We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Model Checkpointing
Serialize a minimal training state to a JSON string so it can be saved to disk and restored later.
What is checkpointing?
Training a large model takes hours or days. Checkpointing saves the full training state at regular intervals so you can resume after a crash, roll back to an earlier epoch, or export a specific step for evaluation. Without checkpoints, any interruption means starting from scratch.
What goes in a state_dict?
A real checkpoint typically contains:
- Model weights — the parameters learned so far.
-
Optimizer state — momentum buffers, adaptive learning-rate accumulators
(Adam’s
mandv), etc. - Step / epoch counter — so the scheduler knows where you are.
- Learning rate — current value after any annealing schedule.
- RNG state — if you need perfectly reproducible resumption.
This problem focuses on the serialization step: pack the three most
essential pieces (weights, step, lr) into a JSON string. Round-trip
recovery would call json.loads on the result and reconstruct the tensors.
Pedagogy: understand what you’re saving
Reach for torch.save in real projects, but know what it does: it pickles
a dict of Python objects (usually tensors). Here you’ll build that dict
by hand so you understand exactly what state your checkpoint carries. If
you can serialize it yourself, you understand it — and you can debug it
when the shapes don’t match on reload.
Function signature:
def serialize_state(weights, step, lr) -> str
-
weights: shape(d,)— model parameters. -
step: int — current global training step. -
lr: float — current learning rate.
Output: a JSON string with shape {"weights": [...], "step": <int>, "lr": <float>}.
Use only stdlib json — no torch.save, no pickle.
Hints
Sign in to attempt this problem and view the solution.