easy end_to_end

Model Checkpointing

Serialize a minimal training state to a JSON string so it can be saved to disk and restored later.

What is checkpointing?

Training a large model takes hours or days. Checkpointing saves the full training state at regular intervals so you can resume after a crash, roll back to an earlier epoch, or export a specific step for evaluation. Without checkpoints, any interruption means starting from scratch.

What goes in a state_dict?

A real checkpoint typically contains:

  • Model weights — the parameters learned so far.
  • Optimizer state — momentum buffers, adaptive learning-rate accumulators (Adam’s m and v), etc.
  • Step / epoch counter — so the scheduler knows where you are.
  • Learning rate — current value after any annealing schedule.
  • RNG state — if you need perfectly reproducible resumption.

This problem focuses on the serialization step: pack the three most essential pieces (weights, step, lr) into a JSON string. Round-trip recovery would call json.loads on the result and reconstruct the tensors.

Pedagogy: understand what you’re saving

Reach for torch.save in real projects, but know what it does: it pickles a dict of Python objects (usually tensors). Here you’ll build that dict by hand so you understand exactly what state your checkpoint carries. If you can serialize it yourself, you understand it — and you can debug it when the shapes don’t match on reload.

Function signature:

def serialize_state(weights, step, lr) -> str
  • weights: shape (d,) — model parameters.
  • step: int — current global training step.
  • lr: float — current learning rate.

Output: a JSON string with shape {"weights": [...], "step": <int>, "lr": <float>}.

Use only stdlib json — no torch.save, no pickle.

Hints

training checkpointing serialization

Sign in to attempt this problem and view the solution.