We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Train Tiny GPT End-to-End
Implement a multi-step GPT pretraining loop that wraps the single causal-LM
training step from train-causal-lm-pretraining-step into a real training
procedure.
Real GPT pretraining is millions of steps over enormous corpora; this is the teaching minimum.
Pipeline
for step in range(n_steps):
tokens = tokens_corpus[step] # (N, T+1) — pre-batched batch for this step
w_emb, pos_embed, blocks_weights, w_head = train_causal_lm_step(
tokens, w_emb, pos_embed, blocks_weights, w_head, num_heads, lr
)
Each call to train_causal_lm_step runs:
-
Forward — causal LM on
tokens[:, :T]. - Loss — next-token CE at every position.
- Backward by hand — manual chain rule, same as the single-step problem.
- SGD update on all 4 weight tensors.
Input layout
tokens_corpus has shape (n_steps, N, T+1):
- First axis: training step index.
- Second axis: batch element.
- Third axis: token sequence (input + target, length T+1).
The corpus is pre-batched and deterministic — no random sampling inside the
function. At step k, slice tokens_corpus[k].
Output
Returns a single flat tensor of all updated weights after the loop:
[w_emb_flat, pos_embed_flat, blocks_weights_flat, w_head_flat]
Edge cases
-
n_steps = 0: loop body never executes — return initial weights unchanged. -
lr = 0: every step is a no-op — return initial weights unchanged.
Float32 drift
Sequential training steps accumulate round-off. This test contract caps
n_steps ≤ 5; atol = 1e-3 absorbs typical float32 drift over that range.
References
- Radford et al., “Improving Language Understanding by Generative Pre-Training” (GPT-1), OpenAI 2018.
Hints
Sign in to attempt this problem and view the solution.