Train Tiny GPT End-to-End

Implement a multi-step GPT pretraining loop that wraps the single causal-LM training step from train-causal-lm-pretraining-step into a real training procedure.

Real GPT pretraining is millions of steps over enormous corpora; this is the teaching minimum.

Pipeline

for step in range(n_steps):
    tokens = tokens_corpus[step]          # (N, T+1) — pre-batched batch for this step
    w_emb, pos_embed, blocks_weights, w_head = train_causal_lm_step(
        tokens, w_emb, pos_embed, blocks_weights, w_head, num_heads, lr
    )

Each call to train_causal_lm_step runs:

Forward — causal LM on tokens[:, :T].
Loss — next-token CE at every position.
Backward by hand — manual chain rule, same as the single-step problem.
SGD update on all 4 weight tensors.

Input layout

tokens_corpus has shape (n_steps, N, T+1):

First axis: training step index.
Second axis: batch element.
Third axis: token sequence (input + target, length T+1).

The corpus is pre-batched and deterministic — no random sampling inside the function. At step k, slice tokens_corpus[k].

Output

Returns a single flat tensor of all updated weights after the loop:

[w_emb_flat, pos_embed_flat, blocks_weights_flat, w_head_flat]

Edge cases

n_steps = 0: loop body never executes — return initial weights unchanged.
lr = 0: every step is a no-op — return initial weights unchanged.

Float32 drift

Sequential training steps accumulate round-off. This test contract caps n_steps ≤ 5; atol = 1e-3 absorbs typical float32 drift over that range.

References

Radford et al., “Improving Language Understanding by Generative Pre-Training” (GPT-1), OpenAI 2018.