Causal Self-Attention Block

Implement causal (masked) multi-head self-attention with a residual connection and layer norm — the core building block of GPT-style decoder transformers.

Causal vs Bi-directional Attention

In an encoder (e.g. BERT) every token can attend to every other token. In a decoder (e.g. GPT) each token may only attend to itself and earlier tokens — otherwise the model could “cheat” by peeking at future tokens during training. This constraint is enforced with a causal mask: before applying softmax, set all scores at positions (i, j) where j > i to -1e9 (a large negative number that becomes ~0 after softmax).

Why -1e9 and not -inf? Literal -inf can produce NaN gradients in some frameworks (e.g. when an entire row is masked and the gradient through softmax touches 0/0). Using -1e9 is numerically safe and ubiquitous in production code.

Pipeline

Given x of shape (N, T, d_model) and weight matrices w_q, w_k, w_v, w_o each of shape (d_model, d_model):

Project: Q = x @ w_q, K = x @ w_k, V = x @ w_v — each (N, T, d_model).
Reshape: split last dim d_model → (num_heads, d_head), transpose to (N, num_heads, T, d_head).
Scaled dot-product scores: scores = Q @ Kᵀ / sqrt(d_head) — shape (N, num_heads, T, T).
Causal mask: build a (T, T) lower-triangular matrix of ones (torch.tril), then masked_fill the upper triangle (where mask == 0) with -1e9.
Softmax + weighted sum: attn = softmax(scores, dim=-1); per_head = attn @ V — shape (N, num_heads, T, d_head).
Concat heads: transpose + reshape → (N, T, d_model).
Output projection + residual: out = (concat @ w_o) + x.
Layer norm over last dim with eps=1e-5 (no learned γ/β).

GPT Context

This block is the heart of every GPT-style model. Stack it with token embeddings, position encodings, and a feed-forward sublayer and you have a full transformer decoder layer.

The weight matrices are passed in directly (not as nn.Linear modules) so that tests are reproducible and framework-agnostic.

Inputs / Output

x: shape (N, T, d_model) — input token sequence.
w_q, w_k, w_v: shape (d_model, d_model).
w_o: shape (d_model, d_model).
num_heads: int — must divide d_model evenly.
Output: shape (N, T, d_model).

Causal Self-Attention Block

Causal vs Bi-directional Attention

Pipeline

GPT Context

Inputs / Output

Hints