Causal Attention Mask

Implement single-head causal (masked) dot-product attention as used on the decoder side of transformer models.

What is causal attention?

In a standard (unmasked) self-attention layer every token can attend to every other token — past and future. On the decoder side of a transformer (and in any autoregressive language model) this would let the model “cheat” by reading tokens it hasn’t generated yet. Causal attention fixes this by masking out the upper triangle of the attention score matrix before the softmax, so position $t$ can only attend to positions $0, 1, \ldots, t$.

Math

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}} + M\right) V$$

where the causal mask $M$ is:

$$M_{ij} = \begin{cases} 0 & j \le i \\ -\infty & j > i \end{cases}$$

Adding $-\infty$ before the softmax sets those attention weights to exactly $0$ after exponentiation. In practice $-10^9$ is used instead of $-\infty$ to avoid NaN gradients.

Algorithm

Compute scaled scores: scores = q @ k.transpose(-2, -1) / sqrt(d) — shape (N, T, T).
Build the causal mask: lower-triangular [T, T] of ones, zeros above the diagonal.
Where the mask is 0 (upper triangle), set the corresponding score to -1e9.
Softmax over the last dimension: attn = softmax(scores, dim=-1).
Weighted sum of values: out = attn @ v.

Inputs / Output

q, k, v: each shape (N, T, d) — batch of sequences.
Output: shape (N, T, d).

Reference