We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Causal Attention Mask
Implement single-head causal (masked) dot-product attention as used on the decoder side of transformer models.
What is causal attention?
In a standard (unmasked) self-attention layer every token can attend to every other token — past and future. On the decoder side of a transformer (and in any autoregressive language model) this would let the model “cheat” by reading tokens it hasn’t generated yet. Causal attention fixes this by masking out the upper triangle of the attention score matrix before the softmax, so position $t$ can only attend to positions $0, 1, \ldots, t$.
Math
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}} + M\right) V$$
where the causal mask $M$ is:
$$M_{ij} = \begin{cases} 0 & j \le i \\ -\infty & j > i \end{cases}$$
Adding $-\infty$ before the softmax sets those attention weights to exactly $0$ after exponentiation. In practice $-10^9$ is used instead of $-\infty$ to avoid NaN gradients.
Algorithm
-
Compute scaled scores:
scores = q @ k.transpose(-2, -1) / sqrt(d)— shape(N, T, T). -
Build the causal mask: lower-triangular
[T, T]of ones, zeros above the diagonal. -
Where the mask is
0(upper triangle), set the corresponding score to-1e9. -
Softmax over the last dimension:
attn = softmax(scores, dim=-1). -
Weighted sum of values:
out = attn @ v.
Inputs / Output
-
q,k,v: each shape(N, T, d)— batch of sequences. -
Output: shape
(N, T, d).
Reference
Vaswani et al., Attention Is All You Need, NeurIPS 2017 — Section 3.1 (Scaled Dot-Product Attention) and Section 3.3 (Decoder, masked multi-head attention sub-layer).
Hints
Sign in to attempt this problem and view the solution.