ALiBi Position Bias

Implement ALiBi (Attention with Linear Biases) from “Train Short, Test Long” (Press et al., 2022).

ALiBi adds a linear position bias to attention scores instead of using position embeddings. For head h with slope m_h, the bias for query position i and key position j is:

$$\text{bias}(i, j) = -m_h \cdot |i - j|$$

The slopes are geometric: $m_h = \frac{1}{2^{8h/H}}$ for head index h (0-based) and H total heads.

Given attention scores S of shape (n_heads, seq_len, seq_len) and n_heads, add the ALiBi bias to the scores.

Output: Tensor of shape (n_heads, seq_len, seq_len) — biased attention scores.

Hints