Implement ALiBi (Attention with Linear Biases) from “Train Short, Test Long” (Press et al., 2022).
ALiBi adds a linear position bias to attention scores instead of using position embeddings. For head h with slope m_h, the bias for query position i and key position j is:
$$\text{bias}(i, j) = -m_h \cdot |i - j|$$
The slopes are geometric: $m_h = \frac{1}{2^{8h/H}}$ for head index h (0-based) and H total heads.
Given attention scores S of shape (n_heads, seq_len, seq_len) and n_heads,
add the ALiBi bias to the scores.
Output: Tensor of shape (n_heads, seq_len, seq_len) — biased attention scores.