Relative Position Encoding

Implement relative position encoding from “Self-Attention with Relative Position Representations” (Shaw et al., 2018).

Instead of absolute position embeddings, add a learned bias based on the relative distance between query and key positions.

Given:

scores: shape (seq_len, seq_len) — raw attention scores (Q @ K^T / sqrt(d))
rel_bias: shape (2*max_dist+1,) — learned bias for relative positions [-max_dist, …, -1, 0, 1, …, max_dist]
max_dist: integer — maximum relative distance to consider (clamp beyond)

For positions i and j, the relative position is clipped: $$r = \text{clip}(j - i, -\text{max\_dist}, \text{max\_dist})$$ Index into rel_bias: $\text{rel\_bias}[r + \text{max\_dist}]$

Output: Tensor of shape (seq_len, seq_len) — scores with relative position bias added.

Relative Position Encoding

Hints