Implement relative position encoding from “Self-Attention with Relative Position Representations” (Shaw et al., 2018).
Instead of absolute position embeddings, add a learned bias based on the relative distance between query and key positions.
Given:
scores: shape (seq_len, seq_len) — raw attention scores (Q @ K^T / sqrt(d)) rel_bias: shape (2*max_dist+1,) — learned bias for relative positions
[-max_dist, …, -1, 0, 1, …, max_dist] max_dist: integer — maximum relative distance to consider (clamp beyond) For positions i and j, the relative position is clipped: $$r = \text{clip}(j - i, -\text{max\_dist}, \text{max\_dist})$$ Index into rel_bias: $\text{rel\_bias}[r + \text{max\_dist}]$
Output: Tensor of shape (seq_len, seq_len) — scores with relative position bias added.