medium research

Relative Position Encoding

Implement relative position encoding from “Self-Attention with Relative Position Representations” (Shaw et al., 2018).

Instead of absolute position embeddings, add a learned bias based on the relative distance between query and key positions.

Given:

  • scores: shape (seq_len, seq_len) — raw attention scores (Q @ K^T / sqrt(d))
  • rel_bias: shape (2*max_dist+1,) — learned bias for relative positions [-max_dist, …, -1, 0, 1, …, max_dist]
  • max_dist: integer — maximum relative distance to consider (clamp beyond)

For positions i and j, the relative position is clipped: $$r = \text{clip}(j - i, -\text{max\_dist}, \text{max\_dist})$$ Index into rel_bias: $\text{rel\_bias}[r + \text{max\_dist}]$

Output: Tensor of shape (seq_len, seq_len) — scores with relative position bias added.

Hints

relative-position shaw-2018 position-encoding attention
Detecting runtime...