medium primitives

Sinusoidal Position Encoding

Why this matters

Self-attention is permutation-equivariant: shuffle the tokens, the output rows shuffle the same way. Without extra signal, the model can’t tell “the cat sat on the mat” from “the mat sat on the cat”. Position encoding fixes this by adding a position-dependent vector to the embeddings before attention.

Vaswani et al. (2017) introduced sinusoidal position encoding: deterministic, parameter-free, generalises to longer sequences than seen in training. Modern models often use learned or RoPE instead, but sinusoidal is the canonical reference.

The formula

For position pos ∈ [0, T) and dimension i ∈ [0, D/2):

PE[pos, 2i]   = sin(pos / 10000^(2i/D))
PE[pos, 2i+1] = cos(pos / 10000^(2i/D))

Even-indexed dims get sin, odd-indexed dims get cos. The frequencies 1 / 10000^(2i/D) span a geometric sequence: dim 0 oscillates fast (high frequency); dim D-1 oscillates slowly (low frequency, near-constant across small pos).

Why sin/cos pairs?

Two key properties:

  1. Bounded magnitude: |PE[pos, j]| ≤ 1 everywhere — no scale issues.
  2. Linear shift invariance: PE[pos+k] is a linear function of PE[pos]. Concretely, for each frequency, the (sin, cos) pair rotates by a fixed angle when pos increases by k. This lets the model learn relative-position attention via dot products.

Worked example

T, D = 4, 4
pos = jnp.arange(T)[:, None]                        # (4, 1)
i   = jnp.arange(D // 2)[None, :]                   # (1, 2)
div = 10000.0 ** (2 * i / D)                        # (1, 2): [1, 100]
angles = pos / div                                  # (4, 2)
PE = jnp.zeros((T, D))
PE = PE.at[:, 0::2].set(jnp.sin(angles))            # even dims: sin
PE = PE.at[:, 1::2].set(jnp.cos(angles))            # odd dims:  cos

Row 0 is [sin 0, cos 0, sin 0, cos 0] = [0, 1, 0, 1] — the zero position is always [0, 1] repeated. Row 1 is [sin 1, cos 1, sin 0.01, cos 0.01] — the high-freq dims move fast.

Common pitfalls

  • Off-by-one on the exponent: it’s 2i / D, not i / D or 2i / D-1.
  • Even vs odd interleaving: PE[:, 0::2] = sin, PE[:, 1::2] = cos. Some implementations stack [sin..., cos...] instead — that’s a different (but valid) convention. Stick to interleaving here.
  • Integer division on D: 2i/D must be float arithmetic. Cast early or use 2.0 * i / float(D).
  • D not even: this formulation requires D % 2 == 0 so the sin/cos pairs cover all dims.

Problem

Implement sinusoidal_pos(seed, T, d_model):

  1. Cast T, d_model to int. (seed is unused — kept for signature consistency.)
  2. Build the (T, D) matrix following the Vaswani formula above.
  3. Even-indexed dims 0, 2, 4, ... get sin; odd-indexed dims get cos.
  4. Return the matrix flattened with .reshape(-1).

Inputs:

  • seed: int (unused).
  • T: int — number of positions.
  • d_model: int — embedding dim, even.

Output: 1-D array of length T * D.

Hints

flax position-encoding transformers

Sign in to attempt this problem and view the solution.