Transformer Block with RoPE

Implement a transformer block that applies Rotary Position Embeddings (RoPE) to Q and K before computing attention — the architecture used in LLaMA, GPT-NeoX, PaLM, and most modern open-weight LLMs.

Background: RoFormer (Su et al., 2021)

Classical absolute position encodings (sinusoidal, learned) add position info once at the embedding layer. RoPE takes a different approach: it injects relative-position information directly into the attention score by rotating Q and K in head-space.

The key insight is that if Q and K are rotated by angle m·θ and n·θ respectively (positions m and n), then Q_rot · K_rot^T depends only on the difference (m - n)·θ. This makes RoPE a relative position encoding disguised as an absolute one.

Why Q and K, but not V?

Position information enters the computation via the attention score Q · K^T. By rotating Q and K, relative distances are baked into which tokens attend to which. The value vectors V carry content, not position — rotating them would distort the information being aggregated, not the selection of what to aggregate.

RoPE Rotation Rule

For a query vector at position t, head h, and frequency-pair index k:

q'[..., t, h, 2k]   = q[..., t, h, 2k]   * cos[t, k]  -  q[..., t, h, 2k+1] * sin[t, k]
q'[..., t, h, 2k+1] = q[..., t, h, 2k]   * sin[t, k]  +  q[..., t, h, 2k+1] * cos[t, k]

The same rule applies to K. The freqs_cos and freqs_sin tables (shape (T, d_head/2)) are precomputed and passed as arguments — you do not need to derive frequencies from scratch.

Full Pipeline

Project: Q = x @ w_q, K = x @ w_k, V = x @ w_v — each (N, T, d_model).
Multi-head reshape: split d_model → (num_heads, d_head), transpose to (N, num_heads, T, d_head).
Apply RoPE to Q and K using even/odd index slicing (see Hints). V is left unchanged.
Scaled dot-product: scores = Q_rot @ K_rot^T / sqrt(d_head). No causal mask (this is a bidirectional encoder-style block).
Softmax + weighted sum: attn = softmax(scores, dim=-1); out_heads = attn @ V.
Concat + output projection + residual: out = (concat @ w_o) + x.
Layer norm over last dim with eps=1e-5 (no learned γ/β).

Inputs / Output