Learned Absolute Position Embedding

Implement learned absolute position embeddings — the simplest way to give a transformer a sense of token order.

Why position embeddings?

Attention is permutation-equivariant: without explicit position information the model cannot distinguish “the cat sat” from “sat cat the”. Position embeddings inject order by adding (or concatenating) a position-specific vector to each token embedding before the first attention layer.

Learned vs sinusoidal:

The original “Attention Is All You Need” paper used fixed sinusoidal functions. GPT and early BERT variants instead use a learnable lookup table — a plain nn.Embedding(max_len, d_model) whose rows are updated by backprop just like token embeddings. This gives the model full flexibility to discover useful position representations for a specific task and dataset.

Operation:

Given an integer position matrix positions of shape (N, T) and a learnable embedding_table of shape (max_len, d_model), the output is:

output[n, t, :] = embedding_table[positions[n, t], :]

This is a simple 2-D index gather — no arithmetic, no sin/cos. The table rows are learned parameters; this function is just the forward lookup.

Contrast with:

Sinusoidal (implement-positional-encoding): no learnable params, fixed formula, generalises to unseen lengths at inference time.
RoPE (rotary-position-embeddings): modifies the attention scores directly rather than adding to the input.
ALiBi (alibi-position-bias): adds a position-dependent bias to attention logits.

Inputs:

positions: shape (N, T) — integer position ids (0-indexed, typically [[0, 1, 2, …, T-1]] repeated across the batch, but any valid indices). Delivered as float32 by the runtime; cast to int before indexing.
embedding_table: shape (max_len, d_model) — the learnable lookup table.

Output: shape (N, T, d_model) — one embedding vector per position.

Learned Absolute Position Embedding

Hints