We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Learned Absolute Position Embedding
Implement learned absolute position embeddings — the simplest way to give a transformer a sense of token order.
Why position embeddings?
Attention is permutation-equivariant: without explicit position information the model cannot distinguish “the cat sat” from “sat cat the”. Position embeddings inject order by adding (or concatenating) a position-specific vector to each token embedding before the first attention layer.
Learned vs sinusoidal:
The original “Attention Is All You Need” paper used fixed sinusoidal
functions. GPT and early BERT variants instead use a learnable lookup table
— a plain nn.Embedding(max_len, d_model) whose rows are updated by
backprop just like token embeddings. This gives the model full flexibility
to discover useful position representations for a specific task and dataset.
Operation:
Given an integer position matrix positions of shape (N, T) and a
learnable embedding_table of shape (max_len, d_model), the output is:
output[n, t, :] = embedding_table[positions[n, t], :]
This is a simple 2-D index gather — no arithmetic, no sin/cos. The table rows are learned parameters; this function is just the forward lookup.
Contrast with:
-
Sinusoidal (
implement-positional-encoding): no learnable params, fixed formula, generalises to unseen lengths at inference time. -
RoPE (
rotary-position-embeddings): modifies the attention scores directly rather than adding to the input. -
ALiBi (
alibi-position-bias): adds a position-dependent bias to attention logits.
Inputs:
-
positions: shape(N, T)— integer position ids (0-indexed, typically[[0, 1, 2, …, T-1]]repeated across the batch, but any valid indices). Delivered as float32 by the runtime; cast to int before indexing. -
embedding_table: shape(max_len, d_model)— the learnable lookup table.
Output: shape (N, T, d_model) — one embedding vector per position.
Hints
Sign in to attempt this problem and view the solution.