We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Encoder-Decoder Greedy Decode
Implement seq2seq greedy decoding — the inference loop that drives machine translation, summarization, and other sequence-to-sequence tasks.
The encode-once, loop-decoder pattern
Unlike teacher forcing (used during training), inference requires generating tokens one at a time without knowing the future. The key insight: the encoder only needs to run once.
enc_out = encode(src_ids) # run ONCE, cache forever
seq = [start_token_id]
for step in range(max_new_tokens):
logits = decode(seq, enc_out) # full decoder forward on seq so far
next_token = argmax(logits[-1])
seq.append(next_token)
if next_token == eos_token_id:
break
return tensor(seq)
Architecture
Re-uses the encoder-decoder transformer from encoder-decoder-forward-pass:
-
Encoder (
enc_blocks, shape(E, 6, d, d)): bidirectional self-attention over the source sequence. Six weight slots per block:[w_q, w_k, w_v, w_o, w_mlp1, w_mlp2]. -
Decoder (
dec_blocks, shape(D, 12, d, d)): three sub-layers per block:- Slots 0–3: causal self-attention over the partial target sequence (lower-triangular mask).
-
Slots 4–7: cross-attention — Q from decoder, K/V from
enc_out. -
Slots 8–9: FFN (GELU,
d_ff = d_model). - Slots 10–11: unused / zero-padded.
Post-LN convention
x = LN(x + sub_out) for every sub-layer.
LN eps = 1e-5, no learned γ/β.
GELU: 0.5 * t * (1 + tanh(sqrt(2/π) * (t + 0.044715 * t³))).
Notes
-
For v1, there is no decoder-side KV cache — just rerun the full
decoder forward at each step. The KV-cache variant was already
covered for causal LM in
causal-lm-with-kv-cache-generation. - Greedy = argmax at the last position only.
-
Halt on EOS or
max_new_tokens. -
src_idsis delivered as float; cast to int inside the function.
Reference
Vaswani et al. “Attention Is All You Need”, NeurIPS 2017.
Hints
Sign in to attempt this problem and view the solution.