medium end_to_end

Encoder-Decoder Greedy Decode

Implement seq2seq greedy decoding — the inference loop that drives machine translation, summarization, and other sequence-to-sequence tasks.

The encode-once, loop-decoder pattern

Unlike teacher forcing (used during training), inference requires generating tokens one at a time without knowing the future. The key insight: the encoder only needs to run once.

enc_out = encode(src_ids)          # run ONCE, cache forever
seq = [start_token_id]
for step in range(max_new_tokens):
    logits = decode(seq, enc_out)  # full decoder forward on seq so far
    next_token = argmax(logits[-1])
    seq.append(next_token)
    if next_token == eos_token_id:
        break
return tensor(seq)

Architecture

Re-uses the encoder-decoder transformer from encoder-decoder-forward-pass:

  • Encoder (enc_blocks, shape (E, 6, d, d)): bidirectional self-attention over the source sequence. Six weight slots per block: [w_q, w_k, w_v, w_o, w_mlp1, w_mlp2].

  • Decoder (dec_blocks, shape (D, 12, d, d)): three sub-layers per block:

    • Slots 0–3: causal self-attention over the partial target sequence (lower-triangular mask).
    • Slots 4–7: cross-attention — Q from decoder, K/V from enc_out.
    • Slots 8–9: FFN (GELU, d_ff = d_model).
    • Slots 10–11: unused / zero-padded.

Post-LN convention

x = LN(x + sub_out) for every sub-layer. LN eps = 1e-5, no learned γ/β. GELU: 0.5 * t * (1 + tanh(sqrt(2/π) * (t + 0.044715 * t³))).

Notes

  • For v1, there is no decoder-side KV cache — just rerun the full decoder forward at each step. The KV-cache variant was already covered for causal LM in causal-lm-with-kv-cache-generation.
  • Greedy = argmax at the last position only.
  • Halt on EOS or max_new_tokens.
  • src_ids is delivered as float; cast to int inside the function.

Reference

Vaswani et al. “Attention Is All You Need”, NeurIPS 2017.

Hints

seq2seq inference greedy-decoding

Sign in to attempt this problem and view the solution.