hard end_to_end

Encoder-Decoder Beam Search

Implement seq2seq beam search decoding — encode the source sequence once, then beam-search decode the target by maintaining beam_width candidate hypotheses at every step.

Why beam search?

Greedy decoding picks the single best token at each step — it can get stuck in locally good but globally bad paths. Beam search hedges: keep the top-K hypotheses alive and pick the globally highest-scoring complete sequence.

beams = [(seq=[start_id], log_prob=0.0)]   # one initial beam
for step in range(max_new_tokens):
    candidates = []
    for (seq, log_prob) in non-finished beams:
        logits = decode(seq, enc_out)
        log_probs = log_softmax(logits[-1])
        for tok in top_beam_width(log_probs):
            candidates.append((seq + [tok], log_prob + log_probs[tok]))
    sort candidates by log_prob descending
    beams = candidates[:beam_width]
    if all beams ended with eos: break
return highest-log-prob beam

Architecture

Re-uses the encoder-decoder transformer from encoder-decoder-forward-pass and the encode-once pattern from encoder-decoder-greedy-decode:

  • Encoder (enc_blocks, shape (E, 6, d, d)): bidirectional self-attention over the source sequence. Six weight slots per block: [w_q, w_k, w_v, w_o, w_mlp1, w_mlp2].

  • Decoder (dec_blocks, shape (D, 12, d, d)): three sub-layers per block:

    • Slots 0–3: causal self-attention over the partial target sequence.
    • Slots 4–7: cross-attention — Q from decoder, K/V from enc_out.
    • Slots 8–9: FFN (GELU, d_ff = d_model).
    • Slots 10–11: unused / zero-padded.

Post-LN convention

x = LN(x + sub_out) for every sub-layer. LN eps = 1e-5, no learned γ/β. GELU: 0.5 * t * (1 + tanh(sqrt(2/π) * (t + 0.044715 * t³))).

Tradeoffs

  • More compute than greedy: roughly beam_width × greedy cost per step.
  • Often improves BLEU / translation quality.
  • beam_width=1 is equivalent to greedy decoding.

Reference

Vaswani et al. “Attention Is All You Need”, NeurIPS 2017. Sutskever et al. “Sequence to Sequence Learning with Neural Networks”, NeurIPS 2014.

Hints

seq2seq inference beam-search

Sign in to attempt this problem and view the solution.