Encoder-Decoder Beam Search

Implement seq2seq beam search decoding — encode the source sequence once, then beam-search decode the target by maintaining beam_width candidate hypotheses at every step.

Why beam search?

Greedy decoding picks the single best token at each step — it can get stuck in locally good but globally bad paths. Beam search hedges: keep the top-K hypotheses alive and pick the globally highest-scoring complete sequence.

beams = [(seq=[start_id], log_prob=0.0)]   # one initial beam
for step in range(max_new_tokens):
    candidates = []
    for (seq, log_prob) in non-finished beams:
        logits = decode(seq, enc_out)
        log_probs = log_softmax(logits[-1])
        for tok in top_beam_width(log_probs):
            candidates.append((seq + [tok], log_prob + log_probs[tok]))
    sort candidates by log_prob descending
    beams = candidates[:beam_width]
    if all beams ended with eos: break
return highest-log-prob beam

Architecture

Re-uses the encoder-decoder transformer from encoder-decoder-forward-pass and the encode-once pattern from encoder-decoder-greedy-decode:

Encoder (enc_blocks, shape (E, 6, d, d)): bidirectional self-attention over the source sequence. Six weight slots per block: [w_q, w_k, w_v, w_o, w_mlp1, w_mlp2].
Decoder (dec_blocks, shape (D, 12, d, d)): three sub-layers per block:
- Slots 0–3: causal self-attention over the partial target sequence.
- Slots 4–7: cross-attention — Q from decoder, K/V from enc_out.
- Slots 8–9: FFN (GELU, d_ff = d_model).
- Slots 10–11: unused / zero-padded.

Post-LN convention

x = LN(x + sub_out) for every sub-layer. LN eps = 1e-5, no learned γ/β. GELU: 0.5 * t * (1 + tanh(sqrt(2/π) * (t + 0.044715 * t³))).

Tradeoffs

More compute than greedy: roughly beam_width × greedy cost per step.
Often improves BLEU / translation quality.
beam_width=1 is equivalent to greedy decoding.

Reference