We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Encoder-Decoder Beam Search
Implement seq2seq beam search decoding — encode the source sequence once,
then beam-search decode the target by maintaining beam_width candidate
hypotheses at every step.
Why beam search?
Greedy decoding picks the single best token at each step — it can get stuck in locally good but globally bad paths. Beam search hedges: keep the top-K hypotheses alive and pick the globally highest-scoring complete sequence.
beams = [(seq=[start_id], log_prob=0.0)] # one initial beam
for step in range(max_new_tokens):
candidates = []
for (seq, log_prob) in non-finished beams:
logits = decode(seq, enc_out)
log_probs = log_softmax(logits[-1])
for tok in top_beam_width(log_probs):
candidates.append((seq + [tok], log_prob + log_probs[tok]))
sort candidates by log_prob descending
beams = candidates[:beam_width]
if all beams ended with eos: break
return highest-log-prob beam
Architecture
Re-uses the encoder-decoder transformer from encoder-decoder-forward-pass
and the encode-once pattern from encoder-decoder-greedy-decode:
-
Encoder (
enc_blocks, shape(E, 6, d, d)): bidirectional self-attention over the source sequence. Six weight slots per block:[w_q, w_k, w_v, w_o, w_mlp1, w_mlp2]. -
Decoder (
dec_blocks, shape(D, 12, d, d)): three sub-layers per block:- Slots 0–3: causal self-attention over the partial target sequence.
-
Slots 4–7: cross-attention — Q from decoder, K/V from
enc_out. -
Slots 8–9: FFN (GELU,
d_ff = d_model). - Slots 10–11: unused / zero-padded.
Post-LN convention
x = LN(x + sub_out) for every sub-layer.
LN eps = 1e-5, no learned γ/β.
GELU: 0.5 * t * (1 + tanh(sqrt(2/π) * (t + 0.044715 * t³))).
Tradeoffs
-
More compute than greedy: roughly
beam_width × greedy costper step. - Often improves BLEU / translation quality.
-
beam_width=1is equivalent to greedy decoding.
Reference
Vaswani et al. “Attention Is All You Need”, NeurIPS 2017. Sutskever et al. “Sequence to Sequence Learning with Neural Networks”, NeurIPS 2014.
Hints
Sign in to attempt this problem and view the solution.