easy primitives

Greedy Decoding

Implement greedy decoding โ€” the simplest strategy for generating a sequence of tokens from a language model.

What is greedy decoding?

At each step, greedy decoding picks the single token with the highest logit (argmax) from the distribution produced by a model. There is no randomness: the output is fully deterministic given the same model and prompt.

Algorithm

seq = list(prompt)
for _ in range(max_tokens):
    logits = logits_fn(seq)          # shape (vocab,)
    next_token = argmax(logits)
    seq.append(next_token)
    if next_token == eos_id:
        break                        # include EOS, then stop
return tensor(seq)

Strengths

  • Deterministic โ€” same prompt always produces the same output.
  • Fast โ€” one forward pass per token, no branching.
  • Simple โ€” trivial to implement and debug.

Weaknesses

  • Repetitive loops โ€” the model can get stuck repeating the same phrase because each argmax ignores diversity.
  • No global optimality โ€” locally best tokens can lead to poor overall sequences (beam search addresses this).
  • No diversity โ€” useful in creative tasks to sample, not just pick the mode.

Inputs / Output

  • logits_fn: callable (seq: list[int]) -> tensor shape (vocab,) โ€” the model head; called once per generated token.
  • prompt: 1-D tensor of starting token ids, shape (T_prompt,).
  • max_tokens: int โ€” maximum number of tokens to generate (prompt not counted).
  • eos_id: int โ€” end-of-sequence token. Halt and include this token in the output.

Output: 1-D tensor of token ids, shape (T_prompt + n_generated,). The prompt is included.

Where it fits

Greedy decoding is the baseline before any sampling strategy. Understanding it is the entry point to top-k sampling, nucleus (top-p) sampling, beam search, and speculative decoding.

Hints

llm decoding generation

Sign in to attempt this problem and view the solution.