Greedy Decoding

Implement greedy decoding — the simplest strategy for generating a sequence of tokens from a language model.

What is greedy decoding?

At each step, greedy decoding picks the single token with the highest logit (argmax) from the distribution produced by a model. There is no randomness: the output is fully deterministic given the same model and prompt.

Algorithm

seq = list(prompt)
for _ in range(max_tokens):
    logits = logits_fn(seq)          # shape (vocab,)
    next_token = argmax(logits)
    seq.append(next_token)
    if next_token == eos_id:
        break                        # include EOS, then stop
return tensor(seq)

Strengths

Deterministic — same prompt always produces the same output.
Fast — one forward pass per token, no branching.
Simple — trivial to implement and debug.

Weaknesses

Repetitive loops — the model can get stuck repeating the same phrase because each argmax ignores diversity.
No global optimality — locally best tokens can lead to poor overall sequences (beam search addresses this).
No diversity — useful in creative tasks to sample, not just pick the mode.

Inputs / Output

logits_fn: callable (seq: list[int]) -> tensor shape (vocab,) — the model head; called once per generated token.
prompt: 1-D tensor of starting token ids, shape (T_prompt,).
max_tokens: int — maximum number of tokens to generate (prompt not counted).
eos_id: int — end-of-sequence token. Halt and include this token in the output.

Output: 1-D tensor of token ids, shape (T_prompt + n_generated,). The prompt is included.

Where it fits