medium primitives

BLEU-1gram

Compute a simplified BLEU score using only 1-grams plus a brevity penalty.

What is BLEU?

BLEU (Bilingual Evaluation Understudy) is the standard automatic metric for evaluating machine-translation output. The full metric uses a weighted geometric mean of 1- to 4-gram precisions; this problem implements the 1-gram (unigram) variant, which captures simple word-overlap.

Reference: Papineni et al. 2002, β€œBLEU: a method for automatic evaluation of machine translation.”

Algorithm

Precision component (with clipping)

For each unique token in the candidate, count how many times it appears in the candidate (cand_count) and in the reference (ref_count). The clipped count is min(cand_count, ref_count) β€” this prevents the model from gaming precision by repeating a common word many times.

clipped_sum = Σ_{t ∈ candidate} min(count_candidate(t), count_reference(t))
precision   = clipped_sum / len(candidate)

Brevity penalty (BP)

A short candidate can achieve high precision by only saying safe words. The brevity penalty discourages outputs shorter than the reference:

BP = 1.0                             if len(candidate) > len(reference)
   = exp(1 βˆ’ len(reference) / len(candidate))   otherwise

BLEU score

BLEU = BP Γ— precision

Edge case

If the candidate is empty, return 0.0.

When to use BLEU

  • Evaluating machine translation, summarisation, and other text-generation tasks where a gold reference is available.
  • Quick sanity checks during fine-tuning or decoding experiments.
  • As a fast proxy before running human evaluation.

Inputs

  • reference: list of token strings β€” the gold sentence.
  • candidate: list of token strings β€” the generated sentence.

Output

Scalar float BLEU score in [0, 1].

Hints

metrics nlp bleu

Sign in to attempt this problem and view the solution.