BLEU-1gram

Compute a simplified BLEU score using only 1-grams plus a brevity penalty.

What is BLEU?

BLEU (Bilingual Evaluation Understudy) is the standard automatic metric for evaluating machine-translation output. The full metric uses a weighted geometric mean of 1- to 4-gram precisions; this problem implements the 1-gram (unigram) variant, which captures simple word-overlap.

Reference: Papineni et al. 2002, “BLEU: a method for automatic evaluation of machine translation.”

Algorithm

Precision component (with clipping)

For each unique token in the candidate, count how many times it appears in the candidate (cand_count) and in the reference (ref_count). The clipped count is min(cand_count, ref_count) — this prevents the model from gaming precision by repeating a common word many times.

clipped_sum = Σ_{t ∈ candidate} min(count_candidate(t), count_reference(t))
precision   = clipped_sum / len(candidate)

Brevity penalty (BP)

A short candidate can achieve high precision by only saying safe words. The brevity penalty discourages outputs shorter than the reference:

BP = 1.0                             if len(candidate) > len(reference)
   = exp(1 − len(reference) / len(candidate))   otherwise

BLEU score

BLEU = BP × precision

Edge case

If the candidate is empty, return 0.0.

When to use BLEU

Evaluating machine translation, summarisation, and other text-generation tasks where a gold reference is available.
Quick sanity checks during fine-tuning or decoding experiments.
As a fast proxy before running human evaluation.

Inputs

reference: list of token strings — the gold sentence.
candidate: list of token strings — the generated sentence.

Output

Scalar float BLEU score in [0, 1].

Hints