medium end_to_end

MLM Eval — Masked Accuracy

Evaluate a masked language model (MLM) by computing the fraction of masked positions where the model correctly predicts the original token — the canonical BERT evaluation metric alongside perplexity.

What this measures

During MLM pre-training (BERT, RoBERTa), a random subset of tokens is replaced with a [MASK] token. The model must predict the original token at each masked position. Masked accuracy counts how often argmax(logits) at a masked position equals the original token id.

Unlike training (which minimises cross-entropy loss), evaluation is a single forward pass followed by argmax — no backward pass, no gradient, no SGD.

Pipeline

  1. Run mlm_forward (same as Task 7) to obtain logits_all with shape (N, T, vocab_size).
  2. At masked positions (mask_indicator > 0.5) take argmax over the vocab dimension → predicted token ids, shape (M,).
  3. Gather original_ids at the same masked positions → shape (M,).
  4. Compute mean(predicted == original_ids) and return as a scalar float.
  5. Edge case: if no positions are masked (M = 0), return 0.0.

Inputs

  • input_ids: shape (N, T) — corrupted (possibly masked) token ids.
  • original_ids: shape (N, T) — the true token ids before masking.
  • mask_indicator: shape (N, T)1.0 at masked positions, 0.0 elsewhere.
  • w_emb: shape (vocab_size, d_model).
  • pos_embed: shape (T, d_model).
  • blocks_weights: shape (num_blocks, 6, d_model, d_model).
  • w_head: shape (d_model, vocab_size).
  • num_heads: int.

Output

A single scalar float in [0.0, 1.0] — the fraction of masked positions correctly predicted. The runtime returns this as {"value": <float>}.

References

  • Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2019.
  • Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, 2019.

Hints

mlm bert evaluation

Sign in to attempt this problem and view the solution.