MLM Eval — Masked Accuracy

Evaluate a masked language model (MLM) by computing the fraction of masked positions where the model correctly predicts the original token — the canonical BERT evaluation metric alongside perplexity.

What this measures

During MLM pre-training (BERT, RoBERTa), a random subset of tokens is replaced with a [MASK] token. The model must predict the original token at each masked position. Masked accuracy counts how often argmax(logits) at a masked position equals the original token id.

Unlike training (which minimises cross-entropy loss), evaluation is a single forward pass followed by argmax — no backward pass, no gradient, no SGD.

Pipeline

Run mlm_forward (same as Task 7) to obtain logits_all with shape (N, T, vocab_size).
At masked positions (mask_indicator > 0.5) take argmax over the vocab dimension → predicted token ids, shape (M,).
Gather original_ids at the same masked positions → shape (M,).
Compute mean(predicted == original_ids) and return as a scalar float.
Edge case: if no positions are masked (M = 0), return 0.0.

Inputs

input_ids: shape (N, T) — corrupted (possibly masked) token ids.
original_ids: shape (N, T) — the true token ids before masking.
mask_indicator: shape (N, T) — 1.0 at masked positions, 0.0 elsewhere.
w_emb: shape (vocab_size, d_model).
pos_embed: shape (T, d_model).
blocks_weights: shape (num_blocks, 6, d_model, d_model).
w_head: shape (d_model, vocab_size).
num_heads: int.

Output

A single scalar float in [0.0, 1.0] — the fraction of masked positions correctly predicted. The runtime returns this as {"value": <float>}.

References