Eval Loop with Metrics

Run a binary-classifier evaluation loop over a list of batches and return both mean BCE loss and accuracy.

Why eval is different from training

During training you call loss.backward() and update parameters. During evaluation you skip all of that — no gradients, no updates. You just forward-pass each batch, accumulate metrics, and report aggregates. In PyTorch you would wrap the whole loop in torch.no_grad() for efficiency; here the weights are already frozen so the result is the same.

Critical: average over all examples, not over batches

If you average the per-batch averages you get the wrong answer whenever batches are unequal in size. The correct approach is to keep running sums (total_loss, total_correct, total_examples) and divide once at the end.

Numerical stability for BCE with logits

The naive form log(sigmoid(z)) overflows when z is large. Use the stable form instead:

loss_per_example = max(z, 0) - z*y + log1p(exp(-|z|))

where z = x @ weights. This is equivalent to binary_cross_entropy_with_logits but safe for large positive or negative logits.

Output format

Return a 1-D tensor of shape (2,) containing [mean_bce_loss, accuracy].

Inputs

weights: shape (d,) — frozen linear-classifier weights.
xs: shape (num_batches, B, d) — stacked batch inputs.
ys: shape (num_batches, B) — stacked binary labels in {0.0, 1.0}.

Eval Loop with Metrics

Hints