medium primitives

Precision, Recall & F1

Compute per-class precision, recall, and F1, plus their macro-averages.

Definitions

For class i, define:

  • True Positives (TP): examples whose true label and predicted label are both i — i.e. M[i, i] in the confusion matrix.
  • False Positives (FP): examples predicted as i but with a different true label — rest of column i.
  • False Negatives (FN): examples whose true label is i but that were predicted as something else — rest of row i.
Precision[i] = TP / (TP + FP) = M[i, i] / sum(M[:, i])
Recall[i]    = TP / (TP + FN) = M[i, i] / sum(M[i, :])
F1[i]        = 2 · P[i] · R[i] / (P[i] + R[i])

0 / 0 convention: if the denominator is zero, the metric is defined as 0.0 (avoids NaN).

Macro-averaging

Macro-averaging computes each metric independently per class and then takes a simple (unweighted) mean:

Macro-P = mean(P[0], …, P[C-1])
Macro-R = mean(R[0], …, R[C-1])
Macro-F = mean(F1[0], …, F1[C-1])

This treats every class equally regardless of its frequency. The alternative — micro-averaging — pools all TP / FP / FN counts before dividing, which weights by class size. Use macro when class imbalance should not dominate the reported metric.

When to use precision vs recall vs F1

  • Precision matters when false positives are costly (e.g. spam filter: you do not want legitimate email flagged).
  • Recall matters when false negatives are costly (e.g. cancer screening: you do not want a positive case missed).
  • F1 is the harmonic mean of the two — useful when you want a single number that balances both concerns.

Inputs

  • predictions: shape (N,) — predicted class indices (delivered as floats by the test harness).
  • labels: shape (N,) — true class indices (delivered as floats).
  • num_classes: integer C.

Output

Tensor of shape (3, C+1):

Row Contents
0 P[0], …, P[C-1], macro-P
1 R[0], …, R[C-1], macro-R
2 F1[0], …, F1[C-1], macro-F1

Hints

metrics classification f1

Sign in to attempt this problem and view the solution.