Precision, Recall & F1

Compute per-class precision, recall, and F1, plus their macro-averages.

Definitions

For class i, define:

True Positives (TP): examples whose true label and predicted label are both i — i.e. M[i, i] in the confusion matrix.
False Positives (FP): examples predicted as i but with a different true label — rest of column i.
False Negatives (FN): examples whose true label is i but that were predicted as something else — rest of row i.

Precision[i] = TP / (TP + FP) = M[i, i] / sum(M[:, i])
Recall[i]    = TP / (TP + FN) = M[i, i] / sum(M[i, :])
F1[i]        = 2 · P[i] · R[i] / (P[i] + R[i])

0 / 0 convention: if the denominator is zero, the metric is defined as 0.0 (avoids NaN).

Macro-averaging

Macro-averaging computes each metric independently per class and then takes a simple (unweighted) mean:

Macro-P = mean(P[0], …, P[C-1])
Macro-R = mean(R[0], …, R[C-1])
Macro-F = mean(F1[0], …, F1[C-1])

This treats every class equally regardless of its frequency. The alternative — micro-averaging — pools all TP / FP / FN counts before dividing, which weights by class size. Use macro when class imbalance should not dominate the reported metric.

When to use precision vs recall vs F1

Precision matters when false positives are costly (e.g. spam filter: you do not want legitimate email flagged).
Recall matters when false negatives are costly (e.g. cancer screening: you do not want a positive case missed).
F1 is the harmonic mean of the two — useful when you want a single number that balances both concerns.

Inputs

predictions: shape (N,) — predicted class indices (delivered as floats by the test harness).
labels: shape (N,) — true class indices (delivered as floats).
num_classes: integer C.

Output

Tensor of shape (3, C+1):

Row	Contents
0	`P[0], …, P[C-1], macro-P`
1	`R[0], …, R[C-1], macro-R`
2	`F1[0], …, F1[C-1], macro-F1`

Precision, Recall & F1

Hints