Tiny ResNet Classifier

Why this matters

A residual block by itself doesn’t classify anything. To get a full image classifier, you wrap a STACK of blocks with two bookends:

A stem: an initial conv (often 7x7 stride 2 in full-scale ResNets, here just 3x3 stride 1) that lifts the raw image into feature space and sets the channel count.
A head: pooling that collapses spatial dims, plus a final Dense that maps features to per-class logits.

The result is the canonical CNN classifier shape:

image (H, W, 3)
    → STEM:    Conv → BN → ReLU
    → BACKBONE: BasicBlock × N
    → HEAD:    GlobalAvgPool → Dense(num_classes)
    → logits (num_classes,)

Every modern image classifier — ResNet, EfficientNet, ConvNeXt, even ViT (with patches replacing the conv stem) — fits this mold.

Global average pooling

The cheap-and-effective replacement for Flatten + Dense(huge). For a feature map of shape (H, W, C), take the mean over the spatial axes:

pooled = jnp.mean(features, axis=(1, 2))   # batched: axes 1 and 2
# shape: (B, C)

Each channel is summarized by a single number — its mean over space. Then a single Dense(num_classes) maps (C,) → (num_classes,).

Why is this so much better than flatten?

Far fewer params in the head. Flatten produces H*W*C units; that times num_classes is enormous.
Spatial invariance: averaging treats every position equally. Empirically gives a healthy regularization effect.
Resolution-agnostic: the head’s parameter count doesn’t depend on input size, so the same trained model runs on bigger / smaller images.

Worked walk-through

Input (4, 4, 3), num_classes=3, stem_features = 3 (matches input channels so the residual works without a projection):

Add batch dim: (1, 4, 4, 3).
Stem: Conv3x3(3) → BN → ReLU → (1, 4, 4, 3).
BasicBlock(features=3) × 2 → still (1, 4, 4, 3). Each block’s residual works because input channels = features.
Global avg pool over spatial axes (1, 2) → (1, 3).
Dense(num_classes=3) → (1, 3).
reshape(-1) → (3,).

The output is the per-class logits vector. (No softmax — that happens in the loss / inference step.)

Common pitfalls

Pooling over the wrong axes: with batch dim, pool over (1, 2), NOT (0, 1, 2) (would also collapse the batch). For unbatched: (0, 1).
Forgetting mutable=['batch_stats']: every BN in stem AND blocks depends on it. One missing flag, the whole net errors.
Mismatched channel counts in the residual blocks: this problem keeps stem_features constant through both blocks (matching image.shape[-1]) so the identity skip works throughout. Don’t insert a downsampling block here.
Putting the Dense BEFORE pooling: huge param count, wrong shape — a classic newbie mistake.

Problem

Implement resnet_classifier_forward(seed, image, num_classes):

stem_features = image.shape[-1] (so block residuals work without projection).
Module stack: Conv3x3 → BN → ReLU → BasicBlock × 2 → mean over (H, W) → Dense(num_classes).
Init/apply with batched input; mutable=['batch_stats'].
Return logits flattened to 1-D.

Inputs:

seed: int.
image: 3-D (H, W, C).
num_classes: int (output dim).