Vision Transformer (Mean-Pool Variant)

Why this matters

The Vision Transformer (Dosovitskiy et al., 2021) showed that Transformers — designed for sequences — work spectacularly well on images, given enough data. The trick is reshaping the image as a sequence: cut it into patches, embed each patch, treat the result as (num_patches, d_model) tokens, and feed it to a stock Transformer encoder.

ViT became the foundation for CLIP, DINO, MAE, SAM, and most modern vision models. The recipe is mechanically identical to BERT once you have patch embeddings.

Architecture

image (H, W, C)
→ patch embed (strided conv)  → (num_patches, D)
→ + learned pos embed         → (num_patches, D)
→ encoder block × N           → (num_patches, D)
→ LayerNorm                   → (num_patches, D)
→ mean-pool over patches      → (D,)
→ Dense(num_classes)          → (num_classes,)

Two halves:

Patches → tokens → encoder: same idea as BERT, just with strided-conv patches as the “embedding” instead of token IDs.
Encoder output → classifier: pool to a single vector, then a linear head to class logits.

This problem uses mean-pooling over the patch sequence to get one vector per image. The next problem (pos 46) replaces mean-pool with a [CLS] token — the original ViT design.

Patch embedding (refresher from pos 38)

feat = nn.Conv(features=D, kernel_size=(P, P), strides=(P, P), padding="VALID")(image)
tokens = feat.reshape(num_patches, D)        # (H/P · W/P, D)

Strided conv with kernel = stride = patch_size is the standard way to express “non-overlapping linear projection of each patch.”

Worked walk-through

With image (4, 4, 3), P=2, D=8, num_layers=2, num_classes=4:

feat = conv(image) → (2, 2, 8). Reshape → (4, 8) (4 patches).
pos = pos_embed[:4] → (4, 8). x = tokens + pos.
Two ViT encoder blocks (LN + MHA + FFN, Pre-LN, no causal mask).
x = LayerNorm(x).
pooled = jnp.mean(x, axis=0) → (8,).
logits = Dense(4)(pooled) → (4,).

Common pitfalls

Forgetting the position embedding: with no positions, the patches are unordered — the model can’t tell “top-left” from “bottom-right.” Position embedding length = num_patches.
Pooling axis: jnp.mean(x, axis=0) reduces over patches (axis 0). axis=-1 reduces over D — wrong.
Patch size not dividing image: with padding='VALID', the last partial patch is dropped. Always pre-resize.
Stride ≠ patch size: makes patches overlap (Swin) — not ViT.

Problem

Implement vit_forward(seed, image, patch_size, d_model, num_heads, d_ff, num_layers, num_classes):

Patch-embed via strided conv. Reshape to (num_patches, D).
Add a learned position embedding of shape (num_patches, D).
N ViT encoder blocks (Pre-LN MHA + FFN, no causal mask).
Final LayerNorm.
Mean-pool over patches → (D,).
Dense(num_classes)(pooled) → (num_classes,).
Return flattened.

Inputs:

seed: int.
image: 3-D (H, W, C). H, W divisible by patch_size.
All other args: ints.

Output: 1-D, length num_classes.