Checkpointed Deep Stack via scan

Why this matters

Deep stacks of identical layers — transformer blocks, ResNet stages, SSM recurrences — are the canonical use case for gradient checkpointing. For an N-layer network:

Without checkpointing: O(N) activation memory (one stored tensor per layer).
With scan + checkpoint: O(1) activation memory — only the current carry is live at any time, and the per-step activations are recomputed during the backward pass.

lax.scan provides the loop primitive. Wrapping its body function with jax.checkpoint gives you the recompute-on-demand behaviour for every scan iteration. The combination is the standard pattern for training very deep or very long models.

Worked mini-example

import jax
import jax.numpy as jnp
from jax import lax

weights = jnp.stack([jnp.eye(2)] * 3)   # 3 identity layers, shape (3,2,2)
x = jnp.ones((1, 2))

# Non-checkpointed body — activations retained across all 3 iterations.
def layer_plain(y, w):
    return jax.nn.relu(y @ w), None

# Checkpointed body — activations of each step are dropped and recomputed.
@jax.checkpoint
def layer_ckpt(y, w):
    return jax.nn.relu(y @ w), None

def loss(weights, x, body):
    y_final, _ = lax.scan(body, x, weights)
    return jnp.sum(y_final ** 2)

g_plain = jax.grad(loss, argnums=0)(weights, x, layer_plain)
g_ckpt  = jax.grad(loss, argnums=0)(weights, x, layer_ckpt)

assert jnp.allclose(g_plain, g_ckpt)   # True — gradients are identical

Common pitfalls

lax.scan body signature — body(carry, x_slice) -> (new_carry, output). Here: layer(y, w) -> (new_y, None). The None is the per-step output (stacked across steps); we discard it.
Wrap the BODY, not the whole scan — jax.checkpoint(lax.scan(...)) is wrong; that would checkpoint the entire unrolled computation. Wrap the per-step function instead.
@jax.checkpoint as a decorator is equivalent to jax.checkpoint(fn) — both are valid.
Very deep stacks (>100) — this combination is essential; without it memory grows linearly with depth.

Problem

Implement grad_checkpointed_stack(weights, x) that:

Defines a per-layer body layer(y, w) = (relu(y @ w), None) wrapped with jax.checkpoint.
Applies it across N=8 layers via lax.scan.
Computes loss = sum(y_final ** 2).
Returns jax.grad(loss, argnums=0)(weights, x) — the gradient w.r.t. weights.

weights: 3-D jax array (8, d, d).
x: 2-D jax array (N_batch, d).

Returns: array same shape as weights — gradient of the loss w.r.t. each layer weight matrix.

Checkpointed Deep Stack via scan

Why this matters

Worked mini-example

Common pitfalls

Problem

Hints