medium end_to_end

Gradient Accumulation

Implement one step of gradient-accumulated SGD for a simple linear regressor.

What is gradient accumulation?

Modern deep learning often requires large effective batch sizes for stable training โ€” but GPU memory limits how many samples you can fit in a single forward pass. Gradient accumulation is the standard workaround: instead of one large batch, you run several smaller micro-batches, sum their gradients, and apply a single weight update at the end.

The result is mathematically equivalent to training on a single batch of size micro_batch_size ร— accum_steps, without the memory cost.

Why does dividing matter?

After accumulating accum_steps gradients you must divide by accum_steps before applying the update. Omitting this step makes the effective learning rate scale with accumulation depth โ€” a common bug that causes gradient explosion.

Algorithm (mean-MSE regression)

grad_total = zeros_like(weights)
for k in range(accum_steps):
    x, y = xs[k], ys[k]          # x: (B, d),  y: (B,)
    y_hat = x @ weights
    grad  = (2 / B) * x.T @ (y_hat - y)   # gradient of mean MSE
    grad_total += grad
grad_avg = grad_total / accum_steps
return weights - lr * grad_avg

Inputs:

  • weights: shape (d,) โ€” current parameters.
  • xs: shape (accum_steps, B, d) โ€” stacked micro-batch inputs.
  • ys: shape (accum_steps, B) โ€” stacked micro-batch targets.
  • lr: learning rate (float).
  • accum_steps: number of micro-batches to accumulate (int).

Output: Updated weight vector of shape (d,).

Hints

training gradient-accumulation

Sign in to attempt this problem and view the solution.