Gradient Accumulation

Implement one step of gradient-accumulated SGD for a simple linear regressor.

What is gradient accumulation?

Modern deep learning often requires large effective batch sizes for stable training — but GPU memory limits how many samples you can fit in a single forward pass. Gradient accumulation is the standard workaround: instead of one large batch, you run several smaller micro-batches, sum their gradients, and apply a single weight update at the end.

The result is mathematically equivalent to training on a single batch of size micro_batch_size × accum_steps, without the memory cost.

Why does dividing matter?

After accumulating accum_steps gradients you must divide by accum_steps before applying the update. Omitting this step makes the effective learning rate scale with accumulation depth — a common bug that causes gradient explosion.

Algorithm (mean-MSE regression)

grad_total = zeros_like(weights)
for k in range(accum_steps):
    x, y = xs[k], ys[k]          # x: (B, d),  y: (B,)
    y_hat = x @ weights
    grad  = (2 / B) * x.T @ (y_hat - y)   # gradient of mean MSE
    grad_total += grad
grad_avg = grad_total / accum_steps
return weights - lr * grad_avg

Inputs:

weights: shape (d,) — current parameters.
xs: shape (accum_steps, B, d) — stacked micro-batch inputs.
ys: shape (accum_steps, B) — stacked micro-batch targets.
lr: learning rate (float).
accum_steps: number of micro-batches to accumulate (int).

Output: Updated weight vector of shape (d,).

Gradient Accumulation

Hints