We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Gradient Accumulation
Implement one step of gradient-accumulated SGD for a simple linear regressor.
What is gradient accumulation?
Modern deep learning often requires large effective batch sizes for stable training โ but GPU memory limits how many samples you can fit in a single forward pass. Gradient accumulation is the standard workaround: instead of one large batch, you run several smaller micro-batches, sum their gradients, and apply a single weight update at the end.
The result is mathematically equivalent to training on a single batch of
size micro_batch_size ร accum_steps, without the memory cost.
Why does dividing matter?
After accumulating accum_steps gradients you must divide by
accum_steps before applying the update. Omitting this step makes the
effective learning rate scale with accumulation depth โ a common bug that
causes gradient explosion.
Algorithm (mean-MSE regression)
grad_total = zeros_like(weights)
for k in range(accum_steps):
x, y = xs[k], ys[k] # x: (B, d), y: (B,)
y_hat = x @ weights
grad = (2 / B) * x.T @ (y_hat - y) # gradient of mean MSE
grad_total += grad
grad_avg = grad_total / accum_steps
return weights - lr * grad_avg
Inputs:
-
weights: shape(d,)โ current parameters. -
xs: shape(accum_steps, B, d)โ stacked micro-batch inputs. -
ys: shape(accum_steps, B)โ stacked micro-batch targets. -
lr: learning rate (float). -
accum_steps: number of micro-batches to accumulate (int).
Output: Updated weight vector of shape (d,).
Hints
Sign in to attempt this problem and view the solution.