Train with Adam End-to-End

Train a linear regressor end-to-end using Adam — implemented from scratch. No optim.Adam; you own every line of the update rule.

The model

Given feature matrix x of shape (N, d) and targets y of shape (N,), the MSE loss gradient at weights w is:

$$\nabla_w \mathcal{L} = \frac{2}{N} X^\top (Xw - y)$$

Adam update rule (1-indexed step t)

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1)\, g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2)\, g_t^2$$ $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$ $$w_t = w_{t-1} - \eta\,\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

Bias correction (dividing by 1 - beta^t) counteracts the cold-start bias when m and v are initialized to zero. It converges to 1 as t grows.

Algorithm

m, v = m0, v0
w    = w0
for t in 1 .. n_steps:
    grad  = (2/N) * x.T @ (x @ w - y)
    m     = beta1 * m + (1 - beta1) * grad
    v     = beta2 * v + (1 - beta2) * grad**2
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    w     = w - lr * m_hat / (sqrt(v_hat) + eps)
return concat(w, m, v)

Inputs

x: shape (N, d) — feature matrix.
y: shape (N,) — regression targets.
w0: shape (d,) — initial weights.
m0, v0: shape (d,) — initial Adam state (typically zeros).
lr, beta1, beta2, eps: floats — Adam hyperparameters.
n_steps: int — number of update steps.

Output

Returns shape (3*d,) — the concatenation of (final_w, final_m, final_v) flattened. This makes the full optimizer state checkable in a single tensor.

Edge cases

n_steps=0: loop never runs; output is concat(w0, m0, v0).
lr=0: w never changes, but m and v still update each step (the Adam state is tracked even when no weight update occurs).

Reference

Kingma & Ba, “Adam: A Method for Stochastic Optimization” (2014).