Implement a single step of the Adam optimizer.
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$ $$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$ $$w_t = w_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$
Input:
weights: current parameters gradients: current gradients m: first moment estimate (running mean of gradients) v: second moment estimate (running mean of squared gradients) t: current timestep (integer, starting from 1) lr: learning rate (default 0.001) beta1: first moment decay (default 0.9) beta2: second moment decay (default 0.999) eps: small constant (default 1e-8)
Output: A map with new_weights, new_m, new_v