Implement a single step of SGD with momentum.
$$v_{t+1} = \mu \cdot v_t + \nabla w_t$$ $$w_{t+1} = w_t - \eta \cdot v_{t+1}$$
where $\mu$ is the momentum coefficient and $\eta$ is the learning rate.
Input:
weights: current parameters gradients: current gradients velocity: current velocity (momentum buffer) lr: learning rate momentum: momentum coefficient
Output: A map/tuple with new_weights and new_velocity