Train Step with Global-Norm Clipping

Why this matters

Gradient explosion is a pervasive problem in RNN and transformer training. optax.clip_by_global_norm rescales the entire gradient pytree so its L2-norm does not exceed max_norm, then the optimizer applies its update rule. Chaining these two transforms is the standard defensive pattern used in BERT, GPT, and most production JAX training loops.

The recipe

optimizer = optax.chain(
    optax.clip_by_global_norm(max_norm),
    optax.sgd(lr),
)
opt_state  = optimizer.init(params)
updates, _ = optimizer.update(grads, opt_state, params)
return optax.apply_updates(params, updates)

Common pitfalls

Clip must come first in chain; clipping after SGD is meaningless.
clip_by_global_norm operates on the global L2-norm of all gradient tensors together, not per-tensor.
max_norm=1.0 is the conventional value for BERT/GPT-class training.
When global_norm <= max_norm, the gradients are passed through unchanged.

Inputs

params: 1-D JAX array — model parameters.
grads: 1-D JAX array — gradients.
lr: scalar — SGD learning rate.
max_norm: scalar — maximum allowed gradient global norm.

Output

1-D array — params after one clipped-SGD step.

Train Step with Global-Norm Clipping

Why this matters

The recipe

Common pitfalls

Inputs

Output

Hints