REINFORCE with Baseline

Why this matters

Vanilla REINFORCE has notoriously high variance: different random samples produce wildly different gradient estimates. Control variates (baselines) are the standard fix. Subtracting any constant b from rewards leaves the expected gradient unchanged — because E[∇ log p(a)] = 0 for any normalised distribution — but can drastically reduce variance by centring rewards near zero. In practice, b = V(s) (the value function) is used in actor-critic algorithms. This identity underpins PPO, A3C, and every modern policy-gradient method.

Worked mini-example

K = 2, logits = [0,0], rewards = [10, 0], baseline b = 5. Adjusted rewards: [10−5, 0−5] = [5, −5]. E[grad] ≈ 5·[0.5,−0.5]·0.5 + (−5)·[−0.5,0.5]·0.5 = [2.5, −2.5]. Compare no-baseline: same expected gradient [2.5, −2.5] ✓ — unbiasedness. But variance is halved because magnitudes are centred.

Common pitfalls

Baseline must NOT depend on the action: if b = b(a), the identity E[b(a) ∇ log p(a)] ≠ 0 in general, and the estimator becomes biased. A state-dependent baseline b = V(s) is fine because the state is fixed before the action is sampled.
Only one line changes from vanilla REINFORCE: subtract baseline before weighting — rewards = reward_table[actions] - baseline.
Optimal constant baseline: the variance-minimising constant is E[R] (the mean reward), which is why the sample mean is a common choice.
Compare to test 3 (baseline=0): setting b = 0 should reproduce vanilla REINFORCE exactly.

Problem

Implement reinforce_with_baseline(seed, logits, reward_table, baseline, n_samples) — identical to vanilla REINFORCE but with (reward − baseline) as the weight.

seed (float) → jax.random.PRNGKey(int(seed))
logits — 1-D float32 array of length K
reward_table — 1-D float32 array of length K
baseline — scalar float subtracted from every reward
n_samples (float, cast to int) — number of MC samples

Return a 1-D float32 array of shape (K,).

REINFORCE with Baseline

Why this matters

Worked mini-example

Common pitfalls

Problem

Hints