📐 Concept diagram

23-08 — Policy Gradient Methods

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-08 Prerequisites: 23-01 — MDPs, 23-04 — Monte Carlo Methods, 23-06 — Q-Learning Next subject: 23-09 — Proximal Policy Optimization (PPO)

Learning Objectives

By the end of this subject, you will be able to:

Derive the Policy Gradient Theorem: $\nabla_\theta J(\theta) = \mathbb{E}\pi\left[\nabla\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]$
Implement the REINFORCE algorithm using Monte Carlo returns
Add a learned baseline (value function) to reduce gradient variance
Construct the Actor-Critic architecture and derive the TD-based critic update
Connect policy gradients to the score function estimator and the REINFORCE trick

Core Content

Why Policy Gradients?

Value-based methods (Q-learning, DQN) learn $Q(s,a)$ and derive a policy via $\arg\max$. This has limitations: - $\arg\max$ is discontinuous: small changes in $Q$ can cause large policy changes - Cannot represent stochastic policies (essential for games, partially observable environments) - The greedy policy may be hard to compute in continuous action spaces

Policy gradient methods directly parameterize and optimize the policy $\pi_\theta(a|s)$ using gradient ascent on the expected return.

The Objective Function

For episodic tasks, the objective is the expected return from the start state:

$$J(\theta) = \mathbb{E}{\tau \sim \pi\theta}[G_0] = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^{T} \gamma^t R_{t+1}\right]$$

where $\tau = (s_0, a_0, r_1, s_1, a_1, \dots)$ is a trajectory sampled from $\pi_\theta$. For continuing tasks, we use the average reward formulation.

The Policy Gradient Theorem

Theorem (Sutton et al., 1999):

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]$$

Proof sketch:

The probability of a trajectory under $\pi_\theta$ is:

$$P(\tau | \theta) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t | s_t) P(s_{t+1} | s_t, a_t)$$

The log-probability:

$$\log P(\tau | \theta) = \log \rho_0(s_0) + \sum_{t=0}^{T-1} \left[\log \pi_\theta(a_t | s_t) + \log P(s_{t+1} | s_t, a_t)\right]$$

⚠️ CRITICAL: The transition dynamics $P(s_{t+1}|s_t, a_t)$ do NOT depend on $\theta$. Taking the gradient:

$$\nabla_\theta \log P(\tau | \theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t)$$

The expected return gradient uses the log-derivative trick (also called the REINFORCE trick or score function estimator):

$$\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}{\tau \sim \pi\theta}[R(\tau)] = \mathbb{E}{\tau \sim \pi\theta}[R(\tau) \nabla_\theta \log P(\tau | \theta)]$$

Expanding $R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_{t+1}$ and applying causality (future rewards don't affect past actions):

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\sum_{t=0}^{T-1} \gamma^t G_t \cdot \nabla_\theta \log \pi_\theta(a_t | s_t)\right]$$

Where $G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$ is the Monte Carlo return from step $t$. Since $\mathbb{E}[G_t | s_t, a_t] = Q^\pi(s_t, a_t)$, we can replace $G_t$ with $Q^\pi$:

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]$$

where the expectation is over the state-action visitation distribution $d^\pi(s,a)$.

The REINFORCE Algorithm

REINFORCE (Williams, 1992) is the simplest Monte Carlo policy gradient algorithm:

Algorithm: 1. Initialize policy parameters $\theta$ randomly 2. For each episode: - Generate trajectory $(s_0, a_0, r_1, \dots, s_T)$ using $\pi_\theta$ - For each step $t = 0, \dots, T-1$: - Compute return $G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$ - Update parameters: $\theta \leftarrow \theta + \alpha \gamma^t G_t \nabla_\theta \log \pi_\theta(a_t | s_t)$

Intuition: If action $a$ in state $s$ leads to higher return $G_t$, increase its probability ($\nabla_\theta \log \pi_\theta$ points toward higher probability). If lower return, decrease probability.

Variance Reduction with Baseline

REINFORCE has high variance because $G_t$ can vary wildly. The solution: subtract a baseline $b(s)$ that doesn't depend on the action:

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q^\pi(s,a) - b(s))\right]$$

Why baselines don't bias the gradient:

$$\mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = b(s) \cdot \mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s)] = b(s) \cdot \nabla_\theta 1 = 0$$

The baseline term integrates to zero because $\sum_a \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0$.

Optimal baseline: A natural choice is the state value function $V^\pi(s)$, giving the advantage function:

$$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$$

The advantage represents how much better action $a$ is compared to the average action in state $s$. Using the advantage reduces variance because it centers the returns:

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot A^\pi(s,a)\right]$$

Actor-Critic Architecture

The Actor-Critic combines policy gradient (actor) with value function approximation (critic):

Actor ($\pi_\theta$): parameterized policy that selects actions
Critic ($V_\phi$ or $Q_\phi$): parameterized value function that evaluates the actor's actions

The actor is updated using the policy gradient with the critic's advantage estimate:

$$\theta \leftarrow \theta + \alpha_\theta \cdot \hat{A}(s,a) \cdot \nabla_\theta \log \pi_\theta(a|s)$$

The critic is updated using TD learning (or MC):

$$\phi \leftarrow \phi + \alpha_\phi \cdot \delta \cdot \nabla_\phi V_\phi(s)$$

where $\delta = R + \gamma V_\phi(s') - V_\phi(s)$ is the TD error, which itself is an unbiased estimate of the advantage: $\mathbb{E}[\delta | s,a] = A^\pi(s,a)$.

N-step advantage estimate: For lower bias, use $n$-step returns:

$$\hat{A}t^{(n)} = \sum{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n V_\phi(s_{t+n}) - V_\phi(s_t)$$

⚠️ CRITICAL: Actor-Critic combines two learning processes that are coupled: the critic's accuracy affects the actor's gradient quality, and the actor's policy determines the data distribution the critic learns from. This can cause instability.

Policy Parameterizations

Discrete actions (softmax):

$$\pi_\theta(a|s) = \frac{\exp(f_\theta(s,a))}{\sum_{a'} \exp(f_\theta(s,a'))}$$

$$\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s,a) - \sum_{a'} \pi_\theta(a'|s) \nabla_\theta f_\theta(s,a')$$

Continuous actions (Gaussian):

$$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta^2(s))$$

$$\log \pi_\theta(a|s) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(a - \mu)^2}{2\sigma^2}$$

$$\nabla_\theta \log \pi_\theta(a|s) = \frac{a - \mu_\theta(s)}{\sigma_\theta^2(s)} \nabla_\theta \mu_\theta(s) + \frac{(a - \mu_\theta(s))^2 - \sigma_\theta^2(s)}{\sigma_\theta^3(s)} \nabla_\theta \sigma_\theta(s)$$

Key Terms

Actor
Actor-Critic
Critic
Policy Gradient Theorem

Worked Examples

Example 1: REINFORCE on a Simple Bandit

A 2-armed bandit has true values $Q^(a_1) = 1.0$, $Q^(a_2) = 2.0$. The policy is softmax with logits $\theta = [\theta_1, \theta_2] = [0, 0]$ initially. One episode: pick $a_1$, get reward 0.8. Compute the REINFORCE update with $\alpha = 0.1$.

Solution:

Initial probabilities: $\pi(a_1) = \pi(a_2) = \frac{e^0}{e^0 + e^0} = 0.5$

$\nabla_\theta \log \pi(a_1|\theta) = [1 - 0.5, -0.5] = [0.5, -0.5]$ (one-hot minus probabilities)

$G = 0.8$ (episode return = single reward)

Update: $\theta \leftarrow \theta + \alpha G \nabla_\theta \log \pi(a_1|\theta)$ $\theta_1 \leftarrow 0 + 0.1 \cdot 0.8 \cdot 0.5 = 0.04$ $\theta_2 \leftarrow 0 + 0.1 \cdot 0.8 \cdot (-0.5) = -0.04$

Click for answer

$\theta = [0.04, -0.04]$. New probabilities: $\pi(a_1) \approx 0.52$, $\pi(a_2) \approx 0.48$. Even though $a_2$ is truly better, the update increases $\pi(a_1)$ because $a_1$ was taken and the return (0.8) was positive. Over many episodes, $a_2$ will be sampled and its higher returns will dominate.

Example 2: Baseline Variance Reduction

Two actions at state $s$: action $a_1$ gives returns $G = [10, 10, 10]$; action $a_2$ gives $G = [-5, 35, 0]$. Both have true $Q(s,a) = 10$. Compute the variance of the REINFORCE gradient (scalar $\nabla_\theta \log \pi = \pm 1$) with and without baseline $b = V(s) = 10$.

Solution:

Without baseline: Updates are proportional to $G_t$ directly. $a_1$ updates: $[+10, +10, +10]$ → mean = 10, variance = 0 $a_2$ updates: $[-5, +35, 0]$ → mean = 10, variance = $(15^2 + 25^2 + 10^2)/3 = (225 + 625 + 100)/3 = 316.7$

Pooled variance (assuming equal state visitation): $(0 + 316.7)/2 = 158.3$

With baseline $b = 10$: Updates proportional to $G_t - 10$. $a_1$: $[0, 0, 0]$ → mean = 0, variance = 0 $a_2$: $[-15, +25, -10]$ → mean = 0, variance = $(225 + 625 + 100)/3 = 316.7$

Wait — the variance is the same! This is because both actions have the same true $Q$. With different $Q$ values, the baseline helps more dramatically.

Click for answer

In this example baseline doesn't reduce variance because both actions have equal $Q$. But in general, $V(s) = \mathbb{E}_a[Q(s,a)]$ centers the advantage, and when actions have very different $Q$ values, subtracting $V(s)$ dramatically reduces the gradient magnitude spread.

Example 3: Actor-Critic Update Step

At state $s$, the actor selects action $a$ with probability $\pi_\theta(a|s) = 0.3$. The critic estimates $V_\phi(s) = 5.0$. The agent receives $R = 2$, transitions to $s'$ where $V_\phi(s') = 4.5$, $\gamma = 0.99$. Actor learning rate $\alpha_\theta = 0.01$, critic $\alpha_\phi = 0.1$. Compute both updates.

Solution:

TD error: $\delta = 2 + 0.99(4.5) - 5.0 = 2 + 4.455 - 5.0 = 1.455$

Critic update: $\phi \leftarrow \phi + 0.1 \cdot 1.455 \cdot \nabla_\phi V_\phi(s)$ (The gradient direction depends on the specific parameterization — typically $\nabla_\phi V_\phi(s)$ for linear features is just the feature vector $\mathbf{x}(s)$)

Actor update: $\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta \log 0.3 = \nabla_\theta (-\log 0.3)$ (depends on parameterization) $\theta \leftarrow \theta + 0.01 \cdot 1.455 \cdot \nabla_\theta \log \pi_\theta(a|s)$

Since $\delta > 0$, the action was better than expected — increase its probability.

Click for answer

The positive TD error (1.455) means the transition yielded a higher return than the critic predicted. The actor increases $\pi(a|s)$ and the critic increases $V(s)$ toward $R + \gamma V(s') = 6.455$.

Practice Problems

Problem 1: Prove that $\mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0$ for any state $s$.

Answer (click to expand)

$$\mathbb{E}_{a \sim \pi_\theta(\cdot|s)}[\nabla_\theta \log \pi_\theta(a|s)] = \sum_a \pi_\theta(a|s) \cdot \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} = \sum_a \nabla_\theta \pi_\theta(a|s) = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0$$ This identity is crucial: it proves that subtracting ANY state-dependent baseline $b(s)$ from the return does not bias the policy gradient.

Problem 2: Derive the softmax policy gradient for discrete actions. If $\pi_\theta(a|s) = \frac{e^{f_\theta(s,a)}}{\sum_{a'} e^{f_\theta(s,a')}}$, show that $\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s,a) - \mathbb{E}{a' \sim \pi\theta}[\nabla_\theta f_\theta(s,a')]$.

Answer (click to expand)

$$\log \pi_\theta(a|s) = f_\theta(s,a) - \log \sum_{a'} e^{f_\theta(s,a')}$$ $$\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s,a) - \frac{\sum_{a'} e^{f_\theta(s,a')} \nabla_\theta f_\theta(s,a')}{\sum_{a'} e^{f_\theta(s,a')}}$$ $$= \nabla_\theta f_\theta(s,a) - \sum_{a'} \frac{e^{f_\theta(s,a')}}{\sum_k e^{f_\theta(s,k)}} \nabla_\theta f_\theta(s,a')$$ $$= \nabla_\theta f_\theta(s,a) - \sum_{a'} \pi_\theta(a'|s) \nabla_\theta f_\theta(s,a')$$ The gradient is the "score" for the chosen action minus the expected score over all actions — a contrastive signal.

Problem 3: Show that the TD error $\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is an unbiased estimate of the advantage $A^\pi(s_t, a_t)$ when $V = V^\pi$.

Answer (click to expand)

$$\mathbb{E}[\delta_t | s_t, a_t] = \mathbb{E}[R_{t+1} + \gamma V^\pi(s_{t+1}) - V^\pi(s_t) | s_t, a_t]$$ $$= \mathbb{E}[R_{t+1} | s_t, a_t] + \gamma \mathbb{E}[V^\pi(s_{t+1}) | s_t, a_t] - V^\pi(s_t)$$ By definition of the Bellman equation: $\mathbb{E}[R_{t+1} + \gamma V^\pi(s_{t+1}) | s_t, a_t] = Q^\pi(s_t, a_t)$ Therefore: $\mathbb{E}[\delta_t | s_t, a_t] = Q^\pi(s_t, a_t) - V^\pi(s_t) = A^\pi(s_t, a_t)$. This is why actor-critic methods can use the TD error as a low-variance (though biased due to function approximation) advantage estimate.

Problem 4: REINFORCE has high variance. Quantify the variance: if returns $G$ have variance $\sigma_G^2$ and the policy gradient magnitude $|\nabla_\theta \log \pi|$ has variance $\sigma_{\nabla}^2$, what is the variance of the product $G \cdot \nabla_\theta \log \pi$ assuming independence?

Answer (click to expand)

Under independence: $\text{Var}[G \cdot \nabla \log \pi] = \mathbb{E}[G^2]\mathbb{E}[(\nabla \log \pi)^2] - (\mathbb{E}[G]\mathbb{E}[\nabla \log \pi])^2$ Since $\mathbb{E}[\nabla \log \pi] = 0$: $\text{Var} = \mathbb{E}[G^2] \cdot \mathbb{E}[(\nabla \log \pi)^2] = (\sigma_G^2 + \mu_G^2) \cdot \sigma_{\nabla}^2$ The variance grows with the squared return — this is why long episodes with large accumulated returns make REINFORCE nearly unusable without variance reduction. With baseline subtraction, $G$ becomes $G - b \approx A$, and $\mu_A \approx 0$, reducing the term to $\sigma_A^2 \cdot \sigma_{\nabla}^2$.

Problem 5: Compare the computational cost of REINFORCE vs. Actor-Critic for an episode of length $T$. REINFORCE stores all $(s_t, a_t, r_t)$ and computes returns at the end. Actor-Critic updates online. What are the memory requirements?

Answer (click to expand)

**REINFORCE (MC):** - Memory: $O(T \cdot (|s| + |a| + 1))$ — must store entire trajectory - Time: $O(T \cdot |\theta|)$ — one backward pass per step after episode ends - Cannot update during episode **Actor-Critic (TD):** - Memory: $O(|s| + |a|)$ — only current transition - Time: $O(T \cdot (|\theta| + |\phi|))$ — two backward passes per step - Updates online, can learn from incomplete episodes For $T = 1000$ steps, feature dimension $d = 256$: REINFORCE needs ≈ 256KB for the trajectory buffer; Actor-Critic needs ≈ 1KB. But Actor-Critic introduces bias through the value function approximation. This is the classic bias-variance-memory tradeoff.

Summary

Policy gradient theorem: $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]$ — directly optimizes the policy
REINFORCE: Monte Carlo policy gradient — simple but high variance; stores whole episodes
Baseline subtraction: Subtracting $b(s)$ (e.g., $V(s)$) reduces variance without biasing the gradient because $\mathbb{E}[\nabla_\theta \log \pi] = 0$
Advantage function: $A^\pi = Q^\pi - V^\pi$ — how much better an action is than the state average
Actor-Critic: Actor ($\pi_\theta$) updated via policy gradient; Critic ($V_\phi$ or $Q_\phi$) updated via TD learning; TD error serves as advantage estimate
The transition dynamics do NOT depend on policy parameters — this enables the log-derivative trick

Quiz

Question 1: The Policy Gradient Theorem states that $\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]$. Why does the gradient of the transition dynamics $P(s'|s,a)$ not appear?

A. It integrates to zero B. The transition dynamics do not depend on the policy parameters $\theta$ C. It's approximated away D. The model is deterministic

Correct Answer: B