23-08 — Policy Gradient Methods
Phase: 23 — Reinforcement Learning Mathematics Subject: 23-08 Prerequisites: 23-01 — MDPs, 23-04 — Monte Carlo Methods, 23-06 — Q-Learning Next subject: 23-09 — Proximal Policy Optimization (PPO)
Learning Objectives
By the end of this subject, you will be able to:
- Derive the Policy Gradient Theorem: $\nabla_\theta J(\theta) = \mathbb{E}\pi\left[\nabla\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]$
- Implement the REINFORCE algorithm using Monte Carlo returns
- Add a learned baseline (value function) to reduce gradient variance
- Construct the Actor-Critic architecture and derive the TD-based critic update
- Connect policy gradients to the score function estimator and the REINFORCE trick
Core Content
Why Policy Gradients?
Value-based methods (Q-learning, DQN) learn $Q(s,a)$ and derive a policy via $\arg\max$. This has limitations: - $\arg\max$ is discontinuous: small changes in $Q$ can cause large policy changes - Cannot represent stochastic policies (essential for games, partially observable environments) - The greedy policy may be hard to compute in continuous action spaces
Policy gradient methods directly parameterize and optimize the policy $\pi_\theta(a|s)$ using gradient ascent on the expected return.
The Objective Function
For episodic tasks, the objective is the expected return from the start state:
$$J(\theta) = \mathbb{E}{\tau \sim \pi\theta}[G_0] = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^{T} \gamma^t R_{t+1}\right]$$
where $\tau = (s_0, a_0, r_1, s_1, a_1, \dots)$ is a trajectory sampled from $\pi_\theta$. For continuing tasks, we use the average reward formulation.
The Policy Gradient Theorem
Theorem (Sutton et al., 1999):
$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]$$
Proof sketch:
The probability of a trajectory under $\pi_\theta$ is:
$$P(\tau | \theta) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t | s_t) P(s_{t+1} | s_t, a_t)$$
The log-probability:
$$\log P(\tau | \theta) = \log \rho_0(s_0) + \sum_{t=0}^{T-1} \left[\log \pi_\theta(a_t | s_t) + \log P(s_{t+1} | s_t, a_t)\right]$$
⚠️ CRITICAL: The transition dynamics $P(s_{t+1}|s_t, a_t)$ do NOT depend on $\theta$. Taking the gradient:
$$\nabla_\theta \log P(\tau | \theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t)$$
The expected return gradient uses the log-derivative trick (also called the REINFORCE trick or score function estimator):
$$\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}{\tau \sim \pi\theta}[R(\tau)] = \mathbb{E}{\tau \sim \pi\theta}[R(\tau) \nabla_\theta \log P(\tau | \theta)]$$
Expanding $R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_{t+1}$ and applying causality (future rewards don't affect past actions):
$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\sum_{t=0}^{T-1} \gamma^t G_t \cdot \nabla_\theta \log \pi_\theta(a_t | s_t)\right]$$
Where $G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$ is the Monte Carlo return from step $t$. Since $\mathbb{E}[G_t | s_t, a_t] = Q^\pi(s_t, a_t)$, we can replace $G_t$ with $Q^\pi$:
$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]$$
where the expectation is over the state-action visitation distribution $d^\pi(s,a)$.
The REINFORCE Algorithm
REINFORCE (Williams, 1992) is the simplest Monte Carlo policy gradient algorithm:
Algorithm: 1. Initialize policy parameters $\theta$ randomly 2. For each episode: - Generate trajectory $(s_0, a_0, r_1, \dots, s_T)$ using $\pi_\theta$ - For each step $t = 0, \dots, T-1$: - Compute return $G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$ - Update parameters: $\theta \leftarrow \theta + \alpha \gamma^t G_t \nabla_\theta \log \pi_\theta(a_t | s_t)$
Intuition: If action $a$ in state $s$ leads to higher return $G_t$, increase its probability ($\nabla_\theta \log \pi_\theta$ points toward higher probability). If lower return, decrease probability.
Variance Reduction with Baseline
REINFORCE has high variance because $G_t$ can vary wildly. The solution: subtract a baseline $b(s)$ that doesn't depend on the action:
$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q^\pi(s,a) - b(s))\right]$$
Why baselines don't bias the gradient:
$$\mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = b(s) \cdot \mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s)] = b(s) \cdot \nabla_\theta 1 = 0$$
The baseline term integrates to zero because $\sum_a \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0$.
Optimal baseline: A natural choice is the state value function $V^\pi(s)$, giving the advantage function:
$$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$$
The advantage represents how much better action $a$ is compared to the average action in state $s$. Using the advantage reduces variance because it centers the returns:
$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot A^\pi(s,a)\right]$$
Actor-Critic Architecture
The Actor-Critic combines policy gradient (actor) with value function approximation (critic):
- Actor ($\pi_\theta$): parameterized policy that selects actions
- Critic ($V_\phi$ or $Q_\phi$): parameterized value function that evaluates the actor's actions
The actor is updated using the policy gradient with the critic's advantage estimate:
$$\theta \leftarrow \theta + \alpha_\theta \cdot \hat{A}(s,a) \cdot \nabla_\theta \log \pi_\theta(a|s)$$
The critic is updated using TD learning (or MC):
$$\phi \leftarrow \phi + \alpha_\phi \cdot \delta \cdot \nabla_\phi V_\phi(s)$$
where $\delta = R + \gamma V_\phi(s') - V_\phi(s)$ is the TD error, which itself is an unbiased estimate of the advantage: $\mathbb{E}[\delta | s,a] = A^\pi(s,a)$.
N-step advantage estimate: For lower bias, use $n$-step returns:
$$\hat{A}t^{(n)} = \sum{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n V_\phi(s_{t+n}) - V_\phi(s_t)$$
⚠️ CRITICAL: Actor-Critic combines two learning processes that are coupled: the critic's accuracy affects the actor's gradient quality, and the actor's policy determines the data distribution the critic learns from. This can cause instability.
Policy Parameterizations
Discrete actions (softmax):
$$\pi_\theta(a|s) = \frac{\exp(f_\theta(s,a))}{\sum_{a'} \exp(f_\theta(s,a'))}$$
$$\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s,a) - \sum_{a'} \pi_\theta(a'|s) \nabla_\theta f_\theta(s,a')$$
Continuous actions (Gaussian):
$$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta^2(s))$$
$$\log \pi_\theta(a|s) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(a - \mu)^2}{2\sigma^2}$$
$$\nabla_\theta \log \pi_\theta(a|s) = \frac{a - \mu_\theta(s)}{\sigma_\theta^2(s)} \nabla_\theta \mu_\theta(s) + \frac{(a - \mu_\theta(s))^2 - \sigma_\theta^2(s)}{\sigma_\theta^3(s)} \nabla_\theta \sigma_\theta(s)$$
Key Terms
- Actor
- Actor-Critic
- Critic
- Policy Gradient Theorem
Worked Examples
Example 1: REINFORCE on a Simple Bandit
A 2-armed bandit has true values $Q^(a_1) = 1.0$, $Q^(a_2) = 2.0$. The policy is softmax with logits $\theta = [\theta_1, \theta_2] = [0, 0]$ initially. One episode: pick $a_1$, get reward 0.8. Compute the REINFORCE update with $\alpha = 0.1$.
Solution:
Initial probabilities: $\pi(a_1) = \pi(a_2) = \frac{e^0}{e^0 + e^0} = 0.5$
$\nabla_\theta \log \pi(a_1|\theta) = [1 - 0.5, -0.5] = [0.5, -0.5]$ (one-hot minus probabilities)
$G = 0.8$ (episode return = single reward)
Update: $\theta \leftarrow \theta + \alpha G \nabla_\theta \log \pi(a_1|\theta)$ $\theta_1 \leftarrow 0 + 0.1 \cdot 0.8 \cdot 0.5 = 0.04$ $\theta_2 \leftarrow 0 + 0.1 \cdot 0.8 \cdot (-0.5) = -0.04$
Click for answer
$\theta = [0.04, -0.04]$. New probabilities: $\pi(a_1) \approx 0.52$, $\pi(a_2) \approx 0.48$. Even though $a_2$ is truly better, the update increases $\pi(a_1)$ because $a_1$ was taken and the return (0.8) was positive. Over many episodes, $a_2$ will be sampled and its higher returns will dominate.Example 2: Baseline Variance Reduction
Two actions at state $s$: action $a_1$ gives returns $G = [10, 10, 10]$; action $a_2$ gives $G = [-5, 35, 0]$. Both have true $Q(s,a) = 10$. Compute the variance of the REINFORCE gradient (scalar $\nabla_\theta \log \pi = \pm 1$) with and without baseline $b = V(s) = 10$.
Solution:
Without baseline: Updates are proportional to $G_t$ directly. $a_1$ updates: $[+10, +10, +10]$ → mean = 10, variance = 0 $a_2$ updates: $[-5, +35, 0]$ → mean = 10, variance = $(15^2 + 25^2 + 10^2)/3 = (225 + 625 + 100)/3 = 316.7$
Pooled variance (assuming equal state visitation): $(0 + 316.7)/2 = 158.3$
With baseline $b = 10$: Updates proportional to $G_t - 10$. $a_1$: $[0, 0, 0]$ → mean = 0, variance = 0 $a_2$: $[-15, +25, -10]$ → mean = 0, variance = $(225 + 625 + 100)/3 = 316.7$
Wait — the variance is the same! This is because both actions have the same true $Q$. With different $Q$ values, the baseline helps more dramatically.
Click for answer
In this example baseline doesn't reduce variance because both actions have equal $Q$. But in general, $V(s) = \mathbb{E}_a[Q(s,a)]$ centers the advantage, and when actions have very different $Q$ values, subtracting $V(s)$ dramatically reduces the gradient magnitude spread.Example 3: Actor-Critic Update Step
At state $s$, the actor selects action $a$ with probability $\pi_\theta(a|s) = 0.3$. The critic estimates $V_\phi(s) = 5.0$. The agent receives $R = 2$, transitions to $s'$ where $V_\phi(s') = 4.5$, $\gamma = 0.99$. Actor learning rate $\alpha_\theta = 0.01$, critic $\alpha_\phi = 0.1$. Compute both updates.
Solution:
TD error: $\delta = 2 + 0.99(4.5) - 5.0 = 2 + 4.455 - 5.0 = 1.455$
Critic update: $\phi \leftarrow \phi + 0.1 \cdot 1.455 \cdot \nabla_\phi V_\phi(s)$ (The gradient direction depends on the specific parameterization — typically $\nabla_\phi V_\phi(s)$ for linear features is just the feature vector $\mathbf{x}(s)$)
Actor update: $\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta \log 0.3 = \nabla_\theta (-\log 0.3)$ (depends on parameterization) $\theta \leftarrow \theta + 0.01 \cdot 1.455 \cdot \nabla_\theta \log \pi_\theta(a|s)$
Since $\delta > 0$, the action was better than expected — increase its probability.
Click for answer
The positive TD error (1.455) means the transition yielded a higher return than the critic predicted. The actor increases $\pi(a|s)$ and the critic increases $V(s)$ toward $R + \gamma V(s') = 6.455$.Practice Problems
Problem 1: Prove that $\mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0$ for any state $s$.
Answer (click to expand)
$$\mathbb{E}_{a \sim \pi_\theta(\cdot|s)}[\nabla_\theta \log \pi_\theta(a|s)] = \sum_a \pi_\theta(a|s) \cdot \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} = \sum_a \nabla_\theta \pi_\theta(a|s) = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0$$ This identity is crucial: it proves that subtracting ANY state-dependent baseline $b(s)$ from the return does not bias the policy gradient.Problem 2: Derive the softmax policy gradient for discrete actions. If $\pi_\theta(a|s) = \frac{e^{f_\theta(s,a)}}{\sum_{a'} e^{f_\theta(s,a')}}$, show that $\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s,a) - \mathbb{E}{a' \sim \pi\theta}[\nabla_\theta f_\theta(s,a')]$.
Answer (click to expand)
$$\log \pi_\theta(a|s) = f_\theta(s,a) - \log \sum_{a'} e^{f_\theta(s,a')}$$ $$\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s,a) - \frac{\sum_{a'} e^{f_\theta(s,a')} \nabla_\theta f_\theta(s,a')}{\sum_{a'} e^{f_\theta(s,a')}}$$ $$= \nabla_\theta f_\theta(s,a) - \sum_{a'} \frac{e^{f_\theta(s,a')}}{\sum_k e^{f_\theta(s,k)}} \nabla_\theta f_\theta(s,a')$$ $$= \nabla_\theta f_\theta(s,a) - \sum_{a'} \pi_\theta(a'|s) \nabla_\theta f_\theta(s,a')$$ The gradient is the "score" for the chosen action minus the expected score over all actions — a contrastive signal.Problem 3: Show that the TD error $\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is an unbiased estimate of the advantage $A^\pi(s_t, a_t)$ when $V = V^\pi$.
Answer (click to expand)
$$\mathbb{E}[\delta_t | s_t, a_t] = \mathbb{E}[R_{t+1} + \gamma V^\pi(s_{t+1}) - V^\pi(s_t) | s_t, a_t]$$ $$= \mathbb{E}[R_{t+1} | s_t, a_t] + \gamma \mathbb{E}[V^\pi(s_{t+1}) | s_t, a_t] - V^\pi(s_t)$$ By definition of the Bellman equation: $\mathbb{E}[R_{t+1} + \gamma V^\pi(s_{t+1}) | s_t, a_t] = Q^\pi(s_t, a_t)$ Therefore: $\mathbb{E}[\delta_t | s_t, a_t] = Q^\pi(s_t, a_t) - V^\pi(s_t) = A^\pi(s_t, a_t)$. This is why actor-critic methods can use the TD error as a low-variance (though biased due to function approximation) advantage estimate.Problem 4: REINFORCE has high variance. Quantify the variance: if returns $G$ have variance $\sigma_G^2$ and the policy gradient magnitude $|\nabla_\theta \log \pi|$ has variance $\sigma_{\nabla}^2$, what is the variance of the product $G \cdot \nabla_\theta \log \pi$ assuming independence?
Answer (click to expand)
Under independence: $\text{Var}[G \cdot \nabla \log \pi] = \mathbb{E}[G^2]\mathbb{E}[(\nabla \log \pi)^2] - (\mathbb{E}[G]\mathbb{E}[\nabla \log \pi])^2$ Since $\mathbb{E}[\nabla \log \pi] = 0$: $\text{Var} = \mathbb{E}[G^2] \cdot \mathbb{E}[(\nabla \log \pi)^2] = (\sigma_G^2 + \mu_G^2) \cdot \sigma_{\nabla}^2$ The variance grows with the squared return — this is why long episodes with large accumulated returns make REINFORCE nearly unusable without variance reduction. With baseline subtraction, $G$ becomes $G - b \approx A$, and $\mu_A \approx 0$, reducing the term to $\sigma_A^2 \cdot \sigma_{\nabla}^2$.Problem 5: Compare the computational cost of REINFORCE vs. Actor-Critic for an episode of length $T$. REINFORCE stores all $(s_t, a_t, r_t)$ and computes returns at the end. Actor-Critic updates online. What are the memory requirements?
Answer (click to expand)
**REINFORCE (MC):** - Memory: $O(T \cdot (|s| + |a| + 1))$ — must store entire trajectory - Time: $O(T \cdot |\theta|)$ — one backward pass per step after episode ends - Cannot update during episode **Actor-Critic (TD):** - Memory: $O(|s| + |a|)$ — only current transition - Time: $O(T \cdot (|\theta| + |\phi|))$ — two backward passes per step - Updates online, can learn from incomplete episodes For $T = 1000$ steps, feature dimension $d = 256$: REINFORCE needs ≈ 256KB for the trajectory buffer; Actor-Critic needs ≈ 1KB. But Actor-Critic introduces bias through the value function approximation. This is the classic bias-variance-memory tradeoff.Summary
- Policy gradient theorem: $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]$ — directly optimizes the policy
- REINFORCE: Monte Carlo policy gradient — simple but high variance; stores whole episodes
- Baseline subtraction: Subtracting $b(s)$ (e.g., $V(s)$) reduces variance without biasing the gradient because $\mathbb{E}[\nabla_\theta \log \pi] = 0$
- Advantage function: $A^\pi = Q^\pi - V^\pi$ — how much better an action is than the state average
- Actor-Critic: Actor ($\pi_\theta$) updated via policy gradient; Critic ($V_\phi$ or $Q_\phi$) updated via TD learning; TD error serves as advantage estimate
- The transition dynamics do NOT depend on policy parameters — this enables the log-derivative trick
Quiz
Question 1: The Policy Gradient Theorem states that $\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]$. Why does the gradient of the transition dynamics $P(s'|s,a)$ not appear?
A. It integrates to zero B. The transition dynamics do not depend on the policy parameters $\theta$ C. It's approximated away D. The model is deterministic
Correct Answer: B
Explanation (click to expand)
- **If you chose B:** Correct. The environment dynamics $P(s'|s,a)$ are independent of the agent's policy parameters $\theta$, so $\nabla_\theta P(s'|s,a) = 0$. - If you chose A: Individual transition gradients are zero, not just their expectation. - If you chose C: No approximation is needed — it's exact. - If you chose D: The theorem works for stochastic dynamics too.Question 2: REINFORCE updates are applied:
A. At every time step, using the immediate reward B. At the end of each episode, using the full Monte Carlo return C. Only after the policy achieves a target success rate D. Using Q-learning style TD errors
Correct Answer: B
Explanation (click to expand)
- **If you chose B:** Correct. REINFORCE is a Monte Carlo method — it waits until episode end to compute the return $G_t$ for each step. - If you chose A: That's TD-based actor-critic, not pure REINFORCE. - If you chose C: REINFORCE doesn't require a success threshold. - If you chose D: REINFORCE uses $G_t$, not TD errors.Question 3: Why does subtracting a state-dependent baseline $b(s)$ not bias the policy gradient?
A. Because $\mathbb{E}[\nabla_\theta \log \pi_\theta(a|s)] = 0$ B. Because $b(s)$ is always zero C. Because the baseline is learned separately D. Baselines do introduce bias, but it's negligible
Correct Answer: A
Explanation (click to expand)
- **If you chose A:** Correct. $\mathbb{E}[b(s)\nabla_\theta \log \pi] = b(s)\mathbb{E}[\nabla_\theta \log \pi] = b(s) \cdot 0 = 0$. The baseline term contributes zero in expectation. - If you chose B: The baseline is typically non-zero (e.g., $V(s)$). - If you chose C: Separate learning doesn't guarantee unbiasedness; the mathematical property does. - If you chose D: The baseline introduces zero bias — it's a free lunch for variance reduction.Question 4: In Actor-Critic, what does the TD error $\delta$ estimate?
A. The policy gradient directly B. The advantage $A^\pi(s,a)$ C. The value function $V(s)$ D. The immediate reward
Correct Answer: B
Explanation (click to expand)
- **If you chose B:** Correct. $\mathbb{E}[\delta | s,a] = Q^\pi(s,a) - V^\pi(s) = A^\pi(s,a)$. The TD error is an unbiased estimate of the advantage. - If you chose A: $\delta$ times the score function gives the policy gradient, but $\delta$ itself estimates advantage. - If you chose C: $\delta$ is the error in $V(s)$, not $V(s)$ itself. - If you chose D: $\delta$ includes the next state value, not just reward.Question 5: Which has LOWER variance: REINFORCE or Actor-Critic with TD(0)?
A. REINFORCE — because it uses the full return B. Actor-Critic with TD(0) — because it bootstraps from a learned value function C. They have the same variance D. It depends on the discount factor
Correct Answer: B
Explanation (click to expand)
- **If you chose B:** Correct. TD(0) replaces the noisy sum of future rewards with a value function estimate $V(s')$, which has lower variance than the Monte Carlo return $G_t$ (at the cost of bias from function approximation error). - If you chose A: REINFORCE has high variance precisely because it uses noisy long-term returns. - If you chose C: Actor-Critic universally has lower variance. - If you chose D: Discount factor affects both similarly; the variance reduction from bootstrapping is the key difference.Question 6: The "log-derivative trick" refers to:
A. $\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta$ B. $\nabla_\theta \mathcal{L} = \nabla_\theta Q(s,a)$ C. $\log(\nabla_\theta \pi) = \nabla_\theta \log \pi$ D. Using logarithms to stabilize gradients
Correct Answer: A
Explanation (click to expand)
- **If you chose A:** Correct. Since $\nabla_\theta \log \pi = \frac{\nabla_\theta \pi}{\pi}$, multiplying both sides by $\pi$ gives $\nabla_\theta \pi = \pi \nabla_\theta \log \pi$. This lets us write $\nabla_\theta \mathbb{E}_\pi[f] = \mathbb{E}_\pi[f \nabla_\theta \log \pi]$, converting a gradient of an expectation into an expectation of a gradient. - If you chose B: That's the Q-learning gradient, not the log-derivative trick. - If you chose C: The log of a gradient is not the gradient of a log. - If you chose D: While logarithms do stabilize numerically, that's not what the "log-derivative trick" refers to.Pitfalls
- Forgetting causality: Using $R(\tau)$ instead of $G_t$ — the full trajectory return shouldn't affect early action probabilities. The correct update uses per-step returns.
- No baseline: Raw REINFORCE without baseline has extreme variance; even a simple running-average baseline helps enormously.
- Learning rate too high: Policy gradient steps are noisier than supervised learning steps. Use smaller learning rates.
- Collapsing entropy: The policy can become deterministic too quickly (low entropy). Entropy regularization $\mathcal{H}(\pi)$ encourages exploration.
- Confusing on-policy vs off-policy: Policy gradient methods are typically on-policy — data must be collected under the current policy. Reusing old data introduces bias.
Next Steps
Next up: 23-09 — Proximal Policy Optimization (PPO) — where we learn to take policy gradient steps that are large enough to make progress but small enough to avoid destructive policy collapse.
Q8: Why are policy gradient methods typically on-policy?
A) The gradient $$\mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta \cdot Q^\pi]$$ is an expectation under the current policy — using data from a different policy would require importance sampling corrections B) Off-policy algorithms are always unstable C) Neural networks can't handle off-policy data D) On-policy methods are legally required