Math graphic
📐 Concept diagram

23-03 — Dynamic Programming for MDPs

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-03 Prerequisites: 23-01 — MDPs, 23-02 — Bellman Equations Next subject: 23-04 — Monte Carlo Methods


Learning Objectives

By the end of this subject, you will be able to:

  1. Implement policy evaluation and explain why it converges
  2. Execute the policy improvement theorem and prove that it produces strictly better policies
  3. Combine policy evaluation and improvement into policy iteration and analyze its convergence
  4. Derive value iteration from the Bellman optimality equation and explain when to use it vs. policy iteration
  5. Understand the computational complexity of DP methods and their limitations

Core Content

Dynamic Programming Requirements

Dynamic Programming (DP) for MDPs requires a perfect model of the environment — you must know $P(s' \mid s, a)$ and $R(s, a, s')$ for all transitions. DP is thus a planning method, not a learning method. The key idea: use the Bellman equations to iteratively compute value functions.

DP assumes finite state and action spaces (or at least that you can iterate over them). For continuous spaces, function approximation is needed (covered in 23-07 through 23-10).

Policy Evaluation (Prediction)

Goal: Given a policy $\pi$, compute $V^\pi$.

The algorithm iteratively applies the Bellman operator:

$$V_{k+1}(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\big[R(s, a, s') + \gamma V_k(s')\big]$$

Initialization: $V_0(s) = 0$ for all $s$ (arbitrary — convergence is guaranteed from any starting point).

Convergence: Since $\mathcal{T}^\pi$ is a $\gamma$-contraction, $|V_k - V^\pi|\infty \leq \gamma^k |V_0 - V^\pi|\infty \to 0$ as $k \to \infty$.

In practice, we stop when $\max_s |V_{k+1}(s) - V_k(s)| < \theta$ for some small threshold $\theta$.

⚠️ CRITICAL: Policy evaluation is a sweep over all states. Each iteration is $O(|\mathcal{S}|^2 |\mathcal{A}|)$ — quadratic in states, linear in actions. This is why DP only works for small MDPs.

Two implementations:

Policy Improvement

Goal: Given $V^\pi$ (or an approximation), produce a better policy $\pi'$.

The policy improvement theorem states: if for all $s$:

$$Q^\pi(s, \pi'(s)) \geq V^\pi(s)$$

then $\pi'$ is at least as good as $\pi$: $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$.

Proof sketch:

$V^\pi(s) \leq Q^\pi(s, \pi'(s))$ $= \mathbb{E}[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t=s, A_t=\pi'(s)]$ $\leq \mathbb{E}[R_{t+1} + \gamma Q^\pi(S_{t+1}, \pi'(S_{t+1})) \mid \ldots]$ $\leq \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 Q^\pi(S_{t+2}, \pi'(S_{t+2})) \mid \ldots]$ $\cdots \leq V^{\pi'}(s)$

The greedy policy improvement constructs:

$$\pi'(s) = \arg\max_a Q^\pi(s, a) = \arg\max_a \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V^\pi(s')]$$

If $\pi' = \pi$, then the Bellman optimality equation is satisfied and $\pi$ is optimal.

Policy Iteration

Algorithm:

  1. Initialization: $\pi_0$ arbitrary, $V$ arbitrary
  2. Policy Evaluation: Compute $V^{\pi_k}$ (solve or iterate until convergence)
  3. Policy Improvement: $\pi_{k+1}(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)[R + \gamma V^{\pi_k}(s')]$
  4. If $\pi_{k+1} = \pi_k$, stop (optimal policy found). Else, go to 2.

Convergence: For finite MDPs with $|\mathcal{A}|$ finite, policy iteration terminates in at most $|\mathcal{A}|^{|\mathcal{S}|}$ iterations (though in practice it's much faster — often just a handful of iterations). Each iteration strictly improves the policy unless it's already optimal.

Why it works: Each improvement step produces a strictly better policy (unless optimal). With finitely many deterministic policies ($|\mathcal{A}|^{|\mathcal{S}|}$), the process must terminate at the optimum.

Value Iteration

Value iteration combines policy evaluation and improvement into a single update, using the Bellman optimality operator:

$$V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s, a)\big[R(s, a, s') + \gamma V_k(s')\big]$$

This is policy evaluation truncated to one sweep, followed immediately by policy improvement. It's equivalent to applying $\mathcal{T}^*$ repeatedly.

Once $V$ has converged (to $V^*$), extract the optimal policy:

$$\pi^(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V^(s')]$$

Convergence: Since $\mathcal{T}^$ is also a $\gamma$-contraction, value iteration converges to $V^$ from any starting point:

$$|V_k - V^|_\infty \leq \gamma^k |V_0 - V^|_\infty$$

Policy Iteration vs. Value Iteration

Aspect Policy Iteration Value Iteration
Inner loop Full policy evaluation (many sweeps) One sweep per iteration
Outer iterations Few (policies are discrete) Many (value converges continuously)
Per-iteration cost High (solving linear system or many sweeps) Lower (one Bellman update)
When to use Small state spaces, evaluations are cheap Large state spaces or when approximate evaluation is acceptable
Typical behavior Converges in < 20 policy iterations May need thousands of value iterations

⚠️ Common Pitfall: "Value iteration" is NOT the same as "iteratively computing values." Both policy evaluation and value iteration involve iterative updates — the difference is whether you use the policy's probability distribution (evaluation) or the max (value iteration).

Generalized Policy Iteration (GPI)

Most modern RL algorithms are instances of Generalized Policy Iteration: interleave evaluation and improvement at any granularity. Pure policy iteration evaluates fully; value iteration evaluates for one step; actor-critic methods (23-08) evaluate and improve simultaneously.

The key insight: as long as both processes continue (evaluation makes values more accurate, improvement makes the policy greedier), the process converges to optimality.



Key Terms

Worked Examples

Example 1: Policy Evaluation by Hand

MDP with 2 states, 1 action. $P = \begin{pmatrix} 0.5 & 0.5 \ 0 & 1 \end{pmatrix}$, $R = [10, -1]$, $\gamma = 0.9$. Start with $V_0 = [0, 0]$. Perform 3 iterations of synchronous policy evaluation.

Solution:

Iteration 1: $V_1(s_1) = 0.5[10 + 0.9(0)] + 0.5[10 + 0.9(0)] = 5 + 5 = 10$ $V_1(s_2) = 0[-1 + 0.9(0)] + 1[-1 + 0.9(0)] = -1$

Iteration 2: $V_2(s_1) = 0.5[10 + 0.9(10)] + 0.5[10 + 0.9(-1)] = 0.5(19) + 0.5(9.1) = 9.5 + 4.55 = 14.05$ $V_2(s_2) = 0[-1 + 0.9(10)] + 1[-1 + 0.9(-1)] = -1.9$

Iteration 3: $V_3(s_1) = 0.5[10 + 0.9(14.05)] + 0.5[10 + 0.9(-1.9)] = 0.5(22.645) + 0.5(8.29) = 15.47$ $V_3(s_2) = -1 + 0.9(-1.9) = -2.71$

Click for answer $V_3 = [15.47, -2.71]$. True $V^\pi = [18.18, -10]$ (converges slowly because $\gamma$ is close to 1).

Example 2: Policy Iteration — Complete Walkthrough

Grid MDP: $s_1$ → action $A$: reward 5, stays in $s_1$. Action $B$: reward 0, goes to $s_2$. $s_2$ → action $A$: reward 10, goes to terminal (value 0). Action $B$: reward -1, goes back to $s_1$. $\gamma = 0.9$. Start with $\pi_0$: always $A$. Find optimal policy.

Solution:

Policy $\pi_0$: $A$ in both states.

Evaluate $\pi_0$: $V(s_1) = 5 + 0.9 V(s_1)$ → $0.1 V(s_1) = 5$ → $V(s_1) = 50$ $V(s_2) = 10 + 0.9(0) = 10$

Improve: $Q(s_1, A) = 5 + 0.9(50) = 50$, $Q(s_1, B) = 0 + 0.9(10) = 9$ → keep $A$ ✓ $Q(s_2, A) = 10 + 0.9(0) = 10$, $Q(s_2, B) = -1 + 0.9(50) = 44$ → switch to $B$

Policy $\pi_1$: $A$ in $s_1$, $B$ in $s_2$.

Evaluate $\pi_1$: $V(s_1) = 5 + 0.9 V(s_1)$ → $V(s_1) = 50$ $V(s_2) = -1 + 0.9(50) = 44$

Improve: $Q(s_1, A) = 5 + 0.9(50) = 50$, $Q(s_1, B) = 0 + 0.9(44) = 39.6$ → keep $A$ ✓ $Q(s_2, A) = 10 + 0 = 10$, $Q(s_2, B) = -1 + 0.9(50) = 44$ → keep $B$ ✓

Policy unchanged → optimal!

Click for answer Optimal policy: $\pi^*(s_1) = A$, $\pi^*(s_2) = B$. $V^*(s_1) = 50$, $V^*(s_2) = 44$. Surprisingly, in $s_2$, it's better to take action $B$ (immediate $-1$) to get back to $s_1$ where you can earn 5 per step forever.

Example 3: Value Iteration

Same MDP as Example 2. Perform value iteration starting from $V_0 = [0, 0]$.

Solution:

Iteration 1: $V_1(s_1) = \max{5 + 0.9(0),\; 0 + 0.9(0)} = 5$ $V_1(s_2) = \max{10 + 0.9(0),\; -1 + 0.9(0)} = 10$

Iteration 2: $V_2(s_1) = \max{5 + 0.9(5),\; 0 + 0.9(10)} = \max{9.5, 9} = 9.5$ $V_2(s_2) = \max{10 + 0,\; -1 + 0.9(5)} = \max{10, 3.5} = 10$

Iteration 3: $V_3(s_1) = \max{5 + 0.9(9.5),\; 0 + 0.9(10)} = \max{13.55, 9} = 13.55$ $V_3(s_2) = \max{10, -1 + 0.9(9.5)} = \max{10, 7.55} = 10$

...continuing, $V(s_1)$ converges to 50, $V(s_2)$ converges to 44.

Click for answer Value iteration converges to $V^* = [50, 44]$, matching the policy iteration result. The greedy policy extracted from $V^*$ matches the optimal policy found above.

Practice Problems

Problem 1: For the MDP in Example 1, compute the true $V^\pi$ analytically using $(I - \gamma P^\pi)^{-1} R^\pi$ and verify it matches the limit of policy evaluation.

Answers (click to expand) $P^\pi = \begin{pmatrix} 0.5 & 0.5 \\ 0 & 1 \end{pmatrix}$, $R^\pi = [10, -1]^T$ $I - \gamma P^\pi = \begin{pmatrix} 1-0.45 & -0.45 \\ 0 & 1-0.9 \end{pmatrix} = \begin{pmatrix} 0.55 & -0.45 \\ 0 & 0.1 \end{pmatrix}$ Determinant $= 0.55(0.1) = 0.055$. Inverse $= \frac{1}{0.055} \begin{pmatrix} 0.1 & 0.45 \\ 0 & 0.55 \end{pmatrix}$ $V^\pi = \begin{pmatrix} 1.818 & 8.182 \\ 0 & 10 \end{pmatrix} \begin{pmatrix} 10 \\ -1 \end{pmatrix} = \begin{pmatrix} 18.18 - 8.182 \\ -10 \end{pmatrix} = \begin{pmatrix} 10 \\ -10 \end{pmatrix}$ Wait — let me recalculate. $V(s_1) = 1.818(10) + 8.182(-1) = 18.18 - 8.182 = 10/0.55 \approx 18.18$. $V(s_2) = 0(10) + 10(-1) = -10$. So $V^\pi = [18.18, -10]$.

Problem 2: Prove that if the policy improvement step produces $\pi' = \pi$, then $\pi$ satisfies the Bellman optimality equation and is therefore optimal.

Answers (click to expand) If $\pi' = \pi$, then for all $s$: $\pi(s) = \arg\max_a Q^\pi(s, a)$. This means: $V^\pi(s) = Q^\pi(s, \pi(s)) = \max_a Q^\pi(s, a) = \max_a \sum_{s'} P(s' \mid s, a)[R + \gamma V^\pi(s')]$ which is exactly the Bellman optimality equation. Since $V^*$ is the unique solution, $V^\pi = V^*$ and $\pi$ is optimal.

Problem 3: In policy iteration, why does each iteration strictly improve the policy (unless already optimal)? Give a proof sketch.

Answers (click to expand) By the policy improvement theorem, if there's any state $s$ where $Q^\pi(s, \pi'(s)) > V^\pi(s)$, then $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$ and $V^{\pi'}(s) > V^\pi(s)$ for at least one $s$. Since policies are finite, strict improvement cannot continue indefinitely — it terminates at the optimal policy. If $\pi' = \pi$, no improvement is possible → already optimal.

Problem 4: Compare the per-iteration computational cost of policy evaluation (one sweep) vs. value iteration for an MDP with $|\mathcal{S}| = n$, $|\mathcal{A}| = m$, branching factor $b$ (max number of possible next states).

Answers (click to expand) Policy evaluation sweep: For each state $s$ (n states), sum over actions (m) and next states (up to b): $O(n m b)$. Value iteration: For each state (n), for each action (m), sum over next states (b), then take max: $O(n m b)$. **Same per-sweep cost!** The difference is that policy evaluation may need many sweeps per policy iteration, while value iteration does exactly one sweep per iteration. However, policy iteration typically needs far fewer outer iterations.

Problem 5: A value iteration update is $V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s, a)[R + \gamma V_k(s')]$. Show that this is equivalent to one sweep of policy evaluation on the greedy policy with respect to $V_k$, followed by one step of policy improvement.

Answers (click to expand) Let $\pi_k(s) = \arg\max_a \sum_{s'} P(s' \mid s,a)[R + \gamma V_k(s')]$. Then $V_{k+1}(s) = \sum_{s'} P(s' \mid s, \pi_k(s))[R + \gamma V_k(s')]$. This is exactly: (1) form the greedy policy from $V_k$ (improvement), (2) apply one Bellman update using that policy (one-step evaluation). So value iteration = truncated policy iteration where evaluation is exactly one sweep.

Summary

Key takeaways:


Quiz

Question 1: What is the key requirement for applying dynamic programming to an MDP?

A. The state space must be continuous B. A perfect model of $P(s' \mid s,a)$ and $R(s,a,s')$ must be known C. The policy must be stochastic D. $\gamma$ must equal 1

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. DP is a *planning* method — it requires knowing the environment's dynamics. - If you chose A: DP typically requires finite spaces; continuous spaces need function approximation. - If you chose C: DP works with any policy, deterministic or stochastic. - If you chose D: $\gamma < 1$ is standard; $\gamma = 1$ only works for episodic tasks.

Question 2: Policy evaluation converges because:

A. The policy is deterministic B. The Bellman operator is a $\gamma$-contraction C. States are visited in a specific order D. The reward function is bounded

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. The contraction property guarantees $\|V_k - V^\pi\|_\infty \leq \gamma^k \|V_0 - V^\pi\|_\infty \to 0$. - If you chose A: Policy evaluation works for stochastic policies too. - If you chose C: Convergence is independent of visitation order (though order affects rate). - If you chose D: Bounded rewards ensure finite values but aren't what guarantees convergence of the iterative process.

Question 3: In the policy improvement step, the new policy is:

A. A random action for each state B. The action that maximizes $Q^\pi(s, a)$ (greedy with respect to the current value function) C. The same as the old policy D. A uniform distribution over all actions

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. $\pi'(s) = \arg\max_a Q^\pi(s, a)$. This is the greedy improvement. - If you chose A/C/D: These would not guarantee improvement. The policy improvement theorem specifically applies to the greedy choice.

Question 4: Value iteration differs from policy iteration primarily in that:

A. It uses a different discount factor B. It truncates policy evaluation to a single sweep C. It doesn't use the Bellman equation D. It requires continuous state spaces

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. Value iteration performs exactly one Bellman update (using max) between policy improvements. Policy iteration runs evaluation to convergence. - If you chose A: Both use the same $\gamma$. - If you chose C: Value iteration uses the Bellman *optimality* equation — it's still a Bellman equation. - If you chose D: Neither requires continuous spaces; both are designed for finite MDPs.

Question 5: For a finite MDP with $n$ states and $m$ actions, how many deterministic policies exist?

A. $n \times m$ B. $m^n$ C. $n^m$ D. $m!$

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. For each of $n$ states, you choose one of $m$ actions: $m^n$ total deterministic policies. Policy iteration must terminate within this many steps (but usually takes far fewer, often < 20). - If you chose A: $n \times m$ is the number of state-action pairs. - If you chose C: $n^m$ would be choosing a state for each action — backwards. - If you chose D: $m!$ has no meaning here.

Next Steps

Next up: 23-04 — Monte Carlo Methods — where you'll learn how to estimate value functions from experience (actual or simulated trajectories) when you don't have a model of the environment.


Pitfalls

  1. Calling any iterative value update "value iteration." Value iteration specifically uses the Bellman optimality operator (max over actions). Policy evaluation uses the Bellman expectation operator (weighted sum under the policy). The distinction is the operator, not the fact that you're iterating.

  2. Confusing synchronous and asynchronous convergence guarantees. Both converge, but asynchronous updates (in-place) can converge faster because they propagate information within the same sweep. However, the order of in-place updates can affect the rate — updating states in reverse topological order (from high-reward terminal states backward) often converges much faster than random or forward order.

  3. Stopping policy evaluation too early. The policy improvement theorem requires an accurate value function to guarantee improvement. If you truncate evaluation after just 1-2 sweeps (i.e., value iteration), the greedy policy derived may not actually be better. That's why pure policy iteration evaluates to convergence, while value iteration accepts approximate values but does many more outer iterations.

  4. Forgetting that DP requires a perfect model. DP methods are planning algorithms — they need $$P(s' \mid s,a)$$ and $$R(s,a,s')$$ for all transitions. If you're learning from experience (actual or simulated), you're doing model-free RL (MC, TD, Q-learning), not DP. Attempting DP without a model is mathematically impossible.




Q8: What is the per-sweep computational complexity of policy evaluation for an MDP with $$n$$ states, $$m$$ actions, and maximum branching factor $$b$$?

A) $$O(n)$$ B) $$O(n m b)$$ C) $$O(n^3)$$ D) $$O(b^m)$$

Answer and Explanations **Correct: B)** For each of $$n$$ states, you sum over $$m$$ actions and up to $$b$$ possible next states for each action. Total: $$O(n m b)$$ per sweep. - A) Linear in states only — ignores actions and transitions. - C) $$O(n^3)$$ is the cost of the closed-form matrix inverse solution, not per-sweep. - D) $$O(b^m)$$ is exponential and doesn't correspond to any DP operation.