📐 Concept diagram

23-03 — Dynamic Programming for MDPs

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-03 Prerequisites: 23-01 — MDPs, 23-02 — Bellman Equations Next subject: 23-04 — Monte Carlo Methods

Learning Objectives

By the end of this subject, you will be able to:

Implement policy evaluation and explain why it converges
Execute the policy improvement theorem and prove that it produces strictly better policies
Combine policy evaluation and improvement into policy iteration and analyze its convergence
Derive value iteration from the Bellman optimality equation and explain when to use it vs. policy iteration
Understand the computational complexity of DP methods and their limitations

Core Content

Dynamic Programming Requirements

Dynamic Programming (DP) for MDPs requires a perfect model of the environment — you must know $P(s' \mid s, a)$ and $R(s, a, s')$ for all transitions. DP is thus a planning method, not a learning method. The key idea: use the Bellman equations to iteratively compute value functions.

DP assumes finite state and action spaces (or at least that you can iterate over them). For continuous spaces, function approximation is needed (covered in 23-07 through 23-10).

Policy Evaluation (Prediction)

Goal: Given a policy $\pi$, compute $V^\pi$.

The algorithm iteratively applies the Bellman operator:

$$V_{k+1}(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\big[R(s, a, s') + \gamma V_k(s')\big]$$

Initialization: $V_0(s) = 0$ for all $s$ (arbitrary — convergence is guaranteed from any starting point).

Convergence: Since $\mathcal{T}^\pi$ is a $\gamma$-contraction, $|V_k - V^\pi|\infty \leq \gamma^k |V_0 - V^\pi|\infty \to 0$ as $k \to \infty$.

In practice, we stop when $\max_s |V_{k+1}(s) - V_k(s)| < \theta$ for some small threshold $\theta$.

⚠️ CRITICAL: Policy evaluation is a sweep over all states. Each iteration is $O(|\mathcal{S}|^2 |\mathcal{A}|)$ — quadratic in states, linear in actions. This is why DP only works for small MDPs.

Two implementations:

Synchronous: All $V_{k+1}(s)$ computed from $V_k$, then replace the entire array. Uses two arrays.
In-place (asynchronous): Update values one at a time, using the latest values for neighbors. Converges faster in practice but same theoretical guarantees. Uses one array.

Policy Improvement

Goal: Given $V^\pi$ (or an approximation), produce a better policy $\pi'$.

The policy improvement theorem states: if for all $s$:

$$Q^\pi(s, \pi'(s)) \geq V^\pi(s)$$

then $\pi'$ is at least as good as $\pi$: $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$.

Proof sketch:

$V^\pi(s) \leq Q^\pi(s, \pi'(s))$ $= \mathbb{E}[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t=s, A_t=\pi'(s)]$ $\leq \mathbb{E}[R_{t+1} + \gamma Q^\pi(S_{t+1}, \pi'(S_{t+1})) \mid \ldots]$ $\leq \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 Q^\pi(S_{t+2}, \pi'(S_{t+2})) \mid \ldots]$ $\cdots \leq V^{\pi'}(s)$

The greedy policy improvement constructs:

$$\pi'(s) = \arg\max_a Q^\pi(s, a) = \arg\max_a \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V^\pi(s')]$$

If $\pi' = \pi$, then the Bellman optimality equation is satisfied and $\pi$ is optimal.

Policy Iteration

Algorithm:

Initialization: $\pi_0$ arbitrary, $V$ arbitrary
Policy Evaluation: Compute $V^{\pi_k}$ (solve or iterate until convergence)
Policy Improvement: $\pi_{k+1}(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)[R + \gamma V^{\pi_k}(s')]$
If $\pi_{k+1} = \pi_k$, stop (optimal policy found). Else, go to 2.

Convergence: For finite MDPs with $|\mathcal{A}|$ finite, policy iteration terminates in at most $|\mathcal{A}|^{|\mathcal{S}|}$ iterations (though in practice it's much faster — often just a handful of iterations). Each iteration strictly improves the policy unless it's already optimal.

Why it works: Each improvement step produces a strictly better policy (unless optimal). With finitely many deterministic policies ($|\mathcal{A}|^{|\mathcal{S}|}$), the process must terminate at the optimum.

Value Iteration

Value iteration combines policy evaluation and improvement into a single update, using the Bellman optimality operator:

$$V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s, a)\big[R(s, a, s') + \gamma V_k(s')\big]$$

This is policy evaluation truncated to one sweep, followed immediately by policy improvement. It's equivalent to applying $\mathcal{T}^*$ repeatedly.

Once $V$ has converged (to $V^*$), extract the optimal policy:

$$\pi^(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V^(s')]$$

Convergence: Since $\mathcal{T}^$ is also a $\gamma$-contraction, value iteration converges to $V^$ from any starting point:

$$|V_k - V^|_\infty \leq \gamma^k |V_0 - V^|_\infty$$

Policy Iteration vs. Value Iteration

Aspect	Policy Iteration	Value Iteration
Inner loop	Full policy evaluation (many sweeps)	One sweep per iteration
Outer iterations	Few (policies are discrete)	Many (value converges continuously)
Per-iteration cost	High (solving linear system or many sweeps)	Lower (one Bellman update)
When to use	Small state spaces, evaluations are cheap	Large state spaces or when approximate evaluation is acceptable
Typical behavior	Converges in < 20 policy iterations	May need thousands of value iterations

⚠️ Common Pitfall: "Value iteration" is NOT the same as "iteratively computing values." Both policy evaluation and value iteration involve iterative updates — the difference is whether you use the policy's probability distribution (evaluation) or the max (value iteration).

Generalized Policy Iteration (GPI)

Most modern RL algorithms are instances of Generalized Policy Iteration: interleave evaluation and improvement at any granularity. Pure policy iteration evaluates fully; value iteration evaluates for one step; actor-critic methods (23-08) evaluate and improve simultaneously.

The key insight: as long as both processes continue (evaluation makes values more accurate, improvement makes the policy greedier), the process converges to optimality.

Key Terms

Generalized Policy Iteration
Inner loop
Outer iterations
Per-iteration cost
Typical behavior
When to use

Worked Examples

Example 1: Policy Evaluation by Hand

MDP with 2 states, 1 action. $P = \begin{pmatrix} 0.5 & 0.5 \ 0 & 1 \end{pmatrix}$, $R = [10, -1]$, $\gamma = 0.9$. Start with $V_0 = [0, 0]$. Perform 3 iterations of synchronous policy evaluation.

Solution:

Iteration 1: $V_1(s_1) = 0.5[10 + 0.9(0)] + 0.5[10 + 0.9(0)] = 5 + 5 = 10$ $V_1(s_2) = 0[-1 + 0.9(0)] + 1[-1 + 0.9(0)] = -1$

Iteration 2: $V_2(s_1) = 0.5[10 + 0.9(10)] + 0.5[10 + 0.9(-1)] = 0.5(19) + 0.5(9.1) = 9.5 + 4.55 = 14.05$ $V_2(s_2) = 0[-1 + 0.9(10)] + 1[-1 + 0.9(-1)] = -1.9$

Iteration 3: $V_3(s_1) = 0.5[10 + 0.9(14.05)] + 0.5[10 + 0.9(-1.9)] = 0.5(22.645) + 0.5(8.29) = 15.47$ $V_3(s_2) = -1 + 0.9(-1.9) = -2.71$

Click for answer

$V_3 = [15.47, -2.71]$. True $V^\pi = [18.18, -10]$ (converges slowly because $\gamma$ is close to 1).

Example 2: Policy Iteration — Complete Walkthrough

Grid MDP: $s_1$ → action $A$: reward 5, stays in $s_1$. Action $B$: reward 0, goes to $s_2$. $s_2$ → action $A$: reward 10, goes to terminal (value 0). Action $B$: reward -1, goes back to $s_1$. $\gamma = 0.9$. Start with $\pi_0$: always $A$. Find optimal policy.

Solution:

Policy $\pi_0$: $A$ in both states.

Evaluate $\pi_0$: $V(s_1) = 5 + 0.9 V(s_1)$ → $0.1 V(s_1) = 5$ → $V(s_1) = 50$ $V(s_2) = 10 + 0.9(0) = 10$

Improve: $Q(s_1, A) = 5 + 0.9(50) = 50$, $Q(s_1, B) = 0 + 0.9(10) = 9$ → keep $A$ ✓ $Q(s_2, A) = 10 + 0.9(0) = 10$, $Q(s_2, B) = -1 + 0.9(50) = 44$ → switch to $B$

Policy $\pi_1$: $A$ in $s_1$, $B$ in $s_2$.

Evaluate $\pi_1$: $V(s_1) = 5 + 0.9 V(s_1)$ → $V(s_1) = 50$ $V(s_2) = -1 + 0.9(50) = 44$

Improve: $Q(s_1, A) = 5 + 0.9(50) = 50$, $Q(s_1, B) = 0 + 0.9(44) = 39.6$ → keep $A$ ✓ $Q(s_2, A) = 10 + 0 = 10$, $Q(s_2, B) = -1 + 0.9(50) = 44$ → keep $B$ ✓

Policy unchanged → optimal!

Click for answer

Optimal policy: $\pi^*(s_1) = A$, $\pi^*(s_2) = B$. $V^*(s_1) = 50$, $V^*(s_2) = 44$. Surprisingly, in $s_2$, it's better to take action $B$ (immediate $-1$) to get back to $s_1$ where you can earn 5 per step forever.

Example 3: Value Iteration

Same MDP as Example 2. Perform value iteration starting from $V_0 = [0, 0]$.

Solution:

Iteration 1: $V_1(s_1) = \max{5 + 0.9(0),\; 0 + 0.9(0)} = 5$ $V_1(s_2) = \max{10 + 0.9(0),\; -1 + 0.9(0)} = 10$

Iteration 2: $V_2(s_1) = \max{5 + 0.9(5),\; 0 + 0.9(10)} = \max{9.5, 9} = 9.5$ $V_2(s_2) = \max{10 + 0,\; -1 + 0.9(5)} = \max{10, 3.5} = 10$

Iteration 3: $V_3(s_1) = \max{5 + 0.9(9.5),\; 0 + 0.9(10)} = \max{13.55, 9} = 13.55$ $V_3(s_2) = \max{10, -1 + 0.9(9.5)} = \max{10, 7.55} = 10$

...continuing, $V(s_1)$ converges to 50, $V(s_2)$ converges to 44.

Click for answer

Value iteration converges to $V^* = [50, 44]$, matching the policy iteration result. The greedy policy extracted from $V^*$ matches the optimal policy found above.

Practice Problems

Problem 1: For the MDP in Example 1, compute the true $V^\pi$ analytically using $(I - \gamma P^\pi)^{-1} R^\pi$ and verify it matches the limit of policy evaluation.

Answers (click to expand)

$P^\pi = \begin{pmatrix} 0.5 & 0.5 \\ 0 & 1 \end{pmatrix}$, $R^\pi = [10, -1]^T$ $I - \gamma P^\pi = \begin{pmatrix} 1-0.45 & -0.45 \\ 0 & 1-0.9 \end{pmatrix} = \begin{pmatrix} 0.55 & -0.45 \\ 0 & 0.1 \end{pmatrix}$ Determinant $= 0.55(0.1) = 0.055$. Inverse $= \frac{1}{0.055} \begin{pmatrix} 0.1 & 0.45 \\ 0 & 0.55 \end{pmatrix}$ $V^\pi = \begin{pmatrix} 1.818 & 8.182 \\ 0 & 10 \end{pmatrix} \begin{pmatrix} 10 \\ -1 \end{pmatrix} = \begin{pmatrix} 18.18 - 8.182 \\ -10 \end{pmatrix} = \begin{pmatrix} 10 \\ -10 \end{pmatrix}$ Wait — let me recalculate. $V(s_1) = 1.818(10) + 8.182(-1) = 18.18 - 8.182 = 10/0.55 \approx 18.18$. $V(s_2) = 0(10) + 10(-1) = -10$. So $V^\pi = [18.18, -10]$.

Problem 2: Prove that if the policy improvement step produces $\pi' = \pi$, then $\pi$ satisfies the Bellman optimality equation and is therefore optimal.

Answers (click to expand)

If $\pi' = \pi$, then for all $s$: $\pi(s) = \arg\max_a Q^\pi(s, a)$. This means: $V^\pi(s) = Q^\pi(s, \pi(s)) = \max_a Q^\pi(s, a) = \max_a \sum_{s'} P(s' \mid s, a)[R + \gamma V^\pi(s')]$ which is exactly the Bellman optimality equation. Since $V^*$ is the unique solution, $V^\pi = V^*$ and $\pi$ is optimal.

Problem 3: In policy iteration, why does each iteration strictly improve the policy (unless already optimal)? Give a proof sketch.

Answers (click to expand)

By the policy improvement theorem, if there's any state $s$ where $Q^\pi(s, \pi'(s)) > V^\pi(s)$, then $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$ and $V^{\pi'}(s) > V^\pi(s)$ for at least one $s$. Since policies are finite, strict improvement cannot continue indefinitely — it terminates at the optimal policy. If $\pi' = \pi$, no improvement is possible → already optimal.

Problem 4: Compare the per-iteration computational cost of policy evaluation (one sweep) vs. value iteration for an MDP with $|\mathcal{S}| = n$, $|\mathcal{A}| = m$, branching factor $b$ (max number of possible next states).

Answers (click to expand)

Policy evaluation sweep: For each state $s$ (n states), sum over actions (m) and next states (up to b): $O(n m b)$. Value iteration: For each state (n), for each action (m), sum over next states (b), then take max: $O(n m b)$. **Same per-sweep cost!** The difference is that policy evaluation may need many sweeps per policy iteration, while value iteration does exactly one sweep per iteration. However, policy iteration typically needs far fewer outer iterations.

Problem 5: A value iteration update is $V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s, a)[R + \gamma V_k(s')]$. Show that this is equivalent to one sweep of policy evaluation on the greedy policy with respect to $V_k$, followed by one step of policy improvement.

Answers (click to expand)

Let $\pi_k(s) = \arg\max_a \sum_{s'} P(s' \mid s,a)[R + \gamma V_k(s')]$. Then $V_{k+1}(s) = \sum_{s'} P(s' \mid s, \pi_k(s))[R + \gamma V_k(s')]$. This is exactly: (1) form the greedy policy from $V_k$ (improvement), (2) apply one Bellman update using that policy (one-step evaluation). So value iteration = truncated policy iteration where evaluation is exactly one sweep.

Summary

Key takeaways:

Dynamic programming solves MDPs when you have a perfect model of $P$ and $R$
Policy evaluation applies the Bellman operator iteratively — guaranteed to converge because $\mathcal{T}^\pi$ is a $\gamma$-contraction
The policy improvement theorem guarantees that a greedy policy is at least as good as the current one, with strict improvement unless optimal
Policy iteration alternates full evaluation and greedy improvement, terminating at optimality in finite steps
Value iteration combines evaluation and improvement into a single max-update, also converging to $V^*$
Generalized Policy Iteration (GPI) is the unifying framework — interleave evaluation and improvement at any granularity
DP is $O(|\mathcal{S}|^2 |\mathcal{A}|)$ per sweep — intractable for large state spaces, motivating model-free methods

Quiz

Question 1: What is the key requirement for applying dynamic programming to an MDP?

A. The state space must be continuous B. A perfect model of $P(s' \mid s,a)$ and $R(s,a,s')$ must be known C. The policy must be stochastic D. $\gamma$ must equal 1

Correct Answer: B