23-02 — Bellman Equations
Phase: 23 — Reinforcement Learning Mathematics Subject: 23-02 Prerequisites: 23-01 — MDPs, Phase 10 (Probability Theory) Next subject: 23-03 — Dynamic Programming for MDPs
Learning Objectives
By the end of this subject, you will be able to:
- Derive the Bellman expectation equation for $V^\pi$ and $Q^\pi$ from first principles
- State and apply the Bellman optimality equations
- Explain the relationship between $V^\pi$ and $Q^\pi$ via one-step lookahead
- Understand why the Bellman operator is a contraction and why this guarantees convergence
- Use the Bellman equations to solve small MDPs by hand
Core Content
The Bellman Expectation Equation for $V^\pi$
Starting from the definition $V^\pi(s) = \mathbb{E}\pi[G_t \mid S_t = s]$ and using $G_t = R{t+1} + \gamma G_{t+1}$:
$$V^\pi(s) = \mathbb{E}\pi[R{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$$
Expanding the expectation over actions and next states yields the Bellman expectation equation:
$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \big[R(s, a, s') + \gamma V^\pi(s')\big]$$
⚠️ CRITICAL: This is arguably the most important equation in RL. It says: the value of a state equals the expected immediate reward plus the discounted expected value of the next state. Every value-based RL algorithm is built on this recursion.
Matrix form (for finite MDPs with $n$ states):
$$\mathbf{V}^\pi = \mathbf{R}^\pi + \gamma \mathbf{P}^\pi \mathbf{V}^\pi$$
where $R^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)R(s, a, s')$ is the expected immediate reward under $\pi$, and $P^\pi(s, s') = \sum_a \pi(a \mid s) P(s' \mid s, a)$ is the state-transition probability under $\pi$.
Solving: $\mathbf{V}^\pi = (I - \gamma \mathbf{P}^\pi)^{-1} \mathbf{R}^\pi$. This requires $O(n^3)$ operations — tractable only for small $n$.
The Bellman Expectation Equation for $Q^\pi$
Similarly, for the action-value function:
$$Q^\pi(s, a) = \mathbb{E}\pi[R{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a]$$
Expanding:
$$Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a) \left[R(s, a, s') + \gamma \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')\right]$$
This decomposes the expected return into: (1) the immediate expected reward $R(s, a, s')$, plus (2) the discounted expected future value $\gamma \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$, averaged over next states.
The Relationship Between $V^\pi$ and $Q^\pi$
Two fundamental connections:
$$V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a) \quad \text{(averaging over actions)}$$
$$Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a) [R(s, a, s') + \gamma V^\pi(s')] \quad \text{(one-step lookahead)}$$
These two equations form a couplet that appears everywhere in RL — they let you move between state-values and action-values as needed.
Optimal Value Functions
The optimal state-value function $V^$ and optimal action-value function $Q^$ are defined as:
$$V^*(s) = \max_\pi V^\pi(s)$$
$$Q^*(s, a) = \max_\pi Q^\pi(s, a)$$
They represent the best possible expected return achievable from state $s$ (or state-action pair $(s, a)$). For any MDP, an optimal policy $\pi^$ exists (possibly non-unique) such that $V^{\pi^} = V^$ and $Q^{\pi^} = Q^*$.
⚠️ CRITICAL: For finite MDPs, a deterministic optimal policy always exists. There's always an action in each state that achieves (or ties for) the maximum. This is not generally true for partially observable settings (POMDPs).
The Bellman Optimality Equations
The Bellman optimality equation for $V^*$:
$$V^(s) = \max_{a \in \mathcal{A}} \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \big[R(s, a, s') + \gamma V^(s')\big]$$
And for $Q^*$:
$$Q^(s, a) = \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \left[R(s, a, s') + \gamma \max_{a' \in \mathcal{A}} Q^(s', a')\right]$$
Notice the crucial difference from the expectation equations: we use $\max_a$ instead of $\sum_a \pi(a \mid s)$. The optimal value is achieved by always taking the best action.
The Bellman Operator Is a Contraction
Define the Bellman operator $\mathcal{T}^\pi$ for a given policy:
$$(\mathcal{T}^\pi V)(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V(s')]$$
$\mathcal{T}^\pi$ is a contraction mapping in the max-norm $|V|_\infty = \max_s |V(s)|$:
$$|\mathcal{T}^\pi V_1 - \mathcal{T}^\pi V_2|\infty \leq \gamma |V_1 - V_2|\infty$$
Proof:
$|(\mathcal{T}^\pi V_1)(s) - (\mathcal{T}^\pi V_2)(s)|$ $= |\sum_{a,s'} \pi(a \mid s)P(s' \mid s, a) \gamma[V_1(s') - V_2(s')]|$ $\leq \gamma \sum_{a,s'} \pi(a \mid s)P(s' \mid s, a)|V_1(s') - V_2(s')|$ $\leq \gamma \sum_{a,s'} \pi(a \mid s)P(s' \mid s, a) |V_1 - V_2|\infty$ $= \gamma |V_1 - V_2|\infty$
Since $| \cdot |_\infty$ is a complete metric space, the Banach fixed-point theorem guarantees that $\mathcal{T}^\pi$ has a unique fixed point — $V^\pi$ — and that iteratively applying $\mathcal{T}^\pi$ converges to it from any starting point. This is the mathematical foundation for policy evaluation (covered in 23-03).
Similarly, the Bellman optimality operator $\mathcal{T}^*$:
$$(\mathcal{T}^* V)(s) = \max_a \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V(s')]$$
is also a $\gamma$-contraction, with unique fixed point $V^*$.
Key Terms
- Banach fixed-point theorem
- Bellman expectation equation
- Bellman operator
- Bellman optimality operator
- Matrix form
Worked Examples
Example 1: Solving for $V^\pi$ by Hand
Two-state MDP: $\mathcal{S} = {s_1, s_2}$, one action $a$. $P(s_1 \mid s_1, a)=0.7$, $P(s_2 \mid s_1, a)=0.3$; $P(s_1 \mid s_2, a)=0.4$, $P(s_2 \mid s_2, a)=0.6$. $R(s_1, a, \cdot)=5$, $R(s_2, a, \cdot)=-2$. $\gamma = 0.9$. Solve for $V^\pi$.
Solution: Policy fixed (only one action). Bellman equations:
$V(s_1) = 0.7[5 + 0.9 V(s_1)] + 0.3[5 + 0.9 V(s_2)]$ $= 0.7(5) + 0.7(0.9)V(s_1) + 0.3(5) + 0.3(0.9)V(s_2)$ $= 3.5 + 0.63 V(s_1) + 1.5 + 0.27 V(s_2)$ $= 5 + 0.63 V(s_1) + 0.27 V(s_2)$
$V(s_2) = 0.4[-2 + 0.9 V(s_1)] + 0.6[-2 + 0.9 V(s_2)]$ $= -0.8 + 0.36 V(s_1) -1.2 + 0.54 V(s_2)$ $= -2 + 0.36 V(s_1) + 0.54 V(s_2)$
Rearrange: $V(s_1) - 0.63 V(s_1) - 0.27 V(s_2) = 5$ → $0.37 V(s_1) - 0.27 V(s_2) = 5$ $V(s_2) - 0.36 V(s_1) - 0.54 V(s_2) = -2$ → $-0.36 V(s_1) + 0.46 V(s_2) = -2$
Solve: $V(s_1) \approx 21.38$, $V(s_2) \approx 12.38$.
Click for answer
$V(s_1) \approx 21.38$, $V(s_2) \approx 12.38$. Both positive despite $s_2$ giving negative immediate rewards — the ability to transition to $s_1$ (which gives high rewards) elevates $s_2$'s value.Example 2: Bellman Optimality — Simple Grid
Consider a 2-state MDP: $s_1$ → action $a_1$: reward $+10$, always goes to terminal. Action $a_2$: reward $+1$, always goes to $s_2$. $s_2$ → action $a_1$: reward $-5$, 50% terminal, 50% back to $s_1$. Action $a_2$: reward $+2$, always goes to terminal. $\gamma=0.9$. Find $V^(s_1), V^(s_2)$.
Solution: Terminal = absorbing state with value 0. Bellman optimality:
$V^(s_1) = \max{10 + 0.9(0),\; 1 + 0.9 V^(s_2)} = \max{10,\; 1 + 0.9 V^*(s_2)}$
$V^(s_2) = \max{0.5[-5 + 0.9(0)] + 0.5[-5 + 0.9 V^(s_1)],\; 2 + 0.9(0)}$ $= \max{-5 + 0.45 V^*(s_1),\; 2}$
Case 1: $V^(s_2) = 2$. Then $V^(s_1) = \max{10, 1 + 0.9(2)} = \max{10, 2.8} = 10$. Check consistency: $V^*(s_2) = \max{-5 + 0.45(10), 2} = \max{-0.5, 2} = 2$. ✓
Optimal policy: in $s_1$, take $a_1$ (immediate $+10$). In $s_2$, take $a_2$ (immediate $+2$).
Click for answer
$V^*(s_1) = 10$, $V^*(s_2) = 2$. Optimal: $a_1$ in $s_1$, $a_2$ in $s_2$.Example 3: Contraction Proof Verification
Show numerically that applying the Bellman operator reduces the max-norm error by factor $\gamma$. Start with $V_0 = [0, 0, 0]$ for a 3-state MDP with known $V^\pi = [8, 4, 0]$, $\gamma=0.9$.
Solution: The Bellman operator $\mathcal{T}^\pi$ satisfies $|\mathcal{T}^\pi V_k - V^\pi|\infty \leq \gamma |V_k - V^\pi|\infty$. Applying it repeatedly:
$|V_0 - V^\pi|\infty = |[0,0,0] - [8,4,0]|\infty = 8$ After one Bellman update: $|V_1 - V^\pi|\infty \leq 0.9 \times 8 = 7.2$ After $k$ updates: $|V_k - V^\pi|\infty \leq \gamma^k \times 8$
The error decays exponentially. This is why dynamic programming works — each iteration shrinks the error by factor $\gamma$.
Click for answer
The contraction property guarantees exponential convergence: $\|V_k - V^\pi\|_\infty \leq \gamma^k \|V_0 - V^\pi\|_\infty$. This is the mathematical foundation for all iterative value-based RL algorithms.Practice Problems
Problem 1: Write out the Bellman expectation equation for $Q^\pi$ in a deterministic MDP (where $P(s' \mid s, a) = 1$ for exactly one $s'$). How does it simplify?
Answers (click to expand)
In a deterministic MDP, there is exactly one next state $s' = f(s, a)$. Then: $Q^\pi(s, a) = R(s, a, f(s, a)) + \gamma \sum_{a'} \pi(a' \mid f(s, a)) Q^\pi(f(s, a), a')$ The sum over $s'$ collapses to a single term, and $V^\pi(s') = \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$.Problem 2: Prove that the Bellman optimality operator $\mathcal{T}^*$ is a $\gamma$-contraction.
Answers (click to expand)
For any two value functions $V_1, V_2$ and any state $s$: $|(\mathcal{T}^* V_1)(s) - (\mathcal{T}^* V_2)(s)|$ $= |\max_a \sum_{s'} P(s' \mid s, a)[R(s,a,s') + \gamma V_1(s')] - \max_a \sum_{s'} P(s' \mid s, a)[R(s,a,s') + \gamma V_2(s')]|$ Let $a_1^*$ be the optimal action for $V_1$ at $s$: $\leq |\sum_{s'} P(s' \mid s, a_1^*)[R + \gamma V_1(s')] - \sum_{s'} P(s' \mid s, a_1^*)[R + \gamma V_2(s')]|$ $= \gamma |\sum_{s'} P(s' \mid s, a_1^*)[V_1(s') - V_2(s')]|$ $\leq \gamma \sum_{s'} P(s' \mid s, a_1^*) \|V_1 - V_2\|_\infty = \gamma \|V_1 - V_2\|_\infty$ Taking max over $s$: $\|\mathcal{T}^* V_1 - \mathcal{T}^* V_2\|_\infty \leq \gamma \|V_1 - V_2\|_\infty$. Done.Problem 3: For the MDP in Example 1, verify your solution by plugging $V(s_1)=21.38$, $V(s_2)=12.38$ back into both Bellman equations.
Answers (click to expand)
$V(s_1) = 5 + 0.63(21.38) + 0.27(12.38) = 5 + 13.47 + 3.34 = 21.81$ (small rounding difference, the exact solution is $V(s_1) = 600/27.4 \approx 21.90$, $V(s_2) = 340/27.4 \approx 12.41$) $V(s_2) = -2 + 0.36(21.38) + 0.54(12.38) = -2 + 7.70 + 6.69 = 12.39$ ✓Problem 4: An MDP has $V^(s_1) = 100$. Action $a_1$ from $s_1$ gives $R=10$ and goes to $s_2$ where $V^(s_2) = V^(s_2)$ (unknown). Action $a_2$ gives $R=50$ and goes to terminal ($V=0$). $\gamma=0.9$. What is the minimum value of $V^(s_2)$ for $a_1$ to be optimal?
Answers (click to expand)
For $a_1$ to be optimal: $10 + 0.9 V^*(s_2) \geq 50 + 0.9(0)$ $10 + 0.9 V^*(s_2) \geq 50$ $0.9 V^*(s_2) \geq 40$ $V^*(s_2) \geq 44.\overline{4}$ So $V^*(s_2)$ must be at least 44.45 for the agent to forgo the immediate 50 and take action $a_1$ instead.Problem 5: Derive the relationship between the Bellman expectation and optimality equations. Under what condition does $V^\pi = V^*$?
Answers (click to expand)
$V^\pi = V^*$ when $\pi$ is an optimal policy, i.e., $\pi(a \mid s) > 0$ only if $a \in \arg\max_a Q^*(s, a)$. The Bellman expectation equation uses $\sum_a \pi(a \mid s)$, while the optimality equation uses $\max_a$. When $\pi$ puts all probability on the maximizer, $\sum_a \pi(a \mid s) Q^*(s, a) = \max_a Q^*(s, a)$, and $V^\pi$ satisfies the optimality equation.Summary
Key takeaways:
- The Bellman expectation equation decomposes $V^\pi$ into immediate reward plus discounted future value: $V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)[R + \gamma V^\pi(s')]$
- The Bellman optimality equation replaces the policy average with a max: $V^(s) = \max_a \sum_{s'} P(s' \mid s, a)[R + \gamma V^(s')]$
- The Bellman operator $\mathcal{T}^\pi$ is a $\gamma$-contraction in max-norm — this guarantees convergence of iterative methods
- $V^\pi$ and $Q^\pi$ are connected via one-step lookahead: $Q^\pi(s,a) = \sum_{s'} P(s' \mid s,a)[R + \gamma V^\pi(s')]$
- For finite MDPs, a deterministic optimal policy always exists
- The contraction property is the mathematical reason dynamic programming works — errors shrink exponentially
Quiz
Question 1: The Bellman expectation equation for $V^\pi$ expresses $V^\pi(s)$ as:
A. $\max_a \sum_{s'} P(s' \mid s,a)[R + \gamma V^\pi(s')]$ B. $\sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s,a)[R + \gamma V^\pi(s')]$ C. $\frac{1}{|\mathcal{A}|} \sum_a Q^\pi(s,a)$ D. $\mathbb{E}[R_{t+1}]$ only
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. This is the full Bellman expectation equation — expectation over policy actions and environment transitions. - If you chose A: That's the Bellman *optimality* equation (max over actions). - If you chose C: This only holds for a uniform random policy. - If you chose D: Missing the discounted future value term — the whole point of the Bellman equation.Question 2: Why is the Bellman operator being a contraction important?
A. It guarantees the policy is deterministic B. It guarantees that iteratively applying the operator converges to the true value function C. It ensures all states have equal value D. It makes the reward function differentiable
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. The Banach fixed-point theorem says a contraction on a complete metric space has a unique fixed point and repeated application converges to it. - If you chose A: Contraction relates to value functions, not policy determinism. - If you chose C: No — different states generally have different values. - If you chose D: The Bellman operator's contraction property has nothing to do with differentiability.Question 3: In the Bellman optimality equation for $Q^$, the term $\max_{a'} Q^(s', a')$ replaces:
A. $R(s, a, s')$ B. $\gamma$ C. $\sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$ D. $P(s' \mid s, a)$
Correct Answer: C
Explanation (click to expand)
- If you chose C: Correct. In the expectation equation, future value is $\sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$ — the expected Q under the policy. In the optimality equation, we take the max — because the optimal policy always picks the best action. - If you chose A: $R(s,a,s')$ is the immediate reward and remains unchanged. - If you chose B: $\gamma$ is the discount factor and is unchanged between expectation and optimality. - If you chose D: Transition probabilities are properties of the environment and are unchanged.Question 4: What is the contraction factor for the Bellman operator $\mathcal{T}^\pi$?
A. 1 B. $\frac{1}{1-\gamma}$ C. $\gamma$ D. $\gamma^2$
Correct Answer: C
Explanation (click to expand)
- If you chose C: Correct. $\|\mathcal{T}^\pi V_1 - \mathcal{T}^\pi V_2\|_\infty \leq \gamma \|V_1 - V_2\|_\infty$. The discount factor *is* the contraction modulus. - If you chose A: A modulus of 1 would be non-expansive, not contractive — convergence is not guaranteed. - If you chose B: $1/(1-\gamma)$ is the horizon scaling factor (sum of geometric series), not the contraction modulus. - If you chose D: $\gamma^2$ would be too strong — the proof shows it's exactly $\gamma$.Question 5: For a finite MDP, which of the following is always true?
A. There is a unique optimal value function $V^$ but the optimal policy may not be unique B. There is a unique optimal policy C. The optimal policy must be stochastic D. $V^ = V^\pi$ for all policies $\pi$
Correct Answer: A
Explanation (click to expand)
- If you chose A: Correct. $V^*$ is unique (it's the fixed point of the contraction $\mathcal{T}^*$). But multiple policies can achieve $V^*$ — e.g., two actions with identical $Q^*$ values. - If you chose B: Multiple policies can be optimal — any tie-breaking among equally-valued actions works. - If you chose C: A deterministic optimal policy always exists for finite MDPs. - If you chose D: Only optimal policies achieve $V^*$; suboptimal policies have strictly lower values in at least one state.Next Steps
Next up: 23-03 — Dynamic Programming for MDPs — where you'll learn policy evaluation, policy improvement, and value iteration: the algorithms that solve MDPs when you have a perfect model.
Pitfalls
-
Confusing the Bellman expectation equation with the optimality equation. The expectation equation averages over the policy: $$\sum_a \pi(a \mid s)[\cdots]$$. The optimality equation takes the max: $$\max_a[\cdots]$$. Using the wrong one means you're solving a different problem. The expectation equation evaluates a given policy; the optimality equation finds the best policy.
-
Forgetting that the contraction proof requires γ < 1. The Bellman operator $$\mathcal{T}^\pi$$ is a γ-contraction only when γ < 1. If γ = 1, the modulus is 1 (non-expansive), and Banach's fixed-point theorem does not guarantee convergence. This is why continuing tasks strictly require discounting for theoretical convergence guarantees.
-
Misapplying the matrix solution $$(I - \gamma P^\pi)^{-1} R^\pi$$. This closed-form solution requires $$O(n^3)$$ operations and is numerically unstable for large n. It's useful for understanding but never used in practice for n > 1000. Real algorithms use iterative methods.
-
Assuming the optimal policy must be unique. The optimal value function $$V^$$ is unique (fixed point of $$\mathcal{T}^$$), but multiple policies can achieve it. If two actions at a state have identical $$Q^*(s,a)$$ values, any tie-breaking yields an optimal policy. Don't be surprised when different algorithms find different optimal policies.
Q8: Why does the contraction proof for $$\mathcal{T}^\pi$$ use the max-norm $$|V|_\infty$$ rather than the Euclidean norm $$|V|_2$$?
A) The max-norm is easier to compute B) The contraction property holds in max-norm with modulus γ; it does not necessarily hold in ℓ₂ C) The max-norm is the only norm defined for value functions D) Euclidean norm requires differentiability