Math graphic
📐 Concept diagram

23-02 — Bellman Equations

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-02 Prerequisites: 23-01 — MDPs, Phase 10 (Probability Theory) Next subject: 23-03 — Dynamic Programming for MDPs


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the Bellman expectation equation for $V^\pi$ and $Q^\pi$ from first principles
  2. State and apply the Bellman optimality equations
  3. Explain the relationship between $V^\pi$ and $Q^\pi$ via one-step lookahead
  4. Understand why the Bellman operator is a contraction and why this guarantees convergence
  5. Use the Bellman equations to solve small MDPs by hand

Core Content

The Bellman Expectation Equation for $V^\pi$

Starting from the definition $V^\pi(s) = \mathbb{E}\pi[G_t \mid S_t = s]$ and using $G_t = R{t+1} + \gamma G_{t+1}$:

$$V^\pi(s) = \mathbb{E}\pi[R{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$$

Expanding the expectation over actions and next states yields the Bellman expectation equation:

$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \big[R(s, a, s') + \gamma V^\pi(s')\big]$$

⚠️ CRITICAL: This is arguably the most important equation in RL. It says: the value of a state equals the expected immediate reward plus the discounted expected value of the next state. Every value-based RL algorithm is built on this recursion.

Matrix form (for finite MDPs with $n$ states):

$$\mathbf{V}^\pi = \mathbf{R}^\pi + \gamma \mathbf{P}^\pi \mathbf{V}^\pi$$

where $R^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)R(s, a, s')$ is the expected immediate reward under $\pi$, and $P^\pi(s, s') = \sum_a \pi(a \mid s) P(s' \mid s, a)$ is the state-transition probability under $\pi$.

Solving: $\mathbf{V}^\pi = (I - \gamma \mathbf{P}^\pi)^{-1} \mathbf{R}^\pi$. This requires $O(n^3)$ operations — tractable only for small $n$.

The Bellman Expectation Equation for $Q^\pi$

Similarly, for the action-value function:

$$Q^\pi(s, a) = \mathbb{E}\pi[R{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a]$$

Expanding:

$$Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a) \left[R(s, a, s') + \gamma \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')\right]$$

This decomposes the expected return into: (1) the immediate expected reward $R(s, a, s')$, plus (2) the discounted expected future value $\gamma \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$, averaged over next states.

The Relationship Between $V^\pi$ and $Q^\pi$

Two fundamental connections:

$$V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a) \quad \text{(averaging over actions)}$$

$$Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a) [R(s, a, s') + \gamma V^\pi(s')] \quad \text{(one-step lookahead)}$$

These two equations form a couplet that appears everywhere in RL — they let you move between state-values and action-values as needed.

Optimal Value Functions

The optimal state-value function $V^$ and optimal action-value function $Q^$ are defined as:

$$V^*(s) = \max_\pi V^\pi(s)$$

$$Q^*(s, a) = \max_\pi Q^\pi(s, a)$$

They represent the best possible expected return achievable from state $s$ (or state-action pair $(s, a)$). For any MDP, an optimal policy $\pi^$ exists (possibly non-unique) such that $V^{\pi^} = V^$ and $Q^{\pi^} = Q^*$.

⚠️ CRITICAL: For finite MDPs, a deterministic optimal policy always exists. There's always an action in each state that achieves (or ties for) the maximum. This is not generally true for partially observable settings (POMDPs).

The Bellman Optimality Equations

The Bellman optimality equation for $V^*$:

$$V^(s) = \max_{a \in \mathcal{A}} \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \big[R(s, a, s') + \gamma V^(s')\big]$$

And for $Q^*$:

$$Q^(s, a) = \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \left[R(s, a, s') + \gamma \max_{a' \in \mathcal{A}} Q^(s', a')\right]$$

Notice the crucial difference from the expectation equations: we use $\max_a$ instead of $\sum_a \pi(a \mid s)$. The optimal value is achieved by always taking the best action.

The Bellman Operator Is a Contraction

Define the Bellman operator $\mathcal{T}^\pi$ for a given policy:

$$(\mathcal{T}^\pi V)(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V(s')]$$

$\mathcal{T}^\pi$ is a contraction mapping in the max-norm $|V|_\infty = \max_s |V(s)|$:

$$|\mathcal{T}^\pi V_1 - \mathcal{T}^\pi V_2|\infty \leq \gamma |V_1 - V_2|\infty$$

Proof:

$|(\mathcal{T}^\pi V_1)(s) - (\mathcal{T}^\pi V_2)(s)|$ $= |\sum_{a,s'} \pi(a \mid s)P(s' \mid s, a) \gamma[V_1(s') - V_2(s')]|$ $\leq \gamma \sum_{a,s'} \pi(a \mid s)P(s' \mid s, a)|V_1(s') - V_2(s')|$ $\leq \gamma \sum_{a,s'} \pi(a \mid s)P(s' \mid s, a) |V_1 - V_2|\infty$ $= \gamma |V_1 - V_2|\infty$

Since $| \cdot |_\infty$ is a complete metric space, the Banach fixed-point theorem guarantees that $\mathcal{T}^\pi$ has a unique fixed point — $V^\pi$ — and that iteratively applying $\mathcal{T}^\pi$ converges to it from any starting point. This is the mathematical foundation for policy evaluation (covered in 23-03).

Similarly, the Bellman optimality operator $\mathcal{T}^*$:

$$(\mathcal{T}^* V)(s) = \max_a \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V(s')]$$

is also a $\gamma$-contraction, with unique fixed point $V^*$.



Key Terms

Worked Examples

Example 1: Solving for $V^\pi$ by Hand

Two-state MDP: $\mathcal{S} = {s_1, s_2}$, one action $a$. $P(s_1 \mid s_1, a)=0.7$, $P(s_2 \mid s_1, a)=0.3$; $P(s_1 \mid s_2, a)=0.4$, $P(s_2 \mid s_2, a)=0.6$. $R(s_1, a, \cdot)=5$, $R(s_2, a, \cdot)=-2$. $\gamma = 0.9$. Solve for $V^\pi$.

Solution: Policy fixed (only one action). Bellman equations:

$V(s_1) = 0.7[5 + 0.9 V(s_1)] + 0.3[5 + 0.9 V(s_2)]$ $= 0.7(5) + 0.7(0.9)V(s_1) + 0.3(5) + 0.3(0.9)V(s_2)$ $= 3.5 + 0.63 V(s_1) + 1.5 + 0.27 V(s_2)$ $= 5 + 0.63 V(s_1) + 0.27 V(s_2)$

$V(s_2) = 0.4[-2 + 0.9 V(s_1)] + 0.6[-2 + 0.9 V(s_2)]$ $= -0.8 + 0.36 V(s_1) -1.2 + 0.54 V(s_2)$ $= -2 + 0.36 V(s_1) + 0.54 V(s_2)$

Rearrange: $V(s_1) - 0.63 V(s_1) - 0.27 V(s_2) = 5$ → $0.37 V(s_1) - 0.27 V(s_2) = 5$ $V(s_2) - 0.36 V(s_1) - 0.54 V(s_2) = -2$ → $-0.36 V(s_1) + 0.46 V(s_2) = -2$

Solve: $V(s_1) \approx 21.38$, $V(s_2) \approx 12.38$.

Click for answer $V(s_1) \approx 21.38$, $V(s_2) \approx 12.38$. Both positive despite $s_2$ giving negative immediate rewards — the ability to transition to $s_1$ (which gives high rewards) elevates $s_2$'s value.

Example 2: Bellman Optimality — Simple Grid

Consider a 2-state MDP: $s_1$ → action $a_1$: reward $+10$, always goes to terminal. Action $a_2$: reward $+1$, always goes to $s_2$. $s_2$ → action $a_1$: reward $-5$, 50% terminal, 50% back to $s_1$. Action $a_2$: reward $+2$, always goes to terminal. $\gamma=0.9$. Find $V^(s_1), V^(s_2)$.

Solution: Terminal = absorbing state with value 0. Bellman optimality:

$V^(s_1) = \max{10 + 0.9(0),\; 1 + 0.9 V^(s_2)} = \max{10,\; 1 + 0.9 V^*(s_2)}$

$V^(s_2) = \max{0.5[-5 + 0.9(0)] + 0.5[-5 + 0.9 V^(s_1)],\; 2 + 0.9(0)}$ $= \max{-5 + 0.45 V^*(s_1),\; 2}$

Case 1: $V^(s_2) = 2$. Then $V^(s_1) = \max{10, 1 + 0.9(2)} = \max{10, 2.8} = 10$. Check consistency: $V^*(s_2) = \max{-5 + 0.45(10), 2} = \max{-0.5, 2} = 2$. ✓

Optimal policy: in $s_1$, take $a_1$ (immediate $+10$). In $s_2$, take $a_2$ (immediate $+2$).

Click for answer $V^*(s_1) = 10$, $V^*(s_2) = 2$. Optimal: $a_1$ in $s_1$, $a_2$ in $s_2$.

Example 3: Contraction Proof Verification

Show numerically that applying the Bellman operator reduces the max-norm error by factor $\gamma$. Start with $V_0 = [0, 0, 0]$ for a 3-state MDP with known $V^\pi = [8, 4, 0]$, $\gamma=0.9$.

Solution: The Bellman operator $\mathcal{T}^\pi$ satisfies $|\mathcal{T}^\pi V_k - V^\pi|\infty \leq \gamma |V_k - V^\pi|\infty$. Applying it repeatedly:

$|V_0 - V^\pi|\infty = |[0,0,0] - [8,4,0]|\infty = 8$ After one Bellman update: $|V_1 - V^\pi|\infty \leq 0.9 \times 8 = 7.2$ After $k$ updates: $|V_k - V^\pi|\infty \leq \gamma^k \times 8$

The error decays exponentially. This is why dynamic programming works — each iteration shrinks the error by factor $\gamma$.

Click for answer The contraction property guarantees exponential convergence: $\|V_k - V^\pi\|_\infty \leq \gamma^k \|V_0 - V^\pi\|_\infty$. This is the mathematical foundation for all iterative value-based RL algorithms.

Practice Problems

Problem 1: Write out the Bellman expectation equation for $Q^\pi$ in a deterministic MDP (where $P(s' \mid s, a) = 1$ for exactly one $s'$). How does it simplify?

Answers (click to expand) In a deterministic MDP, there is exactly one next state $s' = f(s, a)$. Then: $Q^\pi(s, a) = R(s, a, f(s, a)) + \gamma \sum_{a'} \pi(a' \mid f(s, a)) Q^\pi(f(s, a), a')$ The sum over $s'$ collapses to a single term, and $V^\pi(s') = \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$.

Problem 2: Prove that the Bellman optimality operator $\mathcal{T}^*$ is a $\gamma$-contraction.

Answers (click to expand) For any two value functions $V_1, V_2$ and any state $s$: $|(\mathcal{T}^* V_1)(s) - (\mathcal{T}^* V_2)(s)|$ $= |\max_a \sum_{s'} P(s' \mid s, a)[R(s,a,s') + \gamma V_1(s')] - \max_a \sum_{s'} P(s' \mid s, a)[R(s,a,s') + \gamma V_2(s')]|$ Let $a_1^*$ be the optimal action for $V_1$ at $s$: $\leq |\sum_{s'} P(s' \mid s, a_1^*)[R + \gamma V_1(s')] - \sum_{s'} P(s' \mid s, a_1^*)[R + \gamma V_2(s')]|$ $= \gamma |\sum_{s'} P(s' \mid s, a_1^*)[V_1(s') - V_2(s')]|$ $\leq \gamma \sum_{s'} P(s' \mid s, a_1^*) \|V_1 - V_2\|_\infty = \gamma \|V_1 - V_2\|_\infty$ Taking max over $s$: $\|\mathcal{T}^* V_1 - \mathcal{T}^* V_2\|_\infty \leq \gamma \|V_1 - V_2\|_\infty$. Done.

Problem 3: For the MDP in Example 1, verify your solution by plugging $V(s_1)=21.38$, $V(s_2)=12.38$ back into both Bellman equations.

Answers (click to expand) $V(s_1) = 5 + 0.63(21.38) + 0.27(12.38) = 5 + 13.47 + 3.34 = 21.81$ (small rounding difference, the exact solution is $V(s_1) = 600/27.4 \approx 21.90$, $V(s_2) = 340/27.4 \approx 12.41$) $V(s_2) = -2 + 0.36(21.38) + 0.54(12.38) = -2 + 7.70 + 6.69 = 12.39$ ✓

Problem 4: An MDP has $V^(s_1) = 100$. Action $a_1$ from $s_1$ gives $R=10$ and goes to $s_2$ where $V^(s_2) = V^(s_2)$ (unknown). Action $a_2$ gives $R=50$ and goes to terminal ($V=0$). $\gamma=0.9$. What is the minimum value of $V^(s_2)$ for $a_1$ to be optimal?

Answers (click to expand) For $a_1$ to be optimal: $10 + 0.9 V^*(s_2) \geq 50 + 0.9(0)$ $10 + 0.9 V^*(s_2) \geq 50$ $0.9 V^*(s_2) \geq 40$ $V^*(s_2) \geq 44.\overline{4}$ So $V^*(s_2)$ must be at least 44.45 for the agent to forgo the immediate 50 and take action $a_1$ instead.

Problem 5: Derive the relationship between the Bellman expectation and optimality equations. Under what condition does $V^\pi = V^*$?

Answers (click to expand) $V^\pi = V^*$ when $\pi$ is an optimal policy, i.e., $\pi(a \mid s) > 0$ only if $a \in \arg\max_a Q^*(s, a)$. The Bellman expectation equation uses $\sum_a \pi(a \mid s)$, while the optimality equation uses $\max_a$. When $\pi$ puts all probability on the maximizer, $\sum_a \pi(a \mid s) Q^*(s, a) = \max_a Q^*(s, a)$, and $V^\pi$ satisfies the optimality equation.

Summary

Key takeaways:


Quiz

Question 1: The Bellman expectation equation for $V^\pi$ expresses $V^\pi(s)$ as:

A. $\max_a \sum_{s'} P(s' \mid s,a)[R + \gamma V^\pi(s')]$ B. $\sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s,a)[R + \gamma V^\pi(s')]$ C. $\frac{1}{|\mathcal{A}|} \sum_a Q^\pi(s,a)$ D. $\mathbb{E}[R_{t+1}]$ only

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. This is the full Bellman expectation equation — expectation over policy actions and environment transitions. - If you chose A: That's the Bellman *optimality* equation (max over actions). - If you chose C: This only holds for a uniform random policy. - If you chose D: Missing the discounted future value term — the whole point of the Bellman equation.

Question 2: Why is the Bellman operator being a contraction important?

A. It guarantees the policy is deterministic B. It guarantees that iteratively applying the operator converges to the true value function C. It ensures all states have equal value D. It makes the reward function differentiable

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. The Banach fixed-point theorem says a contraction on a complete metric space has a unique fixed point and repeated application converges to it. - If you chose A: Contraction relates to value functions, not policy determinism. - If you chose C: No — different states generally have different values. - If you chose D: The Bellman operator's contraction property has nothing to do with differentiability.

Question 3: In the Bellman optimality equation for $Q^$, the term $\max_{a'} Q^(s', a')$ replaces:

A. $R(s, a, s')$ B. $\gamma$ C. $\sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$ D. $P(s' \mid s, a)$

Correct Answer: C

Explanation (click to expand) - If you chose C: Correct. In the expectation equation, future value is $\sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$ — the expected Q under the policy. In the optimality equation, we take the max — because the optimal policy always picks the best action. - If you chose A: $R(s,a,s')$ is the immediate reward and remains unchanged. - If you chose B: $\gamma$ is the discount factor and is unchanged between expectation and optimality. - If you chose D: Transition probabilities are properties of the environment and are unchanged.

Question 4: What is the contraction factor for the Bellman operator $\mathcal{T}^\pi$?

A. 1 B. $\frac{1}{1-\gamma}$ C. $\gamma$ D. $\gamma^2$

Correct Answer: C

Explanation (click to expand) - If you chose C: Correct. $\|\mathcal{T}^\pi V_1 - \mathcal{T}^\pi V_2\|_\infty \leq \gamma \|V_1 - V_2\|_\infty$. The discount factor *is* the contraction modulus. - If you chose A: A modulus of 1 would be non-expansive, not contractive — convergence is not guaranteed. - If you chose B: $1/(1-\gamma)$ is the horizon scaling factor (sum of geometric series), not the contraction modulus. - If you chose D: $\gamma^2$ would be too strong — the proof shows it's exactly $\gamma$.

Question 5: For a finite MDP, which of the following is always true?

A. There is a unique optimal value function $V^$ but the optimal policy may not be unique B. There is a unique optimal policy C. The optimal policy must be stochastic D. $V^ = V^\pi$ for all policies $\pi$

Correct Answer: A

Explanation (click to expand) - If you chose A: Correct. $V^*$ is unique (it's the fixed point of the contraction $\mathcal{T}^*$). But multiple policies can achieve $V^*$ — e.g., two actions with identical $Q^*$ values. - If you chose B: Multiple policies can be optimal — any tie-breaking among equally-valued actions works. - If you chose C: A deterministic optimal policy always exists for finite MDPs. - If you chose D: Only optimal policies achieve $V^*$; suboptimal policies have strictly lower values in at least one state.

Next Steps

Next up: 23-03 — Dynamic Programming for MDPs — where you'll learn policy evaluation, policy improvement, and value iteration: the algorithms that solve MDPs when you have a perfect model.


Pitfalls

  1. Confusing the Bellman expectation equation with the optimality equation. The expectation equation averages over the policy: $$\sum_a \pi(a \mid s)[\cdots]$$. The optimality equation takes the max: $$\max_a[\cdots]$$. Using the wrong one means you're solving a different problem. The expectation equation evaluates a given policy; the optimality equation finds the best policy.

  2. Forgetting that the contraction proof requires γ < 1. The Bellman operator $$\mathcal{T}^\pi$$ is a γ-contraction only when γ < 1. If γ = 1, the modulus is 1 (non-expansive), and Banach's fixed-point theorem does not guarantee convergence. This is why continuing tasks strictly require discounting for theoretical convergence guarantees.

  3. Misapplying the matrix solution $$(I - \gamma P^\pi)^{-1} R^\pi$$. This closed-form solution requires $$O(n^3)$$ operations and is numerically unstable for large n. It's useful for understanding but never used in practice for n > 1000. Real algorithms use iterative methods.

  4. Assuming the optimal policy must be unique. The optimal value function $$V^$$ is unique (fixed point of $$\mathcal{T}^$$), but multiple policies can achieve it. If two actions at a state have identical $$Q^*(s,a)$$ values, any tie-breaking yields an optimal policy. Don't be surprised when different algorithms find different optimal policies.




Q8: Why does the contraction proof for $$\mathcal{T}^\pi$$ use the max-norm $$|V|_\infty$$ rather than the Euclidean norm $$|V|_2$$?

A) The max-norm is easier to compute B) The contraction property holds in max-norm with modulus γ; it does not necessarily hold in ℓ₂ C) The max-norm is the only norm defined for value functions D) Euclidean norm requires differentiability

Answer and Explanations **Correct: B)** The proof shows $$\| \mathcal{T}^\pi V_1 - \mathcal{T}^\pi V_2 \|_\infty \leq \gamma \|V_1 - V_2\|_\infty$$ using the fact that $$\sum_{a,s'} \pi(a \mid s)P(s' \mid s,a) = 1$$. This exact cancellation works in ℓ∞ but would require a different constant in ℓ₂. - A) Ease of computation isn't the reason — the contraction property must hold in the chosen norm. - C) Many norms can be defined on value functions; ℓ∞ happens to work nicely with the Bellman operator. - D) Norms don't require differentiability; this is unrelated.