📐 Concept diagram

23-02 — Bellman Equations

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-02 Prerequisites: 23-01 — MDPs, Phase 10 (Probability Theory) Next subject: 23-03 — Dynamic Programming for MDPs

Learning Objectives

By the end of this subject, you will be able to:

Derive the Bellman expectation equation for $V^\pi$ and $Q^\pi$ from first principles
State and apply the Bellman optimality equations
Explain the relationship between $V^\pi$ and $Q^\pi$ via one-step lookahead
Understand why the Bellman operator is a contraction and why this guarantees convergence
Use the Bellman equations to solve small MDPs by hand

Core Content

The Bellman Expectation Equation for $V^\pi$

Starting from the definition $V^\pi(s) = \mathbb{E}\pi[G_t \mid S_t = s]$ and using $G_t = R{t+1} + \gamma G_{t+1}$:

$$V^\pi(s) = \mathbb{E}\pi[R{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$$

Expanding the expectation over actions and next states yields the Bellman expectation equation:

$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \big[R(s, a, s') + \gamma V^\pi(s')\big]$$

⚠️ CRITICAL: This is arguably the most important equation in RL. It says: the value of a state equals the expected immediate reward plus the discounted expected value of the next state. Every value-based RL algorithm is built on this recursion.

Matrix form (for finite MDPs with $n$ states):

$$\mathbf{V}^\pi = \mathbf{R}^\pi + \gamma \mathbf{P}^\pi \mathbf{V}^\pi$$

where $R^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)R(s, a, s')$ is the expected immediate reward under $\pi$, and $P^\pi(s, s') = \sum_a \pi(a \mid s) P(s' \mid s, a)$ is the state-transition probability under $\pi$.

Solving: $\mathbf{V}^\pi = (I - \gamma \mathbf{P}^\pi)^{-1} \mathbf{R}^\pi$. This requires $O(n^3)$ operations — tractable only for small $n$.

The Bellman Expectation Equation for $Q^\pi$

Similarly, for the action-value function:

$$Q^\pi(s, a) = \mathbb{E}\pi[R{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a]$$

Expanding:

$$Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a) \left[R(s, a, s') + \gamma \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')\right]$$

This decomposes the expected return into: (1) the immediate expected reward $R(s, a, s')$, plus (2) the discounted expected future value $\gamma \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$, averaged over next states.

The Relationship Between $V^\pi$ and $Q^\pi$

Two fundamental connections:

$$V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a) \quad \text{(averaging over actions)}$$

$$Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a) [R(s, a, s') + \gamma V^\pi(s')] \quad \text{(one-step lookahead)}$$

These two equations form a couplet that appears everywhere in RL — they let you move between state-values and action-values as needed.

Optimal Value Functions

The optimal state-value function $V^$ and optimal action-value function $Q^$ are defined as:

$$V^*(s) = \max_\pi V^\pi(s)$$

$$Q^*(s, a) = \max_\pi Q^\pi(s, a)$$

They represent the best possible expected return achievable from state $s$ (or state-action pair $(s, a)$). For any MDP, an optimal policy $\pi^$ exists (possibly non-unique) such that $V^{\pi^} = V^$ and $Q^{\pi^} = Q^*$.

⚠️ CRITICAL: For finite MDPs, a deterministic optimal policy always exists. There's always an action in each state that achieves (or ties for) the maximum. This is not generally true for partially observable settings (POMDPs).

The Bellman Optimality Equations

The Bellman optimality equation for $V^*$:

$$V^(s) = \max_{a \in \mathcal{A}} \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \big[R(s, a, s') + \gamma V^(s')\big]$$

And for $Q^*$:

$$Q^(s, a) = \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \left[R(s, a, s') + \gamma \max_{a' \in \mathcal{A}} Q^(s', a')\right]$$

Notice the crucial difference from the expectation equations: we use $\max_a$ instead of $\sum_a \pi(a \mid s)$. The optimal value is achieved by always taking the best action.

The Bellman Operator Is a Contraction

Define the Bellman operator $\mathcal{T}^\pi$ for a given policy:

$$(\mathcal{T}^\pi V)(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V(s')]$$

$\mathcal{T}^\pi$ is a contraction mapping in the max-norm $|V|_\infty = \max_s |V(s)|$:

$$|\mathcal{T}^\pi V_1 - \mathcal{T}^\pi V_2|\infty \leq \gamma |V_1 - V_2|\infty$$

Proof:

$|(\mathcal{T}^\pi V_1)(s) - (\mathcal{T}^\pi V_2)(s)|$ $= |\sum_{a,s'} \pi(a \mid s)P(s' \mid s, a) \gamma[V_1(s') - V_2(s')]|$ $\leq \gamma \sum_{a,s'} \pi(a \mid s)P(s' \mid s, a)|V_1(s') - V_2(s')|$ $\leq \gamma \sum_{a,s'} \pi(a \mid s)P(s' \mid s, a) |V_1 - V_2|\infty$ $= \gamma |V_1 - V_2|\infty$

Since $| \cdot |_\infty$ is a complete metric space, the Banach fixed-point theorem guarantees that $\mathcal{T}^\pi$ has a unique fixed point — $V^\pi$ — and that iteratively applying $\mathcal{T}^\pi$ converges to it from any starting point. This is the mathematical foundation for policy evaluation (covered in 23-03).

Similarly, the Bellman optimality operator $\mathcal{T}^*$:

$$(\mathcal{T}^* V)(s) = \max_a \sum_{s'} P(s' \mid s, a)[R(s, a, s') + \gamma V(s')]$$

is also a $\gamma$-contraction, with unique fixed point $V^*$.

Key Terms

Banach fixed-point theorem
Bellman expectation equation
Bellman operator
Bellman optimality operator
Matrix form

Worked Examples

Example 1: Solving for $V^\pi$ by Hand

Two-state MDP: $\mathcal{S} = {s_1, s_2}$, one action $a$. $P(s_1 \mid s_1, a)=0.7$, $P(s_2 \mid s_1, a)=0.3$; $P(s_1 \mid s_2, a)=0.4$, $P(s_2 \mid s_2, a)=0.6$. $R(s_1, a, \cdot)=5$, $R(s_2, a, \cdot)=-2$. $\gamma = 0.9$. Solve for $V^\pi$.

Solution: Policy fixed (only one action). Bellman equations:

$V(s_1) = 0.7[5 + 0.9 V(s_1)] + 0.3[5 + 0.9 V(s_2)]$ $= 0.7(5) + 0.7(0.9)V(s_1) + 0.3(5) + 0.3(0.9)V(s_2)$ $= 3.5 + 0.63 V(s_1) + 1.5 + 0.27 V(s_2)$ $= 5 + 0.63 V(s_1) + 0.27 V(s_2)$

$V(s_2) = 0.4[-2 + 0.9 V(s_1)] + 0.6[-2 + 0.9 V(s_2)]$ $= -0.8 + 0.36 V(s_1) -1.2 + 0.54 V(s_2)$ $= -2 + 0.36 V(s_1) + 0.54 V(s_2)$

Rearrange: $V(s_1) - 0.63 V(s_1) - 0.27 V(s_2) = 5$ → $0.37 V(s_1) - 0.27 V(s_2) = 5$ $V(s_2) - 0.36 V(s_1) - 0.54 V(s_2) = -2$ → $-0.36 V(s_1) + 0.46 V(s_2) = -2$

Solve: $V(s_1) \approx 21.38$, $V(s_2) \approx 12.38$.

Click for answer

$V(s_1) \approx 21.38$, $V(s_2) \approx 12.38$. Both positive despite $s_2$ giving negative immediate rewards — the ability to transition to $s_1$ (which gives high rewards) elevates $s_2$'s value.

Example 2: Bellman Optimality — Simple Grid

Consider a 2-state MDP: $s_1$ → action $a_1$: reward $+10$, always goes to terminal. Action $a_2$: reward $+1$, always goes to $s_2$. $s_2$ → action $a_1$: reward $-5$, 50% terminal, 50% back to $s_1$. Action $a_2$: reward $+2$, always goes to terminal. $\gamma=0.9$. Find $V^(s_1), V^(s_2)$.

Solution: Terminal = absorbing state with value 0. Bellman optimality:

$V^(s_1) = \max{10 + 0.9(0),\; 1 + 0.9 V^(s_2)} = \max{10,\; 1 + 0.9 V^*(s_2)}$

$V^(s_2) = \max{0.5[-5 + 0.9(0)] + 0.5[-5 + 0.9 V^(s_1)],\; 2 + 0.9(0)}$ $= \max{-5 + 0.45 V^*(s_1),\; 2}$

Case 1: $V^(s_2) = 2$. Then $V^(s_1) = \max{10, 1 + 0.9(2)} = \max{10, 2.8} = 10$. Check consistency: $V^*(s_2) = \max{-5 + 0.45(10), 2} = \max{-0.5, 2} = 2$. ✓

Optimal policy: in $s_1$, take $a_1$ (immediate $+10$). In $s_2$, take $a_2$ (immediate $+2$).

Click for answer

$V^*(s_1) = 10$, $V^*(s_2) = 2$. Optimal: $a_1$ in $s_1$, $a_2$ in $s_2$.

Example 3: Contraction Proof Verification

Show numerically that applying the Bellman operator reduces the max-norm error by factor $\gamma$. Start with $V_0 = [0, 0, 0]$ for a 3-state MDP with known $V^\pi = [8, 4, 0]$, $\gamma=0.9$.

Solution: The Bellman operator $\mathcal{T}^\pi$ satisfies $|\mathcal{T}^\pi V_k - V^\pi|\infty \leq \gamma |V_k - V^\pi|\infty$. Applying it repeatedly:

The error decays exponentially. This is why dynamic programming works — each iteration shrinks the error by factor $\gamma$.

Click for answer

The contraction property guarantees exponential convergence: $\|V_k - V^\pi\|_\infty \leq \gamma^k \|V_0 - V^\pi\|_\infty$. This is the mathematical foundation for all iterative value-based RL algorithms.

Practice Problems

Problem 1: Write out the Bellman expectation equation for $Q^\pi$ in a deterministic MDP (where $P(s' \mid s, a) = 1$ for exactly one $s'$). How does it simplify?

Answers (click to expand)

In a deterministic MDP, there is exactly one next state $s' = f(s, a)$. Then: $Q^\pi(s, a) = R(s, a, f(s, a)) + \gamma \sum_{a'} \pi(a' \mid f(s, a)) Q^\pi(f(s, a), a')$ The sum over $s'$ collapses to a single term, and $V^\pi(s') = \sum_{a'} \pi(a' \mid s') Q^\pi(s', a')$.

Problem 2: Prove that the Bellman optimality operator $\mathcal{T}^*$ is a $\gamma$-contraction.

Answers (click to expand)

For any two value functions $V_1, V_2$ and any state $s$: $|(\mathcal{T}^* V_1)(s) - (\mathcal{T}^* V_2)(s)|$ $= |\max_a \sum_{s'} P(s' \mid s, a)[R(s,a,s') + \gamma V_1(s')] - \max_a \sum_{s'} P(s' \mid s, a)[R(s,a,s') + \gamma V_2(s')]|$ Let $a_1^*$ be the optimal action for $V_1$ at $s$: $\leq |\sum_{s'} P(s' \mid s, a_1^*)[R + \gamma V_1(s')] - \sum_{s'} P(s' \mid s, a_1^*)[R + \gamma V_2(s')]|$ $= \gamma |\sum_{s'} P(s' \mid s, a_1^*)[V_1(s') - V_2(s')]|$ $\leq \gamma \sum_{s'} P(s' \mid s, a_1^*) \|V_1 - V_2\|_\infty = \gamma \|V_1 - V_2\|_\infty$ Taking max over $s$: $\|\mathcal{T}^* V_1 - \mathcal{T}^* V_2\|_\infty \leq \gamma \|V_1 - V_2\|_\infty$. Done.

Problem 3: For the MDP in Example 1, verify your solution by plugging $V(s_1)=21.38$, $V(s_2)=12.38$ back into both Bellman equations.

Answers (click to expand)

$V(s_1) = 5 + 0.63(21.38) + 0.27(12.38) = 5 + 13.47 + 3.34 = 21.81$ (small rounding difference, the exact solution is $V(s_1) = 600/27.4 \approx 21.90$, $V(s_2) = 340/27.4 \approx 12.41$) $V(s_2) = -2 + 0.36(21.38) + 0.54(12.38) = -2 + 7.70 + 6.69 = 12.39$ ✓

Problem 4: An MDP has $V^(s_1) = 100$. Action $a_1$ from $s_1$ gives $R=10$ and goes to $s_2$ where $V^(s_2) = V^(s_2)$ (unknown). Action $a_2$ gives $R=50$ and goes to terminal ($V=0$). $\gamma=0.9$. What is the minimum value of $V^(s_2)$ for $a_1$ to be optimal?

Answers (click to expand)

For $a_1$ to be optimal: $10 + 0.9 V^*(s_2) \geq 50 + 0.9(0)$ $10 + 0.9 V^*(s_2) \geq 50$ $0.9 V^*(s_2) \geq 40$ $V^*(s_2) \geq 44.\overline{4}$ So $V^*(s_2)$ must be at least 44.45 for the agent to forgo the immediate 50 and take action $a_1$ instead.

Problem 5: Derive the relationship between the Bellman expectation and optimality equations. Under what condition does $V^\pi = V^*$?

Answers (click to expand)

$V^\pi = V^*$ when $\pi$ is an optimal policy, i.e., $\pi(a \mid s) > 0$ only if $a \in \arg\max_a Q^*(s, a)$. The Bellman expectation equation uses $\sum_a \pi(a \mid s)$, while the optimality equation uses $\max_a$. When $\pi$ puts all probability on the maximizer, $\sum_a \pi(a \mid s) Q^*(s, a) = \max_a Q^*(s, a)$, and $V^\pi$ satisfies the optimality equation.

Summary

Key takeaways:

The Bellman expectation equation decomposes $V^\pi$ into immediate reward plus discounted future value: $V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)[R + \gamma V^\pi(s')]$
The Bellman optimality equation replaces the policy average with a max: $V^(s) = \max_a \sum_{s'} P(s' \mid s, a)[R + \gamma V^(s')]$
The Bellman operator $\mathcal{T}^\pi$ is a $\gamma$-contraction in max-norm — this guarantees convergence of iterative methods
$V^\pi$ and $Q^\pi$ are connected via one-step lookahead: $Q^\pi(s,a) = \sum_{s'} P(s' \mid s,a)[R + \gamma V^\pi(s')]$
For finite MDPs, a deterministic optimal policy always exists
The contraction property is the mathematical reason dynamic programming works — errors shrink exponentially

Quiz

Question 1: The Bellman expectation equation for $V^\pi$ expresses $V^\pi(s)$ as:

A. $\max_a \sum_{s'} P(s' \mid s,a)[R + \gamma V^\pi(s')]$ B. $\sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s,a)[R + \gamma V^\pi(s')]$ C. $\frac{1}{|\mathcal{A}|} \sum_a Q^\pi(s,a)$ D. $\mathbb{E}[R_{t+1}]$ only

Correct Answer: B