23-01 — Markov Decision Processes (MDPs)
Phase: 23 — Reinforcement Learning Mathematics Subject: 23-01 Prerequisites: Phase 11 (Markov Chains), Phase 10 (Probability Theory) Next subject: 23-02 — Bellman Equations
Learning Objectives
By the end of this subject, you will be able to:
- Define the five components of an MDP and explain the Markov property in decision contexts
- Write the mathematical definition of a trajectory, return, and value function
- Explain why discounting is necessary and how the discount factor γ shapes optimal behavior
- Distinguish episodic from continuing tasks and their mathematical implications
- Recognize when a real-world problem can be formulated as an MDP
Core Content
The MDP Framework
A Markov Decision Process (MDP) is the mathematical formalism for sequential decision-making under uncertainty. Every reinforcement learning problem can be expressed as an MDP. Formally, an MDP is a tuple:
$$\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$$
- $\mathcal{S}$ — State space: the set of all possible states the agent can be in. Can be finite, countably infinite, or continuous.
- $\mathcal{A}$ — Action space: the set of all possible actions. For state-dependent actions, $\mathcal{A}(s) \subseteq \mathcal{A}$ is the set of actions available in state $s$.
- $P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0,1]$ — Transition function: $P(s' \mid s, a) = \Pr(S_{t+1}=s' \mid S_t=s, A_t=a)$. This is a probability distribution over next states.
- $R: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}$ — Reward function: $R(s, a, s')$ is the immediate reward for transitioning from $s$ to $s'$ under action $a$.
- $\gamma \in [0,1]$ — Discount factor: trades off immediate vs. future rewards.
⚠️ CRITICAL: The transition function $P(s' \mid s, a)$ must satisfy the Markov property: the future depends only on the current state and action — not on the history of how you got there. This is the definitional requirement of an MDP. If your problem has memory effects beyond the state, either augment the state or it's not an MDP.
The Markov Property
An environment satisfies the Markov property if:
$$\Pr(S_{t+1}=s', R_{t+1}=r \mid S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = \Pr(S_{t+1}=s', R_{t+1}=r \mid S_t, A_t)$$
In words: the future is conditionally independent of the past given the present. This is a strong assumption, but one we can satisfy by careful state design — include all relevant history in the state representation.
Trajectories and Returns
A trajectory (or episode) is a sequence of states, actions, and rewards:
$$\tau = (S_0, A_0, R_1, S_1, A_1, R_2, S_2, \ldots)$$
The return $G_t$ is the cumulative discounted reward from time $t$ onward:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$
A useful recursive form:
$$G_t = R_{t+1} + \gamma G_{t+1}$$
This recurrence is the backbone of every Bellman equation you'll encounter.
For episodic tasks (finite horizon $T$): the sum terminates, and conceptually $G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$. Usually $\gamma$ can be 1 (undiscounted) for episodic tasks.
For continuing tasks (infinite horizon): we must have $\gamma < 1$ to guarantee finite returns (assuming bounded rewards). Without discounting, the sum could diverge, making value comparisons meaningless.
Why Discount?
- Mathematical necessity: Ensures finite returns in infinite-horizon problems under bounded rewards ($|R| \leq R_{\max}$):
$$|G_t| \leq \sum_{k=0}^{\infty} \gamma^k R_{\max} = \frac{R_{\max}}{1-\gamma}$$
- Uncertainty about the future: Rewards far in the future are less certain.
- Preference for sooner rewards: Economic/behavioral motivation — agents (and humans) prefer rewards now.
- Exponential decay is chosen for mathematical tractability — it makes Bellman equations compact.
Alternative formulations exist (average reward, finite horizon), but discounted infinite-horizon is the standard.
Value Functions
The state-value function $V^\pi(s)$ is the expected return starting from state $s$ and following policy $\pi$ thereafter:
$$V^\pi(s) = \mathbb{E}\pi[G_t \mid S_t = s] = \mathbb{E}\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; S_t = s\right]$$
The action-value function $Q^\pi(s, a)$ is the expected return starting from $s$, taking action $a$, then following $\pi$:
$$Q^\pi(s, a) = \mathbb{E}\pi[G_t \mid S_t = s, A_t = a] = \mathbb{E}\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; S_t = s, A_t = a\right]$$
The relationship between $V$ and $Q$:
$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \, Q^\pi(s, a)$$
This says: the value of a state is the expected value over the policy's action probabilities. Conversely, $Q$ can be expressed in terms of $V$:
$$Q^\pi(s, a) = \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \big[R(s, a, s') + \gamma V^\pi(s')\big]$$
Policies
A policy $\pi$ maps states to a probability distribution over actions:
$$\pi(a \mid s) = \Pr(A_t = a \mid S_t = s)$$
- Deterministic policy: $\pi(s) = a$ (maps to a specific action)
- Stochastic policy: $\pi(a \mid s)$ is a full distribution (useful for exploration and in policy gradient methods)
⚠️ CRITICAL — Common Pitfall: The MDP itself contains $P$ and $R$, which are properties of the environment. The policy $\pi$ is a property of the agent. In model-free RL, the agent doesn't know $P$ and $R$; it must learn from interaction. But the MDP formulation still underlies everything.
Discrete Gridworld Example
Consider a $3 \times 3$ grid. States are grid cells $(i, j)$. Actions: up, down, left, right. If the agent tries to move off the grid, it stays in place. Reward: $+1$ for reaching the goal cell, $-1$ for falling in a pit, $-0.04$ per step (living penalty to encourage efficiency). $\gamma = 0.9$. This is an episodic MDP — the episode ends when the agent reaches the goal or pit.
The transition function is noisy: with probability $0.8$ the agent moves in the intended direction, with $0.1$ it moves 90° left, $0.1$ it moves 90° right. This is a stochastic MDP — transitions are NOT deterministic.
Key Terms
- Action space
- Deterministic policy
- Discount factor
- Exponential decay
- For continuing tasks
- For episodic tasks
- Markov Decision Process
- Markov property
- Mathematical necessity
- Preference for sooner rewards
- Reward function
- State space
- Stochastic policy
- Transition function
- Uncertainty about the future
Worked Examples
Example 1: Computing Returns
Consider the trajectory: $S_0 \xrightarrow{A_0} R_1{=}0, S_1 \xrightarrow{A_1} R_2{=}0, S_2 \xrightarrow{A_2} R_3{=}10$. Compute $G_0, G_1, G_2$ with $\gamma = 0.9$.
Solution:
$G_2 = R_3 = 10$ (no future rewards after the last step)
$G_1 = R_2 + \gamma G_2 = 0 + 0.9 \times 10 = 9$
$G_0 = R_1 + \gamma G_1 = 0 + 0.9 \times 9 = 8.1$
Click for answer
$G_0 = 8.1$, $G_1 = 9$, $G_2 = 10$. Notice exponential decay backward through time: each step further from the reward reduces its contribution by factor $\gamma$.Example 2: MDP Tuple Identification
An autonomous car drives on a highway. State: $(x, v, \text{lane})$ where $x$ is position, $v$ is velocity, lane is left/center/right. Actions: accelerate, decelerate, lane-left, lane-right. Reward: $+1$ per meter traveled, $-100$ for collision, $-1$ for hard braking. Identify each component of the MDP tuple.
Solution:
- $\mathcal{S} = \mathbb{R} \times [0, v_{\max}] \times {L, C, R}$ — continuous in position/velocity, discrete in lane
- $\mathcal{A}$ = {accelerate, decelerate, lane-left, lane-right}
- $P$: Physics model — $x_{t+1} = x_t + v_t \Delta t$, $v_{t+1}$ depends on action, lane changes succeed with some probability (other drivers may block). This is stochastic.
- $R$: Mileage reward minus collision penalty minus comfort penalty.
- $\gamma$: Close to 1 (e.g., 0.99) since we care about the long horizon.
Click for answer
This is an MDP with a mixed continuous/discrete state space and stochastic transitions. In practice, we'd discretize or use function approximation for $V$ or $Q$.Example 3: Value of a Simple Policy
In a 2-state MDP: $\mathcal{S} = {s_1, s_2}$, $\mathcal{A} = {a}$ (only one action). Transitions: $P(s_1 \mid s_1, a) = 0.5$, $P(s_2 \mid s_1, a) = 0.5$; $P(s_1 \mid s_2, a) = 1.0$. Rewards: $R(s_1, a, s_1) = 10$, $R(s_1, a, s_2) = 0$, $R(s_2, a, s_1) = -5$. $\gamma = 0.9$.
Compute $V(s_1)$ and $V(s_2)$ by solving the system of equations defined by the policy.
Solution:
$V(s_1) = 0.5[10 + 0.9 V(s_1)] + 0.5[0 + 0.9 V(s_2)] = 5 + 0.45 V(s_1) + 0.45 V(s_2)$
$V(s_2) = 1.0[-5 + 0.9 V(s_1)] = -5 + 0.9 V(s_1)$
Substitute the second into the first:
$V(s_1) = 5 + 0.45 V(s_1) + 0.45(-5 + 0.9 V(s_1))$ $V(s_1) = 5 + 0.45 V(s_1) - 2.25 + 0.405 V(s_1)$ $V(s_1) = 2.75 + 0.855 V(s_1)$ $0.145 V(s_1) = 2.75$ $V(s_1) = 18.97$
Then $V(s_2) = -5 + 0.9 \times 18.97 = 12.07$
Click for answer
$V(s_1) \approx 18.97$, $V(s_2) \approx 12.07$. Even though $s_2$ always receives $-5$, the 90% chance of transitioning to $s_1$ (which is highly valuable) makes $s_2$ net positive.Practice Problems
Problem 1: Define an MDP for a simple "student's dilemma": a student in state "studying" can choose to "continue studying" (reward $+1$, 70% chance of staying, 30% of finishing) or "go party" (reward $+5$, 100% chance of ending up in "exhausted" state with reward $-2$ thereafter). Write the full MDP tuple.
Answers (click to expand)
$\mathcal{S} = \{\text{studying}, \text{finished}, \text{exhausted}\}$ $\mathcal{A} = \{\text{study}, \text{party}\}$ $P$: from studying, study → 0.7 studying, 0.3 finished; party → 1.0 exhausted. Finished is absorbing (stay with reward 0). Exhausted is absorbing (stay with reward 0). $R$: study from studying → +1 each step; party → +5; staying in finished → 0; staying in exhausted → 0. $\gamma$: e.g., 0.9.Problem 2: For the trajectory $S_0 \to R_1{=}2, S_1 \to R_2{=}-1, S_2 \to R_3{=}4, S_3 \to R_4{=}0$, compute $G_0$ with $\gamma=0.5$ and $\gamma=0.9$. Why does the relative contribution of the final reward differ?
Answers (click to expand)
With $\gamma=0.5$: $G_0 = 2 + 0.5(-1) + 0.25(4) + 0.125(0) = 2 - 0.5 + 1 = 2.5$. Final reward contributes $0.25 \times 4 = 1$ (40% of total). With $\gamma=0.9$: $G_0 = 2 + 0.9(-1) + 0.81(4) + 0.729(0) = 2 - 0.9 + 3.24 = 4.34$. Final reward contributes $0.81 \times 4 = 3.24$ (75% of total). Lower $\gamma$ heavily discounts the future — the agent becomes "myopic." Higher $\gamma$ values future rewards nearly as much as immediate ones.Problem 3: Show that if rewards are bounded by $|R_t| \leq R_{\max}$ and $\gamma < 1$, then $|G_t| \leq \frac{R_{\max}}{1-\gamma}$. What happens as $\gamma \to 1$?
Answers (click to expand)
$|G_t| = |\sum_{k=0}^{\infty} \gamma^k R_{t+k+1}| \leq \sum_{k=0}^{\infty} \gamma^k |R_{t+k+1}| \leq \sum_{k=0}^{\infty} \gamma^k R_{\max} = \frac{R_{\max}}{1-\gamma}$. As $\gamma \to 1$, the bound goes to $\infty$. Even though each individual $G_t$ may be finite, the worst-case bound diverges, which is problematic for convergence proofs in continuing tasks.Problem 4: A deterministic MDP has $P(s' \mid s, a) = 1$ for exactly one $s'$. For the gridworld described in Core Content, what is the size of the transition matrix if the grid is $n \times n$ with 4 actions? Explain why stochastic transitions prevent simple shortest-path solutions.
Answers (click to expand)
Transition matrix size: $n^2 \times 4 \times n^2$ (a 3D tensor). For each $(s, a)$ pair, you have a probability distribution over $n^2$ next states. With stochastic transitions, the agent can't guarantee reaching the goal — there's always some probability of moving sideways. The optimal policy must account for this uncertainty, making it a risk-sensitive planning problem rather than a pure shortest-path problem.Problem 5: Prove the recursive return relationship $G_t = R_{t+1} + \gamma G_{t+1}$ starting from the definition $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$.
Answers (click to expand)
$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$ $= R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \cdots)$ $= R_{t+1} + \gamma G_{t+1}$. This simple recurrence is the foundation for all dynamic programming and temporal difference methods in RL.Summary
Key takeaways:
- An MDP is defined by $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ — master this tuple; every RL algorithm is built on it
- The Markov property means the future depends only on the present state and action, not on history
- The return $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ and its recursive form $G_t = R_{t+1} + \gamma G_{t+1}$ underpin all value-based methods
- $V^\pi(s)$ is expected return from state $s$ under policy $\pi$; $Q^\pi(s,a)$ is expected return from $(s,a)$
- $\gamma < 1$ is required for continuing tasks to keep returns bounded; $\gamma \approx 1$ makes the agent far-sighted
- MDPs can be episodic (finite horizon) or continuing (infinite horizon); algorithms must handle both
Quiz
Question 1: Which of the following is NOT a component of an MDP tuple?
A. State space $\mathcal{S}$ B. Action space $\mathcal{A}$ C. Policy $\pi$ D. Discount factor $\gamma$
Correct Answer: C. Policy $\pi$
Explanation (click to expand)
- If you chose C: Correct. The policy is the agent's behavior — it's what we *solve for*, not part of the MDP definition. The MDP defines the environment; the policy defines the agent. - If you chose A: $\mathcal{S}$ is a core MDP component; it defines what the agent observes. - If you chose B: $\mathcal{A}$ is a core MDP component; it defines what the agent can do. - If you chose D: $\gamma$ is part of the MDP definition, though sometimes treated as a hyperparameter.Question 2: What does the Markov property guarantee?
A. Rewards are always positive B. The future is conditionally independent of the past given the present state and action C. All actions have deterministic outcomes D. The value function is linear
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. $\Pr(S_{t+1} \mid S_t, A_t, \text{history}) = \Pr(S_{t+1} \mid S_t, A_t)$. The state is a sufficient statistic. - If you chose A: Rewards can be positive, negative, or zero — the Markov property says nothing about sign. - If you chose C: MDPs can be fully stochastic; the Markov property doesn't force determinism. - If you chose D: The Markov property doesn't constrain the functional form of $V$.Question 3: In an episodic MDP with $\gamma = 1$, the return $G_t$ for a trajectory of length $T$ (starting at $t$) is:
A. $\sum_{k=0}^{\infty} R_{t+k+1}$ B. $\sum_{k=0}^{T-t-1} R_{t+k+1}$ C. $\max_{k} R_{t+k+1}$ D. $\frac{1}{T}\sum_{k=0}^{T-t-1} R_{t+k+1}$
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. Episodic tasks have finite length, so the sum is over the remaining steps. With $\gamma=1$, it's the undiscounted sum. - If you chose A: This infinite sum is for continuing tasks, not episodic. - If you chose C: This is the max, not the sum — that would be a different objective entirely. - If you chose D: This is the average, not the (undiscounted) sum.Question 4: Why is $\gamma < 1$ necessary for continuing tasks?
A. To make the problem episodic B. To ensure the expected return is finite (assuming bounded rewards) C. Because rewards can only be positive D. To guarantee the policy is deterministic
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. With bounded rewards $|R| \leq R_{\max}$, the return satisfies $|G_t| \leq R_{\max}/(1-\gamma) < \infty$. Without discounting, the sum could diverge. - If you chose A: Discounting doesn't make an infinite-horizon task episodic. - If you chose C: Discounting bounds returns regardless of reward sign — the sign is irrelevant. - If you chose D: Discounting affects value computation, not whether the policy must be deterministic.Question 5: Which expression correctly relates $V^\pi$ and $Q^\pi$?
A. $V^\pi(s) = \max_a Q^\pi(s, a)$ B. $V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a)$ C. $Q^\pi(s, a) = V^\pi(s) + \gamma$ D. $V^\pi(s) = \frac{1}{|\mathcal{A}|} \sum_a Q^\pi(s, a)$
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. $V^\pi(s)$ is the expectation of $Q^\pi(s, a)$ under the policy's action distribution. - If you chose A: This would be $V^*(s)$, the *optimal* state-value function, which takes the max over actions — not $V^\pi$ for a given policy. - If you chose C: This is dimensionally wrong and has no mathematical basis. - If you chose D: This assumes a uniform policy — it's a special case of B when $\pi(a \mid s) = 1/|\mathcal{A}|$.Next Steps
Next up: 23-02 — Bellman Equations — where you'll learn how to express value functions recursively, the single most important equation in reinforcement learning.
Pitfalls
-
Confusing the policy with a component of the MDP. The MDP tuple is $$(\mathcal{S}, \mathcal{A}, P, R, \gamma)$$ — the policy $$\pi$$ is NOT part of the MDP definition. The MDP describes the environment; the policy describes the agent's behavior. Many students write "MDP = (S, A, P, R, γ, π)" — this conflates the problem (MDP) with the solution (policy).
-
Assuming the Markov property holds automatically. Real-world problems often have memory effects: the next state may depend on more than just the current state. If your state representation omits relevant history (e.g., velocity when modeling a car's position), the process is not Markovian, and MDP-based algorithms will fail. Always ask: "Is my state a sufficient statistic for the future?"
-
Setting γ = 1 in continuing tasks. In infinite-horizon problems, γ = 1 produces infinite returns (for positive rewards) or divergent sums, making value comparisons meaningless. Even in episodic tasks, γ < 1 is often used to encode preference for sooner rewards. The only mathematically safe use of γ = 1 is episodic tasks with guaranteed termination.
-
Misunderstanding the recursive return relationship. Newcomers often compute $$G_t = R_{t+1} + R_{t+2} + \cdots$$ by summing all rewards manually, forgetting the recursive form $$G_t = R_{t+1} + \gamma G_{t+1}$$. This recurrence is the foundation of all Bellman equations — internalize it.
Q8: You're designing an MDP for a chess AI. What is NOT a valid concern when defining the state space?
A) Including castling rights and en passant availability to preserve the Markov property B) Making the state space as small as possible to speed up computation C) Ensuring the state captures all information needed to predict legal next positions D) Allowing the state to include the full move history for perfect information