📐 Concept diagram

23-01 — Markov Decision Processes (MDPs)

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-01 Prerequisites: Phase 11 (Markov Chains), Phase 10 (Probability Theory) Next subject: 23-02 — Bellman Equations

Learning Objectives

By the end of this subject, you will be able to:

Define the five components of an MDP and explain the Markov property in decision contexts
Write the mathematical definition of a trajectory, return, and value function
Explain why discounting is necessary and how the discount factor γ shapes optimal behavior
Distinguish episodic from continuing tasks and their mathematical implications
Recognize when a real-world problem can be formulated as an MDP

Core Content

The MDP Framework

A Markov Decision Process (MDP) is the mathematical formalism for sequential decision-making under uncertainty. Every reinforcement learning problem can be expressed as an MDP. Formally, an MDP is a tuple:

$$\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$$

$\mathcal{S}$ — State space: the set of all possible states the agent can be in. Can be finite, countably infinite, or continuous.
$\mathcal{A}$ — Action space: the set of all possible actions. For state-dependent actions, $\mathcal{A}(s) \subseteq \mathcal{A}$ is the set of actions available in state $s$.
$P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0,1]$ — Transition function: $P(s' \mid s, a) = \Pr(S_{t+1}=s' \mid S_t=s, A_t=a)$. This is a probability distribution over next states.
$R: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}$ — Reward function: $R(s, a, s')$ is the immediate reward for transitioning from $s$ to $s'$ under action $a$.
$\gamma \in [0,1]$ — Discount factor: trades off immediate vs. future rewards.

⚠️ CRITICAL: The transition function $P(s' \mid s, a)$ must satisfy the Markov property: the future depends only on the current state and action — not on the history of how you got there. This is the definitional requirement of an MDP. If your problem has memory effects beyond the state, either augment the state or it's not an MDP.

The Markov Property

An environment satisfies the Markov property if:

$$\Pr(S_{t+1}=s', R_{t+1}=r \mid S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = \Pr(S_{t+1}=s', R_{t+1}=r \mid S_t, A_t)$$

In words: the future is conditionally independent of the past given the present. This is a strong assumption, but one we can satisfy by careful state design — include all relevant history in the state representation.

Trajectories and Returns

A trajectory (or episode) is a sequence of states, actions, and rewards:

$$\tau = (S_0, A_0, R_1, S_1, A_1, R_2, S_2, \ldots)$$

The return $G_t$ is the cumulative discounted reward from time $t$ onward:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

A useful recursive form:

$$G_t = R_{t+1} + \gamma G_{t+1}$$

This recurrence is the backbone of every Bellman equation you'll encounter.

For episodic tasks (finite horizon $T$): the sum terminates, and conceptually $G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$. Usually $\gamma$ can be 1 (undiscounted) for episodic tasks.

For continuing tasks (infinite horizon): we must have $\gamma < 1$ to guarantee finite returns (assuming bounded rewards). Without discounting, the sum could diverge, making value comparisons meaningless.

Why Discount?

Mathematical necessity: Ensures finite returns in infinite-horizon problems under bounded rewards ($|R| \leq R_{\max}$):

$$|G_t| \leq \sum_{k=0}^{\infty} \gamma^k R_{\max} = \frac{R_{\max}}{1-\gamma}$$

Uncertainty about the future: Rewards far in the future are less certain.
Preference for sooner rewards: Economic/behavioral motivation — agents (and humans) prefer rewards now.
Exponential decay is chosen for mathematical tractability — it makes Bellman equations compact.

Alternative formulations exist (average reward, finite horizon), but discounted infinite-horizon is the standard.

Value Functions

The state-value function $V^\pi(s)$ is the expected return starting from state $s$ and following policy $\pi$ thereafter:

$$V^\pi(s) = \mathbb{E}\pi[G_t \mid S_t = s] = \mathbb{E}\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; S_t = s\right]$$

The action-value function $Q^\pi(s, a)$ is the expected return starting from $s$, taking action $a$, then following $\pi$:

$$Q^\pi(s, a) = \mathbb{E}\pi[G_t \mid S_t = s, A_t = a] = \mathbb{E}\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; S_t = s, A_t = a\right]$$

The relationship between $V$ and $Q$:

$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \, Q^\pi(s, a)$$

This says: the value of a state is the expected value over the policy's action probabilities. Conversely, $Q$ can be expressed in terms of $V$:

$$Q^\pi(s, a) = \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \big[R(s, a, s') + \gamma V^\pi(s')\big]$$

Policies

A policy $\pi$ maps states to a probability distribution over actions:

$$\pi(a \mid s) = \Pr(A_t = a \mid S_t = s)$$

Deterministic policy: $\pi(s) = a$ (maps to a specific action)
Stochastic policy: $\pi(a \mid s)$ is a full distribution (useful for exploration and in policy gradient methods)

⚠️ CRITICAL — Common Pitfall: The MDP itself contains $P$ and $R$, which are properties of the environment. The policy $\pi$ is a property of the agent. In model-free RL, the agent doesn't know $P$ and $R$; it must learn from interaction. But the MDP formulation still underlies everything.

Discrete Gridworld Example

Consider a $3 \times 3$ grid. States are grid cells $(i, j)$. Actions: up, down, left, right. If the agent tries to move off the grid, it stays in place. Reward: $+1$ for reaching the goal cell, $-1$ for falling in a pit, $-0.04$ per step (living penalty to encourage efficiency). $\gamma = 0.9$. This is an episodic MDP — the episode ends when the agent reaches the goal or pit.

The transition function is noisy: with probability $0.8$ the agent moves in the intended direction, with $0.1$ it moves 90° left, $0.1$ it moves 90° right. This is a stochastic MDP — transitions are NOT deterministic.

Key Terms

Action space
Deterministic policy
Discount factor
Exponential decay
For continuing tasks
For episodic tasks
Markov Decision Process
Markov property
Mathematical necessity
Preference for sooner rewards
Reward function
State space
Stochastic policy
Transition function
Uncertainty about the future

Worked Examples

Example 1: Computing Returns

Consider the trajectory: $S_0 \xrightarrow{A_0} R_1{=}0, S_1 \xrightarrow{A_1} R_2{=}0, S_2 \xrightarrow{A_2} R_3{=}10$. Compute $G_0, G_1, G_2$ with $\gamma = 0.9$.

Solution:

$G_2 = R_3 = 10$ (no future rewards after the last step)

$G_1 = R_2 + \gamma G_2 = 0 + 0.9 \times 10 = 9$

$G_0 = R_1 + \gamma G_1 = 0 + 0.9 \times 9 = 8.1$

Click for answer

$G_0 = 8.1$, $G_1 = 9$, $G_2 = 10$. Notice exponential decay backward through time: each step further from the reward reduces its contribution by factor $\gamma$.

Example 2: MDP Tuple Identification

An autonomous car drives on a highway. State: $(x, v, \text{lane})$ where $x$ is position, $v$ is velocity, lane is left/center/right. Actions: accelerate, decelerate, lane-left, lane-right. Reward: $+1$ per meter traveled, $-100$ for collision, $-1$ for hard braking. Identify each component of the MDP tuple.

Solution:

$\mathcal{S} = \mathbb{R} \times [0, v_{\max}] \times {L, C, R}$ — continuous in position/velocity, discrete in lane
$\mathcal{A}$ = {accelerate, decelerate, lane-left, lane-right}
$P$: Physics model — $x_{t+1} = x_t + v_t \Delta t$, $v_{t+1}$ depends on action, lane changes succeed with some probability (other drivers may block). This is stochastic.
$R$: Mileage reward minus collision penalty minus comfort penalty.
$\gamma$: Close to 1 (e.g., 0.99) since we care about the long horizon.

Click for answer

This is an MDP with a mixed continuous/discrete state space and stochastic transitions. In practice, we'd discretize or use function approximation for $V$ or $Q$.

Example 3: Value of a Simple Policy

In a 2-state MDP: $\mathcal{S} = {s_1, s_2}$, $\mathcal{A} = {a}$ (only one action). Transitions: $P(s_1 \mid s_1, a) = 0.5$, $P(s_2 \mid s_1, a) = 0.5$; $P(s_1 \mid s_2, a) = 1.0$. Rewards: $R(s_1, a, s_1) = 10$, $R(s_1, a, s_2) = 0$, $R(s_2, a, s_1) = -5$. $\gamma = 0.9$.

Compute $V(s_1)$ and $V(s_2)$ by solving the system of equations defined by the policy.

Solution:

$V(s_1) = 0.5[10 + 0.9 V(s_1)] + 0.5[0 + 0.9 V(s_2)] = 5 + 0.45 V(s_1) + 0.45 V(s_2)$

$V(s_2) = 1.0[-5 + 0.9 V(s_1)] = -5 + 0.9 V(s_1)$

Substitute the second into the first:

$V(s_1) = 5 + 0.45 V(s_1) + 0.45(-5 + 0.9 V(s_1))$ $V(s_1) = 5 + 0.45 V(s_1) - 2.25 + 0.405 V(s_1)$ $V(s_1) = 2.75 + 0.855 V(s_1)$ $0.145 V(s_1) = 2.75$ $V(s_1) = 18.97$

Then $V(s_2) = -5 + 0.9 \times 18.97 = 12.07$

Click for answer

$V(s_1) \approx 18.97$, $V(s_2) \approx 12.07$. Even though $s_2$ always receives $-5$, the 90% chance of transitioning to $s_1$ (which is highly valuable) makes $s_2$ net positive.

Practice Problems

Problem 1: Define an MDP for a simple "student's dilemma": a student in state "studying" can choose to "continue studying" (reward $+1$, 70% chance of staying, 30% of finishing) or "go party" (reward $+5$, 100% chance of ending up in "exhausted" state with reward $-2$ thereafter). Write the full MDP tuple.

Answers (click to expand)

$\mathcal{S} = \{\text{studying}, \text{finished}, \text{exhausted}\}$ $\mathcal{A} = \{\text{study}, \text{party}\}$ $P$: from studying, study → 0.7 studying, 0.3 finished; party → 1.0 exhausted. Finished is absorbing (stay with reward 0). Exhausted is absorbing (stay with reward 0). $R$: study from studying → +1 each step; party → +5; staying in finished → 0; staying in exhausted → 0. $\gamma$: e.g., 0.9.

Problem 2: For the trajectory $S_0 \to R_1{=}2, S_1 \to R_2{=}-1, S_2 \to R_3{=}4, S_3 \to R_4{=}0$, compute $G_0$ with $\gamma=0.5$ and $\gamma=0.9$. Why does the relative contribution of the final reward differ?

Answers (click to expand)

With $\gamma=0.5$: $G_0 = 2 + 0.5(-1) + 0.25(4) + 0.125(0) = 2 - 0.5 + 1 = 2.5$. Final reward contributes $0.25 \times 4 = 1$ (40% of total). With $\gamma=0.9$: $G_0 = 2 + 0.9(-1) + 0.81(4) + 0.729(0) = 2 - 0.9 + 3.24 = 4.34$. Final reward contributes $0.81 \times 4 = 3.24$ (75% of total). Lower $\gamma$ heavily discounts the future — the agent becomes "myopic." Higher $\gamma$ values future rewards nearly as much as immediate ones.

Problem 3: Show that if rewards are bounded by $|R_t| \leq R_{\max}$ and $\gamma < 1$, then $|G_t| \leq \frac{R_{\max}}{1-\gamma}$. What happens as $\gamma \to 1$?

Answers (click to expand)

$|G_t| = |\sum_{k=0}^{\infty} \gamma^k R_{t+k+1}| \leq \sum_{k=0}^{\infty} \gamma^k |R_{t+k+1}| \leq \sum_{k=0}^{\infty} \gamma^k R_{\max} = \frac{R_{\max}}{1-\gamma}$. As $\gamma \to 1$, the bound goes to $\infty$. Even though each individual $G_t$ may be finite, the worst-case bound diverges, which is problematic for convergence proofs in continuing tasks.

Problem 4: A deterministic MDP has $P(s' \mid s, a) = 1$ for exactly one $s'$. For the gridworld described in Core Content, what is the size of the transition matrix if the grid is $n \times n$ with 4 actions? Explain why stochastic transitions prevent simple shortest-path solutions.

Answers (click to expand)

Transition matrix size: $n^2 \times 4 \times n^2$ (a 3D tensor). For each $(s, a)$ pair, you have a probability distribution over $n^2$ next states. With stochastic transitions, the agent can't guarantee reaching the goal — there's always some probability of moving sideways. The optimal policy must account for this uncertainty, making it a risk-sensitive planning problem rather than a pure shortest-path problem.

Problem 5: Prove the recursive return relationship $G_t = R_{t+1} + \gamma G_{t+1}$ starting from the definition $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$.

Answers (click to expand)

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$ $= R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \cdots)$ $= R_{t+1} + \gamma G_{t+1}$. This simple recurrence is the foundation for all dynamic programming and temporal difference methods in RL.

Summary

Key takeaways:

An MDP is defined by $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ — master this tuple; every RL algorithm is built on it
The Markov property means the future depends only on the present state and action, not on history
The return $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ and its recursive form $G_t = R_{t+1} + \gamma G_{t+1}$ underpin all value-based methods
$V^\pi(s)$ is expected return from state $s$ under policy $\pi$; $Q^\pi(s,a)$ is expected return from $(s,a)$
$\gamma < 1$ is required for continuing tasks to keep returns bounded; $\gamma \approx 1$ makes the agent far-sighted
MDPs can be episodic (finite horizon) or continuing (infinite horizon); algorithms must handle both

Quiz

Question 1: Which of the following is NOT a component of an MDP tuple?

A. State space $\mathcal{S}$ B. Action space $\mathcal{A}$ C. Policy $\pi$ D. Discount factor $\gamma$

Correct Answer: C. Policy $\pi$