📐 Concept diagram

23-04 — Monte Carlo Methods

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-04 Prerequisites: 23-01 — MDPs, 23-02 — Bellman Equations, Phase 10 (Probability Theory) Next subject: 23-05 — Temporal Difference Learning

Learning Objectives

By the end of this subject, you will be able to:

Explain how Monte Carlo methods estimate value functions from complete episodes without a model
Differentiate first-visit and every-visit MC and prove both converge to $V^\pi$
Implement MC control with exploring starts and $\varepsilon$-greedy policies
Apply importance sampling for off-policy evaluation and understand its variance properties
Compare MC methods to DP and understand the bias-variance trade-off

Core Content

What Are Monte Carlo Methods?

Monte Carlo (MC) methods learn value functions from complete episodes of experience. Unlike DP, they require NO model of the environment — only sample trajectories. The key idea: the value of a state is the expected return, and we estimate this expectation by averaging observed returns.

⚠️ CRITICAL: MC methods require episodic tasks — each episode must terminate. You cannot apply basic MC to continuing tasks (though there are workarounds with discounting and truncation).

MC Prediction (Policy Evaluation)

Goal: estimate $V^\pi(s)$ from episodes following $\pi$.

First-visit MC: For each state $s$, average the returns following the first visit to $s$ in each episode.

Every-visit MC: For each state $s$, average the returns following every visit to $s$ in each episode.

Algorithm (first-visit): 1. Generate an episode $S_0, A_0, R_1, S_1, A_1, R_2, \ldots, S_T$ using $\pi$ 2. For each state $s$ appearing in the episode: - $G \leftarrow$ return following the first occurrence of $s$ - Append $G$ to $\text{Returns}(s)$ - $V(s) \leftarrow \text{average}(\text{Returns}(s))$

Both first-visit and every-visit converge to $V^\pi$ as the number of visits goes to infinity. First-visit is unbiased; every-visit is biased but typically has lower variance.

Convergence proof (first-visit): Each return is an i.i.d. estimate of $V^\pi(s)$ with finite variance (since episodes are independent). By the law of large numbers, the sample mean converges to the true expected value.

Incremental Implementation

To avoid storing all returns, use an incremental update:

$$V(S_t) \leftarrow V(S_t) + \alpha \big[G_t - V(S_t)\big]$$

where $\alpha = 1/N(S_t)$ for the sample average, or a constant $\alpha \in (0,1]$ for a moving average (useful for non-stationary problems).

For constant $\alpha$, this becomes an exponential recency-weighted average:

$$V_{n+1} = (1-\alpha)^n V_1 + \sum_{i=1}^n \alpha(1-\alpha)^{n-i} G_i$$

MC Control (Finding the Optimal Policy)

Monte Carlo control uses Generalized Policy Iteration with MC for evaluation.

Problem: If $\pi$ is deterministic, many state-action pairs are never visited — we can't evaluate or improve them.

Solution 1 — Exploring Starts: Start each episode in a randomly chosen state-action pair, ensuring all pairs are visited infinitely often. Theoretically clean but impractical (can't always control the starting state).

Solution 2 — $\varepsilon$-greedy policies: With probability $1-\varepsilon$, take the greedy action; with probability $\varepsilon$, take a random action:

$$\pi(a \mid s) = \begin{cases} 1 - \varepsilon + \frac{\varepsilon}{|\mathcal{A}(s)|} & \text{if } a = \arg\max_{a'} Q(s, a') \ \frac{\varepsilon}{|\mathcal{A}(s)|} & \text{otherwise} \end{cases}$$

This ensures $\pi(a \mid s) \geq \varepsilon/|\mathcal{A}(s)| > 0$, so all actions are explored infinitely often.

Algorithm (MC control with $\varepsilon$-greedy):

Initialize $Q(s,a)$ arbitrarily, $\pi$ as $\varepsilon$-greedy w.r.t $Q$
Loop:
Generate episode using $\pi$
For each $(S_t, A_t)$ in episode: $G \leftarrow$ return from $t$; $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[G - Q(S_t, A_t)]$
Update $\pi$ to $\varepsilon$-greedy w.r.t. $Q$

GLIE condition (Greedy in the Limit with Infinite Exploration): For convergence to optimality, $\varepsilon$ must decay to zero over time (e.g., $\varepsilon_k = 1/k$) while ensuring all state-action pairs are visited infinitely often.

On-Policy vs. Off-Policy

On-policy: Learn about the same policy that generates behavior. Simple, but the policy being evaluated must explore (e.g., $\varepsilon$-greedy, which is suboptimal).
Off-policy: Learn about a target policy $\pi$ while following a behavior policy $b \neq \pi$. Enables learning optimal policies while exploring, and learning from other agents' or humans' data.

Off-Policy MC with Importance Sampling

For off-policy evaluation, we need to correct for the fact that actions were chosen by $b$, not $\pi$. Importance sampling reweights returns by the probability ratio:

$$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}$$

Ordinary Importance Sampling:

$$V^\pi(s) = \mathbb{E}b[\rho{t:T-1} G_t \mid S_t = s]$$

Unbiased but can have infinite variance (when $\pi$ and $b$ differ significantly, $\rho$ can be very large).

Weighted Importance Sampling:

$$V^\pi(s) = \frac{\sum_{i} \rho^i_{t:T-1} G^i_t}{\sum_{i} \rho^i_{t:T-1}}$$

Biased but with bounded variance, and converges to $V^\pi$. Strongly preferred in practice.

⚠️ CRITICAL — Common Pitfall: Importance sampling ratios are products over entire trajectories. If $\pi$ and $b$ differ substantially, the product can grow or shrink exponentially in trajectory length. This is the "curse of horizon" — off-policy MC is impractical for long episodes without additional variance reduction techniques.

Key Terms

GLIE condition
Generalized Policy Iteration
Importance sampling

Worked Examples

Example 1: First-Visit MC Estimation

Consider 3 episodes in a 3-state MDP with one action:

Episode 1: $s_1, R{=}0, s_2, R{=}0, s_3, R{=}10$ (terminal) Episode 2: $s_1, R{=}0, s_1, R{=}0, s_3, R{=}5$ (terminal) Episode 3: $s_2, R{=}0, s_3, R{=}8$ (terminal)

Estimate $V(s_1)$ and $V(s_2)$ using first-visit MC with $\gamma = 1$.

Solution:

Episode 1 returns from first visits: - $s_1$: $G = 0 + 0 + 10 = 10$ - $s_2$: $G = 0 + 10 = 10$ - $s_3$: $G = 10$

Episode 2: - $s_1$ (first visit at $t=0$): $G = 0 + 0 + 5 = 5$. Second visit to $s_1$ at $t=1$ is ignored. - $s_3$: $G = 5$

Episode 3: - $s_2$: $G = 0 + 8 = 8$ - $s_3$: $G = 8$

$V(s_1) = (10 + 5)/2 = 7.5$ $V(s_2) = (10 + 8)/2 = 9.0$

Click for answer

$V(s_1) = 7.5$, $V(s_2) = 9.0$. Note: $s_2$ appears in all three episodes, but first-visit only counts the first occurrence in each.

Example 2: $\varepsilon$-greedy Action Probabilities

MDP with $|\mathcal{A}(s)| = 4$. Current $Q(s, a)$ values: $[10, 8, 7, 9]$. Compute the $\varepsilon$-greedy policy for $\varepsilon = 0.1$.

Solution:

Greedy action: $a_1$ (value 10). Non-greedy actions: $a_2, a_3, a_4$.

$\pi(a_1 \mid s) = 1 - 0.1 + 0.1/4 = 0.9 + 0.025 = 0.925$ $\pi(a_2 \mid s) = \pi(a_3 \mid s) = \pi(a_4 \mid s) = 0.1/4 = 0.025$

Click for answer

$\pi = [0.925, 0.025, 0.025, 0.025]$. The greedy action is chosen 92.5% of the time; each non-greedy action gets 2.5%.

Example 3: Importance Sampling Ratio

Target policy $\pi$ always takes action $a_1$ from $s$. Behavior policy $b$ takes $a_1$ with probability 0.2 and $a_2, a_3, a_4$ each with 0.2 (uniform). Trajectory: $s, a_1, s', a_1, s'', a_2, \text{terminal}$. Compute $\rho$.

Solution:

$\rho_{0:2} = \frac{\pi(a_1 \mid s)}{b(a_1 \mid s)} \cdot \frac{\pi(a_1 \mid s')}{b(a_1 \mid s')} \cdot \frac{\pi(a_2 \mid s'')}{b(a_2 \mid s'')} = \frac{1}{0.2} \cdot \frac{1}{0.2} \cdot \frac{0}{0.25} = 0$

Click for answer

$\rho = 0$ because $a_2$ (the last action) is never taken by $\pi$. This episode contributes zero to the ordinary importance sampling estimate and is effectively discarded under weighted importance sampling. When $\pi$ and $b$ differ significantly, many trajectories may have zero or negligible weight.

Practice Problems

Problem 1: In first-visit MC, show that the estimator is unbiased: $\mathbb{E}[\frac{1}{n}\sum_{i=1}^n G_i] = V^\pi(s)$, where $G_i$ are independent returns from first visits to $s$.

Answers (click to expand)

Each $G_i$ is the return from the first visit to $s$ in episode $i$. Since episodes are independent (generated by following $\pi$ from starting states), each $G_i \sim p(G \mid s, \pi)$ independently. By definition, $\mathbb{E}[G_i \mid s] = V^\pi(s)$. By linearity of expectation, $\mathbb{E}[\frac{1}{n}\sum G_i] = \frac{1}{n}\sum \mathbb{E}[G_i] = V^\pi(s)$. Done.

Problem 2: Prove that ordinary importance sampling is unbiased: $\mathbb{E}b[\rho{t:T-1} G_t \mid S_t = s] = V^\pi(s)$.

Answers (click to expand)

$\mathbb{E}_b[\rho_{t:T-1} G_t \mid S_t = s]$ $= \mathbb{E}_b\left[\prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)} \sum_{k=t}^{T-1} \gamma^{k-t} R_{k+1} \;\middle|\; S_t = s\right]$ By the definition of expectation under $b$, expanding the joint probability: $= \sum_{\tau} \Pr(\tau \mid b, S_t=s) \cdot \rho_{t:T-1} \cdot G_t$ $= \sum_{\tau} \Pr(\tau \mid \pi, S_t=s) \cdot G_t$ $= \mathbb{E}_\pi[G_t \mid S_t = s] = V^\pi(s)$ The key step: $\Pr(\tau \mid b) \cdot \rho_{t:T-1} = \Pr(\tau \mid \pi)$ because the transition probabilities cancel (both policies face the same environment dynamics $P$).

Problem 3: Explain why weighted importance sampling is biased but preferred in practice. What is the source of the bias?

Answers (click to expand)

Weighted importance sampling: $\hat{V}(s) = \frac{\sum_i \rho_i G_i}{\sum_i \rho_i}$. Bias source: The denominator $\sum \rho_i$ is itself a random variable. $\mathbb{E}[\frac{\sum \rho_i G_i}{\sum \rho_i}] \neq \frac{\mathbb{E}[\sum \rho_i G_i]}{\mathbb{E}[\sum \rho_i]}$ in general. However, the bias vanishes as $n \to \infty$ (the estimator is *consistent*). Preferred because: Variance is bounded — unlike ordinary IS, where $\rho$ can be arbitrarily large. Weighted IS normalizes the weights to sum to 1, so each episode contributes at most proportionally to its weight.

Problem 4: MC methods have high variance. Why? When is MC variance especially problematic?

Answers (click to expand)

MC methods use complete returns $G_t = \sum \gamma^k R_{t+k+1}$. Each return is the sum of many random variables (rewards and state transitions), so variance accumulates over the horizon: $\text{Var}(G_t) \approx \text{Var}(R) \cdot \frac{1}{1-\gamma^2}$ (if rewards are i.i.d.) Variance is especially problematic when: 1. Episodes are long — many random events compound 2. $\gamma$ is close to 1 — long effective horizon 3. Rewards are noisy (high variance) 4. The environment is highly stochastic This is why TD methods (23-05) were developed — they bootstrap to reduce variance.

Problem 5: Implement the GLIE condition. If $\varepsilon_k = 1/k$, how many episodes until $\varepsilon < 0.01$? What property does decaying $\varepsilon$ satisfy?

Answers (click to expand)

$\varepsilon_k = 1/k < 0.01 \implies k > 100$. After 101 episodes, $\varepsilon < 0.01$. GLIE requires: 1. $\lim_{k \to \infty} N_k(s, a) = \infty$ (every state-action pair visited infinitely often) — satisfied if $\varepsilon_k > 0$ for all $k$ 2. $\lim_{k \to \infty} \pi_k(a \mid s) = 1$ for greedy actions — satisfied since $\varepsilon_k \to 0$ This ensures the policy converges to the optimal deterministic policy while exploring sufficiently.

Summary

Key takeaways:

MC methods learn from complete episodes without a model — just sample returns
First-visit MC is unbiased; every-visit MC is biased but often lower-variance
MC control uses $\varepsilon$-greedy exploration to ensure all actions are visited; GLIE conditions guarantee convergence to optimality
Off-policy MC uses importance sampling to evaluate a target policy from behavior data — weighted importance sampling is preferred for bounded variance
MC has high variance (full returns over long horizons) but zero bias — the opposite of bootstrapping methods
MC is simple, intuitive, and doesn't require the Markov property (works for POMDPs), but cannot handle continuing tasks natively

Quiz

Question 1: What is the fundamental requirement for Monte Carlo methods?

A. A model of the environment dynamics B. Complete episodes that terminate C. A continuous state space D. Deterministic policies

Correct Answer: B