Math graphic
📐 Concept diagram

23-04 — Monte Carlo Methods

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-04 Prerequisites: 23-01 — MDPs, 23-02 — Bellman Equations, Phase 10 (Probability Theory) Next subject: 23-05 — Temporal Difference Learning


Learning Objectives

By the end of this subject, you will be able to:

  1. Explain how Monte Carlo methods estimate value functions from complete episodes without a model
  2. Differentiate first-visit and every-visit MC and prove both converge to $V^\pi$
  3. Implement MC control with exploring starts and $\varepsilon$-greedy policies
  4. Apply importance sampling for off-policy evaluation and understand its variance properties
  5. Compare MC methods to DP and understand the bias-variance trade-off

Core Content

What Are Monte Carlo Methods?

Monte Carlo (MC) methods learn value functions from complete episodes of experience. Unlike DP, they require NO model of the environment — only sample trajectories. The key idea: the value of a state is the expected return, and we estimate this expectation by averaging observed returns.

⚠️ CRITICAL: MC methods require episodic tasks — each episode must terminate. You cannot apply basic MC to continuing tasks (though there are workarounds with discounting and truncation).

MC Prediction (Policy Evaluation)

Goal: estimate $V^\pi(s)$ from episodes following $\pi$.

First-visit MC: For each state $s$, average the returns following the first visit to $s$ in each episode.

Every-visit MC: For each state $s$, average the returns following every visit to $s$ in each episode.

Algorithm (first-visit): 1. Generate an episode $S_0, A_0, R_1, S_1, A_1, R_2, \ldots, S_T$ using $\pi$ 2. For each state $s$ appearing in the episode: - $G \leftarrow$ return following the first occurrence of $s$ - Append $G$ to $\text{Returns}(s)$ - $V(s) \leftarrow \text{average}(\text{Returns}(s))$

Both first-visit and every-visit converge to $V^\pi$ as the number of visits goes to infinity. First-visit is unbiased; every-visit is biased but typically has lower variance.

Convergence proof (first-visit): Each return is an i.i.d. estimate of $V^\pi(s)$ with finite variance (since episodes are independent). By the law of large numbers, the sample mean converges to the true expected value.

Incremental Implementation

To avoid storing all returns, use an incremental update:

$$V(S_t) \leftarrow V(S_t) + \alpha \big[G_t - V(S_t)\big]$$

where $\alpha = 1/N(S_t)$ for the sample average, or a constant $\alpha \in (0,1]$ for a moving average (useful for non-stationary problems).

For constant $\alpha$, this becomes an exponential recency-weighted average:

$$V_{n+1} = (1-\alpha)^n V_1 + \sum_{i=1}^n \alpha(1-\alpha)^{n-i} G_i$$

MC Control (Finding the Optimal Policy)

Monte Carlo control uses Generalized Policy Iteration with MC for evaluation.

Problem: If $\pi$ is deterministic, many state-action pairs are never visited — we can't evaluate or improve them.

Solution 1 — Exploring Starts: Start each episode in a randomly chosen state-action pair, ensuring all pairs are visited infinitely often. Theoretically clean but impractical (can't always control the starting state).

Solution 2 — $\varepsilon$-greedy policies: With probability $1-\varepsilon$, take the greedy action; with probability $\varepsilon$, take a random action:

$$\pi(a \mid s) = \begin{cases} 1 - \varepsilon + \frac{\varepsilon}{|\mathcal{A}(s)|} & \text{if } a = \arg\max_{a'} Q(s, a') \ \frac{\varepsilon}{|\mathcal{A}(s)|} & \text{otherwise} \end{cases}$$

This ensures $\pi(a \mid s) \geq \varepsilon/|\mathcal{A}(s)| > 0$, so all actions are explored infinitely often.

Algorithm (MC control with $\varepsilon$-greedy):

  1. Initialize $Q(s,a)$ arbitrarily, $\pi$ as $\varepsilon$-greedy w.r.t $Q$
  2. Loop:
  3. Generate episode using $\pi$
  4. For each $(S_t, A_t)$ in episode: $G \leftarrow$ return from $t$; $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[G - Q(S_t, A_t)]$
  5. Update $\pi$ to $\varepsilon$-greedy w.r.t. $Q$

GLIE condition (Greedy in the Limit with Infinite Exploration): For convergence to optimality, $\varepsilon$ must decay to zero over time (e.g., $\varepsilon_k = 1/k$) while ensuring all state-action pairs are visited infinitely often.

On-Policy vs. Off-Policy

Off-Policy MC with Importance Sampling

For off-policy evaluation, we need to correct for the fact that actions were chosen by $b$, not $\pi$. Importance sampling reweights returns by the probability ratio:

$$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}$$

Ordinary Importance Sampling:

$$V^\pi(s) = \mathbb{E}b[\rho{t:T-1} G_t \mid S_t = s]$$

Unbiased but can have infinite variance (when $\pi$ and $b$ differ significantly, $\rho$ can be very large).

Weighted Importance Sampling:

$$V^\pi(s) = \frac{\sum_{i} \rho^i_{t:T-1} G^i_t}{\sum_{i} \rho^i_{t:T-1}}$$

Biased but with bounded variance, and converges to $V^\pi$. Strongly preferred in practice.

⚠️ CRITICAL — Common Pitfall: Importance sampling ratios are products over entire trajectories. If $\pi$ and $b$ differ substantially, the product can grow or shrink exponentially in trajectory length. This is the "curse of horizon" — off-policy MC is impractical for long episodes without additional variance reduction techniques.



Key Terms

Worked Examples

Example 1: First-Visit MC Estimation

Consider 3 episodes in a 3-state MDP with one action:

Episode 1: $s_1, R{=}0, s_2, R{=}0, s_3, R{=}10$ (terminal) Episode 2: $s_1, R{=}0, s_1, R{=}0, s_3, R{=}5$ (terminal) Episode 3: $s_2, R{=}0, s_3, R{=}8$ (terminal)

Estimate $V(s_1)$ and $V(s_2)$ using first-visit MC with $\gamma = 1$.

Solution:

Episode 1 returns from first visits: - $s_1$: $G = 0 + 0 + 10 = 10$ - $s_2$: $G = 0 + 10 = 10$ - $s_3$: $G = 10$

Episode 2: - $s_1$ (first visit at $t=0$): $G = 0 + 0 + 5 = 5$. Second visit to $s_1$ at $t=1$ is ignored. - $s_3$: $G = 5$

Episode 3: - $s_2$: $G = 0 + 8 = 8$ - $s_3$: $G = 8$

$V(s_1) = (10 + 5)/2 = 7.5$ $V(s_2) = (10 + 8)/2 = 9.0$

Click for answer $V(s_1) = 7.5$, $V(s_2) = 9.0$. Note: $s_2$ appears in all three episodes, but first-visit only counts the first occurrence in each.

Example 2: $\varepsilon$-greedy Action Probabilities

MDP with $|\mathcal{A}(s)| = 4$. Current $Q(s, a)$ values: $[10, 8, 7, 9]$. Compute the $\varepsilon$-greedy policy for $\varepsilon = 0.1$.

Solution:

Greedy action: $a_1$ (value 10). Non-greedy actions: $a_2, a_3, a_4$.

$\pi(a_1 \mid s) = 1 - 0.1 + 0.1/4 = 0.9 + 0.025 = 0.925$ $\pi(a_2 \mid s) = \pi(a_3 \mid s) = \pi(a_4 \mid s) = 0.1/4 = 0.025$

Click for answer $\pi = [0.925, 0.025, 0.025, 0.025]$. The greedy action is chosen 92.5% of the time; each non-greedy action gets 2.5%.

Example 3: Importance Sampling Ratio

Target policy $\pi$ always takes action $a_1$ from $s$. Behavior policy $b$ takes $a_1$ with probability 0.2 and $a_2, a_3, a_4$ each with 0.2 (uniform). Trajectory: $s, a_1, s', a_1, s'', a_2, \text{terminal}$. Compute $\rho$.

Solution:

$\rho_{0:2} = \frac{\pi(a_1 \mid s)}{b(a_1 \mid s)} \cdot \frac{\pi(a_1 \mid s')}{b(a_1 \mid s')} \cdot \frac{\pi(a_2 \mid s'')}{b(a_2 \mid s'')} = \frac{1}{0.2} \cdot \frac{1}{0.2} \cdot \frac{0}{0.25} = 0$

Click for answer $\rho = 0$ because $a_2$ (the last action) is never taken by $\pi$. This episode contributes zero to the ordinary importance sampling estimate and is effectively discarded under weighted importance sampling. When $\pi$ and $b$ differ significantly, many trajectories may have zero or negligible weight.

Practice Problems

Problem 1: In first-visit MC, show that the estimator is unbiased: $\mathbb{E}[\frac{1}{n}\sum_{i=1}^n G_i] = V^\pi(s)$, where $G_i$ are independent returns from first visits to $s$.

Answers (click to expand) Each $G_i$ is the return from the first visit to $s$ in episode $i$. Since episodes are independent (generated by following $\pi$ from starting states), each $G_i \sim p(G \mid s, \pi)$ independently. By definition, $\mathbb{E}[G_i \mid s] = V^\pi(s)$. By linearity of expectation, $\mathbb{E}[\frac{1}{n}\sum G_i] = \frac{1}{n}\sum \mathbb{E}[G_i] = V^\pi(s)$. Done.

Problem 2: Prove that ordinary importance sampling is unbiased: $\mathbb{E}b[\rho{t:T-1} G_t \mid S_t = s] = V^\pi(s)$.

Answers (click to expand) $\mathbb{E}_b[\rho_{t:T-1} G_t \mid S_t = s]$ $= \mathbb{E}_b\left[\prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)} \sum_{k=t}^{T-1} \gamma^{k-t} R_{k+1} \;\middle|\; S_t = s\right]$ By the definition of expectation under $b$, expanding the joint probability: $= \sum_{\tau} \Pr(\tau \mid b, S_t=s) \cdot \rho_{t:T-1} \cdot G_t$ $= \sum_{\tau} \Pr(\tau \mid \pi, S_t=s) \cdot G_t$ $= \mathbb{E}_\pi[G_t \mid S_t = s] = V^\pi(s)$ The key step: $\Pr(\tau \mid b) \cdot \rho_{t:T-1} = \Pr(\tau \mid \pi)$ because the transition probabilities cancel (both policies face the same environment dynamics $P$).

Problem 3: Explain why weighted importance sampling is biased but preferred in practice. What is the source of the bias?

Answers (click to expand) Weighted importance sampling: $\hat{V}(s) = \frac{\sum_i \rho_i G_i}{\sum_i \rho_i}$. Bias source: The denominator $\sum \rho_i$ is itself a random variable. $\mathbb{E}[\frac{\sum \rho_i G_i}{\sum \rho_i}] \neq \frac{\mathbb{E}[\sum \rho_i G_i]}{\mathbb{E}[\sum \rho_i]}$ in general. However, the bias vanishes as $n \to \infty$ (the estimator is *consistent*). Preferred because: Variance is bounded — unlike ordinary IS, where $\rho$ can be arbitrarily large. Weighted IS normalizes the weights to sum to 1, so each episode contributes at most proportionally to its weight.

Problem 4: MC methods have high variance. Why? When is MC variance especially problematic?

Answers (click to expand) MC methods use complete returns $G_t = \sum \gamma^k R_{t+k+1}$. Each return is the sum of many random variables (rewards and state transitions), so variance accumulates over the horizon: $\text{Var}(G_t) \approx \text{Var}(R) \cdot \frac{1}{1-\gamma^2}$ (if rewards are i.i.d.) Variance is especially problematic when: 1. Episodes are long — many random events compound 2. $\gamma$ is close to 1 — long effective horizon 3. Rewards are noisy (high variance) 4. The environment is highly stochastic This is why TD methods (23-05) were developed — they bootstrap to reduce variance.

Problem 5: Implement the GLIE condition. If $\varepsilon_k = 1/k$, how many episodes until $\varepsilon < 0.01$? What property does decaying $\varepsilon$ satisfy?

Answers (click to expand) $\varepsilon_k = 1/k < 0.01 \implies k > 100$. After 101 episodes, $\varepsilon < 0.01$. GLIE requires: 1. $\lim_{k \to \infty} N_k(s, a) = \infty$ (every state-action pair visited infinitely often) — satisfied if $\varepsilon_k > 0$ for all $k$ 2. $\lim_{k \to \infty} \pi_k(a \mid s) = 1$ for greedy actions — satisfied since $\varepsilon_k \to 0$ This ensures the policy converges to the optimal deterministic policy while exploring sufficiently.

Summary

Key takeaways:


Quiz

Question 1: What is the fundamental requirement for Monte Carlo methods?

A. A model of the environment dynamics B. Complete episodes that terminate C. A continuous state space D. Deterministic policies

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. MC methods update from complete returns $G_t$, which require the episode to end. - If you chose A: MC is *model-free* — it doesn't need $P$ or $R$. - If you chose C: MC works with any state space, including discrete. - If you chose D: MC works with stochastic policies, which are often necessary for exploration.

Question 2: In first-visit MC, why is the estimator unbiased?

A. Because returns are normalized B. Because each return is an independent sample of the true expected return from the first visit C. Because $\gamma = 1$ D. Because the policy is deterministic

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. Each first-visit return is an i.i.d. draw from the distribution of returns starting from $s$ under $\pi$. Sample mean of i.i.d. random variables is unbiased. - If you chose A: Normalization would introduce bias (like weighted importance sampling). - If you chose C: The discount factor doesn't determine unbiasedness. - If you chose D: Policy type doesn't affect unbiasedness — any policy works.

Question 3: What is the purpose of $\varepsilon$-greedy exploration in MC control?

A. To make the value function differentiable B. To ensure all state-action pairs are visited eventually C. To reduce the discount factor D. To eliminate the need for episodes

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. $\varepsilon$-greedy ensures $\pi(a \mid s) \geq \varepsilon/|\mathcal{A}(s)| > 0$, so every action has non-zero probability. This guarantees infinite exploration. - If you chose A: Exploration doesn't affect differentiability. - If you chose C: $\varepsilon$ and $\gamma$ are independent parameters. - If you chose D: MC methods always need episodes; $\varepsilon$-greedy is about exploration within episodes.

Question 4: In off-policy MC with importance sampling, what does the importance sampling ratio $\rho$ correct for?

A. The discount factor difference between policies B. The fact that actions were chosen by $b$ instead of $\pi$ C. The difference in state transition probabilities D. The difference in reward functions

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. $\rho = \prod \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}$ reweights the probability of the trajectory to match what it would be under $\pi$, correcting for the behavioral bias. - If you chose A: Both policies use the same $\gamma$. - If you chose C: Transition probabilities $P(s' \mid s,a)$ are the same under both policies (they're environment properties) — they cancel in the importance ratio. - If you chose D: Both policies face the same reward function.

Question 5: Which has lower variance: ordinary importance sampling or weighted importance sampling?

A. Ordinary importance sampling B. Weighted importance sampling C. They have identical variance D. Depends on the discount factor

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. Weighted IS normalizes weights, bounding the influence of any single episode. Ordinary IS can have unbounded variance when $\rho$ takes extreme values. - If you chose A: Ordinary IS can have infinite variance when $\pi$ and $b$ differ significantly. - If you chose C: They have very different variance properties. - If you chose D: $\gamma$ affects the magnitude of returns but not the relative variance of the two methods.

Next Steps

Next up: 23-05 — Temporal Difference Learning — where you'll learn how to combine the best of MC (model-free, from experience) and DP (bootstrapping), enabling online, incremental learning without waiting for episode completion.


Pitfalls

  1. Applying MC to continuing tasks. MC methods require complete episodes — $$G_t$$ is defined as the sum of rewards until termination. In continuing (infinite-horizon) tasks, there is no natural episode end. Attempting MC on continuing tasks requires artificial truncation, which introduces bias. Use TD methods instead.

  2. Forgetting that ordinary importance sampling can have infinite variance. When the behavior policy $$b$$ and target policy $$\pi$$ differ significantly, the importance ratio $$\rho = \prod \pi(a \mid s) / b(a \mid s)$$ can grow or shrink exponentially in the trajectory length. A single episode with enormous $$\rho$$ can dominate the estimate. Always prefer weighted importance sampling for its bounded variance.

  3. Treating every-visit MC as unbiased. Only first-visit MC is strictly unbiased — each first-visit return is an independent draw of $$V^\pi(s)$$. Every-visit returns from the same episode are correlated, introducing bias (though it vanishes asymptotically and is often offset by lower variance).

  4. Not decaying ε in ε-greedy. The GLIE (Greedy in the Limit with Infinite Exploration) conditions require ε → 0 over time. If you keep ε constant, the policy never becomes fully greedy, and MC control converges to the optimal ε-greedy policy (which is suboptimal), not the true optimal policy.




Q8: A significant advantage of MC methods over TD methods is:

A) MC can learn online without waiting for episode completion B) MC does not require the Markov property — it works for partially observable environments C) MC always converges faster D) MC can handle continuing tasks

Answer and Explanations **Correct: B)** MC averages complete returns, which are valid regardless of whether the state is Markovian. TD bootstraps from $$V(S_{t+1})$$, which assumes the state captures all relevant information (the Markov property). In POMDPs, TD can fail while MC remains unbiased (though high-variance). - A) MC requires episode completion; TD learns online — the opposite is true. - C) MC often converges slower due to higher variance. - D) MC needs episodes that terminate; TD handles continuing tasks.