Math graphic
📐 Concept diagram

23-05 — Temporal Difference Learning

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-05 Prerequisites: 23-04 — Monte Carlo Methods, 23-02 — Bellman Equations Next subject: 23-06 — Q-Learning


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the TD(0) update rule from the Bellman equation and explain what "bootstrapping" means
  2. Compare and contrast TD learning with Monte Carlo and Dynamic Programming
  3. Define the TD error and explain its role as a learning signal
  4. Implement SARSA for on-policy TD control and analyze its convergence properties
  5. Understand the bias-variance trade-off between TD and MC methods

Core Content

The Central Idea of Temporal Difference Learning

Temporal Difference (TD) learning combines two ideas: 1. From Monte Carlo: Learn directly from experience without a model 2. From Dynamic Programming: Update estimates based on other learned estimates (bootstrapping)

The key insight: instead of waiting until the end of an episode to compute $G_t$ (like MC), TD updates after every step using the immediate reward plus the current estimate of the next state's value.

⚠️ CRITICAL: TD is arguably the most important idea in reinforcement learning. It enables online, incremental learning — the agent learns at every time step without waiting for episodes to complete.

TD(0) — The Simplest TD Algorithm

The TD(0) update for state-value estimation:

$$V(S_t) \leftarrow V(S_t) + \alpha \big[\underbrace{R_{t+1} + \gamma V(S_{t+1})}{\text{TD target}} - \underbrace{V(S_t)}{\text{current estimate}}\big]$$

Derivation from the Bellman equation:

From 23-02: $V^\pi(s) = \mathbb{E}\pi[R{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$

TD(0) uses a sample of this expectation as the target and moves the estimate toward it via stochastic approximation:

$$V(S_t) \leftarrow V(S_t) + \alpha[\text{sample} - V(S_t)]$$

where $\text{sample} = R_{t+1} + \gamma V(S_{t+1})$.

Comparison: TD vs. MC vs. DP

Property DP MC TD
Requires model Yes ($P$, $R$) No No
Uses bootstrapping Yes No (uses full $G_t$) Yes (uses $V(S_{t+1})$)
Update timing Sweeps over all states End of episode Every time step
Handles continuing tasks Yes No (needs episodes) Yes (with discounting)
Bias Zero (exact solution) Zero (unbiased estimator of $V^\pi$) Non-zero (bootstrap bias)
Variance Zero High (many random events) Lower (only one step of randomness)

The bias-variance trade-off: MC has zero bias but high variance (complete returns compound many random events). TD has some bias (it bootstraps from imperfect estimates) but much lower variance. In practice, TD often converges faster than MC.

The TD Error as a Learning Signal

The TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ has an important property — it can be related to the full return:

$$G_t - V(S_t) = \sum_{k=0}^{\infty} \gamma^k \delta_{t+k}$$

Proof:

$G_t - V(S_t) = (R_{t+1} + \gamma G_{t+1}) - V(S_t)$ $= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) + \gamma(G_{t+1} - V(S_{t+1}))$ $= \delta_t + \gamma(G_{t+1} - V(S_{t+1}))$ $= \delta_t + \gamma \delta_{t+1} + \gamma^2(G_{t+2} - V(S_{t+2})) = \cdots = \sum_{k=0}^{\infty} \gamma^k \delta_{t+k}$

This shows that TD errors accumulate over time to form the full MC error. If $V = V^\pi$, the expected TD error is zero: $\mathbb{E}_\pi[\delta_t \mid S_t = s] = 0$.

TD(λ) — The Bridge Between TD and MC

The λ-return $G_t^\lambda$ is a weighted average of $n$-step returns, exponentially weighted by $\lambda \in [0,1]$:

$$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}$$

where $G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$.

TD(λ) smoothly interpolates between TD(0) and MC.

SARSA — On-Policy TD Control

SARSA (State-Action-Reward-State-Action) learns $Q(s,a)$ on-policy using TD:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\big]$$

The name comes from the transition quintuple: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.

Algorithm:

  1. Initialize $Q(s,a)$ arbitrarily
  2. For each episode:
  3. Initialize $S$, choose $A$ from $S$ using policy derived from $Q$ (e.g., $\varepsilon$-greedy)
  4. For each step:
    • Take action $A$, observe $R, S'$
    • Choose $A'$ from $S'$ using policy derived from $Q$ (e.g., $\varepsilon$-greedy)
    • $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma Q(S', A') - Q(S,A)]$
    • $S \leftarrow S'$, $A \leftarrow A'$
  5. Until $S$ is terminal

SARSA is on-policy — the policy used to select $A'$ (in the update target) is the same policy used to generate behavior. This matters: if the behavior policy explores (e.g., $\varepsilon$-greedy), SARSA learns values for that exploratory policy, learning to account for the exploration penalty.

⚠️ CRITICAL — The Cliff Walking insight: In the classic cliff-walking problem, SARSA learns a safer path that stays away from the cliff edge because it accounts for the possibility of random exploratory moves sending the agent off the cliff. Q-learning (23-06), being off-policy, learns the optimal path right along the edge — but may actually fall off more during learning because it doesn't account for exploration noise.

Convergence of TD

For tabular TD(0) with a fixed policy $\pi$, TD converges to $V^\pi$ with probability 1 if: 1. The step sizes satisfy the Robbins-Monro conditions: $\sum_{t=1}^\infty \alpha_t = \infty$ and $\sum_{t=1}^\infty \alpha_t^2 < \infty$ 2. All states are visited infinitely often

Example valid step-size: $\alpha_t = 1/t$, or constant $\alpha$ with eventual averaging.

TD converges faster than MC in many practical settings because it exploits the Markov property to reduce variance, even though it introduces some bias.



Key Terms

Worked Examples

Example 1: TD(0) Update Step

$V = [0, 0, 0]$ for 3 states. Transition: $S_t = s_1, A_t, R_{t+1} = 5, S_{t+1} = s_2$. Current $V(s_1) = 10$, $V(s_2) = 3$. $\alpha = 0.1$, $\gamma = 0.9$. Compute the TD update.

Solution:

$\delta_t = 5 + 0.9(3) - 10 = 5 + 2.7 - 10 = -2.3$ $V(s_1) \leftarrow 10 + 0.1(-2.3) = 10 - 0.23 = 9.77$

The negative TD error indicates that $V(s_1)$ was too high — the observed reward (5) plus discounted next-state value (2.7) is less than the current estimate (10), so we reduce it.

Click for answer $V(s_1) = 9.77$. The TD error $\delta = -2.3$ drives the update; $\alpha$ controls how much we adjust.

Example 2: SARSA vs. the Cliff

Cliff walking grid: agent starts at bottom-left, goal is bottom-right. Cliff along the bottom row (except start and goal). Falling on cliff gives reward $-100$ and resets to start. Each step gives reward $-1$.

With $\varepsilon$-greedy ($\varepsilon = 0.1$): - The optimal path hugs the cliff edge — shortest path. - SARSA learns a path further from the cliff because the $\varepsilon$-greedy action selection means there's a 10% chance of a random move near the cliff, which would be catastrophic. - SARSA's Q-values account for the actual behavior policy's exploration, making it learn safer but longer paths.

Click for answer SARSA's on-policy nature means it learns values for the $\varepsilon$-greedy policy, which includes exploration risk. This produces a safer policy that compensates for occasional random actions. Q-learning would learn the optimal (cliff-hugging) path but might actually perform worse during training.

Example 3: $n$-step Return Calculation

Trajectory: $R_1=1, R_2=2, R_3=3, R_4=4$, then terminal. $V = [0, 0, 0]$ for all states. $\gamma = 1$. Compute $G_{0:1}$, $G_{0:2}$, $G_{0:3}$, and the full return $G_0$.

Solution:

$G_{0:1} = R_1 + \gamma V(S_1) = 1 + 0 = 1$ $G_{0:2} = R_1 + \gamma R_2 + \gamma^2 V(S_2) = 1 + 2 + 0 = 3$ $G_{0:3} = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 V(S_3) = 1 + 2 + 3 + 0 = 6$ $G_0 = R_1 + R_2 + R_3 + R_4 = 1 + 2 + 3 + 4 = 10$

Click for answer $G_{0:1}=1$, $G_{0:2}=3$, $G_{0:3}=6$, $G_0=10$. The $n$-step returns are partial — they truncate after $n$ steps and bootstrap from the value function.

Practice Problems

Problem 1: Derive the relationship $\sum_{k=0}^\infty \gamma^k \delta_{t+k} = G_t - V(S_t)$.

Answers (click to expand) $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ $\gamma \delta_{t+1} = \gamma R_{t+2} + \gamma^2 V(S_{t+2}) - \gamma V(S_{t+1})$ $\gamma^2 \delta_{t+2} = \gamma^2 R_{t+3} + \gamma^3 V(S_{t+3}) - \gamma^2 V(S_{t+2})$ Summing all: $\sum_{k=0}^\infty \gamma^k \delta_{t+k} = \sum_{k=0}^\infty \gamma^k R_{t+k+1} - V(S_t) = G_t - V(S_t)$. All the $V$ terms telescope, leaving only $-V(S_t)$ and the full reward sum $G_t$.

Problem 2: Show that under the true value function $V^\pi$, the expected TD error is zero: $\mathbb{E}_\pi[\delta_t \mid S_t = s] = 0$.

Answers (click to expand) $\mathbb{E}_\pi[\delta_t \mid S_t = s] = \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s] - V^\pi(s)$ $= \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s] - V^\pi(s)$ But by the Bellman equation: $V^\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$. So the expectation is zero. This makes TD a form of stochastic approximation for solving the Bellman equation.

Problem 3: Explain why TD learns faster than MC in a random walk problem. (Hint: consider how many random transitions affect each update.)

Answers (click to expand) In a random walk, MC must wait for episode termination and computes a full return — this compounds the randomness of *all* steps. TD uses a single-step transition, whose variance is much lower. Although TD has bias (bootstrapping from imperfect $V(S_{t+1})$), the reduced variance often dominates, leading to faster convergence. Numerically: if each step's reward has variance $\sigma^2$ and episodes have typical length $L$, MC returns have variance $\approx L\sigma^2$, while TD targets have variance $\approx \sigma^2 + \gamma^2\text{Var}(V(S_{t+1}))$, which is typically much smaller.

Problem 4: In SARSA, if the policy is pure greedy (no exploration), what happens to learning? Why does $\varepsilon$-greedy fix this?

Answers (click to expand) With pure greedy, SARSA only updates Q-values for actions it already thinks are best. It never tries other actions, so it can never discover if some unvisited action is actually better. $\varepsilon$-greedy forces occasional exploration, ensuring all actions are eventually tried and their Q-values updated. As $\varepsilon \to 0$, SARSA converges to the optimal policy (under the GLIE condition).

Problem 5: Compute the λ-return $G_0^\lambda$ for $\lambda = 0.5$ using the returns from Example 3, with $V = [0, 0, 0]$ for all states. Compare to $G_{0:1}$, $G_{0:2}$, and full $G_0$.

Answers (click to expand) $G_0^{0.5} = (1-0.5)\sum_{n=1}^\infty 0.5^{n-1} G_{0:n}$ For this finite episode (4 steps), $G_{0:n}$ for $n \geq 4$ equals the full return $G_0 = 10$, and $G_{0:1}=1$, $G_{0:2}=3$, $G_{0:3}=6$. $G_0^{0.5} = 0.5[0.5^0(1) + 0.5^1(3) + 0.5^2(6)] + 0.5^3 \cdot 10 \cdot \frac{1}{1-0.5}$ (tail sum approximation) $= 0.5[1 + 1.5 + 1.5] + 0.125(20) = 0.5(4) + 2.5 = 2 + 2.5 = 4.5$ The λ-return (4.5) lies between $G_{0:1}=1$ and $G_0=10$, balancing the short-term TD estimate with longer-term information.

Summary

Key takeaways:


Quiz

Question 1: The TD(0) update uses which of the following as the target?

A. The complete return $G_t$ B. $R_{t+1} + \gamma V(S_{t+1})$ C. $\max_a Q(S_{t+1}, a)$ D. $R_{t+1}$ only

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. TD(0) bootstraps — it uses the immediate reward plus the current estimate of next state's value. - If you chose A: That's the MC target — requires episode completion. - If you chose C: That's the Q-learning target (off-policy, uses max). - If you chose D: That ignores future value entirely — not even a one-step lookahead.

Question 2: What is the primary advantage of TD over MC?

A. TD is unbiased B. TD can learn online without waiting for episode completion C. TD requires a model of the environment D. TD always converges faster mathematically

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. TD updates every step, enabling online learning. MC must wait for episode end. - If you chose A: MC is unbiased; TD has bootstrap bias. The opposite is true. - If you chose C: TD is model-free — it does not require a model. - If you chose D: TD often converges faster in practice but not always — it depends on the problem.

Question 3: In SARSA, what does the second 'A' in the name stand for?

A. The action taken at the current state B. The action that will be taken at the next state, according to the current policy C. The optimal action at the next state D. An arbitrary action

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. SARSA = (S, A, R, S', A') — the second A is the action actually chosen at the next state using the current (on-policy) action selection. This is what makes it on-policy. - If you chose A: That's the first A — the action at the current state. - If you chose C: That would be Q-learning (uses max over actions), not SARSA. - If you chose D: SARSA uses the action chosen by the policy, not arbitrary.

Question 4: What does $\lambda = 0$ in TD(λ) correspond to?

A. Monte Carlo (complete returns) B. TD(0) — one-step bootstrapping C. Dynamic programming D. SARSA

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. When $\lambda = 0$, the λ-return $G_t^0 = G_{t:1} = R_{t+1} + \gamma V(S_{t+1})$, which is the TD(0) target. - If you chose A: $\lambda = 1$ corresponds to MC. - If you chose C: DP is model-based, not a setting of λ. - If you chose D: SARSA is TD control for Q-values; TD(λ) can be applied to SARSA (giving SARSA(λ)).

Question 5: Why might SARSA learn a different policy than Q-learning in the same environment?

A. SARSA uses a different discount factor B. SARSA is on-policy — it learns values for the behavior policy including exploration C. SARSA doesn't use value functions D. SARSA requires a model

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. SARSA's update uses $Q(S', A')$ where $A'$ is chosen by the behavior policy (e.g., $\varepsilon$-greedy). It learns values that account for occasional random exploratory moves. - If you chose A: Both use the same $\gamma$. - If you chose C: SARSA learns Q-values — it definitely uses value functions. - If you chose D: SARSA is model-free, like all TD methods.

Next Steps

Next up: 23-06 — Q-Learning — where you'll learn the off-policy TD control algorithm that directly approximates $$Q^*$$ and forms the basis for deep reinforcement learning.


Pitfalls

  1. Thinking TD is the same as MC because both are model-free. TD uses bootstrapping (updates from $$R_{t+1} + \gamma V(S_{t+1})$$), while MC uses complete returns. This seemingly small difference has major consequences: TD exploits the Markov property for lower variance but introduces bias; MC is unbiased but has high variance. The choice depends on your problem's structure.

  2. Misunderstanding the SARSA vs. Q-learning distinction. SARSA uses $$Q(S_{t+1}, A_{t+1})$$ where $$A_{t+1}$$ is the action actually taken by the behavior policy. Q-learning uses $$\max_{a'} Q(S_{t+1}, a')$$. This single difference makes SARSA on-policy (learns values for the exploratory policy) and Q-learning off-policy (learns optimal values regardless of exploration).

  3. Setting constant α and expecting exact convergence. The Robbins-Monro conditions ($$\sum \alpha_t = \infty$$, $$\sum \alpha_t^2 < \infty$$) require decaying learning rates. Constant α causes persistent oscillations around the true value. In non-stationary environments constant α is acceptable (it tracks changes), but for stationary problems, decay α or average final estimates.

  4. Using λ = 1 in TD(λ) on continuing tasks. λ = 1 gives the full Monte Carlo return $$G_t$$, which requires episode termination. On continuing tasks, this is undefined. Use λ < 1 for continuing environments or truncate with an artificial horizon.




Q8: The expected TD error $$\mathbb{E}_\pi[\delta_t \mid S_t = s]$$ equals zero when:

A) The learning rate is zero B) $$V = V^\pi$$ (the true value function) C) All rewards are positive D) The policy is deterministic

Answer and Explanations **Correct: B)** $$\mathbb{E}_\pi[\delta_t \mid S_t = s] = \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s] - V^\pi(s)$$. By the Bellman equation for $$V^\pi$$, the first term equals $$V^\pi(s)$$, so the expectation is zero. This makes TD a form of stochastic approximation for solving the Bellman equation. - A) α = 0 means no updates, but the TD error can still be non-zero. - C) Reward sign doesn't determine whether the expected error is zero. - D) The expected TD error is zero for any policy, deterministic or stochastic, when $$V = V^\pi$$.