23-05 — Temporal Difference Learning
Phase: 23 — Reinforcement Learning Mathematics Subject: 23-05 Prerequisites: 23-04 — Monte Carlo Methods, 23-02 — Bellman Equations Next subject: 23-06 — Q-Learning
Learning Objectives
By the end of this subject, you will be able to:
- Derive the TD(0) update rule from the Bellman equation and explain what "bootstrapping" means
- Compare and contrast TD learning with Monte Carlo and Dynamic Programming
- Define the TD error and explain its role as a learning signal
- Implement SARSA for on-policy TD control and analyze its convergence properties
- Understand the bias-variance trade-off between TD and MC methods
Core Content
The Central Idea of Temporal Difference Learning
Temporal Difference (TD) learning combines two ideas: 1. From Monte Carlo: Learn directly from experience without a model 2. From Dynamic Programming: Update estimates based on other learned estimates (bootstrapping)
The key insight: instead of waiting until the end of an episode to compute $G_t$ (like MC), TD updates after every step using the immediate reward plus the current estimate of the next state's value.
⚠️ CRITICAL: TD is arguably the most important idea in reinforcement learning. It enables online, incremental learning — the agent learns at every time step without waiting for episodes to complete.
TD(0) — The Simplest TD Algorithm
The TD(0) update for state-value estimation:
$$V(S_t) \leftarrow V(S_t) + \alpha \big[\underbrace{R_{t+1} + \gamma V(S_{t+1})}{\text{TD target}} - \underbrace{V(S_t)}{\text{current estimate}}\big]$$
- $R_{t+1} + \gamma V(S_{t+1})$ is the TD target — a one-step estimate of the return
- $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is the TD error — the difference between the estimated return and the current estimate
- $\alpha \in (0, 1]$ is the learning rate (step-size parameter)
Derivation from the Bellman equation:
From 23-02: $V^\pi(s) = \mathbb{E}\pi[R{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$
TD(0) uses a sample of this expectation as the target and moves the estimate toward it via stochastic approximation:
$$V(S_t) \leftarrow V(S_t) + \alpha[\text{sample} - V(S_t)]$$
where $\text{sample} = R_{t+1} + \gamma V(S_{t+1})$.
Comparison: TD vs. MC vs. DP
| Property | DP | MC | TD |
|---|---|---|---|
| Requires model | Yes ($P$, $R$) | No | No |
| Uses bootstrapping | Yes | No (uses full $G_t$) | Yes (uses $V(S_{t+1})$) |
| Update timing | Sweeps over all states | End of episode | Every time step |
| Handles continuing tasks | Yes | No (needs episodes) | Yes (with discounting) |
| Bias | Zero (exact solution) | Zero (unbiased estimator of $V^\pi$) | Non-zero (bootstrap bias) |
| Variance | Zero | High (many random events) | Lower (only one step of randomness) |
The bias-variance trade-off: MC has zero bias but high variance (complete returns compound many random events). TD has some bias (it bootstraps from imperfect estimates) but much lower variance. In practice, TD often converges faster than MC.
The TD Error as a Learning Signal
The TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ has an important property — it can be related to the full return:
$$G_t - V(S_t) = \sum_{k=0}^{\infty} \gamma^k \delta_{t+k}$$
Proof:
$G_t - V(S_t) = (R_{t+1} + \gamma G_{t+1}) - V(S_t)$ $= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) + \gamma(G_{t+1} - V(S_{t+1}))$ $= \delta_t + \gamma(G_{t+1} - V(S_{t+1}))$ $= \delta_t + \gamma \delta_{t+1} + \gamma^2(G_{t+2} - V(S_{t+2})) = \cdots = \sum_{k=0}^{\infty} \gamma^k \delta_{t+k}$
This shows that TD errors accumulate over time to form the full MC error. If $V = V^\pi$, the expected TD error is zero: $\mathbb{E}_\pi[\delta_t \mid S_t = s] = 0$.
TD(λ) — The Bridge Between TD and MC
The λ-return $G_t^\lambda$ is a weighted average of $n$-step returns, exponentially weighted by $\lambda \in [0,1]$:
$$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}$$
where $G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$.
- $\lambda = 0$: $G_t^0 = G_{t:1} = R_{t+1} + \gamma V(S_{t+1})$ — TD(0)
- $\lambda = 1$: $G_t^1 = G_t$ — Monte Carlo (complete return)
TD(λ) smoothly interpolates between TD(0) and MC.
SARSA — On-Policy TD Control
SARSA (State-Action-Reward-State-Action) learns $Q(s,a)$ on-policy using TD:
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\big]$$
The name comes from the transition quintuple: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.
Algorithm:
- Initialize $Q(s,a)$ arbitrarily
- For each episode:
- Initialize $S$, choose $A$ from $S$ using policy derived from $Q$ (e.g., $\varepsilon$-greedy)
- For each step:
- Take action $A$, observe $R, S'$
- Choose $A'$ from $S'$ using policy derived from $Q$ (e.g., $\varepsilon$-greedy)
- $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma Q(S', A') - Q(S,A)]$
- $S \leftarrow S'$, $A \leftarrow A'$
- Until $S$ is terminal
SARSA is on-policy — the policy used to select $A'$ (in the update target) is the same policy used to generate behavior. This matters: if the behavior policy explores (e.g., $\varepsilon$-greedy), SARSA learns values for that exploratory policy, learning to account for the exploration penalty.
⚠️ CRITICAL — The Cliff Walking insight: In the classic cliff-walking problem, SARSA learns a safer path that stays away from the cliff edge because it accounts for the possibility of random exploratory moves sending the agent off the cliff. Q-learning (23-06), being off-policy, learns the optimal path right along the edge — but may actually fall off more during learning because it doesn't account for exploration noise.
Convergence of TD
For tabular TD(0) with a fixed policy $\pi$, TD converges to $V^\pi$ with probability 1 if: 1. The step sizes satisfy the Robbins-Monro conditions: $\sum_{t=1}^\infty \alpha_t = \infty$ and $\sum_{t=1}^\infty \alpha_t^2 < \infty$ 2. All states are visited infinitely often
Example valid step-size: $\alpha_t = 1/t$, or constant $\alpha$ with eventual averaging.
TD converges faster than MC in many practical settings because it exploits the Markov property to reduce variance, even though it introduces some bias.
Key Terms
- Bias
- Handles continuing tasks
- Requires model
- TD error
- TD target
- Update timing
- Uses bootstrapping
- Variance
Worked Examples
Example 1: TD(0) Update Step
$V = [0, 0, 0]$ for 3 states. Transition: $S_t = s_1, A_t, R_{t+1} = 5, S_{t+1} = s_2$. Current $V(s_1) = 10$, $V(s_2) = 3$. $\alpha = 0.1$, $\gamma = 0.9$. Compute the TD update.
Solution:
$\delta_t = 5 + 0.9(3) - 10 = 5 + 2.7 - 10 = -2.3$ $V(s_1) \leftarrow 10 + 0.1(-2.3) = 10 - 0.23 = 9.77$
The negative TD error indicates that $V(s_1)$ was too high — the observed reward (5) plus discounted next-state value (2.7) is less than the current estimate (10), so we reduce it.
Click for answer
$V(s_1) = 9.77$. The TD error $\delta = -2.3$ drives the update; $\alpha$ controls how much we adjust.Example 2: SARSA vs. the Cliff
Cliff walking grid: agent starts at bottom-left, goal is bottom-right. Cliff along the bottom row (except start and goal). Falling on cliff gives reward $-100$ and resets to start. Each step gives reward $-1$.
With $\varepsilon$-greedy ($\varepsilon = 0.1$): - The optimal path hugs the cliff edge — shortest path. - SARSA learns a path further from the cliff because the $\varepsilon$-greedy action selection means there's a 10% chance of a random move near the cliff, which would be catastrophic. - SARSA's Q-values account for the actual behavior policy's exploration, making it learn safer but longer paths.
Click for answer
SARSA's on-policy nature means it learns values for the $\varepsilon$-greedy policy, which includes exploration risk. This produces a safer policy that compensates for occasional random actions. Q-learning would learn the optimal (cliff-hugging) path but might actually perform worse during training.Example 3: $n$-step Return Calculation
Trajectory: $R_1=1, R_2=2, R_3=3, R_4=4$, then terminal. $V = [0, 0, 0]$ for all states. $\gamma = 1$. Compute $G_{0:1}$, $G_{0:2}$, $G_{0:3}$, and the full return $G_0$.
Solution:
$G_{0:1} = R_1 + \gamma V(S_1) = 1 + 0 = 1$ $G_{0:2} = R_1 + \gamma R_2 + \gamma^2 V(S_2) = 1 + 2 + 0 = 3$ $G_{0:3} = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 V(S_3) = 1 + 2 + 3 + 0 = 6$ $G_0 = R_1 + R_2 + R_3 + R_4 = 1 + 2 + 3 + 4 = 10$
Click for answer
$G_{0:1}=1$, $G_{0:2}=3$, $G_{0:3}=6$, $G_0=10$. The $n$-step returns are partial — they truncate after $n$ steps and bootstrap from the value function.Practice Problems
Problem 1: Derive the relationship $\sum_{k=0}^\infty \gamma^k \delta_{t+k} = G_t - V(S_t)$.
Answers (click to expand)
$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ $\gamma \delta_{t+1} = \gamma R_{t+2} + \gamma^2 V(S_{t+2}) - \gamma V(S_{t+1})$ $\gamma^2 \delta_{t+2} = \gamma^2 R_{t+3} + \gamma^3 V(S_{t+3}) - \gamma^2 V(S_{t+2})$ Summing all: $\sum_{k=0}^\infty \gamma^k \delta_{t+k} = \sum_{k=0}^\infty \gamma^k R_{t+k+1} - V(S_t) = G_t - V(S_t)$. All the $V$ terms telescope, leaving only $-V(S_t)$ and the full reward sum $G_t$.Problem 2: Show that under the true value function $V^\pi$, the expected TD error is zero: $\mathbb{E}_\pi[\delta_t \mid S_t = s] = 0$.
Answers (click to expand)
$\mathbb{E}_\pi[\delta_t \mid S_t = s] = \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s] - V^\pi(s)$ $= \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s] - V^\pi(s)$ But by the Bellman equation: $V^\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$. So the expectation is zero. This makes TD a form of stochastic approximation for solving the Bellman equation.Problem 3: Explain why TD learns faster than MC in a random walk problem. (Hint: consider how many random transitions affect each update.)
Answers (click to expand)
In a random walk, MC must wait for episode termination and computes a full return — this compounds the randomness of *all* steps. TD uses a single-step transition, whose variance is much lower. Although TD has bias (bootstrapping from imperfect $V(S_{t+1})$), the reduced variance often dominates, leading to faster convergence. Numerically: if each step's reward has variance $\sigma^2$ and episodes have typical length $L$, MC returns have variance $\approx L\sigma^2$, while TD targets have variance $\approx \sigma^2 + \gamma^2\text{Var}(V(S_{t+1}))$, which is typically much smaller.Problem 4: In SARSA, if the policy is pure greedy (no exploration), what happens to learning? Why does $\varepsilon$-greedy fix this?
Answers (click to expand)
With pure greedy, SARSA only updates Q-values for actions it already thinks are best. It never tries other actions, so it can never discover if some unvisited action is actually better. $\varepsilon$-greedy forces occasional exploration, ensuring all actions are eventually tried and their Q-values updated. As $\varepsilon \to 0$, SARSA converges to the optimal policy (under the GLIE condition).Problem 5: Compute the λ-return $G_0^\lambda$ for $\lambda = 0.5$ using the returns from Example 3, with $V = [0, 0, 0]$ for all states. Compare to $G_{0:1}$, $G_{0:2}$, and full $G_0$.
Answers (click to expand)
$G_0^{0.5} = (1-0.5)\sum_{n=1}^\infty 0.5^{n-1} G_{0:n}$ For this finite episode (4 steps), $G_{0:n}$ for $n \geq 4$ equals the full return $G_0 = 10$, and $G_{0:1}=1$, $G_{0:2}=3$, $G_{0:3}=6$. $G_0^{0.5} = 0.5[0.5^0(1) + 0.5^1(3) + 0.5^2(6)] + 0.5^3 \cdot 10 \cdot \frac{1}{1-0.5}$ (tail sum approximation) $= 0.5[1 + 1.5 + 1.5] + 0.125(20) = 0.5(4) + 2.5 = 2 + 2.5 = 4.5$ The λ-return (4.5) lies between $G_{0:1}=1$ and $G_0=10$, balancing the short-term TD estimate with longer-term information.Summary
Key takeaways:
- TD learning bootstraps — it updates estimates using other estimates ($R_{t+1} + \gamma V(S_{t+1})$), enabling online step-by-step learning
- The TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is the fundamental learning signal; expected to be zero at the true value function
- TD occupies the middle ground: lower variance than MC (single-step randomness) but some bias from bootstrapping
- SARSA is on-policy TD control — learns Q-values for the actual behavior policy, naturally accounting for exploration risk
- TD(λ) interpolates between TD(0) (λ=0) and MC (λ=1) via exponentially weighted n-step returns
- TD converges under Robbins-Monro step-size conditions with sufficient exploration
Quiz
Question 1: The TD(0) update uses which of the following as the target?
A. The complete return $G_t$ B. $R_{t+1} + \gamma V(S_{t+1})$ C. $\max_a Q(S_{t+1}, a)$ D. $R_{t+1}$ only
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. TD(0) bootstraps — it uses the immediate reward plus the current estimate of next state's value. - If you chose A: That's the MC target — requires episode completion. - If you chose C: That's the Q-learning target (off-policy, uses max). - If you chose D: That ignores future value entirely — not even a one-step lookahead.Question 2: What is the primary advantage of TD over MC?
A. TD is unbiased B. TD can learn online without waiting for episode completion C. TD requires a model of the environment D. TD always converges faster mathematically
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. TD updates every step, enabling online learning. MC must wait for episode end. - If you chose A: MC is unbiased; TD has bootstrap bias. The opposite is true. - If you chose C: TD is model-free — it does not require a model. - If you chose D: TD often converges faster in practice but not always — it depends on the problem.Question 3: In SARSA, what does the second 'A' in the name stand for?
A. The action taken at the current state B. The action that will be taken at the next state, according to the current policy C. The optimal action at the next state D. An arbitrary action
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. SARSA = (S, A, R, S', A') — the second A is the action actually chosen at the next state using the current (on-policy) action selection. This is what makes it on-policy. - If you chose A: That's the first A — the action at the current state. - If you chose C: That would be Q-learning (uses max over actions), not SARSA. - If you chose D: SARSA uses the action chosen by the policy, not arbitrary.Question 4: What does $\lambda = 0$ in TD(λ) correspond to?
A. Monte Carlo (complete returns) B. TD(0) — one-step bootstrapping C. Dynamic programming D. SARSA
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. When $\lambda = 0$, the λ-return $G_t^0 = G_{t:1} = R_{t+1} + \gamma V(S_{t+1})$, which is the TD(0) target. - If you chose A: $\lambda = 1$ corresponds to MC. - If you chose C: DP is model-based, not a setting of λ. - If you chose D: SARSA is TD control for Q-values; TD(λ) can be applied to SARSA (giving SARSA(λ)).Question 5: Why might SARSA learn a different policy than Q-learning in the same environment?
A. SARSA uses a different discount factor B. SARSA is on-policy — it learns values for the behavior policy including exploration C. SARSA doesn't use value functions D. SARSA requires a model
Correct Answer: B
Explanation (click to expand)
- If you chose B: Correct. SARSA's update uses $Q(S', A')$ where $A'$ is chosen by the behavior policy (e.g., $\varepsilon$-greedy). It learns values that account for occasional random exploratory moves. - If you chose A: Both use the same $\gamma$. - If you chose C: SARSA learns Q-values — it definitely uses value functions. - If you chose D: SARSA is model-free, like all TD methods.Next Steps
Next up: 23-06 — Q-Learning — where you'll learn the off-policy TD control algorithm that directly approximates $$Q^*$$ and forms the basis for deep reinforcement learning.
Pitfalls
-
Thinking TD is the same as MC because both are model-free. TD uses bootstrapping (updates from $$R_{t+1} + \gamma V(S_{t+1})$$), while MC uses complete returns. This seemingly small difference has major consequences: TD exploits the Markov property for lower variance but introduces bias; MC is unbiased but has high variance. The choice depends on your problem's structure.
-
Misunderstanding the SARSA vs. Q-learning distinction. SARSA uses $$Q(S_{t+1}, A_{t+1})$$ where $$A_{t+1}$$ is the action actually taken by the behavior policy. Q-learning uses $$\max_{a'} Q(S_{t+1}, a')$$. This single difference makes SARSA on-policy (learns values for the exploratory policy) and Q-learning off-policy (learns optimal values regardless of exploration).
-
Setting constant α and expecting exact convergence. The Robbins-Monro conditions ($$\sum \alpha_t = \infty$$, $$\sum \alpha_t^2 < \infty$$) require decaying learning rates. Constant α causes persistent oscillations around the true value. In non-stationary environments constant α is acceptable (it tracks changes), but for stationary problems, decay α or average final estimates.
-
Using λ = 1 in TD(λ) on continuing tasks. λ = 1 gives the full Monte Carlo return $$G_t$$, which requires episode termination. On continuing tasks, this is undefined. Use λ < 1 for continuing environments or truncate with an artificial horizon.
Q8: The expected TD error $$\mathbb{E}_\pi[\delta_t \mid S_t = s]$$ equals zero when:
A) The learning rate is zero B) $$V = V^\pi$$ (the true value function) C) All rewards are positive D) The policy is deterministic