📐 Concept diagram

23-05 — Temporal Difference Learning

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-05 Prerequisites: 23-04 — Monte Carlo Methods, 23-02 — Bellman Equations Next subject: 23-06 — Q-Learning

Learning Objectives

By the end of this subject, you will be able to:

Derive the TD(0) update rule from the Bellman equation and explain what "bootstrapping" means
Compare and contrast TD learning with Monte Carlo and Dynamic Programming
Define the TD error and explain its role as a learning signal
Implement SARSA for on-policy TD control and analyze its convergence properties
Understand the bias-variance trade-off between TD and MC methods

Core Content

The Central Idea of Temporal Difference Learning

Temporal Difference (TD) learning combines two ideas: 1. From Monte Carlo: Learn directly from experience without a model 2. From Dynamic Programming: Update estimates based on other learned estimates (bootstrapping)

The key insight: instead of waiting until the end of an episode to compute $G_t$ (like MC), TD updates after every step using the immediate reward plus the current estimate of the next state's value.

⚠️ CRITICAL: TD is arguably the most important idea in reinforcement learning. It enables online, incremental learning — the agent learns at every time step without waiting for episodes to complete.

TD(0) — The Simplest TD Algorithm

The TD(0) update for state-value estimation:

$$V(S_t) \leftarrow V(S_t) + \alpha \big[\underbrace{R_{t+1} + \gamma V(S_{t+1})}{\text{TD target}} - \underbrace{V(S_t)}{\text{current estimate}}\big]$$

$R_{t+1} + \gamma V(S_{t+1})$ is the TD target — a one-step estimate of the return
$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is the TD error — the difference between the estimated return and the current estimate
$\alpha \in (0, 1]$ is the learning rate (step-size parameter)

Derivation from the Bellman equation:

From 23-02: $V^\pi(s) = \mathbb{E}\pi[R{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$

TD(0) uses a sample of this expectation as the target and moves the estimate toward it via stochastic approximation:

$$V(S_t) \leftarrow V(S_t) + \alpha[\text{sample} - V(S_t)]$$

where $\text{sample} = R_{t+1} + \gamma V(S_{t+1})$.

Comparison: TD vs. MC vs. DP

Property	DP	MC	TD
Requires model	Yes ($P$, $R$)	No	No
Uses bootstrapping	Yes	No (uses full $G_t$)	Yes (uses $V(S_{t+1})$)
Update timing	Sweeps over all states	End of episode	Every time step
Handles continuing tasks	Yes	No (needs episodes)	Yes (with discounting)
Bias	Zero (exact solution)	Zero (unbiased estimator of $V^\pi$)	Non-zero (bootstrap bias)
Variance	Zero	High (many random events)	Lower (only one step of randomness)

The bias-variance trade-off: MC has zero bias but high variance (complete returns compound many random events). TD has some bias (it bootstraps from imperfect estimates) but much lower variance. In practice, TD often converges faster than MC.

The TD Error as a Learning Signal

The TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ has an important property — it can be related to the full return:

$$G_t - V(S_t) = \sum_{k=0}^{\infty} \gamma^k \delta_{t+k}$$

Proof:

$G_t - V(S_t) = (R_{t+1} + \gamma G_{t+1}) - V(S_t)$ $= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) + \gamma(G_{t+1} - V(S_{t+1}))$ $= \delta_t + \gamma(G_{t+1} - V(S_{t+1}))$ $= \delta_t + \gamma \delta_{t+1} + \gamma^2(G_{t+2} - V(S_{t+2})) = \cdots = \sum_{k=0}^{\infty} \gamma^k \delta_{t+k}$

This shows that TD errors accumulate over time to form the full MC error. If $V = V^\pi$, the expected TD error is zero: $\mathbb{E}_\pi[\delta_t \mid S_t = s] = 0$.

TD(λ) — The Bridge Between TD and MC

The λ-return $G_t^\lambda$ is a weighted average of $n$-step returns, exponentially weighted by $\lambda \in [0,1]$:

$$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}$$

where $G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$.

$\lambda = 0$: $G_t^0 = G_{t:1} = R_{t+1} + \gamma V(S_{t+1})$ — TD(0)
$\lambda = 1$: $G_t^1 = G_t$ — Monte Carlo (complete return)

TD(λ) smoothly interpolates between TD(0) and MC.

SARSA — On-Policy TD Control

SARSA (State-Action-Reward-State-Action) learns $Q(s,a)$ on-policy using TD:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\big]$$

The name comes from the transition quintuple: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.

Algorithm:

Initialize $Q(s,a)$ arbitrarily
For each episode:
Initialize $S$, choose $A$ from $S$ using policy derived from $Q$ (e.g., $\varepsilon$-greedy)
For each step:
- Take action $A$, observe $R, S'$
- Choose $A'$ from $S'$ using policy derived from $Q$ (e.g., $\varepsilon$-greedy)
- $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma Q(S', A') - Q(S,A)]$
- $S \leftarrow S'$, $A \leftarrow A'$
Until $S$ is terminal

SARSA is on-policy — the policy used to select $A'$ (in the update target) is the same policy used to generate behavior. This matters: if the behavior policy explores (e.g., $\varepsilon$-greedy), SARSA learns values for that exploratory policy, learning to account for the exploration penalty.

⚠️ CRITICAL — The Cliff Walking insight: In the classic cliff-walking problem, SARSA learns a safer path that stays away from the cliff edge because it accounts for the possibility of random exploratory moves sending the agent off the cliff. Q-learning (23-06), being off-policy, learns the optimal path right along the edge — but may actually fall off more during learning because it doesn't account for exploration noise.

Convergence of TD

For tabular TD(0) with a fixed policy $\pi$, TD converges to $V^\pi$ with probability 1 if: 1. The step sizes satisfy the Robbins-Monro conditions: $\sum_{t=1}^\infty \alpha_t = \infty$ and $\sum_{t=1}^\infty \alpha_t^2 < \infty$ 2. All states are visited infinitely often

Example valid step-size: $\alpha_t = 1/t$, or constant $\alpha$ with eventual averaging.

TD converges faster than MC in many practical settings because it exploits the Markov property to reduce variance, even though it introduces some bias.

Key Terms

Bias
Handles continuing tasks
Requires model
TD error
TD target
Update timing
Uses bootstrapping
Variance

Worked Examples

Example 1: TD(0) Update Step

$V = [0, 0, 0]$ for 3 states. Transition: $S_t = s_1, A_t, R_{t+1} = 5, S_{t+1} = s_2$. Current $V(s_1) = 10$, $V(s_2) = 3$. $\alpha = 0.1$, $\gamma = 0.9$. Compute the TD update.

Solution:

$\delta_t = 5 + 0.9(3) - 10 = 5 + 2.7 - 10 = -2.3$ $V(s_1) \leftarrow 10 + 0.1(-2.3) = 10 - 0.23 = 9.77$

The negative TD error indicates that $V(s_1)$ was too high — the observed reward (5) plus discounted next-state value (2.7) is less than the current estimate (10), so we reduce it.

Click for answer

$V(s_1) = 9.77$. The TD error $\delta = -2.3$ drives the update; $\alpha$ controls how much we adjust.

Example 2: SARSA vs. the Cliff

Cliff walking grid: agent starts at bottom-left, goal is bottom-right. Cliff along the bottom row (except start and goal). Falling on cliff gives reward $-100$ and resets to start. Each step gives reward $-1$.

With $\varepsilon$-greedy ($\varepsilon = 0.1$): - The optimal path hugs the cliff edge — shortest path. - SARSA learns a path further from the cliff because the $\varepsilon$-greedy action selection means there's a 10% chance of a random move near the cliff, which would be catastrophic. - SARSA's Q-values account for the actual behavior policy's exploration, making it learn safer but longer paths.

Click for answer

SARSA's on-policy nature means it learns values for the $\varepsilon$-greedy policy, which includes exploration risk. This produces a safer policy that compensates for occasional random actions. Q-learning would learn the optimal (cliff-hugging) path but might actually perform worse during training.

Example 3: $n$-step Return Calculation

Trajectory: $R_1=1, R_2=2, R_3=3, R_4=4$, then terminal. $V = [0, 0, 0]$ for all states. $\gamma = 1$. Compute $G_{0:1}$, $G_{0:2}$, $G_{0:3}$, and the full return $G_0$.

Solution:

$G_{0:1} = R_1 + \gamma V(S_1) = 1 + 0 = 1$ $G_{0:2} = R_1 + \gamma R_2 + \gamma^2 V(S_2) = 1 + 2 + 0 = 3$ $G_{0:3} = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 V(S_3) = 1 + 2 + 3 + 0 = 6$ $G_0 = R_1 + R_2 + R_3 + R_4 = 1 + 2 + 3 + 4 = 10$

Click for answer

$G_{0:1}=1$, $G_{0:2}=3$, $G_{0:3}=6$, $G_0=10$. The $n$-step returns are partial — they truncate after $n$ steps and bootstrap from the value function.

Practice Problems

Problem 1: Derive the relationship $\sum_{k=0}^\infty \gamma^k \delta_{t+k} = G_t - V(S_t)$.

Answers (click to expand)

$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ $\gamma \delta_{t+1} = \gamma R_{t+2} + \gamma^2 V(S_{t+2}) - \gamma V(S_{t+1})$ $\gamma^2 \delta_{t+2} = \gamma^2 R_{t+3} + \gamma^3 V(S_{t+3}) - \gamma^2 V(S_{t+2})$ Summing all: $\sum_{k=0}^\infty \gamma^k \delta_{t+k} = \sum_{k=0}^\infty \gamma^k R_{t+k+1} - V(S_t) = G_t - V(S_t)$. All the $V$ terms telescope, leaving only $-V(S_t)$ and the full reward sum $G_t$.

Problem 2: Show that under the true value function $V^\pi$, the expected TD error is zero: $\mathbb{E}_\pi[\delta_t \mid S_t = s] = 0$.

Answers (click to expand)

$\mathbb{E}_\pi[\delta_t \mid S_t = s] = \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s] - V^\pi(s)$ $= \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s] - V^\pi(s)$ But by the Bellman equation: $V^\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$. So the expectation is zero. This makes TD a form of stochastic approximation for solving the Bellman equation.

Problem 3: Explain why TD learns faster than MC in a random walk problem. (Hint: consider how many random transitions affect each update.)

Answers (click to expand)

In a random walk, MC must wait for episode termination and computes a full return — this compounds the randomness of *all* steps. TD uses a single-step transition, whose variance is much lower. Although TD has bias (bootstrapping from imperfect $V(S_{t+1})$), the reduced variance often dominates, leading to faster convergence. Numerically: if each step's reward has variance $\sigma^2$ and episodes have typical length $L$, MC returns have variance $\approx L\sigma^2$, while TD targets have variance $\approx \sigma^2 + \gamma^2\text{Var}(V(S_{t+1}))$, which is typically much smaller.

Problem 4: In SARSA, if the policy is pure greedy (no exploration), what happens to learning? Why does $\varepsilon$-greedy fix this?

Answers (click to expand)

With pure greedy, SARSA only updates Q-values for actions it already thinks are best. It never tries other actions, so it can never discover if some unvisited action is actually better. $\varepsilon$-greedy forces occasional exploration, ensuring all actions are eventually tried and their Q-values updated. As $\varepsilon \to 0$, SARSA converges to the optimal policy (under the GLIE condition).

Problem 5: Compute the λ-return $G_0^\lambda$ for $\lambda = 0.5$ using the returns from Example 3, with $V = [0, 0, 0]$ for all states. Compare to $G_{0:1}$, $G_{0:2}$, and full $G_0$.

Answers (click to expand)

$G_0^{0.5} = (1-0.5)\sum_{n=1}^\infty 0.5^{n-1} G_{0:n}$ For this finite episode (4 steps), $G_{0:n}$ for $n \geq 4$ equals the full return $G_0 = 10$, and $G_{0:1}=1$, $G_{0:2}=3$, $G_{0:3}=6$. $G_0^{0.5} = 0.5[0.5^0(1) + 0.5^1(3) + 0.5^2(6)] + 0.5^3 \cdot 10 \cdot \frac{1}{1-0.5}$ (tail sum approximation) $= 0.5[1 + 1.5 + 1.5] + 0.125(20) = 0.5(4) + 2.5 = 2 + 2.5 = 4.5$ The λ-return (4.5) lies between $G_{0:1}=1$ and $G_0=10$, balancing the short-term TD estimate with longer-term information.

Summary

Key takeaways:

TD learning bootstraps — it updates estimates using other estimates ($R_{t+1} + \gamma V(S_{t+1})$), enabling online step-by-step learning
The TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is the fundamental learning signal; expected to be zero at the true value function
TD occupies the middle ground: lower variance than MC (single-step randomness) but some bias from bootstrapping
SARSA is on-policy TD control — learns Q-values for the actual behavior policy, naturally accounting for exploration risk
TD(λ) interpolates between TD(0) (λ=0) and MC (λ=1) via exponentially weighted n-step returns
TD converges under Robbins-Monro step-size conditions with sufficient exploration

Quiz

Question 1: The TD(0) update uses which of the following as the target?

A. The complete return $G_t$ B. $R_{t+1} + \gamma V(S_{t+1})$ C. $\max_a Q(S_{t+1}, a)$ D. $R_{t+1}$ only

Correct Answer: B