📐 Concept diagram

23-06 — Q-Learning

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-06 Prerequisites: 23-05 — Temporal Difference Learning, 23-02 — Bellman Equations Next subject: 23-07 — Deep Q-Networks (DQN)

Learning Objectives

By the end of this subject, you will be able to:

Derive the Q-learning update rule from the Bellman optimality equation for $Q^*$
Explain why Q-learning is off-policy and what advantages this provides
Prove the convergence conditions for tabular Q-learning
Contrast Q-learning with SARSA and predict when each will perform better
Identify the maximization bias problem and implement double Q-learning

Core Content

The Q-Learning Algorithm

Q-learning (Watkins, 1989) is an off-policy TD control algorithm that directly approximates the optimal action-value function $Q^*$, independently of the policy being followed.

The update rule:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big[R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)\big]$$

⚠️ CRITICAL: The target uses $\max_{a'} Q(S_{t+1}, a')$ — this is the key difference from SARSA, which uses $Q(S_{t+1}, A_{t+1})$ where $A_{t+1}$ is the action actually taken by the behavior policy.

Algorithm

Initialize $Q(s, a)$ arbitrarily for all $s \in \mathcal{S}, a \in \mathcal{A}$
For each episode:
Initialize $S$
For each step:
- Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\varepsilon$-greedy)
- Take action $A$, observe $R, S'$
- $Q(S, A) \leftarrow Q(S, A) + \alpha[R + \gamma \max_{a'} Q(S', a') - Q(S, A)]$
- $S \leftarrow S'$
Until $S$ is terminal

Why Q-Learning Is Off-Policy

Q-learning is off-policy because: - The behavior policy (how actions are selected) can be anything — $\varepsilon$-greedy, random, or even an entirely different agent's policy - The target policy (what Q-learning tries to learn) is always the greedy policy: $\pi^*(s) = \arg\max_a Q(s, a)$ - The update uses $\max_{a'} Q(S_{t+1}, a')$, which is the greedy choice — this is independent of which action was actually taken

This means Q-learning learns the optimal policy while the agent explores arbitrarily. This separation of exploration and learning is a powerful feature.

Comparison: Q-Learning vs. SARSA

Aspect	Q-Learning	SARSA
Type	Off-policy	On-policy
Target	$R + \gamma \max_{a'} Q(S', a')$	$R + \gamma Q(S', A')$
Learns	$Q^*$ (optimal Q, independent of behavior)	$Q^\pi$ (Q for the behavior policy)
Exploration	Can be arbitrary; learned policy ignores it	Learned policy accounts for exploration
Behavior	More aggressive/optimistic	More conservative/safe
Cliff walking	Learns optimal cliff-hugging path	Learns safer path away from cliff

When to use which: - Q-learning: When you want the optimal policy and can afford exploration risk during training (simulation with resets, no real-world consequences). - SARSA: When exploration mistakes during training are costly (robotics, medical decisions, real-world systems).

Convergence of Q-Learning

Theorem (Watkins & Dayan, 1992): Tabular Q-learning converges to $Q^*$ with probability 1 if:

All state-action pairs are visited infinitely often: $\sum_t \mathbb{I}(S_t=s, A_t=a) \to \infty$ for all $s, a$
The learning rates satisfy the Robbins-Monro conditions: $\sum_{t=1}^\infty \alpha_t(s,a) = \infty$ and $\sum_{t=1}^\infty \alpha_t^2(s,a) < \infty$
Rewards are bounded: $|R_t| \leq R_{\max}$

Proof sketch: Q-learning can be viewed as applying stochastic approximation to solve the Bellman optimality equation $Q^(s,a) = \mathbb{E}[R + \gamma \max_{a'} Q^(S', a') \mid s, a]$. The max operator is a non-expansion in the max-norm ($|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|$), and together with $\gamma < 1$, the composed operator $(1-\alpha)I + \alpha \mathcal{T}^*$ is a contraction in expectation under the right step-size schedule.

⚠️ CRITICAL: Convergence requires decaying learning rates. In practice, people often use constant $\alpha$ for non-stationary environments, but by theory, constant $\alpha$ does not guarantee convergence — it leads to oscillations around the optimum.

Maximization Bias and Double Q-Learning

A subtle but important problem: Q-learning uses $\max_{a'} Q(S', a')$ as the target. Because $Q$ values are estimates with noise, $\max_{a'} Q(S', a')$ is systematically biased upward — it selects the action whose Q-value is overestimated, not necessarily the best action.

Double Q-Learning fixes this by maintaining two Q-functions, $Q_A$ and $Q_B$:

With probability 0.5, update $Q_A$ using target: $R + \gamma Q_B(S', \arg\max_{a'} Q_A(S', a'))$
With probability 0.5, update $Q_B$ using target: $R + \gamma Q_A(S', \arg\max_{a'} Q_B(S', a'))$

This decouples action selection (using one Q-function) from action evaluation (using the other), eliminating the upward bias. The cost: twice the memory and computation, but this becomes negligible in the deep RL extensions (23-07).

Q-Learning as Stochastic Approximation

Q-learning is a form of stochastic approximation for solving:

$$Q(s,a) = \mathbb{E}{s' \sim P(\cdot \mid s,a)}[R(s,a,s') + \gamma \max{a'} Q(s', a')]$$

The update $Q \leftarrow Q + \alpha(\text{target} - Q)$ is a Robbins-Monro stochastic approximation step. The convergence proof relies on showing that the expected update direction points toward the true $Q^*$ and that noise averages out under the step-size conditions.

Key Terms

Behavior
Cliff walking
Double Q-Learning
Exploration
Learns
Target
Type

Worked Examples

Example 1: Q-Learning Update Step

Current Q-table (only relevant entries):

State	$a_1$	$a_2$
$s_1$	5.0	3.0
$s_2$	1.0	8.0

Agent at $s_1$, takes action $a_2$, receives $R = 2$, transitions to $s_2$. $\alpha = 0.1$, $\gamma = 0.9$. Compute the Q-learning update.

Solution:

$\max_{a'} Q(s_2, a') = \max(1.0, 8.0) = 8.0$

Target: $R + \gamma \max_{a'} Q(s_2, a') = 2 + 0.9(8.0) = 2 + 7.2 = 9.2$

Current $Q(s_1, a_2) = 3.0$

TD error: $9.2 - 3.0 = 6.2$

Update: $Q(s_1, a_2) \leftarrow 3.0 + 0.1(6.2) = 3.0 + 0.62 = 3.62$

Click for answer

$Q(s_1, a_2)$ increases from 3.0 to 3.62. The large positive TD error (6.2) indicates the action was much better than previously estimated — promising path discovered.

Example 2: Maximization Bias Demonstration

Consider 2 actions at state $s$, both with true $Q^*(s, a) = 0$. Our estimates have independent Gaussian noise: $Q(s, a_1) \sim \mathcal{N}(0, 1)$, $Q(s, a_2) \sim \mathcal{N}(0, 1)$. What is $\mathbb{E}[\max(Q(s, a_1), Q(s, a_2))]$?

Solution:

For two independent standard normals, $\mathbb{E}[\max(X, Y)] = \frac{1}{\sqrt{\pi}} \approx 0.56$.

So even though the true values are both 0, the maximum of noisy estimates has an expected value of ~0.56 — a systematic upward bias. With more actions, this gets worse: for $k$ independent standard normals, $\mathbb{E}[\max] \approx \sqrt{2\log k}$, growing unboundedly.

Click for answer

Expected max ≈ 0.56 for 2 actions, growing with more actions. This is the maximization bias — Q-learning's use of $\max$ over noisy estimates systematically overestimates value. Double Q-learning eliminates this bias.

Example 3: Q-Learning Convergence Trace

Simple 2-state MDP with one action per state. True $Q^ = [10, 5]$. Initial $Q = [0, 0]$. $\alpha = 0.1$, $\gamma = 0.9$. After many iterations, $Q$ converges to $Q^$. Trace 5 iterations from a specific trajectory sequence.

Solution:

State 1 always gives $R=10$ and goes to state 2. State 2 always gives $R=-2$ and goes to terminal ($Q=0$).

Iteration 1: $(s_1, R=10, s_2)$: $Q(s_1) \leftarrow 0 + 0.1[10 + 0.9(0) - 0] = 1.0$ Iteration 2: $(s_2, R=-2, \text{term})$: $Q(s_2) \leftarrow 0 + 0.1[-2 + 0.9(0) - 0] = -0.2$ Iteration 3: $(s_1, R=10, s_2)$: $Q(s_1) \leftarrow 1.0 + 0.1[10 + 0.9(-0.2) - 1.0] = 1.0 + 0.1(8.82) = 1.882$ Iteration 4: $(s_2, R=-2, \text{term})$: $Q(s_2) \leftarrow -0.2 + 0.1[-2 - (-0.2)] = -0.2 + 0.1(-1.8) = -0.38$ Iteration 5: $(s_1, R=10, s_2)$: $Q(s_1) \leftarrow 1.882 + 0.1[10 + 0.9(-0.38) - 1.882] = 1.882 + 0.1(7.776) = 2.66$

Converging slowly toward $Q(s_1)=10$, $Q(s_2)=-2$.

Click for answer

After 5 iterations: $Q(s_1) = 2.66$, $Q(s_2) = -0.38$. True values: 10 and -2. With constant $\alpha=0.1$, convergence is slow but steady. Decaying $\alpha$ would speed initial convergence but slow final convergence.

Practice Problems

Problem 1: Prove that the max operator is a non-expansion: $|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|$.

Answers (click to expand)

Let $a_f^* = \arg\max_a f(a)$ and $a_g^* = \arg\max_a g(a)$. Case 1: $\max_a f(a) \geq \max_a g(a)$: $|\max_a f(a) - \max_a g(a)| = f(a_f^*) - \max_a g(a) \leq f(a_f^*) - g(a_f^*) \leq \max_a |f(a) - g(a)|$. Case 2: $\max_a f(a) < \max_a g(a)$: $|\max_a f(a) - \max_a g(a)| = g(a_g^*) - \max_a f(a) \leq g(a_g^*) - f(a_g^*) \leq \max_a |f(a) - g(a)|$. In both cases, the difference is bounded by the maximum pointwise difference. This non-expansion property is key to proving Q-learning convergence.

Problem 2: Explain why constant learning rate $\alpha$ is acceptable in non-stationary environments even though theory requires decaying rates for convergence.

Answers (click to expand)

In non-stationary environments, the true optimal Q-function changes over time. A constant $\alpha$ gives an exponentially weighted moving average, which naturally tracks changes. Decaying $\alpha$ would eventually stop learning, making the agent unable to adapt. In practice, constant $\alpha$ causes small oscillations around the optimum rather than exact convergence — acceptable for most applications, especially with function approximation where exact convergence is infeasible anyway.

Problem 3: Show that double Q-learning's update is unbiased in expectation for the maximization step (assuming the two Q-functions have independent zero-mean errors).

Answers (click to expand)

Let $\hat{Q}_A(s,a) = Q^*(s,a) + \epsilon_A(s,a)$ and $\hat{Q}_B(s,a) = Q^*(s,a) + \epsilon_B(s,a)$ where $\mathbb{E}[\epsilon_A] = \mathbb{E}[\epsilon_B] = 0$ and $\epsilon_A, \epsilon_B$ are independent. The double Q-learning target for updating $Q_A$ is $R + \gamma \hat{Q}_B(s', \arg\max_a \hat{Q}_A(s', a))$. $\mathbb{E}[\hat{Q}_B(s', \arg\max_a \hat{Q}_A(s', a))] = \mathbb{E}[Q^*(s', a^*) + \epsilon_B(s', a^*)]$ where $a^* = \arg\max_a \hat{Q}_A(s', a)$. Since $\epsilon_B$ is independent of $\hat{Q}_A$ (which determines $a^*$), $\mathbb{E}[\epsilon_B(s', a^*) \mid a^*] = 0$, so the expected value is $\mathbb{E}[Q^*(s', a^*)]$. This equals $Q^*(s', \arg\max_a Q^*(s', a))$ in expectation — unbiased. Standard Q-learning would have $\mathbb{E}[\max_a \hat{Q}_A(s', a)] > \max_a Q^*(s', a)$ due to the positive bias from noise.

Problem 4: In the cliff-walking problem, Q-learning with ε-greedy (ε = 0.1) sometimes falls off the cliff during learning even though it learns the optimal path. Explain why and how SARSA handles this differently.

Answers (click to expand)

Q-learning learns the optimal policy that hugs the cliff. But during training, the ε-greedy behavior policy occasionally takes random actions. When standing next to the cliff, a random "down" action sends the agent off the cliff. Q-learning's learned policy doesn't account for this because it's off-policy — it learns the greedy optimal path. SARSA, being on-policy, learns values that include the ε-greedy exploration: $Q(S', A')$ uses the action actually taken (possibly random). Near the cliff, SARSA learns that some exploratory moves are catastrophic, so it assigns lower values to cliff-edge states and learns to stay further away.

Problem 5: For a tabular MDP with 100 states, 4 actions, and γ = 0.95, estimate how many visits per state-action pair are needed for Q-learning to converge within ε error, using the contraction argument.

Answers (click to expand)

Initial error: $\|Q_0 - Q^*\|_\infty = \Delta$. After $n$ effective updates (where one effective update = applying the Bellman optimality operator once to all state-action pairs), the error is bounded by $\gamma^n \Delta$. For $\|Q_n - Q^*\|_\infty < \varepsilon$: $\gamma^n \Delta < \varepsilon \implies n > \frac{\log(\varepsilon/\Delta)}{\log \gamma}$. With $\gamma = 0.95$, $\log \gamma \approx -0.0513$. If $\Delta = 100$, for $\varepsilon = 0.1$: $n > \log(0.001)/\log(0.95) = 134.6 \approx 135$ full sweeps. With 400 state-action pairs, that's 400 × 135 ≈ 54,000 updates. But in practice, with decaying α, the early updates are larger and convergence is faster. Constant α requires more updates due to oscillation.

Summary

Key takeaways:

Q-learning directly approximates $Q^*$ using an off-policy TD update: $Q(s,a) \leftarrow Q(s,a) + \alpha[R + \gamma \max_{a'} Q(s',a') - Q(s,a)]$
Off-policy means the behavior policy can be anything — exploration is decoupled from the learned policy
Q-learning converges to $Q^*$ under Robbins-Monro step-size conditions with infinite exploration
The $\max$ operator introduces positive bias (maximization bias) — double Q-learning eliminates it by decoupling action selection and evaluation
Q-learning is more aggressive/optimistic than on-policy SARSA; choose based on whether exploration mistakes are costly
Constant $\alpha$ is practical for non-stationary environments; decaying $\alpha$ is theoretically required for exact convergence

Quiz

Question 1: Q-learning is described as "off-policy" because:

A. It doesn't use a policy at all B. The learned policy (greedy) differs from the behavior policy used to generate data C. It requires offline batch processing D. It only works for deterministic policies

Correct Answer: B