Math graphic
📐 Concept diagram

23-06 — Q-Learning

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-06 Prerequisites: 23-05 — Temporal Difference Learning, 23-02 — Bellman Equations Next subject: 23-07 — Deep Q-Networks (DQN)


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the Q-learning update rule from the Bellman optimality equation for $Q^*$
  2. Explain why Q-learning is off-policy and what advantages this provides
  3. Prove the convergence conditions for tabular Q-learning
  4. Contrast Q-learning with SARSA and predict when each will perform better
  5. Identify the maximization bias problem and implement double Q-learning

Core Content

The Q-Learning Algorithm

Q-learning (Watkins, 1989) is an off-policy TD control algorithm that directly approximates the optimal action-value function $Q^*$, independently of the policy being followed.

The update rule:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big[R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)\big]$$

⚠️ CRITICAL: The target uses $\max_{a'} Q(S_{t+1}, a')$ — this is the key difference from SARSA, which uses $Q(S_{t+1}, A_{t+1})$ where $A_{t+1}$ is the action actually taken by the behavior policy.

Algorithm

  1. Initialize $Q(s, a)$ arbitrarily for all $s \in \mathcal{S}, a \in \mathcal{A}$
  2. For each episode:
  3. Initialize $S$
  4. For each step:
    • Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\varepsilon$-greedy)
    • Take action $A$, observe $R, S'$
    • $Q(S, A) \leftarrow Q(S, A) + \alpha[R + \gamma \max_{a'} Q(S', a') - Q(S, A)]$
    • $S \leftarrow S'$
  5. Until $S$ is terminal

Why Q-Learning Is Off-Policy

Q-learning is off-policy because: - The behavior policy (how actions are selected) can be anything — $\varepsilon$-greedy, random, or even an entirely different agent's policy - The target policy (what Q-learning tries to learn) is always the greedy policy: $\pi^*(s) = \arg\max_a Q(s, a)$ - The update uses $\max_{a'} Q(S_{t+1}, a')$, which is the greedy choice — this is independent of which action was actually taken

This means Q-learning learns the optimal policy while the agent explores arbitrarily. This separation of exploration and learning is a powerful feature.

Comparison: Q-Learning vs. SARSA

Aspect Q-Learning SARSA
Type Off-policy On-policy
Target $R + \gamma \max_{a'} Q(S', a')$ $R + \gamma Q(S', A')$
Learns $Q^*$ (optimal Q, independent of behavior) $Q^\pi$ (Q for the behavior policy)
Exploration Can be arbitrary; learned policy ignores it Learned policy accounts for exploration
Behavior More aggressive/optimistic More conservative/safe
Cliff walking Learns optimal cliff-hugging path Learns safer path away from cliff

When to use which: - Q-learning: When you want the optimal policy and can afford exploration risk during training (simulation with resets, no real-world consequences). - SARSA: When exploration mistakes during training are costly (robotics, medical decisions, real-world systems).

Convergence of Q-Learning

Theorem (Watkins & Dayan, 1992): Tabular Q-learning converges to $Q^*$ with probability 1 if:

  1. All state-action pairs are visited infinitely often: $\sum_t \mathbb{I}(S_t=s, A_t=a) \to \infty$ for all $s, a$
  2. The learning rates satisfy the Robbins-Monro conditions: $\sum_{t=1}^\infty \alpha_t(s,a) = \infty$ and $\sum_{t=1}^\infty \alpha_t^2(s,a) < \infty$
  3. Rewards are bounded: $|R_t| \leq R_{\max}$

Proof sketch: Q-learning can be viewed as applying stochastic approximation to solve the Bellman optimality equation $Q^(s,a) = \mathbb{E}[R + \gamma \max_{a'} Q^(S', a') \mid s, a]$. The max operator is a non-expansion in the max-norm ($|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|$), and together with $\gamma < 1$, the composed operator $(1-\alpha)I + \alpha \mathcal{T}^*$ is a contraction in expectation under the right step-size schedule.

⚠️ CRITICAL: Convergence requires decaying learning rates. In practice, people often use constant $\alpha$ for non-stationary environments, but by theory, constant $\alpha$ does not guarantee convergence — it leads to oscillations around the optimum.

Maximization Bias and Double Q-Learning

A subtle but important problem: Q-learning uses $\max_{a'} Q(S', a')$ as the target. Because $Q$ values are estimates with noise, $\max_{a'} Q(S', a')$ is systematically biased upward — it selects the action whose Q-value is overestimated, not necessarily the best action.

Double Q-Learning fixes this by maintaining two Q-functions, $Q_A$ and $Q_B$:

  1. With probability 0.5, update $Q_A$ using target: $R + \gamma Q_B(S', \arg\max_{a'} Q_A(S', a'))$
  2. With probability 0.5, update $Q_B$ using target: $R + \gamma Q_A(S', \arg\max_{a'} Q_B(S', a'))$

This decouples action selection (using one Q-function) from action evaluation (using the other), eliminating the upward bias. The cost: twice the memory and computation, but this becomes negligible in the deep RL extensions (23-07).

Q-Learning as Stochastic Approximation

Q-learning is a form of stochastic approximation for solving:

$$Q(s,a) = \mathbb{E}{s' \sim P(\cdot \mid s,a)}[R(s,a,s') + \gamma \max{a'} Q(s', a')]$$

The update $Q \leftarrow Q + \alpha(\text{target} - Q)$ is a Robbins-Monro stochastic approximation step. The convergence proof relies on showing that the expected update direction points toward the true $Q^*$ and that noise averages out under the step-size conditions.



Key Terms

Worked Examples

Example 1: Q-Learning Update Step

Current Q-table (only relevant entries):

State $a_1$ $a_2$
$s_1$ 5.0 3.0
$s_2$ 1.0 8.0

Agent at $s_1$, takes action $a_2$, receives $R = 2$, transitions to $s_2$. $\alpha = 0.1$, $\gamma = 0.9$. Compute the Q-learning update.

Solution:

$\max_{a'} Q(s_2, a') = \max(1.0, 8.0) = 8.0$

Target: $R + \gamma \max_{a'} Q(s_2, a') = 2 + 0.9(8.0) = 2 + 7.2 = 9.2$

Current $Q(s_1, a_2) = 3.0$

TD error: $9.2 - 3.0 = 6.2$

Update: $Q(s_1, a_2) \leftarrow 3.0 + 0.1(6.2) = 3.0 + 0.62 = 3.62$

Click for answer $Q(s_1, a_2)$ increases from 3.0 to 3.62. The large positive TD error (6.2) indicates the action was much better than previously estimated — promising path discovered.

Example 2: Maximization Bias Demonstration

Consider 2 actions at state $s$, both with true $Q^*(s, a) = 0$. Our estimates have independent Gaussian noise: $Q(s, a_1) \sim \mathcal{N}(0, 1)$, $Q(s, a_2) \sim \mathcal{N}(0, 1)$. What is $\mathbb{E}[\max(Q(s, a_1), Q(s, a_2))]$?

Solution:

For two independent standard normals, $\mathbb{E}[\max(X, Y)] = \frac{1}{\sqrt{\pi}} \approx 0.56$.

So even though the true values are both 0, the maximum of noisy estimates has an expected value of ~0.56 — a systematic upward bias. With more actions, this gets worse: for $k$ independent standard normals, $\mathbb{E}[\max] \approx \sqrt{2\log k}$, growing unboundedly.

Click for answer Expected max ≈ 0.56 for 2 actions, growing with more actions. This is the maximization bias — Q-learning's use of $\max$ over noisy estimates systematically overestimates value. Double Q-learning eliminates this bias.

Example 3: Q-Learning Convergence Trace

Simple 2-state MDP with one action per state. True $Q^ = [10, 5]$. Initial $Q = [0, 0]$. $\alpha = 0.1$, $\gamma = 0.9$. After many iterations, $Q$ converges to $Q^$. Trace 5 iterations from a specific trajectory sequence.

Solution:

State 1 always gives $R=10$ and goes to state 2. State 2 always gives $R=-2$ and goes to terminal ($Q=0$).

Iteration 1: $(s_1, R=10, s_2)$: $Q(s_1) \leftarrow 0 + 0.1[10 + 0.9(0) - 0] = 1.0$ Iteration 2: $(s_2, R=-2, \text{term})$: $Q(s_2) \leftarrow 0 + 0.1[-2 + 0.9(0) - 0] = -0.2$ Iteration 3: $(s_1, R=10, s_2)$: $Q(s_1) \leftarrow 1.0 + 0.1[10 + 0.9(-0.2) - 1.0] = 1.0 + 0.1(8.82) = 1.882$ Iteration 4: $(s_2, R=-2, \text{term})$: $Q(s_2) \leftarrow -0.2 + 0.1[-2 - (-0.2)] = -0.2 + 0.1(-1.8) = -0.38$ Iteration 5: $(s_1, R=10, s_2)$: $Q(s_1) \leftarrow 1.882 + 0.1[10 + 0.9(-0.38) - 1.882] = 1.882 + 0.1(7.776) = 2.66$

Converging slowly toward $Q(s_1)=10$, $Q(s_2)=-2$.

Click for answer After 5 iterations: $Q(s_1) = 2.66$, $Q(s_2) = -0.38$. True values: 10 and -2. With constant $\alpha=0.1$, convergence is slow but steady. Decaying $\alpha$ would speed initial convergence but slow final convergence.

Practice Problems

Problem 1: Prove that the max operator is a non-expansion: $|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|$.

Answers (click to expand) Let $a_f^* = \arg\max_a f(a)$ and $a_g^* = \arg\max_a g(a)$. Case 1: $\max_a f(a) \geq \max_a g(a)$: $|\max_a f(a) - \max_a g(a)| = f(a_f^*) - \max_a g(a) \leq f(a_f^*) - g(a_f^*) \leq \max_a |f(a) - g(a)|$. Case 2: $\max_a f(a) < \max_a g(a)$: $|\max_a f(a) - \max_a g(a)| = g(a_g^*) - \max_a f(a) \leq g(a_g^*) - f(a_g^*) \leq \max_a |f(a) - g(a)|$. In both cases, the difference is bounded by the maximum pointwise difference. This non-expansion property is key to proving Q-learning convergence.

Problem 2: Explain why constant learning rate $\alpha$ is acceptable in non-stationary environments even though theory requires decaying rates for convergence.

Answers (click to expand) In non-stationary environments, the true optimal Q-function changes over time. A constant $\alpha$ gives an exponentially weighted moving average, which naturally tracks changes. Decaying $\alpha$ would eventually stop learning, making the agent unable to adapt. In practice, constant $\alpha$ causes small oscillations around the optimum rather than exact convergence — acceptable for most applications, especially with function approximation where exact convergence is infeasible anyway.

Problem 3: Show that double Q-learning's update is unbiased in expectation for the maximization step (assuming the two Q-functions have independent zero-mean errors).

Answers (click to expand) Let $\hat{Q}_A(s,a) = Q^*(s,a) + \epsilon_A(s,a)$ and $\hat{Q}_B(s,a) = Q^*(s,a) + \epsilon_B(s,a)$ where $\mathbb{E}[\epsilon_A] = \mathbb{E}[\epsilon_B] = 0$ and $\epsilon_A, \epsilon_B$ are independent. The double Q-learning target for updating $Q_A$ is $R + \gamma \hat{Q}_B(s', \arg\max_a \hat{Q}_A(s', a))$. $\mathbb{E}[\hat{Q}_B(s', \arg\max_a \hat{Q}_A(s', a))] = \mathbb{E}[Q^*(s', a^*) + \epsilon_B(s', a^*)]$ where $a^* = \arg\max_a \hat{Q}_A(s', a)$. Since $\epsilon_B$ is independent of $\hat{Q}_A$ (which determines $a^*$), $\mathbb{E}[\epsilon_B(s', a^*) \mid a^*] = 0$, so the expected value is $\mathbb{E}[Q^*(s', a^*)]$. This equals $Q^*(s', \arg\max_a Q^*(s', a))$ in expectation — unbiased. Standard Q-learning would have $\mathbb{E}[\max_a \hat{Q}_A(s', a)] > \max_a Q^*(s', a)$ due to the positive bias from noise.

Problem 4: In the cliff-walking problem, Q-learning with ε-greedy (ε = 0.1) sometimes falls off the cliff during learning even though it learns the optimal path. Explain why and how SARSA handles this differently.

Answers (click to expand) Q-learning learns the optimal policy that hugs the cliff. But during training, the ε-greedy behavior policy occasionally takes random actions. When standing next to the cliff, a random "down" action sends the agent off the cliff. Q-learning's learned policy doesn't account for this because it's off-policy — it learns the greedy optimal path. SARSA, being on-policy, learns values that include the ε-greedy exploration: $Q(S', A')$ uses the action actually taken (possibly random). Near the cliff, SARSA learns that some exploratory moves are catastrophic, so it assigns lower values to cliff-edge states and learns to stay further away.

Problem 5: For a tabular MDP with 100 states, 4 actions, and γ = 0.95, estimate how many visits per state-action pair are needed for Q-learning to converge within ε error, using the contraction argument.

Answers (click to expand) Initial error: $\|Q_0 - Q^*\|_\infty = \Delta$. After $n$ effective updates (where one effective update = applying the Bellman optimality operator once to all state-action pairs), the error is bounded by $\gamma^n \Delta$. For $\|Q_n - Q^*\|_\infty < \varepsilon$: $\gamma^n \Delta < \varepsilon \implies n > \frac{\log(\varepsilon/\Delta)}{\log \gamma}$. With $\gamma = 0.95$, $\log \gamma \approx -0.0513$. If $\Delta = 100$, for $\varepsilon = 0.1$: $n > \log(0.001)/\log(0.95) = 134.6 \approx 135$ full sweeps. With 400 state-action pairs, that's 400 × 135 ≈ 54,000 updates. But in practice, with decaying α, the early updates are larger and convergence is faster. Constant α requires more updates due to oscillation.

Summary

Key takeaways:


Quiz

Question 1: Q-learning is described as "off-policy" because:

A. It doesn't use a policy at all B. The learned policy (greedy) differs from the behavior policy used to generate data C. It requires offline batch processing D. It only works for deterministic policies

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. The update uses $\max_{a'} Q(s', a')$ (greedy target policy), regardless of which action was actually taken (behavior policy). This decouples learning from exploration. - If you chose A: Q-learning learns the greedy policy implicitly through Q-values. - If you chose C: Q-learning can be online (step-by-step) or offline (batch). - If you chose D: Q-learning works with any behavior policy, stochastic or deterministic.

Question 2: In the Q-learning update, $\max_{a'} Q(S_{t+1}, a')$ corresponds to:

A. The expected value under the behavior policy B. The greedy action selection with respect to the current Q estimates C. The average of all Q-values at $S_{t+1}$ D. The minimum Q-value at $S_{t+1}$

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. The max selects the best action according to current estimates, embodying the Bellman optimality equation for $Q^*$. - If you chose A: That would be SARSA's target: $Q(S_{t+1}, A_{t+1})$ where $A_{t+1}$ is the action actually taken. - If you chose C: Averaging would correspond to a uniform random policy, not the optimal one. - If you chose D: Taking the min would try to find the worst action — not useful for maximizing return.

Question 3: What is the maximization bias in Q-learning?

A. Q-values are always underestimated B. The $\max$ operator over noisy Q-estimates systematically overestimates the true value C. The learning rate causes divergence D. The discount factor biases toward short-term rewards

Correct Answer: B

Explanation (click to expand) - If you chose B: Correct. $\max_a \hat{Q}(s,a)$ with noisy estimates selects the action with the highest overestimation error, introducing positive bias. - If you chose A: The bias is upward (overestimation), not downward. - If you chose C: Learning rate affects convergence speed and stability, not bias direction. - If you chose D: Discount factor is a design choice, not a bias introduced by the algorithm.

Question 4: Double Q-learning eliminates maximization bias by:

A. Using two Q-functions: one to select the action, another to evaluate it B. Halving the learning rate C. Using the minimum instead of maximum D. Averaging over all actions

Correct Answer: A

Explanation (click to expand) - If you chose A: Correct. By using independent Q-functions for selection ($Q_A$) and evaluation ($Q_B$), the positive bias cancels in expectation. - If you chose B: Learning rate doesn't address the structural bias of the $\max$ operator. - If you chose C: Taking the minimum would introduce negative bias — also undesirable. - If you chose D: Averaging loses the optimality guarantee — you'd learn values for a uniform policy.

Question 5: For Q-learning to converge with probability 1, which condition must hold?

A. All state-action pairs must be visited infinitely often B. The learning rate must be constant C. The discount factor must be 1 D. The policy must be deterministic

Correct Answer: A

Explanation (click to expand) - If you chose A: Correct. Infinite exploration ensures estimates at all $(s,a)$ pairs are updated, allowing the contraction property to take effect globally. - If you chose B: Constant $\alpha$ leads to oscillations; Robbins-Monro conditions require $\sum \alpha_t = \infty$ but $\sum \alpha_t^2 < \infty$. - If you chose C: $\gamma$ must be $< 1$ for the contraction property to hold in continuing tasks. - If you chose D: The policy does not need to be deterministic for convergence.

Next Steps

Next up: 23-07 — Deep Q-Networks (DQN) — where tabular Q-learning meets deep neural networks, enabling RL in high-dimensional state spaces like raw pixels from Atari games.


Pitfalls

  1. Treating Q-learning as on-policy. Q-learning is fundamentally off-policy: the update uses $$\max_{a'} Q(s', a')$$ regardless of which action was actually taken. This means you can use any behavior policy (e.g., random, ε-greedy, expert demonstrations) while still learning the optimal greedy policy. Don't confuse the behavior policy with what Q-learning learns.

  2. Ignoring maximization bias. The max over noisy Q-estimates systematically overestimates the true value. With 10 actions and noisy estimates, $$\mathbb{E}[\max_a \hat{Q}(s,a)] > \max_a Q^*(s,a)$$. This bias compounds through bootstrapping and can lead to wildly overoptimistic value estimates. Double Q-learning fixes this almost for free.

  3. Using constant α for tabular convergence proofs. Tabular Q-learning's convergence proof (Watkins & Dayan, 1992) requires decaying learning rates satisfying the Robbins-Monro conditions. Constant α works in practice (especially for non-stationary problems) but theoretically only converges to a distribution around the optimum.

  4. Misunderstanding the cliff-walking result. Q-learning learns the optimal (cliff-hugging) policy, but during training with ε-greedy exploration, the agent may frequently fall off the cliff. Q-learning's learned policy is optimal for a fully greedy agent — it doesn't account for exploration noise. If exploration mistakes are costly during training, use SARSA instead.




Q8: In a non-stationary environment, why might constant α be preferable to decaying α?

A) Constant α is theoretically required for convergence B) Constant α gives an exponentially weighted moving average that can track changes over time C) Constant α eliminates maximization bias D) Decaying α doesn't work with Q-learning

Answer and Explanations **Correct: B)** With constant α, recent experiences have exponentially more weight than old ones: the effective window size is roughly $$1/\alpha$$ steps. This lets Q-learning adapt when the environment changes. Decaying α would eventually stop learning altogether. - A) Decaying α is what theory requires for exact convergence in stationary settings. - C) Constant α doesn't address maximization bias — that's a separate issue. - D) Decaying α works fine with Q-learning in theory; constant α is a practical choice for non-stationarity.