Math graphic
📐 Concept diagram

23-07 — Deep Q-Networks (DQN)

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-07 Prerequisites: 23-06 — Q-Learning, Phase 17 (Neural Networks) Next subject: 23-08 — Policy Gradient Methods


Learning Objectives

By the end of this subject, you will be able to:

  1. Formulate the loss function for a neural network Q-function approximator and derive the gradient update
  2. Design an experience replay buffer and analyze why random sampling breaks harmful temporal correlations
  3. Derive the target network update rule and explain how it stabilizes Q-learning with function approximation
  4. Modify DQN to implement Double DQN by decoupling action selection and evaluation
  5. Apply the Dueling architecture to separate state value from action advantage estimates

Core Content

Why Deep Q-Networks?

Tabular Q-learning (23-06) stores one value per state-action pair. For most real-world problems, the state space is enormous (e.g., all possible Atari game frames: $256^{84 \times 84 \times 3}$ possibilities). We need function approximation:

$$Q(s, a) \approx Q(s, a; \theta)$$

where $\theta$ are the parameters of a neural network. But naively plugging a neural network into Q-learning fails catastrophically for three reasons:

  1. Correlated samples: Sequential states are highly correlated $\implies$ gradient updates are not i.i.d., breaking SGD assumptions.
  2. Non-stationary targets: The target $R + \gamma \max_{a'} Q(s', a'; \theta)$ depends on $\theta$ itself, creating a moving target that causes oscillations or divergence.
  3. Maximization bias amplified: The max over noisy Q-estimates gets worse with high-dimensional networks.

DQN (Mnih et al., 2013/2015) solves these with two innovations: Experience Replay and Target Networks.

DQN Loss Function

The loss at iteration $i$ is the mean-squared TD error over a mini-batch of experiences:

$$\mathcal{L}i(\theta_i) = \mathbb{E}{(s,a,r,s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta_i^-) - Q(s, a; \theta_i)\right)^2\right]$$

where: - $\mathcal{D}$ is the experience replay buffer - $\theta_i$ are the online (current) network parameters - $\theta_i^-$ are the target network parameters (frozen copy of $\theta$, updated periodically)

The gradient:

$$\nabla_{\theta_i} \mathcal{L}i(\theta_i) = \mathbb{E}{(s,a,r,s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta_i^-) - Q(s, a; \theta_i)\right) \cdot \nabla_{\theta_i} Q(s, a; \theta_i)\right]$$

⚠️ CRITICAL: The gradient flows through $Q(s, a; \theta_i)$ but NOT through the target $r + \gamma \max_{a'} Q(s', a'; \theta_i^-)$. The target network parameters $\theta_i^-$ are treated as constants during backpropagation. This is essential for stability.

Experience Replay

Data structure: A circular buffer (FIFO queue) of fixed capacity $N_{\text{replay}}$ (typically $10^5$–$10^6$ transitions).

Operation: 1. At each time step, store the transition $(s_t, a_t, r_{t+1}, s_{t+1})$ in the buffer 2. Every $K$ steps (typically $K=4$), sample a random mini-batch of $B$ transitions (typically $B=32$) 3. Train on this mini-batch using SGD

Why uniform random sampling works:

Consider the temporal correlation in sequential experiences. On-policy sequences generate highly correlated gradient estimates. The covariance between successive gradient steps $\nabla_{\theta} \mathcal{L}(s_t)$ and $\nabla_{\theta} \mathcal{L}(s_{t+1})$ can be large, causing SGD to oscillate along narrow valleys.

Random sampling from a large buffer $\mathcal{D}$ approximates the stationary distribution of past experiences:

$$\mathbb{E}{(s,a,r,s') \sim \mathcal{D}}[\nabla{\theta} \mathcal{L}] \approx \mathbb{E}{(s,a) \sim d^{\mu}(s,a)}[\nabla{\theta} \mathcal{L}]$$

where $d^{\mu}(s,a)$ is the empirical state-action visitation distribution (a mixture of past behavior policies). This breaks harmful correlations and satisfies the i.i.d. assumption better than online updates.

Benefits beyond decorrelation: - Each transition is reused multiple times $\rightarrow$ better sample efficiency - Mini-batch averaging reduces variance of gradient estimates

Target Network

The target network $\theta^-$ addresses the non-stationary target problem. The update rule:

$$\theta^- \leftarrow \theta \quad \text{every } C \text{ steps}$$

or with soft updates (Polyak averaging):

$$\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^-$$

where $\tau \ll 1$ (e.g., $\tau = 0.001$).

Why this works: By keeping the target $r + \gamma \max_{a'} Q(s', a'; \theta^-)$ fixed for $C$ steps, Q-learning with function approximation becomes approximately supervised regression on a fixed target. This prevents the "chasing its own tail" phenomenon where both the Q-function and its target change simultaneously.

Mathematical intuition: DQN is essentially solving a sequence of regression problems. Each regression problem uses fixed targets from the frozen network, and the solution becomes the next target network. This is an instance of fitted Q-iteration interleaved with neural network training.

Double DQN

Standard DQN uses the same network (the target network) for both selecting and evaluating the best next action:

$$Y_t^{\text{DQN}} = R_{t+1} + \gamma Q\left(S_{t+1}, \arg\max_{a'} Q(S_{t+1}, a'; \theta_i^-); \theta_i^-\right)$$

This inherits Q-learning's maximization bias (see 23-06). Double DQN decouples:

$$Y_t^{\text{DoubleDQN}} = R_{t+1} + \gamma Q\left(S_{t+1}, \arg\max_{a'} Q(S_{t+1}, a'; \theta_i); \theta_i^-\right)$$

This eliminates the overestimation bias at minimal computational cost. The online network's Q-values are noisy but independent of the target network's Q-values used for evaluation.

Empirical result: Double DQN reduces overestimation from roughly +0.5 to near zero in Atari experiments, and the more accurate value estimates lead to better policies.

Dueling DQN

The dueling architecture splits the Q-network into two streams that are recombined:

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + A(s, a; \theta, \alpha)$$

where: - $V(s; \theta, \beta)$ is the state value function (how good it is to be in state $s$) - $A(s, a; \theta, \alpha)$ is the action advantage function (how much better action $a$ is compared to other actions)

⚠️ CRITICAL: Identifiability problem. From $Q = V + A$, we can add a constant to $V$ and subtract it from $A$ and get the same $Q$. The solution is to force the advantage to have zero mean:

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left(A(s, a; \theta, \alpha) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a'; \theta, \alpha)\right)$$

This gives an identifiability condition: the expected advantage is always zero, so $V(s) = \mathbb{E}_{a \sim \text{uniform}}[Q(s,a)]$.

Why dueling helps: In many states, the choice of action doesn't matter much (e.g., when no enemies are near in an Atari game). The dueling architecture can learn that $V(s)$ captures most of the signal while $A(s,a)$ is nearly zero, learning this structure more efficiently than a monolithic Q-network.

Prioritized Experience Replay

Standard DQN samples uniformly from the replay buffer. But not all transitions are equally informative. Prioritized Experience Replay (Schaul et al., 2016) samples transitions with probability proportional to their TD error magnitude:

$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

where $p_i = |\delta_i| + \varepsilon$ (TD error plus a small constant to allow zero-error transitions to be sampled), and $\alpha$ controls the degree of prioritization ($\alpha = 0 \implies$ uniform, $\alpha = 1 \implies$ full prioritization).

Importance sampling correction: Prioritized sampling biases the distribution. To restore unbiased gradient estimates, importance sampling weights are applied:

$$w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta$$

where $\beta$ is annealed from an initial value (e.g., 0.4) to 1 over training. The update becomes:

$$\Delta \theta \propto w_i \cdot \delta_i \cdot \nabla_\theta Q(s, a; \theta)$$

Efficient implementation: Use a sum-tree data structure for $O(\log N)$ sampling and updates.



Key Terms

Worked Examples

Example 1: DQN Loss Computation

The online network gives $Q(s, a_1; \theta) = [2.5, 3.1, 1.8]$ for three actions. The target network gives $Q(s', a'; \theta^-) = [4.0, 2.2, 5.5]$. The agent took action $a_2$ (value 3.1), received $r = 1.0$, and $\gamma = 0.99$. Compute the TD error and contribution to the loss.

Solution:

Target: $y = r + \gamma \max_{a'} Q(s', a'; \theta^-) = 1.0 + 0.99 \cdot 5.5 = 1.0 + 5.445 = 6.445$

Current: $Q(s, a_2; \theta) = 3.1$

TD error: $\delta = 6.445 - 3.1 = 3.345$

Loss contribution: $\frac{1}{2}\delta^2 = \frac{1}{2}(3.345)^2 = 5.59$

Click for answer The large positive TD error (3.345) indicates action $a_2$ is much better than currently estimated. The squared loss of 5.59 will drive gradient updates to increase $Q(s, a_2; \theta)$ toward 6.445.

Example 2: Target Network Update Timing

DQN with soft updates ($\tau = 0.001$) has just completed 10,000 gradient steps since the last hard update. The online network has parameters $\theta$, and the target network has $\theta^-$. If $|\theta - \theta^-|_2 = 0.5$, estimate the effective "age" of the target network in terms of gradient steps, assuming each step changes $\theta$ by roughly $\eta \cdot |\nabla \mathcal{L}| \approx 0.001$.

Solution:

Each gradient step changes $\theta$ by approximately $0.001$ in $\ell_2$ norm. After 10,000 steps, the cumulative change is roughly $10,000 \times 0.001 = 10$. The observed distance is only 0.5, meaning soft updates have kept $\theta^-$ close to $\theta$.

With $\tau = 0.001$, each soft update moves $\theta^-$ by $\tau |\theta - \theta^-|$, and this happens at every step. After $n$ steps: $(1 - 0.001)^n \approx e^{-0.001 n}$. For the initial distance of 10 to decay to 0.5: $10 \cdot e^{-0.001 n} = 0.5 \implies e^{-0.001 n} = 0.05 \implies n \approx 3000$. So the effective time constant is about 1000 steps.

Click for answer The target network lags behind by roughly 1000 steps (the $1/\tau$ time constant of Polyak averaging). This is the intended behavior — enough lag to stabilize learning, but not so much that the target becomes stale.

Example 3: Double DQN vs. DQN Bias

In a state, the true $Q^*(s, a)$ values are $[0, 0, 0, 0, 10]$ (one good action among 5). The online network estimates them with independent Gaussian noise $\mathcal{N}(0, 1)$. The target network has independent noise $\mathcal{N}(0, 1)$ as well. Compute $\mathbb{E}[Y^{\text{DQN}}]$ and $\mathbb{E}[Y^{\text{DoubleDQN}}]$, ignoring $R$ and $\gamma$ for simplicity.

Solution:

Standard DQN: Target network selects and evaluates: $Y^{\text{DQN}} = \max_{a'} Q(s', a'; \theta^-)$. With 5 noisy estimates around 0, $\mathbb{E}[\max] \approx \sqrt{2\log 5} \approx 1.79$ (extreme value theory), even though the true max is 0 for the first 4 actions and 10 for the good one. In practice, the max bias is dominated by the true high value (10) once the network learns it, but early in training, the noise dominates.

Double DQN: Online network selects $a^ = \arg\max_a Q(s', a; \theta)$. Target network evaluates $Q(s', a^; \theta^-)$. Since the online and target networks have independent noise, $\mathbb{E}[Q(s', a^; \theta^-)] = \mathbb{E}[Q^(s', a^*)]$ (unbiased). The selection noise of the online network does not bias the target network's evaluation.

Click for answer Standard DQN systematically overestimates (expectation of max of noisy estimates > true max). Double DQN eliminates this because the evaluation network's noise is independent of which action was selected. Overestimation bias: $\approx$ 0.5–2.0 for DQN, $\approx$ 0 for Double DQN.

Practice Problems

Problem 1: Derive the DQN gradient update rule starting from the MSE loss. Show why the gradient does NOT flow through the target $r + \gamma \max Q(s', a'; \theta^-)$.

Answer (click to expand) Let $\mathcal{L} = \frac{1}{2}\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2$. $$\nabla_\theta \mathcal{L} = -\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right) \cdot \nabla_\theta Q(s, a; \theta)$$ The target term $r + \gamma \max_{a'} Q(s', a'; \theta^-)$ is treated as a constant because $\theta^-$ is fixed (not a function of $\theta$). If the gradient flowed through both terms, we'd have an additional term $\propto \nabla_\theta Q(s', a'; \theta^-) = 0$ (since target is frozen). The stop-gradient is enforced by the periodic (or soft) update schedule.

Problem 2: An experience replay buffer holds 1 million transitions. During training, the agent visits 30 states per second and stores each transition. If mini-batches are sampled uniformly every 4 steps with batch size 32, what fraction of experiences are ever used for training, assuming a 10-minute training run?

Answer (click to expand) Total transitions stored: 30 transitions/sec × 600 sec = 18,000 transitions (buffer doesn't fill up) Training steps: every 4 steps = 18,000/4 = 4,500 training iterations Total transitions sampled: 4,500 × 32 = 144,000 Each transition is sampled on average 144,000/18,000 = 8 times. So each transition is reused about 8 times — much more sample-efficient than online learning where each transition is used once and discarded.

Problem 3: In Dueling DQN, prove that with the mean-subtraction formulation $Q(s,a) = V(s) + A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')$, we have $\mathbb{E}{a \sim \text{uniform}}[Q(s,a)] = V(s)$ and $\mathbb{E}{a \sim \text{uniform}}[A(s,a)] = 0$.

Answer (click to expand) $$\mathbb{E}_a[Q(s,a)] = \frac{1}{|\mathcal{A}|}\sum_a \left(V(s) + A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')\right)$$ $$= \frac{1}{|\mathcal{A}|}\sum_a V(s) + \frac{1}{|\mathcal{A}|}\sum_a A(s,a) - \frac{1}{|\mathcal{A}|}\sum_a \frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')$$ $$= V(s) + \frac{1}{|\mathcal{A}|}\sum_a A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a') = V(s)$$ Similarly: $\mathbb{E}_a[A(s,a) - \bar{A}(s)] = \bar{A}(s) - \bar{A}(s) = 0$. The identifiability condition holds exactly.

Problem 4: In Prioritized Experience Replay with $\alpha = 1$ and proportional prioritization ($p_i = |\delta_i| + \varepsilon$), show that without importance sampling ($\beta = 0$), the expected update is biased toward high-TD-error transitions.

Answer (click to expand) The expected gradient without IS correction is: $$\mathbb{E}_{i \sim P}[\nabla_\theta \mathcal{L}_i] = \sum_i \frac{p_i}{\sum_k p_k} \nabla_\theta \mathcal{L}_i \neq \frac{1}{N}\sum_i \nabla_\theta \mathcal{L}_i = \mathbb{E}_{i \sim \text{uniform}}[\nabla_\theta \mathcal{L}_i]$$ The prioritization distribution $P(i) \propto |\delta_i|$ oversamples high-TD-error transitions. Without the IS weight $w_i = (N \cdot P(i))^{-\beta}$, the gradient estimate is biased toward these transitions. With $\beta = 1$, $\mathbb{E}[w_i \cdot \nabla_\theta \mathcal{L}_i] = \sum_i \frac{p_i}{\sum_k p_k} \cdot (N \cdot \frac{p_i}{\sum_k p_k})^{-1} \cdot \nabla_\theta \mathcal{L}_i = \frac{1}{N}\sum_i \nabla_\theta \mathcal{L}_i$, restoring unbiasedness.

Problem 5: A DQN agent on an Atari game uses a CNN that processes 4 stacked frames (84×84×4 input). The action space has 18 actions. If the replay buffer stores each transition as (4×84×84 uint8 frames + action + reward), estimate the memory usage for a buffer of 1 million transitions in gigabytes.

Answer (click to expand) Each frame: 84 × 84 = 7,056 bytes (uint8) 4 stacked frames: 4 × 7,056 = 28,224 bytes per state Two states (s and s'): 56,448 bytes Plus action (4 bytes as int32), reward (4 bytes as float32), terminal flag (1 byte): ~9 bytes Per transition: ~56,457 bytes ≈ 55.1 KB 1 million transitions: ~55.1 GB (without compression) In practice, frames are stored as uint8 (28KB per 4-frame stack), and the DQN paper used 1M transitions ≈ 28 GB (storing only s, not s', or using references). Modern implementations often use frame-level storage with pointers to save memory. This is why clever replay buffer design matters.

Summary


Quiz

Question 1: Why does DQN use a target network separate from the online network?

A. To reduce memory usage B. To provide a stable, slowly-changing target for the TD update, preventing divergence C. To enable Double DQN D. It's required for backpropagation through time

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. Without a frozen target, both the prediction and the target use the same network, creating a feedback loop where Q-values can diverge. - If you chose A: The target network uses additional memory, not less. - If you chose C: Double DQN is a separate enhancement that also uses two networks but for a different purpose. - If you chose D: DQN doesn't use backpropagation through time.

Question 2: In experience replay, why is random sampling from the buffer better than training on consecutive transitions?

A. Random sampling is faster B. Consecutive transitions are highly correlated, violating the i.i.d. assumption of SGD C. The buffer discards old transitions D. Consecutive transitions have zero TD error

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. Sequential states have high correlation, causing SGD to oscillate and converge slowly or not at all. - If you chose A: Random sampling is actually slightly slower due to non-sequential memory access. - If you chose C: The buffer is a FIFO — old transitions are retained until evicted. - If you chose D: Consecutive transitions can have any TD error; correlation is the issue, not error magnitude.

Question 3: Double DQN eliminates overestimation bias by:

A. Using a larger network B. Decoupling action selection (online network) from action evaluation (target network) C. Halving the learning rate D. Using only positive rewards

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. The online network picks the best action; the target network evaluates it. Independent noise in the two networks prevents systematic overestimation. - If you chose A: Network size doesn't address the structural bias of the max operator. - If you chose C: Learning rate affects convergence speed, not bias direction. - If you chose D: Reward sign is irrelevant to the estimation bias.

Question 4: In the Dueling DQN architecture, what constraint resolves the identifiability problem?

A. $V(s) \geq 0$ B. $\sum_a A(s,a) = 0$ (mean-subtraction of advantages) C. $Q(s,a) \geq 0$ D. $A(s,a) = V(s)$

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. Forcing the advantage to have zero mean makes the decomposition $Q = V + A$ unique. - If you chose A: Non-negativity doesn't prevent adding a constant to $V$ and subtracting from $A$. - If you chose C: Same issue as A. - If you chose D: That would make all advantages equal to the same value, defeating the purpose.

Question 5: In Prioritized Experience Replay, what are transitions prioritized by?

A. The absolute TD error $|\delta|$ B. The age of the transition C. The magnitude of the reward D. The state visitation count

Correct Answer: A

Explanation (click to expand) - **If you chose A:** Correct. Transitions with large TD errors are more "surprising" and thus more informative for learning. - If you chose B: Age-based prioritization is a different approach (FIFO replay). - If you chose C: Reward magnitude alone doesn't indicate informativeness. - If you chose D: State visitation count relates to exploration, not replay prioritization.

Pitfalls

  1. Forgetting to stop gradients on the target: If gradients flow through the target network, the loss is no longer a meaningful TD error — it collapses to a degenerate solution.
  2. Setting the replay buffer too small: Too small a buffer behaves like online learning. Too large retains obsolete experiences from early, random policies. Typical sizes: 50K–1M.
  3. Target update frequency too high: Updating the target network every step defeats its purpose. Typical: every 1000–10000 steps.
  4. Applying Double DQN incorrectly: The selection should use the online network, evaluation uses the target network. Reversing them gives standard DQN behavior.
  5. Dueling without mean subtraction: The network can learn arbitrary shifts between $V$ and $A$, making training unstable and interpretation impossible.

Next Steps

Next up: 23-08 — Policy Gradient Methods — where we leave value-based methods behind and directly optimize the policy by ascending the gradient of expected return.




Q8: Why is DQN considered an "off-policy" algorithm?

A) It uses ε-greedy exploration B) The replay buffer contains transitions from many previous behavior policies, but the algorithm learns the greedy policy C) It doesn't use any policy at all D) It requires offline pre-training

Answer and Explanations **Correct: B)** DQN inherits Q-learning's off-policy nature. The replay buffer stores transitions generated by older versions of the policy (different ε values, different network weights), but the max operator in the target always drives learning toward the greedy optimal policy. - A) ε-greedy is the behavior policy mechanism — it doesn't make the algorithm on-policy. - C) DQN implicitly learns a greedy policy via Q-values. - D) DQN can be trained online, interleaving data collection with learning from the replay buffer.