📐 Concept diagram

23-07 — Deep Q-Networks (DQN)

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-07 Prerequisites: 23-06 — Q-Learning, Phase 17 (Neural Networks) Next subject: 23-08 — Policy Gradient Methods

Learning Objectives

By the end of this subject, you will be able to:

Formulate the loss function for a neural network Q-function approximator and derive the gradient update
Design an experience replay buffer and analyze why random sampling breaks harmful temporal correlations
Derive the target network update rule and explain how it stabilizes Q-learning with function approximation
Modify DQN to implement Double DQN by decoupling action selection and evaluation
Apply the Dueling architecture to separate state value from action advantage estimates

Core Content

Why Deep Q-Networks?

Tabular Q-learning (23-06) stores one value per state-action pair. For most real-world problems, the state space is enormous (e.g., all possible Atari game frames: $256^{84 \times 84 \times 3}$ possibilities). We need function approximation:

$$Q(s, a) \approx Q(s, a; \theta)$$

where $\theta$ are the parameters of a neural network. But naively plugging a neural network into Q-learning fails catastrophically for three reasons:

Correlated samples: Sequential states are highly correlated $\implies$ gradient updates are not i.i.d., breaking SGD assumptions.
Non-stationary targets: The target $R + \gamma \max_{a'} Q(s', a'; \theta)$ depends on $\theta$ itself, creating a moving target that causes oscillations or divergence.
Maximization bias amplified: The max over noisy Q-estimates gets worse with high-dimensional networks.

DQN (Mnih et al., 2013/2015) solves these with two innovations: Experience Replay and Target Networks.

DQN Loss Function

The loss at iteration $i$ is the mean-squared TD error over a mini-batch of experiences:

$$\mathcal{L}i(\theta_i) = \mathbb{E}{(s,a,r,s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta_i^-) - Q(s, a; \theta_i)\right)^2\right]$$

where: - $\mathcal{D}$ is the experience replay buffer - $\theta_i$ are the online (current) network parameters - $\theta_i^-$ are the target network parameters (frozen copy of $\theta$, updated periodically)

The gradient:

$$\nabla_{\theta_i} \mathcal{L}i(\theta_i) = \mathbb{E}{(s,a,r,s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta_i^-) - Q(s, a; \theta_i)\right) \cdot \nabla_{\theta_i} Q(s, a; \theta_i)\right]$$

⚠️ CRITICAL: The gradient flows through $Q(s, a; \theta_i)$ but NOT through the target $r + \gamma \max_{a'} Q(s', a'; \theta_i^-)$. The target network parameters $\theta_i^-$ are treated as constants during backpropagation. This is essential for stability.

Experience Replay

Data structure: A circular buffer (FIFO queue) of fixed capacity $N_{\text{replay}}$ (typically $10^5$–$10^6$ transitions).

Operation: 1. At each time step, store the transition $(s_t, a_t, r_{t+1}, s_{t+1})$ in the buffer 2. Every $K$ steps (typically $K=4$), sample a random mini-batch of $B$ transitions (typically $B=32$) 3. Train on this mini-batch using SGD

Why uniform random sampling works:

Consider the temporal correlation in sequential experiences. On-policy sequences generate highly correlated gradient estimates. The covariance between successive gradient steps $\nabla_{\theta} \mathcal{L}(s_t)$ and $\nabla_{\theta} \mathcal{L}(s_{t+1})$ can be large, causing SGD to oscillate along narrow valleys.

Random sampling from a large buffer $\mathcal{D}$ approximates the stationary distribution of past experiences:

$$\mathbb{E}{(s,a,r,s') \sim \mathcal{D}}[\nabla{\theta} \mathcal{L}] \approx \mathbb{E}{(s,a) \sim d^{\mu}(s,a)}[\nabla{\theta} \mathcal{L}]$$

where $d^{\mu}(s,a)$ is the empirical state-action visitation distribution (a mixture of past behavior policies). This breaks harmful correlations and satisfies the i.i.d. assumption better than online updates.

Benefits beyond decorrelation: - Each transition is reused multiple times $\rightarrow$ better sample efficiency - Mini-batch averaging reduces variance of gradient estimates

Target Network

The target network $\theta^-$ addresses the non-stationary target problem. The update rule:

$$\theta^- \leftarrow \theta \quad \text{every } C \text{ steps}$$

or with soft updates (Polyak averaging):

$$\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^-$$

where $\tau \ll 1$ (e.g., $\tau = 0.001$).

Why this works: By keeping the target $r + \gamma \max_{a'} Q(s', a'; \theta^-)$ fixed for $C$ steps, Q-learning with function approximation becomes approximately supervised regression on a fixed target. This prevents the "chasing its own tail" phenomenon where both the Q-function and its target change simultaneously.

Mathematical intuition: DQN is essentially solving a sequence of regression problems. Each regression problem uses fixed targets from the frozen network, and the solution becomes the next target network. This is an instance of fitted Q-iteration interleaved with neural network training.

Double DQN

Standard DQN uses the same network (the target network) for both selecting and evaluating the best next action:

$$Y_t^{\text{DQN}} = R_{t+1} + \gamma Q\left(S_{t+1}, \arg\max_{a'} Q(S_{t+1}, a'; \theta_i^-); \theta_i^-\right)$$

This inherits Q-learning's maximization bias (see 23-06). Double DQN decouples:

$$Y_t^{\text{DoubleDQN}} = R_{t+1} + \gamma Q\left(S_{t+1}, \arg\max_{a'} Q(S_{t+1}, a'; \theta_i); \theta_i^-\right)$$

Action selection uses the online network $\theta_i$
Action evaluation uses the target network $\theta_i^-$

This eliminates the overestimation bias at minimal computational cost. The online network's Q-values are noisy but independent of the target network's Q-values used for evaluation.

Empirical result: Double DQN reduces overestimation from roughly +0.5 to near zero in Atari experiments, and the more accurate value estimates lead to better policies.

Dueling DQN

The dueling architecture splits the Q-network into two streams that are recombined:

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + A(s, a; \theta, \alpha)$$

where: - $V(s; \theta, \beta)$ is the state value function (how good it is to be in state $s$) - $A(s, a; \theta, \alpha)$ is the action advantage function (how much better action $a$ is compared to other actions)

⚠️ CRITICAL: Identifiability problem. From $Q = V + A$, we can add a constant to $V$ and subtract it from $A$ and get the same $Q$. The solution is to force the advantage to have zero mean:

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left(A(s, a; \theta, \alpha) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a'; \theta, \alpha)\right)$$

This gives an identifiability condition: the expected advantage is always zero, so $V(s) = \mathbb{E}_{a \sim \text{uniform}}[Q(s,a)]$.

Why dueling helps: In many states, the choice of action doesn't matter much (e.g., when no enemies are near in an Atari game). The dueling architecture can learn that $V(s)$ captures most of the signal while $A(s,a)$ is nearly zero, learning this structure more efficiently than a monolithic Q-network.

Prioritized Experience Replay

Standard DQN samples uniformly from the replay buffer. But not all transitions are equally informative. Prioritized Experience Replay (Schaul et al., 2016) samples transitions with probability proportional to their TD error magnitude:

$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

where $p_i = |\delta_i| + \varepsilon$ (TD error plus a small constant to allow zero-error transitions to be sampled), and $\alpha$ controls the degree of prioritization ($\alpha = 0 \implies$ uniform, $\alpha = 1 \implies$ full prioritization).

Importance sampling correction: Prioritized sampling biases the distribution. To restore unbiased gradient estimates, importance sampling weights are applied:

$$w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta$$

where $\beta$ is annealed from an initial value (e.g., 0.4) to 1 over training. The update becomes:

$$\Delta \theta \propto w_i \cdot \delta_i \cdot \nabla_\theta Q(s, a; \theta)$$

Efficient implementation: Use a sum-tree data structure for $O(\log N)$ sampling and updates.

Key Terms

Action evaluation
Action selection
Double DQN
Dueling DQN
Experience Replay
Experience replay
Prioritized Experience Replay
Prioritized experience replay
TD error magnitude
Target Networks
Target network

Worked Examples

Example 1: DQN Loss Computation

The online network gives $Q(s, a_1; \theta) = [2.5, 3.1, 1.8]$ for three actions. The target network gives $Q(s', a'; \theta^-) = [4.0, 2.2, 5.5]$. The agent took action $a_2$ (value 3.1), received $r = 1.0$, and $\gamma = 0.99$. Compute the TD error and contribution to the loss.

Solution:

Target: $y = r + \gamma \max_{a'} Q(s', a'; \theta^-) = 1.0 + 0.99 \cdot 5.5 = 1.0 + 5.445 = 6.445$

Current: $Q(s, a_2; \theta) = 3.1$

TD error: $\delta = 6.445 - 3.1 = 3.345$

Loss contribution: $\frac{1}{2}\delta^2 = \frac{1}{2}(3.345)^2 = 5.59$

Click for answer

The large positive TD error (3.345) indicates action $a_2$ is much better than currently estimated. The squared loss of 5.59 will drive gradient updates to increase $Q(s, a_2; \theta)$ toward 6.445.

Example 2: Target Network Update Timing

DQN with soft updates ($\tau = 0.001$) has just completed 10,000 gradient steps since the last hard update. The online network has parameters $\theta$, and the target network has $\theta^-$. If $|\theta - \theta^-|_2 = 0.5$, estimate the effective "age" of the target network in terms of gradient steps, assuming each step changes $\theta$ by roughly $\eta \cdot |\nabla \mathcal{L}| \approx 0.001$.

Solution:

Each gradient step changes $\theta$ by approximately $0.001$ in $\ell_2$ norm. After 10,000 steps, the cumulative change is roughly $10,000 \times 0.001 = 10$. The observed distance is only 0.5, meaning soft updates have kept $\theta^-$ close to $\theta$.

With $\tau = 0.001$, each soft update moves $\theta^-$ by $\tau |\theta - \theta^-|$, and this happens at every step. After $n$ steps: $(1 - 0.001)^n \approx e^{-0.001 n}$. For the initial distance of 10 to decay to 0.5: $10 \cdot e^{-0.001 n} = 0.5 \implies e^{-0.001 n} = 0.05 \implies n \approx 3000$. So the effective time constant is about 1000 steps.

Click for answer

The target network lags behind by roughly 1000 steps (the $1/\tau$ time constant of Polyak averaging). This is the intended behavior — enough lag to stabilize learning, but not so much that the target becomes stale.

Example 3: Double DQN vs. DQN Bias

In a state, the true $Q^*(s, a)$ values are $[0, 0, 0, 0, 10]$ (one good action among 5). The online network estimates them with independent Gaussian noise $\mathcal{N}(0, 1)$. The target network has independent noise $\mathcal{N}(0, 1)$ as well. Compute $\mathbb{E}[Y^{\text{DQN}}]$ and $\mathbb{E}[Y^{\text{DoubleDQN}}]$, ignoring $R$ and $\gamma$ for simplicity.

Solution:

Standard DQN: Target network selects and evaluates: $Y^{\text{DQN}} = \max_{a'} Q(s', a'; \theta^-)$. With 5 noisy estimates around 0, $\mathbb{E}[\max] \approx \sqrt{2\log 5} \approx 1.79$ (extreme value theory), even though the true max is 0 for the first 4 actions and 10 for the good one. In practice, the max bias is dominated by the true high value (10) once the network learns it, but early in training, the noise dominates.

Double DQN: Online network selects $a^ = \arg\max_a Q(s', a; \theta)$. Target network evaluates $Q(s', a^; \theta^-)$. Since the online and target networks have independent noise, $\mathbb{E}[Q(s', a^; \theta^-)] = \mathbb{E}[Q^(s', a^*)]$ (unbiased). The selection noise of the online network does not bias the target network's evaluation.

Click for answer

Standard DQN systematically overestimates (expectation of max of noisy estimates > true max). Double DQN eliminates this because the evaluation network's noise is independent of which action was selected. Overestimation bias: $\approx$ 0.5–2.0 for DQN, $\approx$ 0 for Double DQN.

Practice Problems

Problem 1: Derive the DQN gradient update rule starting from the MSE loss. Show why the gradient does NOT flow through the target $r + \gamma \max Q(s', a'; \theta^-)$.

Answer (click to expand)

Let $\mathcal{L} = \frac{1}{2}\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2$. $$\nabla_\theta \mathcal{L} = -\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right) \cdot \nabla_\theta Q(s, a; \theta)$$ The target term $r + \gamma \max_{a'} Q(s', a'; \theta^-)$ is treated as a constant because $\theta^-$ is fixed (not a function of $\theta$). If the gradient flowed through both terms, we'd have an additional term $\propto \nabla_\theta Q(s', a'; \theta^-) = 0$ (since target is frozen). The stop-gradient is enforced by the periodic (or soft) update schedule.

Problem 2: An experience replay buffer holds 1 million transitions. During training, the agent visits 30 states per second and stores each transition. If mini-batches are sampled uniformly every 4 steps with batch size 32, what fraction of experiences are ever used for training, assuming a 10-minute training run?

Answer (click to expand)

Total transitions stored: 30 transitions/sec × 600 sec = 18,000 transitions (buffer doesn't fill up) Training steps: every 4 steps = 18,000/4 = 4,500 training iterations Total transitions sampled: 4,500 × 32 = 144,000 Each transition is sampled on average 144,000/18,000 = 8 times. So each transition is reused about 8 times — much more sample-efficient than online learning where each transition is used once and discarded.

Problem 3: In Dueling DQN, prove that with the mean-subtraction formulation $Q(s,a) = V(s) + A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')$, we have $\mathbb{E}{a \sim \text{uniform}}[Q(s,a)] = V(s)$ and $\mathbb{E}{a \sim \text{uniform}}[A(s,a)] = 0$.

Answer (click to expand)

$$\mathbb{E}_a[Q(s,a)] = \frac{1}{|\mathcal{A}|}\sum_a \left(V(s) + A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')\right)$$ $$= \frac{1}{|\mathcal{A}|}\sum_a V(s) + \frac{1}{|\mathcal{A}|}\sum_a A(s,a) - \frac{1}{|\mathcal{A}|}\sum_a \frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a')$$ $$= V(s) + \frac{1}{|\mathcal{A}|}\sum_a A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'}A(s,a') = V(s)$$ Similarly: $\mathbb{E}_a[A(s,a) - \bar{A}(s)] = \bar{A}(s) - \bar{A}(s) = 0$. The identifiability condition holds exactly.

Problem 4: In Prioritized Experience Replay with $\alpha = 1$ and proportional prioritization ($p_i = |\delta_i| + \varepsilon$), show that without importance sampling ($\beta = 0$), the expected update is biased toward high-TD-error transitions.

Answer (click to expand)

The expected gradient without IS correction is: $$\mathbb{E}_{i \sim P}[\nabla_\theta \mathcal{L}_i] = \sum_i \frac{p_i}{\sum_k p_k} \nabla_\theta \mathcal{L}_i \neq \frac{1}{N}\sum_i \nabla_\theta \mathcal{L}_i = \mathbb{E}_{i \sim \text{uniform}}[\nabla_\theta \mathcal{L}_i]$$ The prioritization distribution $P(i) \propto |\delta_i|$ oversamples high-TD-error transitions. Without the IS weight $w_i = (N \cdot P(i))^{-\beta}$, the gradient estimate is biased toward these transitions. With $\beta = 1$, $\mathbb{E}[w_i \cdot \nabla_\theta \mathcal{L}_i] = \sum_i \frac{p_i}{\sum_k p_k} \cdot (N \cdot \frac{p_i}{\sum_k p_k})^{-1} \cdot \nabla_\theta \mathcal{L}_i = \frac{1}{N}\sum_i \nabla_\theta \mathcal{L}_i$, restoring unbiasedness.

Problem 5: A DQN agent on an Atari game uses a CNN that processes 4 stacked frames (84×84×4 input). The action space has 18 actions. If the replay buffer stores each transition as (4×84×84 uint8 frames + action + reward), estimate the memory usage for a buffer of 1 million transitions in gigabytes.

Answer (click to expand)

Each frame: 84 × 84 = 7,056 bytes (uint8) 4 stacked frames: 4 × 7,056 = 28,224 bytes per state Two states (s and s'): 56,448 bytes Plus action (4 bytes as int32), reward (4 bytes as float32), terminal flag (1 byte): ~9 bytes Per transition: ~56,457 bytes ≈ 55.1 KB 1 million transitions: ~55.1 GB (without compression) In practice, frames are stored as uint8 (28KB per 4-frame stack), and the DQN paper used 1M transitions ≈ 28 GB (storing only s, not s', or using references). Modern implementations often use frame-level storage with pointers to save memory. This is why clever replay buffer design matters.

Summary

DQN = Q-learning + neural network function approximation + experience replay + target network
Experience replay stores transitions in a circular buffer; random sampling breaks temporal correlation and improves sample efficiency
Target network provides fixed Q-targets for $C$ steps, preventing the moving-target divergence problem
Double DQN uses the online network for action selection and target network for evaluation, eliminating maximization bias
Dueling DQN splits Q into $V(s) + A(s,a)$ with a mean-subtraction constraint for identifiability
Prioritized experience replay samples transitions proportional to TD error, with importance sampling corrections

Quiz

Question 1: Why does DQN use a target network separate from the online network?

A. To reduce memory usage B. To provide a stable, slowly-changing target for the TD update, preventing divergence C. To enable Double DQN D. It's required for backpropagation through time

Correct Answer: B