22-03 — Generative Adversarial Networks (GANs)
Phase: 22 — Generative Models Mathematics Subject: 22-03 Prerequisites: 22-01 — Autoencoders, Phase 16–17 (Neural Networks), Phase 14 (Optimization) Next subject: 22-04 — Normalizing Flows
Learning Objectives
By the end of this subject, you will be able to:
- Derive the GAN minimax objective and the optimal discriminator in closed form
- Prove the connection between the GAN objective and the Jensen-Shannon divergence
- Explain the causes of GAN training instability (vanishing gradients, mode collapse)
- Analyze the Wasserstein GAN (WGAN) objective and the Lipschitz constraint
- Compare GANs to likelihood-based generative models (VAEs, flows)
Core Content
The GAN Framework
A GAN consists of two neural networks playing a two-player minimax game:
- Generator $G_\theta: \mathbb{R}^{d_z} \to \mathbb{R}^{d_x}$: maps noise $\mathbf{z} \sim p_z$ to synthetic data $G(\mathbf{z})$. The induced distribution is $p_g$.
- Discriminator $D_\omega: \mathbb{R}^{d_x} \to [0, 1]$: estimates the probability that a sample comes from $p_{\text{data}}$ rather than $p_g$.
The minimax objective:
$$\min_G \max_D V(D, G) = \mathbb{E}{\mathbf{x} \sim p{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))]$$
⚠️ CRITICAL — Optimal Discriminator
For a fixed generator $G$, what is the optimal discriminator $D^*$? We maximize:
$$V(D, G) = \int_{\mathbf{x}} p_{\text{data}}(\mathbf{x}) \log D(\mathbf{x}) + p_g(\mathbf{x}) \log(1 - D(\mathbf{x})) d\mathbf{x}$$
For each $\mathbf{x}$, the integrand is $f(D) = a \log D + b \log(1-D)$ with $a = p_{\text{data}}(\mathbf{x})$, $b = p_g(\mathbf{x})$. Setting $f'(D) = 0$:
$$\frac{a}{D} - \frac{b}{1-D} = 0 \implies D^* = \frac{a}{a+b}$$
Therefore:
$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$
At optimality, $D^*(\mathbf{x}) = 0.5$ when $p_{\text{data}}(\mathbf{x}) = p_g(\mathbf{x})$ — the discriminator cannot distinguish real from fake.
The Jensen-Shannon Divergence Connection
Substituting $D^*$ into $V(D, G)$:
$$V(D^*, G) = \mathbb{E}{\mathbf{x} \sim p{\text{data}}}\left[\log \frac{p_{\text{data}}}{p_{\text{data}} + p_g}\right] + \mathbb{E}{\mathbf{x} \sim p_g}\left[\log \frac{p_g}{p{\text{data}} + p_g}\right]$$
$$= -\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \| p_g)$$
where the Jensen-Shannon divergence is:
$$\text{JSD}(P \| Q) = \frac{1}{2}D_{\text{KL}}\left(P \;\middle\|\; \frac{P+Q}{2}\right) + \frac{1}{2}D_{\text{KL}}\left(Q \;\middle\|\; \frac{P+Q}{2}\right)$$
⚠️ CRITICAL — The Saturation Problem: When $D$ is perfect (near-optimal), $D(G(\mathbf{z})) \approx 0$, so $\log(1 - D(G(\mathbf{z}))) \approx 0$ and its gradient $\nabla_G \log(1-D(G)) = \frac{-1}{1-D(G)}\nabla_G D(G) \approx -\nabla_G D(G)$ is near zero — the generator receives vanishing gradients. In practice, the generator is trained to maximize $\log D(G(\mathbf{z}))$ instead, providing stronger gradients early in training.
Training Instability and Mode Collapse
GANs are notoriously hard to train. Key issues:
-
Non-convergence: The minimax game is non-convex in both players' parameters. Gradient descent on a non-convex game may cycle or diverge rather than finding a Nash equilibrium.
-
Mode collapse: The generator learns to produce only a few modes of the data distribution — e.g., generating the same face for every $\mathbf{z}$. The discriminator gets stuck in a local optimum, and the generator exploits it.
-
Vanishing gradients: When the discriminator is too good, the generator's gradient vanishes (as shown above).
Wasserstein GAN (WGAN)
The WGAN replaces the JS divergence with the Earth Mover's (Wasserstein-1) distance:
$$W(p_{\text{data}}, p_g) = \inf_{\gamma \in \Pi(p_{\text{data}}, p_g)} \mathbb{E}_{(\mathbf{x},\mathbf{y})\sim\gamma}[\|\mathbf{x} - \mathbf{y}\|]$$
where $\Pi$ is the set of all joint distributions with marginals $p_{\text{data}}$ and $p_g$. Intuitively: the minimum "mass" that must be moved to transform $p_g$ into $p_{\text{data}}$.
Key advantage: $W(p_{\text{data}}, p_g)$ is continuous and differentiable almost everywhere, even when the distributions have disjoint support — unlike JS divergence, which can saturate.
By the Kantorovich-Rubinstein duality:
$$W(p_{\text{data}}, p_g) = \sup_{\|f\|L \leq 1} \mathbb{E}{\mathbf{x} \sim p_{\text{data}}}[f(\mathbf{x})] - \mathbb{E}_{\mathbf{x} \sim p_g}[f(\mathbf{x})]$$
where $\|f\|_L \leq 1$ means $f$ is 1-Lipschitz: $|f(\mathbf{x}) - f(\mathbf{y})| \leq \|\mathbf{x} - \mathbf{y}\|$.
The critic (not discriminator — it outputs real numbers, not probabilities) is trained to maximize the difference in expectations, and the generator minimizes it:
$$\min_G \max_{\|D\|L \leq 1} \mathbb{E}{\mathbf{x} \sim p_{\text{data}}}[D(\mathbf{x})] - \mathbb{E}_{\mathbf{z} \sim p_z}[D(G(\mathbf{z}))]$$
Enforcing the Lipschitz Constraint
-
Weight clipping (original WGAN): Clip all weights to $[-c, c]$. Simple but limits capacity and can cause optimization difficulties.
-
Gradient penalty (WGAN-GP): Add a penalty on the gradient norm:
$$\mathcal{L}{\text{GP}} = \lambda \cdot \mathbb{E}{\hat{\mathbf{x}} \sim p_{\hat{\mathbf{x}}}}[(\|\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})\|_2 - 1)^2]$$
where $\hat{\mathbf{x}}$ is sampled uniformly along lines between real and generated samples: $\hat{\mathbf{x}} = \epsilon \mathbf{x}{\text{real}} + (1-\epsilon)\mathbf{x}{\text{fake}}$, $\epsilon \sim U[0,1]$. This encourages $\|\nabla D\| \approx 1$ everywhere, satisfying the Lipschitz constraint.
Pitfalls
| Pitfall | Why It Happens | Fix |
|---|---|---|
| Mode collapse | Generator finds a "trick" that fools discriminator for all $\mathbf{z}$ | WGAN, minibatch discrimination, unrolled GANs |
| Non-convergence | Minibatch gradient descent doesn't converge in games | Two time-scale update rules (TTUR), spectral normalization |
| Discriminator overfitting | Discriminator memorizes training set | Data augmentation (DiffAug, ADA), more data |
| Checkerboard artifacts | Transposed convolution stride mismatch | Use resize + convolution instead of transposed conv |
| Training oscillation | Generator and discriminator are out of balance | Update discriminator more often (e.g., 5:1 ratio) |
Key Terms
- Discriminator
- Generator
Worked Examples
Example 1: Optimal Discriminator for Simple Distributions
Let $p_{\text{data}} = \mathcal{N}(2, 1)$ and $p_g = \mathcal{N}(0, 1)$. Find the optimal discriminator $D^*(x)$.
Solution:
$$D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} = \frac{\frac{1}{\sqrt{2\pi}}e^{-(x-2)^2/2}}{\frac{1}{\sqrt{2\pi}}e^{-(x-2)^2/2} + \frac{1}{\sqrt{2\pi}}e^{-x^2/2}}$$
$$= \frac{e^{-(x-2)^2/2}}{e^{-(x-2)^2/2} + e^{-x^2/2}} = \frac{1}{1 + e^{x^2/2 - (x-2)^2/2}} = \frac{1}{1 + e^{2x - 2}} = \sigma(2 - 2x)$$
where $\sigma$ is the sigmoid function. At $x=1$ (where both densities are equal), $D^*(1) = 0.5$.
Click for answer
$D^*(x) = \\sigma(2 - 2x)$. For $x < 1$: $p_g$ dominates → $D^* < 0.5$. For $x > 1$: $p_{\\text{data}}$ dominates → $D^* > 0.5$. The discriminator is a simple logistic classifier.Example 2: GAN Game Value at Convergence
If the generator perfectly matches the data distribution ($p_g = p_{\text{data}}$), compute $V(D^*, G)$.
Solution:
When $p_g = p_{\text{data}}$, $D^*(\mathbf{x}) = \frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{data}}} = \frac{1}{2}$ for all $\mathbf{x}$.
$$V(D^*, G) = \mathbb{E}{p{\text{data}}}[\log \tfrac{1}{2}] + \mathbb{E}_{p_g}[\log \tfrac{1}{2}] = \log \tfrac{1}{2} + \log \tfrac{1}{2} = -2\log 2 = \log \tfrac{1}{4}$$
So $V(D^*, G) = -\log 4 \approx -1.386$. This is the global minimum of $C(G) = \max_D V(D,G)$.
Click for answer
$V(D^*, G) = -\\log 4 \\approx -1.386$. Any deviation from $p_g = p_{\\text{data}}$ gives a larger value, confirming that the minimax game finds the data distribution at equilibrium.Example 3: Wasserstein Distance for Gaussians
Compute $W(\mathcal{N}(\mu_1, \sigma_1^2), \mathcal{N}(\mu_2, \sigma_2^2))$ for two 1D Gaussians.
Solution:
The Wasserstein-2 distance between Gaussians has a closed form:
$$W_2^2 = (\mu_1 - \mu_2)^2 + (\sigma_1 - \sigma_2)^2$$
For the Wasserstein-1 distance (used in WGAN): no simple closed form in general, but for 1D:
$$W_1(p, q) = \int |F_p^{-1}(t) - F_q^{-1}(t)| dt$$
where $F^{-1}$ is the inverse CDF. For Gaussians: $F^{-1}(t) = \mu + \sigma \Phi^{-1}(t)$.
Since $\Phi^{-1}(t)$ is the same for both, for $\sigma_1 = \sigma_2 = 1$: $$W_1 = \int |\mu_1 + \Phi^{-1}(t) - \mu_2 - \Phi^{-1}(t)| dt = |\mu_1 - \mu_2|$$
This illustrates a key property: $W_1$ gives meaningful gradients even when distributions have disjoint support — unlike JS, which saturates at $\log 2$.
Click for answer
For $\\mathcal{N}(\\mu_1, 1)$ and $\\mathcal{N}(\\mu_2, 1)$: $W_1 = |\\mu_1 - \\mu_2|$. The Wasserstein distance scales linearly with separation, providing meaningful gradients for the generator even when the distributions are far apart. This is why WGANs are more stable to train than standard GANs.Practice Problems
-
Prove that $\text{JSD}(P \| Q) \in [0, \log 2]$ and find when each bound is achieved.
Click for answer
Lower bound 0: when $P=Q$ (distributions identical). Upper bound $\\log 2$: when $P$ and $Q$ have disjoint support, $D_{\\text{KL}}(P \\| (P+Q)/2) = \\log 2$ (similarly for $Q$), so JSD = $\\log 2$. The fact that JSD maxes out at $\\log 2$ even for completely disjoint distributions is why standard GANs suffer from vanishing gradients. -
Derive the gradient of the generator loss $\mathcal{L}G = -\mathbb{E}{\mathbf{z}}[\log D(G(\mathbf{z}))]$ (the non-saturating loss) and compare it to the minimax loss.
Click for answer
$\\nabla_\\theta \\mathcal{L}_G = -\\mathbb{E}_{\\mathbf{z}}\\left[\\frac{1}{D(G(\\mathbf{z}))} \\nabla_\\theta D(G(\\mathbf{z}))\\right]$. Non-saturating loss: gradient scales with $1/D(G)$. When $D(G)$ is small (generator is bad), gradient is large → strong learning signal early on. Minimax loss $\\mathbb{E}[\\log(1-D(G))]$: gradient scales with $1/(1-D(G))$. When $D(G)$ is small, gradient is also small → weak learning signal. This is the saturation problem. -
For WGAN-GP, the gradient penalty term is $\mathbb{E}[(\|\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})\|_2 - 1)^2]$. Explain why the target gradient norm is 1, not 0.
Click for answer
If the target were 0 ($\\|\\nabla D\\| = 0$ everywhere), the critic would be constant — useless for comparing distributions. The Lipschitz constraint requires $\\|\\nabla D\\| \\leq 1$, and the optimal critic for the Kantorovich-Rubinstein dual achieves $\\|\\nabla D\\| = 1$ almost everywhere on the optimal transport paths. Penalizing deviation from 1 encourages the critic to be maximally discriminating (gradient = 1) while satisfying the Lipschitz constraint. -
A GAN generator maps $\mathbf{z} \in \mathbb{R}^2 \sim \mathcal{N}(0,I)$ through a linear layer: $G(\mathbf{z}) = \mathbf{W}\mathbf{z}$. If $\mathbf{W} = \begin{pmatrix} 2 & 0 \\ 0 & 0.1 \end{pmatrix}$, what distribution does $p_g$ follow? Why might this cause mode collapse?
Click for answer
$G(\\mathbf{z}) \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{W}\\mathbf{W}^T) = \\mathcal{N}\\left(\\mathbf{0}, \\begin{pmatrix} 4 & 0 \\\\ 0 & 0.01 \\end{pmatrix}\\right)$. The second dimension has variance 0.01 — the generator produces almost no variation in that direction. This is a form of mode collapse: the generator ignores one dimension of variation, effectively collapsing to a 1D manifold in a 2D space. -
Explain why minibatch discrimination helps prevent mode collapse.
Click for answer
Standard discriminators process each sample independently — they can't detect if the generator produces the same output for different $\\mathbf{z}$. Minibatch discrimination gives the discriminator access to inter-sample statistics (e.g., pairwise distances) within a batch. If the generator collapses to one mode, all samples are similar, and the discriminator can detect the lack of diversity. This forces the generator to diversify.
Summary
Key takeaways:
- GANs use a minimax game: $\min_G \max_D \mathbb{E}[\log D] + \mathbb{E}[\log(1-D(G))]$
- The optimal discriminator is $D^*(\mathbf{x}) = p_{\text{data}}(\mathbf{x})/(p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x}))$
- At optimality, the GAN objective equals $-\log 4 + 2\cdot\text{JSD}(p_{\text{data}} \| p_g)$
- GANs suffer from training instability: vanishing gradients, mode collapse, non-convergence
- WGAN replaces JS with the Wasserstein distance, providing meaningful gradients everywhere
- WGAN-GP enforces the Lipschitz constraint via a gradient penalty rather than weight clipping
Quiz
- The optimal discriminator for a given generator is:
- A) $D^*(\mathbf{x}) = p_{\text{data}}(\mathbf{x})$
- B) $D^*(\mathbf{x}) = p_g(\mathbf{x})/(p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x}))$
- C) $D^*(\mathbf{x}) = p_{\text{data}}(\mathbf{x})/(p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x}))$
- D) $D^(\mathbf{x}) = 0.5$ always Correct: C)*
- If you chose C: Derived by maximizing $p_{\text{data}}\log D + p_g\log(1-D)$ pointwise.
- If you chose B: Inverted — that gives the probability of being fake.
-
If you chose D: Only true when $p_{\text{data}} = p_g$.
-
Mode collapse in GANs refers to:
- A) The discriminator becoming too powerful
- B) The generator producing only a limited variety of outputs
- C) The training loss exploding
- D) The latent space collapsing to zero Correct: B)
- If you chose B: The generator maps many different $\mathbf{z}$ values to the same or very similar outputs, failing to capture the full data distribution.
- If you chose A: That's a separate issue (vanishing gradients).
- If you chose C: That's numerical instability, not mode collapse.
-
If you chose D: The latent space dimensionality is fixed.
-
The gradient penalty in WGAN-GP targets gradient norm:
- A) 0
- B) 1
- C) Any value < 1
- D) The Lipschitz constant of the data distribution Correct: B)
- If you chose B: The 1-Lipschitz constraint requires $\|\nabla D\| \leq 1$, and the optimal critic saturates this at 1. Penalizing $(\|\nabla D\| - 1)^2$ encourages the critic to be maximally discriminating.
- If you chose A: A constant function can't discriminate.
-
If you chose C: That wouldn't push the critic to be maximally useful.
-
Why does the standard GAN generator use the "non-saturating" loss $-\log D(G(\mathbf{z}))$ instead of $\log(1-D(G(\mathbf{z})))$?
- A) It's computationally faster
- B) It provides stronger gradients when the generator is performing poorly
- C) It makes the loss always positive
- D) It avoids the need for the discriminator Correct: B)
- If you chose B: When $D(G)$ is near 0, $\log(1-D(G))$ has gradient near 0 (saturation), but $-\log D(G)$ has gradient $-1/D(G)$ which is large — giving the generator a strong learning signal when it needs it most.
- If you chose A: Computation is nearly identical.
-
If you chose D: Both losses require a discriminator.
-
The JS divergence between two distributions with disjoint support is:
- A) 0
- B) $\infty$
- C) $\log 2$
- D) Undefined Correct: C)
- If you chose C: For disjoint $P$ and $Q$, $D_{\text{KL}}(P\|(P+Q)/2) = \log 2$, so JSD = $\log 2$. This boundedness is why JS-saturating discriminators cause vanishing gradients.
- If you chose B: KL divergence can be infinite for disjoint support, but JS is always bounded in $[0, \log 2]$.
- If you chose A: Only when distributions are identical.
Next Steps
22-04 — Normalizing Flows — a different approach to generative modeling through invertible transformations, offering exact likelihood computation and efficient sampling.
Pitfalls
-
Training with the minimax loss instead of the non-saturating loss from the start: Using $\log(1 - D(G(\mathbf{z})))$ for the generator gives vanishing gradients when the discriminator is confident ($D(G) \approx 0$). The non-saturating loss $-\log D(G(\mathbf{z}))$ provides strong gradients early in training. Always start with the non-saturating loss for the generator; the minimax loss is primarily a theoretical construct.
-
Updating generator and discriminator at equal rates: A common failure mode is the discriminator becoming too strong, leaving the generator with no useful gradient signal. In practice, update the discriminator $k$ times per generator update ($k \in [1, 5]$). Use two time-scale update rules (TTUR) with different learning rates for generator and discriminator to maintain balance.
-
Relying solely on JS divergence without understanding saturation: The JS divergence is bounded at $\log 2$ for disjoint distributions. When $p_g$ and $p_{\text{data}}$ have little overlap (common early in training), JS saturates and gradient vanishes. This is by design in the theory, not a bug — it means you need architectural choices (WGAN, spectral normalization) or careful balancing to maintain useful gradients.
-
Ignoring the Lipschitz constraint approximation in WGAN-GP: The gradient penalty $\mathbb{E}[(|\nabla_{\hat{\mathbf{x}}}D(\hat{\mathbf{x}})|_2 - 1)^2]$ is computed on random interpolations between real and fake samples, which is an approximation. The true Lipschitz constraint must hold everywhere, not just on the interpolations. Using too few interpolation points or sampling from a poor interpolation distribution can lead to critics that violate the constraint off the interpolation path, causing instability.
Q7: The Jensen-Shannon divergence between $p_{\text{data}}$ and $p_g$ appears in the GAN objective via $V(D^*, G) = -\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} | p_g)$. If $\text{JSD} = \log 2$, what does this imply about $p_g$?
A) $p_g = p_{\text{data}}$ exactly. B) $p_g$ and $p_{\text{data}}$ have completely disjoint support. C) $p_g$ is a Gaussian approximation of $p_{\text{data}}$. D) The GAN has converged to a local minimum.