Math graphic
πŸ“ Concept diagram

22-10 β€” Evaluation of Generative Models

Phase: 22 β€” Generative Models Mathematics Subject: 22-10 Prerequisites: 22-01 through 22-09 (all generative model subjects), Phase 13 (Probability β€” expectations, divergences) Next subject: 23-01 β€” Markov Decision Processes (MDPs)


Learning Objectives

By the end of this subject, you will be able to:

  1. Define and compute the Inception Score (IS) and explain its relationship to conditional label entropy
  2. Define and compute the FrΓ©chet Inception Distance (FID) and explain its Gaussian approximation assumption
  3. Interpret log-likelihood and bits per dimension (BPD) as evaluation metrics for generative models
  4. Define precision and recall for generative models via manifold coverage
  5. Critically evaluate which metric to use for a given generative modeling task, recognizing each metric's limitations

Core Content

Why Evaluation Is Hard

Generative models are fundamentally harder to evaluate than discriminative models:

We need metrics that measure at least two axes: 1. Fidelity (quality): do samples look realistic? 2. Diversity (coverage): does the model cover all modes of the data distribution?


Inception Score (IS)

Proposed by Salimans et al. (2016), the Inception Score uses a pretrained ImageNet classifier (Inception v3) to evaluate generated images.

Definition

Given generated samples ${\mathbf{x}1, \ldots, \mathbf{x}_N}$, run each through the Inception classifier to get conditional label distributions $p(y \mid \mathbf{x})$. Also compute the marginal distribution $p(y) = \frac{1}{N}\sum{i=1}^N p(y \mid \mathbf{x}_i)$.

$$\text{IS} = \exp\left(\mathbb{E}_{\mathbf{x}}\left[\text{KL}(p(y \mid \mathbf{x}) \;|\; p(y))\right]\right)$$

Expanding:

$$\text{IS} = \exp\left(\frac{1}{N}\sum_{i=1}^{N} \sum_{y=1}^{K} p(y \mid \mathbf{x}_i) \log\frac{p(y \mid \mathbf{x}_i)}{p(y)}\right)$$

⚠️ CRITICAL β€” Interpreting IS:

Limitations of IS

  1. Classifier-dependent: only measures what the Inception model can detect; may miss artifacts invisible to Inception
  2. No mode-dropping detection: a model generating one perfect sample per class achieves maximum IS, missing intra-class diversity entirely
  3. Not applicable beyond natural images: requires a pretrained classifier on the target domain
  4. Sensitive to sample count: IS requires many samples ($\sim$50k) for stable estimates
  5. No comparison to real data: IS only evaluates generated samples in isolation

⚠️ CRITICAL β€” FrΓ©chet Inception Distance (FID)

Heusel et al. (2017) proposed FID to address IS's inability to compare against real data. FID measures the distance between the distribution of real and generated images in a feature space.

Definition

Extract features from a penultimate Inception v3 layer (typically pool3, 2048-d) for both real images ${\mathbf{a}_i}$ and generated images ${\mathbf{b}_j}$. Assume both follow multivariate Gaussian distributions:

$$\mathbf{a} \sim \mathcal{N}(\boldsymbol{\mu}_r, \Sigma_r), \quad \mathbf{b} \sim \mathcal{N}(\boldsymbol{\mu}_g, \Sigma_g)$$

The FrΓ©chet distance (Wasserstein-2 distance between Gaussians) is:

$$\text{FID} = |\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|_2^2 + \text{tr}\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)$$

where $(\Sigma_r \Sigma_g)^{1/2}$ is the matrix square root of $\Sigma_r \Sigma_g$.

Lower FID = better (real and generated distributions are closer).

Components of FID

If both distributions have the same mean ($\boldsymbol{\mu}_r = \boldsymbol{\mu}_g$) and identical covariances ($\Sigma_r = \Sigma_g$):

$$\text{FID} = 0 + \text{tr}(\Sigma + \Sigma - 2\Sigma) = 0$$

Computing the Matrix Square Root

$(\Sigma_r \Sigma_g)^{1/2}$ is computed via eigendecomposition. If $\Sigma_r$ and $\Sigma_g$ commute (they share eigenvectors), then $(\Sigma_r \Sigma_g)^{1/2} = \Sigma_r^{1/2}\Sigma_g^{1/2}$. In general, use SVD or the Newton-Schulz iteration.

Limitations of FID

  1. Gaussian assumption: feature distributions aren't truly Gaussian β€” FID can be misleading
  2. Sample count sensitivity: FID is biased for small sample sizes ($<$10k); bias correction needed
  3. Inception features: domain-specific β€” only works for natural images
  4. Single-number summary: collapses all distributional differences into one scalar
  5. Doesn't distinguish fidelity vs. diversity failures: both poor quality and mode collapse increase FID

Log-Likelihood and Bits Per Dimension

For models providing explicit density estimates (VAEs, flows, autoregressive models, diffusion via probability flow ODE):

Log-Likelihood

$$\mathcal{LL} = \frac{1}{N}\sum_{i=1}^{N} \log p_\theta(\mathbf{x}_i)$$

Higher is better. However: - VAEs give a lower bound (ELBO), not exact likelihood - Autoregressive models give exact likelihood (chain rule) - Diffusion models can give exact likelihood via the probability flow ODE (22-07) - GANs and implicit models cannot compute likelihood

Bits Per Dimension (BPD)

$$\text{BPD} = -\frac{\mathcal{LL}}{D \cdot \ln 2}$$

where $D$ is the number of dimensions (e.g., $32 \times 32 \times 3 = 3072$ for CIFAR-10). Lower BPD = better compression (model explains data with fewer bits).

Relationship to compression: BPD is the average number of bits needed to encode one dimension of the data under the model. Uniform random 8-bit pixel values β†’ $-\log_2(1/256) = 8$ BPD. A perfect model capturing all structure β†’ lower BPD.

Limitations of Likelihood-Based Metrics

  1. Not comparable across model families: a VAE ELBO is a bound; an autoregressive model gives exact likelihood
  2. High likelihood $\neq$ high sample quality: models can memorize training data (overfit) and achieve low BPD while producing poor samples
  3. Insensitive to manifold structure: a model placing mass everywhere but most in the right places can have good likelihood but blurry samples
  4. Numerical issues: likelihoods can be $-\infty$ for out-of-distribution points, making averaging unstable

Precision and Recall for Generative Models

Sajjadi et al. (2018) and KynkÀÀnniemi et al. (2019) proposed precision/recall metrics to disentangle fidelity and diversity.

Manifold-Based Definition

Define the real and generated manifolds via $k$-nearest-neighbor spheres in feature space.

For each generated sample $\mathbf{b}_j$: - Precision: $\mathbf{b}_j$ is "precise" if it falls within the manifold of real data β†’ fraction of generated samples that are realistic

For each real sample $\mathbf{a}_i$: - Recall: $\mathbf{a}_i$ is "covered" if it falls within the manifold of generated data β†’ fraction of real samples the model can produce

Formally, using Inception features and nearest neighbors:

$$\text{Precision} = \frac{1}{M}\sum_{j=1}^{M} \mathbf{1}\left[\mathbf{b}j \in \bigcup{i=1}^{N} B(\mathbf{a}_i, \text{NN}_k(\mathbf{a}_i))\right]$$

$$\text{Recall} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}\left[\mathbf{a}i \in \bigcup{j=1}^{M} B(\mathbf{b}_j, \text{NN}_k(\mathbf{b}_j))\right]$$

where $B(\mathbf{x}, r)$ is the sphere of radius $r$ (distance to $k$-th nearest neighbor).

⚠️ CRITICAL β€” Interpreting Precision and Recall:

Precision Recall Interpretation
High High Ideal: realistic AND diverse
High Low High quality, mode collapse (e.g., BigGAN)
Low High Diverse but low quality (blurry samples)
Low Low Poor model (bad quality + mode dropping)

Improved Precision and Recall (KynkÀÀnniemi et al.)

Uses explicit manifold estimation: for each point, compute the radius to its $k$-th nearest neighbor. Points within the union of hyperspheres are considered on-manifold. The fraction of generated samples within the real manifold = precision. The fraction of real samples within the generated manifold = recall.


Comparing Metrics

Metric Measures Requires Weakness
IS Quality + diversity (class-level) Pretrained classifier No comparison to real data; only class diversity
FID Distributional distance Pretrained classifier features Gaussian assumption; sample-size bias
BPD / LL Compression quality Explicit density model Not comparable across model families
Precision Sample quality (fidelity) Pretrained classifier features Choice of $k$, feature space
Recall Coverage (diversity) Pretrained classifier features Same as above
Kernel Inception Distance (KID) Maximum mean discrepancy Pretrained features + kernel Choice of kernel

Practical Recommendations



Key Terms

Worked Examples

Example 1: Computing IS from Classifier Outputs

Three generated images produce classifier outputs: - $\mathbf{x}_1$: $p(y|\mathbf{x}_1) = [0.9, 0.1, 0.0]$ (dog, cat, bird) - $\mathbf{x}_2$: $p(y|\mathbf{x}_2) = [0.1, 0.8, 0.1]$ - $\mathbf{x}_3$: $p(y|\mathbf{x}_3) = [0.0, 0.1, 0.9]$

Compute the Inception Score.

Solution:

Step 1: Marginal $p(y) = \frac{1}{3}([0.9,0.1,0.0] + [0.1,0.8,0.1] + [0.0,0.1,0.9])$ $= [1.0/3, 1.0/3, 1.0/3] = [0.333, 0.333, 0.333]$

Step 2: Per-sample KL divergence: $\text{KL}_1 = 0.9\log(0.9/0.333) + 0.1\log(0.1/0.333) + 0.0\log(0.0/0.333)$ $= 0.9 \cdot 0.993 + 0.1 \cdot (-1.204) + 0 = 0.894 - 0.120 = 0.774$

$\text{KL}_2 = 0.1\log(0.1/0.333) + 0.8\log(0.8/0.333) + 0.1\log(0.1/0.333)$ $= 0.1(-1.204) + 0.8(0.875) + 0.1(-1.204) = -0.120 + 0.700 - 0.120 = 0.459$

$\text{KL}_3 = 0.0(\ldots) + 0.1(-1.204) + 0.9(0.993) = -0.120 + 0.894 = 0.774$

Step 3: $\text{IS} = \exp\left(\frac{0.774 + 0.459 + 0.774}{3}\right) = \exp(0.669) = 1.95$

Click for answer IS β‰ˆ 1.95 (out of 3 possible classes). The model generates clear images (peaked conditionals) covering all three classes (uniform marginal), giving a high score relative to the 3-class maximum. With 1000 ImageNet classes, a good IS is typically 8-10.

Example 2: Computing FID

Real features: $\boldsymbol{\mu}_r = (0, 0)$, $\Sigma_r = \begin{pmatrix} 2 & 0 \ 0 & 2 \end{pmatrix}$ Generated features: $\boldsymbol{\mu}_g = (1, 0)$, $\Sigma_g = \begin{pmatrix} 1 & 0 \ 0 & 1 \end{pmatrix}$

Compute FID.

Solution:

Mean term: $|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|^2 = (0-1)^2 + (0-0)^2 = 1$

Covariance term: $\Sigma_r \Sigma_g = \begin{pmatrix}2&0\0&2\end{pmatrix}\begin{pmatrix}1&0\0&1\end{pmatrix} = \begin{pmatrix}2&0\0&2\end{pmatrix}$

$(\Sigma_r \Sigma_g)^{1/2} = \begin{pmatrix}\sqrt{2}&0\0&\sqrt{2}\end{pmatrix}$

$\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}) = \text{tr}\left(\begin{pmatrix}3&0\0&3\end{pmatrix} - \begin{pmatrix}2\sqrt{2}&0\0&2\sqrt{2}\end{pmatrix}\right)$

$= (3 - 2\sqrt{2}) + (3 - 2\sqrt{2}) = 6 - 4\sqrt{2} \approx 6 - 5.657 = 0.343$

$\text{FID} = 1 + 0.343 = 1.343$

Click for answer FID β‰ˆ 1.343. The mean shift contributes 1.0 and the lower variance of generated samples contributes ~0.34. The generated distribution is shifted and less spread out than real data β€” visible in both terms.

Example 3: Bits Per Dimension

A model achieves $\log p(\mathbf{x}) = -4500$ (natural log) on a $64 \times 64 \times 3$ image. Compute BPD.

Solution:

$D = 64 \times 64 \times 3 = 12288$ dimensions.

$\text{BPD} = -\frac{-4500}{12288 \cdot \ln 2} = \frac{4500}{12288 \cdot 0.693147} = \frac{4500}{8517.4} \approx 0.528$

Click for answer BPD β‰ˆ 0.528. This is implausibly low for natural images (CIFAR-10 SOTA β‰ˆ 2.8) β€” it would mean the model can compress each pixel dimension to ~0.5 bits. Such numbers are only possible for nearly-deterministic data or when evaluating on the training set (overfitting).

Practice Problems

  1. Show that if all $p(y|\mathbf{x})$ are identical (e.g., all generated images look like the same class), then $\text{IS} = 1$ regardless of $K$.

    Click for answer If $p(y|\mathbf{x}) = \mathbf{q}$ for all $\mathbf{x}$, then $p(y) = \mathbf{q}$. Then $\text{KL}(p(y|\mathbf{x})\|p(y)) = \text{KL}(\mathbf{q}\|\mathbf{q}) = 0$ for all $\mathbf{x}$. $\text{IS} = \exp(0) = 1$. The model has zero diversity under IS β€” only one effective class.

  2. Prove that when $\Sigma_r = \Sigma_g = \Sigma$, the covariance term of FID simplifies to zero. What does the FID then measure?

    Click for answer $\text{tr}(\Sigma + \Sigma - 2(\Sigma\Sigma)^{1/2}) = \text{tr}(2\Sigma - 2\Sigma) = 0$. $\text{FID} = \|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\|^2$. With equal covariances, FID reduces to the squared Euclidean distance between means β€” a measure of global distribution shift independent of diversity differences.

  3. An autoregressive model achieves 3.2 BPD on CIFAR-10 test set but 1.1 BPD on the training set. What likely happened?

    Click for answer The model is severely overfitting β€” memorizing training data. The gap of 2.1 BPD between train and test is enormous. The model has high likelihood on training data (low BPD) because it's essentially reproducing memorized patterns, but fails to generalize. This is why likelihood alone can be misleading: a lookup table of the training set achieves perfect likelihood but generates nothing new.

  4. Model A has high precision (0.95) but low recall (0.30). Model B has lower precision (0.70) but higher recall (0.85). Which is preferable, and when?

    Click for answer Model A produces very realistic images but only covers 30% of the data distribution (mode collapse). Model B is more diverse but less realistic. Preference depends on application: for photorealistic image synthesis where diversity is secondary (e.g., super-resolution), prefer A. For data augmentation where coverage matters (e.g., medical imaging), prefer B. Most applications want a balance; neither is clearly superior.

  5. Why can't we compare the FID of a VAE with the FID of a GAN trained on the same dataset using just the FID number?

    Click for answer Actually, we CAN compare them β€” that's the point of FID. FID is model-agnostic and only depends on the generated samples vs. real samples. But the question highlights that we shouldn't rely on FID alone: VAEs typically have high recall (cover modes) but low precision (blurry samples), while GANs typically have high precision but low recall (mode collapse). FID conflates these into one number β€” a VAE and GAN could have the same FID for completely different reasons. Always supplement FID with precision/recall.


Summary

Key takeaways:


Quiz

  1. A generative model achieves IS = 8.2 on a 1000-class dataset. This means:
  2. A) The model is 8.2Γ— better than random
  3. B) The effective number of clearly-generated classes is approximately 8.2
  4. C) FID is approximately 8.2
  5. D) The model's precision is 82% Correct: B)
  6. If you chose B: IS is $\exp(\mathbb{E}[\text{KL}])$. When $p(y|\mathbf{x})$ is one-hot and $p(y)$ is uniform, IS = $K$. An IS of 8.2 means the model covers ~8.2 "effective classes" with clear (high-confidence) images. It could be generating images across all 1000 classes with moderate confidence, or perfectly across 8-9 classes.
  7. If you chose A: IS is not multiplicative; IS=1 is the baseline (no diversity).
  8. If you chose C: IS and FID measure different things and have different scales.
  9. If you chose D: IS doesn't measure precision in this sense.

  10. The covariance term in FID uses the matrix square root $(\Sigma_r \Sigma_g)^{1/2}$ because:

  11. A) It's computationally cheaper than other matrix operations
  12. B) It arises from the Wasserstein-2 distance between two multivariate Gaussians
  13. C) Inception features are always diagonal
  14. D) The square root cancels with the trace Correct: B)
  15. If you chose B: The Wasserstein-2 distance between $\mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1)$ and $\mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2)$ has the closed form $|\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2|^2 + \text{tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2})$, which simplifies to the FID form when $\Sigma_1$ and $\Sigma_2$ commute. FID uses $\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$, which is exact when covariances commute and upper-bounds the true Wasserstein distance otherwise.
  16. If you chose A: The matrix square root is actually expensive ($O(d^3)$ via eigendecomposition).
  17. If you chose C: Inception features have rich covariance structure.
  18. If you chose D: It doesn't cancel β€” it contributes the key distributional information.

  19. A VAE reports "log-likelihood = -120 nats" on MNIST. A PixelCNN reports "log-likelihood = -85 nats". Which is better?

  20. A) PixelCNN, because -85 > -120
  21. B) VAE, because | -120 | > | -85 |
  22. C) Cannot directly compare because the VAE reports a lower bound (ELBO), not the true log-likelihood
  23. D) They're equally good because nats are arbitrary Correct: C)
  24. If you chose C: $\log p_{\text{VAE}}(\mathbf{x}) \geq \text{ELBO} = -120$. The true log-likelihood could be -100, -90, or even -85 β€” we don't know the ELBO gap $\text{KL}(q(z|\mathbf{x}) | p(z|\mathbf{x}))$. PixelCNN gives the exact log-likelihood. The numbers are not comparable.
  25. If you chose A: Only valid if both are exact likelihoods, which they aren't.
  26. If you chose B: Higher log-likelihood is better.
  27. If you chose D: Nats (natural log units) are well-defined; "arbitrary" doesn't apply.

  28. FID is biased for small sample sizes because:

  29. A) The Inception network needs many samples to activate
  30. B) The sample mean and covariance are noisy estimates, and the quadratic/root terms introduce systematic bias
  31. C) Small samples cannot represent multiple classes
  32. D) FID requires exactly 50,000 samples by definition Correct: B)
  33. If you chose B: $\hat{\boldsymbol{\mu}}$ and $\hat{\Sigma}$ from $N$ samples are unbiased estimates, but $|\hat{\boldsymbol{\mu}}_r - \hat{\boldsymbol{\mu}}_g|^2$ is biased upward (Jensen's inequality: expectation of a convex function of an unbiased estimator). The trace term with the matrix square root also introduces bias. Bias corrections exist (e.g., the "FID infinity" extrapolation).
  34. If you chose A: Inception activates for any valid image.
  35. If you chose C: Even with one class, statistical bias exists.
  36. If you chose D: 50k is a common convention, not a requirement.

  37. Precision and recall for generative models differ from their classification counterparts in that:

  38. A) They use the same formulas
  39. B) Generative precision measures whether samples look realistic (are in the real data manifold); generative recall measures whether all modes are covered
  40. C) Generative precision is always higher
  41. D) Generative recall requires human evaluation Correct: B)
  42. If you chose B: In classification, precision = TP/(TP+FP) and recall = TP/(TP+FN). In generative evaluation, precision = fraction of generated samples that are realistic (near real data), and recall = fraction of real-data diversity that is covered by generated samples. Both are estimated via nearest-neighbor geometry in feature space.
  43. If you chose A: Different definitions, though the names share the "quality vs. coverage" spirit.
  44. If you chose C: Not true β€” VAE precision is typically lower than GAN precision despite higher recall.
  45. If you chose D: Both are computed from Inception features, no human evaluation needed.

Next Steps

23-01 β€” Markov Decision Processes (MDPs) β€” Transitioning from generative models to reinforcement learning. MDPs formalize sequential decision-making with states, actions, rewards, and the Markov property. This begins Phase 23 (Reinforcement Learning Mathematics).


Pitfalls

  1. Using a single metric as the definitive measure of generative quality: FID, IS, and BPD each capture different aspects. A model with excellent FID might have severe mode collapse (high precision, low recall). A model with great BPD might produce blurry samples. Always report at least 2-3 complementary metrics: FID + precision/recall for images; BPD + FID for likelihood-based models; never rely on a single number.

  2. Computing FID with too few samples: FID is biased upward for small sample sizes β€” the sample mean and covariance estimates are noisy, and the quadratic/covariance-root terms introduce systematic bias. With fewer than 5,000 samples, FID can overstate the distance between distributions. Use at least 10,000 samples (50,000 is standard) and report the sample size alongside FID values. For small datasets, use KID (Kernel Inception Distance) which is unbiased.

  3. Comparing log-likelihoods across different model families: A VAE's ELBO is a lower bound, not the true log-likelihood. An autoregressive model gives exact likelihood. A flow gives exact likelihood. Comparing a VAE's ELBO of -100 nats against a flow's exact likelihood of -95 nats is meaningless β€” the VAE might actually achieve -90 nats with a tighter bound. Only compare likelihood-based metrics within the same model family, or use sample-based metrics (FID) for cross-family comparison.

  4. Confusing IS with a measure of sample fidelity: A model generating one perfect, diverse image per ImageNet class achieves maximum IS β€” but might be incapable of intra-class variation (e.g., produces exactly one dog image for all dog prompts). IS measures class-level diversity, not sample-level diversity. A model with IS = 50 and another with IS = 50 could have wildly different intra-class coverage. Precision/recall metrics are needed to detect this.




Q7: Precision and recall for generative models are computed using:

A) The same formulas as classification precision/recall. B) Nearest-neighbor geometry in a feature space β€” precision measures the fraction of generated samples within the real data manifold, and recall measures the fraction of real samples covered by the generated manifold. C) The discriminator's output probabilities. D) Human evaluation studies.

Answer and Explanations **Correct: B)** Generative precision/recall (Sajjadi et al., KynkÀÀnniemi et al.) estimate the real and generated manifolds via $k$-nearest-neighbor distances in Inception feature space. A generated sample is "precise" if it lies within the union of hyperspheres around real samples. A real sample is "recalled" if it lies within the union of hyperspheres around generated samples. This measures fidelity and coverage geometrically. - A) They share names but different definitions β€” classification P/R uses true/false positives. - C) Discriminator outputs are used for GAN training, not precision/recall evaluation. - D) Human evaluation is a separate (and expensive) alternative, not how these metrics are computed.