📐 Concept diagram

22-10 — Evaluation of Generative Models

Phase: 22 — Generative Models Mathematics Subject: 22-10 Prerequisites: 22-01 through 22-09 (all generative model subjects), Phase 13 (Probability — expectations, divergences) Next subject: 23-01 — Markov Decision Processes (MDPs)

Learning Objectives

By the end of this subject, you will be able to:

Define and compute the Inception Score (IS) and explain its relationship to conditional label entropy
Define and compute the Fréchet Inception Distance (FID) and explain its Gaussian approximation assumption
Interpret log-likelihood and bits per dimension (BPD) as evaluation metrics for generative models
Define precision and recall for generative models via manifold coverage
Critically evaluate which metric to use for a given generative modeling task, recognizing each metric's limitations

Core Content

Why Evaluation Is Hard

Generative models are fundamentally harder to evaluate than discriminative models:

Discriminative models: accuracy, precision, recall, F1 — measure against ground-truth labels
Generative models: no single "correct" output — two different samples can both be valid

We need metrics that measure at least two axes: 1. Fidelity (quality): do samples look realistic? 2. Diversity (coverage): does the model cover all modes of the data distribution?

Inception Score (IS)

Proposed by Salimans et al. (2016), the Inception Score uses a pretrained ImageNet classifier (Inception v3) to evaluate generated images.

Definition

Given generated samples ${\mathbf{x}1, \ldots, \mathbf{x}_N}$, run each through the Inception classifier to get conditional label distributions $p(y \mid \mathbf{x})$. Also compute the marginal distribution $p(y) = \frac{1}{N}\sum{i=1}^N p(y \mid \mathbf{x}_i)$.

$$\text{IS} = \exp\left(\mathbb{E}_{\mathbf{x}}\left[\text{KL}(p(y \mid \mathbf{x}) \;|\; p(y))\right]\right)$$

Expanding:

$$\text{IS} = \exp\left(\frac{1}{N}\sum_{i=1}^{N} \sum_{y=1}^{K} p(y \mid \mathbf{x}_i) \log\frac{p(y \mid \mathbf{x}_i)}{p(y)}\right)$$

⚠️ CRITICAL — Interpreting IS:

High IS requires two things simultaneously:
$p(y \mid \mathbf{x})$ is peaked (low entropy) → the classifier is confident about what object is in each image → samples are clear/realistic
$p(y)$ is uniform (high entropy) → samples cover many different classes → diversity
IS ranges from 1 to $K$ (number of classes; $K=1000$ for ImageNet)
IS = 1: all samples classified identically → no diversity
IS = $K$: perfect uniform distribution of clear images → perfect diversity + quality

Limitations of IS

Classifier-dependent: only measures what the Inception model can detect; may miss artifacts invisible to Inception
No mode-dropping detection: a model generating one perfect sample per class achieves maximum IS, missing intra-class diversity entirely
Not applicable beyond natural images: requires a pretrained classifier on the target domain
Sensitive to sample count: IS requires many samples ($\sim$50k) for stable estimates
No comparison to real data: IS only evaluates generated samples in isolation

⚠️ CRITICAL — Fréchet Inception Distance (FID)

Heusel et al. (2017) proposed FID to address IS's inability to compare against real data. FID measures the distance between the distribution of real and generated images in a feature space.

Definition

Extract features from a penultimate Inception v3 layer (typically pool3, 2048-d) for both real images ${\mathbf{a}_i}$ and generated images ${\mathbf{b}_j}$. Assume both follow multivariate Gaussian distributions:

$$\mathbf{a} \sim \mathcal{N}(\boldsymbol{\mu}_r, \Sigma_r), \quad \mathbf{b} \sim \mathcal{N}(\boldsymbol{\mu}_g, \Sigma_g)$$

The Fréchet distance (Wasserstein-2 distance between Gaussians) is:

$$\text{FID} = |\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|_2^2 + \text{tr}\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)$$

where $(\Sigma_r \Sigma_g)^{1/2}$ is the matrix square root of $\Sigma_r \Sigma_g$.

Lower FID = better (real and generated distributions are closer).

Components of FID

$|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|_2^2$: distance between means — captures global shift (e.g., all generated images are too dark)
$\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$: covariance mismatch — captures diversity differences

If both distributions have the same mean ($\boldsymbol{\mu}_r = \boldsymbol{\mu}_g$) and identical covariances ($\Sigma_r = \Sigma_g$):

$$\text{FID} = 0 + \text{tr}(\Sigma + \Sigma - 2\Sigma) = 0$$

Computing the Matrix Square Root

$(\Sigma_r \Sigma_g)^{1/2}$ is computed via eigendecomposition. If $\Sigma_r$ and $\Sigma_g$ commute (they share eigenvectors), then $(\Sigma_r \Sigma_g)^{1/2} = \Sigma_r^{1/2}\Sigma_g^{1/2}$. In general, use SVD or the Newton-Schulz iteration.

Limitations of FID

Gaussian assumption: feature distributions aren't truly Gaussian — FID can be misleading
Sample count sensitivity: FID is biased for small sample sizes ($<$10k); bias correction needed
Inception features: domain-specific — only works for natural images
Single-number summary: collapses all distributional differences into one scalar
Doesn't distinguish fidelity vs. diversity failures: both poor quality and mode collapse increase FID

Log-Likelihood and Bits Per Dimension

For models providing explicit density estimates (VAEs, flows, autoregressive models, diffusion via probability flow ODE):

Log-Likelihood

$$\mathcal{LL} = \frac{1}{N}\sum_{i=1}^{N} \log p_\theta(\mathbf{x}_i)$$

Higher is better. However: - VAEs give a lower bound (ELBO), not exact likelihood - Autoregressive models give exact likelihood (chain rule) - Diffusion models can give exact likelihood via the probability flow ODE (22-07) - GANs and implicit models cannot compute likelihood

Bits Per Dimension (BPD)

$$\text{BPD} = -\frac{\mathcal{LL}}{D \cdot \ln 2}$$

where $D$ is the number of dimensions (e.g., $32 \times 32 \times 3 = 3072$ for CIFAR-10). Lower BPD = better compression (model explains data with fewer bits).

Relationship to compression: BPD is the average number of bits needed to encode one dimension of the data under the model. Uniform random 8-bit pixel values → $-\log_2(1/256) = 8$ BPD. A perfect model capturing all structure → lower BPD.

Limitations of Likelihood-Based Metrics

Not comparable across model families: a VAE ELBO is a bound; an autoregressive model gives exact likelihood
High likelihood $\neq$ high sample quality: models can memorize training data (overfit) and achieve low BPD while producing poor samples
Insensitive to manifold structure: a model placing mass everywhere but most in the right places can have good likelihood but blurry samples
Numerical issues: likelihoods can be $-\infty$ for out-of-distribution points, making averaging unstable

Precision and Recall for Generative Models

Sajjadi et al. (2018) and Kynkäänniemi et al. (2019) proposed precision/recall metrics to disentangle fidelity and diversity.

Manifold-Based Definition

Define the real and generated manifolds via $k$-nearest-neighbor spheres in feature space.

For each generated sample $\mathbf{b}_j$: - Precision: $\mathbf{b}_j$ is "precise" if it falls within the manifold of real data → fraction of generated samples that are realistic

For each real sample $\mathbf{a}_i$: - Recall: $\mathbf{a}_i$ is "covered" if it falls within the manifold of generated data → fraction of real samples the model can produce

Formally, using Inception features and nearest neighbors:

$$\text{Precision} = \frac{1}{M}\sum_{j=1}^{M} \mathbf{1}\left[\mathbf{b}j \in \bigcup{i=1}^{N} B(\mathbf{a}_i, \text{NN}_k(\mathbf{a}_i))\right]$$

$$\text{Recall} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}\left[\mathbf{a}i \in \bigcup{j=1}^{M} B(\mathbf{b}_j, \text{NN}_k(\mathbf{b}_j))\right]$$

where $B(\mathbf{x}, r)$ is the sphere of radius $r$ (distance to $k$-th nearest neighbor).

⚠️ CRITICAL — Interpreting Precision and Recall:

Precision	Recall	Interpretation
High	High	Ideal: realistic AND diverse
High	Low	High quality, mode collapse (e.g., BigGAN)
Low	High	Diverse but low quality (blurry samples)
Low	Low	Poor model (bad quality + mode dropping)

Improved Precision and Recall (Kynkäänniemi et al.)

Uses explicit manifold estimation: for each point, compute the radius to its $k$-th nearest neighbor. Points within the union of hyperspheres are considered on-manifold. The fraction of generated samples within the real manifold = precision. The fraction of real samples within the generated manifold = recall.

Comparing Metrics

Metric	Measures	Requires	Weakness
IS	Quality + diversity (class-level)	Pretrained classifier	No comparison to real data; only class diversity
FID	Distributional distance	Pretrained classifier features	Gaussian assumption; sample-size bias
BPD / LL	Compression quality	Explicit density model	Not comparable across model families
Precision	Sample quality (fidelity)	Pretrained classifier features	Choice of $k$, feature space
Recall	Coverage (diversity)	Pretrained classifier features	Same as above
Kernel Inception Distance (KID)	Maximum mean discrepancy	Pretrained features + kernel	Choice of kernel

Practical Recommendations

Image generation: Report FID (primary) + IS (secondary) + Precision/Recall
Likelihood-based models: Report BPD + FID
Audio generation: Fréchet Audio Distance (FAD), analogous to FID
Text generation: Perplexity, BLEU/ROUGE (task-specific), human evaluation
Always report multiple metrics — no single number captures all aspects of generative quality

Key Terms

Always report multiple metrics
Audio generation
Classifier-dependent
Discriminative models
Diversity
FID is the de facto standard
Fidelity
Gaussian assumption
Generative models
High IS
Image generation
Inception features
Insensitive to manifold structure
Likelihood-based models
No comparison to real data

Worked Examples

Example 1: Computing IS from Classifier Outputs

Three generated images produce classifier outputs: - $\mathbf{x}_1$: $p(y|\mathbf{x}_1) = [0.9, 0.1, 0.0]$ (dog, cat, bird) - $\mathbf{x}_2$: $p(y|\mathbf{x}_2) = [0.1, 0.8, 0.1]$ - $\mathbf{x}_3$: $p(y|\mathbf{x}_3) = [0.0, 0.1, 0.9]$

Compute the Inception Score.

Solution:

Step 1: Marginal $p(y) = \frac{1}{3}([0.9,0.1,0.0] + [0.1,0.8,0.1] + [0.0,0.1,0.9])$ $= [1.0/3, 1.0/3, 1.0/3] = [0.333, 0.333, 0.333]$

Step 2: Per-sample KL divergence: $\text{KL}_1 = 0.9\log(0.9/0.333) + 0.1\log(0.1/0.333) + 0.0\log(0.0/0.333)$ $= 0.9 \cdot 0.993 + 0.1 \cdot (-1.204) + 0 = 0.894 - 0.120 = 0.774$

$\text{KL}_2 = 0.1\log(0.1/0.333) + 0.8\log(0.8/0.333) + 0.1\log(0.1/0.333)$ $= 0.1(-1.204) + 0.8(0.875) + 0.1(-1.204) = -0.120 + 0.700 - 0.120 = 0.459$

$\text{KL}_3 = 0.0(\ldots) + 0.1(-1.204) + 0.9(0.993) = -0.120 + 0.894 = 0.774$

Step 3: $\text{IS} = \exp\left(\frac{0.774 + 0.459 + 0.774}{3}\right) = \exp(0.669) = 1.95$

Click for answer

IS ≈ 1.95 (out of 3 possible classes). The model generates clear images (peaked conditionals) covering all three classes (uniform marginal), giving a high score relative to the 3-class maximum. With 1000 ImageNet classes, a good IS is typically 8-10.

Example 2: Computing FID

Real features: $\boldsymbol{\mu}_r = (0, 0)$, $\Sigma_r = \begin{pmatrix} 2 & 0 \ 0 & 2 \end{pmatrix}$ Generated features: $\boldsymbol{\mu}_g = (1, 0)$, $\Sigma_g = \begin{pmatrix} 1 & 0 \ 0 & 1 \end{pmatrix}$

Compute FID.

Solution:

Mean term: $|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|^2 = (0-1)^2 + (0-0)^2 = 1$

Covariance term: $\Sigma_r \Sigma_g = \begin{pmatrix}2&0\0&2\end{pmatrix}\begin{pmatrix}1&0\0&1\end{pmatrix} = \begin{pmatrix}2&0\0&2\end{pmatrix}$

$(\Sigma_r \Sigma_g)^{1/2} = \begin{pmatrix}\sqrt{2}&0\0&\sqrt{2}\end{pmatrix}$

$\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}) = \text{tr}\left(\begin{pmatrix}3&0\0&3\end{pmatrix} - \begin{pmatrix}2\sqrt{2}&0\0&2\sqrt{2}\end{pmatrix}\right)$

$= (3 - 2\sqrt{2}) + (3 - 2\sqrt{2}) = 6 - 4\sqrt{2} \approx 6 - 5.657 = 0.343$

$\text{FID} = 1 + 0.343 = 1.343$

Click for answer

FID ≈ 1.343. The mean shift contributes 1.0 and the lower variance of generated samples contributes ~0.34. The generated distribution is shifted and less spread out than real data — visible in both terms.

Example 3: Bits Per Dimension

A model achieves $\log p(\mathbf{x}) = -4500$ (natural log) on a $64 \times 64 \times 3$ image. Compute BPD.

Solution:

$D = 64 \times 64 \times 3 = 12288$ dimensions.

$\text{BPD} = -\frac{-4500}{12288 \cdot \ln 2} = \frac{4500}{12288 \cdot 0.693147} = \frac{4500}{8517.4} \approx 0.528$

Click for answer

BPD ≈ 0.528. This is implausibly low for natural images (CIFAR-10 SOTA ≈ 2.8) — it would mean the model can compress each pixel dimension to ~0.5 bits. Such numbers are only possible for nearly-deterministic data or when evaluating on the training set (overfitting).

Practice Problems

Show that if all $p(y|\mathbf{x})$ are identical (e.g., all generated images look like the same class), then $\text{IS} = 1$ regardless of $K$.

Click for answer
If $p(y|\mathbf{x}) = \mathbf{q}$ for all $\mathbf{x}$, then $p(y) = \mathbf{q}$. Then $\text{KL}(p(y|\mathbf{x})\|p(y)) = \text{KL}(\mathbf{q}\|\mathbf{q}) = 0$ for all $\mathbf{x}$. $\text{IS} = \exp(0) = 1$. The model has zero diversity under IS — only one effective class.
Prove that when $\Sigma_r = \Sigma_g = \Sigma$, the covariance term of FID simplifies to zero. What does the FID then measure?

Click for answer
$\text{tr}(\Sigma + \Sigma - 2(\Sigma\Sigma)^{1/2}) = \text{tr}(2\Sigma - 2\Sigma) = 0$. $\text{FID} = \|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\|^2$. With equal covariances, FID reduces to the squared Euclidean distance between means — a measure of global distribution shift independent of diversity differences.
An autoregressive model achieves 3.2 BPD on CIFAR-10 test set but 1.1 BPD on the training set. What likely happened?

Click for answer
The model is severely overfitting — memorizing training data. The gap of 2.1 BPD between train and test is enormous. The model has high likelihood on training data (low BPD) because it's essentially reproducing memorized patterns, but fails to generalize. This is why likelihood alone can be misleading: a lookup table of the training set achieves perfect likelihood but generates nothing new.
Model A has high precision (0.95) but low recall (0.30). Model B has lower precision (0.70) but higher recall (0.85). Which is preferable, and when?

Click for answer
Model A produces very realistic images but only covers 30% of the data distribution (mode collapse). Model B is more diverse but less realistic. Preference depends on application: for photorealistic image synthesis where diversity is secondary (e.g., super-resolution), prefer A. For data augmentation where coverage matters (e.g., medical imaging), prefer B. Most applications want a balance; neither is clearly superior.
Why can't we compare the FID of a VAE with the FID of a GAN trained on the same dataset using just the FID number?

Click for answer
Actually, we CAN compare them — that's the point of FID. FID is model-agnostic and only depends on the generated samples vs. real samples. But the question highlights that we shouldn't rely on FID alone: VAEs typically have high recall (cover modes) but low precision (blurry samples), while GANs typically have high precision but low recall (mode collapse). FID conflates these into one number — a VAE and GAN could have the same FID for completely different reasons. Always supplement FID with precision/recall.

Summary

Key takeaways:

No single metric fully captures generative model quality — always report multiple
IS = $\exp(\mathbb{E}[\text{KL}(p(y|\mathbf{x}) | p(y))])$ — peaked conditionals + uniform marginal = high score; no real-data comparison
FID = $|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|^2 + \text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$ — Gaussian distance in Inception feature space; lower is better; captures distributional shift
Log-likelihood / BPD: exact or lower-bound measure of how well the model compresses data; model-family dependent; high likelihood $\neq$ high sample quality
Precision = fraction of generated samples in real manifold (fidelity); Recall = fraction of real samples covered (diversity)
FID is the de facto standard for image generation; supplement with IS and precision/recall
Sample-size corrections critical for FID (biased with < 10k samples)

Quiz

A generative model achieves IS = 8.2 on a 1000-class dataset. This means:
A) The model is 8.2× better than random
B) The effective number of clearly-generated classes is approximately 8.2
C) FID is approximately 8.2
D) The model's precision is 82% Correct: B)
If you chose B: IS is $\exp(\mathbb{E}[\text{KL}])$. When $p(y|\mathbf{x})$ is one-hot and $p(y)$ is uniform, IS = $K$. An IS of 8.2 means the model covers ~8.2 "effective classes" with clear (high-confidence) images. It could be generating images across all 1000 classes with moderate confidence, or perfectly across 8-9 classes.
If you chose A: IS is not multiplicative; IS=1 is the baseline (no diversity).
If you chose C: IS and FID measure different things and have different scales.
If you chose D: IS doesn't measure precision in this sense.
The covariance term in FID uses the matrix square root $(\Sigma_r \Sigma_g)^{1/2}$ because:
A) It's computationally cheaper than other matrix operations
B) It arises from the Wasserstein-2 distance between two multivariate Gaussians
C) Inception features are always diagonal
D) The square root cancels with the trace Correct: B)
If you chose B: The Wasserstein-2 distance between $\mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1)$ and $\mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2)$ has the closed form $|\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2|^2 + \text{tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2})$, which simplifies to the FID form when $\Sigma_1$ and $\Sigma_2$ commute. FID uses $\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$, which is exact when covariances commute and upper-bounds the true Wasserstein distance otherwise.
If you chose A: The matrix square root is actually expensive ($O(d^3)$ via eigendecomposition).
If you chose C: Inception features have rich covariance structure.
If you chose D: It doesn't cancel — it contributes the key distributional information.
A VAE reports "log-likelihood = -120 nats" on MNIST. A PixelCNN reports "log-likelihood = -85 nats". Which is better?
A) PixelCNN, because -85 > -120
B) VAE, because | -120 | > | -85 |
C) Cannot directly compare because the VAE reports a lower bound (ELBO), not the true log-likelihood
D) They're equally good because nats are arbitrary Correct: C)
If you chose C: $\log p_{\text{VAE}}(\mathbf{x}) \geq \text{ELBO} = -120$. The true log-likelihood could be -100, -90, or even -85 — we don't know the ELBO gap $\text{KL}(q(z|\mathbf{x}) | p(z|\mathbf{x}))$. PixelCNN gives the exact log-likelihood. The numbers are not comparable.
If you chose A: Only valid if both are exact likelihoods, which they aren't.
If you chose B: Higher log-likelihood is better.
If you chose D: Nats (natural log units) are well-defined; "arbitrary" doesn't apply.
FID is biased for small sample sizes because:
A) The Inception network needs many samples to activate
B) The sample mean and covariance are noisy estimates, and the quadratic/root terms introduce systematic bias
C) Small samples cannot represent multiple classes
D) FID requires exactly 50,000 samples by definition Correct: B)
If you chose B: $\hat{\boldsymbol{\mu}}$ and $\hat{\Sigma}$ from $N$ samples are unbiased estimates, but $|\hat{\boldsymbol{\mu}}_r - \hat{\boldsymbol{\mu}}_g|^2$ is biased upward (Jensen's inequality: expectation of a convex function of an unbiased estimator). The trace term with the matrix square root also introduces bias. Bias corrections exist (e.g., the "FID infinity" extrapolation).
If you chose A: Inception activates for any valid image.
If you chose C: Even with one class, statistical bias exists.
If you chose D: 50k is a common convention, not a requirement.
Precision and recall for generative models differ from their classification counterparts in that:
A) They use the same formulas
B) Generative precision measures whether samples look realistic (are in the real data manifold); generative recall measures whether all modes are covered
C) Generative precision is always higher
D) Generative recall requires human evaluation Correct: B)
If you chose B: In classification, precision = TP/(TP+FP) and recall = TP/(TP+FN). In generative evaluation, precision = fraction of generated samples that are realistic (near real data), and recall = fraction of real-data diversity that is covered by generated samples. Both are estimated via nearest-neighbor geometry in feature space.
If you chose A: Different definitions, though the names share the "quality vs. coverage" spirit.
If you chose C: Not true — VAE precision is typically lower than GAN precision despite higher recall.
If you chose D: Both are computed from Inception features, no human evaluation needed.

Next Steps

23-01 — Markov Decision Processes (MDPs) — Transitioning from generative models to reinforcement learning. MDPs formalize sequential decision-making with states, actions, rewards, and the Markov property. This begins Phase 23 (Reinforcement Learning Mathematics).

Pitfalls

Using a single metric as the definitive measure of generative quality: FID, IS, and BPD each capture different aspects. A model with excellent FID might have severe mode collapse (high precision, low recall). A model with great BPD might produce blurry samples. Always report at least 2-3 complementary metrics: FID + precision/recall for images; BPD + FID for likelihood-based models; never rely on a single number.
Computing FID with too few samples: FID is biased upward for small sample sizes — the sample mean and covariance estimates are noisy, and the quadratic/covariance-root terms introduce systematic bias. With fewer than 5,000 samples, FID can overstate the distance between distributions. Use at least 10,000 samples (50,000 is standard) and report the sample size alongside FID values. For small datasets, use KID (Kernel Inception Distance) which is unbiased.
Comparing log-likelihoods across different model families: A VAE's ELBO is a lower bound, not the true log-likelihood. An autoregressive model gives exact likelihood. A flow gives exact likelihood. Comparing a VAE's ELBO of -100 nats against a flow's exact likelihood of -95 nats is meaningless — the VAE might actually achieve -90 nats with a tighter bound. Only compare likelihood-based metrics within the same model family, or use sample-based metrics (FID) for cross-family comparison.
Confusing IS with a measure of sample fidelity: A model generating one perfect, diverse image per ImageNet class achieves maximum IS — but might be incapable of intra-class variation (e.g., produces exactly one dog image for all dog prompts). IS measures class-level diversity, not sample-level diversity. A model with IS = 50 and another with IS = 50 could have wildly different intra-class coverage. Precision/recall metrics are needed to detect this.

Q7: Precision and recall for generative models are computed using:

A) The same formulas as classification precision/recall. B) Nearest-neighbor geometry in a feature space — precision measures the fraction of generated samples within the real data manifold, and recall measures the fraction of real samples covered by the generated manifold. C) The discriminator's output probabilities. D) Human evaluation studies.

Answer and Explanations

**Correct: B)** Generative precision/recall (Sajjadi et al., Kynkäänniemi et al.) estimate the real and generated manifolds via $k$-nearest-neighbor distances in Inception feature space. A generated sample is "precise" if it lies within the union of hyperspheres around real samples. A real sample is "recalled" if it lies within the union of hyperspheres around generated samples. This measures fidelity and coverage geometrically. - A) They share names but different definitions — classification P/R uses true/false positives. - C) Discriminator outputs are used for GAN training, not precision/recall evaluation. - D) Human evaluation is a separate (and expensive) alternative, not how these metrics are computed.

Progress

Phases

22-10 — Evaluation of Generative Models

Learning Objectives

Core Content

Why Evaluation Is Hard

Inception Score (IS)

Definition

Limitations of IS

⚠️ CRITICAL — Fréchet Inception Distance (FID)

Definition

Components of FID

Computing the Matrix Square Root

Limitations of FID

Log-Likelihood and Bits Per Dimension

Log-Likelihood

Bits Per Dimension (BPD)

Limitations of Likelihood-Based Metrics

Precision and Recall for Generative Models

Manifold-Based Definition

Improved Precision and Recall (Kynkäänniemi et al.)

Comparing Metrics

Practical Recommendations

Key Terms

Worked Examples

Example 1: Computing IS from Classifier Outputs

Example 2: Computing FID

Example 3: Bits Per Dimension

Practice Problems

Summary

Quiz

Next Steps

Pitfalls