22-10 β Evaluation of Generative Models
Phase: 22 β Generative Models Mathematics Subject: 22-10 Prerequisites: 22-01 through 22-09 (all generative model subjects), Phase 13 (Probability β expectations, divergences) Next subject: 23-01 β Markov Decision Processes (MDPs)
Learning Objectives
By the end of this subject, you will be able to:
- Define and compute the Inception Score (IS) and explain its relationship to conditional label entropy
- Define and compute the FrΓ©chet Inception Distance (FID) and explain its Gaussian approximation assumption
- Interpret log-likelihood and bits per dimension (BPD) as evaluation metrics for generative models
- Define precision and recall for generative models via manifold coverage
- Critically evaluate which metric to use for a given generative modeling task, recognizing each metric's limitations
Core Content
Why Evaluation Is Hard
Generative models are fundamentally harder to evaluate than discriminative models:
- Discriminative models: accuracy, precision, recall, F1 β measure against ground-truth labels
- Generative models: no single "correct" output β two different samples can both be valid
We need metrics that measure at least two axes: 1. Fidelity (quality): do samples look realistic? 2. Diversity (coverage): does the model cover all modes of the data distribution?
Inception Score (IS)
Proposed by Salimans et al. (2016), the Inception Score uses a pretrained ImageNet classifier (Inception v3) to evaluate generated images.
Definition
Given generated samples ${\mathbf{x}1, \ldots, \mathbf{x}_N}$, run each through the Inception classifier to get conditional label distributions $p(y \mid \mathbf{x})$. Also compute the marginal distribution $p(y) = \frac{1}{N}\sum{i=1}^N p(y \mid \mathbf{x}_i)$.
$$\text{IS} = \exp\left(\mathbb{E}_{\mathbf{x}}\left[\text{KL}(p(y \mid \mathbf{x}) \;|\; p(y))\right]\right)$$
Expanding:
$$\text{IS} = \exp\left(\frac{1}{N}\sum_{i=1}^{N} \sum_{y=1}^{K} p(y \mid \mathbf{x}_i) \log\frac{p(y \mid \mathbf{x}_i)}{p(y)}\right)$$
β οΈ CRITICAL β Interpreting IS:
- High IS requires two things simultaneously:
- $p(y \mid \mathbf{x})$ is peaked (low entropy) β the classifier is confident about what object is in each image β samples are clear/realistic
-
$p(y)$ is uniform (high entropy) β samples cover many different classes β diversity
-
IS ranges from 1 to $K$ (number of classes; $K=1000$ for ImageNet)
- IS = 1: all samples classified identically β no diversity
- IS = $K$: perfect uniform distribution of clear images β perfect diversity + quality
Limitations of IS
- Classifier-dependent: only measures what the Inception model can detect; may miss artifacts invisible to Inception
- No mode-dropping detection: a model generating one perfect sample per class achieves maximum IS, missing intra-class diversity entirely
- Not applicable beyond natural images: requires a pretrained classifier on the target domain
- Sensitive to sample count: IS requires many samples ($\sim$50k) for stable estimates
- No comparison to real data: IS only evaluates generated samples in isolation
β οΈ CRITICAL β FrΓ©chet Inception Distance (FID)
Heusel et al. (2017) proposed FID to address IS's inability to compare against real data. FID measures the distance between the distribution of real and generated images in a feature space.
Definition
Extract features from a penultimate Inception v3 layer (typically pool3, 2048-d) for both real images ${\mathbf{a}_i}$ and generated images ${\mathbf{b}_j}$. Assume both follow multivariate Gaussian distributions:
$$\mathbf{a} \sim \mathcal{N}(\boldsymbol{\mu}_r, \Sigma_r), \quad \mathbf{b} \sim \mathcal{N}(\boldsymbol{\mu}_g, \Sigma_g)$$
The FrΓ©chet distance (Wasserstein-2 distance between Gaussians) is:
$$\text{FID} = |\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|_2^2 + \text{tr}\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)$$
where $(\Sigma_r \Sigma_g)^{1/2}$ is the matrix square root of $\Sigma_r \Sigma_g$.
Lower FID = better (real and generated distributions are closer).
Components of FID
- $|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|_2^2$: distance between means β captures global shift (e.g., all generated images are too dark)
- $\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$: covariance mismatch β captures diversity differences
If both distributions have the same mean ($\boldsymbol{\mu}_r = \boldsymbol{\mu}_g$) and identical covariances ($\Sigma_r = \Sigma_g$):
$$\text{FID} = 0 + \text{tr}(\Sigma + \Sigma - 2\Sigma) = 0$$
Computing the Matrix Square Root
$(\Sigma_r \Sigma_g)^{1/2}$ is computed via eigendecomposition. If $\Sigma_r$ and $\Sigma_g$ commute (they share eigenvectors), then $(\Sigma_r \Sigma_g)^{1/2} = \Sigma_r^{1/2}\Sigma_g^{1/2}$. In general, use SVD or the Newton-Schulz iteration.
Limitations of FID
- Gaussian assumption: feature distributions aren't truly Gaussian β FID can be misleading
- Sample count sensitivity: FID is biased for small sample sizes ($<$10k); bias correction needed
- Inception features: domain-specific β only works for natural images
- Single-number summary: collapses all distributional differences into one scalar
- Doesn't distinguish fidelity vs. diversity failures: both poor quality and mode collapse increase FID
Log-Likelihood and Bits Per Dimension
For models providing explicit density estimates (VAEs, flows, autoregressive models, diffusion via probability flow ODE):
Log-Likelihood
$$\mathcal{LL} = \frac{1}{N}\sum_{i=1}^{N} \log p_\theta(\mathbf{x}_i)$$
Higher is better. However: - VAEs give a lower bound (ELBO), not exact likelihood - Autoregressive models give exact likelihood (chain rule) - Diffusion models can give exact likelihood via the probability flow ODE (22-07) - GANs and implicit models cannot compute likelihood
Bits Per Dimension (BPD)
$$\text{BPD} = -\frac{\mathcal{LL}}{D \cdot \ln 2}$$
where $D$ is the number of dimensions (e.g., $32 \times 32 \times 3 = 3072$ for CIFAR-10). Lower BPD = better compression (model explains data with fewer bits).
Relationship to compression: BPD is the average number of bits needed to encode one dimension of the data under the model. Uniform random 8-bit pixel values β $-\log_2(1/256) = 8$ BPD. A perfect model capturing all structure β lower BPD.
Limitations of Likelihood-Based Metrics
- Not comparable across model families: a VAE ELBO is a bound; an autoregressive model gives exact likelihood
- High likelihood $\neq$ high sample quality: models can memorize training data (overfit) and achieve low BPD while producing poor samples
- Insensitive to manifold structure: a model placing mass everywhere but most in the right places can have good likelihood but blurry samples
- Numerical issues: likelihoods can be $-\infty$ for out-of-distribution points, making averaging unstable
Precision and Recall for Generative Models
Sajjadi et al. (2018) and KynkÀÀnniemi et al. (2019) proposed precision/recall metrics to disentangle fidelity and diversity.
Manifold-Based Definition
Define the real and generated manifolds via $k$-nearest-neighbor spheres in feature space.
For each generated sample $\mathbf{b}_j$: - Precision: $\mathbf{b}_j$ is "precise" if it falls within the manifold of real data β fraction of generated samples that are realistic
For each real sample $\mathbf{a}_i$: - Recall: $\mathbf{a}_i$ is "covered" if it falls within the manifold of generated data β fraction of real samples the model can produce
Formally, using Inception features and nearest neighbors:
$$\text{Precision} = \frac{1}{M}\sum_{j=1}^{M} \mathbf{1}\left[\mathbf{b}j \in \bigcup{i=1}^{N} B(\mathbf{a}_i, \text{NN}_k(\mathbf{a}_i))\right]$$
$$\text{Recall} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}\left[\mathbf{a}i \in \bigcup{j=1}^{M} B(\mathbf{b}_j, \text{NN}_k(\mathbf{b}_j))\right]$$
where $B(\mathbf{x}, r)$ is the sphere of radius $r$ (distance to $k$-th nearest neighbor).
β οΈ CRITICAL β Interpreting Precision and Recall:
| Precision | Recall | Interpretation |
|---|---|---|
| High | High | Ideal: realistic AND diverse |
| High | Low | High quality, mode collapse (e.g., BigGAN) |
| Low | High | Diverse but low quality (blurry samples) |
| Low | Low | Poor model (bad quality + mode dropping) |
Improved Precision and Recall (KynkÀÀnniemi et al.)
Uses explicit manifold estimation: for each point, compute the radius to its $k$-th nearest neighbor. Points within the union of hyperspheres are considered on-manifold. The fraction of generated samples within the real manifold = precision. The fraction of real samples within the generated manifold = recall.
Comparing Metrics
| Metric | Measures | Requires | Weakness |
|---|---|---|---|
| IS | Quality + diversity (class-level) | Pretrained classifier | No comparison to real data; only class diversity |
| FID | Distributional distance | Pretrained classifier features | Gaussian assumption; sample-size bias |
| BPD / LL | Compression quality | Explicit density model | Not comparable across model families |
| Precision | Sample quality (fidelity) | Pretrained classifier features | Choice of $k$, feature space |
| Recall | Coverage (diversity) | Pretrained classifier features | Same as above |
| Kernel Inception Distance (KID) | Maximum mean discrepancy | Pretrained features + kernel | Choice of kernel |
Practical Recommendations
- Image generation: Report FID (primary) + IS (secondary) + Precision/Recall
- Likelihood-based models: Report BPD + FID
- Audio generation: FrΓ©chet Audio Distance (FAD), analogous to FID
- Text generation: Perplexity, BLEU/ROUGE (task-specific), human evaluation
- Always report multiple metrics β no single number captures all aspects of generative quality
Key Terms
- Always report multiple metrics
- Audio generation
- Classifier-dependent
- Discriminative models
- Diversity
- FID is the de facto standard
- Fidelity
- Gaussian assumption
- Generative models
- High IS
- Image generation
- Inception features
- Insensitive to manifold structure
- Likelihood-based models
- No comparison to real data
Worked Examples
Example 1: Computing IS from Classifier Outputs
Three generated images produce classifier outputs: - $\mathbf{x}_1$: $p(y|\mathbf{x}_1) = [0.9, 0.1, 0.0]$ (dog, cat, bird) - $\mathbf{x}_2$: $p(y|\mathbf{x}_2) = [0.1, 0.8, 0.1]$ - $\mathbf{x}_3$: $p(y|\mathbf{x}_3) = [0.0, 0.1, 0.9]$
Compute the Inception Score.
Solution:
Step 1: Marginal $p(y) = \frac{1}{3}([0.9,0.1,0.0] + [0.1,0.8,0.1] + [0.0,0.1,0.9])$ $= [1.0/3, 1.0/3, 1.0/3] = [0.333, 0.333, 0.333]$
Step 2: Per-sample KL divergence: $\text{KL}_1 = 0.9\log(0.9/0.333) + 0.1\log(0.1/0.333) + 0.0\log(0.0/0.333)$ $= 0.9 \cdot 0.993 + 0.1 \cdot (-1.204) + 0 = 0.894 - 0.120 = 0.774$
$\text{KL}_2 = 0.1\log(0.1/0.333) + 0.8\log(0.8/0.333) + 0.1\log(0.1/0.333)$ $= 0.1(-1.204) + 0.8(0.875) + 0.1(-1.204) = -0.120 + 0.700 - 0.120 = 0.459$
$\text{KL}_3 = 0.0(\ldots) + 0.1(-1.204) + 0.9(0.993) = -0.120 + 0.894 = 0.774$
Step 3: $\text{IS} = \exp\left(\frac{0.774 + 0.459 + 0.774}{3}\right) = \exp(0.669) = 1.95$
Click for answer
IS β 1.95 (out of 3 possible classes). The model generates clear images (peaked conditionals) covering all three classes (uniform marginal), giving a high score relative to the 3-class maximum. With 1000 ImageNet classes, a good IS is typically 8-10.Example 2: Computing FID
Real features: $\boldsymbol{\mu}_r = (0, 0)$, $\Sigma_r = \begin{pmatrix} 2 & 0 \ 0 & 2 \end{pmatrix}$ Generated features: $\boldsymbol{\mu}_g = (1, 0)$, $\Sigma_g = \begin{pmatrix} 1 & 0 \ 0 & 1 \end{pmatrix}$
Compute FID.
Solution:
Mean term: $|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|^2 = (0-1)^2 + (0-0)^2 = 1$
Covariance term: $\Sigma_r \Sigma_g = \begin{pmatrix}2&0\0&2\end{pmatrix}\begin{pmatrix}1&0\0&1\end{pmatrix} = \begin{pmatrix}2&0\0&2\end{pmatrix}$
$(\Sigma_r \Sigma_g)^{1/2} = \begin{pmatrix}\sqrt{2}&0\0&\sqrt{2}\end{pmatrix}$
$\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}) = \text{tr}\left(\begin{pmatrix}3&0\0&3\end{pmatrix} - \begin{pmatrix}2\sqrt{2}&0\0&2\sqrt{2}\end{pmatrix}\right)$
$= (3 - 2\sqrt{2}) + (3 - 2\sqrt{2}) = 6 - 4\sqrt{2} \approx 6 - 5.657 = 0.343$
$\text{FID} = 1 + 0.343 = 1.343$
Click for answer
FID β 1.343. The mean shift contributes 1.0 and the lower variance of generated samples contributes ~0.34. The generated distribution is shifted and less spread out than real data β visible in both terms.Example 3: Bits Per Dimension
A model achieves $\log p(\mathbf{x}) = -4500$ (natural log) on a $64 \times 64 \times 3$ image. Compute BPD.
Solution:
$D = 64 \times 64 \times 3 = 12288$ dimensions.
$\text{BPD} = -\frac{-4500}{12288 \cdot \ln 2} = \frac{4500}{12288 \cdot 0.693147} = \frac{4500}{8517.4} \approx 0.528$
Click for answer
BPD β 0.528. This is implausibly low for natural images (CIFAR-10 SOTA β 2.8) β it would mean the model can compress each pixel dimension to ~0.5 bits. Such numbers are only possible for nearly-deterministic data or when evaluating on the training set (overfitting).Practice Problems
-
Show that if all $p(y|\mathbf{x})$ are identical (e.g., all generated images look like the same class), then $\text{IS} = 1$ regardless of $K$.
Click for answer
If $p(y|\mathbf{x}) = \mathbf{q}$ for all $\mathbf{x}$, then $p(y) = \mathbf{q}$. Then $\text{KL}(p(y|\mathbf{x})\|p(y)) = \text{KL}(\mathbf{q}\|\mathbf{q}) = 0$ for all $\mathbf{x}$. $\text{IS} = \exp(0) = 1$. The model has zero diversity under IS β only one effective class. -
Prove that when $\Sigma_r = \Sigma_g = \Sigma$, the covariance term of FID simplifies to zero. What does the FID then measure?
Click for answer
$\text{tr}(\Sigma + \Sigma - 2(\Sigma\Sigma)^{1/2}) = \text{tr}(2\Sigma - 2\Sigma) = 0$. $\text{FID} = \|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\|^2$. With equal covariances, FID reduces to the squared Euclidean distance between means β a measure of global distribution shift independent of diversity differences. -
An autoregressive model achieves 3.2 BPD on CIFAR-10 test set but 1.1 BPD on the training set. What likely happened?
Click for answer
The model is severely overfitting β memorizing training data. The gap of 2.1 BPD between train and test is enormous. The model has high likelihood on training data (low BPD) because it's essentially reproducing memorized patterns, but fails to generalize. This is why likelihood alone can be misleading: a lookup table of the training set achieves perfect likelihood but generates nothing new. -
Model A has high precision (0.95) but low recall (0.30). Model B has lower precision (0.70) but higher recall (0.85). Which is preferable, and when?
Click for answer
Model A produces very realistic images but only covers 30% of the data distribution (mode collapse). Model B is more diverse but less realistic. Preference depends on application: for photorealistic image synthesis where diversity is secondary (e.g., super-resolution), prefer A. For data augmentation where coverage matters (e.g., medical imaging), prefer B. Most applications want a balance; neither is clearly superior. -
Why can't we compare the FID of a VAE with the FID of a GAN trained on the same dataset using just the FID number?
Click for answer
Actually, we CAN compare them β that's the point of FID. FID is model-agnostic and only depends on the generated samples vs. real samples. But the question highlights that we shouldn't rely on FID alone: VAEs typically have high recall (cover modes) but low precision (blurry samples), while GANs typically have high precision but low recall (mode collapse). FID conflates these into one number β a VAE and GAN could have the same FID for completely different reasons. Always supplement FID with precision/recall.
Summary
Key takeaways:
- No single metric fully captures generative model quality β always report multiple
- IS = $\exp(\mathbb{E}[\text{KL}(p(y|\mathbf{x}) | p(y))])$ β peaked conditionals + uniform marginal = high score; no real-data comparison
- FID = $|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g|^2 + \text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$ β Gaussian distance in Inception feature space; lower is better; captures distributional shift
- Log-likelihood / BPD: exact or lower-bound measure of how well the model compresses data; model-family dependent; high likelihood $\neq$ high sample quality
- Precision = fraction of generated samples in real manifold (fidelity); Recall = fraction of real samples covered (diversity)
- FID is the de facto standard for image generation; supplement with IS and precision/recall
- Sample-size corrections critical for FID (biased with < 10k samples)
Quiz
- A generative model achieves IS = 8.2 on a 1000-class dataset. This means:
- A) The model is 8.2Γ better than random
- B) The effective number of clearly-generated classes is approximately 8.2
- C) FID is approximately 8.2
- D) The model's precision is 82% Correct: B)
- If you chose B: IS is $\exp(\mathbb{E}[\text{KL}])$. When $p(y|\mathbf{x})$ is one-hot and $p(y)$ is uniform, IS = $K$. An IS of 8.2 means the model covers ~8.2 "effective classes" with clear (high-confidence) images. It could be generating images across all 1000 classes with moderate confidence, or perfectly across 8-9 classes.
- If you chose A: IS is not multiplicative; IS=1 is the baseline (no diversity).
- If you chose C: IS and FID measure different things and have different scales.
-
If you chose D: IS doesn't measure precision in this sense.
-
The covariance term in FID uses the matrix square root $(\Sigma_r \Sigma_g)^{1/2}$ because:
- A) It's computationally cheaper than other matrix operations
- B) It arises from the Wasserstein-2 distance between two multivariate Gaussians
- C) Inception features are always diagonal
- D) The square root cancels with the trace Correct: B)
- If you chose B: The Wasserstein-2 distance between $\mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1)$ and $\mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2)$ has the closed form $|\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2|^2 + \text{tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2})$, which simplifies to the FID form when $\Sigma_1$ and $\Sigma_2$ commute. FID uses $\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$, which is exact when covariances commute and upper-bounds the true Wasserstein distance otherwise.
- If you chose A: The matrix square root is actually expensive ($O(d^3)$ via eigendecomposition).
- If you chose C: Inception features have rich covariance structure.
-
If you chose D: It doesn't cancel β it contributes the key distributional information.
-
A VAE reports "log-likelihood = -120 nats" on MNIST. A PixelCNN reports "log-likelihood = -85 nats". Which is better?
- A) PixelCNN, because -85 > -120
- B) VAE, because | -120 | > | -85 |
- C) Cannot directly compare because the VAE reports a lower bound (ELBO), not the true log-likelihood
- D) They're equally good because nats are arbitrary Correct: C)
- If you chose C: $\log p_{\text{VAE}}(\mathbf{x}) \geq \text{ELBO} = -120$. The true log-likelihood could be -100, -90, or even -85 β we don't know the ELBO gap $\text{KL}(q(z|\mathbf{x}) | p(z|\mathbf{x}))$. PixelCNN gives the exact log-likelihood. The numbers are not comparable.
- If you chose A: Only valid if both are exact likelihoods, which they aren't.
- If you chose B: Higher log-likelihood is better.
-
If you chose D: Nats (natural log units) are well-defined; "arbitrary" doesn't apply.
-
FID is biased for small sample sizes because:
- A) The Inception network needs many samples to activate
- B) The sample mean and covariance are noisy estimates, and the quadratic/root terms introduce systematic bias
- C) Small samples cannot represent multiple classes
- D) FID requires exactly 50,000 samples by definition Correct: B)
- If you chose B: $\hat{\boldsymbol{\mu}}$ and $\hat{\Sigma}$ from $N$ samples are unbiased estimates, but $|\hat{\boldsymbol{\mu}}_r - \hat{\boldsymbol{\mu}}_g|^2$ is biased upward (Jensen's inequality: expectation of a convex function of an unbiased estimator). The trace term with the matrix square root also introduces bias. Bias corrections exist (e.g., the "FID infinity" extrapolation).
- If you chose A: Inception activates for any valid image.
- If you chose C: Even with one class, statistical bias exists.
-
If you chose D: 50k is a common convention, not a requirement.
-
Precision and recall for generative models differ from their classification counterparts in that:
- A) They use the same formulas
- B) Generative precision measures whether samples look realistic (are in the real data manifold); generative recall measures whether all modes are covered
- C) Generative precision is always higher
- D) Generative recall requires human evaluation Correct: B)
- If you chose B: In classification, precision = TP/(TP+FP) and recall = TP/(TP+FN). In generative evaluation, precision = fraction of generated samples that are realistic (near real data), and recall = fraction of real-data diversity that is covered by generated samples. Both are estimated via nearest-neighbor geometry in feature space.
- If you chose A: Different definitions, though the names share the "quality vs. coverage" spirit.
- If you chose C: Not true β VAE precision is typically lower than GAN precision despite higher recall.
- If you chose D: Both are computed from Inception features, no human evaluation needed.
Next Steps
23-01 β Markov Decision Processes (MDPs) β Transitioning from generative models to reinforcement learning. MDPs formalize sequential decision-making with states, actions, rewards, and the Markov property. This begins Phase 23 (Reinforcement Learning Mathematics).
Pitfalls
-
Using a single metric as the definitive measure of generative quality: FID, IS, and BPD each capture different aspects. A model with excellent FID might have severe mode collapse (high precision, low recall). A model with great BPD might produce blurry samples. Always report at least 2-3 complementary metrics: FID + precision/recall for images; BPD + FID for likelihood-based models; never rely on a single number.
-
Computing FID with too few samples: FID is biased upward for small sample sizes β the sample mean and covariance estimates are noisy, and the quadratic/covariance-root terms introduce systematic bias. With fewer than 5,000 samples, FID can overstate the distance between distributions. Use at least 10,000 samples (50,000 is standard) and report the sample size alongside FID values. For small datasets, use KID (Kernel Inception Distance) which is unbiased.
-
Comparing log-likelihoods across different model families: A VAE's ELBO is a lower bound, not the true log-likelihood. An autoregressive model gives exact likelihood. A flow gives exact likelihood. Comparing a VAE's ELBO of -100 nats against a flow's exact likelihood of -95 nats is meaningless β the VAE might actually achieve -90 nats with a tighter bound. Only compare likelihood-based metrics within the same model family, or use sample-based metrics (FID) for cross-family comparison.
-
Confusing IS with a measure of sample fidelity: A model generating one perfect, diverse image per ImageNet class achieves maximum IS β but might be incapable of intra-class variation (e.g., produces exactly one dog image for all dog prompts). IS measures class-level diversity, not sample-level diversity. A model with IS = 50 and another with IS = 50 could have wildly different intra-class coverage. Precision/recall metrics are needed to detect this.
Q7: Precision and recall for generative models are computed using:
A) The same formulas as classification precision/recall. B) Nearest-neighbor geometry in a feature space β precision measures the fraction of generated samples within the real data manifold, and recall measures the fraction of real samples covered by the generated manifold. C) The discriminator's output probabilities. D) Human evaluation studies.