Math graphic
πŸ“ Concept diagram

22-02 β€” Variational Autoencoders (VAEs)

Phase: 22 β€” Generative Models Mathematics Subject: 22-02 Prerequisites: 22-01 β€” Autoencoders, Phase 16–17 (Neural Networks), Phase 13 (Probability) Next subject: 22-03 β€” Generative Adversarial Networks (GANs)


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the Evidence Lower Bound (ELBO) from first principles using Jensen's inequality
  2. Explain the reparameterization trick and prove it enables unbiased gradient estimates
  3. Formulate the VAE as a latent variable model with encoder $q_\phi(\mathbf{z}|\mathbf{x})$ and decoder $p_\theta(\mathbf{x}|\mathbf{z})$
  4. Analyze the $\beta$-VAE and its effect on disentanglement
  5. Connect the VAE loss to both reconstruction and KL regularization

Core Content

From Autoencoders to Probabilistic Models

Standard autoencoders learn a deterministic mapping $\mathbf{x} \to \mathbf{z} \to \hat{\mathbf{x}}$. VAEs make this probabilistic:

The ELBO Derivation

We want to maximize the marginal log-likelihood $\log p_\theta(\mathbf{x})$. It's intractable to compute directly because it requires integrating over $\mathbf{z}$:

$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z}) d\mathbf{z}$$

Using the variational distribution $q_\phi(\mathbf{z}|\mathbf{x})$:

$$\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z}) d\mathbf{z}$$

$$= \log \int q_\phi(\mathbf{z}|\mathbf{x}) \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})} d\mathbf{z}$$

⚠️ CRITICAL β€” Jensen's Inequality: Since $\log$ is concave, $\log \mathbb{E}[Y] \geq \mathbb{E}[\log Y]$. Applying this:

$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}\left[\log \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]$$

Expanding the RHS:

$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}\left[\log \frac{q_\phi(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})}\right]$$

$$= \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$$

This lower bound is the ELBO (Evidence Lower BOund):

$$\mathcal{L}(\theta, \phi; \mathbf{x}) = \underbrace{\mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]}{\text{Reconstruction term}} - \underbrace{D{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{Regularization term}}$$

Maximizing the ELBO simultaneously: 1. Maximizes reconstruction quality (first term) 2. Keeps the approximate posterior close to the prior (second term)

Decomposing the ELBO

We can also write:

$$\log p_\theta(\mathbf{x}) = \mathcal{L}(\theta, \phi; \mathbf{x}) + D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$$

Since KL divergence is always $\geq 0$, this confirms $\mathcal{L} \leq \log p_\theta(\mathbf{x})$. The gap is exactly the KL divergence between the approximate and true posteriors β€” our variational approximation error.

Gaussian VAE in Detail

The standard VAE uses Gaussian distributions:

The KL divergence between two Gaussians has a closed form. For $q = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ and $p = \mathcal{N}(\mathbf{0}, I)$:

$$D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{j=1}^{h}\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)$$

Derivation: $$D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \Sigma) \| \mathcal{N}(\mathbf{0}, I)) = \frac{1}{2}\left[\text{tr}(\Sigma) + \boldsymbol{\mu}^T\boldsymbol{\mu} - h - \log\det\Sigma\right]$$

For diagonal $\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_h^2)$: $\text{tr}(\Sigma) = \sum \sigma_j^2$, $\log\det\Sigma = \sum \log \sigma_j^2$, giving the element-wise formula.

⚠️ The Reparameterization Trick

The ELBO requires gradients through sampling $\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$. Direct sampling breaks gradient flow. The solution: express $\mathbf{z}$ as a deterministic function of the encoder outputs plus independent noise:

$$\mathbf{z} = \boldsymbol{\mu}\phi(\mathbf{x}) + \boldsymbol{\sigma}\phi(\mathbf{x}) \odot \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)$$

This is critical: the randomness is in $\boldsymbol{\varepsilon}$, which does not depend on $\phi$. Gradient flows through $\boldsymbol{\mu}\phi$ and $\boldsymbol{\sigma}\phi$ deterministically.

The gradient estimator:

$$\nabla_\phi \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})] = \mathbb{E}{\boldsymbol{\varepsilon} \sim \mathcal{N}(0,I)}[\nabla\phi f(\boldsymbol{\mu}\phi + \boldsymbol{\sigma}\phi \odot \boldsymbol{\varepsilon})]$$

This is an unbiased estimator of the true gradient and typically has lower variance than alternative estimators (e.g., score function / REINFORCE estimator).

$\beta$-VAE: Controlling Disentanglement

The $\beta$-VAE introduces a weight on the KL term:

$$\mathcal{L}{\beta} = \mathbb{E}{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \beta \cdot D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$$

The trade-off: higher $\beta$ improves disentanglement but may hurt reconstruction (the "information bottleneck" becomes tighter).

Pitfalls

Pitfall Why It Happens Fix
Posterior collapse Decoder ignores $\mathbf{z}$; KL goes to 0 KL annealing, free bits, $\beta < 1$
Blurry samples Gaussian decoder assumes pixel independence Use hierarchical VAEs, VQ-VAE, or diffusion decoders
High variance gradients Monte Carlo estimation of expectation Use more MC samples per training step
Numerical instability $\log \sigma^2$ can go to $-\infty$ Clamp $\sigma^2$ to $[10^{-4}, 10^4]$
Mode collapse in generation Latent prior doesn't match aggregate posterior Use more flexible priors (vampPrior, normalizing flow prior)


Key Terms

Worked Examples

Example 1: ELBO Derivation by KL Decomposition

Derive the ELBO by starting from $D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$ and rearranging.

Solution:

$$D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x})) = \mathbb{E}{q\phi}\left[\log \frac{q_\phi(\mathbf{z}|\mathbf{x})}{p_\theta(\mathbf{z}|\mathbf{x})}\right]$$

$$= \mathbb{E}{q\phi}\left[\log q_\phi(\mathbf{z}|\mathbf{x}) - \log \frac{p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p_\theta(\mathbf{x})}\right]$$

$$= \mathbb{E}{q\phi}[\log q_\phi(\mathbf{z}|\mathbf{x}) - \log p_\theta(\mathbf{x}|\mathbf{z}) - \log p(\mathbf{z})] + \log p_\theta(\mathbf{x})$$

$$= -\mathcal{L}(\theta, \phi; \mathbf{x}) + \log p_\theta(\mathbf{x})$$

Rearranging: $\log p_\theta(\mathbf{x}) = \mathcal{L} + D_{\text{KL}}(q_\phi \| p_\theta)$, so $\mathcal{L} \leq \log p_\theta(\mathbf{x})$.

Click for answer $\\mathcal{L} = \\mathbb{E}_{q_\\phi}[\\log p_\\theta(\\mathbf{x}|\\mathbf{z})] - D_{\\text{KL}}(q_\\phi(\\mathbf{z}|\\mathbf{x}) \\| p(\\mathbf{z}))$. This derivation shows that maximizing $\\mathcal{L}$ is equivalent to minimizing $D_{\\text{KL}}(q_\\phi \\| p_\\theta)$ β€” making the approximate posterior close to the true posterior.

Example 2: Computing the Gaussian KL Term

For a 2D latent VAE, the encoder outputs $\boldsymbol{\mu} = (0.5, -0.3)$ and $\boldsymbol{\sigma}^2 = (0.1, 0.04)$ for a given input. Compute the KL divergence to $\mathcal{N}(\mathbf{0}, I_2)$.

Solution:

$$D_{\text{KL}} = \frac{1}{2}\sum_{j=1}^{2}(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1)$$

$D_{\text{KL}} = \frac{1}{2}(1.6526 + 2.3489) = \frac{1}{2}(4.0015) = 2.0008$

Click for answer $D_{\\text{KL}} \\approx 2.001$. Note that $\\sigma^2 = 0.04$ gives $\\sigma = 0.2$, which is quite far from 1.0 β€” this drives up the KL term. Small variances carry a large $\\log \\sigma^2$ penalty.

Example 3: Reparameterization Trick in Action

For a scalar VAE, the encoder outputs $\mu = 2.0$, $\sigma = 0.5$. We sample $\varepsilon = 0.8$ from $\mathcal{N}(0,1)$. Compute $z$ and show how gradients flow to $\mu$ and $\sigma$.

Solution:

$$z = \mu + \sigma \cdot \varepsilon = 2.0 + 0.5 \cdot 0.8 = 2.4$$

Gradients: - $\frac{\partial z}{\partial \mu} = 1$ β€” gradient flows directly to $\mu$ - $\frac{\partial z}{\partial \sigma} = \varepsilon = 0.8$ β€” gradient scaled by the noise sample - $\frac{\partial z}{\partial \varepsilon}$ β€” no gradient to $\varepsilon$ (it's a sample, not a parameter)

If the downstream loss depends on $z$ with gradient $\frac{\partial L}{\partial z} = g$, then: - $\frac{\partial L}{\partial \mu} = g$ - $\frac{\partial L}{\partial \sigma} = g \cdot \varepsilon = 0.8g$

Click for answer $z = 2.4$. Gradients: $\\partial z/\\partial \\mu = 1$, $\\partial z/\\partial \\sigma = 0.8$. The reparameterization trick makes the sampling operation differentiable by moving the stochasticity to the independent noise $\\varepsilon$.

Practice Problems

  1. Prove that $D_{\text{KL}}(q \| p) \geq 0$ for any distributions $q, p$, and show that equality holds iff $q = p$ almost everywhere.

    Click for answer Using Jensen's inequality: $D_{\\text{KL}}(q\\|p) = \\mathbb{E}_q[-\\log(p/q)] \\geq -\\log \\mathbb{E}_q[p/q] = -\\log \\int q \\cdot (p/q) = -\\log 1 = 0$. Equality when $p/q$ is constant a.e., i.e., $p=q$ a.e. This justifies the ELBO as a lower bound.

  2. For a Bernoulli decoder $p_\theta(\mathbf{x}|\mathbf{z}) = \prod_j \hat{x}_j^{x_j}(1-\hat{x}_j)^{1-x_j}$ where $\hat{\mathbf{x}} = \sigma(\mathbf{W}\mathbf{z} + \mathbf{b})$, write the reconstruction term of the ELBO.

    Click for answer $\\mathbb{E}_{q_\\phi}[\\log p_\\theta(\\mathbf{x}|\\mathbf{z})] = \\mathbb{E}_{q_\\phi}\\left[\\sum_j x_j \\log \\hat{x}_j + (1-x_j)\\log(1-\\hat{x}_j)\\right]$ This is the binary cross-entropy loss, summed over pixels/dimensions. In practice, we use one MC sample from $q_\\phi$ per training step.

  3. For a $\beta$-VAE with $\beta = 4$, compute the total loss if reconstruction = 50 and KL = 5 (standard VAE would have loss = 55). What does $\beta=4$ imply?

    Click for answer $\\mathcal{L}_\\beta = 50 - 4 \\cdot 5 = 50 - 20 = 30$. The $\\beta$-VAE loss is 30 (vs 55 for standard). Higher $\\beta$ heavily penalizes deviation from the prior, pushing the latent representation to be more factorial/disentangled. The lower numeric loss value doesn't mean "better" β€” it's a different objective.

  4. The standard VAE prior is $\mathcal{N}(\mathbf{0}, I)$. Why is this a reasonable choice? What happens if we use a different prior?

    Click for answer $\\mathcal{N}(\\mathbf{0}, I)$ is the maximum-entropy distribution with fixed mean and variance. The KL term with this prior encourages the latent dimensions to be independent and roughly unit-Gaussian, which aids disentanglement. Any prior is valid mathematically, but the KL term becomes more complex. A non-standard prior may better match the aggregate posterior but loses the simple closed-form KL.

  5. Show that for a VAE with Gaussian encoder and decoder, the ELBO is equivalent to a regularized autoencoder objective. What role does each term play?

    Click for answer With Gaussian decoder $p_\\theta(\\mathbf{x}|\\mathbf{z}) = \\mathcal{N}(\\mathbf{x}; \\hat{\\mathbf{x}}, \\sigma^2 I)$: $\\mathbb{E}_{q_\\phi}[\\log p_\\theta(\\mathbf{x}|\\mathbf{z})] = -\\frac{1}{2\\sigma^2}\\mathbb{E}_{q_\\phi}[\\|\\mathbf{x} - \\hat{\\mathbf{x}}\\|^2] - \\frac{d}{2}\\log(2\\pi\\sigma^2)$ So the reconstruction term is proportional to MSE. The KL term acts as a regularizer, pulling the latent distribution toward $\\mathcal{N}(\\mathbf{0}, I)$. Together: $\\mathcal{L} = -\\text{(scaled MSE)} - D_{\\text{KL}}$.


Summary

Key takeaways:


Quiz

  1. What does ELBO stand for?
  2. A) Estimated Lower Bound Objective
  3. B) Evidence Lower Bound
  4. C) Empirical Likelihood Bound Optimizer
  5. D) Encoder Likelihood Bound Output Correct: B)
  6. If you chose B: ELBO = Evidence Lower BOund β€” a lower bound on $\log p_\theta(\mathbf{x})$ (the "evidence").
  7. If you chose A, C, D: These are all incorrect expansions. "Evidence" is the marginal likelihood $p_\theta(\mathbf{x})$.

  8. The reparameterization trick is necessary because:

  9. A) Sampling from a Gaussian is expensive
  10. B) Direct sampling blocks gradient flow through $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$
  11. C) The KL divergence is intractable
  12. D) The decoder is non-differentiable Correct: B)
  13. If you chose B: $\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$ is a non-differentiable operation w.r.t. $\boldsymbol{\mu}, \boldsymbol{\sigma}$. Reparameterization moves the randomness to $\boldsymbol{\varepsilon}$.
  14. If you chose A: Gaussian sampling is fast β€” the issue is differentiability.
  15. If you chose C: The KL has a closed form for Gaussians.
  16. If you chose D: The decoder is a neural network β€” it's differentiable by design.

  17. In the VAE loss, the KL term $D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$ encourages:

  18. A) Better reconstructions
  19. B) The latent distribution to be close to the prior
  20. C) Sparsity in the latent space
  21. D) Higher-dimensional latent codes Correct: B)
  22. If you chose B: The KL pulls each input's approximate posterior toward the prior $p(\mathbf{z})$, regularizing the latent space.
  23. If you chose A: The reconstruction term handles that.
  24. If you chose C: Sparsity is a different regularization β€” this is distribution matching.
  25. If you chose D: KL doesn't affect dimensionality; the architecture determines $h$.

  26. For $q = \mathcal{N}(0, 0.01)$ and $p = \mathcal{N}(0, 1)$, the KL divergence is approximately:

  27. A) 0
  28. B) Large (>> 1)
  29. C) Negative
  30. D) Exactly 1 Correct: B)
  31. If you chose B: $D_{\text{KL}} = \frac{1}{2}(0 + 0.01 - \log 0.01 - 1) = \frac{1}{2}(0.01 + 4.605 - 1) = 1.81$. A very narrow posterior ($\sigma^2 \ll 1$) is heavily penalized because $\log \sigma^2$ is very negative, making $-\log \sigma^2$ very positive.
  32. If you chose A: Both centered at 0, but KL is sensitive to variance mismatch.
  33. If you chose C: KL is always non-negative.
  34. If you chose D: It's approximately 1.81, not 1.

  35. Posterior collapse in VAEs refers to:

  36. A) The encoder producing deterministic outputs
  37. B) $q_\phi(\mathbf{z}|\mathbf{x}) \approx p(\mathbf{z})$ for all $\mathbf{x}$, making $\mathbf{z}$ uninformative
  38. C) The decoder failing to reconstruct inputs
  39. D) The latent space dimensionality shrinking to 0 Correct: B)
  40. If you chose B: When $q_\phi(\mathbf{z}|\mathbf{x}) \approx \mathcal{N}(0,I)$ for all inputs, $\mathbf{z}$ carries no information about $\mathbf{x}$. The decoder learns to ignore $\mathbf{z}$, converging to the data mean.
  41. If you chose A: The encoder always outputs distribution parameters β€” it's the distribution that collapses, not determinism.
  42. If you chose C: Decoder still works β€” it just ignores the latent code.
  43. If you chose D: Dimensionality is fixed by architecture.

Next Steps

22-03 β€” Generative Adversarial Networks (GANs) β€” the adversarial alternative to likelihood-based generative models: minimax games, the optimal discriminator, and the Jensen-Shannon connection.


Pitfalls

  1. Treating the ELBO gap as negligible during evaluation: The ELBO is a lower bound, not the log-likelihood. The true log-likelihood is $\log p_\theta(\mathbf{x}) = \mathcal{L} + D_{\text{KL}}(q_\phi | p_\theta(\mathbf{z}|\mathbf{x}))$. If the encoder is not expressive enough, the KL gap can be large, and the ELBO may substantially underestimate (or overestimate, if the bound is loose and not tight) the true likelihood. Never compare a VAE's ELBO directly against an autoregressive model's exact log-likelihood.

  2. Using too few Monte Carlo samples for the reconstruction term: The standard VAE training uses a single sample $\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$ per training step. While this works for training (the gradient is unbiased), evaluating the ELBO with a single sample gives a noisy estimate. For evaluation and model comparison, use many MC samples (e.g., 100-500) and report the importance-weighted bound $\log \frac{1}{K}\sum_{k=1}^K \frac{p_\theta(\mathbf{x}, \mathbf{z}^{(k)})}{q_\phi(\mathbf{z}^{(k)}|\mathbf{x})}$, which converges to the true log-likelihood as $K \to \infty$.

  3. Misdiagnosing posterior collapse: Posterior collapse ($\text{KL} \to 0$) can be confused with successful training if only monitoring the total loss, because the reconstruction term may still be good (the decoder learns the data mean). Always monitor the KL term and reconstruction term separately. If $\text{KL} \approx 0$ but reconstructions are acceptable, the model has collapsed β€” it's ignoring the latent code.

  4. Setting $\beta$ too high in $\beta$-VAE without adjusting architecture: High $\beta$ (e.g., $\beta = 10$) heavily penalizes the KL term, which can force all latent dimensions to match the prior so tightly that no information passes through the bottleneck. This produces excellent disentanglement in theory but zero useful encoding β€” the latent codes become pure noise. Scale $\beta$ gradually and monitor the mutual information $I(\mathbf{x}; \mathbf{z})$ between inputs and latent codes.




Q6: Which of the following is a valid way to compute an approximate log-likelihood from a trained VAE?

A) Use the ELBO with a single MC sample. B) Use importance sampling: $\log \frac{1}{K}\sum_{k=1}^K \frac{p_\theta(\mathbf{x}|\mathbf{z}^{(k)})p(\mathbf{z}^{(k)})}{q_\phi(\mathbf{z}^{(k)}|\mathbf{x})}$ with large $K$ (e.g., 500). C) Set $\mathbf{z} = \boldsymbol{\mu}\phi(\mathbf{x})$ (deterministic encoding) and compute $\log p\theta(\mathbf{x}|\boldsymbol{\mu})$. D) The ELBO on the training set divided by the number of latent dimensions.

Answer and Explanations **Correct: B)** Importance-weighted autoencoders (IWAE) use multiple MC samples to tighten the bound. As $K \to \infty$, the IWAE bound converges to the true $\log p_\theta(\mathbf{x})$. For practical evaluation, $K = 100$–$500$ gives a tight estimate. - A) The single-sample ELBO is a valid lower bound but can be loose β€” not a good approximation of the true likelihood. - C) Using only the mean discards the variance and ignores the prior term entirely β€” not a valid likelihood. - D) This is nonsensical β€” dividing by latent dimension doesn't produce a valid likelihood.