22-02 β Variational Autoencoders (VAEs)
Phase: 22 β Generative Models Mathematics Subject: 22-02 Prerequisites: 22-01 β Autoencoders, Phase 16β17 (Neural Networks), Phase 13 (Probability) Next subject: 22-03 β Generative Adversarial Networks (GANs)
Learning Objectives
By the end of this subject, you will be able to:
- Derive the Evidence Lower Bound (ELBO) from first principles using Jensen's inequality
- Explain the reparameterization trick and prove it enables unbiased gradient estimates
- Formulate the VAE as a latent variable model with encoder $q_\phi(\mathbf{z}|\mathbf{x})$ and decoder $p_\theta(\mathbf{x}|\mathbf{z})$
- Analyze the $\beta$-VAE and its effect on disentanglement
- Connect the VAE loss to both reconstruction and KL regularization
Core Content
From Autoencoders to Probabilistic Models
Standard autoencoders learn a deterministic mapping $\mathbf{x} \to \mathbf{z} \to \hat{\mathbf{x}}$. VAEs make this probabilistic:
- Probabilistic encoder (inference network): $q_\phi(\mathbf{z}|\mathbf{x})$ β approximates the true posterior $p(\mathbf{z}|\mathbf{x})$
- Probabilistic decoder (generative network): $p_\theta(\mathbf{x}|\mathbf{z})$ β likelihood of data given latent code
- Prior: $p(\mathbf{z})$ β typically $\mathcal{N}(\mathbf{0}, I)$
The ELBO Derivation
We want to maximize the marginal log-likelihood $\log p_\theta(\mathbf{x})$. It's intractable to compute directly because it requires integrating over $\mathbf{z}$:
$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z}) d\mathbf{z}$$
Using the variational distribution $q_\phi(\mathbf{z}|\mathbf{x})$:
$$\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z}) d\mathbf{z}$$
$$= \log \int q_\phi(\mathbf{z}|\mathbf{x}) \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})} d\mathbf{z}$$
β οΈ CRITICAL β Jensen's Inequality: Since $\log$ is concave, $\log \mathbb{E}[Y] \geq \mathbb{E}[\log Y]$. Applying this:
$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}\left[\log \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]$$
Expanding the RHS:
$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}\left[\log \frac{q_\phi(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})}\right]$$
$$= \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$$
This lower bound is the ELBO (Evidence Lower BOund):
$$\mathcal{L}(\theta, \phi; \mathbf{x}) = \underbrace{\mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]}{\text{Reconstruction term}} - \underbrace{D{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{Regularization term}}$$
Maximizing the ELBO simultaneously: 1. Maximizes reconstruction quality (first term) 2. Keeps the approximate posterior close to the prior (second term)
Decomposing the ELBO
We can also write:
$$\log p_\theta(\mathbf{x}) = \mathcal{L}(\theta, \phi; \mathbf{x}) + D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$$
Since KL divergence is always $\geq 0$, this confirms $\mathcal{L} \leq \log p_\theta(\mathbf{x})$. The gap is exactly the KL divergence between the approximate and true posteriors β our variational approximation error.
Gaussian VAE in Detail
The standard VAE uses Gaussian distributions:
- Prior: $p(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, I)$
- Encoder: $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}\phi^2(\mathbf{x})))$
- Decoder: $p_\theta(\mathbf{x}|\mathbf{z}) = \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_\theta(\mathbf{z}), \sigma^2 I)$ (for continuous data) or Bernoulli (for binary data)
The KL divergence between two Gaussians has a closed form. For $q = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ and $p = \mathcal{N}(\mathbf{0}, I)$:
$$D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{j=1}^{h}\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)$$
Derivation: $$D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \Sigma) \| \mathcal{N}(\mathbf{0}, I)) = \frac{1}{2}\left[\text{tr}(\Sigma) + \boldsymbol{\mu}^T\boldsymbol{\mu} - h - \log\det\Sigma\right]$$
For diagonal $\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_h^2)$: $\text{tr}(\Sigma) = \sum \sigma_j^2$, $\log\det\Sigma = \sum \log \sigma_j^2$, giving the element-wise formula.
β οΈ The Reparameterization Trick
The ELBO requires gradients through sampling $\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$. Direct sampling breaks gradient flow. The solution: express $\mathbf{z}$ as a deterministic function of the encoder outputs plus independent noise:
$$\mathbf{z} = \boldsymbol{\mu}\phi(\mathbf{x}) + \boldsymbol{\sigma}\phi(\mathbf{x}) \odot \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)$$
This is critical: the randomness is in $\boldsymbol{\varepsilon}$, which does not depend on $\phi$. Gradient flows through $\boldsymbol{\mu}\phi$ and $\boldsymbol{\sigma}\phi$ deterministically.
The gradient estimator:
$$\nabla_\phi \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})] = \mathbb{E}{\boldsymbol{\varepsilon} \sim \mathcal{N}(0,I)}[\nabla\phi f(\boldsymbol{\mu}\phi + \boldsymbol{\sigma}\phi \odot \boldsymbol{\varepsilon})]$$
This is an unbiased estimator of the true gradient and typically has lower variance than alternative estimators (e.g., score function / REINFORCE estimator).
$\beta$-VAE: Controlling Disentanglement
The $\beta$-VAE introduces a weight on the KL term:
$$\mathcal{L}{\beta} = \mathbb{E}{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \beta \cdot D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$$
- $\beta = 1$: Standard VAE
- $\beta > 1$: Stronger pressure for the latent space to match the prior β more disentangled representations (each latent dimension captures anη¬η« factor of variation)
- $\beta < 1$: Weaker regularization β better reconstructions but less structured latent space
The trade-off: higher $\beta$ improves disentanglement but may hurt reconstruction (the "information bottleneck" becomes tighter).
Pitfalls
| Pitfall | Why It Happens | Fix |
|---|---|---|
| Posterior collapse | Decoder ignores $\mathbf{z}$; KL goes to 0 | KL annealing, free bits, $\beta < 1$ |
| Blurry samples | Gaussian decoder assumes pixel independence | Use hierarchical VAEs, VQ-VAE, or diffusion decoders |
| High variance gradients | Monte Carlo estimation of expectation | Use more MC samples per training step |
| Numerical instability | $\log \sigma^2$ can go to $-\infty$ | Clamp $\sigma^2$ to $[10^{-4}, 10^4]$ |
| Mode collapse in generation | Latent prior doesn't match aggregate posterior | Use more flexible priors (vampPrior, normalizing flow prior) |
Key Terms
- ELBO
Worked Examples
Example 1: ELBO Derivation by KL Decomposition
Derive the ELBO by starting from $D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$ and rearranging.
Solution:
$$D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x})) = \mathbb{E}{q\phi}\left[\log \frac{q_\phi(\mathbf{z}|\mathbf{x})}{p_\theta(\mathbf{z}|\mathbf{x})}\right]$$
$$= \mathbb{E}{q\phi}\left[\log q_\phi(\mathbf{z}|\mathbf{x}) - \log \frac{p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p_\theta(\mathbf{x})}\right]$$
$$= \mathbb{E}{q\phi}[\log q_\phi(\mathbf{z}|\mathbf{x}) - \log p_\theta(\mathbf{x}|\mathbf{z}) - \log p(\mathbf{z})] + \log p_\theta(\mathbf{x})$$
$$= -\mathcal{L}(\theta, \phi; \mathbf{x}) + \log p_\theta(\mathbf{x})$$
Rearranging: $\log p_\theta(\mathbf{x}) = \mathcal{L} + D_{\text{KL}}(q_\phi \| p_\theta)$, so $\mathcal{L} \leq \log p_\theta(\mathbf{x})$.
Click for answer
$\\mathcal{L} = \\mathbb{E}_{q_\\phi}[\\log p_\\theta(\\mathbf{x}|\\mathbf{z})] - D_{\\text{KL}}(q_\\phi(\\mathbf{z}|\\mathbf{x}) \\| p(\\mathbf{z}))$. This derivation shows that maximizing $\\mathcal{L}$ is equivalent to minimizing $D_{\\text{KL}}(q_\\phi \\| p_\\theta)$ β making the approximate posterior close to the true posterior.Example 2: Computing the Gaussian KL Term
For a 2D latent VAE, the encoder outputs $\boldsymbol{\mu} = (0.5, -0.3)$ and $\boldsymbol{\sigma}^2 = (0.1, 0.04)$ for a given input. Compute the KL divergence to $\mathcal{N}(\mathbf{0}, I_2)$.
Solution:
$$D_{\text{KL}} = \frac{1}{2}\sum_{j=1}^{2}(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1)$$
- $j=1$: $0.5^2 + 0.1 - \log(0.1) - 1 = 0.25 + 0.1 - (-2.3026) - 1 = 0.35 + 2.3026 - 1 = 1.6526$
- $j=2$: $(-0.3)^2 + 0.04 - \log(0.04) - 1 = 0.09 + 0.04 - (-3.2189) - 1 = 0.13 + 3.2189 - 1 = 2.3489$
$D_{\text{KL}} = \frac{1}{2}(1.6526 + 2.3489) = \frac{1}{2}(4.0015) = 2.0008$
Click for answer
$D_{\\text{KL}} \\approx 2.001$. Note that $\\sigma^2 = 0.04$ gives $\\sigma = 0.2$, which is quite far from 1.0 β this drives up the KL term. Small variances carry a large $\\log \\sigma^2$ penalty.Example 3: Reparameterization Trick in Action
For a scalar VAE, the encoder outputs $\mu = 2.0$, $\sigma = 0.5$. We sample $\varepsilon = 0.8$ from $\mathcal{N}(0,1)$. Compute $z$ and show how gradients flow to $\mu$ and $\sigma$.
Solution:
$$z = \mu + \sigma \cdot \varepsilon = 2.0 + 0.5 \cdot 0.8 = 2.4$$
Gradients: - $\frac{\partial z}{\partial \mu} = 1$ β gradient flows directly to $\mu$ - $\frac{\partial z}{\partial \sigma} = \varepsilon = 0.8$ β gradient scaled by the noise sample - $\frac{\partial z}{\partial \varepsilon}$ β no gradient to $\varepsilon$ (it's a sample, not a parameter)
If the downstream loss depends on $z$ with gradient $\frac{\partial L}{\partial z} = g$, then: - $\frac{\partial L}{\partial \mu} = g$ - $\frac{\partial L}{\partial \sigma} = g \cdot \varepsilon = 0.8g$
Click for answer
$z = 2.4$. Gradients: $\\partial z/\\partial \\mu = 1$, $\\partial z/\\partial \\sigma = 0.8$. The reparameterization trick makes the sampling operation differentiable by moving the stochasticity to the independent noise $\\varepsilon$.Practice Problems
-
Prove that $D_{\text{KL}}(q \| p) \geq 0$ for any distributions $q, p$, and show that equality holds iff $q = p$ almost everywhere.
Click for answer
Using Jensen's inequality: $D_{\\text{KL}}(q\\|p) = \\mathbb{E}_q[-\\log(p/q)] \\geq -\\log \\mathbb{E}_q[p/q] = -\\log \\int q \\cdot (p/q) = -\\log 1 = 0$. Equality when $p/q$ is constant a.e., i.e., $p=q$ a.e. This justifies the ELBO as a lower bound. -
For a Bernoulli decoder $p_\theta(\mathbf{x}|\mathbf{z}) = \prod_j \hat{x}_j^{x_j}(1-\hat{x}_j)^{1-x_j}$ where $\hat{\mathbf{x}} = \sigma(\mathbf{W}\mathbf{z} + \mathbf{b})$, write the reconstruction term of the ELBO.
Click for answer
$\\mathbb{E}_{q_\\phi}[\\log p_\\theta(\\mathbf{x}|\\mathbf{z})] = \\mathbb{E}_{q_\\phi}\\left[\\sum_j x_j \\log \\hat{x}_j + (1-x_j)\\log(1-\\hat{x}_j)\\right]$ This is the binary cross-entropy loss, summed over pixels/dimensions. In practice, we use one MC sample from $q_\\phi$ per training step. -
For a $\beta$-VAE with $\beta = 4$, compute the total loss if reconstruction = 50 and KL = 5 (standard VAE would have loss = 55). What does $\beta=4$ imply?
Click for answer
$\\mathcal{L}_\\beta = 50 - 4 \\cdot 5 = 50 - 20 = 30$. The $\\beta$-VAE loss is 30 (vs 55 for standard). Higher $\\beta$ heavily penalizes deviation from the prior, pushing the latent representation to be more factorial/disentangled. The lower numeric loss value doesn't mean "better" β it's a different objective. -
The standard VAE prior is $\mathcal{N}(\mathbf{0}, I)$. Why is this a reasonable choice? What happens if we use a different prior?
Click for answer
$\\mathcal{N}(\\mathbf{0}, I)$ is the maximum-entropy distribution with fixed mean and variance. The KL term with this prior encourages the latent dimensions to be independent and roughly unit-Gaussian, which aids disentanglement. Any prior is valid mathematically, but the KL term becomes more complex. A non-standard prior may better match the aggregate posterior but loses the simple closed-form KL. -
Show that for a VAE with Gaussian encoder and decoder, the ELBO is equivalent to a regularized autoencoder objective. What role does each term play?
Click for answer
With Gaussian decoder $p_\\theta(\\mathbf{x}|\\mathbf{z}) = \\mathcal{N}(\\mathbf{x}; \\hat{\\mathbf{x}}, \\sigma^2 I)$: $\\mathbb{E}_{q_\\phi}[\\log p_\\theta(\\mathbf{x}|\\mathbf{z})] = -\\frac{1}{2\\sigma^2}\\mathbb{E}_{q_\\phi}[\\|\\mathbf{x} - \\hat{\\mathbf{x}}\\|^2] - \\frac{d}{2}\\log(2\\pi\\sigma^2)$ So the reconstruction term is proportional to MSE. The KL term acts as a regularizer, pulling the latent distribution toward $\\mathcal{N}(\\mathbf{0}, I)$. Together: $\\mathcal{L} = -\\text{(scaled MSE)} - D_{\\text{KL}}$.
Summary
Key takeaways:
- VAEs are probabilistic autoencoders: encoder $q_\phi(\mathbf{z}|\mathbf{x})$, decoder $p_\theta(\mathbf{x}|\mathbf{z})$, prior $p(\mathbf{z})$
- The ELBO $\mathcal{L} = \mathbb{E}q[\log p\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi \| p)$ lower-bounds $\log p_\theta(\mathbf{x})$
- The reparameterization trick $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\varepsilon}$ makes sampling differentiable
- $\beta$-VAE controls the reconstruction-regularization trade-off via $\beta \cdot D_{\text{KL}}$
- Posterior collapse (KLβ0) is the most common VAE failure mode β the decoder learns to ignore $\mathbf{z}$
Quiz
- What does ELBO stand for?
- A) Estimated Lower Bound Objective
- B) Evidence Lower Bound
- C) Empirical Likelihood Bound Optimizer
- D) Encoder Likelihood Bound Output Correct: B)
- If you chose B: ELBO = Evidence Lower BOund β a lower bound on $\log p_\theta(\mathbf{x})$ (the "evidence").
-
If you chose A, C, D: These are all incorrect expansions. "Evidence" is the marginal likelihood $p_\theta(\mathbf{x})$.
-
The reparameterization trick is necessary because:
- A) Sampling from a Gaussian is expensive
- B) Direct sampling blocks gradient flow through $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$
- C) The KL divergence is intractable
- D) The decoder is non-differentiable Correct: B)
- If you chose B: $\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$ is a non-differentiable operation w.r.t. $\boldsymbol{\mu}, \boldsymbol{\sigma}$. Reparameterization moves the randomness to $\boldsymbol{\varepsilon}$.
- If you chose A: Gaussian sampling is fast β the issue is differentiability.
- If you chose C: The KL has a closed form for Gaussians.
-
If you chose D: The decoder is a neural network β it's differentiable by design.
-
In the VAE loss, the KL term $D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$ encourages:
- A) Better reconstructions
- B) The latent distribution to be close to the prior
- C) Sparsity in the latent space
- D) Higher-dimensional latent codes Correct: B)
- If you chose B: The KL pulls each input's approximate posterior toward the prior $p(\mathbf{z})$, regularizing the latent space.
- If you chose A: The reconstruction term handles that.
- If you chose C: Sparsity is a different regularization β this is distribution matching.
-
If you chose D: KL doesn't affect dimensionality; the architecture determines $h$.
-
For $q = \mathcal{N}(0, 0.01)$ and $p = \mathcal{N}(0, 1)$, the KL divergence is approximately:
- A) 0
- B) Large (>> 1)
- C) Negative
- D) Exactly 1 Correct: B)
- If you chose B: $D_{\text{KL}} = \frac{1}{2}(0 + 0.01 - \log 0.01 - 1) = \frac{1}{2}(0.01 + 4.605 - 1) = 1.81$. A very narrow posterior ($\sigma^2 \ll 1$) is heavily penalized because $\log \sigma^2$ is very negative, making $-\log \sigma^2$ very positive.
- If you chose A: Both centered at 0, but KL is sensitive to variance mismatch.
- If you chose C: KL is always non-negative.
-
If you chose D: It's approximately 1.81, not 1.
-
Posterior collapse in VAEs refers to:
- A) The encoder producing deterministic outputs
- B) $q_\phi(\mathbf{z}|\mathbf{x}) \approx p(\mathbf{z})$ for all $\mathbf{x}$, making $\mathbf{z}$ uninformative
- C) The decoder failing to reconstruct inputs
- D) The latent space dimensionality shrinking to 0 Correct: B)
- If you chose B: When $q_\phi(\mathbf{z}|\mathbf{x}) \approx \mathcal{N}(0,I)$ for all inputs, $\mathbf{z}$ carries no information about $\mathbf{x}$. The decoder learns to ignore $\mathbf{z}$, converging to the data mean.
- If you chose A: The encoder always outputs distribution parameters β it's the distribution that collapses, not determinism.
- If you chose C: Decoder still works β it just ignores the latent code.
- If you chose D: Dimensionality is fixed by architecture.
Next Steps
22-03 β Generative Adversarial Networks (GANs) β the adversarial alternative to likelihood-based generative models: minimax games, the optimal discriminator, and the Jensen-Shannon connection.
Pitfalls
-
Treating the ELBO gap as negligible during evaluation: The ELBO is a lower bound, not the log-likelihood. The true log-likelihood is $\log p_\theta(\mathbf{x}) = \mathcal{L} + D_{\text{KL}}(q_\phi | p_\theta(\mathbf{z}|\mathbf{x}))$. If the encoder is not expressive enough, the KL gap can be large, and the ELBO may substantially underestimate (or overestimate, if the bound is loose and not tight) the true likelihood. Never compare a VAE's ELBO directly against an autoregressive model's exact log-likelihood.
-
Using too few Monte Carlo samples for the reconstruction term: The standard VAE training uses a single sample $\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$ per training step. While this works for training (the gradient is unbiased), evaluating the ELBO with a single sample gives a noisy estimate. For evaluation and model comparison, use many MC samples (e.g., 100-500) and report the importance-weighted bound $\log \frac{1}{K}\sum_{k=1}^K \frac{p_\theta(\mathbf{x}, \mathbf{z}^{(k)})}{q_\phi(\mathbf{z}^{(k)}|\mathbf{x})}$, which converges to the true log-likelihood as $K \to \infty$.
-
Misdiagnosing posterior collapse: Posterior collapse ($\text{KL} \to 0$) can be confused with successful training if only monitoring the total loss, because the reconstruction term may still be good (the decoder learns the data mean). Always monitor the KL term and reconstruction term separately. If $\text{KL} \approx 0$ but reconstructions are acceptable, the model has collapsed β it's ignoring the latent code.
-
Setting $\beta$ too high in $\beta$-VAE without adjusting architecture: High $\beta$ (e.g., $\beta = 10$) heavily penalizes the KL term, which can force all latent dimensions to match the prior so tightly that no information passes through the bottleneck. This produces excellent disentanglement in theory but zero useful encoding β the latent codes become pure noise. Scale $\beta$ gradually and monitor the mutual information $I(\mathbf{x}; \mathbf{z})$ between inputs and latent codes.
Q6: Which of the following is a valid way to compute an approximate log-likelihood from a trained VAE?
A) Use the ELBO with a single MC sample. B) Use importance sampling: $\log \frac{1}{K}\sum_{k=1}^K \frac{p_\theta(\mathbf{x}|\mathbf{z}^{(k)})p(\mathbf{z}^{(k)})}{q_\phi(\mathbf{z}^{(k)}|\mathbf{x})}$ with large $K$ (e.g., 500). C) Set $\mathbf{z} = \boldsymbol{\mu}\phi(\mathbf{x})$ (deterministic encoding) and compute $\log p\theta(\mathbf{x}|\boldsymbol{\mu})$. D) The ELBO on the training set divided by the number of latent dimensions.