📐 Concept diagram

22-02 — Variational Autoencoders (VAEs)

Phase: 22 — Generative Models Mathematics Subject: 22-02 Prerequisites: 22-01 — Autoencoders, Phase 16–17 (Neural Networks), Phase 13 (Probability) Next subject: 22-03 — Generative Adversarial Networks (GANs)

Learning Objectives

By the end of this subject, you will be able to:

Derive the Evidence Lower Bound (ELBO) from first principles using Jensen's inequality
Explain the reparameterization trick and prove it enables unbiased gradient estimates
Formulate the VAE as a latent variable model with encoder $q_\phi(\mathbf{z}|\mathbf{x})$ and decoder $p_\theta(\mathbf{x}|\mathbf{z})$
Analyze the $\beta$-VAE and its effect on disentanglement
Connect the VAE loss to both reconstruction and KL regularization

Core Content

From Autoencoders to Probabilistic Models

Standard autoencoders learn a deterministic mapping $\mathbf{x} \to \mathbf{z} \to \hat{\mathbf{x}}$. VAEs make this probabilistic:

Probabilistic encoder (inference network): $q_\phi(\mathbf{z}|\mathbf{x})$ — approximates the true posterior $p(\mathbf{z}|\mathbf{x})$
Probabilistic decoder (generative network): $p_\theta(\mathbf{x}|\mathbf{z})$ — likelihood of data given latent code
Prior: $p(\mathbf{z})$ — typically $\mathcal{N}(\mathbf{0}, I)$

The ELBO Derivation

We want to maximize the marginal log-likelihood $\log p_\theta(\mathbf{x})$. It's intractable to compute directly because it requires integrating over $\mathbf{z}$:

$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z}) d\mathbf{z}$$

Using the variational distribution $q_\phi(\mathbf{z}|\mathbf{x})$:

$$\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z}) d\mathbf{z}$$

$$= \log \int q_\phi(\mathbf{z}|\mathbf{x}) \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})} d\mathbf{z}$$

⚠️ CRITICAL — Jensen's Inequality: Since $\log$ is concave, $\log \mathbb{E}[Y] \geq \mathbb{E}[\log Y]$. Applying this:

$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}\left[\log \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right]$$

Expanding the RHS:

$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}\left[\log \frac{q_\phi(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})}\right]$$

$$= \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$$

This lower bound is the ELBO (Evidence Lower BOund):

$$\mathcal{L}(\theta, \phi; \mathbf{x}) = \underbrace{\mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]}{\text{Reconstruction term}} - \underbrace{D{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{Regularization term}}$$

Maximizing the ELBO simultaneously: 1. Maximizes reconstruction quality (first term) 2. Keeps the approximate posterior close to the prior (second term)

Decomposing the ELBO

We can also write:

$$\log p_\theta(\mathbf{x}) = \mathcal{L}(\theta, \phi; \mathbf{x}) + D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$$

Since KL divergence is always $\geq 0$, this confirms $\mathcal{L} \leq \log p_\theta(\mathbf{x})$. The gap is exactly the KL divergence between the approximate and true posteriors — our variational approximation error.

Gaussian VAE in Detail

The standard VAE uses Gaussian distributions:

Prior: $p(\mathbf{z}) = \mathcal{N}(\mathbf{z}; \mathbf{0}, I)$
Encoder: $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}\phi^2(\mathbf{x})))$
Decoder: $p_\theta(\mathbf{x}|\mathbf{z}) = \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_\theta(\mathbf{z}), \sigma^2 I)$ (for continuous data) or Bernoulli (for binary data)

The KL divergence between two Gaussians has a closed form. For $q = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ and $p = \mathcal{N}(\mathbf{0}, I)$:

$$D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{j=1}^{h}\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)$$

Derivation: $$D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \Sigma) \| \mathcal{N}(\mathbf{0}, I)) = \frac{1}{2}\left[\text{tr}(\Sigma) + \boldsymbol{\mu}^T\boldsymbol{\mu} - h - \log\det\Sigma\right]$$

For diagonal $\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_h^2)$: $\text{tr}(\Sigma) = \sum \sigma_j^2$, $\log\det\Sigma = \sum \log \sigma_j^2$, giving the element-wise formula.

⚠️ The Reparameterization Trick

The ELBO requires gradients through sampling $\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$. Direct sampling breaks gradient flow. The solution: express $\mathbf{z}$ as a deterministic function of the encoder outputs plus independent noise:

$$\mathbf{z} = \boldsymbol{\mu}\phi(\mathbf{x}) + \boldsymbol{\sigma}\phi(\mathbf{x}) \odot \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)$$

This is critical: the randomness is in $\boldsymbol{\varepsilon}$, which does not depend on $\phi$. Gradient flows through $\boldsymbol{\mu}\phi$ and $\boldsymbol{\sigma}\phi$ deterministically.

The gradient estimator:

$$\nabla_\phi \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})] = \mathbb{E}{\boldsymbol{\varepsilon} \sim \mathcal{N}(0,I)}[\nabla\phi f(\boldsymbol{\mu}\phi + \boldsymbol{\sigma}\phi \odot \boldsymbol{\varepsilon})]$$

This is an unbiased estimator of the true gradient and typically has lower variance than alternative estimators (e.g., score function / REINFORCE estimator).

$\beta$-VAE: Controlling Disentanglement

The $\beta$-VAE introduces a weight on the KL term:

$$\mathcal{L}{\beta} = \mathbb{E}{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \beta \cdot D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$$

$\beta = 1$: Standard VAE
$\beta > 1$: Stronger pressure for the latent space to match the prior → more disentangled representations (each latent dimension captures an独立 factor of variation)
$\beta < 1$: Weaker regularization → better reconstructions but less structured latent space

The trade-off: higher $\beta$ improves disentanglement but may hurt reconstruction (the "information bottleneck" becomes tighter).

Pitfalls

Pitfall	Why It Happens	Fix
Posterior collapse	Decoder ignores $\mathbf{z}$; KL goes to 0	KL annealing, free bits, $\beta < 1$
Blurry samples	Gaussian decoder assumes pixel independence	Use hierarchical VAEs, VQ-VAE, or diffusion decoders
High variance gradients	Monte Carlo estimation of expectation	Use more MC samples per training step
Numerical instability	$\log \sigma^2$ can go to $-\infty$	Clamp $\sigma^2$ to $[10^{-4}, 10^4]$
Mode collapse in generation	Latent prior doesn't match aggregate posterior	Use more flexible priors (vampPrior, normalizing flow prior)

Key Terms

ELBO

Worked Examples

Example 1: ELBO Derivation by KL Decomposition

Derive the ELBO by starting from $D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$ and rearranging.

Solution:

$$D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x})) = \mathbb{E}{q\phi}\left[\log \frac{q_\phi(\mathbf{z}|\mathbf{x})}{p_\theta(\mathbf{z}|\mathbf{x})}\right]$$

$$= \mathbb{E}{q\phi}\left[\log q_\phi(\mathbf{z}|\mathbf{x}) - \log \frac{p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p_\theta(\mathbf{x})}\right]$$

$$= \mathbb{E}{q\phi}[\log q_\phi(\mathbf{z}|\mathbf{x}) - \log p_\theta(\mathbf{x}|\mathbf{z}) - \log p(\mathbf{z})] + \log p_\theta(\mathbf{x})$$

$$= -\mathcal{L}(\theta, \phi; \mathbf{x}) + \log p_\theta(\mathbf{x})$$

Rearranging: $\log p_\theta(\mathbf{x}) = \mathcal{L} + D_{\text{KL}}(q_\phi \| p_\theta)$, so $\mathcal{L} \leq \log p_\theta(\mathbf{x})$.

Click for answer

$\\mathcal{L} = \\mathbb{E}_{q_\\phi}[\\log p_\\theta(\\mathbf{x}|\\mathbf{z})] - D_{\\text{KL}}(q_\\phi(\\mathbf{z}|\\mathbf{x}) \\| p(\\mathbf{z}))$. This derivation shows that maximizing $\\mathcal{L}$ is equivalent to minimizing $D_{\\text{KL}}(q_\\phi \\| p_\\theta)$ — making the approximate posterior close to the true posterior.

Example 2: Computing the Gaussian KL Term

For a 2D latent VAE, the encoder outputs $\boldsymbol{\mu} = (0.5, -0.3)$ and $\boldsymbol{\sigma}^2 = (0.1, 0.04)$ for a given input. Compute the KL divergence to $\mathcal{N}(\mathbf{0}, I_2)$.

Solution:

$$D_{\text{KL}} = \frac{1}{2}\sum_{j=1}^{2}(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1)$$

$j=1$: $0.5^2 + 0.1 - \log(0.1) - 1 = 0.25 + 0.1 - (-2.3026) - 1 = 0.35 + 2.3026 - 1 = 1.6526$
$j=2$: $(-0.3)^2 + 0.04 - \log(0.04) - 1 = 0.09 + 0.04 - (-3.2189) - 1 = 0.13 + 3.2189 - 1 = 2.3489$

$D_{\text{KL}} = \frac{1}{2}(1.6526 + 2.3489) = \frac{1}{2}(4.0015) = 2.0008$

Click for answer

$D_{\\text{KL}} \\approx 2.001$. Note that $\\sigma^2 = 0.04$ gives $\\sigma = 0.2$, which is quite far from 1.0 — this drives up the KL term. Small variances carry a large $\\log \\sigma^2$ penalty.

Example 3: Reparameterization Trick in Action

For a scalar VAE, the encoder outputs $\mu = 2.0$, $\sigma = 0.5$. We sample $\varepsilon = 0.8$ from $\mathcal{N}(0,1)$. Compute $z$ and show how gradients flow to $\mu$ and $\sigma$.

Solution:

$$z = \mu + \sigma \cdot \varepsilon = 2.0 + 0.5 \cdot 0.8 = 2.4$$

Gradients: - $\frac{\partial z}{\partial \mu} = 1$ — gradient flows directly to $\mu$ - $\frac{\partial z}{\partial \sigma} = \varepsilon = 0.8$ — gradient scaled by the noise sample - $\frac{\partial z}{\partial \varepsilon}$ — no gradient to $\varepsilon$ (it's a sample, not a parameter)

If the downstream loss depends on $z$ with gradient $\frac{\partial L}{\partial z} = g$, then: - $\frac{\partial L}{\partial \mu} = g$ - $\frac{\partial L}{\partial \sigma} = g \cdot \varepsilon = 0.8g$

Click for answer

$z = 2.4$. Gradients: $\\partial z/\\partial \\mu = 1$, $\\partial z/\\partial \\sigma = 0.8$. The reparameterization trick makes the sampling operation differentiable by moving the stochasticity to the independent noise $\\varepsilon$.

Practice Problems

Prove that $D_{\text{KL}}(q \| p) \geq 0$ for any distributions $q, p$, and show that equality holds iff $q = p$ almost everywhere.

Click for answer
Using Jensen's inequality: $D_{\\text{KL}}(q\\|p) = \\mathbb{E}_q[-\\log(p/q)] \\geq -\\log \\mathbb{E}_q[p/q] = -\\log \\int q \\cdot (p/q) = -\\log 1 = 0$. Equality when $p/q$ is constant a.e., i.e., $p=q$ a.e. This justifies the ELBO as a lower bound.
For a Bernoulli decoder $p_\theta(\mathbf{x}|\mathbf{z}) = \prod_j \hat{x}_j^{x_j}(1-\hat{x}_j)^{1-x_j}$ where $\hat{\mathbf{x}} = \sigma(\mathbf{W}\mathbf{z} + \mathbf{b})$, write the reconstruction term of the ELBO.

Click for answer
$\\mathbb{E}_{q_\\phi}[\\log p_\\theta(\\mathbf{x}|\\mathbf{z})] = \\mathbb{E}_{q_\\phi}\\left[\\sum_j x_j \\log \\hat{x}_j + (1-x_j)\\log(1-\\hat{x}_j)\\right]$ This is the binary cross-entropy loss, summed over pixels/dimensions. In practice, we use one MC sample from $q_\\phi$ per training step.
For a $\beta$-VAE with $\beta = 4$, compute the total loss if reconstruction = 50 and KL = 5 (standard VAE would have loss = 55). What does $\beta=4$ imply?

Click for answer
$\\mathcal{L}_\\beta = 50 - 4 \\cdot 5 = 50 - 20 = 30$. The $\\beta$-VAE loss is 30 (vs 55 for standard). Higher $\\beta$ heavily penalizes deviation from the prior, pushing the latent representation to be more factorial/disentangled. The lower numeric loss value doesn't mean "better" — it's a different objective.
The standard VAE prior is $\mathcal{N}(\mathbf{0}, I)$. Why is this a reasonable choice? What happens if we use a different prior?

Click for answer
$\\mathcal{N}(\\mathbf{0}, I)$ is the maximum-entropy distribution with fixed mean and variance. The KL term with this prior encourages the latent dimensions to be independent and roughly unit-Gaussian, which aids disentanglement. Any prior is valid mathematically, but the KL term becomes more complex. A non-standard prior may better match the aggregate posterior but loses the simple closed-form KL.
Show that for a VAE with Gaussian encoder and decoder, the ELBO is equivalent to a regularized autoencoder objective. What role does each term play?

Click for answer
With Gaussian decoder $p_\\theta(\\mathbf{x}|\\mathbf{z}) = \\mathcal{N}(\\mathbf{x}; \\hat{\\mathbf{x}}, \\sigma^2 I)$: $\\mathbb{E}_{q_\\phi}[\\log p_\\theta(\\mathbf{x}|\\mathbf{z})] = -\\frac{1}{2\\sigma^2}\\mathbb{E}_{q_\\phi}[\\|\\mathbf{x} - \\hat{\\mathbf{x}}\\|^2] - \\frac{d}{2}\\log(2\\pi\\sigma^2)$ So the reconstruction term is proportional to MSE. The KL term acts as a regularizer, pulling the latent distribution toward $\\mathcal{N}(\\mathbf{0}, I)$. Together: $\\mathcal{L} = -\\text{(scaled MSE)} - D_{\\text{KL}}$.

Summary

Key takeaways:

VAEs are probabilistic autoencoders: encoder $q_\phi(\mathbf{z}|\mathbf{x})$, decoder $p_\theta(\mathbf{x}|\mathbf{z})$, prior $p(\mathbf{z})$
The ELBO $\mathcal{L} = \mathbb{E}q[\log p\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi \| p)$ lower-bounds $\log p_\theta(\mathbf{x})$
The reparameterization trick $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\varepsilon}$ makes sampling differentiable
$\beta$-VAE controls the reconstruction-regularization trade-off via $\beta \cdot D_{\text{KL}}$
Posterior collapse (KL→0) is the most common VAE failure mode — the decoder learns to ignore $\mathbf{z}$

Quiz

What does ELBO stand for?
A) Estimated Lower Bound Objective
B) Evidence Lower Bound
C) Empirical Likelihood Bound Optimizer
D) Encoder Likelihood Bound Output Correct: B)
If you chose B: ELBO = Evidence Lower BOund — a lower bound on $\log p_\theta(\mathbf{x})$ (the "evidence").
If you chose A, C, D: These are all incorrect expansions. "Evidence" is the marginal likelihood $p_\theta(\mathbf{x})$.
The reparameterization trick is necessary because:
A) Sampling from a Gaussian is expensive
B) Direct sampling blocks gradient flow through $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$
C) The KL divergence is intractable
D) The decoder is non-differentiable Correct: B)
If you chose B: $\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$ is a non-differentiable operation w.r.t. $\boldsymbol{\mu}, \boldsymbol{\sigma}$. Reparameterization moves the randomness to $\boldsymbol{\varepsilon}$.
If you chose A: Gaussian sampling is fast — the issue is differentiability.
If you chose C: The KL has a closed form for Gaussians.
If you chose D: The decoder is a neural network — it's differentiable by design.
In the VAE loss, the KL term $D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$ encourages:
A) Better reconstructions
B) The latent distribution to be close to the prior
C) Sparsity in the latent space
D) Higher-dimensional latent codes Correct: B)
If you chose B: The KL pulls each input's approximate posterior toward the prior $p(\mathbf{z})$, regularizing the latent space.
If you chose A: The reconstruction term handles that.
If you chose C: Sparsity is a different regularization — this is distribution matching.
If you chose D: KL doesn't affect dimensionality; the architecture determines $h$.
For $q = \mathcal{N}(0, 0.01)$ and $p = \mathcal{N}(0, 1)$, the KL divergence is approximately:
A) 0
B) Large (>> 1)
C) Negative
D) Exactly 1 Correct: B)
If you chose B: $D_{\text{KL}} = \frac{1}{2}(0 + 0.01 - \log 0.01 - 1) = \frac{1}{2}(0.01 + 4.605 - 1) = 1.81$. A very narrow posterior ($\sigma^2 \ll 1$) is heavily penalized because $\log \sigma^2$ is very negative, making $-\log \sigma^2$ very positive.
If you chose A: Both centered at 0, but KL is sensitive to variance mismatch.
If you chose C: KL is always non-negative.
If you chose D: It's approximately 1.81, not 1.
Posterior collapse in VAEs refers to:
A) The encoder producing deterministic outputs
B) $q_\phi(\mathbf{z}|\mathbf{x}) \approx p(\mathbf{z})$ for all $\mathbf{x}$, making $\mathbf{z}$ uninformative
C) The decoder failing to reconstruct inputs
D) The latent space dimensionality shrinking to 0 Correct: B)
If you chose B: When $q_\phi(\mathbf{z}|\mathbf{x}) \approx \mathcal{N}(0,I)$ for all inputs, $\mathbf{z}$ carries no information about $\mathbf{x}$. The decoder learns to ignore $\mathbf{z}$, converging to the data mean.
If you chose A: The encoder always outputs distribution parameters — it's the distribution that collapses, not determinism.
If you chose C: Decoder still works — it just ignores the latent code.
If you chose D: Dimensionality is fixed by architecture.

Next Steps

22-03 — Generative Adversarial Networks (GANs) — the adversarial alternative to likelihood-based generative models: minimax games, the optimal discriminator, and the Jensen-Shannon connection.

Pitfalls

Treating the ELBO gap as negligible during evaluation: The ELBO is a lower bound, not the log-likelihood. The true log-likelihood is $\log p_\theta(\mathbf{x}) = \mathcal{L} + D_{\text{KL}}(q_\phi | p_\theta(\mathbf{z}|\mathbf{x}))$. If the encoder is not expressive enough, the KL gap can be large, and the ELBO may substantially underestimate (or overestimate, if the bound is loose and not tight) the true likelihood. Never compare a VAE's ELBO directly against an autoregressive model's exact log-likelihood.
Using too few Monte Carlo samples for the reconstruction term: The standard VAE training uses a single sample $\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$ per training step. While this works for training (the gradient is unbiased), evaluating the ELBO with a single sample gives a noisy estimate. For evaluation and model comparison, use many MC samples (e.g., 100-500) and report the importance-weighted bound $\log \frac{1}{K}\sum_{k=1}^K \frac{p_\theta(\mathbf{x}, \mathbf{z}^{(k)})}{q_\phi(\mathbf{z}^{(k)}|\mathbf{x})}$, which converges to the true log-likelihood as $K \to \infty$.
Misdiagnosing posterior collapse: Posterior collapse ($\text{KL} \to 0$) can be confused with successful training if only monitoring the total loss, because the reconstruction term may still be good (the decoder learns the data mean). Always monitor the KL term and reconstruction term separately. If $\text{KL} \approx 0$ but reconstructions are acceptable, the model has collapsed — it's ignoring the latent code.
Setting $\beta$ too high in $\beta$-VAE without adjusting architecture: High $\beta$ (e.g., $\beta = 10$) heavily penalizes the KL term, which can force all latent dimensions to match the prior so tightly that no information passes through the bottleneck. This produces excellent disentanglement in theory but zero useful encoding — the latent codes become pure noise. Scale $\beta$ gradually and monitor the mutual information $I(\mathbf{x}; \mathbf{z})$ between inputs and latent codes.

Q6: Which of the following is a valid way to compute an approximate log-likelihood from a trained VAE?

A) Use the ELBO with a single MC sample. B) Use importance sampling: $\log \frac{1}{K}\sum_{k=1}^K \frac{p_\theta(\mathbf{x}|\mathbf{z}^{(k)})p(\mathbf{z}^{(k)})}{q_\phi(\mathbf{z}^{(k)}|\mathbf{x})}$ with large $K$ (e.g., 500). C) Set $\mathbf{z} = \boldsymbol{\mu}\phi(\mathbf{x})$ (deterministic encoding) and compute $\log p\theta(\mathbf{x}|\boldsymbol{\mu})$. D) The ELBO on the training set divided by the number of latent dimensions.

Answer and Explanations

**Correct: B)** Importance-weighted autoencoders (IWAE) use multiple MC samples to tighten the bound. As $K \to \infty$, the IWAE bound converges to the true $\log p_\theta(\mathbf{x})$. For practical evaluation, $K = 100$–$500$ gives a tight estimate. - A) The single-sample ELBO is a valid lower bound but can be loose — not a good approximation of the true likelihood. - C) Using only the mean discards the variance and ignores the prior term entirely — not a valid likelihood. - D) This is nonsensical — dividing by latent dimension doesn't produce a valid likelihood.

Progress

Phases

22-02 — Variational Autoencoders (VAEs)

Learning Objectives

Core Content

From Autoencoders to Probabilistic Models

The ELBO Derivation

Decomposing the ELBO

Gaussian VAE in Detail

⚠️ The Reparameterization Trick

$\beta$-VAE: Controlling Disentanglement

Pitfalls

Key Terms

Worked Examples

Example 1: ELBO Derivation by KL Decomposition

Example 2: Computing the Gaussian KL Term

Example 3: Reparameterization Trick in Action

Practice Problems

Summary

Quiz

Next Steps

Pitfalls