Math graphic
📐 Concept diagram

22-06 — Diffusion Models: Foundations

Phase: 22 — Generative Models Mathematics Subject: 22-06 Prerequisites: 22-05 — Score-Based Generative Models, Phase 13 (Probability, Gaussians), Phase 06 (Stochastic Calculus basics) Next subject: 22-07 — Diffusion Models: Advanced


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the forward diffusion process and its closed-form marginal $q(\mathbf{x}_t|\mathbf{x}_0)$
  2. Formulate the reverse diffusion process as a learned Gaussian transition
  3. Derive the simplified DDPM training objective $\mathbb{E}[\|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t)\|^2]$
  4. Explain the connection between diffusion models and score matching
  5. Implement the forward noising, reverse denoising, and sampling procedures

Core Content

The Forward (Diffusion) Process

Diffusion models gradually destroy data structure by adding Gaussian noise over $T$ timesteps. Given a data point $\mathbf{x}_0 \sim q(\mathbf{x}_0)$, the forward process is a Markov chain:

$$q(\mathbf{x}{1:T}|\mathbf{x}_0) = \prod{t=1}^{T} q(\mathbf{x}t|\mathbf{x}{t-1})$$

where each step adds a small amount of Gaussian noise:

$$q(\mathbf{x}t|\mathbf{x}{t-1}) = \mathcal{N}\left(\mathbf{x}t; \sqrt{1 - \beta_t}\;\mathbf{x}{t-1}, \beta_t I\right)$$

⚠️ CRITICAL — Closed-Form Marginal

The beauty of Gaussian diffusion: we can sample $\mathbf{x}_t$ directly from $\mathbf{x}_0$ without iterating through $t$ steps. Define:

$$\alpha_t = 1 - \beta_t, \quad \bar{\alpha}t = \prod{s=1}^{t} \alpha_s$$

Then:

$$q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}\left(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\;\mathbf{x}_0, (1 - \bar{\alpha}_t)I\right)$$

Proof by induction:

Base case $t=1$: $q(\mathbf{x}_1|\mathbf{x}_0) = \mathcal{N}(\sqrt{\alpha_1}\mathbf{x}_0, (1-\alpha_1)I) = \mathcal{N}(\sqrt{\bar{\alpha}_1}\mathbf{x}_0, (1-\bar{\alpha}_1)I)$ ✓

Inductive step: Assume true for $t-1$. Using the reparameterization trick:

$$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}}\boldsymbol{\varepsilon}{t-1}, \quad \boldsymbol{\varepsilon}{t-1} \sim \mathcal{N}(0,I)$$

Then $\mathbf{x}t = \sqrt{\alpha_t}\mathbf{x}{t-1} + \sqrt{1-\alpha_t}\boldsymbol{\varepsilon}_t$:

$$\mathbf{x}t = \sqrt{\alpha_t}\left(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}}\boldsymbol{\varepsilon}_{t-1}\right) + \sqrt{1-\alpha_t}\boldsymbol{\varepsilon}_t$$

The noise terms combine: $\sqrt{\alpha_t(1-\bar{\alpha}{t-1})}\boldsymbol{\varepsilon}{t-1} + \sqrt{1-\alpha_t}\boldsymbol{\varepsilon}t \sim \mathcal{N}(0, (\alpha_t(1-\bar{\alpha}{t-1}) + 1 - \alpha_t)I)$

$$= \mathcal{N}(0, (1 - \alpha_t\bar{\alpha}_{t-1})I) = \mathcal{N}(0, (1 - \bar{\alpha}_t)I)$$

Therefore $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$ where $\boldsymbol{\varepsilon} \sim \mathcal{N}(0,I)$. ✓

Equivalently:

$$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(0, I)$$

This is hugely important for efficient training: we can sample $\mathbf{x}_t$ at any timestep in one step.

The Reverse (Denoising) Process

The reverse process starts from pure noise $\mathbf{x}_T \sim \mathcal{N}(0, I)$ and gradually denoises:

$$p_\theta(\mathbf{x}{0:T}) = p(\mathbf{x}_T) \prod{t=1}^{T} p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$$

where each reverse step is learned as a Gaussian:

$$p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) = \mathcal{N}\left(\mathbf{x}{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 I\right)$$

When $\beta_t$ is small, the reverse process has the same functional form as the forward process (Gaussian) — this is a key property of diffusion processes (Feller/ Kolmogorov).

The Variational Bound

We train by maximizing the variational lower bound on $\log p_\theta(\mathbf{x}_0)$:

$$\log p_\theta(\mathbf{x}0) \geq \mathbb{E}_q\left[\log \frac{p\theta(\mathbf{x}{0:T})}{q(\mathbf{x}{1:T}|\mathbf{x}0)}\right] = \mathcal{L}{\text{VLB}}$$

After algebraic manipulation (expanding the Markov chains and using the Gaussian forms), this decomposes into:

$$\mathcal{L}{\text{VLB}} = \mathbb{E}_q\left[\underbrace{-D{\text{KL}}(q(\mathbf{x}T|\mathbf{x}_0) \| p(\mathbf{x}_T))}{L_T} - \sum_{t=2}^{T} \underbrace{D_{\text{KL}}(q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) \| p\theta(\mathbf{x}{t-1}|\mathbf{x}_t))}{L_{t-1}} + \underbrace{\log p_\theta(\mathbf{x}0|\mathbf{x}_1)}{L_0}\right]$$

⚠️ CRITICAL — The Key Insight: $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is Tractable

Using Bayes' rule and the known Gaussian forms:

$$q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t|\mathbf{x}{t-1}) q(\mathbf{x}_{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)}$$

After Gaussian algebra, this is also Gaussian:

$$q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\left(\mathbf{x}{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t I\right)$$

where:

$$\tilde{\boldsymbol{\mu}}t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1 - \bar{\alpha}t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}{t-1})}{1 - \bar{\alpha}_t}\mathbf{x}_t$$

$$\tilde{\beta}t = \frac{1 - \bar{\alpha}{t-1}}{1 - \bar{\alpha}_t}\beta_t$$

Since $q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is a known Gaussian, each $L{t-1}$ term is a KL divergence between two Gaussians, which has a closed form.

The Simplified Training Objective

Through clever reparameterization, Ho et al. (2020) showed that the variational bound simplifies dramatically. Express $\mathbf{x}_0$ in terms of $\mathbf{x}_t$ and noise:

$$\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon})$$

Substituting into $\tilde{\boldsymbol{\mu}}t$ and parameterizing $\boldsymbol{\mu}\theta$ to match:

$$\boldsymbol{\mu}\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)\right)$$

The KL divergence $L_{t-1}$ reduces to:

$$L_{t-1} = \mathbb{E}{\mathbf{x}_0, \boldsymbol{\varepsilon}}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}_t)} \|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)\|^2\right]$$

Discarding the weighting factor (which empirically improves sample quality), we get the elegantly simple DDPM objective:

$$\mathcal{L}{\text{simple}}(\theta) = \mathbb{E}{t, \mathbf{x}0, \boldsymbol{\varepsilon}}\left[\|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)\|^2\right]$$

where: - $t \sim \text{Uniform}(1, \ldots, T)$ - $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ (data) - $\boldsymbol{\varepsilon} \sim \mathcal{N}(0, I)$ - $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$

This is the same objective as denoising score matching! The model learns to predict the noise that was added — equivalent to learning the score function up to a scaling factor.

Connection to Score Matching

Recall from 22-05: $\mathbf{s}\theta(\mathbf{x}) = -\boldsymbol{\varepsilon}\theta(\mathbf{x})/\sigma$. For diffusion:

$$\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t|\mathbf{x}_0) = -\frac{\boldsymbol{\varepsilon}}{\sqrt{1-\bar{\alpha}_t}}$$

So the score model is:

$$\mathbf{s}\theta(\mathbf{x}_t, t) = -\frac{\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$

This explicitly connects diffusion models to the score-based framework from 22-05.

Sampling (DDPM)

To generate a sample:

  1. Sample $\mathbf{x}_T \sim \mathcal{N}(0, I)$
  2. For $t = T, T-1, \ldots, 1$: $$\mathbf{x}{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}, \quad \mathbf{z} \sim \mathcal{N}(0, I)$$ where $\sigma_t^2 = \tilde{\beta}_t$ (or $\sigma_t^2 = \beta_t$ for the simplified version)
  3. Output $\mathbf{x}_0$

Pitfalls

Pitfall Why It Happens Fix
Poor sample quality $T$ too small — not enough denoising steps Use $T \geq 1000$ for DDPM
Slow sampling $T$ large — many sequential steps Use DDIM (22-07) for fewer steps
Bad noise schedule $\beta_t$ not tuned for data scale Linear schedule: $\beta_t$ from $10^{-4}$ to $0.02$; cosine schedule often better
Low diversity Model overfits or $T$ too small More data augmentation, validate $T$
Training instability Large $\beta_t$ early → hard to learn reverse Start schedule small; use cosine schedule


Key Terms

Worked Examples

Example 1: Compute $\bar{\alpha}_t$ and $\mathbf{x}_t$

For a diffusion process with $\beta_t = 0.001$ constant for $t = 1, \ldots, 5$, compute $\bar{\alpha}_5$. If $\mathbf{x}_0 = (1.0, -2.0)$ and $\boldsymbol{\varepsilon} = (0.5, -0.3)$, compute $\mathbf{x}_5$.

Solution:

$\alpha_t = 1 - 0.001 = 0.999$ for all $t$.

$\bar{\alpha}_5 = (0.999)^5 = 0.99501$

$\mathbf{x}_5 = \sqrt{0.99501} \cdot (1.0, -2.0) + \sqrt{1 - 0.99501} \cdot (0.5, -0.3)$

$= (0.9975, -1.9950) + (0.07064 \cdot 0.5, 0.07064 \cdot (-0.3))$

$= (0.9975 + 0.03532, -1.9950 - 0.02119)$

$= (1.0328, -2.0162)$

Click for answer $\\bar{\\alpha}_5 = 0.99501$, $\\mathbf{x}_5 = (1.0328, -2.0162)$. After only 5 steps with a very small $\\beta$, the data is barely perturbed — $\\sqrt{\\bar{\\alpha}_5} \\approx 0.9975$ means 99.75% of the signal is preserved.

Example 2: Derive $\tilde{\boldsymbol{\mu}}_t$ for a Simple Case

Given $\bar{\alpha}_{t-1} = 0.8$, $\bar{\alpha}_t = 0.7$, $\beta_t = 0.125$, $\mathbf{x}_t = (0.5, 0.5)$, $\mathbf{x}_0 = (1.0, 0.0)$. Compute $\tilde{\boldsymbol{\mu}}_t$.

Solution:

$\alpha_t = \bar{\alpha}t/\bar{\alpha}{t-1} = 0.7/0.8 = 0.875$

Check: $\beta_t = 1 - \alpha_t = 1 - 0.875 = 0.125$ ✓

Coefficient for $\mathbf{x}0$: $$c_0 = \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1 - \bar{\alpha}_t} = \frac{\sqrt{0.8} \cdot 0.125}{1 - 0.7} = \frac{0.8944 \cdot 0.125}{0.3} = \frac{0.1118}{0.3} = 0.3727$$

Coefficient for $\mathbf{x}t$: $$c_t = \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}{t-1})}{1 - \bar{\alpha}_t} = \frac{\sqrt{0.875} \cdot (1 - 0.8)}{0.3} = \frac{0.9354 \cdot 0.2}{0.3} = \frac{0.1871}{0.3} = 0.6236$$

$\tilde{\boldsymbol{\mu}}_t = 0.3727 \cdot (1.0, 0.0) + 0.6236 \cdot (0.5, 0.5) = (0.3727 + 0.3118, 0 + 0.3118) = (0.6845, 0.3118)$

Click for answer $\\tilde{\\boldsymbol{\\mu}}_t = (0.6845, 0.3118)$. This is the optimal mean for $q(\\mathbf{x}_{t-1}|\\mathbf{x}_t, \\mathbf{x}_0)$ — the target for our learned $\\boldsymbol{\\mu}_\\theta(\\mathbf{x}_t, t)$.

Example 3: Reverse Sampling Step

Using the DDPM sampler with $\beta_t = 0.001$, $\alpha_t = 0.999$, $\bar{\alpha}t = 0.9$, and trained model predicting $\boldsymbol{\varepsilon}\theta(\mathbf{x}t, t) = (0.1, -0.2)$. Given $\mathbf{x}_t = (0.5, 0.5)$, $\mathbf{z} = (0.3, -0.5)$, compute $\mathbf{x}{t-1}$.

Solution:

$\boldsymbol{\mu}\theta = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\varepsilon}\theta\right)$

$= \frac{1}{\sqrt{0.999}}\left((0.5, 0.5) - \frac{0.001}{\sqrt{0.1}}(0.1, -0.2)\right)$

$= 1.0005 \cdot \left((0.5, 0.5) - 0.003162 \cdot (0.1, -0.2)\right)$

$= 1.0005 \cdot (0.5 - 0.000316, 0.5 - (-0.000632))$

$= 1.0005 \cdot (0.49968, 0.50063)$

$= (0.49993, 0.50088)$

With $\sigma_t = \sqrt{\beta_t} = \sqrt{0.001} = 0.03162$:

$\mathbf{x}{t-1} = \boldsymbol{\mu}\theta + \sigma_t \mathbf{z} = (0.49993, 0.50088) + 0.03162 \cdot (0.3, -0.5)$

$= (0.49993 + 0.00949, 0.50088 - 0.01581) = (0.5094, 0.4851)$

Click for answer $\\mathbf{x}_{t-1} = (0.5094, 0.4851)$. The model "denoised" from $(0.5, 0.5)$ and added a small stochastic perturbation. Notice the denoising is very subtle because $\\beta_t$ is small — the reverse step makes only tiny adjustments.

Practice Problems

  1. Prove that if $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$ with $\boldsymbol{\varepsilon} \sim \mathcal{N}(0,I)$ and $\mathbf{x}_0$ is independent of $\boldsymbol{\varepsilon}$, then $\text{Var}[\mathbf{x}_t] = \bar{\alpha}_t\text{Var}[\mathbf{x}_0] + (1-\bar{\alpha}_t)I$.

    Click for answer $\\text{Cov}[\\mathbf{x}_t] = \\mathbb{E}[(\\sqrt{\\bar{\\alpha}_t}\\mathbf{x}_0 + \\sqrt{1-\\bar{\\alpha}_t}\\boldsymbol{\\varepsilon})(\\sqrt{\\bar{\\alpha}_t}\\mathbf{x}_0 + \\sqrt{1-\\bar{\\alpha}_t}\\boldsymbol{\\varepsilon})^T]$ $= \\bar{\\alpha}_t\\mathbb{E}[\\mathbf{x}_0\\mathbf{x}_0^T] + \\sqrt{\\bar{\\alpha}_t(1-\\bar{\\alpha}_t)}(\\mathbb{E}[\\mathbf{x}_0]\\mathbb{E}[\\boldsymbol{\\varepsilon}]^T + \\mathbb{E}[\\boldsymbol{\\varepsilon}]\\mathbb{E}[\\mathbf{x}_0]^T) + (1-\\bar{\\alpha}_t)\\mathbb{E}[\\boldsymbol{\\varepsilon}\\boldsymbol{\\varepsilon}^T]$ By independence: cross terms vanish. Assuming $\\mathbb{E}[\\mathbf{x}_0] = 0$ for simplicity: $\\text{Cov}[\\mathbf{x}_t] = \\bar{\\alpha}_t\\text{Var}[\\mathbf{x}_0] + (1-\\bar{\\alpha}_t)I$. If data is normalized ($\\text{Var}[\\mathbf{x}_0] = I$), then $\\text{Var}[\\mathbf{x}_t] = I$ for all $t$ — the variance-preserving property.

  2. Show that as $\bar{\alpha}_T \to 0$, the forward process converges to $\mathcal{N}(0, I)$. What noise schedule ensures $\bar{\alpha}_T \approx 0$?

    Click for answer $q(\\mathbf{x}_T|\\mathbf{x}_0) = \\mathcal{N}(\\sqrt{\\bar{\\alpha}_T}\\mathbf{x}_0, (1-\\bar{\\alpha}_T)I)$. As $\\bar{\\alpha}_T \\to 0$, the mean goes to 0 and covariance to $I$ — regardless of $\\mathbf{x}_0$. For the linear schedule $\\beta_t = 10^{-4} + (t-1)\\frac{0.02 - 10^{-4}}{T-1}$ with $T=1000$: $\\bar{\\alpha}_T = \\prod_{t=1}^{1000}(1-\\beta_t) \\approx \\exp(\\sum -\\beta_t) \\approx e^{-10} \\approx 4.5 \\times 10^{-5} \\approx 0$. ✓

  3. Derive $\tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}t}\beta_t$ from the formula for the conditional Gaussian $q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0)$.

    Click for answer Using Bayes' rule for Gaussians: $q(\\mathbf{x}_{t-1}|\\mathbf{x}_t, \\mathbf{x}_0) \\propto q(\\mathbf{x}_t|\\mathbf{x}_{t-1})q(\\mathbf{x}_{t-1}|\\mathbf{x}_0)$. All are Gaussian. $q(\\mathbf{x}_t|\\mathbf{x}_{t-1}) = \\mathcal{N}(\\sqrt{\\alpha_t}\\mathbf{x}_{t-1}, \\beta_t I)$ $q(\\mathbf{x}_{t-1}|\\mathbf{x}_0) = \\mathcal{N}(\\sqrt{\\bar{\\alpha}_{t-1}}\\mathbf{x}_0, (1-\\bar{\\alpha}_{t-1})I)$ The product of Gaussians gives precision (inverse variance): $\\frac{1}{\\tilde{\\beta}_t} = \\frac{\\alpha_t}{\\beta_t} + \\frac{1}{1-\\bar{\\alpha}_{t-1}}$. Solving: $\\tilde{\\beta}_t = \\frac{\\beta_t(1-\\bar{\\alpha}_{t-1})}{\\alpha_t(1-\\bar{\\alpha}_{t-1}) + \\beta_t} = \\frac{1-\\bar{\\alpha}_{t-1}}{1-\\bar{\\alpha}_t}\\beta_t$.

  4. Explain why the simplified DDPM objective drops the weighting factor $\frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)}$. What effect does this have?

    Click for answer The weighting factor heavily upweights training at small $t$ (when $1-\\bar{\\alpha}_t$ is small). Dropping it weights all $t$ equally. Empirically, this improves sample quality — the model learns to denoise at all noise levels equally well, not just the subtle denoising steps. The simplified objective is not a proper variational bound but works better as a training signal.

  5. For the forward process $q(\mathbf{x}t|\mathbf{x}{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t I)$, verify that the variance of $\mathbf{x}_t$ remains 1 if $\mathbf{x}_0 \sim \mathcal{N}(0, I)$ and $\beta_t$ are chosen to preserve variance.

    Click for answer $\\text{Var}[\\mathbf{x}_t] = (1-\\beta_t)\\text{Var}[\\mathbf{x}_{t-1}] + \\beta_t = (1-\\beta_t) \\cdot 1 + \\beta_t = 1$. The variance-preserving (VP) SDE formulation keeps the marginal variance constant. This contrasts with variance-exploding (VE) SDEs where variance grows over time.


Summary

Key takeaways:


Quiz

  1. The closed-form $q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)$ enables:
  2. A) Faster sampling
  3. B) Efficient training by sampling any $t$ in one step
  4. C) Better sample quality
  5. D) Smaller model size Correct: B)
  6. If you chose B: Without the closed form, training would require iterating through all $t$ steps for each datapoint. The closed form enables random $t$ sampling in $O(1)$.
  7. If you chose A: Sampling still requires $T$ sequential steps — that's addressed by DDIM.
  8. If you chose C: The closed form helps training efficiency but doesn't directly improve quality.
  9. If you chose D: Unrelated to model size.

  10. In DDPM, the model $\boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t)$ directly predicts:

  11. A) $\mathbf{x}_0$
  12. B) $\mathbf{x}_{t-1}$
  13. C) The noise $\boldsymbol{\varepsilon}$ added to $\mathbf{x}_0$
  14. D) The score $\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$ Correct: C)
  15. If you chose C: The loss is $\|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}_\theta\|^2$. The model is a noise predictor.
  16. If you chose D: The score is proportional to $-\boldsymbol{\varepsilon}_\theta$, so it's equivalent, but the DDPM parameterization directly outputs noise.

  17. The forward process scaling $\sqrt{1-\beta_t}$ (rather than 1) is used to:

  18. A) Make the process converge faster
  19. B) Preserve variance (variance-preserving diffusion)
  20. C) Increase the noise at each step
  21. D) Simplify the reverse process Correct: B)
  22. If you chose B: $\text{Var}[\mathbf{x}t] = (1-\beta_t)\text{Var}[\mathbf{x}{t-1}] + \beta_t$. If $\text{Var}[\mathbf{x}_{t-1}] = 1$, then $\text{Var}[\mathbf{x}_t] = 1$. Without the scaling, variance would grow monotonically.
  23. If you chose A: The scaling actually slows convergence to noise.
  24. If you chose C: It decreases the effect (compared to no scaling).
  25. If you chose D: The scaling choice affects both forward and reverse, not simplifying either.

  26. The weighted variational bound loses the KL term $L_T$ in practice because:

  27. A) It's intractable
  28. B) With enough steps, $q(\mathbf{x}_T|\mathbf{x}_0) \approx \mathcal{N}(0,I)$, so the KL is near zero
  29. C) It's computationally too expensive
  30. D) It's exactly zero for any noise schedule Correct: B)
  31. If you chose B: With properly chosen $T$ and $\beta_t$, $\bar{\alpha}_T \approx 0$, so $q(\mathbf{x}_T|\mathbf{x}_0) \approx \mathcal{N}(0,I) = p(\mathbf{x}_T)$. The KL divergence approaches zero.
  32. If you chose A: It has a closed form — KL between two Gaussians.
  33. If you chose C: Gaussian KL is $O(d)$.
  34. If you chose D: Only zero if distributions match perfectly.

  35. The optimal $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ targets $\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)$, which is the mean of:

  36. A) $q(\mathbf{x}t|\mathbf{x}{t-1})$
  37. B) $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$
  38. C) $q(\mathbf{x}_0|\mathbf{x}_t)$
  39. D) $p_\theta(\mathbf{x}t|\mathbf{x}{t-1})$ Correct: B)
  40. If you chose B: $q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is the true posterior given both the noisy observation $\mathbf{x}_t$ and the clean data $\mathbf{x}_0$. The reverse model $p\theta$ tries to approximate this without access to $\mathbf{x}_0$.
  41. If you chose A: That's the forward process — known analytically.
  42. If you chose C: That's a different conditional — the full denoising distribution.
  43. If you chose D: That's the learned reverse step.

Next Steps

22-07 — Diffusion Models: Advanced — DDIM for accelerated sampling, classifier guidance, classifier-free guidance, and the SDE/ODE formulations that unify diffusion and score-based models.


Pitfalls

  1. Using the simplified loss without understanding its bias: The DDPM simplified loss $\mathcal{L}{\text{simple}} = \mathbb{E}[|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta|^2]$ drops the per-timestep weighting factor $\frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)}$. This means the model is not optimized for the variational bound — it's trained equally across all $t$. While this empirically improves sample quality, it means the training loss is no longer a valid lower bound on $\log p(\mathbf{x})$. For likelihood evaluation, use the full weighted variational bound or the probability flow ODE.

  2. Setting $T$ too small and expecting quality: DDPM requires $T \approx 1000$ steps for the reverse process to be well-approximated as Gaussian. With $T = 100$, the assumption that $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is approximately Gaussian breaks down, leading to poor sample quality. If you need fewer sampling steps, switch to DDIM or a higher-order ODE solver — don't just reduce $T$ in DDPM.

  3. Confusing SNR(t) with $\bar{\alpha}_t$ in noise schedule design: The key quantity controlling training difficulty is $\text{SNR}(t) = \bar{\alpha}_t/(1 - \bar{\alpha}_t)$, not $\bar{\alpha}_t$ directly. A cosine schedule keeps SNR high at intermediate $t$ (preserving more signal for the model to learn from), while a linear schedule drops SNR rapidly. Always plot SNR vs. $t$ when designing or comparing schedules, not just $\bar{\alpha}_t$.

  4. Applying DDPM sampling with the wrong variance choice: DDPM can use either $\sigma_t^2 = \beta_t$ or $\sigma_t^2 = \tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}t}\beta_t$ for the reverse process variance. Using $\sigma_t^2 = \beta_t$ when $\tilde{\beta}_t$ is expected (or vice versa) can cause a distribution shift. For best log-likelihood, use the learned variance $\Sigma\theta(\mathbf{x}_t, t)$; for best sample quality, $\sigma_t = \beta_t$ is often preferred.




Q6: The variational bound for diffusion models involves $D_{\text{KL}}(q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) | p\theta(\mathbf{x}{t-1}|\mathbf{x}_t))$. Why is $q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ tractable?

A) It's analytically derived from the known Gaussian forward transitions using Bayes' rule. B) It's estimated via Monte Carlo with a single sample. C) The neural network learns it. D) It's always $\mathcal{N}(0, I)$.

Answer and Explanations **Correct: A)** $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = q(\mathbf{x}_t|\mathbf{x}_{t-1})q(\mathbf{x}_{t-1}|\mathbf{x}_0) / q(\mathbf{x}_t|\mathbf{x}_0)$. All three terms are known Gaussians. The product-and-division of Gaussians yields another Gaussian, with mean $\tilde{\boldsymbol{\mu}}_t$ and variance $\tilde{\beta}_t$, both given by closed-form expressions. This tractability is what makes the variational bound and training objective computable. - B) It's exact, not estimated. - C) $q$ is the forward process — it's fixed by the noise schedule, not learned. - D) That's the prior $p(\mathbf{x}_T)$, not the conditional posterior.