22-06 — Diffusion Models: Foundations
Phase: 22 — Generative Models Mathematics Subject: 22-06 Prerequisites: 22-05 — Score-Based Generative Models, Phase 13 (Probability, Gaussians), Phase 06 (Stochastic Calculus basics) Next subject: 22-07 — Diffusion Models: Advanced
Learning Objectives
By the end of this subject, you will be able to:
- Derive the forward diffusion process and its closed-form marginal $q(\mathbf{x}_t|\mathbf{x}_0)$
- Formulate the reverse diffusion process as a learned Gaussian transition
- Derive the simplified DDPM training objective $\mathbb{E}[\|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t)\|^2]$
- Explain the connection between diffusion models and score matching
- Implement the forward noising, reverse denoising, and sampling procedures
Core Content
The Forward (Diffusion) Process
Diffusion models gradually destroy data structure by adding Gaussian noise over $T$ timesteps. Given a data point $\mathbf{x}_0 \sim q(\mathbf{x}_0)$, the forward process is a Markov chain:
$$q(\mathbf{x}{1:T}|\mathbf{x}_0) = \prod{t=1}^{T} q(\mathbf{x}t|\mathbf{x}{t-1})$$
where each step adds a small amount of Gaussian noise:
$$q(\mathbf{x}t|\mathbf{x}{t-1}) = \mathcal{N}\left(\mathbf{x}t; \sqrt{1 - \beta_t}\;\mathbf{x}{t-1}, \beta_t I\right)$$
- $\beta_t \in (0, 1)$ is the noise schedule — typically small, increasing over time
- $\sqrt{1-\beta_t}$ scaling ensures variance doesn't explode (variance-preserving process)
- After many steps, $\mathbf{x}_T \sim \mathcal{N}(0, I)$ (pure noise)
⚠️ CRITICAL — Closed-Form Marginal
The beauty of Gaussian diffusion: we can sample $\mathbf{x}_t$ directly from $\mathbf{x}_0$ without iterating through $t$ steps. Define:
$$\alpha_t = 1 - \beta_t, \quad \bar{\alpha}t = \prod{s=1}^{t} \alpha_s$$
Then:
$$q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}\left(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\;\mathbf{x}_0, (1 - \bar{\alpha}_t)I\right)$$
Proof by induction:
Base case $t=1$: $q(\mathbf{x}_1|\mathbf{x}_0) = \mathcal{N}(\sqrt{\alpha_1}\mathbf{x}_0, (1-\alpha_1)I) = \mathcal{N}(\sqrt{\bar{\alpha}_1}\mathbf{x}_0, (1-\bar{\alpha}_1)I)$ ✓
Inductive step: Assume true for $t-1$. Using the reparameterization trick:
$$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}}\boldsymbol{\varepsilon}{t-1}, \quad \boldsymbol{\varepsilon}{t-1} \sim \mathcal{N}(0,I)$$
Then $\mathbf{x}t = \sqrt{\alpha_t}\mathbf{x}{t-1} + \sqrt{1-\alpha_t}\boldsymbol{\varepsilon}_t$:
$$\mathbf{x}t = \sqrt{\alpha_t}\left(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}}\boldsymbol{\varepsilon}_{t-1}\right) + \sqrt{1-\alpha_t}\boldsymbol{\varepsilon}_t$$
The noise terms combine: $\sqrt{\alpha_t(1-\bar{\alpha}{t-1})}\boldsymbol{\varepsilon}{t-1} + \sqrt{1-\alpha_t}\boldsymbol{\varepsilon}t \sim \mathcal{N}(0, (\alpha_t(1-\bar{\alpha}{t-1}) + 1 - \alpha_t)I)$
$$= \mathcal{N}(0, (1 - \alpha_t\bar{\alpha}_{t-1})I) = \mathcal{N}(0, (1 - \bar{\alpha}_t)I)$$
Therefore $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$ where $\boldsymbol{\varepsilon} \sim \mathcal{N}(0,I)$. ✓
Equivalently:
$$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(0, I)$$
This is hugely important for efficient training: we can sample $\mathbf{x}_t$ at any timestep in one step.
The Reverse (Denoising) Process
The reverse process starts from pure noise $\mathbf{x}_T \sim \mathcal{N}(0, I)$ and gradually denoises:
$$p_\theta(\mathbf{x}{0:T}) = p(\mathbf{x}_T) \prod{t=1}^{T} p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$$
where each reverse step is learned as a Gaussian:
$$p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) = \mathcal{N}\left(\mathbf{x}{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 I\right)$$
When $\beta_t$ is small, the reverse process has the same functional form as the forward process (Gaussian) — this is a key property of diffusion processes (Feller/ Kolmogorov).
The Variational Bound
We train by maximizing the variational lower bound on $\log p_\theta(\mathbf{x}_0)$:
$$\log p_\theta(\mathbf{x}0) \geq \mathbb{E}_q\left[\log \frac{p\theta(\mathbf{x}{0:T})}{q(\mathbf{x}{1:T}|\mathbf{x}0)}\right] = \mathcal{L}{\text{VLB}}$$
After algebraic manipulation (expanding the Markov chains and using the Gaussian forms), this decomposes into:
$$\mathcal{L}{\text{VLB}} = \mathbb{E}_q\left[\underbrace{-D{\text{KL}}(q(\mathbf{x}T|\mathbf{x}_0) \| p(\mathbf{x}_T))}{L_T} - \sum_{t=2}^{T} \underbrace{D_{\text{KL}}(q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) \| p\theta(\mathbf{x}{t-1}|\mathbf{x}_t))}{L_{t-1}} + \underbrace{\log p_\theta(\mathbf{x}0|\mathbf{x}_1)}{L_0}\right]$$
⚠️ CRITICAL — The Key Insight: $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is Tractable
Using Bayes' rule and the known Gaussian forms:
$$q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t|\mathbf{x}{t-1}) q(\mathbf{x}_{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)}$$
After Gaussian algebra, this is also Gaussian:
$$q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\left(\mathbf{x}{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t I\right)$$
where:
$$\tilde{\boldsymbol{\mu}}t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1 - \bar{\alpha}t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}{t-1})}{1 - \bar{\alpha}_t}\mathbf{x}_t$$
$$\tilde{\beta}t = \frac{1 - \bar{\alpha}{t-1}}{1 - \bar{\alpha}_t}\beta_t$$
Since $q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is a known Gaussian, each $L{t-1}$ term is a KL divergence between two Gaussians, which has a closed form.
The Simplified Training Objective
Through clever reparameterization, Ho et al. (2020) showed that the variational bound simplifies dramatically. Express $\mathbf{x}_0$ in terms of $\mathbf{x}_t$ and noise:
$$\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon})$$
Substituting into $\tilde{\boldsymbol{\mu}}t$ and parameterizing $\boldsymbol{\mu}\theta$ to match:
$$\boldsymbol{\mu}\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)\right)$$
The KL divergence $L_{t-1}$ reduces to:
$$L_{t-1} = \mathbb{E}{\mathbf{x}_0, \boldsymbol{\varepsilon}}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}_t)} \|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)\|^2\right]$$
Discarding the weighting factor (which empirically improves sample quality), we get the elegantly simple DDPM objective:
$$\mathcal{L}{\text{simple}}(\theta) = \mathbb{E}{t, \mathbf{x}0, \boldsymbol{\varepsilon}}\left[\|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)\|^2\right]$$
where: - $t \sim \text{Uniform}(1, \ldots, T)$ - $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ (data) - $\boldsymbol{\varepsilon} \sim \mathcal{N}(0, I)$ - $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$
This is the same objective as denoising score matching! The model learns to predict the noise that was added — equivalent to learning the score function up to a scaling factor.
Connection to Score Matching
Recall from 22-05: $\mathbf{s}\theta(\mathbf{x}) = -\boldsymbol{\varepsilon}\theta(\mathbf{x})/\sigma$. For diffusion:
$$\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t|\mathbf{x}_0) = -\frac{\boldsymbol{\varepsilon}}{\sqrt{1-\bar{\alpha}_t}}$$
So the score model is:
$$\mathbf{s}\theta(\mathbf{x}_t, t) = -\frac{\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$
This explicitly connects diffusion models to the score-based framework from 22-05.
Sampling (DDPM)
To generate a sample:
- Sample $\mathbf{x}_T \sim \mathcal{N}(0, I)$
- For $t = T, T-1, \ldots, 1$: $$\mathbf{x}{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}, \quad \mathbf{z} \sim \mathcal{N}(0, I)$$ where $\sigma_t^2 = \tilde{\beta}_t$ (or $\sigma_t^2 = \beta_t$ for the simplified version)
- Output $\mathbf{x}_0$
Pitfalls
| Pitfall | Why It Happens | Fix |
|---|---|---|
| Poor sample quality | $T$ too small — not enough denoising steps | Use $T \geq 1000$ for DDPM |
| Slow sampling | $T$ large — many sequential steps | Use DDIM (22-07) for fewer steps |
| Bad noise schedule | $\beta_t$ not tuned for data scale | Linear schedule: $\beta_t$ from $10^{-4}$ to $0.02$; cosine schedule often better |
| Low diversity | Model overfits or $T$ too small | More data augmentation, validate $T$ |
| Training instability | Large $\beta_t$ early → hard to learn reverse | Start schedule small; use cosine schedule |
Key Terms
- DDPM objective
Worked Examples
Example 1: Compute $\bar{\alpha}_t$ and $\mathbf{x}_t$
For a diffusion process with $\beta_t = 0.001$ constant for $t = 1, \ldots, 5$, compute $\bar{\alpha}_5$. If $\mathbf{x}_0 = (1.0, -2.0)$ and $\boldsymbol{\varepsilon} = (0.5, -0.3)$, compute $\mathbf{x}_5$.
Solution:
$\alpha_t = 1 - 0.001 = 0.999$ for all $t$.
$\bar{\alpha}_5 = (0.999)^5 = 0.99501$
$\mathbf{x}_5 = \sqrt{0.99501} \cdot (1.0, -2.0) + \sqrt{1 - 0.99501} \cdot (0.5, -0.3)$
$= (0.9975, -1.9950) + (0.07064 \cdot 0.5, 0.07064 \cdot (-0.3))$
$= (0.9975 + 0.03532, -1.9950 - 0.02119)$
$= (1.0328, -2.0162)$
Click for answer
$\\bar{\\alpha}_5 = 0.99501$, $\\mathbf{x}_5 = (1.0328, -2.0162)$. After only 5 steps with a very small $\\beta$, the data is barely perturbed — $\\sqrt{\\bar{\\alpha}_5} \\approx 0.9975$ means 99.75% of the signal is preserved.Example 2: Derive $\tilde{\boldsymbol{\mu}}_t$ for a Simple Case
Given $\bar{\alpha}_{t-1} = 0.8$, $\bar{\alpha}_t = 0.7$, $\beta_t = 0.125$, $\mathbf{x}_t = (0.5, 0.5)$, $\mathbf{x}_0 = (1.0, 0.0)$. Compute $\tilde{\boldsymbol{\mu}}_t$.
Solution:
$\alpha_t = \bar{\alpha}t/\bar{\alpha}{t-1} = 0.7/0.8 = 0.875$
Check: $\beta_t = 1 - \alpha_t = 1 - 0.875 = 0.125$ ✓
Coefficient for $\mathbf{x}0$: $$c_0 = \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1 - \bar{\alpha}_t} = \frac{\sqrt{0.8} \cdot 0.125}{1 - 0.7} = \frac{0.8944 \cdot 0.125}{0.3} = \frac{0.1118}{0.3} = 0.3727$$
Coefficient for $\mathbf{x}t$: $$c_t = \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}{t-1})}{1 - \bar{\alpha}_t} = \frac{\sqrt{0.875} \cdot (1 - 0.8)}{0.3} = \frac{0.9354 \cdot 0.2}{0.3} = \frac{0.1871}{0.3} = 0.6236$$
$\tilde{\boldsymbol{\mu}}_t = 0.3727 \cdot (1.0, 0.0) + 0.6236 \cdot (0.5, 0.5) = (0.3727 + 0.3118, 0 + 0.3118) = (0.6845, 0.3118)$
Click for answer
$\\tilde{\\boldsymbol{\\mu}}_t = (0.6845, 0.3118)$. This is the optimal mean for $q(\\mathbf{x}_{t-1}|\\mathbf{x}_t, \\mathbf{x}_0)$ — the target for our learned $\\boldsymbol{\\mu}_\\theta(\\mathbf{x}_t, t)$.Example 3: Reverse Sampling Step
Using the DDPM sampler with $\beta_t = 0.001$, $\alpha_t = 0.999$, $\bar{\alpha}t = 0.9$, and trained model predicting $\boldsymbol{\varepsilon}\theta(\mathbf{x}t, t) = (0.1, -0.2)$. Given $\mathbf{x}_t = (0.5, 0.5)$, $\mathbf{z} = (0.3, -0.5)$, compute $\mathbf{x}{t-1}$.
Solution:
$\boldsymbol{\mu}\theta = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\varepsilon}\theta\right)$
$= \frac{1}{\sqrt{0.999}}\left((0.5, 0.5) - \frac{0.001}{\sqrt{0.1}}(0.1, -0.2)\right)$
$= 1.0005 \cdot \left((0.5, 0.5) - 0.003162 \cdot (0.1, -0.2)\right)$
$= 1.0005 \cdot (0.5 - 0.000316, 0.5 - (-0.000632))$
$= 1.0005 \cdot (0.49968, 0.50063)$
$= (0.49993, 0.50088)$
With $\sigma_t = \sqrt{\beta_t} = \sqrt{0.001} = 0.03162$:
$\mathbf{x}{t-1} = \boldsymbol{\mu}\theta + \sigma_t \mathbf{z} = (0.49993, 0.50088) + 0.03162 \cdot (0.3, -0.5)$
$= (0.49993 + 0.00949, 0.50088 - 0.01581) = (0.5094, 0.4851)$
Click for answer
$\\mathbf{x}_{t-1} = (0.5094, 0.4851)$. The model "denoised" from $(0.5, 0.5)$ and added a small stochastic perturbation. Notice the denoising is very subtle because $\\beta_t$ is small — the reverse step makes only tiny adjustments.Practice Problems
-
Prove that if $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$ with $\boldsymbol{\varepsilon} \sim \mathcal{N}(0,I)$ and $\mathbf{x}_0$ is independent of $\boldsymbol{\varepsilon}$, then $\text{Var}[\mathbf{x}_t] = \bar{\alpha}_t\text{Var}[\mathbf{x}_0] + (1-\bar{\alpha}_t)I$.
Click for answer
$\\text{Cov}[\\mathbf{x}_t] = \\mathbb{E}[(\\sqrt{\\bar{\\alpha}_t}\\mathbf{x}_0 + \\sqrt{1-\\bar{\\alpha}_t}\\boldsymbol{\\varepsilon})(\\sqrt{\\bar{\\alpha}_t}\\mathbf{x}_0 + \\sqrt{1-\\bar{\\alpha}_t}\\boldsymbol{\\varepsilon})^T]$ $= \\bar{\\alpha}_t\\mathbb{E}[\\mathbf{x}_0\\mathbf{x}_0^T] + \\sqrt{\\bar{\\alpha}_t(1-\\bar{\\alpha}_t)}(\\mathbb{E}[\\mathbf{x}_0]\\mathbb{E}[\\boldsymbol{\\varepsilon}]^T + \\mathbb{E}[\\boldsymbol{\\varepsilon}]\\mathbb{E}[\\mathbf{x}_0]^T) + (1-\\bar{\\alpha}_t)\\mathbb{E}[\\boldsymbol{\\varepsilon}\\boldsymbol{\\varepsilon}^T]$ By independence: cross terms vanish. Assuming $\\mathbb{E}[\\mathbf{x}_0] = 0$ for simplicity: $\\text{Cov}[\\mathbf{x}_t] = \\bar{\\alpha}_t\\text{Var}[\\mathbf{x}_0] + (1-\\bar{\\alpha}_t)I$. If data is normalized ($\\text{Var}[\\mathbf{x}_0] = I$), then $\\text{Var}[\\mathbf{x}_t] = I$ for all $t$ — the variance-preserving property. -
Show that as $\bar{\alpha}_T \to 0$, the forward process converges to $\mathcal{N}(0, I)$. What noise schedule ensures $\bar{\alpha}_T \approx 0$?
Click for answer
$q(\\mathbf{x}_T|\\mathbf{x}_0) = \\mathcal{N}(\\sqrt{\\bar{\\alpha}_T}\\mathbf{x}_0, (1-\\bar{\\alpha}_T)I)$. As $\\bar{\\alpha}_T \\to 0$, the mean goes to 0 and covariance to $I$ — regardless of $\\mathbf{x}_0$. For the linear schedule $\\beta_t = 10^{-4} + (t-1)\\frac{0.02 - 10^{-4}}{T-1}$ with $T=1000$: $\\bar{\\alpha}_T = \\prod_{t=1}^{1000}(1-\\beta_t) \\approx \\exp(\\sum -\\beta_t) \\approx e^{-10} \\approx 4.5 \\times 10^{-5} \\approx 0$. ✓ -
Derive $\tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}t}\beta_t$ from the formula for the conditional Gaussian $q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0)$.
Click for answer
Using Bayes' rule for Gaussians: $q(\\mathbf{x}_{t-1}|\\mathbf{x}_t, \\mathbf{x}_0) \\propto q(\\mathbf{x}_t|\\mathbf{x}_{t-1})q(\\mathbf{x}_{t-1}|\\mathbf{x}_0)$. All are Gaussian. $q(\\mathbf{x}_t|\\mathbf{x}_{t-1}) = \\mathcal{N}(\\sqrt{\\alpha_t}\\mathbf{x}_{t-1}, \\beta_t I)$ $q(\\mathbf{x}_{t-1}|\\mathbf{x}_0) = \\mathcal{N}(\\sqrt{\\bar{\\alpha}_{t-1}}\\mathbf{x}_0, (1-\\bar{\\alpha}_{t-1})I)$ The product of Gaussians gives precision (inverse variance): $\\frac{1}{\\tilde{\\beta}_t} = \\frac{\\alpha_t}{\\beta_t} + \\frac{1}{1-\\bar{\\alpha}_{t-1}}$. Solving: $\\tilde{\\beta}_t = \\frac{\\beta_t(1-\\bar{\\alpha}_{t-1})}{\\alpha_t(1-\\bar{\\alpha}_{t-1}) + \\beta_t} = \\frac{1-\\bar{\\alpha}_{t-1}}{1-\\bar{\\alpha}_t}\\beta_t$. -
Explain why the simplified DDPM objective drops the weighting factor $\frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)}$. What effect does this have?
Click for answer
The weighting factor heavily upweights training at small $t$ (when $1-\\bar{\\alpha}_t$ is small). Dropping it weights all $t$ equally. Empirically, this improves sample quality — the model learns to denoise at all noise levels equally well, not just the subtle denoising steps. The simplified objective is not a proper variational bound but works better as a training signal. -
For the forward process $q(\mathbf{x}t|\mathbf{x}{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t I)$, verify that the variance of $\mathbf{x}_t$ remains 1 if $\mathbf{x}_0 \sim \mathcal{N}(0, I)$ and $\beta_t$ are chosen to preserve variance.
Click for answer
$\\text{Var}[\\mathbf{x}_t] = (1-\\beta_t)\\text{Var}[\\mathbf{x}_{t-1}] + \\beta_t = (1-\\beta_t) \\cdot 1 + \\beta_t = 1$. The variance-preserving (VP) SDE formulation keeps the marginal variance constant. This contrasts with variance-exploding (VE) SDEs where variance grows over time.
Summary
Key takeaways:
- Forward diffusion: $q(\mathbf{x}t|\mathbf{x}{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t I)$, with closed-form $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$
- The reverse process is learned as $p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) = \mathcal{N}(\boldsymbol{\mu}\theta(\mathbf{x}_t, t), \sigma_t^2 I)$
- The DDPM objective simplifies to $\mathbb{E}[\|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t)\|^2]$ — predict the noise
- Diffusion models are equivalent to learning the score function at multiple noise scales
- Sampling iteratively denoises: subtract predicted noise, then add small stochastic perturbation
- The connection to score-based models is: $\mathbf{s}\theta(\mathbf{x}_t, t) = -\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)/\sqrt{1-\bar{\alpha}_t}$
Quiz
- The closed-form $q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)$ enables:
- A) Faster sampling
- B) Efficient training by sampling any $t$ in one step
- C) Better sample quality
- D) Smaller model size Correct: B)
- If you chose B: Without the closed form, training would require iterating through all $t$ steps for each datapoint. The closed form enables random $t$ sampling in $O(1)$.
- If you chose A: Sampling still requires $T$ sequential steps — that's addressed by DDIM.
- If you chose C: The closed form helps training efficiency but doesn't directly improve quality.
-
If you chose D: Unrelated to model size.
-
In DDPM, the model $\boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t)$ directly predicts:
- A) $\mathbf{x}_0$
- B) $\mathbf{x}_{t-1}$
- C) The noise $\boldsymbol{\varepsilon}$ added to $\mathbf{x}_0$
- D) The score $\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$ Correct: C)
- If you chose C: The loss is $\|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}_\theta\|^2$. The model is a noise predictor.
-
If you chose D: The score is proportional to $-\boldsymbol{\varepsilon}_\theta$, so it's equivalent, but the DDPM parameterization directly outputs noise.
-
The forward process scaling $\sqrt{1-\beta_t}$ (rather than 1) is used to:
- A) Make the process converge faster
- B) Preserve variance (variance-preserving diffusion)
- C) Increase the noise at each step
- D) Simplify the reverse process Correct: B)
- If you chose B: $\text{Var}[\mathbf{x}t] = (1-\beta_t)\text{Var}[\mathbf{x}{t-1}] + \beta_t$. If $\text{Var}[\mathbf{x}_{t-1}] = 1$, then $\text{Var}[\mathbf{x}_t] = 1$. Without the scaling, variance would grow monotonically.
- If you chose A: The scaling actually slows convergence to noise.
- If you chose C: It decreases the effect (compared to no scaling).
-
If you chose D: The scaling choice affects both forward and reverse, not simplifying either.
-
The weighted variational bound loses the KL term $L_T$ in practice because:
- A) It's intractable
- B) With enough steps, $q(\mathbf{x}_T|\mathbf{x}_0) \approx \mathcal{N}(0,I)$, so the KL is near zero
- C) It's computationally too expensive
- D) It's exactly zero for any noise schedule Correct: B)
- If you chose B: With properly chosen $T$ and $\beta_t$, $\bar{\alpha}_T \approx 0$, so $q(\mathbf{x}_T|\mathbf{x}_0) \approx \mathcal{N}(0,I) = p(\mathbf{x}_T)$. The KL divergence approaches zero.
- If you chose A: It has a closed form — KL between two Gaussians.
- If you chose C: Gaussian KL is $O(d)$.
-
If you chose D: Only zero if distributions match perfectly.
-
The optimal $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ targets $\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)$, which is the mean of:
- A) $q(\mathbf{x}t|\mathbf{x}{t-1})$
- B) $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$
- C) $q(\mathbf{x}_0|\mathbf{x}_t)$
- D) $p_\theta(\mathbf{x}t|\mathbf{x}{t-1})$ Correct: B)
- If you chose B: $q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is the true posterior given both the noisy observation $\mathbf{x}_t$ and the clean data $\mathbf{x}_0$. The reverse model $p\theta$ tries to approximate this without access to $\mathbf{x}_0$.
- If you chose A: That's the forward process — known analytically.
- If you chose C: That's a different conditional — the full denoising distribution.
- If you chose D: That's the learned reverse step.
Next Steps
22-07 — Diffusion Models: Advanced — DDIM for accelerated sampling, classifier guidance, classifier-free guidance, and the SDE/ODE formulations that unify diffusion and score-based models.
Pitfalls
-
Using the simplified loss without understanding its bias: The DDPM simplified loss $\mathcal{L}{\text{simple}} = \mathbb{E}[|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta|^2]$ drops the per-timestep weighting factor $\frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)}$. This means the model is not optimized for the variational bound — it's trained equally across all $t$. While this empirically improves sample quality, it means the training loss is no longer a valid lower bound on $\log p(\mathbf{x})$. For likelihood evaluation, use the full weighted variational bound or the probability flow ODE.
-
Setting $T$ too small and expecting quality: DDPM requires $T \approx 1000$ steps for the reverse process to be well-approximated as Gaussian. With $T = 100$, the assumption that $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is approximately Gaussian breaks down, leading to poor sample quality. If you need fewer sampling steps, switch to DDIM or a higher-order ODE solver — don't just reduce $T$ in DDPM.
-
Confusing SNR(t) with $\bar{\alpha}_t$ in noise schedule design: The key quantity controlling training difficulty is $\text{SNR}(t) = \bar{\alpha}_t/(1 - \bar{\alpha}_t)$, not $\bar{\alpha}_t$ directly. A cosine schedule keeps SNR high at intermediate $t$ (preserving more signal for the model to learn from), while a linear schedule drops SNR rapidly. Always plot SNR vs. $t$ when designing or comparing schedules, not just $\bar{\alpha}_t$.
-
Applying DDPM sampling with the wrong variance choice: DDPM can use either $\sigma_t^2 = \beta_t$ or $\sigma_t^2 = \tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}t}\beta_t$ for the reverse process variance. Using $\sigma_t^2 = \beta_t$ when $\tilde{\beta}_t$ is expected (or vice versa) can cause a distribution shift. For best log-likelihood, use the learned variance $\Sigma\theta(\mathbf{x}_t, t)$; for best sample quality, $\sigma_t = \beta_t$ is often preferred.
Q6: The variational bound for diffusion models involves $D_{\text{KL}}(q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) | p\theta(\mathbf{x}{t-1}|\mathbf{x}_t))$. Why is $q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ tractable?
A) It's analytically derived from the known Gaussian forward transitions using Bayes' rule. B) It's estimated via Monte Carlo with a single sample. C) The neural network learns it. D) It's always $\mathcal{N}(0, I)$.