22-07 β Diffusion Models (Advanced)
Phase: 22 β Generative Models Mathematics Subject: 22-07 Prerequisites: 22-06 β Diffusion Models: Foundations, 22-05 β Score-Based Generative Models Next subject: 22-08 β Autoregressive Models
Learning Objectives
By the end of this subject, you will be able to:
- Derive the DDIM deterministic sampling procedure and explain why it enables fewer-step generation
- Formulate classifier guidance and classifier-free guidance for conditional diffusion
- Understand the SDE/ODE continuous-time formulation that unifies diffusion and score-based models
- Choose appropriate noise schedules (linear, cosine, sigmoid) for different applications
- Implement accelerated sampling with DDIM and guidance-controlled generation
Core Content
DDIM: Denoising Diffusion Implicit Models
DDPM sampling (22-06) requires $T \approx 1000$ sequential denoising steps, making generation slow. Song et al. (2021) introduced DDIM, which generalizes DDPM to a non-Markovian forward process, enabling deterministic sampling with far fewer steps.
β οΈ CRITICAL β The DDIM Key Insight
The DDPM objective $L_{\text{simple}} = \mathbb{E}[|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)|^2]$ depends only on the marginals $q(\mathbf{x}_t|\mathbf{x}_0)$, not on the joint $q(\mathbf{x}{1:T}|\mathbf{x}_0)$. This means any forward process yielding the same marginals produces the same training objective, but different reverse processes.
DDIM defines a non-Markovian forward process:
$$q_\sigma(\mathbf{x}{1:T}|\mathbf{x}_0) = q\sigma(\mathbf{x}T|\mathbf{x}_0)\prod{t=2}^{T} q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$$
where for $t > 1$:
$$q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\left(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2} \cdot \frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_t}}, \sigma_t^2 I\right)$$
The parameter $\sigma_t$ controls stochasticity: - $\sigma_t = 0$ β DDIM (fully deterministic) - $\sigma_t = \sqrt{(1-\bar{\alpha}{t-1})/(1-\bar{\alpha}_t)}\sqrt{1-\bar{\alpha}_t/\bar{\alpha}{t-1}}$ β recovers DDPM
DDIM Sampling (Deterministic)
With $\sigma_t = 0$, the reverse step becomes deterministic. Given a trained noise predictor $\boldsymbol{\varepsilon}_\theta$:
-
Predict $\hat{\mathbf{x}}0$ from $\mathbf{x}_t$: $$\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}}$$
-
Compute $\mathbf{x}{t-1}$: $$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}}\,\hat{\mathbf{x}}_0 + \sqrt{1-\bar{\alpha}{t-1}}\,\boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t)$$
This is the "predicted $\mathbf{x}_0$" interpretation: we estimate what $\mathbf{x}_0$ would be given $\mathbf{x}_t$, then re-noise to level $t-1$ using the same predicted noise direction.
Why DDIM is faster: We can sample using a subsequence $\tau_1 < \tau_2 < \cdots < \tau_S$ of $S \ll T$ timesteps. For $S=50$ on a $T=1000$ model, quality degrades only slightly:
$$\mathbf{x}{\tau{i-1}} = \sqrt{\bar{\alpha}{\tau{i-1}}}\,\hat{\mathbf{x}}0 + \sqrt{1-\bar{\alpha}{\tau_{i-1}}}\,\boldsymbol{\varepsilon}\theta(\mathbf{x}{\tau_i}, \tau_i)$$
DDIM Inversion
A remarkable property of deterministic DDIM ($\sigma=0$): the mapping $\mathbf{x}_0 \to \mathbf{x}_T$ is invertible. Given $\mathbf{x}_0$, we can compute the exact latent $\mathbf{x}_T$ that would produce $\mathbf{x}_0$ under reverse DDIM:
$$\mathbf{x}{t} = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\varepsilon}\theta(\mathbf{x}_{t-1}, t-1)$$
This enables image editing: encode real image to latent β modify latent β decode back. Crucial for SDEdit and prompt-to-prompt editing.
β οΈ CRITICAL β Classifier Guidance
Classifier guidance (Dhariwal & Nicholls, 2021) uses a pretrained classifier $p_\phi(y|\mathbf{x}_t)$ to steer generation toward a desired class $y$.
The key insight: we can modify the score function to incorporate the gradient of the log-classifier probability, shifting the sampling trajectory toward regions of high $p(y|\mathbf{x})$:
$$\nabla_{\mathbf{x}t} \log p(\mathbf{x}_t|y) = \nabla{\mathbf{x}t} \log p(\mathbf{x}_t) + \nabla{\mathbf{x}_t} \log p(y|\mathbf{x}_t)$$
Using Bayes' rule: $p(\mathbf{x}_t|y) \propto p(\mathbf{x}_t)p(y|\mathbf{x}_t)$. The score of the class-conditional distribution decomposes into unconditional score + classifier gradient.
Translating to the noise prediction framework. Recall from 22-06:
$$\mathbf{s}\theta(\mathbf{x}_t, t) = -\frac{\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$
The guided score is:
$$\mathbf{s}{\text{guided}}(\mathbf{x}_t) = \mathbf{s}\theta(\mathbf{x}t) + w \cdot \nabla{\mathbf{x}t} \log p\phi(y|\mathbf{x}_t)$$
where $w$ is the guidance scale. In terms of noise prediction:
$$\hat{\boldsymbol{\varepsilon}}\theta(\mathbf{x}_t) = \boldsymbol{\varepsilon}\theta(\mathbf{x}t) - w \cdot \sqrt{1-\bar{\alpha}_t} \cdot \nabla{\mathbf{x}t} \log p\phi(y|\mathbf{x}_t)$$
The guidance scale $w$ controls the trade-off: - $w = 1$: standard class-conditional sampling - $w > 1$: pushes samples further toward high-density class regions β better quality, lower diversity - $w = 0$: unconditional (ignore classifier)
Typical values: $w \in [1, 10]$ for image generation.
Practical Issues with Classifier Guidance
- Requires training a separate noise-aware classifier $p_\phi(y|\mathbf{x}_t)$ for each noise level
- Gradient computation through classifier adds overhead
- Classifier must be robust to noisy inputs at all $t$
β οΈ CRITICAL β Classifier-Free Guidance (CFG)
Classifier-free guidance (Ho & Salimans, 2021) eliminates the need for a separate classifier by jointly training a conditional and unconditional model in one network.
During training, we randomly drop the conditioning $c$ (replace with a null token $\varnothing$) with probability $p_{\text{uncond}}$ (typically 10-20%):
$$\mathcal{L} = \mathbb{E}{t,\mathbf{x}_0,\boldsymbol{\varepsilon}}\left[|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t, c)|^2\right] \quad \text{with } c \in {y, \varnothing}$$
At sampling time, we interpolate between conditional and unconditional predictions:
$$\tilde{\boldsymbol{\varepsilon}}\theta(\mathbf{x}_t, c) = \boldsymbol{\varepsilon}\theta(\mathbf{x}t, \varnothing) + w\left(\boldsymbol{\varepsilon}\theta(\mathbf{x}t, c) - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, \varnothing)\right)$$
This is the CFG formula. Interpretation: - $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, \varnothing)$ = unconditional prediction (what the model thinks is noise regardless of class) - $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, c)$ = conditional prediction - The difference is the "direction toward class $c$" - $w$ amplifies this direction
Equivalently in score space:
$$\mathbf{s}{\text{CFG}}(\mathbf{x}_t, c) = \mathbf{s}\theta(\mathbf{x}t, \varnothing) + w(\mathbf{s}\theta(\mathbf{x}t, c) - \mathbf{s}\theta(\mathbf{x}_t, \varnothing))$$
Why CFG dominates in practice: - No separate classifier needed - Works with any conditioning (text, images, labels, etc.) - Natural for text-to-image models (Stable Diffusion, DALLΒ·E, Imagen) - The same network handles all guidance internally
Noise Schedules
The choice of $\beta_t$ (or equivalently $\bar{\alpha}_t$) significantly affects generation quality.
Linear Schedule (DDPM original)
$$\beta_t = \beta_{\text{start}} + \frac{t-1}{T-1}(\beta_{\text{end}} - \beta_{\text{start}})$$
Typical: $\beta_1 = 10^{-4}, \beta_T = 0.02$. Problem: too much noise added too quickly in early steps.
Cosine Schedule (Nichol & Dhariwal, 2021)
$$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2$$
where $s = 0.008$ is a small offset preventing $\beta_t$ from being too small near $t=0$. Better performance than linear: prevents information from being destroyed too quickly.
Sigmoid Schedule
$$\bar{\alpha}_t = \frac{\text{sigmoid}(-c + 2c \cdot t/T)}{\text{sigmoid}(c) + \text{sigmoid}(c)}$$
where $c$ controls steepness. The sigmoid schedule transitions smoothly from signal to noise.
Key comparison: | Schedule | SNR at $t=T/2$ | SNR decay shape | Best for | |----------|---------------|-----------------|----------| | Linear | Moderate | Uniform | Simple baselines | | Cosine | High (preserves signal) | Gradual then steep | High-quality images | | Sigmoid | Tunable | Controllable | Custom applications |
SDE/ODE Formulation (Continuous-Time)
The continuous-time perspective (Song et al., 2021) unifies diffusion and score-based models. The forward process is an ItΓ΄ SDE:
$$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + g(t)d\mathbf{w}$$
where $\mathbf{w}$ is a standard Wiener process, $\mathbf{f}$ is the drift, and $g(t)$ is the diffusion coefficient.
Two common SDEs:
-
Variance-Preserving (VP) SDE (DDPM): $$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x}\,dt + \sqrt{\beta(t)}\,d\mathbf{w}$$
-
Variance-Exploding (VE) SDE (Score-based/22-05): $$d\mathbf{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}}\,d\mathbf{w}$$
The reverse-time SDE (Anderson, 1982) is:
$$d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt + g(t)d\bar{\mathbf{w}}$$
where $d\bar{\mathbf{w}}$ is a reverse-time Wiener process. The score $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ is what we learn with $\mathbf{s}_\theta(\mathbf{x}, t)$.
Probability Flow ODE
Remarkably, there exists a deterministic probability flow ODE with the same marginal distributions:
$$d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt$$
DDIM is the discrete-time Euler discretization of this ODE. This explains why DDIM is deterministic β it solves the probability flow ODE rather than the reverse SDE.
The ODE formulation enables: - Exact likelihood computation: $\log p_0(\mathbf{x}_0)$ via instantaneous change-of-variables - Deterministic encoding/decoding: invertible mappings for image editing - Higher-order solvers: Heun's method, Runge-Kutta for faster/better sampling
Key Terms
- Better performance
- CFG formula
- Classifier guidance
- Classifier-free guidance
- DDIM
- DDPM
- Exact likelihood computation
- Higher-order solvers
- Noise schedules
- Probability flow ODE
- SDE formulation
- Variance-Exploding (VE) SDE
- Variance-Preserving (VP) SDE
Worked Examples
Example 1: DDIM Sampling Step
A model trained with $T=1000$ uses $\bar{\alpha}{100} = 0.8$ and $\bar{\alpha}{50} = 0.9$. Given $\mathbf{x}{100} = (0.3, -0.4)$ and $\boldsymbol{\varepsilon}\theta(\mathbf{x}{100}, 100) = (0.2, -0.5)$, compute the deterministic DDIM step to $\mathbf{x}{50}$.
Solution:
Step 1 β Predict $\hat{\mathbf{x}}_0$:
$$\hat{\mathbf{x}}_0 = \frac{(0.3, -0.4) - \sqrt{1-0.8}\,(0.2, -0.5)}{\sqrt{0.8}} = \frac{(0.3, -0.4) - 0.4472 \cdot (0.2, -0.5)}{0.8944}$$
$$= \frac{(0.3 - 0.08944, -0.4 + 0.2236)}{0.8944} = \frac{(0.2106, -0.1764)}{0.8944} = (0.2354, -0.1972)$$
Step 2 β Re-noise to $t=50$:
$$\mathbf{x}_{50} = \sqrt{0.9}\,(0.2354, -0.1972) + \sqrt{1-0.9}\,(0.2, -0.5)$$
$$= 0.9487 \cdot (0.2354, -0.1972) + 0.3162 \cdot (0.2, -0.5)$$
$$= (0.2233, -0.1870) + (0.06324, -0.1581) = (0.2865, -0.3451)$$
Click for answer
$\mathbf{x}_{50} = (0.2865, -0.3451)$. Notice that noisy $\mathbf{x}_{100} = (0.3, -0.4)$ was "denoised" in one DDIM jump. The deterministic path means this same step always produces the same output.Example 2: Classifier-Free Guidance
A text-to-image model produces $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, c) = (0.1, -0.3, 0.2)$ and $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, \varnothing) = (-0.1, 0.1, 0.0)$. Compute the guided noise prediction for $w = 7.5$.
Solution:
$$\tilde{\boldsymbol{\varepsilon}}_\theta = (-0.1, 0.1, 0.0) + 7.5\left[(0.1, -0.3, 0.2) - (-0.1, 0.1, 0.0)\right]$$
$$= (-0.1, 0.1, 0.0) + 7.5 \cdot (0.2, -0.4, 0.2)$$
$$= (-0.1, 0.1, 0.0) + (1.5, -3.0, 1.5) = (1.4, -2.9, 1.5)$$
Click for answer
$\tilde{\boldsymbol{\varepsilon}}_\theta = (1.4, -2.9, 1.5)$. With $w > 1$, the prediction is pushed far beyond the conditional model's natural range. This amplifies the "class signal" at the cost of increased variance β typical CFG behavior.Example 3: SDE to ODE Conversion
For the VP-SDE with $\beta(t) = \beta$ (constant), write the probability flow ODE.
Solution:
VP-SDE: $d\mathbf{x} = -\frac{1}{2}\beta\mathbf{x}\,dt + \sqrt{\beta}\,d\mathbf{w}$
With $\mathbf{f}(\mathbf{x}, t) = -\frac{1}{2}\beta\mathbf{x}$ and $g(t) = \sqrt{\beta}$:
Probability flow ODE:
$$d\mathbf{x} = \left[-\frac{1}{2}\beta\mathbf{x} - \frac{1}{2}\beta \cdot \nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt$$
$$= -\frac{1}{2}\beta\left[\mathbf{x} + \nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt$$
Using the score-noise relationship $\nabla_{\mathbf{x}}\log p_t(\mathbf{x}) = -\boldsymbol{\varepsilon}_\theta(\mathbf{x}, t)/\sigma_t$:
$$d\mathbf{x} = -\frac{1}{2}\beta\left[\mathbf{x} - \frac{\boldsymbol{\varepsilon}_\theta(\mathbf{x}, t)}{\sigma_t}\right]dt$$
Euler discretization with step size $\Delta t$ gives:
$$\mathbf{x}{t-\Delta t} = \mathbf{x}_t + \frac{1}{2}\beta\Delta t\left[\mathbf{x}_t - \frac{\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sigma_t}\right]$$
Click for answer
This is the continuous analog of DDIM. At each step, we move $\mathbf{x}$ in the direction that reduces noise β the score points toward higher density. The ODE provides a smooth, deterministic trajectory from noise to data.Practice Problems
-
For a DDIM model with $\bar{\alpha}{200} = 0.6$, $\bar{\alpha}{100} = 0.8$, compute $\mathbf{x}{100}$ from $\mathbf{x}{200} = \mathbf{0}$ and $\boldsymbol{\varepsilon}\theta(\mathbf{x}{200}, 200) = (0.1, 0.1)$.
Click for answer
$\hat{\mathbf{x}}_0 = (\mathbf{0} - \sqrt{0.4}(0.1, 0.1))/\sqrt{0.6} = (-0.08165, -0.08165)$ $\mathbf{x}_{100} = \sqrt{0.8}(-0.08165, -0.08165) + \sqrt{0.2}(0.1, 0.1) = (-0.0730 + 0.04472, -0.0730 + 0.04472) = (-0.0283, -0.0283)$ -
Prove that CFG with $w = 1$ reduces to standard conditional sampling.
Click for answer
$\tilde{\boldsymbol{\varepsilon}}_\theta = \boldsymbol{\varepsilon}_\theta(\varnothing) + 1 \cdot (\boldsymbol{\varepsilon}_\theta(c) - \boldsymbol{\varepsilon}_\theta(\varnothing)) = \boldsymbol{\varepsilon}_\theta(c)$. The unconditional component cancels, leaving only the conditional prediction β exactly standard conditional diffusion. -
Explain why classifier guidance requires $w > 0$ but CFG works with any real $w$. What does negative $w$ produce?
Click for answer
Classifier guidance uses a gradient, so $w < 0$ pushes away from class $y$ (anti-guidance). CFG with $w < 0$ pushes away from conditioning toward the unconditional β produces "anti-samples." With $w = 0$, CFG gives purely unconditional output. With very large $w$, CFG saturates (model over-emphasizes the conditioning direction). -
Derive the SNR at time $t$ for the DDPM forward process: $\text{SNR}(t) = \bar{\alpha}_t/(1-\bar{\alpha}_t)$. Why does the cosine schedule preserve higher SNR longer?
Click for answer
$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$. Signal power = $\bar{\alpha}_t\|\mathbf{x}_0\|^2$, noise power = $(1-\bar{\alpha}_t)$. Ratio: $\text{SNR} = \bar{\alpha}_t/(1-\bar{\alpha}_t)$. Cosine schedule keeps $\bar{\alpha}_t$ near 1 longer (flat near $t=0$), preserving signal. Linear schedule drops $\bar{\alpha}_t$ faster, destroying signal earlier. This matters because the model needs signal at intermediate noise levels to learn meaningful structure. -
Given the probability flow ODE $d\mathbf{x} = v(\mathbf{x}, t)dt$, how would you compute the log-likelihood $\log p_0(\mathbf{x}_0)$ of a generated sample?
Click for answer
Using the instantaneous change-of-variables formula (continuous normalizing flows): $d\log p_t/dt = -\text{tr}(\nabla_{\mathbf{x}} v(\mathbf{x}_t, t))$. Integrate from $t=0$ to $t=T$ to get $\log p_0(\mathbf{x}_0) = \log p_T(\mathbf{x}_T) + \int_0^T \text{tr}(\nabla_{\mathbf{x}} v(\mathbf{x}_t, t))dt$. Since $p_T = \mathcal{N}(0, I)$, this gives exact likelihood. Computing the trace of the Jacobian is expensive; Hutchinson's trace estimator provides an unbiased approximation.
Summary
Key takeaways:
- DDIM makes diffusion sampling non-Markovian, enabling deterministic generation with far fewer steps ($S \ll T$) via the probability flow ODE
- Classifier guidance: $\mathbf{s}{\text{guided}} = \mathbf{s}\theta(\mathbf{x}) + w\nabla_{\mathbf{x}}\log p_\phi(y|\mathbf{x})$, requires separate noise-aware classifier
- Classifier-free guidance: $\tilde{\boldsymbol{\varepsilon}}\theta(c) = \boldsymbol{\varepsilon}\theta(\varnothing) + w(\boldsymbol{\varepsilon}\theta(c) - \boldsymbol{\varepsilon}\theta(\varnothing))$, no classifier needed, dominates practice
- Noise schedules (linear, cosine, sigmoid) control SNR decay; cosine is typically best
- SDE formulation: $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + g(t)d\mathbf{w}$ unifies diffusion/score models
- Probability flow ODE: deterministic path with same marginals; DDIM = Euler discretization
- Guidance scale $w$ trades diversity ($w$ low) for quality/adherence ($w$ high)
Quiz
- The key property of DDPM that DDIM exploits for faster sampling is:
- A) The Markov property of the forward process
- B) That training depends only on marginals $q(\mathbf{x}_t|\mathbf{x}_0)$, not the joint
- C) That the reverse process is Gaussian
- D) That the noise schedule is linear Correct: B)
- If you chose B: The DDPM loss uses only $q(\mathbf{x}_t|\mathbf{x}_0)$ for each $t$, so any forward process with the same marginals yields the same trained model β DDIM changes the inference (reverse) process while keeping the same trained weights.
- If you chose A: DDIM is explicitly non-Markovian.
- If you chose C: The reverse process in DDIM remains Gaussian but that's not why it's faster.
-
If you chose D: DDIM works with any noise schedule.
-
In classifier-free guidance, setting $w = 7.5$ means:
- A) The model uses 7.5Γ more compute
- B) The conditioning signal is amplified 7.5Γ beyond the unconditional baseline
- C) The model was trained with 7.5Γ more uncond samples
- D) Only 1/7.5 of the timesteps are used Correct: B)
- If you chose B: $\tilde{\boldsymbol{\varepsilon}} = \boldsymbol{\varepsilon}\theta(\varnothing) + 7.5(\boldsymbol{\varepsilon}\theta(c) - \boldsymbol{\varepsilon}_\theta(\varnothing))$ β the vector from unconditional to conditional is stretched by factor 7.5.
- If you chose A: The compute is roughly 2Γ (two forward passes), regardless of $w$.
- If you chose C: Training dropout rate $p_{\text{uncond}}$ is separate from sampling guidance $w$.
-
If you chose D: $w$ controls adherence, not step count.
-
The probability flow ODE differs from the reverse SDE by:
- A) Having different marginal distributions
- B) Lacking the stochastic $d\bar{\mathbf{w}}$ term β it's deterministic
- C) Using a different score function
- D) Being defined only at discrete timesteps Correct: B)
- If you chose B: The ODE drops the $g(t)d\bar{\mathbf{w}}$ term while keeping the same marginals. The SDE has random diffusion; the ODE is a deterministic vector field.
- If you chose A: They have identical marginal distributions $p_t(\mathbf{x})$ by construction.
- If you chose C: Both use the same score function $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$.
-
If you chose D: The ODE is defined continuously; DDIM discretizes it.
-
Classifier guidance requires a noise-aware classifier because:
- A) The guidance is applied at every noise level, and $p(y|\mathbf{x}_t)$ must be defined for noisy $\mathbf{x}_t$
- B) Noisy classifiers are more accurate
- C) The classifier must share weights with the diffusion model
- D) Clean-image classifiers are too slow Correct: A)
- If you chose A: The classifier must estimate $p(y|\mathbf{x}_t)$ for all $t$, meaning it must handle inputs at every noise level β from nearly clean ($t$ small) to pure noise ($t \approx T$).
- If you chose B: Noisy classifiers are typically less accurate than clean ones.
- If you chose C: The classifier is a separate model.
-
If you chose D: Inference speed depends on architecture, not noise-awareness.
-
DDIM inversion is possible because:
- A) DDIM uses a Markov forward process
- B) The deterministic DDIM mapping ($\sigma = 0$) is invertible β given $\mathbf{x}_0$, you can compute the unique $\mathbf{x}_T$ that decodes to it
- C) DDIM trains on both forward and reverse directions
- D) Neural networks are invertible by design Correct: B)
- If you chose B: With $\sigma=0$, each step is a deterministic function of $\mathbf{x}_t$ (since $\hat{\mathbf{x}}_0$ depends only on $\mathbf{x}_t$). The chain of deterministic steps is invertible by running the ODE backward.
- If you chose A: DDIM is non-Markovian. DDPM (Markovian) is not easily invertible due to stochasticity.
- If you chose C: DDIM uses the same trained model as DDPM β no separate inversion training.
- If you chose D: Invertibility comes from the ODE, not network architecture.
Next Steps
22-08 β Autoregressive Models β PixelCNN, WaveNet, and the autoregressive approach to generative modeling. While diffusion models generate all pixels simultaneously through iterative refinement, autoregressive models generate one element at a time, conditioning on previously generated outputs.
Pitfalls
-
Using DDIM with too few steps and expecting DDPM-quality samples: DDIM with $S=10$ steps on a $T=1000$ model produces deterministic samples but at substantially lower quality than DDPM with $T=1000$. The quality degradation is monotonic with step reduction. For high-quality generation, use $S \geq 50$; for real-time applications, $S=20$β$50$ with DDIM is a reasonable trade-off. Higher-order ODE solvers (Heun, DPM-Solver) can achieve better quality than DDIM at the same step count.
-
Setting CFG scale $w$ too high: Classifier-free guidance with $w > 10$ can produce over-saturated, unnatural images because the model is pushed far beyond the training distribution. The CFG formula $\tilde{\boldsymbol{\varepsilon}} = \boldsymbol{\varepsilon}{\text{uncond}} + w(\boldsymbol{\varepsilon}{\text{cond}} - \boldsymbol{\varepsilon}_{\text{uncond}})$ amplifies the conditional signal, but the model's outputs for extreme $w$ were never seen during training. Typical sweet spot: $w \in [3, 8]$ for text-to-image; $w \in [1, 3]$ for class-conditional.
-
Confusing DDIM inversion with exact invertibility: Deterministic DDIM ($\sigma=0$) is theoretically invertible, but in practice, numerical errors accumulate over many steps. The forward (encoding) and reverse (decoding) paths may not perfectly reconstruct the original image, especially with classifier-free guidance where the unconditional and conditional paths differ. For image editing, use fewer inversion steps and consider techniques like null-text inversion or EDICT for better reconstruction fidelity.
-
Using the wrong noise schedule for the task: The linear schedule ($\beta_t$ from $10^{-4}$ to $0.02$) destroys signal too quickly for high-resolution images, where fine details need preservation at intermediate noise levels. The cosine schedule preserves SNR longer and generally produces better FID scores. For very high resolutions ($1024^2$ and above), shifted cosine or sigmoid schedules with SNR tuned for the specific resolution are essential.
Q6: The log-SNR at time $t$ in a diffusion model is $\log(\bar{\alpha}_t/(1-\bar{\alpha}_t))$. Why is this quantity important for noise schedule design?
A) It determines the size of the neural network needed. B) It controls how much information about $\mathbf{x}_0$ remains β low log-SNR means the data is mostly noise, high log-SNR means the data is mostly clean. The schedule should smoothly transition between these regimes. C) It determines the batch size during training. D) It must equal zero at $t = T$ for the model to work.