📐 Concept diagram

22-07 — Diffusion Models (Advanced)

Phase: 22 — Generative Models Mathematics Subject: 22-07 Prerequisites: 22-06 — Diffusion Models: Foundations, 22-05 — Score-Based Generative Models Next subject: 22-08 — Autoregressive Models

Learning Objectives

By the end of this subject, you will be able to:

Derive the DDIM deterministic sampling procedure and explain why it enables fewer-step generation
Formulate classifier guidance and classifier-free guidance for conditional diffusion
Understand the SDE/ODE continuous-time formulation that unifies diffusion and score-based models
Choose appropriate noise schedules (linear, cosine, sigmoid) for different applications
Implement accelerated sampling with DDIM and guidance-controlled generation

Core Content

DDIM: Denoising Diffusion Implicit Models

DDPM sampling (22-06) requires $T \approx 1000$ sequential denoising steps, making generation slow. Song et al. (2021) introduced DDIM, which generalizes DDPM to a non-Markovian forward process, enabling deterministic sampling with far fewer steps.

⚠️ CRITICAL — The DDIM Key Insight

The DDPM objective $L_{\text{simple}} = \mathbb{E}[|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)|^2]$ depends only on the marginals $q(\mathbf{x}_t|\mathbf{x}_0)$, not on the joint $q(\mathbf{x}{1:T}|\mathbf{x}_0)$. This means any forward process yielding the same marginals produces the same training objective, but different reverse processes.

DDIM defines a non-Markovian forward process:

$$q_\sigma(\mathbf{x}{1:T}|\mathbf{x}_0) = q\sigma(\mathbf{x}T|\mathbf{x}_0)\prod{t=2}^{T} q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$$

where for $t > 1$:

$$q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\left(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2} \cdot \frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_t}}, \sigma_t^2 I\right)$$

The parameter $\sigma_t$ controls stochasticity: - $\sigma_t = 0$ → DDIM (fully deterministic) - $\sigma_t = \sqrt{(1-\bar{\alpha}{t-1})/(1-\bar{\alpha}_t)}\sqrt{1-\bar{\alpha}_t/\bar{\alpha}{t-1}}$ → recovers DDPM

DDIM Sampling (Deterministic)

With $\sigma_t = 0$, the reverse step becomes deterministic. Given a trained noise predictor $\boldsymbol{\varepsilon}_\theta$:

Predict $\hat{\mathbf{x}}0$ from $\mathbf{x}_t$: $$\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}}$$
Compute $\mathbf{x}{t-1}$: $$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}}\,\hat{\mathbf{x}}_0 + \sqrt{1-\bar{\alpha}{t-1}}\,\boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t)$$

This is the "predicted $\mathbf{x}_0$" interpretation: we estimate what $\mathbf{x}_0$ would be given $\mathbf{x}_t$, then re-noise to level $t-1$ using the same predicted noise direction.

Why DDIM is faster: We can sample using a subsequence $\tau_1 < \tau_2 < \cdots < \tau_S$ of $S \ll T$ timesteps. For $S=50$ on a $T=1000$ model, quality degrades only slightly:

$$\mathbf{x}{\tau{i-1}} = \sqrt{\bar{\alpha}{\tau{i-1}}}\,\hat{\mathbf{x}}0 + \sqrt{1-\bar{\alpha}{\tau_{i-1}}}\,\boldsymbol{\varepsilon}\theta(\mathbf{x}{\tau_i}, \tau_i)$$

DDIM Inversion

A remarkable property of deterministic DDIM ($\sigma=0$): the mapping $\mathbf{x}_0 \to \mathbf{x}_T$ is invertible. Given $\mathbf{x}_0$, we can compute the exact latent $\mathbf{x}_T$ that would produce $\mathbf{x}_0$ under reverse DDIM:

$$\mathbf{x}{t} = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\varepsilon}\theta(\mathbf{x}_{t-1}, t-1)$$

This enables image editing: encode real image to latent → modify latent → decode back. Crucial for SDEdit and prompt-to-prompt editing.

⚠️ CRITICAL — Classifier Guidance

Classifier guidance (Dhariwal & Nicholls, 2021) uses a pretrained classifier $p_\phi(y|\mathbf{x}_t)$ to steer generation toward a desired class $y$.

The key insight: we can modify the score function to incorporate the gradient of the log-classifier probability, shifting the sampling trajectory toward regions of high $p(y|\mathbf{x})$:

$$\nabla_{\mathbf{x}t} \log p(\mathbf{x}_t|y) = \nabla{\mathbf{x}t} \log p(\mathbf{x}_t) + \nabla{\mathbf{x}_t} \log p(y|\mathbf{x}_t)$$

Using Bayes' rule: $p(\mathbf{x}_t|y) \propto p(\mathbf{x}_t)p(y|\mathbf{x}_t)$. The score of the class-conditional distribution decomposes into unconditional score + classifier gradient.

Translating to the noise prediction framework. Recall from 22-06:

$$\mathbf{s}\theta(\mathbf{x}_t, t) = -\frac{\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$

The guided score is:

$$\mathbf{s}{\text{guided}}(\mathbf{x}_t) = \mathbf{s}\theta(\mathbf{x}t) + w \cdot \nabla{\mathbf{x}t} \log p\phi(y|\mathbf{x}_t)$$

where $w$ is the guidance scale. In terms of noise prediction:

$$\hat{\boldsymbol{\varepsilon}}\theta(\mathbf{x}_t) = \boldsymbol{\varepsilon}\theta(\mathbf{x}t) - w \cdot \sqrt{1-\bar{\alpha}_t} \cdot \nabla{\mathbf{x}t} \log p\phi(y|\mathbf{x}_t)$$

The guidance scale $w$ controls the trade-off: - $w = 1$: standard class-conditional sampling - $w > 1$: pushes samples further toward high-density class regions → better quality, lower diversity - $w = 0$: unconditional (ignore classifier)

Typical values: $w \in [1, 10]$ for image generation.

Practical Issues with Classifier Guidance

Requires training a separate noise-aware classifier $p_\phi(y|\mathbf{x}_t)$ for each noise level
Gradient computation through classifier adds overhead
Classifier must be robust to noisy inputs at all $t$

⚠️ CRITICAL — Classifier-Free Guidance (CFG)

Classifier-free guidance (Ho & Salimans, 2021) eliminates the need for a separate classifier by jointly training a conditional and unconditional model in one network.

During training, we randomly drop the conditioning $c$ (replace with a null token $\varnothing$) with probability $p_{\text{uncond}}$ (typically 10-20%):

$$\mathcal{L} = \mathbb{E}{t,\mathbf{x}_0,\boldsymbol{\varepsilon}}\left[|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t, c)|^2\right] \quad \text{with } c \in {y, \varnothing}$$

At sampling time, we interpolate between conditional and unconditional predictions:

$$\tilde{\boldsymbol{\varepsilon}}\theta(\mathbf{x}_t, c) = \boldsymbol{\varepsilon}\theta(\mathbf{x}t, \varnothing) + w\left(\boldsymbol{\varepsilon}\theta(\mathbf{x}t, c) - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, \varnothing)\right)$$

This is the CFG formula. Interpretation: - $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, \varnothing)$ = unconditional prediction (what the model thinks is noise regardless of class) - $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, c)$ = conditional prediction - The difference is the "direction toward class $c$" - $w$ amplifies this direction

Equivalently in score space:

$$\mathbf{s}{\text{CFG}}(\mathbf{x}_t, c) = \mathbf{s}\theta(\mathbf{x}t, \varnothing) + w(\mathbf{s}\theta(\mathbf{x}t, c) - \mathbf{s}\theta(\mathbf{x}_t, \varnothing))$$

Why CFG dominates in practice: - No separate classifier needed - Works with any conditioning (text, images, labels, etc.) - Natural for text-to-image models (Stable Diffusion, DALL·E, Imagen) - The same network handles all guidance internally

Noise Schedules

The choice of $\beta_t$ (or equivalently $\bar{\alpha}_t$) significantly affects generation quality.

Linear Schedule (DDPM original)

$$\beta_t = \beta_{\text{start}} + \frac{t-1}{T-1}(\beta_{\text{end}} - \beta_{\text{start}})$$

Typical: $\beta_1 = 10^{-4}, \beta_T = 0.02$. Problem: too much noise added too quickly in early steps.

Cosine Schedule (Nichol & Dhariwal, 2021)

$$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2$$

where $s = 0.008$ is a small offset preventing $\beta_t$ from being too small near $t=0$. Better performance than linear: prevents information from being destroyed too quickly.

Sigmoid Schedule

$$\bar{\alpha}_t = \frac{\text{sigmoid}(-c + 2c \cdot t/T)}{\text{sigmoid}(c) + \text{sigmoid}(c)}$$

where $c$ controls steepness. The sigmoid schedule transitions smoothly from signal to noise.

Key comparison: | Schedule | SNR at $t=T/2$ | SNR decay shape | Best for | |----------|---------------|-----------------|----------| | Linear | Moderate | Uniform | Simple baselines | | Cosine | High (preserves signal) | Gradual then steep | High-quality images | | Sigmoid | Tunable | Controllable | Custom applications |

SDE/ODE Formulation (Continuous-Time)

The continuous-time perspective (Song et al., 2021) unifies diffusion and score-based models. The forward process is an Itô SDE:

$$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + g(t)d\mathbf{w}$$

where $\mathbf{w}$ is a standard Wiener process, $\mathbf{f}$ is the drift, and $g(t)$ is the diffusion coefficient.

Two common SDEs:

Variance-Preserving (VP) SDE (DDPM): $$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x}\,dt + \sqrt{\beta(t)}\,d\mathbf{w}$$
Variance-Exploding (VE) SDE (Score-based/22-05): $$d\mathbf{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}}\,d\mathbf{w}$$

The reverse-time SDE (Anderson, 1982) is:

$$d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt + g(t)d\bar{\mathbf{w}}$$

where $d\bar{\mathbf{w}}$ is a reverse-time Wiener process. The score $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ is what we learn with $\mathbf{s}_\theta(\mathbf{x}, t)$.

Probability Flow ODE

Remarkably, there exists a deterministic probability flow ODE with the same marginal distributions:

$$d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt$$

DDIM is the discrete-time Euler discretization of this ODE. This explains why DDIM is deterministic — it solves the probability flow ODE rather than the reverse SDE.

The ODE formulation enables: - Exact likelihood computation: $\log p_0(\mathbf{x}_0)$ via instantaneous change-of-variables - Deterministic encoding/decoding: invertible mappings for image editing - Higher-order solvers: Heun's method, Runge-Kutta for faster/better sampling

Key Terms

Better performance
CFG formula
Classifier guidance
Classifier-free guidance
DDIM
DDPM
Exact likelihood computation
Higher-order solvers
Noise schedules
Probability flow ODE
SDE formulation
Variance-Exploding (VE) SDE
Variance-Preserving (VP) SDE

Worked Examples

Example 1: DDIM Sampling Step

A model trained with $T=1000$ uses $\bar{\alpha}{100} = 0.8$ and $\bar{\alpha}{50} = 0.9$. Given $\mathbf{x}{100} = (0.3, -0.4)$ and $\boldsymbol{\varepsilon}\theta(\mathbf{x}{100}, 100) = (0.2, -0.5)$, compute the deterministic DDIM step to $\mathbf{x}{50}$.

Solution:

Step 1 — Predict $\hat{\mathbf{x}}_0$:

$$\hat{\mathbf{x}}_0 = \frac{(0.3, -0.4) - \sqrt{1-0.8}\,(0.2, -0.5)}{\sqrt{0.8}} = \frac{(0.3, -0.4) - 0.4472 \cdot (0.2, -0.5)}{0.8944}$$

$$= \frac{(0.3 - 0.08944, -0.4 + 0.2236)}{0.8944} = \frac{(0.2106, -0.1764)}{0.8944} = (0.2354, -0.1972)$$

Step 2 — Re-noise to $t=50$:

$$\mathbf{x}_{50} = \sqrt{0.9}\,(0.2354, -0.1972) + \sqrt{1-0.9}\,(0.2, -0.5)$$

$$= 0.9487 \cdot (0.2354, -0.1972) + 0.3162 \cdot (0.2, -0.5)$$

$$= (0.2233, -0.1870) + (0.06324, -0.1581) = (0.2865, -0.3451)$$

Click for answer

$\mathbf{x}_{50} = (0.2865, -0.3451)$. Notice that noisy $\mathbf{x}_{100} = (0.3, -0.4)$ was "denoised" in one DDIM jump. The deterministic path means this same step always produces the same output.

Example 2: Classifier-Free Guidance

A text-to-image model produces $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, c) = (0.1, -0.3, 0.2)$ and $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, \varnothing) = (-0.1, 0.1, 0.0)$. Compute the guided noise prediction for $w = 7.5$.

Solution:

$$\tilde{\boldsymbol{\varepsilon}}_\theta = (-0.1, 0.1, 0.0) + 7.5\left[(0.1, -0.3, 0.2) - (-0.1, 0.1, 0.0)\right]$$

$$= (-0.1, 0.1, 0.0) + 7.5 \cdot (0.2, -0.4, 0.2)$$

$$= (-0.1, 0.1, 0.0) + (1.5, -3.0, 1.5) = (1.4, -2.9, 1.5)$$

Click for answer

$\tilde{\boldsymbol{\varepsilon}}_\theta = (1.4, -2.9, 1.5)$. With $w > 1$, the prediction is pushed far beyond the conditional model's natural range. This amplifies the "class signal" at the cost of increased variance — typical CFG behavior.

Example 3: SDE to ODE Conversion

For the VP-SDE with $\beta(t) = \beta$ (constant), write the probability flow ODE.

Solution:

VP-SDE: $d\mathbf{x} = -\frac{1}{2}\beta\mathbf{x}\,dt + \sqrt{\beta}\,d\mathbf{w}$

With $\mathbf{f}(\mathbf{x}, t) = -\frac{1}{2}\beta\mathbf{x}$ and $g(t) = \sqrt{\beta}$:

Probability flow ODE:

$$d\mathbf{x} = \left[-\frac{1}{2}\beta\mathbf{x} - \frac{1}{2}\beta \cdot \nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt$$

$$= -\frac{1}{2}\beta\left[\mathbf{x} + \nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt$$

Using the score-noise relationship $\nabla_{\mathbf{x}}\log p_t(\mathbf{x}) = -\boldsymbol{\varepsilon}_\theta(\mathbf{x}, t)/\sigma_t$:

$$d\mathbf{x} = -\frac{1}{2}\beta\left[\mathbf{x} - \frac{\boldsymbol{\varepsilon}_\theta(\mathbf{x}, t)}{\sigma_t}\right]dt$$

Euler discretization with step size $\Delta t$ gives:

$$\mathbf{x}{t-\Delta t} = \mathbf{x}_t + \frac{1}{2}\beta\Delta t\left[\mathbf{x}_t - \frac{\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sigma_t}\right]$$

Click for answer

This is the continuous analog of DDIM. At each step, we move $\mathbf{x}$ in the direction that reduces noise — the score points toward higher density. The ODE provides a smooth, deterministic trajectory from noise to data.

Practice Problems

For a DDIM model with $\bar{\alpha}{200} = 0.6$, $\bar{\alpha}{100} = 0.8$, compute $\mathbf{x}{100}$ from $\mathbf{x}{200} = \mathbf{0}$ and $\boldsymbol{\varepsilon}\theta(\mathbf{x}{200}, 200) = (0.1, 0.1)$.

Click for answer
$\hat{\mathbf{x}}_0 = (\mathbf{0} - \sqrt{0.4}(0.1, 0.1))/\sqrt{0.6} = (-0.08165, -0.08165)$ $\mathbf{x}_{100} = \sqrt{0.8}(-0.08165, -0.08165) + \sqrt{0.2}(0.1, 0.1) = (-0.0730 + 0.04472, -0.0730 + 0.04472) = (-0.0283, -0.0283)$
Prove that CFG with $w = 1$ reduces to standard conditional sampling.

Click for answer
$\tilde{\boldsymbol{\varepsilon}}_\theta = \boldsymbol{\varepsilon}_\theta(\varnothing) + 1 \cdot (\boldsymbol{\varepsilon}_\theta(c) - \boldsymbol{\varepsilon}_\theta(\varnothing)) = \boldsymbol{\varepsilon}_\theta(c)$. The unconditional component cancels, leaving only the conditional prediction — exactly standard conditional diffusion.
Explain why classifier guidance requires $w > 0$ but CFG works with any real $w$. What does negative $w$ produce?

Click for answer
Classifier guidance uses a gradient, so $w < 0$ pushes away from class $y$ (anti-guidance). CFG with $w < 0$ pushes away from conditioning toward the unconditional — produces "anti-samples." With $w = 0$, CFG gives purely unconditional output. With very large $w$, CFG saturates (model over-emphasizes the conditioning direction).
Derive the SNR at time $t$ for the DDPM forward process: $\text{SNR}(t) = \bar{\alpha}_t/(1-\bar{\alpha}_t)$. Why does the cosine schedule preserve higher SNR longer?

Click for answer
$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$. Signal power = $\bar{\alpha}_t\|\mathbf{x}_0\|^2$, noise power = $(1-\bar{\alpha}_t)$. Ratio: $\text{SNR} = \bar{\alpha}_t/(1-\bar{\alpha}_t)$. Cosine schedule keeps $\bar{\alpha}_t$ near 1 longer (flat near $t=0$), preserving signal. Linear schedule drops $\bar{\alpha}_t$ faster, destroying signal earlier. This matters because the model needs signal at intermediate noise levels to learn meaningful structure.
Given the probability flow ODE $d\mathbf{x} = v(\mathbf{x}, t)dt$, how would you compute the log-likelihood $\log p_0(\mathbf{x}_0)$ of a generated sample?

Click for answer
Using the instantaneous change-of-variables formula (continuous normalizing flows): $d\log p_t/dt = -\text{tr}(\nabla_{\mathbf{x}} v(\mathbf{x}_t, t))$. Integrate from $t=0$ to $t=T$ to get $\log p_0(\mathbf{x}_0) = \log p_T(\mathbf{x}_T) + \int_0^T \text{tr}(\nabla_{\mathbf{x}} v(\mathbf{x}_t, t))dt$. Since $p_T = \mathcal{N}(0, I)$, this gives exact likelihood. Computing the trace of the Jacobian is expensive; Hutchinson's trace estimator provides an unbiased approximation.

Summary

Key takeaways:

DDIM makes diffusion sampling non-Markovian, enabling deterministic generation with far fewer steps ($S \ll T$) via the probability flow ODE
Classifier guidance: $\mathbf{s}{\text{guided}} = \mathbf{s}\theta(\mathbf{x}) + w\nabla_{\mathbf{x}}\log p_\phi(y|\mathbf{x})$, requires separate noise-aware classifier
Classifier-free guidance: $\tilde{\boldsymbol{\varepsilon}}\theta(c) = \boldsymbol{\varepsilon}\theta(\varnothing) + w(\boldsymbol{\varepsilon}\theta(c) - \boldsymbol{\varepsilon}\theta(\varnothing))$, no classifier needed, dominates practice
Noise schedules (linear, cosine, sigmoid) control SNR decay; cosine is typically best
SDE formulation: $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + g(t)d\mathbf{w}$ unifies diffusion/score models
Probability flow ODE: deterministic path with same marginals; DDIM = Euler discretization
Guidance scale $w$ trades diversity ($w$ low) for quality/adherence ($w$ high)

Quiz

The key property of DDPM that DDIM exploits for faster sampling is:
A) The Markov property of the forward process
B) That training depends only on marginals $q(\mathbf{x}_t|\mathbf{x}_0)$, not the joint
C) That the reverse process is Gaussian
D) That the noise schedule is linear Correct: B)
If you chose B: The DDPM loss uses only $q(\mathbf{x}_t|\mathbf{x}_0)$ for each $t$, so any forward process with the same marginals yields the same trained model — DDIM changes the inference (reverse) process while keeping the same trained weights.
If you chose A: DDIM is explicitly non-Markovian.
If you chose C: The reverse process in DDIM remains Gaussian but that's not why it's faster.
If you chose D: DDIM works with any noise schedule.
In classifier-free guidance, setting $w = 7.5$ means:
A) The model uses 7.5× more compute
B) The conditioning signal is amplified 7.5× beyond the unconditional baseline
C) The model was trained with 7.5× more uncond samples
D) Only 1/7.5 of the timesteps are used Correct: B)
If you chose B: $\tilde{\boldsymbol{\varepsilon}} = \boldsymbol{\varepsilon}\theta(\varnothing) + 7.5(\boldsymbol{\varepsilon}\theta(c) - \boldsymbol{\varepsilon}_\theta(\varnothing))$ — the vector from unconditional to conditional is stretched by factor 7.5.
If you chose A: The compute is roughly 2× (two forward passes), regardless of $w$.
If you chose C: Training dropout rate $p_{\text{uncond}}$ is separate from sampling guidance $w$.
If you chose D: $w$ controls adherence, not step count.
The probability flow ODE differs from the reverse SDE by:
A) Having different marginal distributions
B) Lacking the stochastic $d\bar{\mathbf{w}}$ term — it's deterministic
C) Using a different score function
D) Being defined only at discrete timesteps Correct: B)
If you chose B: The ODE drops the $g(t)d\bar{\mathbf{w}}$ term while keeping the same marginals. The SDE has random diffusion; the ODE is a deterministic vector field.
If you chose A: They have identical marginal distributions $p_t(\mathbf{x})$ by construction.
If you chose C: Both use the same score function $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$.
If you chose D: The ODE is defined continuously; DDIM discretizes it.
Classifier guidance requires a noise-aware classifier because:
A) The guidance is applied at every noise level, and $p(y|\mathbf{x}_t)$ must be defined for noisy $\mathbf{x}_t$
B) Noisy classifiers are more accurate
C) The classifier must share weights with the diffusion model
D) Clean-image classifiers are too slow Correct: A)
If you chose A: The classifier must estimate $p(y|\mathbf{x}_t)$ for all $t$, meaning it must handle inputs at every noise level — from nearly clean ($t$ small) to pure noise ($t \approx T$).
If you chose B: Noisy classifiers are typically less accurate than clean ones.
If you chose C: The classifier is a separate model.
If you chose D: Inference speed depends on architecture, not noise-awareness.
DDIM inversion is possible because:
A) DDIM uses a Markov forward process
B) The deterministic DDIM mapping ($\sigma = 0$) is invertible — given $\mathbf{x}_0$, you can compute the unique $\mathbf{x}_T$ that decodes to it
C) DDIM trains on both forward and reverse directions
D) Neural networks are invertible by design Correct: B)
If you chose B: With $\sigma=0$, each step is a deterministic function of $\mathbf{x}_t$ (since $\hat{\mathbf{x}}_0$ depends only on $\mathbf{x}_t$). The chain of deterministic steps is invertible by running the ODE backward.
If you chose A: DDIM is non-Markovian. DDPM (Markovian) is not easily invertible due to stochasticity.
If you chose C: DDIM uses the same trained model as DDPM — no separate inversion training.
If you chose D: Invertibility comes from the ODE, not network architecture.

Next Steps

22-08 — Autoregressive Models — PixelCNN, WaveNet, and the autoregressive approach to generative modeling. While diffusion models generate all pixels simultaneously through iterative refinement, autoregressive models generate one element at a time, conditioning on previously generated outputs.

Pitfalls

Using DDIM with too few steps and expecting DDPM-quality samples: DDIM with $S=10$ steps on a $T=1000$ model produces deterministic samples but at substantially lower quality than DDPM with $T=1000$. The quality degradation is monotonic with step reduction. For high-quality generation, use $S \geq 50$; for real-time applications, $S=20$–$50$ with DDIM is a reasonable trade-off. Higher-order ODE solvers (Heun, DPM-Solver) can achieve better quality than DDIM at the same step count.
Setting CFG scale $w$ too high: Classifier-free guidance with $w > 10$ can produce over-saturated, unnatural images because the model is pushed far beyond the training distribution. The CFG formula $\tilde{\boldsymbol{\varepsilon}} = \boldsymbol{\varepsilon}{\text{uncond}} + w(\boldsymbol{\varepsilon}{\text{cond}} - \boldsymbol{\varepsilon}_{\text{uncond}})$ amplifies the conditional signal, but the model's outputs for extreme $w$ were never seen during training. Typical sweet spot: $w \in [3, 8]$ for text-to-image; $w \in [1, 3]$ for class-conditional.
Confusing DDIM inversion with exact invertibility: Deterministic DDIM ($\sigma=0$) is theoretically invertible, but in practice, numerical errors accumulate over many steps. The forward (encoding) and reverse (decoding) paths may not perfectly reconstruct the original image, especially with classifier-free guidance where the unconditional and conditional paths differ. For image editing, use fewer inversion steps and consider techniques like null-text inversion or EDICT for better reconstruction fidelity.
Using the wrong noise schedule for the task: The linear schedule ($\beta_t$ from $10^{-4}$ to $0.02$) destroys signal too quickly for high-resolution images, where fine details need preservation at intermediate noise levels. The cosine schedule preserves SNR longer and generally produces better FID scores. For very high resolutions ($1024^2$ and above), shifted cosine or sigmoid schedules with SNR tuned for the specific resolution are essential.

Q6: The log-SNR at time $t$ in a diffusion model is $\log(\bar{\alpha}_t/(1-\bar{\alpha}_t))$. Why is this quantity important for noise schedule design?

A) It determines the size of the neural network needed. B) It controls how much information about $\mathbf{x}_0$ remains — low log-SNR means the data is mostly noise, high log-SNR means the data is mostly clean. The schedule should smoothly transition between these regimes. C) It determines the batch size during training. D) It must equal zero at $t = T$ for the model to work.

Answer and Explanations

**Correct: B)** Log-SNR measures the ratio of signal power ($\bar{\alpha}_t$) to noise power ($1-\bar{\alpha}_t$). At $t \approx 0$, log-SNR is high (clean data). At $t \approx T$, log-SNR is very negative (mostly noise). The schedule design problem is to choose how log-SNR decays as a function of $t$ — linear decay of log-SNR (exponential decay of SNR) is a common design principle that gives equal emphasis to all noise levels. - A) Network size depends on data complexity, not the schedule. - C) Batch size is independent of the noise schedule. - D) Log-SNR must be very negative at $t=T$ (so $\mathbf{x}_T \approx \mathcal{N}(0,I)$), not exactly zero.

Progress

Phases

22-07 — Diffusion Models (Advanced)

Learning Objectives

Core Content

DDIM: Denoising Diffusion Implicit Models

⚠️ CRITICAL — The DDIM Key Insight

DDIM Sampling (Deterministic)

DDIM Inversion

⚠️ CRITICAL — Classifier Guidance

Practical Issues with Classifier Guidance

⚠️ CRITICAL — Classifier-Free Guidance (CFG)

Noise Schedules

Linear Schedule (DDPM original)

Cosine Schedule (Nichol & Dhariwal, 2021)

Sigmoid Schedule

SDE/ODE Formulation (Continuous-Time)

Probability Flow ODE

Key Terms

Worked Examples

Example 1: DDIM Sampling Step

Example 2: Classifier-Free Guidance

Example 3: SDE to ODE Conversion

Practice Problems

Summary

Quiz

Next Steps

Pitfalls