Math graphic
πŸ“ Concept diagram

22-07 β€” Diffusion Models (Advanced)

Phase: 22 β€” Generative Models Mathematics Subject: 22-07 Prerequisites: 22-06 β€” Diffusion Models: Foundations, 22-05 β€” Score-Based Generative Models Next subject: 22-08 β€” Autoregressive Models


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the DDIM deterministic sampling procedure and explain why it enables fewer-step generation
  2. Formulate classifier guidance and classifier-free guidance for conditional diffusion
  3. Understand the SDE/ODE continuous-time formulation that unifies diffusion and score-based models
  4. Choose appropriate noise schedules (linear, cosine, sigmoid) for different applications
  5. Implement accelerated sampling with DDIM and guidance-controlled generation

Core Content

DDIM: Denoising Diffusion Implicit Models

DDPM sampling (22-06) requires $T \approx 1000$ sequential denoising steps, making generation slow. Song et al. (2021) introduced DDIM, which generalizes DDPM to a non-Markovian forward process, enabling deterministic sampling with far fewer steps.

⚠️ CRITICAL β€” The DDIM Key Insight

The DDPM objective $L_{\text{simple}} = \mathbb{E}[|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)|^2]$ depends only on the marginals $q(\mathbf{x}_t|\mathbf{x}_0)$, not on the joint $q(\mathbf{x}{1:T}|\mathbf{x}_0)$. This means any forward process yielding the same marginals produces the same training objective, but different reverse processes.

DDIM defines a non-Markovian forward process:

$$q_\sigma(\mathbf{x}{1:T}|\mathbf{x}_0) = q\sigma(\mathbf{x}T|\mathbf{x}_0)\prod{t=2}^{T} q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$$

where for $t > 1$:

$$q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\left(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2} \cdot \frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_t}}, \sigma_t^2 I\right)$$

The parameter $\sigma_t$ controls stochasticity: - $\sigma_t = 0$ β†’ DDIM (fully deterministic) - $\sigma_t = \sqrt{(1-\bar{\alpha}{t-1})/(1-\bar{\alpha}_t)}\sqrt{1-\bar{\alpha}_t/\bar{\alpha}{t-1}}$ β†’ recovers DDPM

DDIM Sampling (Deterministic)

With $\sigma_t = 0$, the reverse step becomes deterministic. Given a trained noise predictor $\boldsymbol{\varepsilon}_\theta$:

  1. Predict $\hat{\mathbf{x}}0$ from $\mathbf{x}_t$: $$\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}}$$

  2. Compute $\mathbf{x}{t-1}$: $$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}}\,\hat{\mathbf{x}}_0 + \sqrt{1-\bar{\alpha}{t-1}}\,\boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t)$$

This is the "predicted $\mathbf{x}_0$" interpretation: we estimate what $\mathbf{x}_0$ would be given $\mathbf{x}_t$, then re-noise to level $t-1$ using the same predicted noise direction.

Why DDIM is faster: We can sample using a subsequence $\tau_1 < \tau_2 < \cdots < \tau_S$ of $S \ll T$ timesteps. For $S=50$ on a $T=1000$ model, quality degrades only slightly:

$$\mathbf{x}{\tau{i-1}} = \sqrt{\bar{\alpha}{\tau{i-1}}}\,\hat{\mathbf{x}}0 + \sqrt{1-\bar{\alpha}{\tau_{i-1}}}\,\boldsymbol{\varepsilon}\theta(\mathbf{x}{\tau_i}, \tau_i)$$

DDIM Inversion

A remarkable property of deterministic DDIM ($\sigma=0$): the mapping $\mathbf{x}_0 \to \mathbf{x}_T$ is invertible. Given $\mathbf{x}_0$, we can compute the exact latent $\mathbf{x}_T$ that would produce $\mathbf{x}_0$ under reverse DDIM:

$$\mathbf{x}{t} = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\varepsilon}\theta(\mathbf{x}_{t-1}, t-1)$$

This enables image editing: encode real image to latent β†’ modify latent β†’ decode back. Crucial for SDEdit and prompt-to-prompt editing.


⚠️ CRITICAL β€” Classifier Guidance

Classifier guidance (Dhariwal & Nicholls, 2021) uses a pretrained classifier $p_\phi(y|\mathbf{x}_t)$ to steer generation toward a desired class $y$.

The key insight: we can modify the score function to incorporate the gradient of the log-classifier probability, shifting the sampling trajectory toward regions of high $p(y|\mathbf{x})$:

$$\nabla_{\mathbf{x}t} \log p(\mathbf{x}_t|y) = \nabla{\mathbf{x}t} \log p(\mathbf{x}_t) + \nabla{\mathbf{x}_t} \log p(y|\mathbf{x}_t)$$

Using Bayes' rule: $p(\mathbf{x}_t|y) \propto p(\mathbf{x}_t)p(y|\mathbf{x}_t)$. The score of the class-conditional distribution decomposes into unconditional score + classifier gradient.

Translating to the noise prediction framework. Recall from 22-06:

$$\mathbf{s}\theta(\mathbf{x}_t, t) = -\frac{\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$

The guided score is:

$$\mathbf{s}{\text{guided}}(\mathbf{x}_t) = \mathbf{s}\theta(\mathbf{x}t) + w \cdot \nabla{\mathbf{x}t} \log p\phi(y|\mathbf{x}_t)$$

where $w$ is the guidance scale. In terms of noise prediction:

$$\hat{\boldsymbol{\varepsilon}}\theta(\mathbf{x}_t) = \boldsymbol{\varepsilon}\theta(\mathbf{x}t) - w \cdot \sqrt{1-\bar{\alpha}_t} \cdot \nabla{\mathbf{x}t} \log p\phi(y|\mathbf{x}_t)$$

The guidance scale $w$ controls the trade-off: - $w = 1$: standard class-conditional sampling - $w > 1$: pushes samples further toward high-density class regions β†’ better quality, lower diversity - $w = 0$: unconditional (ignore classifier)

Typical values: $w \in [1, 10]$ for image generation.

Practical Issues with Classifier Guidance


⚠️ CRITICAL β€” Classifier-Free Guidance (CFG)

Classifier-free guidance (Ho & Salimans, 2021) eliminates the need for a separate classifier by jointly training a conditional and unconditional model in one network.

During training, we randomly drop the conditioning $c$ (replace with a null token $\varnothing$) with probability $p_{\text{uncond}}$ (typically 10-20%):

$$\mathcal{L} = \mathbb{E}{t,\mathbf{x}_0,\boldsymbol{\varepsilon}}\left[|\boldsymbol{\varepsilon} - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t, c)|^2\right] \quad \text{with } c \in {y, \varnothing}$$

At sampling time, we interpolate between conditional and unconditional predictions:

$$\tilde{\boldsymbol{\varepsilon}}\theta(\mathbf{x}_t, c) = \boldsymbol{\varepsilon}\theta(\mathbf{x}t, \varnothing) + w\left(\boldsymbol{\varepsilon}\theta(\mathbf{x}t, c) - \boldsymbol{\varepsilon}\theta(\mathbf{x}_t, \varnothing)\right)$$

This is the CFG formula. Interpretation: - $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, \varnothing)$ = unconditional prediction (what the model thinks is noise regardless of class) - $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, c)$ = conditional prediction - The difference is the "direction toward class $c$" - $w$ amplifies this direction

Equivalently in score space:

$$\mathbf{s}{\text{CFG}}(\mathbf{x}_t, c) = \mathbf{s}\theta(\mathbf{x}t, \varnothing) + w(\mathbf{s}\theta(\mathbf{x}t, c) - \mathbf{s}\theta(\mathbf{x}_t, \varnothing))$$

Why CFG dominates in practice: - No separate classifier needed - Works with any conditioning (text, images, labels, etc.) - Natural for text-to-image models (Stable Diffusion, DALLΒ·E, Imagen) - The same network handles all guidance internally


Noise Schedules

The choice of $\beta_t$ (or equivalently $\bar{\alpha}_t$) significantly affects generation quality.

Linear Schedule (DDPM original)

$$\beta_t = \beta_{\text{start}} + \frac{t-1}{T-1}(\beta_{\text{end}} - \beta_{\text{start}})$$

Typical: $\beta_1 = 10^{-4}, \beta_T = 0.02$. Problem: too much noise added too quickly in early steps.

Cosine Schedule (Nichol & Dhariwal, 2021)

$$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2$$

where $s = 0.008$ is a small offset preventing $\beta_t$ from being too small near $t=0$. Better performance than linear: prevents information from being destroyed too quickly.

Sigmoid Schedule

$$\bar{\alpha}_t = \frac{\text{sigmoid}(-c + 2c \cdot t/T)}{\text{sigmoid}(c) + \text{sigmoid}(c)}$$

where $c$ controls steepness. The sigmoid schedule transitions smoothly from signal to noise.

Key comparison: | Schedule | SNR at $t=T/2$ | SNR decay shape | Best for | |----------|---------------|-----------------|----------| | Linear | Moderate | Uniform | Simple baselines | | Cosine | High (preserves signal) | Gradual then steep | High-quality images | | Sigmoid | Tunable | Controllable | Custom applications |


SDE/ODE Formulation (Continuous-Time)

The continuous-time perspective (Song et al., 2021) unifies diffusion and score-based models. The forward process is an ItΓ΄ SDE:

$$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + g(t)d\mathbf{w}$$

where $\mathbf{w}$ is a standard Wiener process, $\mathbf{f}$ is the drift, and $g(t)$ is the diffusion coefficient.

Two common SDEs:

  1. Variance-Preserving (VP) SDE (DDPM): $$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x}\,dt + \sqrt{\beta(t)}\,d\mathbf{w}$$

  2. Variance-Exploding (VE) SDE (Score-based/22-05): $$d\mathbf{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}}\,d\mathbf{w}$$

The reverse-time SDE (Anderson, 1982) is:

$$d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt + g(t)d\bar{\mathbf{w}}$$

where $d\bar{\mathbf{w}}$ is a reverse-time Wiener process. The score $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ is what we learn with $\mathbf{s}_\theta(\mathbf{x}, t)$.

Probability Flow ODE

Remarkably, there exists a deterministic probability flow ODE with the same marginal distributions:

$$d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt$$

DDIM is the discrete-time Euler discretization of this ODE. This explains why DDIM is deterministic β€” it solves the probability flow ODE rather than the reverse SDE.

The ODE formulation enables: - Exact likelihood computation: $\log p_0(\mathbf{x}_0)$ via instantaneous change-of-variables - Deterministic encoding/decoding: invertible mappings for image editing - Higher-order solvers: Heun's method, Runge-Kutta for faster/better sampling



Key Terms

Worked Examples

Example 1: DDIM Sampling Step

A model trained with $T=1000$ uses $\bar{\alpha}{100} = 0.8$ and $\bar{\alpha}{50} = 0.9$. Given $\mathbf{x}{100} = (0.3, -0.4)$ and $\boldsymbol{\varepsilon}\theta(\mathbf{x}{100}, 100) = (0.2, -0.5)$, compute the deterministic DDIM step to $\mathbf{x}{50}$.

Solution:

Step 1 β€” Predict $\hat{\mathbf{x}}_0$:

$$\hat{\mathbf{x}}_0 = \frac{(0.3, -0.4) - \sqrt{1-0.8}\,(0.2, -0.5)}{\sqrt{0.8}} = \frac{(0.3, -0.4) - 0.4472 \cdot (0.2, -0.5)}{0.8944}$$

$$= \frac{(0.3 - 0.08944, -0.4 + 0.2236)}{0.8944} = \frac{(0.2106, -0.1764)}{0.8944} = (0.2354, -0.1972)$$

Step 2 β€” Re-noise to $t=50$:

$$\mathbf{x}_{50} = \sqrt{0.9}\,(0.2354, -0.1972) + \sqrt{1-0.9}\,(0.2, -0.5)$$

$$= 0.9487 \cdot (0.2354, -0.1972) + 0.3162 \cdot (0.2, -0.5)$$

$$= (0.2233, -0.1870) + (0.06324, -0.1581) = (0.2865, -0.3451)$$

Click for answer $\mathbf{x}_{50} = (0.2865, -0.3451)$. Notice that noisy $\mathbf{x}_{100} = (0.3, -0.4)$ was "denoised" in one DDIM jump. The deterministic path means this same step always produces the same output.

Example 2: Classifier-Free Guidance

A text-to-image model produces $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, c) = (0.1, -0.3, 0.2)$ and $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, \varnothing) = (-0.1, 0.1, 0.0)$. Compute the guided noise prediction for $w = 7.5$.

Solution:

$$\tilde{\boldsymbol{\varepsilon}}_\theta = (-0.1, 0.1, 0.0) + 7.5\left[(0.1, -0.3, 0.2) - (-0.1, 0.1, 0.0)\right]$$

$$= (-0.1, 0.1, 0.0) + 7.5 \cdot (0.2, -0.4, 0.2)$$

$$= (-0.1, 0.1, 0.0) + (1.5, -3.0, 1.5) = (1.4, -2.9, 1.5)$$

Click for answer $\tilde{\boldsymbol{\varepsilon}}_\theta = (1.4, -2.9, 1.5)$. With $w > 1$, the prediction is pushed far beyond the conditional model's natural range. This amplifies the "class signal" at the cost of increased variance β€” typical CFG behavior.

Example 3: SDE to ODE Conversion

For the VP-SDE with $\beta(t) = \beta$ (constant), write the probability flow ODE.

Solution:

VP-SDE: $d\mathbf{x} = -\frac{1}{2}\beta\mathbf{x}\,dt + \sqrt{\beta}\,d\mathbf{w}$

With $\mathbf{f}(\mathbf{x}, t) = -\frac{1}{2}\beta\mathbf{x}$ and $g(t) = \sqrt{\beta}$:

Probability flow ODE:

$$d\mathbf{x} = \left[-\frac{1}{2}\beta\mathbf{x} - \frac{1}{2}\beta \cdot \nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt$$

$$= -\frac{1}{2}\beta\left[\mathbf{x} + \nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right]dt$$

Using the score-noise relationship $\nabla_{\mathbf{x}}\log p_t(\mathbf{x}) = -\boldsymbol{\varepsilon}_\theta(\mathbf{x}, t)/\sigma_t$:

$$d\mathbf{x} = -\frac{1}{2}\beta\left[\mathbf{x} - \frac{\boldsymbol{\varepsilon}_\theta(\mathbf{x}, t)}{\sigma_t}\right]dt$$

Euler discretization with step size $\Delta t$ gives:

$$\mathbf{x}{t-\Delta t} = \mathbf{x}_t + \frac{1}{2}\beta\Delta t\left[\mathbf{x}_t - \frac{\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t)}{\sigma_t}\right]$$

Click for answer This is the continuous analog of DDIM. At each step, we move $\mathbf{x}$ in the direction that reduces noise β€” the score points toward higher density. The ODE provides a smooth, deterministic trajectory from noise to data.

Practice Problems

  1. For a DDIM model with $\bar{\alpha}{200} = 0.6$, $\bar{\alpha}{100} = 0.8$, compute $\mathbf{x}{100}$ from $\mathbf{x}{200} = \mathbf{0}$ and $\boldsymbol{\varepsilon}\theta(\mathbf{x}{200}, 200) = (0.1, 0.1)$.

    Click for answer $\hat{\mathbf{x}}_0 = (\mathbf{0} - \sqrt{0.4}(0.1, 0.1))/\sqrt{0.6} = (-0.08165, -0.08165)$ $\mathbf{x}_{100} = \sqrt{0.8}(-0.08165, -0.08165) + \sqrt{0.2}(0.1, 0.1) = (-0.0730 + 0.04472, -0.0730 + 0.04472) = (-0.0283, -0.0283)$

  2. Prove that CFG with $w = 1$ reduces to standard conditional sampling.

    Click for answer $\tilde{\boldsymbol{\varepsilon}}_\theta = \boldsymbol{\varepsilon}_\theta(\varnothing) + 1 \cdot (\boldsymbol{\varepsilon}_\theta(c) - \boldsymbol{\varepsilon}_\theta(\varnothing)) = \boldsymbol{\varepsilon}_\theta(c)$. The unconditional component cancels, leaving only the conditional prediction β€” exactly standard conditional diffusion.

  3. Explain why classifier guidance requires $w > 0$ but CFG works with any real $w$. What does negative $w$ produce?

    Click for answer Classifier guidance uses a gradient, so $w < 0$ pushes away from class $y$ (anti-guidance). CFG with $w < 0$ pushes away from conditioning toward the unconditional β€” produces "anti-samples." With $w = 0$, CFG gives purely unconditional output. With very large $w$, CFG saturates (model over-emphasizes the conditioning direction).

  4. Derive the SNR at time $t$ for the DDPM forward process: $\text{SNR}(t) = \bar{\alpha}_t/(1-\bar{\alpha}_t)$. Why does the cosine schedule preserve higher SNR longer?

    Click for answer $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\varepsilon}$. Signal power = $\bar{\alpha}_t\|\mathbf{x}_0\|^2$, noise power = $(1-\bar{\alpha}_t)$. Ratio: $\text{SNR} = \bar{\alpha}_t/(1-\bar{\alpha}_t)$. Cosine schedule keeps $\bar{\alpha}_t$ near 1 longer (flat near $t=0$), preserving signal. Linear schedule drops $\bar{\alpha}_t$ faster, destroying signal earlier. This matters because the model needs signal at intermediate noise levels to learn meaningful structure.

  5. Given the probability flow ODE $d\mathbf{x} = v(\mathbf{x}, t)dt$, how would you compute the log-likelihood $\log p_0(\mathbf{x}_0)$ of a generated sample?

    Click for answer Using the instantaneous change-of-variables formula (continuous normalizing flows): $d\log p_t/dt = -\text{tr}(\nabla_{\mathbf{x}} v(\mathbf{x}_t, t))$. Integrate from $t=0$ to $t=T$ to get $\log p_0(\mathbf{x}_0) = \log p_T(\mathbf{x}_T) + \int_0^T \text{tr}(\nabla_{\mathbf{x}} v(\mathbf{x}_t, t))dt$. Since $p_T = \mathcal{N}(0, I)$, this gives exact likelihood. Computing the trace of the Jacobian is expensive; Hutchinson's trace estimator provides an unbiased approximation.


Summary

Key takeaways:


Quiz

  1. The key property of DDPM that DDIM exploits for faster sampling is:
  2. A) The Markov property of the forward process
  3. B) That training depends only on marginals $q(\mathbf{x}_t|\mathbf{x}_0)$, not the joint
  4. C) That the reverse process is Gaussian
  5. D) That the noise schedule is linear Correct: B)
  6. If you chose B: The DDPM loss uses only $q(\mathbf{x}_t|\mathbf{x}_0)$ for each $t$, so any forward process with the same marginals yields the same trained model β€” DDIM changes the inference (reverse) process while keeping the same trained weights.
  7. If you chose A: DDIM is explicitly non-Markovian.
  8. If you chose C: The reverse process in DDIM remains Gaussian but that's not why it's faster.
  9. If you chose D: DDIM works with any noise schedule.

  10. In classifier-free guidance, setting $w = 7.5$ means:

  11. A) The model uses 7.5Γ— more compute
  12. B) The conditioning signal is amplified 7.5Γ— beyond the unconditional baseline
  13. C) The model was trained with 7.5Γ— more uncond samples
  14. D) Only 1/7.5 of the timesteps are used Correct: B)
  15. If you chose B: $\tilde{\boldsymbol{\varepsilon}} = \boldsymbol{\varepsilon}\theta(\varnothing) + 7.5(\boldsymbol{\varepsilon}\theta(c) - \boldsymbol{\varepsilon}_\theta(\varnothing))$ β€” the vector from unconditional to conditional is stretched by factor 7.5.
  16. If you chose A: The compute is roughly 2Γ— (two forward passes), regardless of $w$.
  17. If you chose C: Training dropout rate $p_{\text{uncond}}$ is separate from sampling guidance $w$.
  18. If you chose D: $w$ controls adherence, not step count.

  19. The probability flow ODE differs from the reverse SDE by:

  20. A) Having different marginal distributions
  21. B) Lacking the stochastic $d\bar{\mathbf{w}}$ term β€” it's deterministic
  22. C) Using a different score function
  23. D) Being defined only at discrete timesteps Correct: B)
  24. If you chose B: The ODE drops the $g(t)d\bar{\mathbf{w}}$ term while keeping the same marginals. The SDE has random diffusion; the ODE is a deterministic vector field.
  25. If you chose A: They have identical marginal distributions $p_t(\mathbf{x})$ by construction.
  26. If you chose C: Both use the same score function $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$.
  27. If you chose D: The ODE is defined continuously; DDIM discretizes it.

  28. Classifier guidance requires a noise-aware classifier because:

  29. A) The guidance is applied at every noise level, and $p(y|\mathbf{x}_t)$ must be defined for noisy $\mathbf{x}_t$
  30. B) Noisy classifiers are more accurate
  31. C) The classifier must share weights with the diffusion model
  32. D) Clean-image classifiers are too slow Correct: A)
  33. If you chose A: The classifier must estimate $p(y|\mathbf{x}_t)$ for all $t$, meaning it must handle inputs at every noise level β€” from nearly clean ($t$ small) to pure noise ($t \approx T$).
  34. If you chose B: Noisy classifiers are typically less accurate than clean ones.
  35. If you chose C: The classifier is a separate model.
  36. If you chose D: Inference speed depends on architecture, not noise-awareness.

  37. DDIM inversion is possible because:

  38. A) DDIM uses a Markov forward process
  39. B) The deterministic DDIM mapping ($\sigma = 0$) is invertible β€” given $\mathbf{x}_0$, you can compute the unique $\mathbf{x}_T$ that decodes to it
  40. C) DDIM trains on both forward and reverse directions
  41. D) Neural networks are invertible by design Correct: B)
  42. If you chose B: With $\sigma=0$, each step is a deterministic function of $\mathbf{x}_t$ (since $\hat{\mathbf{x}}_0$ depends only on $\mathbf{x}_t$). The chain of deterministic steps is invertible by running the ODE backward.
  43. If you chose A: DDIM is non-Markovian. DDPM (Markovian) is not easily invertible due to stochasticity.
  44. If you chose C: DDIM uses the same trained model as DDPM β€” no separate inversion training.
  45. If you chose D: Invertibility comes from the ODE, not network architecture.

Next Steps

22-08 β€” Autoregressive Models β€” PixelCNN, WaveNet, and the autoregressive approach to generative modeling. While diffusion models generate all pixels simultaneously through iterative refinement, autoregressive models generate one element at a time, conditioning on previously generated outputs.


Pitfalls

  1. Using DDIM with too few steps and expecting DDPM-quality samples: DDIM with $S=10$ steps on a $T=1000$ model produces deterministic samples but at substantially lower quality than DDPM with $T=1000$. The quality degradation is monotonic with step reduction. For high-quality generation, use $S \geq 50$; for real-time applications, $S=20$–$50$ with DDIM is a reasonable trade-off. Higher-order ODE solvers (Heun, DPM-Solver) can achieve better quality than DDIM at the same step count.

  2. Setting CFG scale $w$ too high: Classifier-free guidance with $w > 10$ can produce over-saturated, unnatural images because the model is pushed far beyond the training distribution. The CFG formula $\tilde{\boldsymbol{\varepsilon}} = \boldsymbol{\varepsilon}{\text{uncond}} + w(\boldsymbol{\varepsilon}{\text{cond}} - \boldsymbol{\varepsilon}_{\text{uncond}})$ amplifies the conditional signal, but the model's outputs for extreme $w$ were never seen during training. Typical sweet spot: $w \in [3, 8]$ for text-to-image; $w \in [1, 3]$ for class-conditional.

  3. Confusing DDIM inversion with exact invertibility: Deterministic DDIM ($\sigma=0$) is theoretically invertible, but in practice, numerical errors accumulate over many steps. The forward (encoding) and reverse (decoding) paths may not perfectly reconstruct the original image, especially with classifier-free guidance where the unconditional and conditional paths differ. For image editing, use fewer inversion steps and consider techniques like null-text inversion or EDICT for better reconstruction fidelity.

  4. Using the wrong noise schedule for the task: The linear schedule ($\beta_t$ from $10^{-4}$ to $0.02$) destroys signal too quickly for high-resolution images, where fine details need preservation at intermediate noise levels. The cosine schedule preserves SNR longer and generally produces better FID scores. For very high resolutions ($1024^2$ and above), shifted cosine or sigmoid schedules with SNR tuned for the specific resolution are essential.




Q6: The log-SNR at time $t$ in a diffusion model is $\log(\bar{\alpha}_t/(1-\bar{\alpha}_t))$. Why is this quantity important for noise schedule design?

A) It determines the size of the neural network needed. B) It controls how much information about $\mathbf{x}_0$ remains β€” low log-SNR means the data is mostly noise, high log-SNR means the data is mostly clean. The schedule should smoothly transition between these regimes. C) It determines the batch size during training. D) It must equal zero at $t = T$ for the model to work.

Answer and Explanations **Correct: B)** Log-SNR measures the ratio of signal power ($\bar{\alpha}_t$) to noise power ($1-\bar{\alpha}_t$). At $t \approx 0$, log-SNR is high (clean data). At $t \approx T$, log-SNR is very negative (mostly noise). The schedule design problem is to choose how log-SNR decays as a function of $t$ β€” linear decay of log-SNR (exponential decay of SNR) is a common design principle that gives equal emphasis to all noise levels. - A) Network size depends on data complexity, not the schedule. - C) Batch size is independent of the noise schedule. - D) Log-SNR must be very negative at $t=T$ (so $\mathbf{x}_T \approx \mathcal{N}(0,I)$), not exactly zero.