22-05 β Score-Based Generative Models
Phase: 22 β Generative Models Mathematics Subject: 22-05 Prerequisites: 22-01 β Autoencoders (DAE section), Phase 06 (Vector Calculus), Phase 13 (Probability) Next subject: 22-06 β Diffusion Models: Foundations
Learning Objectives
By the end of this subject, you will be able to:
- Define the score function $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ and explain its role in generative modeling
- Derive the denoising score matching (DSM) objective from the score matching loss
- Explain Langevin dynamics as a sampling procedure using only the score function
- Analyze the manifold hypothesis and why it necessitates noise-perturbed score matching
- Describe annealed Langevin dynamics and connect it to diffusion models
Core Content
What Is the Score Function?
For a probability density $p(\mathbf{x})$, the score function is the gradient of the log-density:
$$\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$$
β οΈ CRITICAL β Why the Score? The score function has a remarkable property: it is independent of the normalization constant. If $p(\mathbf{x}) = \frac{1}{Z}\tilde{p}(\mathbf{x})$ where $Z = \int \tilde{p}(\mathbf{x}) d\mathbf{x}$ is the (usually intractable) partition function:
$$\nabla_{\mathbf{x}} \log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \nabla_{\mathbf{x}} \log Z = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x})$$
The $Z$ term vanishes because $Z$ doesn't depend on $\mathbf{x}$! This is the key insight: we can learn the score function without knowing the normalization constant β avoiding one of the hardest problems in probabilistic modeling.
Geometric interpretation: $\mathbf{s}(\mathbf{x})$ points in the direction of steepest increase in log-probability β i.e., toward higher-density regions. Following the score moves samples toward the modes of the distribution.
Score Matching
Given samples $\{\mathbf{x}^{(i)}\}$ from $p_{\text{data}}$, we want a model $\mathbf{s}\theta(\mathbf{x})$ that approximates the true score $\nabla{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})$. The score matching objective minimizes the Fisher divergence:
$$J(\theta) = \frac{1}{2} \mathbb{E}{p{\text{data}}}\left[\|\mathbf{s}\theta(\mathbf{x}) - \nabla{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})\|^2\right]$$
But we don't know $\nabla_{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})$! HyvΓ€rinen (2005) showed that under mild regularity conditions, this is equivalent to:
$$J(\theta) = \mathbb{E}{p{\text{data}}}\left[\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}\theta(\mathbf{x})) + \frac{1}{2}\|\mathbf{s}\theta(\mathbf{x})\|^2\right] + \text{const}$$
However, the trace of the Jacobian $\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}\theta) = \sum{i=1}^d \frac{\partial s_{\theta,i}}{\partial x_i}$ requires $d$ backward passes β computationally prohibitive for high-dimensional data like images.
β οΈ Denoising Score Matching (DSM)
Vincent (2011) introduced a brilliant solution: instead of matching the score of the data distribution directly, match the score of the noise-perturbed data distribution.
Perturb data with Gaussian noise: $\tilde{\mathbf{x}} = \mathbf{x} + \sigma \boldsymbol{\varepsilon}$, $\boldsymbol{\varepsilon} \sim \mathcal{N}(0, I)$. The perturbed distribution is:
$$p_\sigma(\tilde{\mathbf{x}}) = \int p_{\text{data}}(\mathbf{x}) \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 I) d\mathbf{x}$$
The DSM objective is:
$$J_{\text{DSM}}(\theta) = \frac{1}{2} \mathbb{E}{p{\text{data}}(\mathbf{x})} \mathbb{E}{\tilde{\mathbf{x}} \sim \mathcal{N}(\mathbf{x}, \sigma^2 I)}\left[\left\|\mathbf{s}\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}|\mathbf{x})\right\|^2\right]$$
And here's the crucial simplification: since $p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 I)$:
$$\nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = \nabla_{\tilde{\mathbf{x}}}\left(-\frac{\|\tilde{\mathbf{x}} - \mathbf{x}\|^2}{2\sigma^2}\right) = -\frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^2} = -\frac{\boldsymbol{\varepsilon}}{\sigma}$$
Therefore:
$$J_{\text{DSM}}(\theta) = \frac{1}{2\sigma^2} \mathbb{E}{\mathbf{x} \sim p{\text{data}}, \boldsymbol{\varepsilon} \sim \mathcal{N}(0,I)}\left[\left\|\mathbf{s}_\theta(\mathbf{x} + \sigma\boldsymbol{\varepsilon}) + \frac{\boldsymbol{\varepsilon}}{\sigma}\right\|^2\right]$$
Equivalently, we can parameterize $\mathbf{s}\theta(\tilde{\mathbf{x}}) = -\frac{1}{\sigma}\boldsymbol{\varepsilon}\theta(\tilde{\mathbf{x}})$ and train $\boldsymbol{\varepsilon}_\theta$ to predict the noise:
$$\mathcal{L}(\theta) = \mathbb{E}{\mathbf{x}, \boldsymbol{\varepsilon}}\left[\|\boldsymbol{\varepsilon}\theta(\mathbf{x} + \sigma\boldsymbol{\varepsilon}) - \boldsymbol{\varepsilon}\|^2\right]$$
This is just a denoising objective β predict the noise that was added!
Connection to DAE: Recall from 22-01 that a denoising autoencoder trained to reconstruct $\mathbf{x}$ from $\mathbf{x} + \sigma\boldsymbol{\varepsilon}$ learns $\hat{\mathbf{x}} \approx \mathbf{x} + \sigma^2 \mathbf{s}(\mathbf{x})$. Rearranging: $\mathbf{s}(\mathbf{x}) \approx (\hat{\mathbf{x}} - \mathbf{x})/\sigma^2 = -\boldsymbol{\varepsilon}/\sigma$. Exactly the DSM objective!
Langevin Dynamics for Sampling
Once we have $\mathbf{s}\theta(\mathbf{x}) \approx \nabla{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})$, how do we generate samples? Langevin dynamics β a Markov chain Monte Carlo method:
$$\mathbf{x}{t+1} = \mathbf{x}_t + \frac{\epsilon}{2} \mathbf{s}\theta(\mathbf{x}_t) + \sqrt{\epsilon} \cdot \mathbf{z}_t, \quad \mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, I)$$
where $\epsilon$ is the step size.
Why this works: Langevin dynamics is a discretization of the stochastic differential equation:
$$d\mathbf{x}t = \frac{1}{2}\nabla{\mathbf{x}} \log p(\mathbf{x}_t) dt + d\mathbf{W}_t$$
where $\mathbf{W}_t$ is a Wiener process (Brownian motion). This SDE has $p(\mathbf{x})$ as its stationary distribution β running it long enough produces samples from $p$.
Intuition: The drift term $\frac{1}{2}\mathbf{s}(\mathbf{x})$ pushes toward high-density regions. The diffusion term $\sqrt{\epsilon}\mathbf{z}_t$ adds exploration, preventing the chain from getting stuck in local modes.
β οΈ The Manifold Hypothesis Problem
Real data (images, audio, text) lies on low-dimensional manifolds embedded in high-dimensional space. For example, $256 \times 256$ images live in $\mathbb{R}^{65536}$, but the manifold of natural images has much lower intrinsic dimension.
On the data manifold, the score $\nabla_{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})$ is undefined in directions perpendicular to the manifold (density is zero off-manifold β log-density is $-\infty$ β gradient is undefined). Score matching on clean data fails.
Solution: Noise perturbation. Adding Gaussian noise spreads the distribution to the ambient space, making the score well-defined everywhere. But this introduces a trade-off:
- Small $\sigma$: score is accurate near the manifold but poorly estimated off-manifold
- Large $\sigma$: score is well-estimated everywhere but the perturbed distribution is a blurred version of the data
Annealed Langevin Dynamics
The solution: use multiple noise levels $\sigma_1 > \sigma_2 > \cdots > \sigma_L$:
- Train separate score models $\mathbf{s}_\theta(\mathbf{x}, \sigma_i)$ for each noise level, or one model conditioned on $\sigma$
- Sample via annealed Langevin dynamics:
- Start with $\mathbf{x}_0 \sim \mathcal{N}(0, I)$ (high noise)
- Run Langevin dynamics with $\sigma_1$ (coarse structure)
- Reduce to $\sigma_2$ and continue (refine details)
- ... until $\sigma_L$ (fine details)
This gradually refines samples from coarse to fine, the same intuition that powers diffusion models (22-06).
The training objective with multiple noise levels (noise-conditioned score network, NCSN):
$$\mathcal{L}(\theta) = \frac{1}{L}\sum_{i=1}^{L} \sigma_i^2 \cdot \mathbb{E}{\mathbf{x}, \boldsymbol{\varepsilon}}\left[\|\mathbf{s}\theta(\mathbf{x} + \sigma_i\boldsymbol{\varepsilon}, \sigma_i) + \frac{\boldsymbol{\varepsilon}}{\sigma_i}\|^2\right]$$
The $\sigma_i^2$ weighting balances the different noise scales (since $\|\nabla \log p_\sigma\| \propto 1/\sigma$).
Pitfalls
| Pitfall | Why It Happens | Fix |
|---|---|---|
| Poor sample quality | Insufficient Langevin steps or wrong $\epsilon$ | Tune step size and number of steps; use annealed sampling |
| Score is undefined off-manifold | Data lies on low-dim manifold | Always use noise perturbation (DSM) |
| Mixing failure | Langevin dynamics gets stuck between modes | Annealed sampling: start at high noise, gradually reduce |
| Computational cost | Many Langevin steps needed per sample | Use fewer noise levels with learned step sizes; see diffusion models |
| Gradient explosion | Score grows unboundedly in low-density regions | Noise-conditioning regularizes; clip score magnitude |
Key Terms
- Langevin dynamics
Worked Examples
Example 1: Score of a Gaussian
For $p(\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \Sigma)$, compute $\nabla_{\mathbf{x}} \log p(\mathbf{x})$.
Solution:
$$\log p(\mathbf{x}) = -\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})$$
$$\nabla_{\mathbf{x}} \log p(\mathbf{x}) = -\Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu})$$
Interpretation: the score points toward the mean $\boldsymbol{\mu}$, with the direction and magnitude scaled by the inverse covariance. In high-variance directions, the score is weak; in low-variance directions, it's strong.
Click for answer
$\\nabla_{\\mathbf{x}} \\log p(\\mathbf{x}) = -\\Sigma^{-1}(\\mathbf{x} - \\boldsymbol{\\mu})$. For $\\Sigma = \\sigma^2 I$: $\\mathbf{s}(\\mathbf{x}) = -(\\mathbf{x} - \\boldsymbol{\\mu})/\\sigma^2$ β a simple linear function.Example 2: DSM as Noise Prediction
For a 1D data point $x = 2$ perturbed by $\varepsilon = 0.8$ with $\sigma = 0.5$, the noisy sample is $\tilde{x} = 2 + 0.5(0.8) = 2.4$. If the score model predicts $\varepsilon_\theta(2.4) = 0.6$, compute the DSM loss and the implied score.
Solution:
DSM loss: $\mathcal{L} = \|0.6 - 0.8\|^2 = 0.04$
Implied score: $\mathbf{s}\theta(\tilde{x}) = -\varepsilon\theta/\sigma = -0.6/0.5 = -1.2$
True score: $\nabla_{\tilde{x}} \log p_\sigma(\tilde{x}|x) = -\varepsilon/\sigma = -0.8/0.5 = -1.6$
The score points in the negative direction (toward the clean data at $x=2$, since $\tilde{x} = 2.4 > x$).
Click for answer
Loss = 0.04. The score model predicts $\\mathbf{s}_\\theta = -1.2$ (true: -1.6). Both point toward $x=2$, but the model underestimates the magnitude. Training would increase the score magnitude for this input.Example 3: One Step of Langevin Dynamics
Data distribution is $p(x) = \frac{1}{2}\mathcal{N}(-2, 1) + \frac{1}{2}\mathcal{N}(2, 1)$. Starting at $x_0 = 0.5$ with step size $\epsilon = 0.1$ and noise $z = -0.3$, perform one Langevin step using the exact score.
Solution:
$$p(x) = \frac{1}{2\sqrt{2\pi}}[e^{-(x+2)^2/2} + e^{-(x-2)^2/2}]$$
Score: $s(x) = \nabla_x \log p(x) = \frac{\nabla_x p(x)}{p(x)}$
At $x = 0.5$: $e^{-(2.5)^2/2} = e^{-3.125} = 0.0440$ $e^{-(-1.5)^2/2} = e^{-1.125} = 0.3247$
$p(0.5) = \frac{1}{2\sqrt{2\pi}}(0.0440 + 0.3247) = \frac{0.3687}{2.5066} = 0.1471$
$p'(0.5) = \frac{1}{2\sqrt{2\pi}}[-(0.5+2)e^{-(0.5+2)^2/2} - (0.5-2)e^{-(0.5-2)^2/2}]$
$= \frac{1}{2.5066}[-2.5 \cdot 0.0440 - (-1.5) \cdot 0.3247] = \frac{1}{2.5066}[-0.1100 + 0.4871] = \frac{0.3771}{2.5066} = 0.1504$
$s(0.5) = p'(0.5)/p(0.5) = 0.1504/0.1471 = 1.0227$
Langevin step: $x_1 = 0.5 + \frac{0.1}{2}(1.0227) + \sqrt{0.1}(-0.3) = 0.5 + 0.0511 - 0.0949 = 0.4563$
Click for answer
$x_1 = 0.4563$. The score at 0.5 is positive (pointing right toward the mode at $x=2$), but the random noise $z=-0.3$ pulled leftward, nearly canceling the drift. Langevin dynamics is stochastic β individual steps can go either way.Practice Problems
-
Prove that $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is invariant to scaling of $p(\mathbf{x})$: if $q(\mathbf{x}) = c \cdot p(\mathbf{x})$, then $\nabla_{\mathbf{x}} \log q(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$.
Click for answer
$\\nabla_{\\mathbf{x}} \\log(c \\cdot p(\\mathbf{x})) = \\nabla_{\\mathbf{x}}(\\log c + \\log p(\\mathbf{x})) = 0 + \\nabla_{\\mathbf{x}} \\log p(\\mathbf{x})$. The constant disappears. This is why score models don't need to know the partition function β it's a multiplicative constant that disappears under the gradient of the log. -
Show that the score of a Gaussian mixture $p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)$ is a weighted average of component scores.
Click for answer
$\\nabla_{\\mathbf{x}} \\log p(\\mathbf{x}) = \\frac{\\sum_k \\pi_k \\nabla_{\\mathbf{x}} \\mathcal{N}(\\mathbf{x};\\boldsymbol{\\mu}_k,\\Sigma_k)}{\\sum_k \\pi_k \\mathcal{N}(\\mathbf{x};\\boldsymbol{\\mu}_k,\\Sigma_k)}$ $= \\sum_k w_k(\\mathbf{x}) \\cdot (-\\Sigma_k^{-1}(\\mathbf{x} - \\boldsymbol{\\mu}_k))$ where $w_k(\\mathbf{x}) = \\frac{\\pi_k \\mathcal{N}(\\mathbf{x};\\boldsymbol{\\mu}_k,\\Sigma_k)}{\\sum_j \\pi_j \\mathcal{N}(\\mathbf{x};\\boldsymbol{\\mu}_j,\\Sigma_j)}$ is the posterior probability of component $k$. The score points toward the most likely nearby mode. -
Derive why the score matching objective $\mathbb{E}{p{\text{data}}}[\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}\theta) + \frac{1}{2}\|\mathbf{s}\theta\|^2]$ is equivalent to $\frac{1}{2}\mathbb{E}{p{\text{data}}}[\|\mathbf{s}\theta - \nabla\log p{\text{data}}\|^2]$ up to a constant. (Hint: use integration by parts.)
Click for answer
Expand $\\frac{1}{2}\\|\\mathbf{s}_\\theta - \\nabla\\log p\\|^2 = \\frac{1}{2}\\|\\mathbf{s}_\\theta\\|^2 - \\mathbf{s}_\\theta^T \\nabla\\log p + \\frac{1}{2}\\|\\nabla\\log p\\|^2$. $\\mathbb{E}_p[\\mathbf{s}_\\theta^T \\nabla\\log p] = \\int \\mathbf{s}_\\theta(\\mathbf{x})^T \\nabla p(\\mathbf{x}) d\\mathbf{x} = -\\int p(\\mathbf{x}) \\nabla \\cdot \\mathbf{s}_\\theta(\\mathbf{x}) d\\mathbf{x} = -\\mathbb{E}_p[\\nabla \\cdot \\mathbf{s}_\\theta]$. So the objective becomes $\\frac{1}{2}\\mathbb{E}_p[\\|\\mathbf{s}_\\theta\\|^2] + \\mathbb{E}_p[\\nabla \\cdot \\mathbf{s}_\\theta] + \\text{const}$. The divergence $\\nabla \\cdot \\mathbf{s}_\\theta = \\text{tr}(\\nabla_{\\mathbf{x}} \\mathbf{s}_\\theta)$. -
For Langevin dynamics, explain why $\epsilon$ must be small. What happens if $\epsilon$ is too large?
Click for answer
Langevin dynamics is derived from the continuous-time SDE $d\\mathbf{x} = \\frac{1}{2}\\nabla\\log p(\\mathbf{x})dt + d\\mathbf{W}$. The discrete update $\\mathbf{x}_{t+1} = \\mathbf{x}_t + \\frac{\\epsilon}{2}\\mathbf{s}(\\mathbf{x}_t) + \\sqrt{\\epsilon}\\mathbf{z}_t$ is an Euler-Maruyama discretization. Large $\\epsilon$ causes discretization error β the stationary distribution of the discrete chain may differ from $p(\\mathbf{x})$, and the chain may overshoot modes or even diverge. Metropolis-Hastings corrections can mitigate this. -
Why can't we train a score model on clean data (no noise perturbation)? What goes wrong mathematically?
Click for answer
Data lies on a low-dimensional manifold. For points exactly on the manifold, the score is undefined in the normal direction (density is a Dirac delta on the manifold). For points off the manifold, there are no training samples β the model sees zero probability regions. The training objective would be dominated by these off-manifold regions where we have no signal. Noise perturbation spreads the distribution, giving the model training signal everywhere in the ambient space.
Summary
Key takeaways:
- The score function $\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$ points toward higher-density regions
- Score is invariant to normalization constants β key advantage over density estimation
- Denoising score matching (DSM) trains via $\|\mathbf{s}_\theta(\mathbf{x}+\sigma\boldsymbol{\varepsilon}) + \boldsymbol{\varepsilon}/\sigma\|^2$ β equivalent to noise prediction
- Langevin dynamics $\mathbf{x}_{t+1} = \mathbf{x}_t + \frac{\epsilon}{2}\mathbf{s}(\mathbf{x}_t) + \sqrt{\epsilon}\mathbf{z}_t$ samples from the distribution
- Manifold hypothesis necessitates noise perturbation; annealed Langevin dynamics uses multiple noise scales
- Score-based models are the foundation for diffusion models (Phase 22-06, 22-07)
Quiz
- The score function $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is independent of:
- A) The mean of $p$
- B) The variance of $p$
- C) The normalization constant of $p$
- D) The dimensionality of $\mathbf{x}$ Correct: C)
- If you chose C: $\log(c \cdot \tilde{p}) = \log c + \log \tilde{p}$; gradient eliminates $\log c$. This is the fundamental advantage of score-based modeling.
- If you chose A, B: The score depends on both β e.g., for $\mathcal{N}(\mu, \sigma^2)$, score = $-(x-\mu)/\sigma^2$.
-
If you chose D: Dimensionality affects the vector dimension but not independence from constants.
-
Denoising score matching trains the model to predict:
- A) The clean data point $\mathbf{x}$
- B) The added noise $\boldsymbol{\varepsilon}$
- C) The density $p(\mathbf{x})$
- D) The gradient of the model itself Correct: B)
- If you chose B: DSM loss is $\|\boldsymbol{\varepsilon}\theta(\mathbf{x}+\sigma\boldsymbol{\varepsilon}) - \boldsymbol{\varepsilon}\|^2$. The score is recovered as $-\boldsymbol{\varepsilon}\theta/\sigma$.
- If you chose A: Predicting $\mathbf{x}$ is the DAE objective β related but different scaling.
- If you chose C: Density estimation is harder; score matching avoids it.
-
If you chose D: That's meta-learning β DSM predicts the noise.
-
Langevin dynamics generates samples by:
- A) Optimization of the score function
- B) Following the score with added noise for exploration
- C) Rejection sampling from the score
- D) Inverting the score function Correct: B)
- If you chose B: Update is drift (deterministic score-following) + diffusion (Gaussian noise). The noise prevents getting stuck and ensures convergence to $p$.
- If you chose A: Optimization finds modes, not samples from the distribution.
- If you chose C: Rejection sampling requires the density, not just the score.
-
If you chose D: The score is not generally invertible.
-
Why is annealed Langevin dynamics necessary?
- A) To reduce computational cost of a single noise level
- B) Because the score on clean data is ill-defined off the data manifold
- C) To increase the dimensionality of samples
- D) Single-scale Langevin dynamics always diverges Correct: B)
- If you chose B: High noise provides global structure (well-defined score everywhere); low noise provides fine details. Annealing bridges the gap.
- If you chose A: Annealing actually increases computation (multiple scales).
- If you chose C: Dimensionality is fixed.
-
If you chose D: Single-scale can work for sufficiently smoothed distributions; annealing improves efficiency and quality.
-
The DSM objective $\|\mathbf{s}_\theta(\tilde{\mathbf{x}}) + \boldsymbol{\varepsilon}/\sigma\|^2$ uses the fact that:
- A) $\nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) = -\boldsymbol{\varepsilon}/\sigma$
- B) $\nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = -\boldsymbol{\varepsilon}/\sigma$
- C) $\nabla_{\mathbf{x}} \log p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = -\boldsymbol{\varepsilon}/\sigma$
- D) $\nabla_{\tilde{\mathbf{x}}} \log p_{\text{data}}(\tilde{\mathbf{x}}) = -\boldsymbol{\varepsilon}/\sigma$ Correct: B)
- If you chose B: The conditional distribution $p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 I)$ has known score $-(\tilde{\mathbf{x}}-\mathbf{x})/\sigma^2 = -\boldsymbol{\varepsilon}/\sigma$. DSM matches this conditional score.
- If you chose A: The marginal score of $p_\sigma(\tilde{\mathbf{x}})$ is unknown β that's what we're trying to learn.
- If you chose C: Wrong variable of differentiation.
- If you chose D: We don't know the data score β that's the whole problem.
Next Steps
22-06 β Diffusion Models: Foundations β the natural evolution of score-based models: forward noising processes, reverse denoising, and the elegant training objective that powers modern generative AI.
Pitfalls
-
Training a score model on clean data without noise perturbation: If data lies on a low-dimensional manifold, the score $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is undefined in normal directions (density is zero off-manifold, log-density is $-\infty$, gradient is singular). Training on clean data gives no signal in the ambient directions and the model will fail to generate valid samples. Always use noise perturbation (denoising score matching) to spread the distribution and make the score well-defined everywhere.
-
Using too few Langevin steps or an inappropriate step size: Langevin dynamics requires sufficient steps to mix to the stationary distribution. Too few steps produce samples that haven't traveled far from the initialization. Step size $\epsilon$ that's too large causes discretization error (the discrete chain's stationary distribution differs from $p$); too small requires impractically many steps. Tune via MCMC diagnostics (e.g., effective sample size, autocorrelation).
-
Applying single-scale Langevin dynamics to multi-modal distributions: Without annealing, Langevin dynamics initialized from a single Gaussian can get trapped in local modes and fail to mix between well-separated modes. The chain's mixing time can be exponentially long in the distance between modes. Annealed Langevin dynamics (starting at high noise and gradually reducing) helps the chain explore globally before refining locally.
-
Confusing score matching's trace-of-Jacobian objective with DSM: The original score matching objective $J(\theta) = \mathbb{E}[\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}\theta) + \frac{1}{2}|\mathbf{s}\theta|^2]$ requires computing the trace of the Jacobian β $d$ backward passes. For $d = 10^5$ (a modest image), this is intractable. DSM replaces this with a denoising objective that only requires predicting the added noise, making it $O(1)$ per sample. Never implement the original (implicit) score matching for high-dimensional data.
Q6: Why can't the original (implicit) score matching objective be used for high-dimensional data like images?
A) It requires computing $\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}_\theta)$, which needs $d$ backward passes β prohibitively expensive for images where $d \approx 10^5$. B) It's mathematically incorrect for dimensions larger than 100. C) It requires knowing the true data distribution. D) It produces biased gradient estimates.