Math graphic
πŸ“ Concept diagram

22-05 β€” Score-Based Generative Models

Phase: 22 β€” Generative Models Mathematics Subject: 22-05 Prerequisites: 22-01 β€” Autoencoders (DAE section), Phase 06 (Vector Calculus), Phase 13 (Probability) Next subject: 22-06 β€” Diffusion Models: Foundations


Learning Objectives

By the end of this subject, you will be able to:

  1. Define the score function $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ and explain its role in generative modeling
  2. Derive the denoising score matching (DSM) objective from the score matching loss
  3. Explain Langevin dynamics as a sampling procedure using only the score function
  4. Analyze the manifold hypothesis and why it necessitates noise-perturbed score matching
  5. Describe annealed Langevin dynamics and connect it to diffusion models

Core Content

What Is the Score Function?

For a probability density $p(\mathbf{x})$, the score function is the gradient of the log-density:

$$\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$$

⚠️ CRITICAL β€” Why the Score? The score function has a remarkable property: it is independent of the normalization constant. If $p(\mathbf{x}) = \frac{1}{Z}\tilde{p}(\mathbf{x})$ where $Z = \int \tilde{p}(\mathbf{x}) d\mathbf{x}$ is the (usually intractable) partition function:

$$\nabla_{\mathbf{x}} \log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \nabla_{\mathbf{x}} \log Z = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x})$$

The $Z$ term vanishes because $Z$ doesn't depend on $\mathbf{x}$! This is the key insight: we can learn the score function without knowing the normalization constant β€” avoiding one of the hardest problems in probabilistic modeling.

Geometric interpretation: $\mathbf{s}(\mathbf{x})$ points in the direction of steepest increase in log-probability β€” i.e., toward higher-density regions. Following the score moves samples toward the modes of the distribution.

Score Matching

Given samples $\{\mathbf{x}^{(i)}\}$ from $p_{\text{data}}$, we want a model $\mathbf{s}\theta(\mathbf{x})$ that approximates the true score $\nabla{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})$. The score matching objective minimizes the Fisher divergence:

$$J(\theta) = \frac{1}{2} \mathbb{E}{p{\text{data}}}\left[\|\mathbf{s}\theta(\mathbf{x}) - \nabla{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})\|^2\right]$$

But we don't know $\nabla_{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})$! HyvΓ€rinen (2005) showed that under mild regularity conditions, this is equivalent to:

$$J(\theta) = \mathbb{E}{p{\text{data}}}\left[\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}\theta(\mathbf{x})) + \frac{1}{2}\|\mathbf{s}\theta(\mathbf{x})\|^2\right] + \text{const}$$

However, the trace of the Jacobian $\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}\theta) = \sum{i=1}^d \frac{\partial s_{\theta,i}}{\partial x_i}$ requires $d$ backward passes β€” computationally prohibitive for high-dimensional data like images.

⚠️ Denoising Score Matching (DSM)

Vincent (2011) introduced a brilliant solution: instead of matching the score of the data distribution directly, match the score of the noise-perturbed data distribution.

Perturb data with Gaussian noise: $\tilde{\mathbf{x}} = \mathbf{x} + \sigma \boldsymbol{\varepsilon}$, $\boldsymbol{\varepsilon} \sim \mathcal{N}(0, I)$. The perturbed distribution is:

$$p_\sigma(\tilde{\mathbf{x}}) = \int p_{\text{data}}(\mathbf{x}) \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 I) d\mathbf{x}$$

The DSM objective is:

$$J_{\text{DSM}}(\theta) = \frac{1}{2} \mathbb{E}{p{\text{data}}(\mathbf{x})} \mathbb{E}{\tilde{\mathbf{x}} \sim \mathcal{N}(\mathbf{x}, \sigma^2 I)}\left[\left\|\mathbf{s}\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}|\mathbf{x})\right\|^2\right]$$

And here's the crucial simplification: since $p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 I)$:

$$\nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = \nabla_{\tilde{\mathbf{x}}}\left(-\frac{\|\tilde{\mathbf{x}} - \mathbf{x}\|^2}{2\sigma^2}\right) = -\frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^2} = -\frac{\boldsymbol{\varepsilon}}{\sigma}$$

Therefore:

$$J_{\text{DSM}}(\theta) = \frac{1}{2\sigma^2} \mathbb{E}{\mathbf{x} \sim p{\text{data}}, \boldsymbol{\varepsilon} \sim \mathcal{N}(0,I)}\left[\left\|\mathbf{s}_\theta(\mathbf{x} + \sigma\boldsymbol{\varepsilon}) + \frac{\boldsymbol{\varepsilon}}{\sigma}\right\|^2\right]$$

Equivalently, we can parameterize $\mathbf{s}\theta(\tilde{\mathbf{x}}) = -\frac{1}{\sigma}\boldsymbol{\varepsilon}\theta(\tilde{\mathbf{x}})$ and train $\boldsymbol{\varepsilon}_\theta$ to predict the noise:

$$\mathcal{L}(\theta) = \mathbb{E}{\mathbf{x}, \boldsymbol{\varepsilon}}\left[\|\boldsymbol{\varepsilon}\theta(\mathbf{x} + \sigma\boldsymbol{\varepsilon}) - \boldsymbol{\varepsilon}\|^2\right]$$

This is just a denoising objective β€” predict the noise that was added!

Connection to DAE: Recall from 22-01 that a denoising autoencoder trained to reconstruct $\mathbf{x}$ from $\mathbf{x} + \sigma\boldsymbol{\varepsilon}$ learns $\hat{\mathbf{x}} \approx \mathbf{x} + \sigma^2 \mathbf{s}(\mathbf{x})$. Rearranging: $\mathbf{s}(\mathbf{x}) \approx (\hat{\mathbf{x}} - \mathbf{x})/\sigma^2 = -\boldsymbol{\varepsilon}/\sigma$. Exactly the DSM objective!

Langevin Dynamics for Sampling

Once we have $\mathbf{s}\theta(\mathbf{x}) \approx \nabla{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})$, how do we generate samples? Langevin dynamics β€” a Markov chain Monte Carlo method:

$$\mathbf{x}{t+1} = \mathbf{x}_t + \frac{\epsilon}{2} \mathbf{s}\theta(\mathbf{x}_t) + \sqrt{\epsilon} \cdot \mathbf{z}_t, \quad \mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, I)$$

where $\epsilon$ is the step size.

Why this works: Langevin dynamics is a discretization of the stochastic differential equation:

$$d\mathbf{x}t = \frac{1}{2}\nabla{\mathbf{x}} \log p(\mathbf{x}_t) dt + d\mathbf{W}_t$$

where $\mathbf{W}_t$ is a Wiener process (Brownian motion). This SDE has $p(\mathbf{x})$ as its stationary distribution β€” running it long enough produces samples from $p$.

Intuition: The drift term $\frac{1}{2}\mathbf{s}(\mathbf{x})$ pushes toward high-density regions. The diffusion term $\sqrt{\epsilon}\mathbf{z}_t$ adds exploration, preventing the chain from getting stuck in local modes.

⚠️ The Manifold Hypothesis Problem

Real data (images, audio, text) lies on low-dimensional manifolds embedded in high-dimensional space. For example, $256 \times 256$ images live in $\mathbb{R}^{65536}$, but the manifold of natural images has much lower intrinsic dimension.

On the data manifold, the score $\nabla_{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})$ is undefined in directions perpendicular to the manifold (density is zero off-manifold β†’ log-density is $-\infty$ β†’ gradient is undefined). Score matching on clean data fails.

Solution: Noise perturbation. Adding Gaussian noise spreads the distribution to the ambient space, making the score well-defined everywhere. But this introduces a trade-off:

Annealed Langevin Dynamics

The solution: use multiple noise levels $\sigma_1 > \sigma_2 > \cdots > \sigma_L$:

  1. Train separate score models $\mathbf{s}_\theta(\mathbf{x}, \sigma_i)$ for each noise level, or one model conditioned on $\sigma$
  2. Sample via annealed Langevin dynamics:
  3. Start with $\mathbf{x}_0 \sim \mathcal{N}(0, I)$ (high noise)
  4. Run Langevin dynamics with $\sigma_1$ (coarse structure)
  5. Reduce to $\sigma_2$ and continue (refine details)
  6. ... until $\sigma_L$ (fine details)

This gradually refines samples from coarse to fine, the same intuition that powers diffusion models (22-06).

The training objective with multiple noise levels (noise-conditioned score network, NCSN):

$$\mathcal{L}(\theta) = \frac{1}{L}\sum_{i=1}^{L} \sigma_i^2 \cdot \mathbb{E}{\mathbf{x}, \boldsymbol{\varepsilon}}\left[\|\mathbf{s}\theta(\mathbf{x} + \sigma_i\boldsymbol{\varepsilon}, \sigma_i) + \frac{\boldsymbol{\varepsilon}}{\sigma_i}\|^2\right]$$

The $\sigma_i^2$ weighting balances the different noise scales (since $\|\nabla \log p_\sigma\| \propto 1/\sigma$).

Pitfalls

Pitfall Why It Happens Fix
Poor sample quality Insufficient Langevin steps or wrong $\epsilon$ Tune step size and number of steps; use annealed sampling
Score is undefined off-manifold Data lies on low-dim manifold Always use noise perturbation (DSM)
Mixing failure Langevin dynamics gets stuck between modes Annealed sampling: start at high noise, gradually reduce
Computational cost Many Langevin steps needed per sample Use fewer noise levels with learned step sizes; see diffusion models
Gradient explosion Score grows unboundedly in low-density regions Noise-conditioning regularizes; clip score magnitude


Key Terms

Worked Examples

Example 1: Score of a Gaussian

For $p(\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \Sigma)$, compute $\nabla_{\mathbf{x}} \log p(\mathbf{x})$.

Solution:

$$\log p(\mathbf{x}) = -\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})$$

$$\nabla_{\mathbf{x}} \log p(\mathbf{x}) = -\Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu})$$

Interpretation: the score points toward the mean $\boldsymbol{\mu}$, with the direction and magnitude scaled by the inverse covariance. In high-variance directions, the score is weak; in low-variance directions, it's strong.

Click for answer $\\nabla_{\\mathbf{x}} \\log p(\\mathbf{x}) = -\\Sigma^{-1}(\\mathbf{x} - \\boldsymbol{\\mu})$. For $\\Sigma = \\sigma^2 I$: $\\mathbf{s}(\\mathbf{x}) = -(\\mathbf{x} - \\boldsymbol{\\mu})/\\sigma^2$ β€” a simple linear function.

Example 2: DSM as Noise Prediction

For a 1D data point $x = 2$ perturbed by $\varepsilon = 0.8$ with $\sigma = 0.5$, the noisy sample is $\tilde{x} = 2 + 0.5(0.8) = 2.4$. If the score model predicts $\varepsilon_\theta(2.4) = 0.6$, compute the DSM loss and the implied score.

Solution:

DSM loss: $\mathcal{L} = \|0.6 - 0.8\|^2 = 0.04$

Implied score: $\mathbf{s}\theta(\tilde{x}) = -\varepsilon\theta/\sigma = -0.6/0.5 = -1.2$

True score: $\nabla_{\tilde{x}} \log p_\sigma(\tilde{x}|x) = -\varepsilon/\sigma = -0.8/0.5 = -1.6$

The score points in the negative direction (toward the clean data at $x=2$, since $\tilde{x} = 2.4 > x$).

Click for answer Loss = 0.04. The score model predicts $\\mathbf{s}_\\theta = -1.2$ (true: -1.6). Both point toward $x=2$, but the model underestimates the magnitude. Training would increase the score magnitude for this input.

Example 3: One Step of Langevin Dynamics

Data distribution is $p(x) = \frac{1}{2}\mathcal{N}(-2, 1) + \frac{1}{2}\mathcal{N}(2, 1)$. Starting at $x_0 = 0.5$ with step size $\epsilon = 0.1$ and noise $z = -0.3$, perform one Langevin step using the exact score.

Solution:

$$p(x) = \frac{1}{2\sqrt{2\pi}}[e^{-(x+2)^2/2} + e^{-(x-2)^2/2}]$$

Score: $s(x) = \nabla_x \log p(x) = \frac{\nabla_x p(x)}{p(x)}$

At $x = 0.5$: $e^{-(2.5)^2/2} = e^{-3.125} = 0.0440$ $e^{-(-1.5)^2/2} = e^{-1.125} = 0.3247$

$p(0.5) = \frac{1}{2\sqrt{2\pi}}(0.0440 + 0.3247) = \frac{0.3687}{2.5066} = 0.1471$

$p'(0.5) = \frac{1}{2\sqrt{2\pi}}[-(0.5+2)e^{-(0.5+2)^2/2} - (0.5-2)e^{-(0.5-2)^2/2}]$

$= \frac{1}{2.5066}[-2.5 \cdot 0.0440 - (-1.5) \cdot 0.3247] = \frac{1}{2.5066}[-0.1100 + 0.4871] = \frac{0.3771}{2.5066} = 0.1504$

$s(0.5) = p'(0.5)/p(0.5) = 0.1504/0.1471 = 1.0227$

Langevin step: $x_1 = 0.5 + \frac{0.1}{2}(1.0227) + \sqrt{0.1}(-0.3) = 0.5 + 0.0511 - 0.0949 = 0.4563$

Click for answer $x_1 = 0.4563$. The score at 0.5 is positive (pointing right toward the mode at $x=2$), but the random noise $z=-0.3$ pulled leftward, nearly canceling the drift. Langevin dynamics is stochastic β€” individual steps can go either way.

Practice Problems

  1. Prove that $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is invariant to scaling of $p(\mathbf{x})$: if $q(\mathbf{x}) = c \cdot p(\mathbf{x})$, then $\nabla_{\mathbf{x}} \log q(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$.

    Click for answer $\\nabla_{\\mathbf{x}} \\log(c \\cdot p(\\mathbf{x})) = \\nabla_{\\mathbf{x}}(\\log c + \\log p(\\mathbf{x})) = 0 + \\nabla_{\\mathbf{x}} \\log p(\\mathbf{x})$. The constant disappears. This is why score models don't need to know the partition function β€” it's a multiplicative constant that disappears under the gradient of the log.

  2. Show that the score of a Gaussian mixture $p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)$ is a weighted average of component scores.

    Click for answer $\\nabla_{\\mathbf{x}} \\log p(\\mathbf{x}) = \\frac{\\sum_k \\pi_k \\nabla_{\\mathbf{x}} \\mathcal{N}(\\mathbf{x};\\boldsymbol{\\mu}_k,\\Sigma_k)}{\\sum_k \\pi_k \\mathcal{N}(\\mathbf{x};\\boldsymbol{\\mu}_k,\\Sigma_k)}$ $= \\sum_k w_k(\\mathbf{x}) \\cdot (-\\Sigma_k^{-1}(\\mathbf{x} - \\boldsymbol{\\mu}_k))$ where $w_k(\\mathbf{x}) = \\frac{\\pi_k \\mathcal{N}(\\mathbf{x};\\boldsymbol{\\mu}_k,\\Sigma_k)}{\\sum_j \\pi_j \\mathcal{N}(\\mathbf{x};\\boldsymbol{\\mu}_j,\\Sigma_j)}$ is the posterior probability of component $k$. The score points toward the most likely nearby mode.

  3. Derive why the score matching objective $\mathbb{E}{p{\text{data}}}[\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}\theta) + \frac{1}{2}\|\mathbf{s}\theta\|^2]$ is equivalent to $\frac{1}{2}\mathbb{E}{p{\text{data}}}[\|\mathbf{s}\theta - \nabla\log p{\text{data}}\|^2]$ up to a constant. (Hint: use integration by parts.)

    Click for answer Expand $\\frac{1}{2}\\|\\mathbf{s}_\\theta - \\nabla\\log p\\|^2 = \\frac{1}{2}\\|\\mathbf{s}_\\theta\\|^2 - \\mathbf{s}_\\theta^T \\nabla\\log p + \\frac{1}{2}\\|\\nabla\\log p\\|^2$. $\\mathbb{E}_p[\\mathbf{s}_\\theta^T \\nabla\\log p] = \\int \\mathbf{s}_\\theta(\\mathbf{x})^T \\nabla p(\\mathbf{x}) d\\mathbf{x} = -\\int p(\\mathbf{x}) \\nabla \\cdot \\mathbf{s}_\\theta(\\mathbf{x}) d\\mathbf{x} = -\\mathbb{E}_p[\\nabla \\cdot \\mathbf{s}_\\theta]$. So the objective becomes $\\frac{1}{2}\\mathbb{E}_p[\\|\\mathbf{s}_\\theta\\|^2] + \\mathbb{E}_p[\\nabla \\cdot \\mathbf{s}_\\theta] + \\text{const}$. The divergence $\\nabla \\cdot \\mathbf{s}_\\theta = \\text{tr}(\\nabla_{\\mathbf{x}} \\mathbf{s}_\\theta)$.

  4. For Langevin dynamics, explain why $\epsilon$ must be small. What happens if $\epsilon$ is too large?

    Click for answer Langevin dynamics is derived from the continuous-time SDE $d\\mathbf{x} = \\frac{1}{2}\\nabla\\log p(\\mathbf{x})dt + d\\mathbf{W}$. The discrete update $\\mathbf{x}_{t+1} = \\mathbf{x}_t + \\frac{\\epsilon}{2}\\mathbf{s}(\\mathbf{x}_t) + \\sqrt{\\epsilon}\\mathbf{z}_t$ is an Euler-Maruyama discretization. Large $\\epsilon$ causes discretization error β€” the stationary distribution of the discrete chain may differ from $p(\\mathbf{x})$, and the chain may overshoot modes or even diverge. Metropolis-Hastings corrections can mitigate this.

  5. Why can't we train a score model on clean data (no noise perturbation)? What goes wrong mathematically?

    Click for answer Data lies on a low-dimensional manifold. For points exactly on the manifold, the score is undefined in the normal direction (density is a Dirac delta on the manifold). For points off the manifold, there are no training samples β€” the model sees zero probability regions. The training objective would be dominated by these off-manifold regions where we have no signal. Noise perturbation spreads the distribution, giving the model training signal everywhere in the ambient space.


Summary

Key takeaways:


Quiz

  1. The score function $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is independent of:
  2. A) The mean of $p$
  3. B) The variance of $p$
  4. C) The normalization constant of $p$
  5. D) The dimensionality of $\mathbf{x}$ Correct: C)
  6. If you chose C: $\log(c \cdot \tilde{p}) = \log c + \log \tilde{p}$; gradient eliminates $\log c$. This is the fundamental advantage of score-based modeling.
  7. If you chose A, B: The score depends on both β€” e.g., for $\mathcal{N}(\mu, \sigma^2)$, score = $-(x-\mu)/\sigma^2$.
  8. If you chose D: Dimensionality affects the vector dimension but not independence from constants.

  9. Denoising score matching trains the model to predict:

  10. A) The clean data point $\mathbf{x}$
  11. B) The added noise $\boldsymbol{\varepsilon}$
  12. C) The density $p(\mathbf{x})$
  13. D) The gradient of the model itself Correct: B)
  14. If you chose B: DSM loss is $\|\boldsymbol{\varepsilon}\theta(\mathbf{x}+\sigma\boldsymbol{\varepsilon}) - \boldsymbol{\varepsilon}\|^2$. The score is recovered as $-\boldsymbol{\varepsilon}\theta/\sigma$.
  15. If you chose A: Predicting $\mathbf{x}$ is the DAE objective β€” related but different scaling.
  16. If you chose C: Density estimation is harder; score matching avoids it.
  17. If you chose D: That's meta-learning β€” DSM predicts the noise.

  18. Langevin dynamics generates samples by:

  19. A) Optimization of the score function
  20. B) Following the score with added noise for exploration
  21. C) Rejection sampling from the score
  22. D) Inverting the score function Correct: B)
  23. If you chose B: Update is drift (deterministic score-following) + diffusion (Gaussian noise). The noise prevents getting stuck and ensures convergence to $p$.
  24. If you chose A: Optimization finds modes, not samples from the distribution.
  25. If you chose C: Rejection sampling requires the density, not just the score.
  26. If you chose D: The score is not generally invertible.

  27. Why is annealed Langevin dynamics necessary?

  28. A) To reduce computational cost of a single noise level
  29. B) Because the score on clean data is ill-defined off the data manifold
  30. C) To increase the dimensionality of samples
  31. D) Single-scale Langevin dynamics always diverges Correct: B)
  32. If you chose B: High noise provides global structure (well-defined score everywhere); low noise provides fine details. Annealing bridges the gap.
  33. If you chose A: Annealing actually increases computation (multiple scales).
  34. If you chose C: Dimensionality is fixed.
  35. If you chose D: Single-scale can work for sufficiently smoothed distributions; annealing improves efficiency and quality.

  36. The DSM objective $\|\mathbf{s}_\theta(\tilde{\mathbf{x}}) + \boldsymbol{\varepsilon}/\sigma\|^2$ uses the fact that:

  37. A) $\nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) = -\boldsymbol{\varepsilon}/\sigma$
  38. B) $\nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = -\boldsymbol{\varepsilon}/\sigma$
  39. C) $\nabla_{\mathbf{x}} \log p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = -\boldsymbol{\varepsilon}/\sigma$
  40. D) $\nabla_{\tilde{\mathbf{x}}} \log p_{\text{data}}(\tilde{\mathbf{x}}) = -\boldsymbol{\varepsilon}/\sigma$ Correct: B)
  41. If you chose B: The conditional distribution $p_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 I)$ has known score $-(\tilde{\mathbf{x}}-\mathbf{x})/\sigma^2 = -\boldsymbol{\varepsilon}/\sigma$. DSM matches this conditional score.
  42. If you chose A: The marginal score of $p_\sigma(\tilde{\mathbf{x}})$ is unknown β€” that's what we're trying to learn.
  43. If you chose C: Wrong variable of differentiation.
  44. If you chose D: We don't know the data score β€” that's the whole problem.

Next Steps

22-06 β€” Diffusion Models: Foundations β€” the natural evolution of score-based models: forward noising processes, reverse denoising, and the elegant training objective that powers modern generative AI.


Pitfalls

  1. Training a score model on clean data without noise perturbation: If data lies on a low-dimensional manifold, the score $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is undefined in normal directions (density is zero off-manifold, log-density is $-\infty$, gradient is singular). Training on clean data gives no signal in the ambient directions and the model will fail to generate valid samples. Always use noise perturbation (denoising score matching) to spread the distribution and make the score well-defined everywhere.

  2. Using too few Langevin steps or an inappropriate step size: Langevin dynamics requires sufficient steps to mix to the stationary distribution. Too few steps produce samples that haven't traveled far from the initialization. Step size $\epsilon$ that's too large causes discretization error (the discrete chain's stationary distribution differs from $p$); too small requires impractically many steps. Tune via MCMC diagnostics (e.g., effective sample size, autocorrelation).

  3. Applying single-scale Langevin dynamics to multi-modal distributions: Without annealing, Langevin dynamics initialized from a single Gaussian can get trapped in local modes and fail to mix between well-separated modes. The chain's mixing time can be exponentially long in the distance between modes. Annealed Langevin dynamics (starting at high noise and gradually reducing) helps the chain explore globally before refining locally.

  4. Confusing score matching's trace-of-Jacobian objective with DSM: The original score matching objective $J(\theta) = \mathbb{E}[\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}\theta) + \frac{1}{2}|\mathbf{s}\theta|^2]$ requires computing the trace of the Jacobian β€” $d$ backward passes. For $d = 10^5$ (a modest image), this is intractable. DSM replaces this with a denoising objective that only requires predicting the added noise, making it $O(1)$ per sample. Never implement the original (implicit) score matching for high-dimensional data.




Q6: Why can't the original (implicit) score matching objective be used for high-dimensional data like images?

A) It requires computing $\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}_\theta)$, which needs $d$ backward passes β€” prohibitively expensive for images where $d \approx 10^5$. B) It's mathematically incorrect for dimensions larger than 100. C) It requires knowing the true data distribution. D) It produces biased gradient estimates.

Answer and Explanations **Correct: A)** The implicit score matching objective $J = \mathbb{E}[\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}_\theta) + \frac{1}{2}\|\mathbf{s}_\theta\|^2]$ requires computing the Jacobian trace $\sum_{i=1}^d \partial s_{\theta,i}/\partial x_i$. For each dimension $i$, we need to compute a vector-Jacobian product, which requires a separate backward pass per dimension. With $d \approx 10^5$ for even small images, this is $10^5$ times more expensive than a single forward/backward pass. DSM avoids this entirely. - B) The method is mathematically correct for any dimension β€” it's just computationally impractical. - C) The whole point is that integration by parts eliminates the need to know $p_{\text{data}}$. - D) The estimates are unbiased; the issue is computational cost, not bias.