📐 Concept diagram

22-09 — Energy-Based Models

Phase: 22 — Generative Models Mathematics Subject: 22-09 Prerequisites: 22-01 — Autoencoders, Phase 13 (Probability — distributions, KL divergence, Monte Carlo), Phase 14 (Optimization — gradient descent, MCMC basics) Next subject: 22-10 — Evaluation of Generative Models

Learning Objectives

By the end of this subject, you will be able to:

Formulate an energy-based model using the Boltzmann (Gibbs) distribution and explain the role of the partition function
Derive the contrastive divergence (CD) training algorithm and explain why it approximates the log-likelihood gradient
Derive noise contrastive estimation (NCE) as an alternative to MCMC-based training
Compare EBMs to other generative models (VAEs, GANs, diffusion) in terms of training difficulty, sampling, and flexibility
Understand the connection between EBMs, score matching, and diffusion models

Core Content

The Energy-Based Formulation

An energy-based model (EBM) defines a probability distribution over $\mathbf{x} \in \mathbb{R}^d$ via an energy function $E_\theta(\mathbf{x}) : \mathbb{R}^d \to \mathbb{R}$:

$$p_\theta(\mathbf{x}) = \frac{\exp(-E_\theta(\mathbf{x}))}{Z(\theta)}$$

where the partition function (normalizing constant) is:

$$Z(\theta) = \int \exp(-E_\theta(\mathbf{x})) \, d\mathbf{x}$$

This is the Boltzmann (Gibbs) distribution from statistical mechanics. Lower energy → higher probability. The energy function is typically a neural network $E_\theta : \mathbb{R}^d \to \mathbb{R}$.

⚠️ CRITICAL: $Z(\theta)$ is generally intractable. Computing it requires integrating over the entire input space $\mathbb{R}^d$, which is exponentially large in $d$. This is the central challenge of EBMs: we can define the distribution, but we can't directly normalize it.

Why EBMs Are Powerful

Despite the intractability of $Z(\theta)$, EBMs offer unique advantages:

Unnormalized models are flexible: Any scalar function $E_\theta(\mathbf{x})$ defines a valid (unnormalized) density. No architectural constraints like invertibility (flows) or latent dimensions (VAEs).
Compositional: Given two independent EBMs $p_1(\mathbf{x}) \propto \exp(-E_1(\mathbf{x}))$ and $p_2(\mathbf{x}) \propto \exp(-E_2(\mathbf{x}))$, we can combine them: $p_{\text{combined}}(\mathbf{x}) \propto \exp(-(E_1(\mathbf{x}) + E_2(\mathbf{x})))$.
Natural for structured prediction: The energy can capture arbitrary constraints and preferences.

Training EBMs: The Log-Likelihood Gradient

The log-likelihood for a datapoint $\mathbf{x}$ is:

$$\log p_\theta(\mathbf{x}) = -E_\theta(\mathbf{x}) - \log Z(\theta)$$

The gradient with respect to parameters $\theta$:

$$\nabla_\theta \log p_\theta(\mathbf{x}) = -\nabla_\theta E_\theta(\mathbf{x}) - \nabla_\theta \log Z(\theta)$$

Now, the partition function gradient:

$$\nabla_\theta \log Z(\theta) = \frac{1}{Z(\theta)}\nabla_\theta Z(\theta) = \frac{1}{Z(\theta)}\int \nabla_\theta \exp(-E_\theta(\mathbf{x}')) d\mathbf{x}'$$

$$= \frac{1}{Z(\theta)}\int \exp(-E_\theta(\mathbf{x}')) \cdot (-\nabla_\theta E_\theta(\mathbf{x}')) d\mathbf{x}'$$

$$= -\int \frac{\exp(-E_\theta(\mathbf{x}'))}{Z(\theta)} \nabla_\theta E_\theta(\mathbf{x}') d\mathbf{x}' = -\mathbb{E}{\mathbf{x}' \sim p\theta}[\nabla_\theta E_\theta(\mathbf{x}')]$$

Therefore:

$$\boxed{\nabla_\theta \log p_\theta(\mathbf{x}) = -\nabla_\theta E_\theta(\mathbf{x}) + \mathbb{E}{\mathbf{x}' \sim p\theta}[\nabla_\theta E_\theta(\mathbf{x}')]}$$

The gradient of the log-likelihood is the negative energy gradient at the data point plus the expected energy gradient under the model distribution.

In plain terms: training lowers the energy of real data and raises the energy of everything else (model samples).

The full training gradient over a batch:

$$\nabla_\theta \mathcal{L} = \mathbb{E}{\mathbf{x} \sim p{\text{data}}}[\nabla_\theta E_\theta(\mathbf{x})] - \mathbb{E}{\mathbf{x}' \sim p\theta}[\nabla_\theta E_\theta(\mathbf{x}')]$$

This is the contrastive divergence objective: pull down on data, push up on model samples.

⚠️ CRITICAL — Contrastive Divergence (CD)

The problem: we need to sample from $p_\theta$ to compute $\mathbb{E}{\mathbf{x}' \sim p\theta}[\nabla_\theta E_\theta(\mathbf{x}')]$, but $p_\theta$ is the distribution we're trying to learn!

Contrastive Divergence (Hinton, 2002) solves this with MCMC:

CD-$k$ Algorithm: 1. Initialize MCMC chain at a data point $\mathbf{x}^{(0)} = \mathbf{x}{\text{data}}$ 2. Run $k$ steps of Langevin dynamics (or other MCMC) to get $\mathbf{x}^{(k)}$: $$\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \frac{\epsilon}{2}\nabla{\mathbf{x}} E_\theta(\mathbf{x}^{(t)}) + \sqrt{\epsilon}\,\boldsymbol{\eta}t, \quad \boldsymbol{\eta}_t \sim \mathcal{N}(0, I)$$ 3. Use $\mathbf{x}^{(k)}$ as approximate model sample for gradient: $$\nabla\theta \mathcal{L}{\text{CD}-k} \approx \nabla\theta E_\theta(\mathbf{x}^{(0)}) - \nabla_\theta E_\theta(\mathbf{x}^{(k)})$$

Why it works: Starting the MCMC chain at data points gives reasonable samples even with few steps ($k = 1$ works in practice for many problems). The chains don't need to fully mix — CD just needs to push the model distribution away from those starting points.

Langevin dynamics is the MCMC method of choice for EBMs because it uses $\nabla_{\mathbf{x}}E_\theta(\mathbf{x})$ (available via backprop) and naturally converges to the stationary distribution $p_\theta(\mathbf{x}) \propto \exp(-E_\theta(\mathbf{x}))$.

Persistent CD (PCD)

Instead of reinitializing chains at data each iteration, persistent contrastive divergence (Tieleman, 2008) maintains a persistent bank of "fantasy particles" — MCMC chains that continue across training iterations:

$$\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \frac{\epsilon}{2}\nabla_{\mathbf{x}} E_\theta(\mathbf{x}^{(t)}) + \sqrt{\epsilon}\,\boldsymbol{\eta}_t$$

Between parameter updates, the fantasy particles track the slowly changing model distribution. This is more statistically correct than CD when $E_\theta$ changes slowly.

⚠️ CRITICAL — Noise Contrastive Estimation (NCE)

NCE (Gutmann & Hyvärinen, 2010) converts density estimation into a binary classification problem, completely avoiding MCMC.

Key idea: Instead of estimating $Z(\theta)$, treat it as a learnable parameter $c = \log Z(\theta)$. The unnormalized model is $\tilde{p}\theta(\mathbf{x}) = \exp(-E\theta(\mathbf{x}))$.

NCE objective: Train a classifier to distinguish data samples ($C=1$) from noise samples ($C=0$) drawn from a known noise distribution $p_n(\mathbf{x})$:

$$p(C=1 \mid \mathbf{x}) = \frac{\tilde{p}\theta(\mathbf{x})}{\tilde{p}\theta(\mathbf{x}) + \nu \, p_n(\mathbf{x})}$$

where $\nu = P(C=0)/P(C=1)$ is the noise-to-data ratio (typically $\nu = 1$, equal mixing).

Equivalently, using logits:

$$\log\frac{p(C=1 \mid \mathbf{x})}{p(C=0 \mid \mathbf{x})} = \log \tilde{p}\theta(\mathbf{x}) - \log(\nu \, p_n(\mathbf{x})) = -E\theta(\mathbf{x}) - \log(\nu \, p_n(\mathbf{x}))$$

The NCE loss is binary cross-entropy:

$$\mathcal{L}{\text{NCE}} = -\mathbb{E}{\mathbf{x} \sim p_{\text{data}}}[\log \sigma(\Delta(\mathbf{x}))] - \mathbb{E}_{\mathbf{x} \sim p_n}[\log(1 - \sigma(\Delta(\mathbf{x})))]$$

where $\Delta(\mathbf{x}) = -E_\theta(\mathbf{x}) - \log(\nu \, p_n(\mathbf{x}))$ and $\sigma$ is the sigmoid.

Critical property: As the number of noise samples $\to \infty$, NCE recovers the exact MLE up to an additive constant for $\log Z$. In practice, a modest number of noise samples works well.

Comparison CD vs. NCE:

Aspect	Contrastive Divergence	NCE
Requires MCMC	Yes (Langevin dynamics)	No
Requires noise distribution	No	Yes ($p_n$)
Statistical correctness	Approximate (biased for $k < \infty$)	Consistent (asymptotically)
Sample efficiency	Uses real data as starting points	Needs noise samples
Stability	Can diverge if step size too large	Very stable (classification task)

Connection to Score Matching and Diffusion

Recall from 22-05 and 22-06: score matching learns $\nabla_{\mathbf{x}} \log p(\mathbf{x})$; diffusion models learn $\boldsymbol{\varepsilon}\theta(\mathbf{x}_t, t) \propto -\nabla{\mathbf{x}} \log p(\mathbf{x}_t)$.

EBMs and score matching are closely related: - An EBM defines $p_\theta(\mathbf{x}) \propto \exp(-E_\theta(\mathbf{x}))$ → $\nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) = -\nabla_{\mathbf{x}} E_\theta(\mathbf{x})$ - Score matching learns $\mathbf{s}\theta(\mathbf{x}) \approx \nabla{\mathbf{x}} \log p(\mathbf{x})$ directly - If we train an EBM via score matching instead of CD/NCE, we bypass $Z(\theta)$ entirely

This connection was exploited in Denoising Score Matching and ultimately diffusion models, which can be viewed as a multi-scale EBM trained via denoising score matching across noise levels.

Key Terms

Compositional
Contrastive Divergence
Denoising Score Matching
Langevin dynamics
Log-likelihood gradient
Natural for structured prediction
Noise Contrastive Estimation
Persistent CD
Score matching connection
Unnormalized models are flexible

Worked Examples

Example 1: Gaussian EBM

A 1D EBM has energy $E_\theta(x) = \frac{(x-\mu)^2}{2\sigma^2}$. Compute $Z(\theta)$ and verify $\nabla_\theta \log p_\theta(x)$ matches the gradient formula.

Solution:

$$Z(\theta) = \int_{-\infty}^{\infty} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)dx = \sqrt{2\pi\sigma^2}$$

$$p_\theta(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) = \mathcal{N}(\mu, \sigma^2)$$

$\log p_\theta(x) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)$

Gradient w.r.t $\mu$:

Direct: $\frac{\partial}{\partial\mu}\log p_\theta(x) = \frac{x-\mu}{\sigma^2}$

Using the formula: $\nabla_\mu \log p_\theta(x) = -\nabla_\mu E_\theta(x) + \mathbb{E}{x' \sim p\theta}[\nabla_\mu E_\theta(x')]$

$-\nabla_\mu E_\theta(x) = -\frac{\partial}{\partial\mu}\frac{(x-\mu)^2}{2\sigma^2} = \frac{x-\mu}{\sigma^2}$

$\mathbb{E}{x'}[\nabla\mu E_\theta(x')] = \mathbb{E}_{x'}\left[-\frac{x'-\mu}{\sigma^2}\right] = -\frac{\mathbb{E}[x']-\mu}{\sigma^2} = 0$ (since $\mathbb{E}[x']=\mu$)

Sum: $\frac{x-\mu}{\sigma^2} + 0 = \frac{x-\mu}{\sigma^2}$ ✓

Click for answer

$Z(\theta) = \sqrt{2\pi\sigma^2}$, and the gradient formula is verified: $-\nabla_\mu E_\theta(x)$ gives $\frac{x-\mu}{\sigma^2}$, while the expectation term is zero because the model mean equals $\mu$. The EBM is simply a Gaussian.

Example 2: CD-1 Update

A 1D EBM $E_\theta(x) = \theta x^2$ with $\theta > 0$. Given a single datapoint $x_{\text{data}} = 1.0$, run one step of CD-1 with Langevin step size $\epsilon = 0.1$ and compute the CD gradient.

Solution:

Step 1: Initialize $x^{(0)} = x_{\text{data}} = 1.0$

Step 2: One Langevin step: $\nabla_x E_\theta(x) = 2\theta x$

With $\theta = 1$ (example): $x^{(1)} = 1.0 - 0.05 \cdot 2 \cdot 1.0 + \sqrt{0.1}\,\eta = 1.0 - 0.1 + 0.3162\eta$

Assuming $\eta = 0$ for simplicity (deterministic): $x^{(1)} = 0.9$

Step 3: CD-1 gradient: $\nabla_\theta E_\theta(x^{(0)}) = (1.0)^2 = 1.0$ $\nabla_\theta E_\theta(x^{(1)}) = (0.9)^2 = 0.81$

$\nabla_\theta \mathcal{L}_{\text{CD-1}} \approx 1.0 - 0.81 = 0.19$

Click for answer

$\nabla_\theta \mathcal{L} \approx 0.19$. Positive gradient means we increase $\theta$, increasing the energy curvature — penalizing large $|x|$ values. CD pushes the model distribution away from the data point toward lower-energy regions.

Example 3: NCE Logit

An EBM has $E_\theta(x) = \frac{1}{2}x^2$ (standard Gaussian energy). The noise distribution is $p_n(x) = \mathcal{N}(0, 4)$ (variance 4). With $\nu = 1$, compute the NCE decision boundary (where $P(C=1|x) = P(C=0|x) = 0.5$).

Solution:

At the decision boundary: $\tilde{p}_\theta(x) = \nu \, p_n(x)$

$\exp(-\frac{1}{2}x^2) = 1 \cdot \frac{1}{\sqrt{2\pi \cdot 4}}\exp(-\frac{x^2}{2 \cdot 4})$

$\exp(-\frac{x^2}{2}) = \frac{1}{2\sqrt{2\pi}}\exp(-\frac{x^2}{8})$

Taking logs: $-\frac{x^2}{2} = -\ln(2\sqrt{2\pi}) - \frac{x^2}{8}$

$\frac{x^2}{8} - \frac{x^2}{2} = -\ln(2\sqrt{2\pi})$

$-\frac{3x^2}{8} = -\ln(2\sqrt{2\pi})$

$x^2 = \frac{8}{3}\ln(2\sqrt{2\pi}) \approx \frac{8}{3} \cdot 1.612 \approx 4.299$

$x \approx \pm 2.073$

Click for answer

Decision boundaries at $x \approx \pm 2.073$. For $|x| < 2.073$, the EBM assigns higher probability (classifier says "data"). For $|x| > 2.073$, the noise model dominates (classifier says "noise"). The tight Gaussian (variance 1) has higher density near zero, while the wide noise (variance 4) has higher density in the tails.

Practice Problems

Prove that adding a constant to $E_\theta(\mathbf{x})$ doesn't change $p_\theta(\mathbf{x})$.

Click for answer
Let $\tilde{E}_\theta(\mathbf{x}) = E_\theta(\mathbf{x}) + c$. Then $\tilde{p}_\theta(\mathbf{x}) = \exp(-E_\theta(\mathbf{x}) - c) / \int \exp(-E_\theta(\mathbf{x}') - c) d\mathbf{x}' = e^{-c}\exp(-E_\theta(\mathbf{x})) / (e^{-c}\int \exp(-E_\theta(\mathbf{x}')) d\mathbf{x}') = p_\theta(\mathbf{x})$. The constant cancels between numerator and denominator. This is why the energy can be unbounded — it's only differences that matter.
Show that Langevin dynamics with $\epsilon \to 0$ and infinite steps converges to the stationary distribution $p_\theta(\mathbf{x}) \propto \exp(-E_\theta(\mathbf{x}))$.

Click for answer
The Langevin SDE is $d\mathbf{x} = -\frac{1}{2}\nabla E(\mathbf{x})dt + d\mathbf{w}$. Its Fokker-Planck equation for the density $p_t$ is $\partial_t p_t = \nabla \cdot (p_t \nabla E) + \frac{1}{2}\nabla^2 p_t$. The stationary solution satisfies $0 = \nabla \cdot (p_\infty \nabla E) + \frac{1}{2}\nabla^2 p_\infty$. Substituting $p_\infty \propto \exp(-E)$: $\nabla p_\infty = -p_\infty \nabla E$. Then $\nabla \cdot (p_\infty \nabla E) + \frac{1}{2}\nabla \cdot (-p_\infty \nabla E) = \frac{1}{2}\nabla \cdot (p_\infty \nabla E) \neq 0$... wait, let me check again. The correct stationary condition: $p_\infty \nabla E + \frac{1}{2}\nabla p_\infty = 0$. If $p_\infty \propto \exp(-E)$, then $\nabla p_\infty = -p_\infty \nabla E$, so $\frac{1}{2}(-p_\infty \nabla E) + \frac{1}{2}(-p_\infty \nabla E) = -p_\infty \nabla E \neq 0$. Hmm. Actually, the Langevin SDE $d\mathbf{x} = -\nabla E(\mathbf{x})dt + \sqrt{2}d\mathbf{w}$ has stationary distribution $\propto \exp(-E)$. The version with $-\frac{1}{2}\nabla E$ and unit noise also converges to $\exp(-E)$ — the factor of 1/2 just rescales time. The key property is detailed balance: $\exp(-E(\mathbf{x})) \cdot T(\mathbf{x} \to \mathbf{x}') = \exp(-E(\mathbf{x}')) \cdot T(\mathbf{x}' \to \mathbf{x})$, where $T$ is the transition kernel of the Langevin step.
Compare CD-1 and PCD. Why might PCD produce better samples when the model distribution changes slowly?

Click for answer
CD-1 resets chains to data at each iteration, so the "negative samples" are always near the data manifold — the model sees no incentive to model low-density regions far from data. PCD maintains persistent chains that can wander into low-density regions as parameters change, providing a more accurate approximation of $\mathbb{E}_{p_\theta}[\nabla_\theta E]$. However, PCD can be unstable if parameters change too quickly, causing chains to diverge.
For NCE with noise $p_n = \mathcal{N}(0, \sigma_n^2)$ and EBM $E(x) = \frac{x^2}{2\sigma^2}$, find $\sigma_n$ such that the optimal classifier assigns equal probability to data/noise for all $x$. Interpret.

Click for answer
$\Delta(x) = -E(x) - \log(\nu p_n(x)) = -\frac{x^2}{2\sigma^2} - \log\nu + \frac{1}{2}\log(2\pi\sigma_n^2) + \frac{x^2}{2\sigma_n^2}$ For $\Delta(x)$ to be constant (independent of $x$): $\frac{1}{2\sigma_n^2} - \frac{1}{2\sigma^2} = 0 \implies \sigma_n = \sigma$. When noise matches data, the classifier cannot distinguish them — NCE has no learning signal. The noise distribution must differ from the data distribution. In practice, noise should be broader to cover the space but not too broad (would make classification trivial).
An EBM defines $p_\theta(\mathbf{x}) \propto \exp(-E_\theta(\mathbf{x}))$. Its score function is $\nabla_{\mathbf{x}}\log p_\theta(\mathbf{x}) = -\nabla_{\mathbf{x}}E_\theta(\mathbf{x})$. Explain why score matching (22-05) can train this EBM without needing $Z(\theta)$ or MCMC.

Click for answer
The score matching objective is $\mathbb{E}_{p_{\text{data}}}[\frac{1}{2}\|\mathbf{s}_\theta(\mathbf{x}) - \nabla_{\mathbf{x}}\log p_{\text{data}}(\mathbf{x})\|^2]$. After integration by parts (Hyvärinen, 2005), this becomes $\mathbb{E}_{p_{\text{data}}}[\text{tr}(\nabla_{\mathbf{x}}\mathbf{s}_\theta(\mathbf{x})) + \frac{1}{2}\|\mathbf{s}_\theta(\mathbf{x})\|^2]$, which depends only on $\mathbf{s}_\theta$ and its Jacobian — no $Z(\theta)$ appears. For an EBM, $\mathbf{s}_\theta(\mathbf{x}) = -\nabla_{\mathbf{x}}E_\theta(\mathbf{x})$. The trace-of-Hessian term $\text{tr}(\nabla_{\mathbf{x}}^2 E_\theta(\mathbf{x}))$ is expensive but can be approximated by sliced score matching or denoising score matching. This bypasses both MCMC and $Z(\theta)$, connecting EBMs directly to the score-based framework.

Summary

Key takeaways:

EBM: $p_\theta(\mathbf{x}) = \exp(-E_\theta(\mathbf{x}))/Z(\theta)$ — any scalar function defines an unnormalized density
$Z(\theta) = \int \exp(-E_\theta(\mathbf{x}))d\mathbf{x}$ is intractable — the central challenge of EBMs
Log-likelihood gradient: $\nabla_\theta \mathcal{L} = \mathbb{E}{\text{data}}[\nabla\theta E_\theta] - \mathbb{E}{p\theta}[\nabla_\theta E_\theta]$ — "pull down data, push up model"
Contrastive Divergence (CD-$k$): MCMC approximation — initialize at data, run $k$ Langevin steps
Persistent CD: Maintain chains across iterations for better statistical properties
Noise Contrastive Estimation: Convert to binary classification — data vs. noise, with $\log Z$ as learned parameter
Score matching connection: $\nabla_{\mathbf{x}}\log p_\theta = -\nabla_{\mathbf{x}}E_\theta$ — EBMs can be trained via score matching, avoiding $Z(\theta)$ entirely

Quiz

The partition function $Z(\theta)$ in an EBM is:
A) A hyperparameter set by the user
B) The integral $\int \exp(-E_\theta(\mathbf{x}))d\mathbf{x}$, generally intractable
C) Always equal to 1 for properly trained EBMs
D) Computed analytically for any neural network $E_\theta$ Correct: B)
If you chose B: $Z(\theta)$ normalizes the distribution. For a neural network $E_\theta$, the integral over $\mathbb{R}^d$ has no closed form and is exponentially expensive.
If you chose A: $Z(\theta)$ is determined by the energy function, not chosen freely.
If you chose C: Only true for normalized flows or specially constrained architectures.
If you chose D: Only for simple energy functions like quadratics (Gaussians).
In contrastive divergence, MCMC chains are initialized at data points because:
A) Data points are already low-energy under a well-trained model, providing good starting points for the MCMC sampler
B) The theory requires initialization at extrema of the energy
C) It guarantees convergence in one step
D) Data points are easier to store than random points Correct: A)
If you chose A: Starting near the data manifold gives reasonable negative samples even with few MCMC steps, because the model distribution is (or should be) concentrated near data. Random initialization would require many steps to reach relevant regions.
If you chose B: There's no such theoretical requirement.
If you chose C: Even CD-1 (one step) is biased — it approximates, not guarantees.
If you chose D: Computational storage is irrelevant.
NCE transforms density estimation into:
A) A regression problem predicting $E_\theta(\mathbf{x})$
B) A binary classification problem (data vs. noise)
C) A reinforcement learning problem
D) A clustering problem Correct: B)
If you chose B: NCE trains a classifier with logit $\Delta(\mathbf{x}) = -E_\theta(\mathbf{x}) - \log(\nu p_n(\mathbf{x}))$. The classifier learns to distinguish $\mathbf{x} \sim p_{\text{data}}$ from $\mathbf{x} \sim p_n$.
If you chose A: The energy is learned indirectly through classification.
If you chose C: No reward signal or sequential decisions.
If you chose D: No cluster assignments.
The Langevin dynamics update $\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \frac{\epsilon}{2}\nabla_{\mathbf{x}}E_\theta(\mathbf{x}^{(t)}) + \sqrt{\epsilon}\,\boldsymbol{\eta}_t$ is used in CD because:
A) It's the only MCMC method available for continuous spaces
B) It uses $\nabla_{\mathbf{x}}E_\theta$ (available via backprop) and converges to the stationary distribution $p_\theta \propto \exp(-E_\theta)$
C) It always converges in one step
D) It avoids computing $E_\theta(\mathbf{x})$ entirely Correct: B)
If you chose B: Langevin dynamics is gradient-based MCMC that naturally targets the Boltzmann distribution. The gradient $\nabla_{\mathbf{x}}E_\theta$ is cheap (one backward pass), making it ideal for neural EBMs.
If you chose A: Other methods exist (HMC, Metropolis-Hastings with random walk proposals), but Langevin is most natural for EBMs.
If you chose C: Mixing time depends on energy landscape geometry.
If you chose D: The energy itself ($E_\theta$) isn't needed for the Langevin step, but $E_\theta$ is used elsewhere (loss computation).
Which connection correctly links EBMs to diffusion models?
A) Diffusion models are EBMs with a fixed energy function
B) The score function $\nabla_{\mathbf{x}}\log p_\theta(\mathbf{x}) = -\nabla_{\mathbf{x}}E_\theta(\mathbf{x})$ — diffusion models learn the score, which is the negative energy gradient of an implicit EBM
C) EBMs and diffusion models are mathematically unrelated
D) Diffusion models approximate $Z(\theta)$ for EBMs Correct: B)
If you chose B: A diffusion model's noise predictor $\boldsymbol{\varepsilon}\theta(\mathbf{x}, t)$ is proportional to $-\nabla{\mathbf{x}}\log p_t(\mathbf{x})$, which is $\nabla_{\mathbf{x}}E_t(\mathbf{x})$ for the implicit energy $E_t(\mathbf{x}) = -\log p_t(\mathbf{x})$. Every diffusion model implicitly defines an EBM at each noise level.
If you chose A: The energy is learned (via the score/noise predictor), not fixed.
If you chose C: They're deeply connected via score matching.
If you chose D: Diffusion models don't compute $Z(\theta)$; they bypass it via score matching.

Next Steps

22-10 — Evaluation of Generative Models — How do we actually measure how good these generative models are? Inception Score (IS), Fréchet Inception Distance (FID), log-likelihood, bits per dimension, and the many pitfalls of evaluating generative models.

Pitfalls

Comparing EBMs with different training methods using the same loss metric: CD, PCD, and NCE optimize different objectives. CD minimizes an approximation of the log-likelihood gradient; NCE minimizes a classification loss. The training loss values are not comparable across methods. Always evaluate EBMs on held-out log-likelihood (computed via Annealed Importance Sampling, AIS) or sample quality metrics (FID for images) rather than comparing training losses.
Expecting CD-1 to be statistically correct: CD-1 (initializing at data and taking one Langevin step) is computationally efficient but biased — it doesn't converge to the MLE even with infinite data. The bias decreases with more MCMC steps (CD-$k$ with larger $k$) and with Persistent CD. For final model evaluation, use PCD or train with larger $k$, then evaluate with AIS for proper log-likelihood estimates.
Choosing a noise distribution for NCE without considering overlap: NCE requires $p_n(\mathbf{x})$ to have sufficient overlap with $p_{\text{data}}(\mathbf{x})$. If the noise distribution is too narrow (concentrated far from data), the classifier trivially separates data from noise with near-perfect accuracy — no learning signal for the energy function. If too broad, the classifier learns a trivial boundary. The noise should be a rough approximation of the data distribution (e.g., a Gaussian matched to data mean and variance, scaled wider).
Forgetting that the energy $E_\theta(\mathbf{x})$ is only determined up to an additive constant: Adding any constant $c$ to $E_\theta$ doesn't change $p_\theta$ (it cancels in the Boltzmann distribution). This means the absolute energy values are meaningless — only energy differences $E_\theta(\mathbf{x}) - E_\theta(\mathbf{x}')$ matter. Don't interpret the raw energy output; instead, look at relative energies or the score $\nabla_{\mathbf{x}}E_\theta(\mathbf{x})$, which is unaffected by constant shifts.

Q6: How are EBMs connected to score-based and diffusion models?

A) They are completely unrelated approaches. B) $\nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) = -\nabla_{\mathbf{x}} E_\theta(\mathbf{x})$, so the score function is the negative energy gradient. Diffusion models learn the score, which implicitly defines an EBM at each noise level. C) EBMs are a special case of GANs. D) Diffusion models compute $Z(\theta)$ for EBMs.

Answer and Explanations

**Correct: B)** For an EBM $p_\theta(\mathbf{x}) = \exp(-E_\theta(\mathbf{x}))/Z$, the score is $\nabla_{\mathbf{x}} \log p_\theta = -\nabla_{\mathbf{x}} E_\theta$ (the $Z$ term vanishes). A diffusion model's noise predictor $\boldsymbol{\varepsilon}_\theta(\mathbf{x}_t, t)$ approximates $-\sqrt{1-\bar{\alpha}_t} \nabla \log p_t(\mathbf{x}_t)$, which equals $\sqrt{1-\bar{\alpha}_t} \nabla E_t(\mathbf{x}_t)$ for an implicit energy $E_t = -\log p_t$. Every diffusion model is implicitly an EBM, and EBMs can be trained via score matching to bypass $Z(\theta)$. - A) They are deeply connected through the score function. - C) EBMs are a distinct model class. - D) Diffusion models bypass $Z$ via score matching — they don't compute it.

Progress

Phases

22-09 — Energy-Based Models

Learning Objectives

Core Content

The Energy-Based Formulation

Why EBMs Are Powerful

Training EBMs: The Log-Likelihood Gradient

⚠️ CRITICAL — Contrastive Divergence (CD)

Persistent CD (PCD)

⚠️ CRITICAL — Noise Contrastive Estimation (NCE)

Connection to Score Matching and Diffusion

Key Terms

Worked Examples

Example 1: Gaussian EBM

Example 2: CD-1 Update

Example 3: NCE Logit

Practice Problems

Summary

Quiz

Next Steps

Pitfalls