22-04 — Normalizing Flows
Phase: 22 — Generative Models Mathematics Subject: 22-04 Prerequisites: 22-02 — VAEs, Phase 08 (Linear Algebra), Phase 06 (Multivariable Calculus) Next subject: 22-05 — Score-Based Generative Models
Learning Objectives
By the end of this subject, you will be able to:
- Derive the change-of-variables formula for probability densities under invertible transformations
- Compute log-determinants for structured Jacobians (autoregressive, coupling layers)
- Explain how RealNVP affine coupling layers achieve both invertibility and efficient Jacobian computation
- Analyze the trade-off between expressiveness and computational cost in flow architectures
- Connect normalizing flows to VAEs as improved prior/posterior distributions
Core Content
The Change of Variables Formula
Normalizing flows model a complex distribution $p_X(\mathbf{x})$ by transforming a simple base distribution $p_Z(\mathbf{z})$ (typically $\mathcal{N}(\mathbf{0}, I)$) through a sequence of invertible transformations $f = f_K \circ f_{K-1} \circ \cdots \circ f_1$:
$$\mathbf{x} = f(\mathbf{z}), \quad \mathbf{z} \sim p_Z$$
⚠️ CRITICAL — Change of Variables: For an invertible, differentiable transformation $f: \mathbb{R}^d \to \mathbb{R}^d$:
$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \left| \det \frac{\partial f^{-1}(\mathbf{x})}{\partial \mathbf{x}} \right|$$
Or equivalently, in terms of the forward transformation:
$$p_X(\mathbf{x}) = p_Z(\mathbf{z}) \left| \det \frac{\partial f(\mathbf{z})}{\partial \mathbf{z}} \right|^{-1}$$
where $\mathbf{z} = f^{-1}(\mathbf{x})$.
Intuition: The determinant of the Jacobian measures how much the transformation $f$ expands or contracts volume locally. If $f$ expands a region by factor 2, the density must be halved to keep total probability mass at 1.
The Log-Likelihood Objective
For a dataset $\{\mathbf{x}^{(i)}\}_{i=1}^N$, we maximize:
$$\log p_X(\mathbf{x}) = \log p_Z(f^{-1}(\mathbf{x})) + \log \left| \det \frac{\partial f^{-1}(\mathbf{x})}{\partial \mathbf{x}} \right|$$
$$= \log p_Z(\mathbf{z}) - \log \left| \det \frac{\partial f(\mathbf{z})}{\partial \mathbf{z}} \right|$$
where $\mathbf{z} = f^{-1}(\mathbf{x})$.
Key requirements for practical flows: 1. $f$ must be invertible (bijective) with an efficiently computable inverse 2. The Jacobian determinant must be tractable — $O(d)$ rather than $O(d^3)$ for a general matrix 3. The transformation should be expressive enough to model complex distributions
Autoregressive Flows
Autoregressive flows factor the transformation dimension by dimension:
$$x_i = f_i(z_i; \mathbf{z}{<i}) \quad \text{or} \quad x_i = \tau(z_i; \mathbf{h}_i) \text{ where } \mathbf{h}_i = c_i(\mathbf{x}{<i})$$
Because $x_i$ depends only on $z_i$ and previous $z_j$ ($j < i$), the Jacobian $\partial f / \partial \mathbf{z}$ is triangular:
$$\frac{\partial f}{\partial \mathbf{z}} = \begin{pmatrix} \frac{\partial f_1}{\partial z_1} & 0 & \cdots & 0 \\ \frac{\partial f_2}{\partial z_1} & \frac{\partial f_2}{\partial z_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_d}{\partial z_1} & \frac{\partial f_d}{\partial z_2} & \cdots & \frac{\partial f_d}{\partial z_d} \end{pmatrix}$$
The determinant of a triangular matrix is the product of diagonal entries:
$$\det \frac{\partial f}{\partial \mathbf{z}} = \prod_{i=1}^{d} \frac{\partial f_i}{\partial z_i}$$
⚠️ CRITICAL — Computational Efficiency: Computing $\prod_{i=1}^d \frac{\partial f_i}{\partial z_i}$ takes $O(d)$ time, compared to $O(d^3)$ for a dense Jacobian. This makes autoregressive flows practical for high-dimensional data.
However, autoregressive flows have a drawback: inverting them (computing $\mathbf{z}$ from $\mathbf{x}$) requires sequential computation — $O(d)$ sequential steps for the inverse, which can be slow.
Affine Coupling Layers (RealNVP)
Coupling layers split the input into two parts and transform one part conditioned on the other:
- Split $\mathbf{z}$ into $(\mathbf{z}_a, \mathbf{z}_b)$ (e.g., first half and second half)
- Compute scale $\mathbf{s}$ and translation $\mathbf{t}$ from $\mathbf{z}_a$ via neural networks: $[\mathbf{s}, \mathbf{t}] = \text{NN}(\mathbf{z}_a)$
- Transform $\mathbf{z}_b$ with an affine (and easily invertible) operation:
$$\mathbf{x}_a = \mathbf{z}_a \quad \text{(identity — kept unchanged)}$$ $$\mathbf{x}_b = \mathbf{s} \odot \mathbf{z}_b + \mathbf{t}$$
Inverse (for sampling): $$\mathbf{z}_a = \mathbf{x}_a$$ $$\mathbf{z}_b = (\mathbf{x}_b - \mathbf{t}) \oslash \mathbf{s}$$
where $\oslash$ is element-wise division. This requires $\mathbf{s} \neq 0$ — typically we use $\mathbf{s} = \exp(\text{NN}(\mathbf{z}_a))$ to ensure positivity.
Jacobian: Because $\mathbf{x}_a = \mathbf{z}_a$ and $\partial\mathbf{x}_b/\partial\mathbf{z}_a$ doesn't affect the determinant (off-diagonal block), the Jacobian matrix is block-triangular:
$$\frac{\partial \mathbf{x}}{\partial \mathbf{z}} = \begin{pmatrix} I & 0 \\ \frac{\partial\mathbf{x}_b}{\partial\mathbf{z}_a} & \text{diag}(\mathbf{s}) \end{pmatrix}$$
$$\log \left| \det \frac{\partial \mathbf{x}}{\partial \mathbf{z}} \right| = \sum_{i} \log |s_i|$$
Again, $O(d)$ computation!
Invertible 1×1 Convolutions (Glow)
To allow all dimensions to interact (not just pairs), Glow introduces learnable invertible $1 \times 1$ convolutions between coupling layers. For a $c \times h \times w$ tensor, a $1 \times 1$ convolution with weight matrix $\mathbf{W} \in \mathbb{R}^{c \times c}$ applies the same linear transformation at every spatial location.
The Jacobian determinant for all spatial positions is:
$$\log \left| \det \frac{\partial f}{\partial \mathbf{z}} \right| = h \cdot w \cdot \log |\det \mathbf{W}|$$
$\mathbf{W}$ must be invertible. Glow parameterizes it via LU decomposition: $\mathbf{W} = \mathbf{P}\mathbf{L}\mathbf{U}$, where $\mathbf{P}$ is fixed permutation, $\mathbf{L}$ is lower triangular with ones on diagonal, $\mathbf{U}$ is upper triangular. Then $\log|\det\mathbf{W}| = \sum_i \log |U_{ii}|$, which is $O(c)$.
Multi-Scale Architecture
To handle high-dimensional data efficiently, flows often use a multi-scale architecture:
- After several flow layers, factor out half the dimensions
- Model the factored-out dimensions with a Gaussian
- Continue transforming the remaining dimensions
This reduces computation and allows the model to capture features at multiple scales — similar to wavelet decompositions.
Flows as VAE Priors/Posteriors
Normalizing flows can enhance VAEs:
- Flow prior: $p(\mathbf{z}) = p_Z(f^{-1}(\mathbf{z}))|\det J_{f^{-1}}|$ — a learnable, flexible prior instead of $\mathcal{N}(0,I)$
- Flow posterior: $q_\phi(\mathbf{z}|\mathbf{x})$ is a flow-transformed Gaussian — more expressive approximate posterior
This reduces the gap between the approximate and true posterior, tightening the ELBO.
Pitfalls
| Pitfall | Why It Happens | Fix |
|---|---|---|
| Numerical instability | $\exp(\mathbf{s})$ can overflow or vanish | Use $\mathbf{s} = \sigma(\text{NN output})$ scaled to safe range |
| Poor expressiveness | Coupling layers only transform half the dimensions per layer | Alternate which half is transformed; use $1\times1$ convolutions |
| Slow inverse | Autoregressive flows require $O(d)$ sequential steps to invert | Use coupling layers for fast inversion |
| Volume-preserving flows | Only translations, no scaling → log-det = 0, limited expressiveness | Always include scaling (affine coupling) |
| NaN log-likelihood | Invalid transformations (non-invertible, zero determinant) | Ensure $\mathbf{s} > 0$ and $\mathbf{W}$ is nonsingular |
Key Terms
- Inverse
- Jacobian determinant
Worked Examples
Example 1: Scalar Flow
A base distribution $p_Z(z) = \mathcal{N}(0, 1)$ is transformed by $f(z) = az + b$ with $a > 0$. Find $p_X(x)$.
Solution:
$f^{-1}(x) = (x-b)/a$, and $\frac{\partial f^{-1}}{\partial x} = 1/a$.
$$p_X(x) = p_Z\left(\frac{x-b}{a}\right) \cdot \frac{1}{a} = \frac{1}{\sqrt{2\pi}} e^{-(x-b)^2/(2a^2)} \cdot \frac{1}{a}$$
$$= \frac{1}{\sqrt{2\pi a^2}} e^{-(x-b)^2/(2a^2)} = \mathcal{N}(x; b, a^2)$$
A linear transformation of a Gaussian is Gaussian — the flow simply shifts and scales.
Click for answer
$p_X(x) = \\mathcal{N}(x; b, a^2)$. Log-density: $\\log p_X(x) = -\\frac{1}{2}\\log(2\\pi a^2) - \\frac{(x-b)^2}{2a^2}$. This demonstrates the simplest possible flow.Example 2: 2D Affine Coupling Layer
Input $\mathbf{z} = (z_1, z_2) = (1, 3)$. A coupling layer splits into $\mathbf{z}_a = (z_1)$, $\mathbf{z}_b = (z_2)$. The network outputs $s = \exp(0.5) \approx 1.649$, $t = 2$. Compute $\mathbf{x}$ and $\log|\det J|$.
Solution:
$\mathbf{x}_a = z_1 = 1$ $\mathbf{x}_b = s \cdot z_2 + t = 1.649 \cdot 3 + 2 = 6.946$
Jacobian determinant: $\log|\det J| = \sum \log|s_i| = \log(1.649) = 0.5$
Inverse check: $\mathbf{z}_a = \mathbf{x}_a = 1$, $\mathbf{z}_b = (\mathbf{x}_b - t)/s = (6.946 - 2)/1.649 = 4.946/1.649 = 3.000$. ✓
Click for answer
$\\mathbf{x} = (1, 6.946)$, $\\log|\\det J| = 0.5$. The forward pass is $O(d)$, the inverse is $O(d)$, and the log-determinant is $O(d)$ — all linear in dimensionality.Example 3: Planar Flow Log-Likelihood
A planar flow transforms $\mathbf{z} \in \mathbb{R}^2$ with $f(\mathbf{z}) = \mathbf{z} + \mathbf{u} \cdot \tanh(\mathbf{w}^T\mathbf{z} + b)$ where $\mathbf{u}, \mathbf{w} \in \mathbb{R}^2$, $b \in \mathbb{R}$. Given $\mathbf{u} = (1, 0)$, $\mathbf{w} = (1, 2)$, $b = 0$, and $\mathbf{z} = (0.5, -0.5)$, compute the log-determinant of the Jacobian.
Solution:
$$\frac{\partial f}{\partial \mathbf{z}} = I + \mathbf{u} \cdot \nabla_{\mathbf{z}}[\tanh(\mathbf{w}^T\mathbf{z} + b)]^T = I + \mathbf{u} \cdot \text{sech}^2(\mathbf{w}^T\mathbf{z} + b) \cdot \mathbf{w}^T$$
$\mathbf{w}^T\mathbf{z} + b = 1(0.5) + 2(-0.5) = 0.5 - 1.0 = -0.5$
$\text{sech}^2(-0.5) = (1/\cosh(-0.5))^2 \approx (1/1.1276)^2 = 0.7864$
$$J = I + \begin{pmatrix} 1 \\ 0 \end{pmatrix} \cdot 0.7864 \cdot (1, 2) = \begin{pmatrix} 1 + 0.7864 & 1.5728 \\ 0 & 1 \end{pmatrix}$$
$\det J = (1.7864)(1) - (1.5728)(0) = 1.7864$
$\log|\det J| = \log(1.7864) \approx 0.580$
Click for answer
$\\log|\\det J| \\approx 0.580$. The matrix determinant lemma gives a general formula for planar flows: $\\det(I + \\mathbf{u}\\mathbf{v}^T) = 1 + \\mathbf{v}^T\\mathbf{u}$. Here: $1 + 0.7864 \\cdot (1,2) \\cdot (1,0)^T = 1 + 0.7864 = 1.7864$. ✓Practice Problems
-
Prove that the composition of two invertible transformations $f = f_2 \circ f_1$ has Jacobian determinant $\det J_f = \det J_{f_2} \cdot \det J_{f_1}$.
Click for answer
By the chain rule: $J_f = J_{f_2}(f_1(\\mathbf{z})) \\cdot J_{f_1}(\\mathbf{z})$. By multiplicativity of determinants: $\\det J_f = \\det J_{f_2} \\cdot \\det J_{f_1}$. The log-determinant is additive: $\\log|\\det J_f| = \\log|\\det J_{f_2}| + \\log|\\det J_{f_1}|$. This is why composing many simple flows (each with tractable Jacobian) yields a complex overall transformation with a tractable total log-determinant. -
For a RealNVP coupling layer, explain why $\mathbf{s}$ must be nonzero. What happens if $s_i = 0$ for some $i$?
Click for answer
If $s_i = 0$, the transformation $x_i = 0 \\cdot z_i + t_i = t_i$ is constant — not invertible (many $z_i$ map to the same $x_i$). The inverse requires division by $s_i$, which would be undefined. Using $\\mathbf{s} = \\exp(\\text{NN output})$ guarantees $\\mathbf{s} > 0$ always. -
A flow transforms $\mathbf{z} \sim \mathcal{N}(0, I_2)$ by applying rotation matrix $R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$. What is $\log|\det J|$? Why is rotation alone insufficient?
Click for answer
$\\det R(\\theta) = \\cos^2\\theta + \\sin^2\\theta = 1$, so $\\log|\\det J| = 0$. Rotation is volume-preserving — it only rotates the Gaussian, which is still a Gaussian (spherically symmetric). Rotation alone can't model non-Gaussian distributions; we need scaling (affine transformations) to change the shape. -
For a Glow-style $1 \times 1$ convolution with 3 channels, $h=w=32$, the LU-decomposed weight matrix has $U_{11}=2, U_{22}=0.5, U_{33}=0.8$. Compute the total log-determinant contribution.
Click for answer
$\\log|\\det \\mathbf{W}| = \\log(2) + \\log(0.5) + \\log(0.8) = 0.6931 - 0.6931 - 0.2231 = -0.2231$. Total for all spatial positions: $h \\cdot w \\cdot \\log|\\det \\mathbf{W}| = 32 \\cdot 32 \\cdot (-0.2231) = 1024 \\cdot (-0.2231) \\approx -228.5$. -
Explain why normalizing flows can compute exact likelihoods while VAEs and GANs cannot.
Click for answer
Flows are bijective: $\\mathbf{x} = f(\\mathbf{z})$ with known inverse. The density is given exactly by the change-of-variables formula: $p_X(\\mathbf{x}) = p_Z(f^{-1}(\\mathbf{x}))|\\det J_{f^{-1}}|$. VAEs use a lower bound (ELBO) — the true likelihood involves intractable integration. GANs are implicit models — they can sample but don't provide a density function. Only flows give exact, tractable likelihoods.
Summary
Key takeaways:
- Flows transform a simple distribution into a complex one via invertible $f$, using $p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x}))|\det J_{f^{-1}}|$
- The Jacobian determinant must be tractable — triangular structure (autoregressive, coupling) gives $O(d)$ computation
- Affine coupling layers are fast and invertible: $\mathbf{x}_b = \mathbf{s} \odot \mathbf{z}_b + \mathbf{t}$, $\mathbf{z}_b = (\mathbf{x}_b - \mathbf{t})/\mathbf{s}$
- Invertible $1 \times 1$ convolutions (Glow) mix channels while keeping Jacobian computation $O(c)$ via LU decomposition
- Flows provide exact likelihoods — unlike VAEs (lower bound) and GANs (implicit)
- The trade-off: flows require $d_{\text{latent}} = d_{\text{data}}$ (no dimensionality reduction)
Quiz
- The change-of-variables formula for $\mathbf{x} = f(\mathbf{z})$ gives density $p_X(\mathbf{x})$ equal to:
- A) $p_Z(\mathbf{z})$
- B) $p_Z(f^{-1}(\mathbf{x})) \cdot |\det J_f|$
- C) $p_Z(f^{-1}(\mathbf{x})) \cdot |\det J_{f^{-1}}|$
- D) $p_Z(\mathbf{x}) / |\det J_f|$ Correct: C)
- If you chose C: $p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x}))|\det \partial f^{-1}/\partial\mathbf{x}|$. Equivalently $p_Z(\mathbf{z})/|\det J_f|$.
- If you chose B: Missing the inverse — $\det$ is for $f^{-1}$, not $f$.
-
If you chose A: Ignores the change in volume — this is only correct if $f$ is volume-preserving ($|\det J|=1$).
-
Why is the Jacobian determinant of an autoregressive flow easy to compute?
- A) The Jacobian is the identity matrix
- B) The Jacobian is triangular, so the determinant is the product of diagonal entries
- C) The Jacobian is always orthogonal
- D) The Jacobian is sparse with only one non-zero per row Correct: B)
- If you chose B: In autoregressive flows, $x_i = f_i(z_i; \mathbf{z}_{ i$ → triangular Jacobian → $\det = \prod_i \partial f_i/\partial z_i$ in $O(d)$.
- If you chose A: Only true for identity transformation.
- If you chose C: Only for specific flows (e.g., NICE).
-
If you chose D: Not necessarily — off-diagonal entries can be non-zero; they just don't enter the determinant.
-
In a RealNVP affine coupling layer, the output of the scaling network should always be:
- A) Any real number
- B) Non-zero (typically positive via exp)
- C) Between 0 and 1
- D) Exactly 1 Correct: B)
- If you chose B: $\mathbf{s} = \exp(\text{NN})$ ensures $\mathbf{s} > 0$, which is necessary for invertibility (can't divide by zero) and prevents sign flips.
- If you chose A: $s_i = 0$ makes the transformation non-invertible.
- If you chose C: Restricting to (0,1) limits expressiveness; $\exp$ allows any positive value.
-
If you chose D: Would be NICE (non-volume-preserving extension) — less expressive.
-
What distinguishes normalizing flows from VAEs in terms of likelihood evaluation?
- A) Flows can only evaluate likelihood, not sample
- B) Flows provide exact likelihoods; VAEs provide a lower bound
- C) VAEs provide exact likelihoods; flows provide a lower bound
- D) Both provide exact likelihoods Correct: B)
- If you chose B: Flows are bijective, so $p_X$ is exact via change-of-variables. VAEs optimize the ELBO, which lower-bounds $\log p(\mathbf{x})$.
- If you chose A: Flows can sample by forward-transforming $\mathbf{z} \sim p_Z$.
- If you chose C: Reversed — VAEs use a bound, flows are exact.
-
If you chose D: VAEs don't — the marginal likelihood is intractable.
-
A normalizing flow must have latent dimensionality:
- A) Less than the data dimensionality
- B) Greater than the data dimensionality
- C) Equal to the data dimensionality
- D) Any dimensionality Correct: C)
- If you chose C: The transformation $f$ must be bijective between spaces of the same dimension. Flows can't do dimensionality reduction (unlike autoencoders).
- If you chose D: This is true for VAEs, but flows require $d_z = d_x$ for bijectivity. Dimension-changing transformations are not invertible.
Next Steps
22-05 — Score-Based Generative Models — where we learn the gradient of the log-density (the score function) instead of the density itself, enabling generation via Langevin dynamics.
Pitfalls
-
Expecting a single coupling layer to mix all dimensions: A single affine coupling layer transforms only half the dimensions (identity on $\mathbf{z}_a$, affine on $\mathbf{z}_b$). If you use only one coupling layer, half the input remains unchanged — the flow cannot transform all coordinates. Always stack multiple coupling layers with alternating splits (or use $1 \times 1$ convolutions) to ensure every dimension gets transformed.
-
Using volume-preserving transformations exclusively: Transformations with $|\det J| = 1$ (e.g., pure rotations, permutations, NICE-style additive coupling) cannot change the shape of the base distribution. Since the base is typically a spherical Gaussian, volume-preserving flows can only rotate it — they produce another Gaussian. Always include scaling (multiplicative/affine transformations) to enable the model to learn non-Gaussian shapes.
-
Neglecting numerical stability of $\exp(\mathbf{s})$: Using $\mathbf{s} = \exp(\text{NN output})$ ensures positivity but can cause numerical overflow if the network outputs large values. Always clamp or use numerically stable alternatives: $\mathbf{s} = \sigma(\text{NN output}) \cdot (\text{max} - \text{min}) + \text{min}$ with reasonable bounds, or apply $\tanh$ before exponentiation to limit the range of the log-scale.
-
Using flows where dimensionality reduction is needed: Normalizing flows require $d_z = d_x$ because the transformation must be bijective. If your goal is dimensionality reduction (e.g., compressing images to a low-dimensional manifold), flows are the wrong tool — use autoencoders or VAEs instead. Flows are best for density estimation where exact likelihood matters.
Q7: Why do normalizing flows typically use $\mathcal{N}(0, I)$ as the base distribution?
A) It's the only distribution for which the change-of-variables formula is valid. B) It's easy to sample from and evaluate, and has full support over $\mathbb{R}^d$, ensuring the flow can (in principle) model any distribution. C) The flow architecture requires a Gaussian base. D) It minimizes the number of flow layers needed.