Math graphic
📐 Concept diagram

22-04 — Normalizing Flows

Phase: 22 — Generative Models Mathematics Subject: 22-04 Prerequisites: 22-02 — VAEs, Phase 08 (Linear Algebra), Phase 06 (Multivariable Calculus) Next subject: 22-05 — Score-Based Generative Models


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the change-of-variables formula for probability densities under invertible transformations
  2. Compute log-determinants for structured Jacobians (autoregressive, coupling layers)
  3. Explain how RealNVP affine coupling layers achieve both invertibility and efficient Jacobian computation
  4. Analyze the trade-off between expressiveness and computational cost in flow architectures
  5. Connect normalizing flows to VAEs as improved prior/posterior distributions

Core Content

The Change of Variables Formula

Normalizing flows model a complex distribution $p_X(\mathbf{x})$ by transforming a simple base distribution $p_Z(\mathbf{z})$ (typically $\mathcal{N}(\mathbf{0}, I)$) through a sequence of invertible transformations $f = f_K \circ f_{K-1} \circ \cdots \circ f_1$:

$$\mathbf{x} = f(\mathbf{z}), \quad \mathbf{z} \sim p_Z$$

⚠️ CRITICAL — Change of Variables: For an invertible, differentiable transformation $f: \mathbb{R}^d \to \mathbb{R}^d$:

$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \left| \det \frac{\partial f^{-1}(\mathbf{x})}{\partial \mathbf{x}} \right|$$

Or equivalently, in terms of the forward transformation:

$$p_X(\mathbf{x}) = p_Z(\mathbf{z}) \left| \det \frac{\partial f(\mathbf{z})}{\partial \mathbf{z}} \right|^{-1}$$

where $\mathbf{z} = f^{-1}(\mathbf{x})$.

Intuition: The determinant of the Jacobian measures how much the transformation $f$ expands or contracts volume locally. If $f$ expands a region by factor 2, the density must be halved to keep total probability mass at 1.

The Log-Likelihood Objective

For a dataset $\{\mathbf{x}^{(i)}\}_{i=1}^N$, we maximize:

$$\log p_X(\mathbf{x}) = \log p_Z(f^{-1}(\mathbf{x})) + \log \left| \det \frac{\partial f^{-1}(\mathbf{x})}{\partial \mathbf{x}} \right|$$

$$= \log p_Z(\mathbf{z}) - \log \left| \det \frac{\partial f(\mathbf{z})}{\partial \mathbf{z}} \right|$$

where $\mathbf{z} = f^{-1}(\mathbf{x})$.

Key requirements for practical flows: 1. $f$ must be invertible (bijective) with an efficiently computable inverse 2. The Jacobian determinant must be tractable — $O(d)$ rather than $O(d^3)$ for a general matrix 3. The transformation should be expressive enough to model complex distributions

Autoregressive Flows

Autoregressive flows factor the transformation dimension by dimension:

$$x_i = f_i(z_i; \mathbf{z}{<i}) \quad \text{or} \quad x_i = \tau(z_i; \mathbf{h}_i) \text{ where } \mathbf{h}_i = c_i(\mathbf{x}{<i})$$

Because $x_i$ depends only on $z_i$ and previous $z_j$ ($j < i$), the Jacobian $\partial f / \partial \mathbf{z}$ is triangular:

$$\frac{\partial f}{\partial \mathbf{z}} = \begin{pmatrix} \frac{\partial f_1}{\partial z_1} & 0 & \cdots & 0 \\ \frac{\partial f_2}{\partial z_1} & \frac{\partial f_2}{\partial z_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_d}{\partial z_1} & \frac{\partial f_d}{\partial z_2} & \cdots & \frac{\partial f_d}{\partial z_d} \end{pmatrix}$$

The determinant of a triangular matrix is the product of diagonal entries:

$$\det \frac{\partial f}{\partial \mathbf{z}} = \prod_{i=1}^{d} \frac{\partial f_i}{\partial z_i}$$

⚠️ CRITICAL — Computational Efficiency: Computing $\prod_{i=1}^d \frac{\partial f_i}{\partial z_i}$ takes $O(d)$ time, compared to $O(d^3)$ for a dense Jacobian. This makes autoregressive flows practical for high-dimensional data.

However, autoregressive flows have a drawback: inverting them (computing $\mathbf{z}$ from $\mathbf{x}$) requires sequential computation — $O(d)$ sequential steps for the inverse, which can be slow.

Affine Coupling Layers (RealNVP)

Coupling layers split the input into two parts and transform one part conditioned on the other:

  1. Split $\mathbf{z}$ into $(\mathbf{z}_a, \mathbf{z}_b)$ (e.g., first half and second half)
  2. Compute scale $\mathbf{s}$ and translation $\mathbf{t}$ from $\mathbf{z}_a$ via neural networks: $[\mathbf{s}, \mathbf{t}] = \text{NN}(\mathbf{z}_a)$
  3. Transform $\mathbf{z}_b$ with an affine (and easily invertible) operation:

$$\mathbf{x}_a = \mathbf{z}_a \quad \text{(identity — kept unchanged)}$$ $$\mathbf{x}_b = \mathbf{s} \odot \mathbf{z}_b + \mathbf{t}$$

Inverse (for sampling): $$\mathbf{z}_a = \mathbf{x}_a$$ $$\mathbf{z}_b = (\mathbf{x}_b - \mathbf{t}) \oslash \mathbf{s}$$

where $\oslash$ is element-wise division. This requires $\mathbf{s} \neq 0$ — typically we use $\mathbf{s} = \exp(\text{NN}(\mathbf{z}_a))$ to ensure positivity.

Jacobian: Because $\mathbf{x}_a = \mathbf{z}_a$ and $\partial\mathbf{x}_b/\partial\mathbf{z}_a$ doesn't affect the determinant (off-diagonal block), the Jacobian matrix is block-triangular:

$$\frac{\partial \mathbf{x}}{\partial \mathbf{z}} = \begin{pmatrix} I & 0 \\ \frac{\partial\mathbf{x}_b}{\partial\mathbf{z}_a} & \text{diag}(\mathbf{s}) \end{pmatrix}$$

$$\log \left| \det \frac{\partial \mathbf{x}}{\partial \mathbf{z}} \right| = \sum_{i} \log |s_i|$$

Again, $O(d)$ computation!

Invertible 1×1 Convolutions (Glow)

To allow all dimensions to interact (not just pairs), Glow introduces learnable invertible $1 \times 1$ convolutions between coupling layers. For a $c \times h \times w$ tensor, a $1 \times 1$ convolution with weight matrix $\mathbf{W} \in \mathbb{R}^{c \times c}$ applies the same linear transformation at every spatial location.

The Jacobian determinant for all spatial positions is:

$$\log \left| \det \frac{\partial f}{\partial \mathbf{z}} \right| = h \cdot w \cdot \log |\det \mathbf{W}|$$

$\mathbf{W}$ must be invertible. Glow parameterizes it via LU decomposition: $\mathbf{W} = \mathbf{P}\mathbf{L}\mathbf{U}$, where $\mathbf{P}$ is fixed permutation, $\mathbf{L}$ is lower triangular with ones on diagonal, $\mathbf{U}$ is upper triangular. Then $\log|\det\mathbf{W}| = \sum_i \log |U_{ii}|$, which is $O(c)$.

Multi-Scale Architecture

To handle high-dimensional data efficiently, flows often use a multi-scale architecture:

  1. After several flow layers, factor out half the dimensions
  2. Model the factored-out dimensions with a Gaussian
  3. Continue transforming the remaining dimensions

This reduces computation and allows the model to capture features at multiple scales — similar to wavelet decompositions.

Flows as VAE Priors/Posteriors

Normalizing flows can enhance VAEs:

This reduces the gap between the approximate and true posterior, tightening the ELBO.

Pitfalls

Pitfall Why It Happens Fix
Numerical instability $\exp(\mathbf{s})$ can overflow or vanish Use $\mathbf{s} = \sigma(\text{NN output})$ scaled to safe range
Poor expressiveness Coupling layers only transform half the dimensions per layer Alternate which half is transformed; use $1\times1$ convolutions
Slow inverse Autoregressive flows require $O(d)$ sequential steps to invert Use coupling layers for fast inversion
Volume-preserving flows Only translations, no scaling → log-det = 0, limited expressiveness Always include scaling (affine coupling)
NaN log-likelihood Invalid transformations (non-invertible, zero determinant) Ensure $\mathbf{s} > 0$ and $\mathbf{W}$ is nonsingular


Key Terms

Worked Examples

Example 1: Scalar Flow

A base distribution $p_Z(z) = \mathcal{N}(0, 1)$ is transformed by $f(z) = az + b$ with $a > 0$. Find $p_X(x)$.

Solution:

$f^{-1}(x) = (x-b)/a$, and $\frac{\partial f^{-1}}{\partial x} = 1/a$.

$$p_X(x) = p_Z\left(\frac{x-b}{a}\right) \cdot \frac{1}{a} = \frac{1}{\sqrt{2\pi}} e^{-(x-b)^2/(2a^2)} \cdot \frac{1}{a}$$

$$= \frac{1}{\sqrt{2\pi a^2}} e^{-(x-b)^2/(2a^2)} = \mathcal{N}(x; b, a^2)$$

A linear transformation of a Gaussian is Gaussian — the flow simply shifts and scales.

Click for answer $p_X(x) = \\mathcal{N}(x; b, a^2)$. Log-density: $\\log p_X(x) = -\\frac{1}{2}\\log(2\\pi a^2) - \\frac{(x-b)^2}{2a^2}$. This demonstrates the simplest possible flow.

Example 2: 2D Affine Coupling Layer

Input $\mathbf{z} = (z_1, z_2) = (1, 3)$. A coupling layer splits into $\mathbf{z}_a = (z_1)$, $\mathbf{z}_b = (z_2)$. The network outputs $s = \exp(0.5) \approx 1.649$, $t = 2$. Compute $\mathbf{x}$ and $\log|\det J|$.

Solution:

$\mathbf{x}_a = z_1 = 1$ $\mathbf{x}_b = s \cdot z_2 + t = 1.649 \cdot 3 + 2 = 6.946$

Jacobian determinant: $\log|\det J| = \sum \log|s_i| = \log(1.649) = 0.5$

Inverse check: $\mathbf{z}_a = \mathbf{x}_a = 1$, $\mathbf{z}_b = (\mathbf{x}_b - t)/s = (6.946 - 2)/1.649 = 4.946/1.649 = 3.000$. ✓

Click for answer $\\mathbf{x} = (1, 6.946)$, $\\log|\\det J| = 0.5$. The forward pass is $O(d)$, the inverse is $O(d)$, and the log-determinant is $O(d)$ — all linear in dimensionality.

Example 3: Planar Flow Log-Likelihood

A planar flow transforms $\mathbf{z} \in \mathbb{R}^2$ with $f(\mathbf{z}) = \mathbf{z} + \mathbf{u} \cdot \tanh(\mathbf{w}^T\mathbf{z} + b)$ where $\mathbf{u}, \mathbf{w} \in \mathbb{R}^2$, $b \in \mathbb{R}$. Given $\mathbf{u} = (1, 0)$, $\mathbf{w} = (1, 2)$, $b = 0$, and $\mathbf{z} = (0.5, -0.5)$, compute the log-determinant of the Jacobian.

Solution:

$$\frac{\partial f}{\partial \mathbf{z}} = I + \mathbf{u} \cdot \nabla_{\mathbf{z}}[\tanh(\mathbf{w}^T\mathbf{z} + b)]^T = I + \mathbf{u} \cdot \text{sech}^2(\mathbf{w}^T\mathbf{z} + b) \cdot \mathbf{w}^T$$

$\mathbf{w}^T\mathbf{z} + b = 1(0.5) + 2(-0.5) = 0.5 - 1.0 = -0.5$

$\text{sech}^2(-0.5) = (1/\cosh(-0.5))^2 \approx (1/1.1276)^2 = 0.7864$

$$J = I + \begin{pmatrix} 1 \\ 0 \end{pmatrix} \cdot 0.7864 \cdot (1, 2) = \begin{pmatrix} 1 + 0.7864 & 1.5728 \\ 0 & 1 \end{pmatrix}$$

$\det J = (1.7864)(1) - (1.5728)(0) = 1.7864$

$\log|\det J| = \log(1.7864) \approx 0.580$

Click for answer $\\log|\\det J| \\approx 0.580$. The matrix determinant lemma gives a general formula for planar flows: $\\det(I + \\mathbf{u}\\mathbf{v}^T) = 1 + \\mathbf{v}^T\\mathbf{u}$. Here: $1 + 0.7864 \\cdot (1,2) \\cdot (1,0)^T = 1 + 0.7864 = 1.7864$. ✓

Practice Problems

  1. Prove that the composition of two invertible transformations $f = f_2 \circ f_1$ has Jacobian determinant $\det J_f = \det J_{f_2} \cdot \det J_{f_1}$.

    Click for answer By the chain rule: $J_f = J_{f_2}(f_1(\\mathbf{z})) \\cdot J_{f_1}(\\mathbf{z})$. By multiplicativity of determinants: $\\det J_f = \\det J_{f_2} \\cdot \\det J_{f_1}$. The log-determinant is additive: $\\log|\\det J_f| = \\log|\\det J_{f_2}| + \\log|\\det J_{f_1}|$. This is why composing many simple flows (each with tractable Jacobian) yields a complex overall transformation with a tractable total log-determinant.

  2. For a RealNVP coupling layer, explain why $\mathbf{s}$ must be nonzero. What happens if $s_i = 0$ for some $i$?

    Click for answer If $s_i = 0$, the transformation $x_i = 0 \\cdot z_i + t_i = t_i$ is constant — not invertible (many $z_i$ map to the same $x_i$). The inverse requires division by $s_i$, which would be undefined. Using $\\mathbf{s} = \\exp(\\text{NN output})$ guarantees $\\mathbf{s} > 0$ always.

  3. A flow transforms $\mathbf{z} \sim \mathcal{N}(0, I_2)$ by applying rotation matrix $R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$. What is $\log|\det J|$? Why is rotation alone insufficient?

    Click for answer $\\det R(\\theta) = \\cos^2\\theta + \\sin^2\\theta = 1$, so $\\log|\\det J| = 0$. Rotation is volume-preserving — it only rotates the Gaussian, which is still a Gaussian (spherically symmetric). Rotation alone can't model non-Gaussian distributions; we need scaling (affine transformations) to change the shape.

  4. For a Glow-style $1 \times 1$ convolution with 3 channels, $h=w=32$, the LU-decomposed weight matrix has $U_{11}=2, U_{22}=0.5, U_{33}=0.8$. Compute the total log-determinant contribution.

    Click for answer $\\log|\\det \\mathbf{W}| = \\log(2) + \\log(0.5) + \\log(0.8) = 0.6931 - 0.6931 - 0.2231 = -0.2231$. Total for all spatial positions: $h \\cdot w \\cdot \\log|\\det \\mathbf{W}| = 32 \\cdot 32 \\cdot (-0.2231) = 1024 \\cdot (-0.2231) \\approx -228.5$.

  5. Explain why normalizing flows can compute exact likelihoods while VAEs and GANs cannot.

    Click for answer Flows are bijective: $\\mathbf{x} = f(\\mathbf{z})$ with known inverse. The density is given exactly by the change-of-variables formula: $p_X(\\mathbf{x}) = p_Z(f^{-1}(\\mathbf{x}))|\\det J_{f^{-1}}|$. VAEs use a lower bound (ELBO) — the true likelihood involves intractable integration. GANs are implicit models — they can sample but don't provide a density function. Only flows give exact, tractable likelihoods.


Summary

Key takeaways:


Quiz

  1. The change-of-variables formula for $\mathbf{x} = f(\mathbf{z})$ gives density $p_X(\mathbf{x})$ equal to:
  2. A) $p_Z(\mathbf{z})$
  3. B) $p_Z(f^{-1}(\mathbf{x})) \cdot |\det J_f|$
  4. C) $p_Z(f^{-1}(\mathbf{x})) \cdot |\det J_{f^{-1}}|$
  5. D) $p_Z(\mathbf{x}) / |\det J_f|$ Correct: C)
  6. If you chose C: $p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x}))|\det \partial f^{-1}/\partial\mathbf{x}|$. Equivalently $p_Z(\mathbf{z})/|\det J_f|$.
  7. If you chose B: Missing the inverse — $\det$ is for $f^{-1}$, not $f$.
  8. If you chose A: Ignores the change in volume — this is only correct if $f$ is volume-preserving ($|\det J|=1$).

  9. Why is the Jacobian determinant of an autoregressive flow easy to compute?

  10. A) The Jacobian is the identity matrix
  11. B) The Jacobian is triangular, so the determinant is the product of diagonal entries
  12. C) The Jacobian is always orthogonal
  13. D) The Jacobian is sparse with only one non-zero per row Correct: B)
  14. If you chose B: In autoregressive flows, $x_i = f_i(z_i; \mathbf{z}_{ i$ → triangular Jacobian → $\det = \prod_i \partial f_i/\partial z_i$ in $O(d)$.
  15. If you chose A: Only true for identity transformation.
  16. If you chose C: Only for specific flows (e.g., NICE).
  17. If you chose D: Not necessarily — off-diagonal entries can be non-zero; they just don't enter the determinant.

  18. In a RealNVP affine coupling layer, the output of the scaling network should always be:

  19. A) Any real number
  20. B) Non-zero (typically positive via exp)
  21. C) Between 0 and 1
  22. D) Exactly 1 Correct: B)
  23. If you chose B: $\mathbf{s} = \exp(\text{NN})$ ensures $\mathbf{s} > 0$, which is necessary for invertibility (can't divide by zero) and prevents sign flips.
  24. If you chose A: $s_i = 0$ makes the transformation non-invertible.
  25. If you chose C: Restricting to (0,1) limits expressiveness; $\exp$ allows any positive value.
  26. If you chose D: Would be NICE (non-volume-preserving extension) — less expressive.

  27. What distinguishes normalizing flows from VAEs in terms of likelihood evaluation?

  28. A) Flows can only evaluate likelihood, not sample
  29. B) Flows provide exact likelihoods; VAEs provide a lower bound
  30. C) VAEs provide exact likelihoods; flows provide a lower bound
  31. D) Both provide exact likelihoods Correct: B)
  32. If you chose B: Flows are bijective, so $p_X$ is exact via change-of-variables. VAEs optimize the ELBO, which lower-bounds $\log p(\mathbf{x})$.
  33. If you chose A: Flows can sample by forward-transforming $\mathbf{z} \sim p_Z$.
  34. If you chose C: Reversed — VAEs use a bound, flows are exact.
  35. If you chose D: VAEs don't — the marginal likelihood is intractable.

  36. A normalizing flow must have latent dimensionality:

  37. A) Less than the data dimensionality
  38. B) Greater than the data dimensionality
  39. C) Equal to the data dimensionality
  40. D) Any dimensionality Correct: C)
  41. If you chose C: The transformation $f$ must be bijective between spaces of the same dimension. Flows can't do dimensionality reduction (unlike autoencoders).
  42. If you chose D: This is true for VAEs, but flows require $d_z = d_x$ for bijectivity. Dimension-changing transformations are not invertible.

Next Steps

22-05 — Score-Based Generative Models — where we learn the gradient of the log-density (the score function) instead of the density itself, enabling generation via Langevin dynamics.


Pitfalls

  1. Expecting a single coupling layer to mix all dimensions: A single affine coupling layer transforms only half the dimensions (identity on $\mathbf{z}_a$, affine on $\mathbf{z}_b$). If you use only one coupling layer, half the input remains unchanged — the flow cannot transform all coordinates. Always stack multiple coupling layers with alternating splits (or use $1 \times 1$ convolutions) to ensure every dimension gets transformed.

  2. Using volume-preserving transformations exclusively: Transformations with $|\det J| = 1$ (e.g., pure rotations, permutations, NICE-style additive coupling) cannot change the shape of the base distribution. Since the base is typically a spherical Gaussian, volume-preserving flows can only rotate it — they produce another Gaussian. Always include scaling (multiplicative/affine transformations) to enable the model to learn non-Gaussian shapes.

  3. Neglecting numerical stability of $\exp(\mathbf{s})$: Using $\mathbf{s} = \exp(\text{NN output})$ ensures positivity but can cause numerical overflow if the network outputs large values. Always clamp or use numerically stable alternatives: $\mathbf{s} = \sigma(\text{NN output}) \cdot (\text{max} - \text{min}) + \text{min}$ with reasonable bounds, or apply $\tanh$ before exponentiation to limit the range of the log-scale.

  4. Using flows where dimensionality reduction is needed: Normalizing flows require $d_z = d_x$ because the transformation must be bijective. If your goal is dimensionality reduction (e.g., compressing images to a low-dimensional manifold), flows are the wrong tool — use autoencoders or VAEs instead. Flows are best for density estimation where exact likelihood matters.




Q7: Why do normalizing flows typically use $\mathcal{N}(0, I)$ as the base distribution?

A) It's the only distribution for which the change-of-variables formula is valid. B) It's easy to sample from and evaluate, and has full support over $\mathbb{R}^d$, ensuring the flow can (in principle) model any distribution. C) The flow architecture requires a Gaussian base. D) It minimizes the number of flow layers needed.

Answer and Explanations **Correct: B)** $\mathcal{N}(0, I)$ is computationally convenient: sampling is $O(d)$, log-density evaluation is a simple quadratic form. Its full support over $\mathbb{R}^d$ means the flow can theoretically reach any point in the data space. Combined with a sufficiently expressive flow, any distribution can be modeled. - A) The change-of-variables formula works for any base distribution. - C) Any distribution is mathematically valid as a base; Gaussians are chosen for convenience. - D) The number of layers needed depends on the complexity of the target distribution, not the base.