📐 Concept diagram

24-05 — Disentanglement and Representation Theory

Phase: 24 — Information Geometry & Advanced Theory Subject: 24-05 Prerequisites: 24-04 (Manifold Hypothesis), 22-02 (Variational Autoencoders), 16-04 (Loss Functions) Next subject: 24-06 — Optimal Transport

Learning Objectives

By the end of this subject, you will be able to:

Define disentangled representation and compare multiple formal definitions
Derive the β-VAE objective and explain β's role in the disentanglement–reconstruction trade-off
Compute the mutual information gap and explain its role in unsupervised disentanglement
Apply group-theoretic decompositions to enforce structured disentanglement
Evaluate disentanglement quality using standard metrics (FactorVAE score, MIG, DCI)

Core Content

1. What Is Disentanglement?

A representation is disentangled when each latent dimension corresponds to exactly one generative factor of variation, and changing one factor leaves all others unchanged — analogous to the way a Fourier transform separates a signal into independent frequency components.

Intuition: Imagine a dataset of 3D rendered chairs. The generative factors include chair height, leg thickness, backrest angle, and material colour. A disentangled representation would have one dimension controlling colour and another controlling height, with changes to one not affecting the other.

Formal definitions (in order of increasing strictness):

β-VAE definition (Higgins et al., 2017): A representation is more disentangled when $q_\phi(\mathbf{z}|\mathbf{x})$ is closer to a factorised prior $p(\mathbf{z}) = \prod_j p(z_j)$. Enforced via a weighted KL penalty.
Symmetry-based definition (Higgins et al., 2018): A representation is disentangled with respect to a symmetry group $G$ decomposing as $G = G_1 \times \cdots \times G_K$ if each $G_k$ acts on exactly one latent subspace and trivially on all others.
Information-theoretic definition (Eastwood & Williams, 2018): The mutual information $I(z_j; v_k)$ between latent $z_j$ and ground-truth factor $v_k$ should be high for exactly one $j$ per $k$ and low otherwise.

⚠️ CRITICAL: No single definition of disentanglement is universally accepted. Different papers use different definitions and metrics — always check which one is in play when reading the literature.

2. The β-VAE and the Disentanglement Trade-off

Recall the standard VAE objective (from 22-02):

$$\mathcal{L}{\text{VAE}} = \mathbb{E}{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$

The β-VAE introduces a hyperparameter $\beta > 1$:

$$\mathcal{L}{\beta\text{-VAE}} = \mathbb{E}{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \beta \cdot D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))$$

How it works:

The KL term $D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}) | \prod_j p(z_j))$ measures how much $q(\mathbf{z}|\mathbf{x})$ deviates from a fully-factorised prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$
For $\beta = 1$, this is the standard VAE — the KL regularises the latent space but doesn't enforce independence
For $\beta > 1$, the KL term is up-weighted, putting stronger pressure on each $q(z_j|\mathbf{x})$ to match the independent Gaussian prior
Since the prior factorises as $\prod_j p(z_j)$, the only way to achieve low KL divergence is to make the approximate posterior also approximately factorised: $q(\mathbf{z}|\mathbf{x}) \approx \prod_j q(z_j|\mathbf{x})$

Thus, $\beta > 1$ forces the latent dimensions to be statistically independent, which tends to align them with independent generative factors — achieving disentanglement.

The trade-off: Higher $\beta$ produces better disentanglement but worse reconstruction, because the KL term restricts how much information about $\mathbf{x}$ can pass through the bottleneck. The information bottleneck principle applies: $I(\mathbf{x}; \mathbf{z})$ is bounded above.

⚠️ CRITICAL: β-VAE is simple to implement but the disentanglement–reconstruction trade-off limits its practical utility. Modern approaches (FactorVAE, β-TCVAE) address this by decomposing the KL into components and only penalising the total correlation.

3. Mutual Information Gap

Kim & Mnih (2018) identified that the total correlation (TC) — not the full KL — is responsible for disentanglement:

$$D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}) | p(\mathbf{z})) = \underbrace{D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}) | \prod_j q(z_j|\mathbf{x}))}{\text{Total Correlation}} + \underbrace{\sum_j D{\text{KL}}(q(z_j|\mathbf{x}) | p(z_j))}_{\text{Dimension-wise KL}}$$

The FactorVAE objective penalises only the total correlation term:

$$\mathcal{L}{\text{FactorVAE}} = \mathcal{L}{\text{VAE}} - \gamma \cdot \text{TC}(q(\mathbf{z}))$$

where $\text{TC}(q(\mathbf{z})) = D_{\text{KL}}(q(\mathbf{z}) | \prod_j q(z_j))$ is estimated via a density-ratio trick using a discriminator network.

Similarly, β-TCVAE (Chen et al., 2018) decomposes the KL and applies different weights:

$$\mathcal{L}{\beta\text{-TCVAE}} = \mathbb{E}{q}[\log p(\mathbf{x}|\mathbf{z})] - \alpha I_q(\mathbf{x};\mathbf{z}) - \beta \text{TC}(q(\mathbf{z})) - \gamma \sum_j D_{\text{KL}}(q(z_j) | p(z_j))$$

where $I_q(\mathbf{x};\mathbf{z})$ is the index-code mutual information (weighted by $\alpha$), TC is weighted by $\beta$, and dimension-wise KL by $\gamma$. Setting $\beta > 1$ and $\alpha = \gamma = 1$ penalises only the total correlation, improving disentanglement without sacrificing reconstruction.

4. Group-Theoretic Disentanglement

Higgins et al. (2018) proposed a symmetry-based framework using group theory:

Setup: The world state $\mathbf{w}$ transforms under a symmetry group $G$. The agent learns an encoder $f: \mathbf{w} \to \mathbf{z}$ and decoder $g: \mathbf{z} \to \mathbf{w}$. The group action on $\mathbf{w}$ induces an action on $\mathbf{z}$ via $f$.

Key result: If $G$ decomposes as a direct product $G = G_1 \times G_2 \times \cdots \times G_K$, and each $G_k$ acts only on its corresponding latent subspace $\mathbf{z}_k$, then the representation is equivariantly disentangled.

Example — 2D translation group: Let $\mathbf{w}$ be an image of a shape that can translate in $x$ and $y$. The group is $\mathbb{R}^2 = \mathbb{R}_x \times \mathbb{R}_y$. A disentangled representation would have one latent dimension $z_1$ controlling $x$-translation and $z_2$ controlling $y$-translation, with $x$-translation leaving $z_2$ unchanged and vice versa.

Implementation: Enforce that a forward model $M: G \times \mathbf{z} \to \mathbf{z}$ predicts how the latent changes under a group action. Train by comparing $M(g, f(\mathbf{w}))$ with $f(g \cdot \mathbf{w})$ (the latent of the transformed input).

5. Evaluation Metrics

Metric	What It Measures	Limitation
FactorVAE score	Accuracy of majority-vote classifier predicting which factor a latent dimension encodes	Requires ground-truth factor labels
MIG (Mutual Information Gap)	$\frac{1}{K}\sum_k \frac{I(z_{j^}; v_k) - I(z_{j^{*}}; v_k)}{H(v_k)}$	Requires generative model to sample conditionally
DCI (Disentanglement, Completeness, Informativeness)	Three scores from Lasso regression mapping latents to factors	Sensitive to hyperparameters
SAP (Separated Attribute Predictability)	Difference between top two latent dimensions' predictive power per factor	Requires factor labels
Modularity	Each latent dimension encodes at most one factor	Part of DCI framework

Common Pitfall: These metrics require ground-truth generative factors — they are only applicable on synthetic datasets with known factors (dSprites, 3D Shapes, Cars3D). On real data, there is no agreed-upon evaluation protocol.

Key Terms

Disentanglement
Evaluation metrics
FactorVAE
FactorVAE score
Formal definitions
Group-theoretic disentanglement
Modularity

Worked Examples

Example 1: β-VAE KL Pressure

Problem: For a VAE with latent dim $d=2$ and Gaussian encoder/prior, show mathematically how $\beta > 1$ forces the posterior to factorise.

Solution:

The KL divergence for Gaussian posterior $q(z_i|\mathbf{x}) = \mathcal{N}(\mu_i(\mathbf{x}), \sigma_i^2(\mathbf{x}))$ and prior $p(z_i) = \mathcal{N}(0, 1)$:

$$D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}) | p(\mathbf{z})) = \frac{1}{2}\sum_{i=1}^d \left(\mu_i^2 + \sigma_i^2 - 1 - \log \sigma_i^2\right)$$

The total correlation is not directly visible here — the KL term is additive across dimensions because both $q$ (mean-field) and $p$ (factorised Gaussian) factorise. The pressure toward disentanglement comes from the fact that with $\beta > 1$, the encoder must compress information, and any dependence between latent dimensions wastes the limited capacity. The optimal solution under capacity constraints is to make dimensions independent, each encoding one factor.

Specifically, when $\beta$ is large, the effective capacity $C = \sum_i D_{\text{KL}}(q(z_i) | p(z_i))$ is constrained. Different generative factors $v_k$ compete for this capacity, and efficient encoding leads to each factor occupying a distinct latent dimension.

Example 2: Computing Mutual Information Gap

Problem: Given a 3-factor dataset where factor 1 (shape) has entropy $H(v_1) = \log_2 3$ (3 shapes), and the mutual information between latent $z_1$ and $v_1$ is $I(z_1; v_1) = 1.2$ bits, with $I(z_2; v_1) = 0.1$ bits and $I(z_3; v_1) = 0.05$ bits, compute the MIG for factor $v_1$.

Solution:

MIG for factor $v_k$ measures the gap between the highest and second-highest mutual information:

$$\text{MIG}k = \frac{I(z{j^}; v_k) - I(z_{j^{*}}; v_k)}{H(v_k)}$$

For $v_1$: - $j^ = 1$ (highest MI: $I(z_1; v_1) = 1.2$) - $j^{*} = 2$ (second highest: $I(z_2; v_1) = 0.1$)

$$\text{MIG}_1 = \frac{1.2 - 0.1}{\log_2 3} = \frac{1.1}{1.585} = 0.694$$

A MIG of 0.694 means 69.4% of factor $v_1$'s entropy is captured uniquely by the best latent dimension, with little leakage to other dimensions — indicating good disentanglement.

Example 3: Group Action on Latent Space

Problem: Consider images of a coloured geometric shape where the group $G = SO(2) \times \mathbb{Z}_3$ (rotation × colour). The model learns latents $\mathbf{z} = [z_1, z_2, z_3]$. Rotation by angle $\theta$ changes $z_1$; colour cycling leaves $z_2, z_3$ unchanged but $z_1$ invariant. Show the group action decomposition.

Solution:

Let $g = (R_\theta, c) \in SO(2) \times \mathbb{Z}3$ where $R\theta$ is rotation by $\theta$ and $c \in {0, 1, 2}$ is the colour index.

The induced action on $\mathbf{z}$: $$g \cdot \mathbf{z} = (M_\theta(z_1), z_2, z_3)$$

where $M_\theta$ applies the rotation transformation to the rotation-encoding dimension $z_1$ only. The colour cycling operates on $z_2, z_3$ (2D representation of 3 colours on a circle) while leaving $z_1$ invariant.

Decomposition verification: - $SO(2)$ acts only on subspace $\mathbf{z}_1$ (dimension $z_1$) - $\mathbb{Z}_3$ acts only on subspace $\mathbf{z}_c$ (dimensions $z_2, z_3$) - $G \cong SO(2) \times \mathbb{Z}_3$ with clean subspace decomposition $\rightarrow$ equivariantly disentangled ✓

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: Explain why $\beta$-VAE with $\beta > 1$ improves disentanglement but degrades reconstruction. Frame your answer in terms of the information bottleneck.

Problem 2: For a 4-factor dataset, MIG scores are $[0.82, 0.45, 0.91, 0.73]$. Interpret these scores — which factors are well disentangled and which might have entanglement issues?

Problem 3: The FactorVAE discriminator estimates $\text{TC}(q(\mathbf{z}))$ by distinguishing samples from $q(\mathbf{z})$ (joint) vs $\prod_j q(z_j)$ (shuffled). If the discriminator achieves 50% accuracy, what does this imply about the latent representation?

Problem 4: Prove that when the encoder posterior factorises perfectly ($q(\mathbf{z}|\mathbf{x}) = \prod_j q(z_j|\mathbf{x})$) AND the prior factorises ($p(\mathbf{z}) = \prod_j p(z_j)$), the total correlation in the aggregate posterior is zero.

Problem 5: Design a group-theoretic disentanglement setup for images of faces where factors are: identity (10 people), expression (5 categories), and illumination angle (continuous). What group structure would you use?

Answers (click to expand)

**Problem 1:** The KL term $D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$ measures how much information the latent $\mathbf{z}$ carries about $\mathbf{x}$. With $\beta > 1$, this term is up-weighted, constraining the mutual information $I(\mathbf{x}; \mathbf{z})$ to be smaller. Less information through the bottleneck means the decoder receives a degraded signal $\rightarrow$ worse reconstruction. But the factorisation pressure (KL to factorised Gaussian) pushes dimensions to be independent, each capturing one generative factor $\rightarrow$ better disentanglement. This is the classic information bottleneck trade-off: you can have good compression OR good reconstruction, not both. **Problem 2:** Factor 3 (0.91) and factor 1 (0.82) show strong disentanglement — one latent dimension captures most of their entropy with little leakage. Factor 4 (0.73) is decent. Factor 2 (0.45) is problematic — less than half of its entropy is captured uniquely by any single latent, suggesting multiple dimensions encode this factor (entanglement) or the model hasn't learned to separate it well. **Problem 3:** 50% accuracy = chance level. The discriminator cannot distinguish joint samples from shuffled (factorised) ones, meaning $q(\mathbf{z}) \approx \prod_j q(z_j)$. The aggregate posterior already factorises — the latent dimensions are statistically independent. Whatever disentanglement exists has already been achieved, and the FactorVAE penalty would be near zero. **Problem 4:** The aggregate posterior is $q(\mathbf{z}) = \mathbb{E}_{p_{\text{data}}(\mathbf{x})}[q(\mathbf{z}|\mathbf{x})]$. If $q(\mathbf{z}|\mathbf{x})$ factorises perfectly for every $\mathbf{x}$: $$q(\mathbf{z}) = \mathbb{E}_{p_{\text{data}}}\left[\prod_j q(z_j|\mathbf{x})\right]$$ In general, the expectation of a product ≠ product of expectations, so $q(\mathbf{z}) \neq \prod_j q(z_j)$. The total correlation is NOT automatically zero — it depends on whether the data distribution induces correlations. For TC = 0, we would need the stronger condition that $q(\mathbf{z}|\mathbf{x}) = \prod_j q(z_j)$ (no conditioning on $\mathbf{x}$ = the factors are truly independent in the data). This is a subtle point: mean-field posterior does NOT guarantee factorised aggregate posterior. **Problem 5:** The group is $G = S_{10} \times \mathbb{Z}_5 \times SO(2)$ where: - $S_{10}$ (symmetric group) permutes the 10 identity labels (discrete category) - $\mathbb{Z}_5$ cycles through 5 expressions (smile→surprise→anger→neutral→sad) - $SO(2)$ continuously rotates the illumination direction The latent space decomposes as $\mathbf{z} = [\mathbf{z}_{\text{id}}, \mathbf{z}_{\text{expr}}, \mathbf{z}_{\text{light}}]$ with subspaces of dimensions corresponding to the group representations. $S_{10}$ acts on $\mathbf{z}_{\text{id}}$ via a learned permutation-invariant embedding, $\mathbb{Z}_5$ cycles $\mathbf{z}_{\text{expr}}$ on a circular latent manifold, and $SO(2)$ rotates $\mathbf{z}_{\text{light}}$ in a 2D subspace.

Summary

Disentanglement means each latent dimension corresponds to exactly one generative factor, with multiple formal definitions (VAE-based, symmetry-based, information-theoretic).
β-VAE up-weights the KL term to force latent dimensions toward the factorised prior, trading reconstruction quality for disentanglement — a direct information bottleneck effect.
FactorVAE and β-TCVAE isolate the total correlation from the full KL, penalising only statistical dependence between latents rather than per-dimension deviation from the prior.
Group-theoretic disentanglement enforces that symmetry group actions decompose cleanly onto latent subspaces, providing a principled alternative to KL-based methods.
Evaluation metrics (MIG, FactorVAE score, DCI) all require ground-truth generative factors — no consensus metric exists for unsupervised disentanglement on real data.

Quiz

Question 1: Which of the following is the primary mechanism by which β-VAE (β > 1) encourages disentanglement?

A. By increasing the reconstruction loss weight B. By penalising deviation from a factorised prior more strongly C. By adding a total correlation discriminator D. By enforcing group equivariance constraints

Correct Answer: B

Explanation: The factorised Gaussian prior p(z) = ∏ⱼ 𝒩(zⱼ; 0, 1) is the key — the KL divergence D_KL(q(z|x) || p(z)) with β > 1 puts stronger pressure on q(z|x) to match this factorised structure, encouraging each latent dimension to be independent.

Question 2: The total correlation decomposition of the KL divergence separates the KL into which components?

A. Reconstruction loss and regularisation loss B. Total correlation and dimension-wise KL C. Mutual information and entropy D. Encoder KL and decoder KL

Correct Answer: B

Explanation: D_KL(q(z|x) || p(z)) = D_KL(q(z|x) || ∏ⱼ q(zⱼ|x)) + ∑ⱼ D_KL(q(zⱼ|x) || p(zⱼ)), where the first term is the total correlation (dependence between latents) and the second is dimension-wise KL (deviation of each dimension from the prior). FactorVAE and β-TCVAE penalise only the total correlation term, improving disentanglement without sacrificing reconstruction.

Question 3: What distinguishes the Mutual Information Gap (MIG) metric from simpler metrics like the FactorVAE score?

A. MIG doesn't require ground-truth factor labels B. MIG measures the gap between the best and second-best latent for each factor C. MIG is computed using a classifier, not mutual information D. MIG measures only reconstruction quality

Correct Answer: B

Explanation: MIG for factor v_k is (I(z_j; v_k) - I(z_j; v_k)) / H(v_k), where j is the latent with highest MI and j** is the second-highest. This gap quantifies how uniquely one latent dimension captures a factor versus having information spread across multiple dimensions, providing a more nuanced measure than a simple majority-vote classifier.

Question 4: In group-theoretic disentanglement, what condition must hold for a representation to be equivariantly disentangled?

A. The data manifold must be flat B. Each direct product component G_k must act on exactly one latent subspace and trivially on all others C. The group must be Abelian D. The latent space dimension must equal the number of group elements

Correct Answer: B

Explanation: If the symmetry group decomposes as G = G₁ × ⋯ × G_K, equivariant disentanglement requires that each G_k acts on exactly one latent subspace z_k and leaves all other subspaces unchanged. This aligns the latent space structure with the group-theoretic structure of the data, providing a principled alternative to KL-based methods.

Question 5: Why might a mean-field posterior q(z|x) = ∏ⱼ q(zⱼ|x) NOT guarantee a factorised aggregate posterior?

A. Because the encoder network is stochastic B. Because the expectation of a product is not the product of expectations: E_x[∏ q(zⱼ|x)] ≠ ∏ E_x[q(zⱼ|x)] C. Because VAEs always produce correlated latents D. Because the prior is not factorised

Correct Answer: B

Explanation: The aggregate posterior is q(z) = E_x[∏ⱼ q(zⱼ|x)]. Even if the posterior factorises for each individual x, the expectation over the data distribution can introduce correlations between latent dimensions. The total correlation in q(z) is generally non-zero unless additional conditions hold — this is a subtle but important point often missed in the disentanglement literature.

Pitfalls

Random seeds dominate results: Disentanglement scores are highly sensitive to random seed and hyperparameters. Always report mean ± std over 5+ seeds.
β-VAE conflates disentanglement with sparsity: High $\beta$ can cause dimensions to collapse (posterior collapses to prior), which improves disentanglement metrics trivially but loses information.
Metric gaming: Optimising directly for a disentanglement metric (e.g., FactorVAE score) often produces representations that score well but don't actually disentangle — the metric is a proxy, not the objective.
Mean-field posterior ≠ factorised aggregate posterior: $q(\mathbf{z}|\mathbf{x})$ factorising for each $\mathbf{x}$ does not imply $q(\mathbf{z})$ factorises across the dataset — the TC can still be nonzero.

Next Steps

Move on to 24-06 — Optimal Transport, where we study the Wasserstein distance, Sinkhorn algorithm, and the mathematical foundations of distribution comparison that connect to WGANs.

Progress

Phases

24-05 — Disentanglement and Representation Theory

Learning Objectives

Core Content

1. What Is Disentanglement?

2. The β-VAE and the Disentanglement Trade-off

3. Mutual Information Gap

4. Group-Theoretic Disentanglement

5. Evaluation Metrics

Key Terms

Worked Examples

Example 1: β-VAE KL Pressure

Example 2: Computing Mutual Information Gap

Example 3: Group Action on Latent Space

Practice Problems

Summary

Quiz

Pitfalls

Next Steps