24-04 — Manifold Hypothesis and Representation Geometry
Phase: 24 — Information Geometry & Advanced Theory Subject: 24-04 Prerequisites: Phase 16-17 (Neural Networks & Deep Learning — architectures, representations), Phase 6 (Linear Algebra — SVD, matrix decompositions), 24-03 NTK Next subject: 24-05 — Disentanglement and Representation Theory
Learning Objectives
By the end of this subject, you will be able to:
- State and justify the manifold hypothesis and its implications for deep learning
- Estimate intrinsic dimension of data and learned representations
- Define and compute representation similarity metrics (CKA, CCA, PWCCA)
- Understand how neural networks progressively "untangle" data manifolds through their layers
- Analyze the geometry of hidden representations using spectral methods
Core Content
The Manifold Hypothesis
The manifold hypothesis states:
Real-world high-dimensional data lies on (or near) a low-dimensional manifold embedded in the ambient space.
For example: - Natural images ($\mathbb{R}^{H \times W \times 3}$ with millions of dimensions) lie near a manifold of much lower intrinsic dimension — the set of "valid" natural images is a tiny subset of all possible pixel configurations - Speech signals live on a manifold defined by vocal tract physics - Text embeddings concentrate near semantic manifolds
Formally: data $\mathbf{x} \in \mathbb{R}^D$ is generated as $\mathbf{x} = g(\mathbf{z}) + \boldsymbol{\epsilon}$ where $\mathbf{z} \in \mathbb{R}^d$ are latent factors, $d \ll D$, $g: \mathbb{R}^d \to \mathbb{R}^D$ is a smooth embedding, and $\boldsymbol{\epsilon}$ is small noise.
⚠️ CRITICAL: The manifold hypothesis is the geometric foundation of deep learning. Neural networks succeed because they learn to flatten the data manifold — each layer progressively untangles the nonlinear structure so that the final representation is linearly separable.
Intrinsic Dimension
The intrinsic dimension (ID) of a dataset is the minimum number of variables needed to describe the data without significant information loss. Several estimation methods exist:
1. PCA-Based (Global Linear)
Compute the singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_D$ of the data matrix. The ID is the number of singular values needed to explain a fraction $\tau$ of the total variance:
$$d_{\text{PCA}} = \min\left{k : \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^D \sigma_i^2} \geq \tau\right}$$
This gives a global linear estimate — it misses nonlinear manifold structure.
2. Two-NN Estimator (Local, Nonlinear)
A more robust method (Facco et al., 2017): For each point, compute distances $r_1$ and $r_2$ to its nearest and second-nearest neighbor. The ratio $\mu = r_2/r_1$ follows a Pareto distribution with parameter $d$ (the intrinsic dimension):
$$P(\mu) = d \cdot \mu^{-(d+1)}$$
The MLE for $d$ is:
$$\hat{d} = \frac{N}{\sum_{i=1}^N \log(r_{2,i}/r_{1,i})}$$
This estimator works remarkably well because it exploits the local scaling of volumes: in a $d$-dimensional space, the volume ratio of balls with radii $r_2$ and $r_1$ scales as $(r_2/r_1)^d$.
3. Maximum Likelihood Estimation (Levina-Bickel)
For each point, consider its $k$ nearest neighbors within radius $R$. The MLE is:
$$\hat{d}k(\mathbf{x}_i) = \left(\frac{1}{k-1}\sum{j=1}^{k-1} \log\frac{R_k(\mathbf{x}_i)}{R_j(\mathbf{x}_i)}\right)^{-1}$$
where $R_j(\mathbf{x}_i)$ is the distance to the $j$-th nearest neighbor. Average over all points to get the global estimate.
Typical intrinsic dimensions: | Dataset | Ambient Dimension | Estimated ID | |---------|------------------|-------------| | MNIST | 784 | ~10-15 | | CIFAR-10 | 3072 | ~25-40 | | ImageNet | 150528 | ~40-80 | | GPT-2 embeddings | 768-1600 | ~10-30 |
Neural Networks as Manifold Untangling
The function of deep neural network layers can be understood geometrically:
$$\mathbf{x} \xrightarrow{\text{layer 1}} \mathbf{h}^{(1)} \xrightarrow{\text{layer 2}} \mathbf{h}^{(2)} \xrightarrow{\cdots} \mathbf{h}^{(L)} \xrightarrow{\text{classifier}} \mathbf{y}$$
At each layer, the representation $\mathbf{h}^{(\ell)}$ lives on a manifold $M_\ell$ embedded in $\mathbb{R}^{n_\ell}$. Deep networks perform a sequence of homeomorphisms (continuous deformations) that gradually flatten the data manifold:
- Early layers: Learn local features, manifold remains highly curved
- Middle layers: Progressive untangling, curvature decreases
- Late layers: Manifold becomes approximately flat, enabling linear separation
This is why linear classifiers work on top of deep networks — the network's job is to make the data linearly separable in its final hidden representation.
⚠️ CRITICAL: The width of hidden layers must be at least $d+1$ to embed a $d$-dimensional manifold without self-intersections (Whitney embedding theorem). In practice, layers are much wider, providing "room" to untangle the manifold.
Spectral Analysis of Representations
For a batch of $N$ data points, the layer-$\ell$ representation forms a matrix $\mathbf{H}^{(\ell)} \in \mathbb{R}^{N \times n_\ell}$. Its singular value spectrum reveals geometric structure:
- Rapidly decaying spectrum: The representation is low-dimensional (strong manifold compression)
- Slow decay / high effective rank: The representation fills the ambient space (distributed code)
- Spectral gaps: Indicate separate sub-manifolds (e.g., different classes)
The effective rank (stable rank) of $\mathbf{H}$ is:
$$\text{srank}(\mathbf{H}) = \frac{|\mathbf{H}|F^2}{|\mathbf{H}|_2^2} = \frac{\sum_i \sigma_i^2}{\sigma{\max}^2}$$
This is always between 1 and $\min(N, n_\ell)$, with higher values indicating more distributed (less compressed) representations.
Representation Similarity Metrics
How do we compare representations across different networks, layers, or training runs? Several metrics exist.
CKA (Centered Kernel Alignment)
CKA measures the similarity between two representation matrices $\mathbf{X} \in \mathbb{R}^{N \times p}$ and $\mathbf{Y} \in \mathbb{R}^{N \times q}$ (same $N$ points, possibly different dimensionalities):
$$\text{CKA}(\mathbf{X}, \mathbf{Y}) = \frac{|\mathbf{Y}^T\mathbf{X}|_F^2}{|\mathbf{X}^T\mathbf{X}|_F \cdot |\mathbf{Y}^T\mathbf{Y}|_F}$$
For linear CKA (the most common variant), this is:
$$\text{CKA}_{\text{linear}}(\mathbf{X}, \mathbf{Y}) = \frac{|\text{Cov}(\mathbf{X}, \mathbf{Y})|_F^2}{|\text{Cov}(\mathbf{X})|_F \cdot |\text{Cov}(\mathbf{Y})|_F}$$
where $\mathbf{X}$ and $\mathbf{Y}$ are centered (mean-subtracted).
Properties: - $0 \leq \text{CKA} \leq 1$, with 1 meaning identical representations up to orthogonal transformation - Invariant to orthogonal transformations and isotropic scaling of either representation - NOT invariant to invertible linear transformations (unlike CCA) - Robust to the number of neurons — comparing layers of different widths works
⚠️ CRITICAL: CKA is the gold standard for representation comparison in modern deep learning. Kornblith et al. (2019) showed it reliably identifies corresponding layers across architectures and training runs.
CCA (Canonical Correlation Analysis) and SVCCA
CCA finds linear projections $\mathbf{X}\mathbf{a}$ and $\mathbf{Y}\mathbf{b}$ that maximize correlation:
$$\rho = \max_{\mathbf{a}, \mathbf{b}} \text{corr}(\mathbf{X}\mathbf{a}, \mathbf{Y}\mathbf{b})$$
This gives $k = \min(p, q)$ canonical correlations $\rho_1 \geq \rho_2 \geq \cdots \geq \rho_k$. The mean CCA similarity is:
$$\bar{\rho}{\text{CCA}} = \frac{1}{k}\sum{i=1}^k \rho_i$$
SVCCA (Raghu et al., 2017): Apply SVD to each representation first, keeping only directions with significant variance, then compute CCA. This removes noise dimensions and focuses on the signal subspace.
PWCCA (Projection-Weighted CCA)
Morcos et al. (2018) noted that not all CCA directions are equally important. PWCCA weights each canonical correlation by its importance to the original representation:
$$\text{PWCCA} = \frac{\sum_{i=1}^k \alpha_i \rho_i}{\sum_{i=1}^k \alpha_i}, \quad \alpha_i = \sum_{j=1}^p |\langle \mathbf{h}_i, \mathbf{x}_j \rangle|$$
where $\mathbf{h}_i$ is the CCA vector and $\mathbf{x}_j$ is a direction in the original $\mathbf{X}$ space. PWCCA gives higher weight to CCA directions that are actually used by the representation.
Comparison of Metrics
| Metric | Invariance | Handles Different Widths | Robust to Noise | Identifies Same Architecture |
|---|---|---|---|---|
| CKA | Orthogonal transforms | Yes | Yes | Yes |
| CCA | Invertible linear transforms | Needs SVD first | Only with SVCCA | Yes, but less discriminative |
| PWCCA | Same as CCA | Needs SVD first | Yes (projection weights) | Yes, most sensitive |
| Neuron-by-neuron | None (brittle) | No | No | Requires aligned neurons |
Manifold Capacity and Separability
The manifold capacity theory (Chung et al., 2018) quantifies how many manifolds can be packed into a representation space while remaining linearly separable. For $P$ class manifolds in $\mathbb{R}^N$, with each manifold having radius $R_M$ and dimension $D_M$:
The maximum number of separable manifolds scales as:
$$P_{\max} \propto N \cdot \left(\frac{1}{R_M}\right)^{D_M}$$
Key insight: reducing the radius (compressing within-class variability) or the dimension (simplifying the manifold) increases capacity exponentially. Good representations have: - Small within-class radius (tight clustering of same-class points) - Large between-class separation - Low-dimensional manifolds for each class
This connects directly to the neural collapse phenomenon (Papyan et al., 2020): at the terminal phase of training, class representations collapse to their means, achieving maximal capacity.
Geometry of the Loss Landscape
The representation geometry is intimately connected to the loss landscape (Phase 14). The Fisher information matrix $I(\theta)$ (24-01) defines a Riemannian metric on the space of functions. When the NTK (24-03) is well-conditioned, gradient descent finds flat minima that generalize well — and flat minima correspond to representations with good manifold untangling.
Sharp vs. flat minima:
- Sharp minima: Small parameter perturbations cause large changes in the function → representations are fragile
- Flat minima: The function is robust to parameter variation → representations are well-separated and robust
The Hessian eigenvalues of the loss at convergence reveal the sharpness/flatness of the representation. Small top eigenvalues indicate a flat minimum and typically better generalization.
Key Terms
- Increasing effective rank
- Intrinsic dimension
- Large between-class separation
- Low-dimensional manifolds
- Neural collapse
- SVCCA
- Small within-class radius
Worked Examples
Example 1: Intrinsic Dimension of a Swiss Roll
The Swiss roll is a 2D manifold embedded in $\mathbb{R}^3$: $(x, y, z) = (t\cos t, t\sin t, s)$ with $t \in [3\pi/2, 9\pi/2]$, $s \in [0, 1]$. Sample 1000 points and estimate the intrinsic dimension using PCA and the Two-NN method.
Solution:
PCA: The data lives in $\mathbb{R}^3$. The singular values: $\sigma_1 \approx 45$, $\sigma_2 \approx 8$, $\sigma_3 \approx 0.01$. The first two components explain >99.9% of variance, so $d_{\text{PCA}} = 2$. PCA correctly identifies the global dimension because the Swiss roll is (globally) a 2D sheet, even though it's curved in 3D.
Two-NN: For each point, compute the ratio $\mu = r_2/r_1$. For a true 2D manifold, $\mu$ follows a Pareto distribution with $d=2$. The MLE $\hat{d} = N / \sum_i \log\mu_i \approx 2.0$. Both methods correctly identify $d=2$.
Click for answer
Both PCA and Two-NN give $d \approx 2$. PCA works here because the Swiss roll can be globally flattened to 2D via PCA projection (it's not self-intersecting). For more complex manifolds (e.g., knot embeddings), PCA fails because the manifold isn't globally linearizable, but Two-NN still works because it's a local estimator.Example 2: CKA Between Corresponding Layers
Two ResNet-18 networks are trained on CIFAR-10 with different random seeds, achieving similar accuracy. Their layer-4 representations (before the final FC) are $\mathbf{X}, \mathbf{Y} \in \mathbb{R}^{1000 \times 512}$. Compute linear CKA and interpret.
Solution:
Center the representations: $\tilde{\mathbf{X}} = \mathbf{X} - \boldsymbol{\mu}_X$, $\tilde{\mathbf{Y}} = \mathbf{Y} - \boldsymbol{\mu}_Y$.
Compute Gram matrices: $K_X = \tilde{\mathbf{X}}\tilde{\mathbf{X}}^T$, $K_Y = \tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^T$.
HSIC (Hilbert-Schmidt Independence Criterion): $\text{HSIC}(\mathbf{X}, \mathbf{Y}) = \frac{1}{(N-1)^2}\text{tr}(K_X H K_Y H)$ where $H = I - \frac{1}{N}\mathbf{1}\mathbf{1}^T$.
Linear CKA: $\text{CKA} = \frac{\text{HSIC}(\mathbf{X}, \mathbf{Y})}{\sqrt{\text{HSIC}(\mathbf{X}, \mathbf{X}) \cdot \text{HSIC}(\mathbf{Y}, \mathbf{Y})}}$.
If $\text{CKA} \approx 0.92$, this indicates the two networks learned very similar representations at this layer — consistent across random seeds. Typical CKA values for corresponding ResNet layers are 0.85-0.95, while different layers within the same network have CKA $\approx$ 0.3-0.7.
Click for answer
CKA $\approx 0.92$ indicates strong agreement between the two networks' representations at this layer. The representations are essentially the same up to an orthogonal transformation. This is evidence that neural networks converge to similar internal representations for the same task, supporting the idea that the learned manifold geometry is determined by the data+architecture, not by random initialization.Example 3: CCA for Measuring Shared Subspace
Two representation matrices $\mathbf{X} \in \mathbb{R}^{500 \times 100}$, $\mathbf{Y} \in \mathbb{R}^{500 \times 80}$ are centered. SVD gives $\mathbf{X} = \mathbf{U}X \mathbf{S}_X \mathbf{V}_X^T$. The top 30 singular vectors of each are kept. The CCA gives canonical correlations $\rho_1, \ldots, \rho{30}$ with $\rho_1 = 0.95$, $\rho_2 = 0.88$, $\rho_3 = 0.72$, and then a sharp drop to $\rho_i < 0.3$ for $i \geq 4$. Interpret this result.
Solution:
The first 3 canonical correlations are high (>0.7), indicating a 3-dimensional shared subspace between the two representations. After that, correlations drop sharply — there's little shared structure beyond 3 dimensions.
Mean CCA: $\bar{\rho} \approx 0.24$ (diluted by many low correlations).
SVCCA (mean of top 30): even lower if all are included.
PWCCA: heavily weights the first 3 directions → $\text{PWCCA} \approx 0.78$.
This is why PWCCA is preferred: it captures that the two representations are strongly aligned in the important (high-variance) directions, even if the mean over all directions is low. The 3 shared dimensions likely correspond to the most important features for the task.
Click for answer
The representations share a 3-dimensional dominant subspace (high $\rho_1$ through $\rho_3$) and diverge in higher dimensions. PWCCA captures this well (~0.78), while mean CCA (~0.24) masks the strong alignment in important directions. This pattern is common: different networks agree on the "coarse" representation structure but differ in fine-grained details. The choice of metric significantly affects the conclusion.Practice Problems
Problem 1: Estimate the intrinsic dimension of a $d$-dimensional Gaussian $\mathcal{N}(\mathbf{0}, I_d)$ with $d=10$, embedded in $\mathbb{R}^{100}$ by multiplying with a random $100 \times 10$ matrix $\mathbf{A}$ (all entries $\sim \mathcal{N}(0,1)$). What happens as $N \to \infty$ with PCA? With Two-NN?
Click for answer
The data is $\mathbf{x} = \mathbf{A}\mathbf{z}$ where $\mathbf{z} \sim \mathcal{N}(0, I_{10})$. The covariance is $\mathbf{A}\mathbf{A}^T \in \mathbb{R}^{100 \times 100}$, which has rank 10. **PCA:** As $N \to \infty$, the sample covariance converges to $\mathbf{A}\mathbf{A}^T$, which has exactly 10 nonzero eigenvalues. PCA correctly identifies $d=10$. **Two-NN:** The data lives on a 10-dimensional affine subspace of $\mathbb{R}^{100}$. The Two-NN estimator also gives $\hat{d} \approx 10$ for large $N$, because the local geometry of a flat 10D affine subspace is exactly 10-dimensional. This shows that linear manifolds are trivially estimated by both methods. **Caution:** If noise is added ($\mathbf{x} = \mathbf{A}\mathbf{z} + \boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I_{100})$), PCA eigenvalues show $10 + \sigma^2$ for the top 10 and $\sigma^2$ for the rest. Two-NN is more robust to noise at small scales because nearest neighbors are dominated by noise — it overestimates the dimension. This is a common pitfall.Problem 2: Prove that linear CKA is invariant to orthogonal transformations of either representation: $\text{CKA}(\mathbf{X}\mathbf{Q}_X, \mathbf{Y}) = \text{CKA}(\mathbf{X}, \mathbf{Y})$ where $\mathbf{Q}_X^T\mathbf{Q}_X = I$.
Click for answer
$\mathbf{X}' = \mathbf{X}\mathbf{Q}_X$. Then $\mathbf{X}'^T\mathbf{X}' = \mathbf{Q}_X^T\mathbf{X}^T\mathbf{X}\mathbf{Q}_X$. $\|\mathbf{X}'^T\mathbf{X}'\|_F^2 = \|\mathbf{Q}_X^T\mathbf{X}^T\mathbf{X}\mathbf{Q}_X\|_F^2 = \|\mathbf{X}^T\mathbf{X}\|_F^2$ (Frobenius norm is orthogonally invariant for congruence transforms when the transform is orthogonal). Similarly, $\mathbf{Y}^T\mathbf{X}' = \mathbf{Y}^T\mathbf{X}\mathbf{Q}_X$, and $\|\mathbf{Y}^T\mathbf{X}\mathbf{Q}_X\|_F^2 = \|\mathbf{Y}^T\mathbf{X}\|_F^2$ (right-multiplication by orthogonal matrix preserves Frobenius norm). Therefore: $\text{CKA}(\mathbf{X}', \mathbf{Y}) = \frac{\|\mathbf{Y}^T\mathbf{X}\mathbf{Q}_X\|_F^2}{\|\mathbf{X}^T\mathbf{X}\|_F \cdot \|\mathbf{Y}^T\mathbf{Y}\|_F} = \text{CKA}(\mathbf{X}, \mathbf{Y})$. This invariance is essential — it means CKA measures the *geometry* of the representation (which is preserved under rotations/reflections) rather than the specific coordinate axes. Two networks that learn the same geometry but in different coordinate systems will have CKA = 1.Problem 3: A neural network's hidden representation at layer $\ell$ has an effective rank that increases from $\text{srank}(\mathbf{H}^{(1)}) = 5$ to $\text{srank}(\mathbf{H}^{(L)}) = 45$ across layers. What does this tell you about the network's computation? Is this typical?
Click for answer
**Increasing effective rank** means the representation is becoming more distributed — the network is expanding the data into a higher-dimensional space where linear separation is easier. This is the opposite of dimensionality reduction: it's *dimensionality expansion* for separability. This is typical for: - **Early layers:** Compress input (remove noise, extract features) — effective rank may decrease initially - **Middle layers:** Expand into higher dimensions to create linear separability — effective rank increases - **Final hidden layer:** May compress again before the classifier The pattern you'd expect: decrease (input compression) → increase (expansion for separation) → maintain or slight decrease (classification-ready). A monotonic increase is unusual and might indicate that the early layers aren't doing enough compression, or the network is overparameterized. **Connection to manifold untangling:** Expanding the effective dimension gives the manifold more "room" to untangle. Think of unknotting a tangled string — you sometimes need to pull it into a higher-dimensional space before you can lay it flat.Problem 4: Show that for centered representations $\mathbf{X}$ and $\mathbf{Y}$, CKA can be written as $\frac{|\mathbf{X}^T\mathbf{Y}|_F^2}{|\mathbf{X}^T\mathbf{X}|_F |\mathbf{Y}^T\mathbf{Y}|_F}$ when $\mathbf{X}$ and $\mathbf{Y}$ have the same number of features, but this formulation fails for different widths. Why?
Click for answer
For centered data, the Gram matrix is $\mathbf{X}\mathbf{X}^T$ (up to scaling). The HSIC numerator: $\text{tr}((\mathbf{X}\mathbf{X}^T)(\mathbf{Y}\mathbf{Y}^T)) = \|\mathbf{X}^T\mathbf{Y}\|_F^2$. This works regardless of $p$ and $q$: $\mathbf{X} \in \mathbb{R}^{N \times p}$, $\mathbf{Y} \in \mathbb{R}^{N \times q}$, then $\mathbf{X}^T\mathbf{Y} \in \mathbb{R}^{p \times q}$ and its Frobenius norm is defined. The alternative formula $\frac{\|\mathbf{X}^T\mathbf{Y}\|_F^2}{\|\mathbf{X}^T\mathbf{X}\|_F \|\mathbf{Y}^T\mathbf{Y}\|_F}$ is actually equivalent to linear CKA when both are centered — the $\mathbf{X}^T\mathbf{Y}$ formulation naturally handles different dimensionalities because it computes the cross-covariance between all pairs of features across the two representations. The Frobenius norm on the $p \times q$ matrix aggregates all these cross-covariances. This is a key advantage over CCA, which can only find $\min(p, q)$ directions. CKA considers all pairwise interactions.Problem 5: The neural collapse phenomenon: at the terminal phase of training a classifier, the within-class covariance $\Sigma_W \to 0$ and the class means $\boldsymbol{\mu}_c$ converge to a simplex equiangular tight frame (ETF). Compute the CKA between the final hidden representations of two different classes $c$ and $c'$ in this limit. What does this say about the representation?
Click for answer
At neural collapse, for any sample $\mathbf{x}$ from class $c$: $\mathbf{h}(\mathbf{x}) = \boldsymbol{\mu}_c$ (all within-class variance vanishes). For $K$ classes, the means form a simplex ETF: $\boldsymbol{\mu}_c^T\boldsymbol{\mu}_{c'} = \frac{K}{K-1}\delta_{cc'} - \frac{1}{K-1}$. The representation matrix for $N$ samples (with $N_c$ per class, $N = \sum N_c$) has: $\mathbf{H}^T\mathbf{H} = \sum_c N_c \boldsymbol{\mu}_c \boldsymbol{\mu}_c^T$. The Gram matrix $\mathbf{H}\mathbf{H}^T$ is block-constant: samples from the same class have the same pairwise similarity. The off-diagonal between different classes is $-\frac{1}{K-1}$. Two different classes $c$ and $c'$: CKA between their representations (treating each class's samples as a separate "network") would be low because the within-class structure has collapsed to a point and the between-class structure is the ETF pattern. Specifically, the representations are maximally separated (equal angles between all class centers), so CKA would reflect this uniform structure. **Interpretation:** Neural collapse means the representation has achieved maximum manifold untangling — each class manifold has collapsed to a single point, and all class centers are maximally separated. This is the ideal for linear classification but may harm transfer learning because all within-class variation has been discarded.Summary
Key takeaways:
- The manifold hypothesis: real data lives on low-dimensional manifolds embedded in high-dimensional ambient space — this is why deep learning works
- Intrinsic dimension can be estimated via PCA (global/linear) or Two-NN/Levina-Bickel (local/nonlinear), typically 10-80 for common datasets vs. millions of ambient dimensions
- Neural networks perform manifold untangling — each layer progressively flattens the data manifold until it becomes linearly separable in the final layer
- CKA is the preferred metric for comparing representations: invariant to orthogonal transforms, robust to layer width, and captures shared geometry
- CCA/SVCCA/PWCCA identify aligned subspaces; PWCCA weights by importance, avoiding dilution by noise directions
- Neural collapse and manifold capacity theory explain why good representations have low within-class radius and maximal between-class separation
Quiz
Question 1: Which statement best describes the manifold hypothesis?
A. All data lies on a single global linear subspace B. High-dimensional real-world data is concentrated near low-dimensional manifolds C. Every neural network layer projects data onto a lower-dimensional manifold D. Manifold structure is irrelevant to deep learning performance
Correct Answer: B. High-dimensional real-world data is concentrated near low-dimensional manifolds
Explanation: The manifold hypothesis posits that despite living in $\mathbb{R}^D$ with $D$ large, the data-generating process involves many fewer degrees of freedom ($d \ll D$). Option A is too restrictive — the manifolds are generally nonlinear. Option C describes what networks do (sometimes), not the hypothesis itself. Option D is contradicted by extensive evidence linking manifold geometry to network performance.
Question 2: What does the Two-NN intrinsic dimension estimator exploit?
A. The global covariance structure of the data B. The ratio of distances to the first and second nearest neighbors C. The number of PCA components needed for 95% variance D. The rank of the data matrix
Correct Answer: B. The ratio of distances to the first and second nearest neighbors
Explanation: In a $d$-dimensional space, the ratio $r_2/r_1$ follows a Pareto($d$) distribution. The Two-NN method estimates $d$ from the log-ratios: $\hat{d} = N / \sum \log(r_{2,i}/r_{1,i})$. This is a local estimator — it uses only nearest-neighbor information, making it valid for nonlinear manifolds where PCA (a global linear method) fails. Option A describes PCA, option D describes matrix factorization approaches.
Question 3: Which representation similarity metric is invariant to invertible linear transformations?
A. CKA B. Linear regression from one representation to another C. CCA D. Euclidean distance between representation vectors
Correct Answer: C. CCA
Explanation: CCA finds the maximum correlations achievable by linear projections. If you apply an invertible linear transformation $\mathbf{Y}' = \mathbf{Y}\mathbf{A}$, the canonical correlations don't change because any linear projection of $\mathbf{Y}'$ is equivalent to some linear projection of $\mathbf{Y}$ via $\mathbf{A}$. CKA is invariant only to orthogonal transformations, not arbitrary invertible ones. Option D is not invariant to anything useful.
Question 4: A network's hidden layer has effective rank 5 at layer 2 and effective rank 45 at layer 8 (final hidden layer). What computational strategy is the network employing?
A. Dimensionality reduction for compression B. Dimensionality expansion for linear separability C. Random projection D. Overfitting to noise
Correct Answer: B. Dimensionality expansion for linear separability
Explanation: Going from low to high effective rank means the network is expanding its representation into a higher-dimensional space. This is the kernel trick in learned form: projecting data into a high-dimensional space where it becomes linearly separable. It's a form of "untangling" — giving the manifold more room to unfold. Option A would show decreasing rank. Options C and D are possible pathologies but the expansion pattern is consistent with normal successful training.
Question 5: What does a CKA value of 0.95 between two different networks' corresponding layers indicate?
A. The networks are identical B. The representations encode very similar similarity structure (up to orthogonal rotation) C. The networks must have the same architecture D. CKA is miscalibrated and this value is meaningless
Correct Answer: B. The representations encode very similar similarity structure (up to orthogonal rotation)
Explanation: CKA = 1 means identical representational geometry up to orthogonal transformation. A value of 0.95 means the representations are nearly identical in terms of which inputs have similar/dissimilar representations. The networks could have different architectures, different initializations, or even different widths — CKA measures the similarity of the geometry, not the coordinate values. This is a robust finding: networks trained on the same task tend to converge to similar representational geometries.
Question 6: Neural collapse refers to:
A. The network's weights becoming zero B. Within-class representations collapsing to their class means and class means forming a symmetric structure C. The loss function becoming non-convex D. The manifold hypothesis being violated
Correct Answer: B. Within-class representations collapsing to their class means and class means forming a symmetric structure
Explanation: Neural collapse (Papyan et al., 2020) is the phenomenon where, late in training, the last-layer features of each class converge to their class mean ($\Sigma_W \to 0$), the class means converge to a simplex ETF (maximally separated and symmetric), and the classifier weights align with the class means. This represents perfect manifold untangling — each class manifold has collapsed to a point, achieving maximum capacity for linear separability.
Pitfalls
-
Using PCA intrinsic dimension on nonlinear manifolds: PCA estimates the global linear dimension — the dimension of the best-fit affine subspace. For curved manifolds (e.g., Swiss rolls, knots), PCA dramatically overestimates the dimension because it can't "see" the nonlinear structure. Always complement PCA with local estimators (Two-NN, Levina-Bickel) when dealing with potentially nonlinear manifolds.
-
Confusing CKA and CCA invariance properties: CKA is invariant to orthogonal transformations but NOT to arbitrary invertible linear transformations. CCA is invariant to all invertible linear transformations. A high CKA between two representations means they share the same geometry up to rotation; a low CKA doesn't mean the representations are fundamentally different — they could be related by a non-orthogonal linear transform (which CCA would detect). Always use both metrics for a complete picture.
-
Misinterpreting effective rank trends: An increasing effective rank across layers doesn't necessarily mean the network is "doing well." Some increase is normal (expansion for separability), but if all layers have very high effective rank, the network may not be compressing information — it's just passing noise forward. Look for spectral gaps, not just rank magnitudes.
-
Treating neural collapse as universally desirable: Neural collapse (within-class covariance → 0, class means → simplex ETF) maximizes linear separability, which is great for the current task. But it discards all within-class variation, which can severely harm transfer learning — downstream tasks that need fine-grained distinctions within classes will struggle. Task-specific representations strike a balance: enough compression for the task, enough variation preserved for reuse.
Next Steps
Next up: 24-05 — Disentanglement and Representation Theory — where you'll learn about learning representations whose individual dimensions correspond to meaningful and independent factors of variation, using $\beta$-VAEs, mutual information gaps, and group-theoretic frameworks.