Math graphic
📐 Concept diagram

24-04 — Manifold Hypothesis and Representation Geometry

Phase: 24 — Information Geometry & Advanced Theory Subject: 24-04 Prerequisites: Phase 16-17 (Neural Networks & Deep Learning — architectures, representations), Phase 6 (Linear Algebra — SVD, matrix decompositions), 24-03 NTK Next subject: 24-05 — Disentanglement and Representation Theory


Learning Objectives

By the end of this subject, you will be able to:

  1. State and justify the manifold hypothesis and its implications for deep learning
  2. Estimate intrinsic dimension of data and learned representations
  3. Define and compute representation similarity metrics (CKA, CCA, PWCCA)
  4. Understand how neural networks progressively "untangle" data manifolds through their layers
  5. Analyze the geometry of hidden representations using spectral methods

Core Content

The Manifold Hypothesis

The manifold hypothesis states:

Real-world high-dimensional data lies on (or near) a low-dimensional manifold embedded in the ambient space.

For example: - Natural images ($\mathbb{R}^{H \times W \times 3}$ with millions of dimensions) lie near a manifold of much lower intrinsic dimension — the set of "valid" natural images is a tiny subset of all possible pixel configurations - Speech signals live on a manifold defined by vocal tract physics - Text embeddings concentrate near semantic manifolds

Formally: data $\mathbf{x} \in \mathbb{R}^D$ is generated as $\mathbf{x} = g(\mathbf{z}) + \boldsymbol{\epsilon}$ where $\mathbf{z} \in \mathbb{R}^d$ are latent factors, $d \ll D$, $g: \mathbb{R}^d \to \mathbb{R}^D$ is a smooth embedding, and $\boldsymbol{\epsilon}$ is small noise.

⚠️ CRITICAL: The manifold hypothesis is the geometric foundation of deep learning. Neural networks succeed because they learn to flatten the data manifold — each layer progressively untangles the nonlinear structure so that the final representation is linearly separable.

Intrinsic Dimension

The intrinsic dimension (ID) of a dataset is the minimum number of variables needed to describe the data without significant information loss. Several estimation methods exist:

1. PCA-Based (Global Linear)

Compute the singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_D$ of the data matrix. The ID is the number of singular values needed to explain a fraction $\tau$ of the total variance:

$$d_{\text{PCA}} = \min\left{k : \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^D \sigma_i^2} \geq \tau\right}$$

This gives a global linear estimate — it misses nonlinear manifold structure.

2. Two-NN Estimator (Local, Nonlinear)

A more robust method (Facco et al., 2017): For each point, compute distances $r_1$ and $r_2$ to its nearest and second-nearest neighbor. The ratio $\mu = r_2/r_1$ follows a Pareto distribution with parameter $d$ (the intrinsic dimension):

$$P(\mu) = d \cdot \mu^{-(d+1)}$$

The MLE for $d$ is:

$$\hat{d} = \frac{N}{\sum_{i=1}^N \log(r_{2,i}/r_{1,i})}$$

This estimator works remarkably well because it exploits the local scaling of volumes: in a $d$-dimensional space, the volume ratio of balls with radii $r_2$ and $r_1$ scales as $(r_2/r_1)^d$.

3. Maximum Likelihood Estimation (Levina-Bickel)

For each point, consider its $k$ nearest neighbors within radius $R$. The MLE is:

$$\hat{d}k(\mathbf{x}_i) = \left(\frac{1}{k-1}\sum{j=1}^{k-1} \log\frac{R_k(\mathbf{x}_i)}{R_j(\mathbf{x}_i)}\right)^{-1}$$

where $R_j(\mathbf{x}_i)$ is the distance to the $j$-th nearest neighbor. Average over all points to get the global estimate.

Typical intrinsic dimensions: | Dataset | Ambient Dimension | Estimated ID | |---------|------------------|-------------| | MNIST | 784 | ~10-15 | | CIFAR-10 | 3072 | ~25-40 | | ImageNet | 150528 | ~40-80 | | GPT-2 embeddings | 768-1600 | ~10-30 |

Neural Networks as Manifold Untangling

The function of deep neural network layers can be understood geometrically:

$$\mathbf{x} \xrightarrow{\text{layer 1}} \mathbf{h}^{(1)} \xrightarrow{\text{layer 2}} \mathbf{h}^{(2)} \xrightarrow{\cdots} \mathbf{h}^{(L)} \xrightarrow{\text{classifier}} \mathbf{y}$$

At each layer, the representation $\mathbf{h}^{(\ell)}$ lives on a manifold $M_\ell$ embedded in $\mathbb{R}^{n_\ell}$. Deep networks perform a sequence of homeomorphisms (continuous deformations) that gradually flatten the data manifold:

This is why linear classifiers work on top of deep networks — the network's job is to make the data linearly separable in its final hidden representation.

⚠️ CRITICAL: The width of hidden layers must be at least $d+1$ to embed a $d$-dimensional manifold without self-intersections (Whitney embedding theorem). In practice, layers are much wider, providing "room" to untangle the manifold.

Spectral Analysis of Representations

For a batch of $N$ data points, the layer-$\ell$ representation forms a matrix $\mathbf{H}^{(\ell)} \in \mathbb{R}^{N \times n_\ell}$. Its singular value spectrum reveals geometric structure:

The effective rank (stable rank) of $\mathbf{H}$ is:

$$\text{srank}(\mathbf{H}) = \frac{|\mathbf{H}|F^2}{|\mathbf{H}|_2^2} = \frac{\sum_i \sigma_i^2}{\sigma{\max}^2}$$

This is always between 1 and $\min(N, n_\ell)$, with higher values indicating more distributed (less compressed) representations.

Representation Similarity Metrics

How do we compare representations across different networks, layers, or training runs? Several metrics exist.

CKA (Centered Kernel Alignment)

CKA measures the similarity between two representation matrices $\mathbf{X} \in \mathbb{R}^{N \times p}$ and $\mathbf{Y} \in \mathbb{R}^{N \times q}$ (same $N$ points, possibly different dimensionalities):

$$\text{CKA}(\mathbf{X}, \mathbf{Y}) = \frac{|\mathbf{Y}^T\mathbf{X}|_F^2}{|\mathbf{X}^T\mathbf{X}|_F \cdot |\mathbf{Y}^T\mathbf{Y}|_F}$$

For linear CKA (the most common variant), this is:

$$\text{CKA}_{\text{linear}}(\mathbf{X}, \mathbf{Y}) = \frac{|\text{Cov}(\mathbf{X}, \mathbf{Y})|_F^2}{|\text{Cov}(\mathbf{X})|_F \cdot |\text{Cov}(\mathbf{Y})|_F}$$

where $\mathbf{X}$ and $\mathbf{Y}$ are centered (mean-subtracted).

Properties: - $0 \leq \text{CKA} \leq 1$, with 1 meaning identical representations up to orthogonal transformation - Invariant to orthogonal transformations and isotropic scaling of either representation - NOT invariant to invertible linear transformations (unlike CCA) - Robust to the number of neurons — comparing layers of different widths works

⚠️ CRITICAL: CKA is the gold standard for representation comparison in modern deep learning. Kornblith et al. (2019) showed it reliably identifies corresponding layers across architectures and training runs.

CCA (Canonical Correlation Analysis) and SVCCA

CCA finds linear projections $\mathbf{X}\mathbf{a}$ and $\mathbf{Y}\mathbf{b}$ that maximize correlation:

$$\rho = \max_{\mathbf{a}, \mathbf{b}} \text{corr}(\mathbf{X}\mathbf{a}, \mathbf{Y}\mathbf{b})$$

This gives $k = \min(p, q)$ canonical correlations $\rho_1 \geq \rho_2 \geq \cdots \geq \rho_k$. The mean CCA similarity is:

$$\bar{\rho}{\text{CCA}} = \frac{1}{k}\sum{i=1}^k \rho_i$$

SVCCA (Raghu et al., 2017): Apply SVD to each representation first, keeping only directions with significant variance, then compute CCA. This removes noise dimensions and focuses on the signal subspace.

PWCCA (Projection-Weighted CCA)

Morcos et al. (2018) noted that not all CCA directions are equally important. PWCCA weights each canonical correlation by its importance to the original representation:

$$\text{PWCCA} = \frac{\sum_{i=1}^k \alpha_i \rho_i}{\sum_{i=1}^k \alpha_i}, \quad \alpha_i = \sum_{j=1}^p |\langle \mathbf{h}_i, \mathbf{x}_j \rangle|$$

where $\mathbf{h}_i$ is the CCA vector and $\mathbf{x}_j$ is a direction in the original $\mathbf{X}$ space. PWCCA gives higher weight to CCA directions that are actually used by the representation.

Comparison of Metrics

Metric Invariance Handles Different Widths Robust to Noise Identifies Same Architecture
CKA Orthogonal transforms Yes Yes Yes
CCA Invertible linear transforms Needs SVD first Only with SVCCA Yes, but less discriminative
PWCCA Same as CCA Needs SVD first Yes (projection weights) Yes, most sensitive
Neuron-by-neuron None (brittle) No No Requires aligned neurons

Manifold Capacity and Separability

The manifold capacity theory (Chung et al., 2018) quantifies how many manifolds can be packed into a representation space while remaining linearly separable. For $P$ class manifolds in $\mathbb{R}^N$, with each manifold having radius $R_M$ and dimension $D_M$:

The maximum number of separable manifolds scales as:

$$P_{\max} \propto N \cdot \left(\frac{1}{R_M}\right)^{D_M}$$

Key insight: reducing the radius (compressing within-class variability) or the dimension (simplifying the manifold) increases capacity exponentially. Good representations have: - Small within-class radius (tight clustering of same-class points) - Large between-class separation - Low-dimensional manifolds for each class

This connects directly to the neural collapse phenomenon (Papyan et al., 2020): at the terminal phase of training, class representations collapse to their means, achieving maximal capacity.

Geometry of the Loss Landscape

The representation geometry is intimately connected to the loss landscape (Phase 14). The Fisher information matrix $I(\theta)$ (24-01) defines a Riemannian metric on the space of functions. When the NTK (24-03) is well-conditioned, gradient descent finds flat minima that generalize well — and flat minima correspond to representations with good manifold untangling.

Sharp vs. flat minima:

The Hessian eigenvalues of the loss at convergence reveal the sharpness/flatness of the representation. Small top eigenvalues indicate a flat minimum and typically better generalization.



Key Terms

Worked Examples

Example 1: Intrinsic Dimension of a Swiss Roll

The Swiss roll is a 2D manifold embedded in $\mathbb{R}^3$: $(x, y, z) = (t\cos t, t\sin t, s)$ with $t \in [3\pi/2, 9\pi/2]$, $s \in [0, 1]$. Sample 1000 points and estimate the intrinsic dimension using PCA and the Two-NN method.

Solution:

PCA: The data lives in $\mathbb{R}^3$. The singular values: $\sigma_1 \approx 45$, $\sigma_2 \approx 8$, $\sigma_3 \approx 0.01$. The first two components explain >99.9% of variance, so $d_{\text{PCA}} = 2$. PCA correctly identifies the global dimension because the Swiss roll is (globally) a 2D sheet, even though it's curved in 3D.

Two-NN: For each point, compute the ratio $\mu = r_2/r_1$. For a true 2D manifold, $\mu$ follows a Pareto distribution with $d=2$. The MLE $\hat{d} = N / \sum_i \log\mu_i \approx 2.0$. Both methods correctly identify $d=2$.

Click for answer Both PCA and Two-NN give $d \approx 2$. PCA works here because the Swiss roll can be globally flattened to 2D via PCA projection (it's not self-intersecting). For more complex manifolds (e.g., knot embeddings), PCA fails because the manifold isn't globally linearizable, but Two-NN still works because it's a local estimator.

Example 2: CKA Between Corresponding Layers

Two ResNet-18 networks are trained on CIFAR-10 with different random seeds, achieving similar accuracy. Their layer-4 representations (before the final FC) are $\mathbf{X}, \mathbf{Y} \in \mathbb{R}^{1000 \times 512}$. Compute linear CKA and interpret.

Solution:

Center the representations: $\tilde{\mathbf{X}} = \mathbf{X} - \boldsymbol{\mu}_X$, $\tilde{\mathbf{Y}} = \mathbf{Y} - \boldsymbol{\mu}_Y$.

Compute Gram matrices: $K_X = \tilde{\mathbf{X}}\tilde{\mathbf{X}}^T$, $K_Y = \tilde{\mathbf{Y}}\tilde{\mathbf{Y}}^T$.

HSIC (Hilbert-Schmidt Independence Criterion): $\text{HSIC}(\mathbf{X}, \mathbf{Y}) = \frac{1}{(N-1)^2}\text{tr}(K_X H K_Y H)$ where $H = I - \frac{1}{N}\mathbf{1}\mathbf{1}^T$.

Linear CKA: $\text{CKA} = \frac{\text{HSIC}(\mathbf{X}, \mathbf{Y})}{\sqrt{\text{HSIC}(\mathbf{X}, \mathbf{X}) \cdot \text{HSIC}(\mathbf{Y}, \mathbf{Y})}}$.

If $\text{CKA} \approx 0.92$, this indicates the two networks learned very similar representations at this layer — consistent across random seeds. Typical CKA values for corresponding ResNet layers are 0.85-0.95, while different layers within the same network have CKA $\approx$ 0.3-0.7.

Click for answer CKA $\approx 0.92$ indicates strong agreement between the two networks' representations at this layer. The representations are essentially the same up to an orthogonal transformation. This is evidence that neural networks converge to similar internal representations for the same task, supporting the idea that the learned manifold geometry is determined by the data+architecture, not by random initialization.

Example 3: CCA for Measuring Shared Subspace

Two representation matrices $\mathbf{X} \in \mathbb{R}^{500 \times 100}$, $\mathbf{Y} \in \mathbb{R}^{500 \times 80}$ are centered. SVD gives $\mathbf{X} = \mathbf{U}X \mathbf{S}_X \mathbf{V}_X^T$. The top 30 singular vectors of each are kept. The CCA gives canonical correlations $\rho_1, \ldots, \rho{30}$ with $\rho_1 = 0.95$, $\rho_2 = 0.88$, $\rho_3 = 0.72$, and then a sharp drop to $\rho_i < 0.3$ for $i \geq 4$. Interpret this result.

Solution:

The first 3 canonical correlations are high (>0.7), indicating a 3-dimensional shared subspace between the two representations. After that, correlations drop sharply — there's little shared structure beyond 3 dimensions.

Mean CCA: $\bar{\rho} \approx 0.24$ (diluted by many low correlations).
SVCCA (mean of top 30): even lower if all are included.
PWCCA: heavily weights the first 3 directions → $\text{PWCCA} \approx 0.78$.

This is why PWCCA is preferred: it captures that the two representations are strongly aligned in the important (high-variance) directions, even if the mean over all directions is low. The 3 shared dimensions likely correspond to the most important features for the task.

Click for answer The representations share a 3-dimensional dominant subspace (high $\rho_1$ through $\rho_3$) and diverge in higher dimensions. PWCCA captures this well (~0.78), while mean CCA (~0.24) masks the strong alignment in important directions. This pattern is common: different networks agree on the "coarse" representation structure but differ in fine-grained details. The choice of metric significantly affects the conclusion.

Practice Problems

Problem 1: Estimate the intrinsic dimension of a $d$-dimensional Gaussian $\mathcal{N}(\mathbf{0}, I_d)$ with $d=10$, embedded in $\mathbb{R}^{100}$ by multiplying with a random $100 \times 10$ matrix $\mathbf{A}$ (all entries $\sim \mathcal{N}(0,1)$). What happens as $N \to \infty$ with PCA? With Two-NN?

Click for answer The data is $\mathbf{x} = \mathbf{A}\mathbf{z}$ where $\mathbf{z} \sim \mathcal{N}(0, I_{10})$. The covariance is $\mathbf{A}\mathbf{A}^T \in \mathbb{R}^{100 \times 100}$, which has rank 10. **PCA:** As $N \to \infty$, the sample covariance converges to $\mathbf{A}\mathbf{A}^T$, which has exactly 10 nonzero eigenvalues. PCA correctly identifies $d=10$. **Two-NN:** The data lives on a 10-dimensional affine subspace of $\mathbb{R}^{100}$. The Two-NN estimator also gives $\hat{d} \approx 10$ for large $N$, because the local geometry of a flat 10D affine subspace is exactly 10-dimensional. This shows that linear manifolds are trivially estimated by both methods. **Caution:** If noise is added ($\mathbf{x} = \mathbf{A}\mathbf{z} + \boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I_{100})$), PCA eigenvalues show $10 + \sigma^2$ for the top 10 and $\sigma^2$ for the rest. Two-NN is more robust to noise at small scales because nearest neighbors are dominated by noise — it overestimates the dimension. This is a common pitfall.

Problem 2: Prove that linear CKA is invariant to orthogonal transformations of either representation: $\text{CKA}(\mathbf{X}\mathbf{Q}_X, \mathbf{Y}) = \text{CKA}(\mathbf{X}, \mathbf{Y})$ where $\mathbf{Q}_X^T\mathbf{Q}_X = I$.

Click for answer $\mathbf{X}' = \mathbf{X}\mathbf{Q}_X$. Then $\mathbf{X}'^T\mathbf{X}' = \mathbf{Q}_X^T\mathbf{X}^T\mathbf{X}\mathbf{Q}_X$. $\|\mathbf{X}'^T\mathbf{X}'\|_F^2 = \|\mathbf{Q}_X^T\mathbf{X}^T\mathbf{X}\mathbf{Q}_X\|_F^2 = \|\mathbf{X}^T\mathbf{X}\|_F^2$ (Frobenius norm is orthogonally invariant for congruence transforms when the transform is orthogonal). Similarly, $\mathbf{Y}^T\mathbf{X}' = \mathbf{Y}^T\mathbf{X}\mathbf{Q}_X$, and $\|\mathbf{Y}^T\mathbf{X}\mathbf{Q}_X\|_F^2 = \|\mathbf{Y}^T\mathbf{X}\|_F^2$ (right-multiplication by orthogonal matrix preserves Frobenius norm). Therefore: $\text{CKA}(\mathbf{X}', \mathbf{Y}) = \frac{\|\mathbf{Y}^T\mathbf{X}\mathbf{Q}_X\|_F^2}{\|\mathbf{X}^T\mathbf{X}\|_F \cdot \|\mathbf{Y}^T\mathbf{Y}\|_F} = \text{CKA}(\mathbf{X}, \mathbf{Y})$. This invariance is essential — it means CKA measures the *geometry* of the representation (which is preserved under rotations/reflections) rather than the specific coordinate axes. Two networks that learn the same geometry but in different coordinate systems will have CKA = 1.

Problem 3: A neural network's hidden representation at layer $\ell$ has an effective rank that increases from $\text{srank}(\mathbf{H}^{(1)}) = 5$ to $\text{srank}(\mathbf{H}^{(L)}) = 45$ across layers. What does this tell you about the network's computation? Is this typical?

Click for answer **Increasing effective rank** means the representation is becoming more distributed — the network is expanding the data into a higher-dimensional space where linear separation is easier. This is the opposite of dimensionality reduction: it's *dimensionality expansion* for separability. This is typical for: - **Early layers:** Compress input (remove noise, extract features) — effective rank may decrease initially - **Middle layers:** Expand into higher dimensions to create linear separability — effective rank increases - **Final hidden layer:** May compress again before the classifier The pattern you'd expect: decrease (input compression) → increase (expansion for separation) → maintain or slight decrease (classification-ready). A monotonic increase is unusual and might indicate that the early layers aren't doing enough compression, or the network is overparameterized. **Connection to manifold untangling:** Expanding the effective dimension gives the manifold more "room" to untangle. Think of unknotting a tangled string — you sometimes need to pull it into a higher-dimensional space before you can lay it flat.

Problem 4: Show that for centered representations $\mathbf{X}$ and $\mathbf{Y}$, CKA can be written as $\frac{|\mathbf{X}^T\mathbf{Y}|_F^2}{|\mathbf{X}^T\mathbf{X}|_F |\mathbf{Y}^T\mathbf{Y}|_F}$ when $\mathbf{X}$ and $\mathbf{Y}$ have the same number of features, but this formulation fails for different widths. Why?

Click for answer For centered data, the Gram matrix is $\mathbf{X}\mathbf{X}^T$ (up to scaling). The HSIC numerator: $\text{tr}((\mathbf{X}\mathbf{X}^T)(\mathbf{Y}\mathbf{Y}^T)) = \|\mathbf{X}^T\mathbf{Y}\|_F^2$. This works regardless of $p$ and $q$: $\mathbf{X} \in \mathbb{R}^{N \times p}$, $\mathbf{Y} \in \mathbb{R}^{N \times q}$, then $\mathbf{X}^T\mathbf{Y} \in \mathbb{R}^{p \times q}$ and its Frobenius norm is defined. The alternative formula $\frac{\|\mathbf{X}^T\mathbf{Y}\|_F^2}{\|\mathbf{X}^T\mathbf{X}\|_F \|\mathbf{Y}^T\mathbf{Y}\|_F}$ is actually equivalent to linear CKA when both are centered — the $\mathbf{X}^T\mathbf{Y}$ formulation naturally handles different dimensionalities because it computes the cross-covariance between all pairs of features across the two representations. The Frobenius norm on the $p \times q$ matrix aggregates all these cross-covariances. This is a key advantage over CCA, which can only find $\min(p, q)$ directions. CKA considers all pairwise interactions.

Problem 5: The neural collapse phenomenon: at the terminal phase of training a classifier, the within-class covariance $\Sigma_W \to 0$ and the class means $\boldsymbol{\mu}_c$ converge to a simplex equiangular tight frame (ETF). Compute the CKA between the final hidden representations of two different classes $c$ and $c'$ in this limit. What does this say about the representation?

Click for answer At neural collapse, for any sample $\mathbf{x}$ from class $c$: $\mathbf{h}(\mathbf{x}) = \boldsymbol{\mu}_c$ (all within-class variance vanishes). For $K$ classes, the means form a simplex ETF: $\boldsymbol{\mu}_c^T\boldsymbol{\mu}_{c'} = \frac{K}{K-1}\delta_{cc'} - \frac{1}{K-1}$. The representation matrix for $N$ samples (with $N_c$ per class, $N = \sum N_c$) has: $\mathbf{H}^T\mathbf{H} = \sum_c N_c \boldsymbol{\mu}_c \boldsymbol{\mu}_c^T$. The Gram matrix $\mathbf{H}\mathbf{H}^T$ is block-constant: samples from the same class have the same pairwise similarity. The off-diagonal between different classes is $-\frac{1}{K-1}$. Two different classes $c$ and $c'$: CKA between their representations (treating each class's samples as a separate "network") would be low because the within-class structure has collapsed to a point and the between-class structure is the ETF pattern. Specifically, the representations are maximally separated (equal angles between all class centers), so CKA would reflect this uniform structure. **Interpretation:** Neural collapse means the representation has achieved maximum manifold untangling — each class manifold has collapsed to a single point, and all class centers are maximally separated. This is the ideal for linear classification but may harm transfer learning because all within-class variation has been discarded.

Summary

Key takeaways:


Quiz

Question 1: Which statement best describes the manifold hypothesis?

A. All data lies on a single global linear subspace B. High-dimensional real-world data is concentrated near low-dimensional manifolds C. Every neural network layer projects data onto a lower-dimensional manifold D. Manifold structure is irrelevant to deep learning performance

Correct Answer: B. High-dimensional real-world data is concentrated near low-dimensional manifolds

Explanation: The manifold hypothesis posits that despite living in $\mathbb{R}^D$ with $D$ large, the data-generating process involves many fewer degrees of freedom ($d \ll D$). Option A is too restrictive — the manifolds are generally nonlinear. Option C describes what networks do (sometimes), not the hypothesis itself. Option D is contradicted by extensive evidence linking manifold geometry to network performance.


Question 2: What does the Two-NN intrinsic dimension estimator exploit?

A. The global covariance structure of the data B. The ratio of distances to the first and second nearest neighbors C. The number of PCA components needed for 95% variance D. The rank of the data matrix

Correct Answer: B. The ratio of distances to the first and second nearest neighbors

Explanation: In a $d$-dimensional space, the ratio $r_2/r_1$ follows a Pareto($d$) distribution. The Two-NN method estimates $d$ from the log-ratios: $\hat{d} = N / \sum \log(r_{2,i}/r_{1,i})$. This is a local estimator — it uses only nearest-neighbor information, making it valid for nonlinear manifolds where PCA (a global linear method) fails. Option A describes PCA, option D describes matrix factorization approaches.


Question 3: Which representation similarity metric is invariant to invertible linear transformations?

A. CKA B. Linear regression from one representation to another C. CCA D. Euclidean distance between representation vectors

Correct Answer: C. CCA

Explanation: CCA finds the maximum correlations achievable by linear projections. If you apply an invertible linear transformation $\mathbf{Y}' = \mathbf{Y}\mathbf{A}$, the canonical correlations don't change because any linear projection of $\mathbf{Y}'$ is equivalent to some linear projection of $\mathbf{Y}$ via $\mathbf{A}$. CKA is invariant only to orthogonal transformations, not arbitrary invertible ones. Option D is not invariant to anything useful.


Question 4: A network's hidden layer has effective rank 5 at layer 2 and effective rank 45 at layer 8 (final hidden layer). What computational strategy is the network employing?

A. Dimensionality reduction for compression B. Dimensionality expansion for linear separability C. Random projection D. Overfitting to noise

Correct Answer: B. Dimensionality expansion for linear separability

Explanation: Going from low to high effective rank means the network is expanding its representation into a higher-dimensional space. This is the kernel trick in learned form: projecting data into a high-dimensional space where it becomes linearly separable. It's a form of "untangling" — giving the manifold more room to unfold. Option A would show decreasing rank. Options C and D are possible pathologies but the expansion pattern is consistent with normal successful training.


Question 5: What does a CKA value of 0.95 between two different networks' corresponding layers indicate?

A. The networks are identical B. The representations encode very similar similarity structure (up to orthogonal rotation) C. The networks must have the same architecture D. CKA is miscalibrated and this value is meaningless

Correct Answer: B. The representations encode very similar similarity structure (up to orthogonal rotation)

Explanation: CKA = 1 means identical representational geometry up to orthogonal transformation. A value of 0.95 means the representations are nearly identical in terms of which inputs have similar/dissimilar representations. The networks could have different architectures, different initializations, or even different widths — CKA measures the similarity of the geometry, not the coordinate values. This is a robust finding: networks trained on the same task tend to converge to similar representational geometries.


Question 6: Neural collapse refers to:

A. The network's weights becoming zero B. Within-class representations collapsing to their class means and class means forming a symmetric structure C. The loss function becoming non-convex D. The manifold hypothesis being violated

Correct Answer: B. Within-class representations collapsing to their class means and class means forming a symmetric structure

Explanation: Neural collapse (Papyan et al., 2020) is the phenomenon where, late in training, the last-layer features of each class converge to their class mean ($\Sigma_W \to 0$), the class means converge to a simplex ETF (maximally separated and symmetric), and the classifier weights align with the class means. This represents perfect manifold untangling — each class manifold has collapsed to a point, achieving maximum capacity for linear separability.


Pitfalls

  1. Using PCA intrinsic dimension on nonlinear manifolds: PCA estimates the global linear dimension — the dimension of the best-fit affine subspace. For curved manifolds (e.g., Swiss rolls, knots), PCA dramatically overestimates the dimension because it can't "see" the nonlinear structure. Always complement PCA with local estimators (Two-NN, Levina-Bickel) when dealing with potentially nonlinear manifolds.

  2. Confusing CKA and CCA invariance properties: CKA is invariant to orthogonal transformations but NOT to arbitrary invertible linear transformations. CCA is invariant to all invertible linear transformations. A high CKA between two representations means they share the same geometry up to rotation; a low CKA doesn't mean the representations are fundamentally different — they could be related by a non-orthogonal linear transform (which CCA would detect). Always use both metrics for a complete picture.

  3. Misinterpreting effective rank trends: An increasing effective rank across layers doesn't necessarily mean the network is "doing well." Some increase is normal (expansion for separability), but if all layers have very high effective rank, the network may not be compressing information — it's just passing noise forward. Look for spectral gaps, not just rank magnitudes.

  4. Treating neural collapse as universally desirable: Neural collapse (within-class covariance → 0, class means → simplex ETF) maximizes linear separability, which is great for the current task. But it discards all within-class variation, which can severely harm transfer learning — downstream tasks that need fine-grained distinctions within classes will struggle. Task-specific representations strike a balance: enough compression for the task, enough variation preserved for reuse.


Next Steps

Next up: 24-05 — Disentanglement and Representation Theory — where you'll learn about learning representations whose individual dimensions correspond to meaningful and independent factors of variation, using $\beta$-VAEs, mutual information gaps, and group-theoretic frameworks.