25-05 — Mode Connectivity and Loss Landscapes
Phase: 25 — Frontiers & Active Research Areas Subject: 25-05 Prerequisites: 25-04 (Double Descent), 14 (Optimization Theory), 24-04 (Manifold Hypothesis) Next subject: 25-06 — Federated Learning
Learning Objectives
By the end of this subject, you will be able to:
- Define linear mode connectivity and explain its significance for understanding neural network optimisation
- Describe the loss landscape geometry of deep networks, including basins and barriers
- Explain the Lottery Ticket Hypothesis and its connection to sparse subnetworks
- Apply mode connectivity concepts to model merging and ensemble analysis
- Interpret loss landscape visualisations and identify key topological features
Core Content
1. Mode Connectivity
Mode connectivity is the empirical observation that different trained neural networks (different optimisation trajectories, random seeds) often converge to solutions connected by paths of near-constant low loss — they don't fall into isolated local minima but into a single connected basin.
⚠️ CRITICAL: This challenges the classical picture of optimisation landing in isolated local minima. In overparameterised networks, the loss landscape is dominated by connected valleys, not isolated pits.
Linear mode connectivity (Frankle et al., 2020): Two solutions $\theta_1$ and $\theta_2$ are linearly mode-connected if the convex combination $\alpha\theta_1 + (1-\alpha)\theta_2$ has low loss for all $\alpha \in [0, 1]$:
$$L(\alpha\theta_1 + (1-\alpha)\theta_2) \leq \max(L(\theta_1), L(\theta_2)) + \delta$$
for small $\delta$. This holds when both solutions lie in the same convex basin.
Nonlinear mode connectivity (Garipov et al., 2018): Even when linear paths fail (loss barrier), the solutions are often connected by low-loss nonlinear paths. A quadratic Bezier curve with a midpoint $\theta_{\text{mid}}$:
$$\theta(\alpha) = (1-\alpha)^2\theta_1 + 2\alpha(1-\alpha)\theta_{\text{mid}} + \alpha^2\theta_2, \quad \alpha \in [0, 1]$$
The midpoint is optimised to minimise the maximum loss along the curve.
Key findings: - Early in training, networks are NOT mode-connected — solutions from different initialisations are in separate basins - As training progresses, the basins expand and merge - At convergence (especially with SGD noise), solutions often connect - Batch normalisation significantly improves mode connectivity
2. Loss Landscape Geometry
The loss landscape of a neural network is the function $\theta \mapsto L(\theta)$ — typically extremely high-dimensional (millions of parameters) and non-convex.
Key topological features:
| Feature | Description | Why It Matters |
|---|---|---|
| Basin | Connected region of low loss | Solutions from different runs fall into same basin |
| Barrier | Ridge of high loss between basins | Causes mode disconnectivity |
| Saddle point | Gradient zero, mixed curvature | Dominant in high dimensions; SGD escapes easily |
| Plateau | Region of near-zero gradient | Can trap optimisation momentarily |
| Narrow valley | Low-loss corridor with steep walls | Sharp minima may generalise worse than flat ones |
Sharp vs flat minima: Keskar et al. (2016) showed that large-batch SGD converges to sharp minima (high curvature, worse generalisation), while small-batch SGD converges to flat minima (low curvature, better generalisation). The intuition: flat minima are robust to parameter perturbations. However, Dinh et al. (2017) showed that sharpness can be gamed through reparameterisation — ReLU networks have scale invariance (multiply incoming weights by $c$, divide outgoing by $c$, same function, different sharpness).
3. Loss Landscape Visualisation
Direct visualisation of million-dimensional loss landscapes is impossible. Two main techniques:
Random direction method (Goodfellow et al., 2015): Choose two random direction vectors $\delta_1, \delta_2$ in parameter space, then plot:
$$f(\alpha, \beta) = L(\theta^* + \alpha\delta_1 + \beta\delta_2)$$
as a 2D surface. The directions are typically normalised and may be filtered through the Hessian for more informative views.
Filter-normalised directions (Li et al., 2018): To address scale invariance, normalise each filter's direction by its Frobenius norm:
$$\delta_{ij}^{\text{norm}} = \frac{\delta_{ij}}{|\theta_{ij}^|F} \cdot |\theta{ij}^|F = \delta{ij}$$
This prevents scale-invariant directions from distorting the visualisation.
Common visualisation reveals: - Convex-looking basins near minima (though globally non-convex) - Chaotic landscapes for untrained networks - Smooth, well-structured landscapes after training - Pathways (low-loss corridors) connecting different minima
4. The Lottery Ticket Hypothesis
Original hypothesis (Frankle & Carbin, 2019): "A randomly-initialised, dense neural network contains a subnetwork that is initialised such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations."
The pruning procedure: 1. Randomly initialise a network $f(x; \theta_0)$ 2. Train for $j$ iterations to obtain $\theta_j$ 3. Prune $p\%$ of parameters with the smallest magnitude 4. Reset remaining parameters to their original initialisation $\theta_0$ 5. Repeat from step 2 (iterative magnitude pruning, IMP)
The surviving subnetwork structure with the original initialisations is the winning ticket.
Key results: - Winning tickets achieve comparable or better performance than the full network, using 50-99% fewer parameters - Rewinding matters: Resetting weights to their original initialisation is critical — using random reinitialisation fails - Late rewinding: For larger networks, resetting to epoch 1-5 weights (rather than epoch 0) still works, suggesting early training establishes a useful parameter subspace - Supermasks: Zhou et al. (2019) found that random weights with carefully chosen binary masks can achieve good performance WITHOUT any training — the mask alone defines the function
Implications: The initialisation provides a "library" of subnetworks; SGD training selects and amplifies a winning ticket. The network architecture is overparameterised not for expressivity but to increase the probability of containing a good subnetwork at initialisation.
Key Terms
- Barrier
- Basin
- Loss landscape geometry
- Lottery Ticket Hypothesis
- Mode connectivity
- Narrow valley
- Plateau
- Saddle point
Worked Examples
Example 1: Testing Linear Mode Connectivity
Problem: Two trained ResNet-20 models (seed A: accuracy 91.2%, seed B: accuracy 91.0%) have parameters $\theta_A$ and $\theta_B$. Evaluate linear connectivity by checking the loss at $\alpha = 0, 0.25, 0.5, 0.75, 1$. Results: $L(0) = 0.32$, $L(0.25) = 0.55$, $L(0.5) = 1.42$, $L(0.75) = 0.58$, $L(1) = 0.33$. Are these models linearly mode-connected?
Solution:
The barrier height is the max loss along the path minus the average endpoint loss:
$$\text{barrier} = \max_\alpha L(\alpha) - \frac{L(0) + L(1)}{2} = 1.42 - \frac{0.32 + 0.33}{2} = 1.42 - 0.325 = 1.095$$
A barrier of 1.095 is substantial — it's 3.4× the endpoint loss. For CIFAR-10 classification (cross-entropy loss), this represents severe degradation at the midpoint (accuracy would drop from ~91% to well below random chance at 10 classes). These models are NOT linearly mode-connected.
Remedy: Use nonlinear connectivity — train a Bezier midpoint $\theta_{\text{mid}}$ to minimise $\max_\alpha L(\theta(\alpha))$. If a low-loss path can be found, the models are in the same (non-convex) basin.
Example 2: Hessian-Based Flatness Measure
Problem: At a local minimum $\theta^*$, the Hessian eigenvalues are $\lambda_1 = 0.01, \lambda_2 = 0.05, \ldots, \lambda_{100} = 15.0, \lambda_{101} = 40.0, \ldots$ (spread over many orders of magnitude). The trace is $\text{tr}(H) = 500$. Compare sharpness to another minimum with $\text{tr}(H) = 80$.
Solution:
The trace of the Hessian (sum of eigenvalues) measures the average curvature — higher trace = sharper minimum. The first minimum ($\text{tr}(H) = 500$) is significantly sharper. Under small parameter perturbation $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$:
$$\mathbb{E}[L(\theta^ + \epsilon)] \approx L(\theta^) + \frac{\sigma^2}{2}\text{tr}(H)$$
For $\sigma = 0.01$: expected loss increase is $0.5 \cdot 10^{-4} \cdot 500 = 0.025$ (first minimum) vs $0.5 \cdot 10^{-4} \cdot 80 = 0.004$ (second minimum) — the sharper minimum is 6.25× more sensitive to perturbation.
Example 3: Iterative Magnitude Pruning
Problem: A 2-layer MLP with weights $W_1$ (4×3) and $W_2$ (3×2) is trained. After training, 30% pruning is applied. Show the pruning mask for $W_1$ if the smallest 30% of entries (by absolute value) are:
$$W_1 = \begin{bmatrix} 0.8 & -0.1 & 0.3 \ -0.5 & 0.02 & 0.6 \ 0.1 & -0.9 & -0.2 \ 0.4 & 0.05 & -0.7 \end{bmatrix}$$
Solution:
12 total entries. 30% pruning → prune 3.6 → round to 4 smallest by absolute value.
Sorted by $|w|$: 0.02, 0.05, 0.1, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9
Prune: 0.02, 0.05, 0.1, 0.1 (the second 0.1 at position (3,1) is 4th smallest)
Mask (0 = pruned, 1 = kept):
$$M_1 = \begin{bmatrix} 1 & 0 & 1 \ 1 & 0 & 1 \ 0 & 1 & 1 \ 1 & 0 & 1 \end{bmatrix}$$
After rewinding to original init and retraining, the surviving connections should recover performance.
Practice Problems
Problem 1: Three independent training runs produce solutions $\theta_1, \theta_2, \theta_3$ with losses 0.15, 0.16, 0.17. Linear interpolation between (1,2) has max loss 0.18; between (2,3) has max loss 0.95. What can you conclude about the loss landscape topology?
Problem 2: The Hessian at a minimum has eigenvalues decaying as $\lambda_k \propto 1/k$. Compute the expected loss perturbation for isotropic Gaussian noise with variance $\sigma^2$, and compare to a minimum with $\lambda_k \propto e^{-k}$.
Problem 3: Why might a "winning ticket" found via iterative magnitude pruning at 90% sparsity NOT transfer to a different dataset, even though the original architecture works on both?
Problem 4: Explain why Dinh et al.'s (2017) scale-invariance argument means that "sharp vs flat minima" claims must be treated carefully for ReLU networks.
Problem 5: A Bezier curve with midpoint $\theta_{\text{mid}}$ connects $\theta_1$ and $\theta_2$. Derive the loss along the curve and explain why the midpoint must be trained.
Answers (click to expand)
**Problem 1:** (1,2) are linearly mode-connected (barrier 0.02-0.03, negligible) — they lie in the same convex basin. (2,3) are NOT connected (barrier ~0.78) — they lie in different basins separated by a loss barrier. The landscape has at least two distinct basins containing good solutions. A nonlinear path might connect them. **Problem 2:** With $\lambda_k \propto 1/k$: $\text{tr}(H) \approx \int_1^d (1/k) dk = \log d$, diverging slowly. Expected loss increase $\approx \sigma^2 \log d / 2$ — grows logarithmically with dimension. With $\lambda_k \propto e^{-k}$: $\text{tr}(H) \approx \int_0^\infty e^{-k} dk = 1$, finite. Expected loss increase $\approx \sigma^2 / 2$ — independent of dimension. The first landscape is much more sensitive in high dimensions. **Problem 3:** The winning ticket is a specific subnetwork structure paired with a specific initialisation. This structure is an inductive bias optimised for the training data distribution. Transfer to a different dataset requires different features and representations, which a different subnetwork might serve better. Winning tickets are dataset-specific — they encode data-dependent architectural priors. **Problem 4:** For a ReLU network, multiplying incoming weights to a neuron by $c > 0$ and dividing outgoing weights by $c$ produces an identical function. One can make the Hessian arbitrarily sharp (large $c$) or flat (small $c$) without changing the function, loss, or generalisation. Any sharpness-based generalisation measure must account for this scale invariance through normalisation — otherwise it measures reparameterisation choice, not network geometry. **Problem 5:** The Bezier curve: $\theta(\alpha) = (1-\alpha)^2\theta_1 + 2\alpha(1-\alpha)\theta_{\text{mid}} + \alpha^2\theta_2$. At $\alpha=0$: $\theta(0)=\theta_1$; at $\alpha=1$: $\theta(1)=\theta_2$; at $\alpha=0.5$: $\theta(0.5)=0.25\theta_1+0.5\theta_{\text{mid}}+0.25\theta_2$. The midpoint must be trained because the naive choice $\theta_{\text{mid}} = (\theta_1+\theta_2)/2$ (midpoint of the convex combination) gives a straight-line path, equivalent to linear connectivity. The Bezier midpoint adds a degree of freedom to navigate around loss barriers — it is optimised to minimise $\max_\alpha L(\theta(\alpha))$, effectively learning to "go around" barriers.Summary
- Mode connectivity shows that overparameterised networks trained from different seeds often converge to solutions in the same low-loss basin, connected by near-constant-loss paths.
- Liner vs nonlinear connectivity: When linear interpolation crosses a loss barrier, Bezier curves with trained midpoints can find low-loss nonlinear paths around the barrier.
- Loss landscape geometry is dominated by extended basins in high dimensions — isolated local minima are rare, but barriers between basins exist.
- Lottery Ticket Hypothesis shows that dense networks contain sparse winning tickets at initialisation that can be trained in isolation to match full-network performance.
- Practical implications: Mode connectivity enables model merging; understanding landscape geometry informs optimiser design; winning tickets suggest we can train much smaller networks.
Pitfalls
- Sharpness is not well-defined: Without filter-wise normalisation, sharpness measures can be gamed by scale invariance in ReLU networks. Always normalise when comparing minima.
- Linear interpolation doesn't prove same basin: Low loss along the linear path may indicate the convex hull is low-loss, not that a path exists in weight space — but in practice, successful linear interpolation strongly suggests same-basin membership.
- Winning tickets require careful rewinding: Using a different random seed or even slightly different initialisation destroys the winning ticket property. The specific coupling between architecture and initialisation matters.
Quiz
Q1: Two trained networks have a loss barrier of 3.2 when linearly interpolated. What does this most likely indicate?
A) The networks have identical weights B) The networks are in different basins of the loss landscape C) The loss function is convex D) The networks are poorly trained
Q2: The Lottery Ticket Hypothesis requires rewinding weights to:
A) Random reinitialisation B) The weights at epoch 50 C) Their original initialisation values D) Zero weights
Q3: Why does batch normalisation improve mode connectivity?
A) It increases model capacity B) It provides scale invariance that smooths the loss landscape C) It makes training slower D) It adds noise to the weights
Q4: A Hessian with trace 1200 at one minimum and trace 300 at another indicates:
A) The first minimum has lower loss B) The second minimum has higher curvature C) The first minimum is sharper (more sensitive to parameter noise) D) Both minima have identical generalisation properties
Q5: The "supermasks" phenomenon demonstrates that:
A) Training is always necessary for good performance B) Random weights with learned binary masks can achieve good accuracy without any weight updates C) Larger networks always outperform smaller ones D) Pruning always degrades performance
Answers (click to expand)
**Q1: B)** A loss barrier of 3.2 along a linear interpolation path means the midpoint has much higher loss than the endpoints — the solutions are in separate basins. **A** is wrong (identical weights have zero barrier). **C** is wrong (convex functions have no barriers). **D** is wrong (both can be well-trained in their respective basins). **Q2: C)** The key insight is resetting to the original initialisation values, not random values or later epochs. Original init preserves the "winning ticket" structure. **A** fails empirically. **B** (late rewinding) works for some cases but original rewinding is the hypothesis. **D** is just a zero network. **Q3: B)** Batch norm introduces scale invariance (output invariant to weight scale) and reduces internal covariate shift, smoothing the loss landscape and making different solutions more likely to connect. **A** is wrong; BN affects training dynamics, not capacity. **C** is wrong (BN typically speeds up training). **D** is a side effect, not the mechanism. **Q4: C)** Higher Hessian trace = higher average curvature = sharper minimum = larger loss increase under parameter perturbations. **A** is wrong (trace doesn't indicate loss level). **B** is the opposite. **D** is false; sharpness correlates (imperfectly) with generalisation. **Q5: B)** Supermasks (Zhou et al., 2019) found binary masks over random weights that achieve non-trivial accuracy without training. **A** is contradicted. **C** is a separate scaling observation. **D** is false (winning tickets improve with pruning).Next Steps
Move on to 25-06 — Federated Learning, where we study distributed training across decentralised data with privacy constraints.