24-03 — Neural Tangent Kernel (NTK)
Phase: 24 — Information Geometry & Advanced Theory Subject: 24-03 Prerequisites: Phase 16-17 (Neural Networks & Deep Learning — backpropagation, architectures), Phase 8 (Functional Analysis — kernel methods), 24-01 Fisher Information Next subject: 24-04 — Manifold Hypothesis and Representation Geometry
Learning Objectives
By the end of this subject, you will be able to:
- Define the Neural Tangent Kernel and derive it from the infinite-width limit of neural networks
- Explain why the NTK stays constant during training in the infinite-width regime
- Connect neural network training to kernel ridge regression via the NTK
- Compute the NTK analytically for simple architectures (fully connected, ReLU, erf)
- Understand the implications of the NTK for generalization, convergence, and the "lazy training" regime
Core Content
The Infinite-Width Limit
Consider a fully-connected neural network with $L$ layers, each of width $n$:
$$\mathbf{h}^{(0)} = \mathbf{x}, \quad \mathbf{h}^{(\ell)} = \frac{1}{\sqrt{n}} W^{(\ell)} \phi(\mathbf{h}^{(\ell-1)}), \quad f(\mathbf{x}; \theta) = \frac{1}{\sqrt{n}} \mathbf{w}^{(L)} \cdot \phi(\mathbf{h}^{(L-1)})$$
where $W^{(\ell)} \in \mathbb{R}^{n \times n}$ with entries i.i.d. $\mathcal{N}(0, 1)$, and $\mathbf{w}^{(L)} \in \mathbb{R}^n$ similarly. The scaling factor $1/\sqrt{n}$ ensures the variance remains stable across layers.
⚠️ CRITICAL: As $n \to \infty$, the network at initialization becomes a Gaussian process (NNGP — Neural Network Gaussian Process). The NTK describes what happens when you train this infinite network.
Definition of the NTK
The Neural Tangent Kernel $\Theta(\mathbf{x}, \mathbf{x}')$ is defined as:
$$\boxed{\Theta(\mathbf{x}, \mathbf{x}') = \sum_{p=1}^P \frac{\partial f(\mathbf{x}; \theta)}{\partial \theta_p} \cdot \frac{\partial f(\mathbf{x}'; \theta)}{\partial \theta_p} = \left\langle \nabla_\theta f(\mathbf{x}), \nabla_\theta f(\mathbf{x}') \right\rangle}$$
where $P$ is the total number of parameters. The NTK measures the similarity between two inputs through the lens of the network's Jacobian — it's the inner product of the gradients of the network output with respect to all parameters.
Key property: The NTK is a positive semidefinite kernel — for any set of inputs ${\mathbf{x}1, \ldots, \mathbf{x}_m}$, the Gram matrix $K{ij} = \Theta(\mathbf{x}_i, \mathbf{x}_j)$ is PSD.
Gradient Flow Dynamics
Consider training the network with gradient flow (continuous-time gradient descent) on the MSE loss:
$$\mathcal{L} = \frac{1}{2}\sum_{i=1}^m (f(\mathbf{x}_i; \theta) - y_i)^2$$
The parameter dynamics are:
$$\dot{\theta} = -\nabla_\theta\mathcal{L} = -\sum_{i=1}^m (f(\mathbf{x}i; \theta) - y_i) \nabla\theta f(\mathbf{x}_i)$$
The function dynamics follow via the chain rule:
$$\dot{f}(\mathbf{x}) = \nabla_\theta f(\mathbf{x})^T \dot{\theta} = -\sum_{i=1}^m (f(\mathbf{x}i; \theta) - y_i) \underbrace{\nabla\theta f(\mathbf{x})^T \nabla_\theta f(\mathbf{x}i)}{\Theta(\mathbf{x}, \mathbf{x}_i)}$$
So the function evolves according to:
$$\boxed{\dot{f}(\mathbf{x}) = -\sum_{i=1}^m \Theta(\mathbf{x}, \mathbf{x}_i) (f(\mathbf{x}_i) - y_i)}$$
The Fundamental Result: NTK is Constant at Infinite Width
This is the central theorem of NTK theory (Jacot et al., 2018). As $n \to \infty$:
$$\Theta(\mathbf{x}, \mathbf{x}'; \theta_t) \xrightarrow{n \to \infty} \Theta^\infty(\mathbf{x}, \mathbf{x}')$$
where $\Theta^\infty$ depends ONLY on the architecture and initialization distribution, NOT on the parameters $t$ during training.
Why this matters: If $\Theta$ is constant, the function dynamics become a closed-form linear ODE:
$$\dot{\mathbf{f}} = -\Theta \cdot (\mathbf{f} - \mathbf{y})$$
where $\mathbf{f} = (f(\mathbf{x}_1), \ldots, f(\mathbf{x}_m))^T$ and $\Theta$ is the $m \times m$ Gram matrix. The solution is:
$$\mathbf{f}(t) = \mathbf{y} + (\mathbf{f}(0) - \mathbf{y}) e^{-\Theta t}$$
Or, more usefully, the prediction at a test point $\mathbf{x}^*$ at time $t$:
$$f(\mathbf{x}^; t) = f_0(\mathbf{x}^) + \Theta(\mathbf{x}^*, X) \Theta(X, X)^{-1} \left(I - e^{-\Theta(X,X)t}\right) (\mathbf{y} - \mathbf{f}_0)$$
As $t \to \infty$, this converges to the kernel ridge regression predictor:
$$f_\infty(\mathbf{x}^) = f_0(\mathbf{x}^) + \Theta(\mathbf{x}^*, X) \Theta(X, X)^{-1} (\mathbf{y} - \mathbf{f}_0)$$
⚠️ CRITICAL: An infinitely wide neural network trained with gradient descent is exactly equivalent to kernel regression with the NTK. Every aspect of training — the trajectory, the final predictor, the generalization — is determined by the NTK.
The Lazy Training Regime
The constant-NTK behavior defines the lazy training (or kernel) regime:
- The network function $f(\mathbf{x})$ changes, but the tangent features $\nabla_\theta f(\mathbf{x})$ do NOT change
- Parameters barely move from initialization: $|\theta_t - \theta_0| = O(1/\sqrt{n})$
- The network learns by combining its random features linearly, not by changing the features themselves
In contrast, the rich (or feature-learning) regime occurs when the width is finite and the NTK evolves during training — the network actually changes its internal representations.
Analytical NTK Formulas
Two-Layer ReLU Network
For $f(\mathbf{x}) = \frac{1}{\sqrt{n}} \sum_{i=1}^n a_i \sigma(\mathbf{w}_i^T \mathbf{x})$ with ReLU activation $\sigma(z) = \max(0, z)$:
The NNGP kernel (covariance of the random function at initialization): $$K(\mathbf{x}, \mathbf{x}') = \mathbb{E}_{\mathbf{w} \sim \mathcal{N}(0,I)}[\sigma(\mathbf{w}^T\mathbf{x}) \sigma(\mathbf{w}^T\mathbf{x}')]$$
For ReLU on the sphere ($|\mathbf{x}| = |\mathbf{x}'| = 1$): $$K(\mathbf{x}, \mathbf{x}') = \frac{1}{2\pi}\left(\sqrt{1-u^2} + u(\pi - \arccos u)\right)$$ where $u = \mathbf{x}^T\mathbf{x}'$ (the cosine similarity).
The NTK for the two-layer ReLU network is: $$\Theta(\mathbf{x}, \mathbf{x}') = K(\mathbf{x}, \mathbf{x}') + \mathbf{x}^T \mathbf{x}' \cdot \dot{K}(\mathbf{x}, \mathbf{x}')$$ where $\dot{K}$ is the derivative of $K$ with respect to $u$: $$\dot{K}(\mathbf{x}, \mathbf{x}') = \frac{1}{2\pi}(\pi - \arccos u)$$
Erf Activation
For $\sigma(z) = \text{erf}(z/\sqrt{2})$ (the error function, which approximates tanh), closed forms are even simpler:
$$K(\mathbf{x}, \mathbf{x}') = \frac{2}{\pi} \arcsin\left(\frac{\mathbf{x}^T\mathbf{x}'}{\sqrt{(1+|\mathbf{x}|^2)(1+|\mathbf{x}'|^2)}}\right)$$
This is the arcsin kernel, which plays a special role in random feature theory.
Deep Networks (Recursive NTK)
For a deep network, the NTK can be computed recursively. Let $K^{(\ell)}$ be the NNGP kernel at layer $\ell$ and $\Theta^{(\ell)}$ the NTK at layer $\ell$. For fully-connected ReLU networks:
$$K^{(0)}(\mathbf{x}, \mathbf{x}') = \mathbf{x}^T\mathbf{x}'$$ $$\Theta^{(0)}(\mathbf{x}, \mathbf{x}') = K^{(0)}(\mathbf{x}, \mathbf{x}')$$
For $\ell \geq 1$: $$K^{(\ell)}(\mathbf{x}, \mathbf{x}') = \mathbb{E}_{(u,v) \sim \mathcal{N}(0, \Lambda^{(\ell-1)})}[\sigma(u)\sigma(v)]$$ $$\Theta^{(\ell)}(\mathbf{x}, \mathbf{x}') = K^{(\ell)}(\mathbf{x}, \mathbf{x}') + \Theta^{(\ell-1)}(\mathbf{x}, \mathbf{x}') \cdot \dot{K}^{(\ell)}(\mathbf{x}, \mathbf{x}')$$
where $\Lambda^{(\ell-1)} = \begin{pmatrix} K^{(\ell-1)}(\mathbf{x},\mathbf{x}) & K^{(\ell-1)}(\mathbf{x},\mathbf{x}') \ K^{(\ell-1)}(\mathbf{x}',\mathbf{x}) & K^{(\ell-1)}(\mathbf{x}',\mathbf{x}') \end{pmatrix}$.
NTK as a Kernels-on-Kernels Structure
The deep NTK can be understood as a hierarchical composition:
$$\Theta^{(L)}(\mathbf{x}, \mathbf{x}') = \sum_{\ell=1}^L K^{(\ell)}(\mathbf{x}, \mathbf{x}') \cdot \prod_{k=\ell+1}^L \dot{K}^{(k)}(\mathbf{x}, \mathbf{x}')$$
This sum over layers reveals that the NTK is an ensemble of kernels at different scales. Each layer contributes a term weighted by the "gradient propagation" through higher layers. The deeper the layer, the more its contribution is modulated by higher-level derivative terms.
NTK and Fisher Information
A profound connection: the NTK at initialization is the empirical Fisher information matrix for the regression parameters, but computed over the parameter space rather than the function space:
$$\Theta(\mathbf{x}, \mathbf{x}') = \nabla_\theta f(\mathbf{x})^T \nabla_\theta f(\mathbf{x}')$$
The Fisher information for the function values $f$ (treating $f(\mathbf{x})$ as the "parameter") is related to the NTK Gram matrix. Moreover, natural gradient descent in function space using the NTK as the metric gives exactly the same dynamics as standard gradient descent in parameter space at infinite width. The NTK is the pullback of the Euclidean metric in parameter space to function space.
Practical Implications
Convergence Speed
The eigenvalues of the NTK Gram matrix $\Theta(X, X)$ determine the convergence rate of gradient descent. The function along eigenmode $i$ converges at rate $e^{-\lambda_i t}$:
$$f_{\text{train}}(t) - \mathbf{y} = \sum_i \mathbf{u}_i e^{-\lambda_i t} \langle \mathbf{u}_i, \mathbf{f}(0) - \mathbf{y} \rangle$$
Small eigenvalues mean slow convergence. The condition number of $\Theta$ governs the worst-case convergence. Networks that generalize well often have NTKs with a rapidly decaying spectrum — they fit the "easy" directions first (benign overfitting).
Generalization
The generalization error for NTK regression is:
$$\mathbb{E}_{\mathbf{x}^}[(f_\infty(\mathbf{x}^) - y^*)^2] = \text{Bias}^2 + \text{Variance}$$
where the variance term is $\sigma^2 \cdot \Theta(\mathbf{x}^, X) \Theta(X, X)^{-2} \Theta(X, \mathbf{x}^)$. The NTK's eigenstructure determines which functions are easy to learn: functions aligned with large-eigenvalue directions of $\Theta$ are learned quickly with low sample complexity.
Width Dependence
Real networks have finite width $n$, and the NTK evolves. The rate of NTK change at initialization scales as:
$$|\Theta_t - \Theta_0| \sim O(1/\sqrt{n})$$
This means: - Very wide networks ($n \gg m$) behave like kernel methods - Narrow networks ($n \sim m$) exhibit feature learning - The crossover happens around the "critical width" where $n \approx m$
Common Misconceptions
Myth: "NTK theory says neural networks are just kernel methods."
Reality: NTK theory describes the infinite-width limit. Finite networks can and do learn features. The NTK is a theoretical tool for understanding training dynamics, not a claim that all neural networks reduce to kernel methods.
Myth: "The NTK being constant means the network doesn't learn."
Reality: The function $f(\mathbf{x})$ changes substantially! It's the tangent features $\nabla_\theta f$ that stay constant. The network learns by reweighting fixed random features, which is how kernel methods work.
Myth: "NTK applies only to MSE loss."
Reality: While MSE gives the cleanest theory, the NTK framework extends to any loss via the Gauss-Newton decomposition of the Hessian. For cross-entropy loss, the Fisher-NTK appears.
Key Terms
- Finite-width
- Gaussian process
Worked Examples
Example 1: Two-Point NTK Regression
Consider a two-layer ReLU network (infinite width) with two training points $\mathbf{x}_1 = (1, 0)^T$, $\mathbf{x}_2 = (0, 1)^T$ and targets $y_1 = 2$, $y_2 = -1$. Assume $\Theta(\mathbf{x}_1, \mathbf{x}_1) = \Theta(\mathbf{x}_2, \mathbf{x}_2) = 1$ and $\Theta(\mathbf{x}_1, \mathbf{x}_2) = 0.2$. Compute the infinite-time prediction for $\mathbf{x}^* = (1, 1)^T$ assuming $f_0 \equiv 0$.
Solution:
The NTK Gram matrix: $\Theta(X, X) = \begin{pmatrix} 1 & 0.2 \ 0.2 & 1 \end{pmatrix}$.
Inverse: $\Theta^{-1} = \frac{1}{1 - 0.04}\begin{pmatrix} 1 & -0.2 \ -0.2 & 1 \end{pmatrix} = \begin{pmatrix} 1.0417 & -0.2083 \ -0.2083 & 1.0417 \end{pmatrix}$.
Need $\Theta(\mathbf{x}^, X)$. For ReLU: $K(\mathbf{x}_1, \mathbf{x}^) = \mathbb{E}[\sigma(\mathbf{w}^T\mathbf{x}_1)\sigma(\mathbf{w}^T\mathbf{x}^*)]$.
Since $\mathbf{x}_1^T\mathbf{x}^ = 1$, they're not orthogonal. For ReLU on the sphere, $K(\mathbf{x}_1, \mathbf{x}^) \approx 0.4$ (can be computed analytically). Similarly for $\mathbf{x}_2$.
Let's use simplified values: assume $\Theta(\mathbf{x}^, \mathbf{x}_1) = 0.6$, $\Theta(\mathbf{x}^, \mathbf{x}_2) = 0.6$.
Prediction: $f_\infty(\mathbf{x}^*) = \begin{pmatrix}0.6 & 0.6\end{pmatrix} \begin{pmatrix}1.0417 & -0.2083 \ -0.2083 & 1.0417\end{pmatrix} \begin{pmatrix}2 \ -1\end{pmatrix}$
$= (0.6, 0.6) \cdot (2.2917, -1.4583)^T = 0.6 \times 0.8334 = 0.5$.
The prediction is roughly the average of the targets, weighted by kernel similarity.
Click for answer
$f_\infty(\mathbf{x}^*) \approx 0.5$. The NTK regression interpolates between the training points using the kernel similarities. With correlation 0.2 and similar test-kernel values, the prediction approximates an average of the targets. In the limit of orthogonal training points ($\Theta_{12} = 0$), the prediction would be $0.6 \times 2 + 0.6 \times (-1) = 0.6$.Example 2: Convergence Rate from Eigenvalues
Two training points with NTK Gram matrix $\Theta = \begin{pmatrix} 1 & \rho \ \rho & 1 \end{pmatrix}$ and $\mathbf{f}(0) = \mathbf{0}$, $\mathbf{y} = (1, -1)^T$. Compute the function values $f_1(t)$ and $f_2(t)$ as a function of time. Interpret the result for $\rho = 0$ vs $\rho = 0.9$.
Solution:
Eigenvalues of $\Theta$: $\lambda_1 = 1 + \rho$ (eigenvector $\mathbf{u}_1 = \frac{1}{\sqrt{2}}(1, 1)^T$), $\lambda_2 = 1 - \rho$ (eigenvector $\mathbf{u}_2 = \frac{1}{\sqrt{2}}(1, -1)^T$).
The solution to $\dot{\mathbf{f}} = -\Theta\mathbf{f} + \Theta\mathbf{y}$ is:
$\mathbf{f}(t) = \mathbf{y} - e^{-\Theta t}\mathbf{y}$ (since $\mathbf{f}(0) = 0$).
Decompose $\mathbf{y}$: $\mathbf{y} = (1, -1)^T = 0 \cdot \mathbf{u}_1 + \sqrt{2} \cdot \mathbf{u}_2$.
So $\mathbf{f}(t) = \mathbf{y} - e^{-\lambda_1 t} \cdot 0 \cdot \mathbf{u}_1 - e^{-\lambda_2 t} \cdot \sqrt{2} \cdot \mathbf{u}_2 = (1 - e^{-(1-\rho)t}) \mathbf{y}$.
$$f_1(t) = 1 - e^{-(1-\rho)t}, \quad f_2(t) = -1 + e^{-(1-\rho)t}$$
For $\rho = 0$: rate is $e^{-t}$, fast convergence. For $\rho = 0.9$: rate is $e^{-0.1t}$, very slow — it takes 10× longer. When $\rho \to 1$ (nearly identical points), $\lambda_2 \to 0$ and training takes infinitely long — the model can't distinguish the two points.
Click for answer
$f_1(t) = 1 - e^{-(1-\rho)t}$, $f_2(t) = -1 + e^{-(1-\rho)t}$. The convergence rate is governed by the *smallest* eigenvalue $\lambda_{\min} = 1 - \rho$. When $\rho$ is close to 1 (highly correlated inputs), the NTK is ill-conditioned and training is very slow. This is the kernel analog of the "narrow valley" problem in neural network optimization.Example 3: Computing the NTK for a Linear Network
Consider a deep linear network $f(\mathbf{x}) = W_L \cdots W_2 W_1 \mathbf{x}$ with $W_\ell \in \mathbb{R}^{n \times n}$ for all $\ell$. Compute the NTK at initialization and show it only depends on $\mathbf{x}^T\mathbf{x}'$.
Solution:
By the chain rule, for any parameter in layer $\ell$:
$\frac{\partial f}{\partial W_\ell} = W_L \cdots W_{\ell+1} \otimes (\mathbf{h}^{(\ell-1)})^T$ where $\mathbf{h}^{(\ell-1)} = W_{\ell-1} \cdots W_1 \mathbf{x}$.
The NTK sums over all parameters. At infinite width and random Gaussian initialization:
$$\Theta(\mathbf{x}, \mathbf{x}') = \sum_{\ell=1}^L \mathbb{E}\left[\left| \frac{\partial f}{\partial W_\ell} \right|^2_{\text{relative to } \mathbf{x}, \mathbf{x}'}\right]$$
For deep linear networks, this simplifies to:
$$\Theta(\mathbf{x}, \mathbf{x}') = L \cdot (\mathbf{x}^T\mathbf{x}')^{L-1} \cdot \mathbf{x}^T\mathbf{x}' = L (\mathbf{x}^T\mathbf{x}')^L$$
The NTK is simply $L$ times the $L$-th power of the dot product. For $L=1$ (linear regression): $\Theta = \mathbf{x}^T\mathbf{x}'$ — the linear kernel. For $L \to \infty$: $\Theta(\mathbf{x}, \mathbf{x}') \to 0$ for $|\mathbf{x}^T\mathbf{x}'| < 1$, and blows up for $\mathbf{x}^T\mathbf{x}' > 1$. This pathological behavior is why deep linear networks need careful normalization — a deep ReLU network doesn't suffer from this because ReLU's nonlinearity regularizes the NTK.
Click for answer
$\Theta(\mathbf{x}, \mathbf{x}') = L (\mathbf{x}^T\mathbf{x}')^L$. The kernel is a simple polynomial of the dot product, reflecting the depth. The gradient contribution from each layer amplifies the dot product. Deep linear networks have pathological NTKs (exponential in $L$), which is one reason why nonlinear activations are essential for deep architectures.Practice Problems
Problem 1: For the NTK Gram matrix of $m$ i.i.d. points on the unit sphere with ReLU activation, the eigenvalues follow a Marchenko-Pastur-like distribution. If the largest eigenvalue is $\lambda_1 \approx m$ and the smallest nonzero is $\lambda_m \approx 1/m$, what does this imply about the condition number and training time?
Click for answer
Condition number $\kappa = \lambda_{\max}/\lambda_{\min} \approx m^2$. Training time $T$ to reach error $\epsilon$ satisfies $T \propto \kappa \log(1/\epsilon) \propto m^2 \log(1/\epsilon)$. This means training time scales quadratically with dataset size in the kernel regime — a fundamental limitation of lazy training. In contrast, feature-learning regimes can break this scaling by adapting the kernel during training, which is why finite-width networks can be more sample-efficient than their infinite-width limits for some problems.Problem 2: Derive the gradient flow ODE for training with weight decay (L2 regularization $\frac{\lambda}{2}|\theta|^2$). Show that the infinite-time limit is kernel ridge regression with regularization $\lambda$.
Click for answer
With weight decay: $\mathcal{L}_{\text{reg}} = \frac{1}{2}\sum_i (f(\mathbf{x}_i) - y_i)^2 + \frac{\lambda}{2}\|\theta\|^2$. Parameter dynamics: $\dot{\theta} = -\sum_i (f(\mathbf{x}_i) - y_i)\nabla_\theta f(\mathbf{x}_i) - \lambda\theta$. Function dynamics: $\dot{f}(\mathbf{x}) = -\sum_i \Theta(\mathbf{x}, \mathbf{x}_i)(f(\mathbf{x}_i) - y_i) - \lambda \nabla_\theta f(\mathbf{x})^T \theta$. In the NTK regime, $\nabla_\theta f(\mathbf{x})^T \theta \approx f(\mathbf{x})$ (since $f$ is approximately linear in $\theta$ near initialization). So: $\dot{f}(\mathbf{x}) \approx -\sum_j \Theta(\mathbf{x}, \mathbf{x}_j)(f(\mathbf{x}_j) - y_j) - \lambda f(\mathbf{x})$. At steady state: $\Theta(X, X)(\mathbf{f} - \mathbf{y}) + \lambda\mathbf{f} = 0 \implies \mathbf{f} = (\Theta + \lambda I)^{-1} \Theta \mathbf{y} = \Theta(\Theta + \lambda I)^{-1} \mathbf{y}$. This is exactly kernel ridge regression! The parameter $\lambda$ directly controls the ridge penalty.Problem 3: A two-layer network $f(\mathbf{x}) = \frac{1}{\sqrt{n}}\sum_i a_i \sigma(\mathbf{w}_i^T \mathbf{x})$ has the NTK $\Theta(\mathbf{x}, \mathbf{x}') = K(\mathbf{x}, \mathbf{x}') + \mathbf{x}^T\mathbf{x}' \dot{K}(\mathbf{x}, \mathbf{x}')$. Explain the two terms: which parameters contribute to each?
Click for answer
The two terms come from different parameter groups: 1. **$K(\mathbf{x}, \mathbf{x}')$:** From the second-layer weights $\{a_i\}$. The gradient w.r.t. $a_i$ is $\frac{1}{\sqrt{n}}\sigma(\mathbf{w}_i^T\mathbf{x})$. So $\sum_i \nabla_{a_i} f(\mathbf{x}) \nabla_{a_i} f(\mathbf{x}') = \frac{1}{n}\sum_i \sigma(\mathbf{w}_i^T\mathbf{x})\sigma(\mathbf{w}_i^T\mathbf{x}') \to K(\mathbf{x}, \mathbf{x}')$ as $n \to \infty$. 2. **$\mathbf{x}^T\mathbf{x}' \dot{K}(\mathbf{x}, \mathbf{x}')$:** From the first-layer weights $\{\mathbf{w}_i\}$. The gradient w.r.t. $\mathbf{w}_i$ is $\frac{a_i}{\sqrt{n}}\sigma'(\mathbf{w}_i^T\mathbf{x})\mathbf{x}$. So $\sum_i \nabla_{\mathbf{w}_i}f(\mathbf{x})^T \nabla_{\mathbf{w}_i}f(\mathbf{x}') = \frac{1}{n}\sum_i a_i^2 \sigma'(\mathbf{w}_i^T\mathbf{x})\sigma'(\mathbf{w}_i^T\mathbf{x}') \mathbf{x}^T\mathbf{x}' \to \mathbf{x}^T\mathbf{x}' \cdot \mathbb{E}[\sigma'(\mathbf{w}^T\mathbf{x})\sigma'(\mathbf{w}^T\mathbf{x}')] = \mathbf{x}^T\mathbf{x}' \dot{K}$. The first-layer contribution incorporates the input correlation $\mathbf{x}^T\mathbf{x}'$ explicitly — this is why the NTK can distinguish inputs that the NNGP kernel cannot. The "tangent" part (first-layer gradients) provides the learnability that the "function" part (NNGP) lacks.Problem 4: In the finite-width regime, the NTK changes during training. Show that the rate of change is bounded by $|\Theta_t - \Theta_0| \leq O(|\theta_t - \theta_0| \cdot \sup_\theta |\nabla_\theta \Theta|)$. What happens to this bound as $n \to \infty$?
Click for answer
By the mean value theorem: $\|\Theta_t - \Theta_0\| \leq \|\theta_t - \theta_0\| \cdot \sup_{\theta \in [\theta_0, \theta_t]} \|\nabla_\theta \Theta\|$. From the NTK definition: $\Theta(\mathbf{x}, \mathbf{x}') = \nabla_\theta f(\mathbf{x})^T \nabla_\theta f(\mathbf{x}')$. So: $\nabla_{\theta_p} \Theta = (\nabla_\theta^2 f(\mathbf{x}))_{:,p}^T \nabla_\theta f(\mathbf{x}') + \nabla_\theta f(\mathbf{x})^T (\nabla_\theta^2 f(\mathbf{x}'))_{:,p}$. At Gaussian initialization, each term scales as $O(1/\sqrt{n})$ (since each parameter contributes $1/\sqrt{n}$ and there are $O(n^2)$ parameters). More carefully: $\|\nabla_\theta \Theta\| = O(1/\sqrt{n})$ and $\|\theta_t - \theta_0\| = O(1/\sqrt{n})$ (lazy regime). So: $\|\Theta_t - \Theta_0\| = O(1/n) \to 0$ as $n \to \infty$. This proves the NTK becomes constant at infinite width. For finite $n$, the NTK changes by $O(1/n)$, which is the leading-order correction to the kernel regime.Problem 5: Beyond MSE loss: Derive the NTK-like dynamics for training with cross-entropy loss on a binary classification problem. What kernel appears, and how does it relate to the Fisher information?
Click for answer
For binary classification with log-likelihood: $\mathcal{L} = -\sum_i \left[y_i \log \sigma(f_i) + (1-y_i)\log(1-\sigma(f_i))\right]$. Gradient: $\dot{\theta} = -\sum_i (\sigma(f_i) - y_i) \nabla_\theta f_i$. Function dynamics: $\dot{f}(\mathbf{x}) = -\sum_i \Theta(\mathbf{x}, \mathbf{x}_i)(\sigma(f_i) - y_i)$. This is the same functional form as MSE, but with $\sigma(f) - y$ instead of $f - y$. The kernel remains $\Theta$. For the *Newton* (second-order) dynamics, the Hessian of the loss has a Gauss-Newton decomposition: $\nabla_\theta^2 \mathcal{L} = \sum_i \sigma(f_i)(1-\sigma(f_i)) \nabla_\theta f_i \nabla_\theta f_i^T + \sum_i (\sigma(f_i) - y_i) \nabla_\theta^2 f_i$. The first term uses the **tangent kernel weighted by Fisher information of the output layer**: $\tilde{\Theta}(\mathbf{x}, \mathbf{x}') = \sigma(f(\mathbf{x}))(1-\sigma(f(\mathbf{x}))) \cdot \Theta(\mathbf{x}, \mathbf{x}')$. This is exactly the Gauss-Newton matrix, which for generalized linear models is the natural gradient preconditioner in parameter space. The NTK appears as the *Jacobian part* of the Fisher/preconditioner.Summary
Key takeaways:
- The NTK $\Theta(\mathbf{x}, \mathbf{x}') = \langle \nabla_\theta f(\mathbf{x}), \nabla_\theta f(\mathbf{x}') \rangle$ measures input similarity through the lens of the network's Jacobian
- At infinite width, the NTK stays constant during training — gradient descent becomes equivalent to kernel ridge regression with the NTK
- The function dynamics follow a linear ODE: $\dot{\mathbf{f}} = -\Theta(\mathbf{f} - \mathbf{y})$, with convergence rate determined by $\Theta$'s eigenvalues
- The NTK is always PSD and has a recursive structure through layers, connecting it to NNGP kernels via $\Theta^{(\ell)} = K^{(\ell)} + \Theta^{(\ell-1)}\dot{K}^{(\ell)}$
- Finite-width networks deviate from the kernel regime by $O(1/\sqrt{n})$, enabling feature learning that infinite networks cannot achieve
- The NTK provides a theoretical bridge between deep learning and kernel methods, explaining when and why neural networks work
Quiz
Question 1: What happens to the Neural Tangent Kernel as the network width $n \to \infty$?
A. It becomes zero everywhere B. It becomes constant during training (independent of $t$) C. It becomes a function of the training labels D. It oscillates randomly
Correct Answer: B. It becomes constant during training (independent of $t$)
Explanation: The central result of NTK theory: $\Theta_t \to \Theta^\infty$ (deterministic and time-independent) as $n \to \infty$. This is because each parameter moves by $O(1/\sqrt{n})$ and the NTK change is $O(1/n)$. A is wrong — it converges to a well-defined, non-zero kernel. C is wrong — it depends on architecture and inputs, not labels. D is wrong — the randomness averages out in the infinite-width limit.
Question 2: In the NTK regime, what is the infinite-time prediction $f_\infty(\mathbf{x}^*)$ for a test point (assuming $f_0 \equiv 0$)?
A. The average of training labels B. $\Theta(\mathbf{x}^, X) \Theta(X, X)^{-1} \mathbf{y}$ C. $\mathbf{y}$ regardless of $\mathbf{x}^$ D. A random function of $\mathbf{x}^*$
Correct Answer: B. $\Theta(\mathbf{x}^*, X) \Theta(X, X)^{-1} \mathbf{y}$
Explanation: This is exactly the kernel ridge regression predictor (with zero regularization). At $t \to \infty$, gradient flow converges to the minimum-norm interpolant in the RKHS defined by the NTK. Option A is only correct if $\Theta$ is the all-ones kernel. Option C means perfect memorization of labels, which only happens if the NTK is the identity matrix. Option D is wrong because the NTK is deterministic in the limit.
Question 3: Which part of a two-layer network's NTK $\Theta = K + \mathbf{x}^T\mathbf{x}'\dot{K}$ comes from the first-layer weights?
A. $K(\mathbf{x}, \mathbf{x}')$ only B. $\mathbf{x}^T\mathbf{x}'\dot{K}$ only C. Both $K$ and $\mathbf{x}^T\mathbf{x}'\dot{K}$ D. Neither — it comes from the biases
Correct Answer: B. $\mathbf{x}^T\mathbf{x}'\dot{K}$ only
Explanation: The first-layer weight gradients involve $\sigma'(\mathbf{w}^T\mathbf{x})\mathbf{x}$, and the $\mathbf{x}^T\mathbf{x}'$ factor comes from the outer product of these input vectors. $K$ comes from the second-layer weights ${a_i}$ whose gradients are simply $\sigma(\mathbf{w}^T\mathbf{x})$. The "tangent" part $\mathbf{x}^T\mathbf{x}'\dot{K}$ is what enables the NTK to be a richer kernel than the NNGP kernel $K$.
Question 4: How does the NTK eigenvalue spectrum affect training?
A. All eigenvalues contribute equally to convergence B. Only the largest eigenvalue matters C. Modes with small eigenvalues converge slowly; the smallest eigenvalue controls the worst-case rate D. Eigenvalues have no effect on training dynamics
Correct Answer: C. Modes with small eigenvalues converge slowly; the smallest eigenvalue controls the worst-case rate
Explanation: Each NTK eigenmode decays as $e^{-\lambda_i t}$. The slowest mode (smallest $\lambda_i$) determines how long training takes to reach a given accuracy. Highly ill-conditioned NTKs (large $\kappa = \lambda_{\max}/\lambda_{\min}$) lead to slow convergence. This is analogous to how the condition number of the Hessian affects convergence in convex optimization.
Question 5: What is the relationship between the NTK and natural gradient descent?
A. They are unrelated concepts B. The NTK is the Fisher information matrix used in natural gradient C. The NTK acting on function space gives the same dynamics as standard gradient in parameter space at infinite width D. Natural gradient is the inverse of the NTK update
Correct Answer: C. The NTK acting on function space gives the same dynamics as standard gradient in parameter space at infinite width
Explanation: $\dot{f} = -\Theta(\mathbf{f} - \mathbf{y})$ in function space. The NTK is the pullback of the Euclidean metric in parameter space to function space. Natural gradient in parameter space uses $I^{-1}$; the NTK is $JJ^T$ where $J = \nabla_\theta \mathbf{f}$. In the infinite-width limit, these two perspectives coincide for MSE loss. Option B is close but not exactly right — the NTK is $JJ^T$ while the Fisher for regression would involve expectations over data.
Question 6: In the "lazy training" regime, which of the following is true?
A. The network weights change dramatically during training B. The network function does not change at all C. The tangent features $\nabla_\theta f$ remain approximately constant while the function changes D. The NTK becomes singular
Correct Answer: C. The tangent features $\nabla_\theta f$ remain approximately constant while the function changes
Explanation: "Lazy" refers to the parameters barely moving ($|\theta_t - \theta_0| \ll 1$), but the function can change significantly because $f(\mathbf{x}) \approx f_0(\mathbf{x}) + \nabla_\theta f_0(\mathbf{x})^T(\theta_t - \theta_0)$ — a small parameter change is amplified by the large gradient. The tangent features (gradients) are what stay fixed. This is why the NTK is constant: $\Theta = \nabla f \cdot \nabla f^T$ depends on the features, not the function values.
Pitfalls
-
Thinking NTK theory means neural networks are "just kernel methods": The NTK describes the infinite-width limit. Finite-width networks can and do learn features — the NTK evolves during training. Confusing the infinite-width limit with reality leads to bad predictions about what real networks can learn. The NTK is a theoretical lens, not the whole story.
-
Confusing the NTK with the NNGP kernel: The NNGP kernel $K(\mathbf{x}, \mathbf{x}')$ describes the covariance of the network's output at initialization (a Gaussian process). The NTK $\Theta(\mathbf{x}, \mathbf{x}') = K + \mathbf{x}^T\mathbf{x}'\dot{K}$ describes the training dynamics. The NTK is always at least as expressive as the NNGP (it includes the tangent component), and using only $K$ for predictions ignores the learnability captured by the first-layer gradients.
-
Ignoring finite-width corrections: Real networks operate at finite width $n$, where the NTK changes by $O(1/\sqrt{n})$ during training. For $n \sim m$ (width comparable to dataset size), this change is significant — the network is in the feature-learning regime. NTK-based analyses that ignore finite-width effects can be misleading for practical architectures.
-
Over-interpreting convergence rates from NTK eigenvalues: The NTK eigenvalue spectrum predicts kernel regime convergence, but real networks often converge faster because feature learning adapts the kernel during training. Relying solely on the NTK condition number to predict training time can overestimate how long training takes in the feature-learning regime.
Next Steps
Next up: 24-04 — Manifold Hypothesis and Representation Geometry — where you'll explore how neural networks learn to untangle data manifolds, and how to measure and compare the geometry of learned representations using CKA and related metrics.