Math graphic
📐 Concept diagram

24-01 — Fisher Information

Phase: 24 — Information Geometry & Advanced Theory Subject: 24-01 Prerequisites: Phase 11 (Information Theory — entropy, KL divergence), Phase 12-13 (Statistics — likelihood, estimators) Next subject: 24-02 — Natural Gradient Descent


Learning Objectives

By the end of this subject, you will be able to:

  1. Define the score function and Fisher information matrix from first principles
  2. Compute Fisher information for common distributions (Bernoulli, Gaussian, categorical)
  3. State and apply the Cramér-Rao lower bound to bound estimator variance
  4. Understand the Fisher information matrix as a Riemannian metric on the statistical manifold
  5. Relate Fisher information to KL divergence via the second-order Taylor expansion

Core Content

The Score Function

Consider a parametric family of probability distributions $p(x \mid \theta)$ where $\theta \in \mathbb{R}^d$ is the parameter vector. The score function $s(\theta, x)$ is the gradient of the log-likelihood with respect to the parameters:

$$s(\theta, x) = \nabla_\theta \log p(x \mid \theta)$$

Key property: The score has zero expectation under the true distribution:

$$\mathbb{E}{x \sim p(x \mid \theta)}[s(\theta, x)] = \int p(x \mid \theta) \nabla\theta \log p(x \mid \theta) \, dx = \nabla_\theta \int p(x \mid \theta) \, dx = \nabla_\theta 1 = \mathbf{0}$$

This follows from differentiating $\int p = 1$ under the integral sign. The score tells you how sensitive the log-likelihood is to changes in $\theta$ — steep log-likelihoods mean the data is highly informative about the parameter.

Fisher Information Matrix

The Fisher information matrix $I(\theta)$ is the covariance of the score function:

$$I(\theta) = \mathbb{E}_{x \sim p(x \mid \theta)}\left[s(\theta, x) \, s(\theta, x)^T\right]$$

Since $\mathbb{E}[s] = \mathbf{0}$, the covariance equals the second moment. Equivalently, it's the negative expected Hessian of the log-likelihood:

$$I(\theta) = -\mathbb{E}{x \sim p(x \mid \theta)}\left[\nabla\theta^2 \log p(x \mid \theta)\right]$$

Proof of equivalence:

Start with the identity $\int p(x \mid \theta) \, dx = 1$. Differentiate twice:

$$\nabla_\theta^2 \int p(x \mid \theta) \, dx = \int \nabla_\theta^2 p(x \mid \theta) \, dx = \mathbf{0}$$

Using $\nabla_\theta^2 \log p = \frac{\nabla_\theta^2 p}{p} - \frac{(\nabla_\theta p)(\nabla_\theta p)^T}{p^2} = \frac{\nabla_\theta^2 p}{p} - s s^T$:

$$\mathbb{E}[\nabla_\theta^2 \log p] = \int p \cdot \left(\frac{\nabla_\theta^2 p}{p} - s s^T\right) dx = \int \nabla_\theta^2 p \, dx - \mathbb{E}[s s^T] = \mathbf{0} - I(\theta)$$

Thus $I(\theta) = -\mathbb{E}[\nabla_\theta^2 \log p]$.

⚠️ CRITICAL: $I(\theta)$ is always positive semidefinite (PSD) — it's a covariance matrix. This is why it serves as a valid Riemannian metric. For a well-parameterized model where all parameters affect the distribution, it's positive definite.

Computing Fisher Information for Common Distributions

Bernoulli Distribution

$p(x \mid \theta) = \theta^x (1-\theta)^{1-x}$ for $x \in {0, 1}$, $\theta \in (0, 1)$.

$$\log p = x \log \theta + (1-x) \log(1-\theta)$$

Score: $s(\theta, x) = \frac{\partial}{\partial\theta} \log p = \frac{x}{\theta} - \frac{1-x}{1-\theta} = \frac{x - \theta}{\theta(1-\theta)}$

Fisher information (scalar):

$$I(\theta) = \mathbb{E}\left[\left(\frac{x-\theta}{\theta(1-\theta)}\right)^2\right] = \frac{\mathbb{E}[(x-\theta)^2]}{\theta^2(1-\theta)^2} = \frac{\theta(1-\theta)}{\theta^2(1-\theta)^2} = \frac{1}{\theta(1-\theta)}$$

Note: $I(\theta) \to \infty$ as $\theta \to 0$ or $\theta \to 1$ — extreme probabilities are the easiest to estimate precisely because there's almost no randomness.

Gaussian with Known Variance

$p(x \mid \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$, with $\sigma^2$ fixed.

$$\log p = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}$$

Score: $s(\mu, x) = \frac{\partial}{\partial\mu}\log p = \frac{x-\mu}{\sigma^2}$

$I(\mu) = \mathbb{E}\left[\left(\frac{x-\mu}{\sigma^2}\right)^2\right] = \frac{\sigma^2}{\sigma^4} = \frac{1}{\sigma^2}$

The Fisher information is constant — every observation is equally informative about $\mu$ regardless of $\mu$'s value.

Gaussian with Both Parameters

For $\theta = (\mu, \sigma^2)^T$:

$$\log p = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}$$

The Fisher matrix is $2 \times 2$:

$$I(\mu, \sigma^2) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \ 0 & \frac{1}{2\sigma^4} \end{pmatrix}$$

The off-diagonal zero means $\mu$ and $\sigma^2$ are orthogonal parameters — information about one doesn't tell you about the other.

Categorical Distribution

$p(x = k \mid \boldsymbol{\pi}) = \pi_k$ for $k = 1, \ldots, K$, with $\sum_k \pi_k = 1$.

Treating $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_{K-1})$ as free parameters (with $\pi_K = 1 - \sum_{k=1}^{K-1} \pi_k$):

$$\log p(x = k) = \log \pi_k$$

Score vector $s \in \mathbb{R}^{K-1}$ has entries:

$$s_j = \begin{cases} \frac{1}{\pi_j} & \text{if } x = j \ -\frac{1}{\pi_K} & \text{if } x = K \ 0 & \text{otherwise} \end{cases}$$

The Fisher matrix is:

$$I_{ij}(\boldsymbol{\pi}) = \frac{\delta_{ij}}{\pi_i} + \frac{1}{\pi_K}$$

Equivalently, using $1/\pi_K$ as the common off-diagonal term. This matrix appears in the natural gradient of softmax classifiers.

Cramér-Rao Lower Bound

For any unbiased estimator $\hat{\theta}$ of $\theta$, the Cramér-Rao bound states:

$$\text{Cov}(\hat{\theta}) \succeq I(\theta)^{-1}$$

In the scalar case: $\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$.

⚠️ CRITICAL: This is a lower bound — no unbiased estimator can beat it. The MLE asymptotically achieves this bound (it's efficient). The Fisher information quantifies the fundamental limit of how well you can estimate $\theta$ from data.

Proof sketch (scalar): For unbiased $\hat{\theta}$: $\mathbb{E}[\hat{\theta} - \theta] = 0 \implies \int (\hat{\theta} - \theta)p \, dx = 0$. Differentiating w.r.t $\theta$: $\int (\hat{\theta} - \theta)\frac{\partial p}{\partial\theta} dx - \int p \, dx = 0$, so $\int (\hat{\theta} - \theta)\frac{\partial \log p}{\partial\theta} p \, dx = 1$. By Cauchy-Schwarz: $1 \leq \sqrt{\text{Var}(\hat{\theta}) \cdot I(\theta)}$, hence $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$.

Fisher Information as a Riemannian Metric

This is the key insight of information geometry. The statistical manifold — the space of probability distributions $p(x \mid \theta)$ — has a natural Riemannian structure defined by the Fisher information metric.

The Fisher-Rao distance between distributions $p(x \mid \theta_1)$ and $p(x \mid \theta_2)$ is:

$$d_F(\theta_1, \theta_2)^2 \approx (\theta_1 - \theta_2)^T I(\theta)(\theta_1 - \theta_2)$$

to second order. More fundamentally, Fisher information gives the second-order Taylor expansion of KL divergence:

$$D_{KL}(p(\cdot \mid \theta) \;|\; p(\cdot \mid \theta + d\theta)) \approx \frac{1}{2} d\theta^T I(\theta) d\theta$$

Infinitesimally, KL divergence is half the squared Fisher-Rao distance. This connects information theory, geometry, and statistics.

Why this matters for ML: The parameter space of a model has geometry defined by $I(\theta)$. A "unit distance" in parameter space means different things depending on where you are — the Fisher metric tells you the correct local distance. This directly motivates natural gradient descent (24-02).

Connection to Shannon Information

The relationship is deep: $I(\theta)$ measures how much information a sample carries about $\theta$. For $n$ i.i.d. samples, $I_n(\theta) = n I(\theta)$ — information scales linearly with data. The name "Fisher information" reflects this fundamental connection to information theory.

Edge Cases and Considerations



Key Terms

Worked Examples

Example 1: Fisher Information of Exponential Distribution

Compute the Fisher information for $p(x \mid \lambda) = \lambda e^{-\lambda x}$, $x \geq 0$, $\lambda > 0$.

Solution:

$$\log p = \log \lambda - \lambda x$$

Score: $s(\lambda, x) = \frac{\partial}{\partial\lambda}(\log \lambda - \lambda x) = \frac{1}{\lambda} - x$

Check $\mathbb{E}[s] = \frac{1}{\lambda} - \mathbb{E}[x] = \frac{1}{\lambda} - \frac{1}{\lambda} = 0$ ✓

$$I(\lambda) = \mathbb{E}\left[\left(\frac{1}{\lambda} - x\right)^2\right] = \text{Var}(x) = \frac{1}{\lambda^2}$$

Alternatively via the Hessian: $\frac{\partial^2}{\partial\lambda^2}\log p = -\frac{1}{\lambda^2}$, so $I(\lambda) = -\mathbb{E}[-\frac{1}{\lambda^2}] = \frac{1}{\lambda^2}$.

The Cramér-Rao bound tells us $\text{Var}(\hat{\lambda}) \geq \lambda^2$. The MLE $\hat{\lambda} = 1/\bar{x}$ has asymptotic variance $\lambda^2/n$, achieving the bound as $n \to \infty$.

Click for answer $I(\lambda) = 1/\lambda^2$. For $n$ i.i.d. samples, $I_n(\lambda) = n/\lambda^2$. Cramér-Rao: any unbiased estimator of $\lambda$ has variance at least $\lambda^2/n$.

Example 2: Fisher Matrix for Multivariate Gaussian (Known Covariance)

For $p(\mathbf{x} \mid \boldsymbol{\mu}) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \Sigma)$ with known $\Sigma$, compute the Fisher information matrix.

Solution:

$$\log p = -\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu})$$

Score: $\nabla_{\boldsymbol{\mu}} \log p = \Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu})$

The Fisher matrix:

$$I(\boldsymbol{\mu}) = \mathbb{E}\left[\Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1}\right] = \Sigma^{-1} \, \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T] \, \Sigma^{-1} = \Sigma^{-1} \Sigma \Sigma^{-1} = \Sigma^{-1}$$

The Fisher information is the inverse covariance (precision) matrix. This makes sense: highly correlated variables make it harder to estimate individual means.

Click for answer $I(\boldsymbol{\mu}) = \Sigma^{-1}$. If $\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_d^2)$, then $I_{ii} = 1/\sigma_i^2$ — the diagonal entries match the scalar Gaussian case. If variables are perfectly correlated, $\Sigma$ is singular, and $I(\boldsymbol{\mu})$ becomes singular — you cannot separately estimate the means.

Example 3: Fisher Information and the Cramér-Rao Bound

Suppose $X_1, \ldots, X_n \sim \text{Bernoulli}(\theta)$. The MLE is $\hat{\theta} = \bar{X} = \frac{1}{n}\sum_i X_i$. Verify that (a) $\hat{\theta}$ is unbiased, (b) its variance achieves the Cramér-Rao bound asymptotically.

Solution:

(a) $\mathbb{E}[\hat{\theta}] = \frac{1}{n} \sum_i \mathbb{E}[X_i] = \frac{n\theta}{n} = \theta$ ✓

(b) $\text{Var}(\hat{\theta}) = \text{Var}\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n^2} \cdot n\theta(1-\theta) = \frac{\theta(1-\theta)}{n}$

From earlier: $I_1(\theta) = \frac{1}{\theta(1-\theta)}$, so $I_n(\theta) = \frac{n}{\theta(1-\theta)}$.

Cramér-Rao: $\text{Var}(\hat{\theta}) \geq \frac{1}{I_n(\theta)} = \frac{\theta(1-\theta)}{n}$

The MLE achieves the bound exactly — it is efficient for the Bernoulli family. The variance $\theta(1-\theta)/n$ is worst (largest) at $\theta = 0.5$ and best (smallest) as $\theta \to 0$ or $1$, matching the Fisher information pattern.

Click for answer MLE is unbiased with variance $\theta(1-\theta)/n$, which equals the Cramér-Rao bound $1/I_n(\theta)$. The MLE is efficient — no unbiased estimator can have lower variance. For $n=100$ and $\theta=0.5$, the standard error is $\sqrt{0.25/100} = 0.05$.

Practice Problems

Problem 1: Compute the Fisher information for the Poisson distribution $p(x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}$, $x \in {0, 1, 2, \ldots}$.

Click for answer $\log p = x\log\lambda - \lambda - \log x!$ $s(\lambda, x) = \frac{x}{\lambda} - 1$, verify $\mathbb{E}[s] = \frac{\lambda}{\lambda} - 1 = 0$. $I(\lambda) = \mathbb{E}\left[\left(\frac{x}{\lambda} - 1\right)^2\right] = \frac{\mathbb{E}[(x-\lambda)^2]}{\lambda^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}$. Alternatively: $\frac{\partial^2}{\partial\lambda^2}\log p = -\frac{x}{\lambda^2}$, so $I(\lambda) = -\mathbb{E}[-\frac{x}{\lambda^2}] = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}$. Cramér-Rao: $\text{Var}(\hat{\lambda}) \geq \lambda/n$. The MLE $\hat{\lambda} = \bar{x}$ achieves this.

Problem 2: Show that if $\theta$ is reparameterized as $\phi = g(\theta)$ where $g$ is a smooth bijection, the Fisher information transforms as $I(\phi) = (g'(\theta))^{-2} I(\theta)$ (scalar case).

Click for answer By chain rule: $\frac{\partial}{\partial\phi}\log p = \frac{\partial\theta}{\partial\phi} \cdot \frac{\partial}{\partial\theta}\log p = \frac{1}{g'(\theta)} s(\theta, x)$. $I(\phi) = \mathbb{E}\left[\left(\frac{1}{g'(\theta)} s(\theta)\right)^2\right] = \frac{1}{(g'(\theta))^2} I(\theta)$. This is exactly how a Riemannian metric tensor transforms under coordinate change — Fisher information is a proper tensor. For multivariate: $I(\phi) = J^{-T} I(\theta) J^{-1}$ where $J = \frac{\partial\phi}{\partial\theta}$.

Problem 3: For the bivariate Gaussian $\mathcal{N}\left(\begin{pmatrix}\mu_1 \ \mu_2\end{pmatrix}, \begin{pmatrix}\sigma^2 & \rho\sigma^2 \ \rho\sigma^2 & \sigma^2\end{pmatrix}\right)$ with known $\sigma^2$, compute the Fisher matrix for $\boldsymbol{\mu} = (\mu_1, \mu_2)^T$ in terms of $\rho$.

Click for answer $I(\boldsymbol{\mu}) = \Sigma^{-1}$ where $\Sigma = \sigma^2 \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}$. $\Sigma^{-1} = \frac{1}{\sigma^2(1-\rho^2)} \begin{pmatrix} 1 & -\rho \\ -\rho & 1 \end{pmatrix}$. As $|\rho| \to 1$, $I(\boldsymbol{\mu}) \to \infty$ in the $(1, 1)^T$ direction (singular limit) — with perfect correlation, you cannot estimate individual means, only their sum. The Fisher matrix becomes singular, reflecting this fundamental indeterminacy.

Problem 4: Prove that Fisher information is additive for independent observations: if $X_1, \ldots, X_n$ are i.i.d., then $I_n(\theta) = n I_1(\theta)$.

Click for answer By independence: $\log p(x_1, \ldots, x_n \mid \theta) = \sum_{i=1}^n \log p(x_i \mid \theta)$. Score: $s_n = \sum_{i=1}^n s_1(\theta, x_i)$ where $s_1$ is the single-sample score. Since $\mathbb{E}[s_1] = 0$, the scores are uncorrelated (in fact independent) across samples. $I_n(\theta) = \mathbb{E}\left[\left(\sum_i s_1(x_i)\right)\left(\sum_j s_1(x_j)\right)^T\right] = \sum_{i,j} \mathbb{E}[s_1(x_i) s_1(x_j)^T] = \sum_i \mathbb{E}[s_1(x_i) s_1(x_i)^T] = n I_1(\theta)$. The cross-terms vanish because $\mathbb{E}[s_1(x_i)] = 0$ and $x_i \perp x_j$. This is why standard errors scale as $1/\sqrt{n}$.

Problem 5: A neural network classifier outputs a categorical distribution over $K$ classes via softmax. Show that the per-example Fisher matrix $I(\boldsymbol{\pi})$ is singular (rank $K-1$) for the raw softmax probabilities, and explain what this implies about natural gradient on the logits.

Click for answer Since $\sum_k \pi_k = 1$, the probabilities are not independent. The Fisher matrix for $\boldsymbol{\pi}$ has rank $K-1$ because the constraint removes one degree of freedom. However, when parameterized by logits $\mathbf{z}$ (where $\pi_k = e^{z_k}/\sum_j e^{z_j}$), the Fisher matrix for $\mathbf{z}$ is: $$I(\mathbf{z}) = \text{diag}(\boldsymbol{\pi}) - \boldsymbol{\pi}\boldsymbol{\pi}^T$$ This $K \times K$ matrix is also rank $K-1$ (singular) but is PSD. For natural gradient in practice, we either (a) use a damping term $\lambda I$, (b) use the Moore-Penrose pseudoinverse, or (c) work in a reduced $(K-1)$-dimensional parameterization. The singular nature reflects the fact that adding a constant to all logits doesn't change the probabilities — softmax is invariant to translation in logit space.

Summary

Key takeaways:


Quiz

Question 1: What is the expected value of the score function $s(\theta, x) = \nabla_\theta \log p(x \mid \theta)$?

A. $I(\theta)$ B. $\theta$ C. $\mathbf{0}$ D. It depends on the parametrization

Correct Answer: C. $\mathbf{0}$

Explanation: The score always has zero expectation: $\mathbb{E}[s] = \int \nabla_\theta p \, dx = \nabla_\theta \int p \, dx = \nabla_\theta 1 = \mathbf{0}$. This holds for any valid distribution regardless of parametrization. Option A is the second moment (covariance) of $s$, not the mean. Option D is wrong because the zero-mean property is parameterization-invariant.


Question 2: Which of the following is NOT a valid expression for the Fisher information matrix?

A. $\mathbb{E}[s(\theta) s(\theta)^T]$ B. $-\mathbb{E}[\nabla_\theta^2 \log p(x \mid \theta)]$ C. $\mathbb{E}[\nabla_\theta^2 p(x \mid \theta)]$ D. The covariance matrix of the score function

Correct Answer: C. $\mathbb{E}[\nabla_\theta^2 p(x \mid \theta)]$

Explanation: Options A, B, and D are all equivalent definitions of $I(\theta)$. Option C is the expected Hessian of $p$ itself, not $\log p$. We proved earlier that $\mathbb{E}[\nabla_\theta^2 p] = \mathbf{0}$ from differentiating $\int p = 1$, so this is the zero matrix, not the Fisher information.


Question 3: For $n$ i.i.d. observations from a Bernoulli distribution with $\theta = 0.5$, what is the Cramér-Rao lower bound on the variance of any unbiased estimator?

A. $1/n$ B. $4/n$ C. $1/(4n)$ D. $n/4$

Correct Answer: C. $1/(4n)$

Explanation: $I_1(\theta) = \frac{1}{\theta(1-\theta)} = \frac{1}{0.25} = 4$. For $n$ samples, $I_n = 4n$. The Cramér-Rao bound is $1/I_n = 1/(4n)$. The MLE $\bar{X}$ has variance $\theta(1-\theta)/n = 0.25/n = 1/(4n)$, achieving the bound. Option A is the single-sample variance, B is $1/I_1$, and D is $I_n$ itself.


Question 4: Why is the Fisher information matrix always positive semidefinite?

A. Because it equals $-\mathbb{E}[\nabla^2 \log p]$ B. Because it's a covariance matrix C. Because it appears in the Cramér-Rao bound D. Only if the model is well-specified

Correct Answer: B. Because it's a covariance matrix

Explanation: $I(\theta) = \mathbb{E}[s s^T] - \mathbb{E}[s]\mathbb{E}[s]^T = \mathbb{E}[s s^T]$ since $\mathbb{E}[s] = 0$. Any covariance matrix is PSD by construction: for any vector $\mathbf{v}$, $\mathbf{v}^T I \mathbf{v} = \mathbb{E}[(\mathbf{v}^T s)^2] \geq 0$. Option A is equivalent but doesn't explain why it's PSD. Option C is a consequence, not a cause. Option D is false — PSD holds for any model.


Question 5: Which statement about the Fisher-Rao metric is correct?

A. It measures Euclidean distance in parameter space B. It gives the KL divergence between any two distributions exactly C. It provides a second-order approximation to KL divergence for nearby distributions D. It is unrelated to KL divergence

Correct Answer: C. It provides a second-order approximation to KL divergence for nearby distributions

Explanation: $D_{KL}(p_\theta | p_{\theta+d\theta}) \approx \frac{1}{2} d\theta^T I(\theta) d\theta$ — this is the second-order Taylor expansion. The Fisher metric is the infinitesimal KL divergence. It is NOT Euclidean (A is wrong) — the geometry is curved. It is NOT exact for arbitrary distances (B is wrong) — that would require integrating along geodesics. And it is deeply related to KL (D is wrong).


Question 6: For a Gaussian $\mathcal{N}(\mu, \sigma^2)$ with both parameters unknown, the Fisher matrix is $\text{diag}(1/\sigma^2, 1/(2\sigma^4))$. What does the zero off-diagonal entry mean?

A. The model is misspecified B. $\mu$ and $\sigma^2$ are orthogonal parameters in the Fisher-Rao sense C. The Fisher matrix is singular D. There is no relationship between $\mu$ and $\sigma^2$

Correct Answer: B. $\mu$ and $\sigma^2$ are orthogonal parameters in the Fisher-Rao sense

Explanation: Orthogonal parameters mean that information about $\mu$ and $\sigma^2$ is "decoupled" — the score components are uncorrelated. This does NOT mean $\mu$ and $\sigma^2$ are unrelated (option D) — they appear together in the likelihood. It means that estimating $\mu$ and estimating $\sigma^2$ are locally independent tasks. The matrix $\text{diag}(1/\sigma^2, 1/(2\sigma^4))$ is full rank, so C is wrong.


Pitfalls

  1. Confusing the score function with Fisher information: The score $s(\theta, x) = \nabla_\theta \log p(x \mid \theta)$ is a vector-valued function of both $\theta$ and data $x$. Fisher information $I(\theta)$ is the covariance matrix of the score. A single evaluation of the score is not Fisher information — Fisher is the expected outer product (or negative expected Hessian).

  2. Forgetting that empirical Fisher is not true Fisher: Using training labels to compute $\hat{I} = \frac{1}{B}\sum s(y_i)s(y_i)^T$ gives the empirical Fisher under the data distribution, not the model distribution. The true Fisher requires expectation under $p(x \mid \theta)$. This matters particularly early in training when the model doesn't fit the data — the empirical Fisher can give misleading curvature estimates.

  3. Treating singular Fisher as an error: In overparameterized models (especially neural networks with redundant parameters), $I(\theta)$ is often singular — this is expected, not a bug. The singularity reflects true parameter redundancy (e.g., softmax invariance to logit shifts). Don't try to invert a singular Fisher; use damping $I + \lambda I$ or the Moore-Penrose pseudoinverse.

  4. Misapplying the Cramér-Rao bound to biased estimators: The standard Cramér-Rao bound $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$ applies only to unbiased estimators. For biased estimators, you need the information inequality: $\text{MSE}(\hat{\theta}) \geq (1 + b'(\theta))^2 / I(\theta) + b(\theta)^2$. Modern ML estimators are often biased (due to regularization), making the simple bound inapplicable.


Next Steps

Next up: 24-02 — Natural Gradient Descent — where you'll use the Fisher information matrix as a preconditioner to take gradient steps that are invariant to parameterization, fundamentally improving optimization in probability space.