📐 Concept diagram

24-01 — Fisher Information

Phase: 24 — Information Geometry & Advanced Theory Subject: 24-01 Prerequisites: Phase 11 (Information Theory — entropy, KL divergence), Phase 12-13 (Statistics — likelihood, estimators) Next subject: 24-02 — Natural Gradient Descent

Learning Objectives

By the end of this subject, you will be able to:

Define the score function and Fisher information matrix from first principles
Compute Fisher information for common distributions (Bernoulli, Gaussian, categorical)
State and apply the Cramér-Rao lower bound to bound estimator variance
Understand the Fisher information matrix as a Riemannian metric on the statistical manifold
Relate Fisher information to KL divergence via the second-order Taylor expansion

Core Content

The Score Function

Consider a parametric family of probability distributions $p(x \mid \theta)$ where $\theta \in \mathbb{R}^d$ is the parameter vector. The score function $s(\theta, x)$ is the gradient of the log-likelihood with respect to the parameters:

$$s(\theta, x) = \nabla_\theta \log p(x \mid \theta)$$

Key property: The score has zero expectation under the true distribution:

$$\mathbb{E}{x \sim p(x \mid \theta)}[s(\theta, x)] = \int p(x \mid \theta) \nabla\theta \log p(x \mid \theta) \, dx = \nabla_\theta \int p(x \mid \theta) \, dx = \nabla_\theta 1 = \mathbf{0}$$

This follows from differentiating $\int p = 1$ under the integral sign. The score tells you how sensitive the log-likelihood is to changes in $\theta$ — steep log-likelihoods mean the data is highly informative about the parameter.

Fisher Information Matrix

The Fisher information matrix $I(\theta)$ is the covariance of the score function:

$$I(\theta) = \mathbb{E}_{x \sim p(x \mid \theta)}\left[s(\theta, x) \, s(\theta, x)^T\right]$$

Since $\mathbb{E}[s] = \mathbf{0}$, the covariance equals the second moment. Equivalently, it's the negative expected Hessian of the log-likelihood:

$$I(\theta) = -\mathbb{E}{x \sim p(x \mid \theta)}\left[\nabla\theta^2 \log p(x \mid \theta)\right]$$

Proof of equivalence:

Start with the identity $\int p(x \mid \theta) \, dx = 1$. Differentiate twice:

$$\nabla_\theta^2 \int p(x \mid \theta) \, dx = \int \nabla_\theta^2 p(x \mid \theta) \, dx = \mathbf{0}$$

Using $\nabla_\theta^2 \log p = \frac{\nabla_\theta^2 p}{p} - \frac{(\nabla_\theta p)(\nabla_\theta p)^T}{p^2} = \frac{\nabla_\theta^2 p}{p} - s s^T$:

$$\mathbb{E}[\nabla_\theta^2 \log p] = \int p \cdot \left(\frac{\nabla_\theta^2 p}{p} - s s^T\right) dx = \int \nabla_\theta^2 p \, dx - \mathbb{E}[s s^T] = \mathbf{0} - I(\theta)$$

Thus $I(\theta) = -\mathbb{E}[\nabla_\theta^2 \log p]$.

⚠️ CRITICAL: $I(\theta)$ is always positive semidefinite (PSD) — it's a covariance matrix. This is why it serves as a valid Riemannian metric. For a well-parameterized model where all parameters affect the distribution, it's positive definite.

Computing Fisher Information for Common Distributions

Bernoulli Distribution

$p(x \mid \theta) = \theta^x (1-\theta)^{1-x}$ for $x \in {0, 1}$, $\theta \in (0, 1)$.

$$\log p = x \log \theta + (1-x) \log(1-\theta)$$

Score: $s(\theta, x) = \frac{\partial}{\partial\theta} \log p = \frac{x}{\theta} - \frac{1-x}{1-\theta} = \frac{x - \theta}{\theta(1-\theta)}$

Fisher information (scalar):

$$I(\theta) = \mathbb{E}\left[\left(\frac{x-\theta}{\theta(1-\theta)}\right)^2\right] = \frac{\mathbb{E}[(x-\theta)^2]}{\theta^2(1-\theta)^2} = \frac{\theta(1-\theta)}{\theta^2(1-\theta)^2} = \frac{1}{\theta(1-\theta)}$$

Note: $I(\theta) \to \infty$ as $\theta \to 0$ or $\theta \to 1$ — extreme probabilities are the easiest to estimate precisely because there's almost no randomness.

Gaussian with Known Variance

$p(x \mid \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$, with $\sigma^2$ fixed.

$$\log p = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}$$

Score: $s(\mu, x) = \frac{\partial}{\partial\mu}\log p = \frac{x-\mu}{\sigma^2}$

$I(\mu) = \mathbb{E}\left[\left(\frac{x-\mu}{\sigma^2}\right)^2\right] = \frac{\sigma^2}{\sigma^4} = \frac{1}{\sigma^2}$

The Fisher information is constant — every observation is equally informative about $\mu$ regardless of $\mu$'s value.

Gaussian with Both Parameters

For $\theta = (\mu, \sigma^2)^T$:

$$\log p = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}$$

The Fisher matrix is $2 \times 2$:

$$I(\mu, \sigma^2) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \ 0 & \frac{1}{2\sigma^4} \end{pmatrix}$$

The off-diagonal zero means $\mu$ and $\sigma^2$ are orthogonal parameters — information about one doesn't tell you about the other.

Categorical Distribution

$p(x = k \mid \boldsymbol{\pi}) = \pi_k$ for $k = 1, \ldots, K$, with $\sum_k \pi_k = 1$.

Treating $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_{K-1})$ as free parameters (with $\pi_K = 1 - \sum_{k=1}^{K-1} \pi_k$):

$$\log p(x = k) = \log \pi_k$$

Score vector $s \in \mathbb{R}^{K-1}$ has entries:

$$s_j = \begin{cases} \frac{1}{\pi_j} & \text{if } x = j \ -\frac{1}{\pi_K} & \text{if } x = K \ 0 & \text{otherwise} \end{cases}$$

The Fisher matrix is:

$$I_{ij}(\boldsymbol{\pi}) = \frac{\delta_{ij}}{\pi_i} + \frac{1}{\pi_K}$$

Equivalently, using $1/\pi_K$ as the common off-diagonal term. This matrix appears in the natural gradient of softmax classifiers.

Cramér-Rao Lower Bound

For any unbiased estimator $\hat{\theta}$ of $\theta$, the Cramér-Rao bound states:

$$\text{Cov}(\hat{\theta}) \succeq I(\theta)^{-1}$$

In the scalar case: $\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$.

⚠️ CRITICAL: This is a lower bound — no unbiased estimator can beat it. The MLE asymptotically achieves this bound (it's efficient). The Fisher information quantifies the fundamental limit of how well you can estimate $\theta$ from data.

Proof sketch (scalar): For unbiased $\hat{\theta}$: $\mathbb{E}[\hat{\theta} - \theta] = 0 \implies \int (\hat{\theta} - \theta)p \, dx = 0$. Differentiating w.r.t $\theta$: $\int (\hat{\theta} - \theta)\frac{\partial p}{\partial\theta} dx - \int p \, dx = 0$, so $\int (\hat{\theta} - \theta)\frac{\partial \log p}{\partial\theta} p \, dx = 1$. By Cauchy-Schwarz: $1 \leq \sqrt{\text{Var}(\hat{\theta}) \cdot I(\theta)}$, hence $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$.

Fisher Information as a Riemannian Metric

This is the key insight of information geometry. The statistical manifold — the space of probability distributions $p(x \mid \theta)$ — has a natural Riemannian structure defined by the Fisher information metric.

The Fisher-Rao distance between distributions $p(x \mid \theta_1)$ and $p(x \mid \theta_2)$ is:

$$d_F(\theta_1, \theta_2)^2 \approx (\theta_1 - \theta_2)^T I(\theta)(\theta_1 - \theta_2)$$

to second order. More fundamentally, Fisher information gives the second-order Taylor expansion of KL divergence:

$$D_{KL}(p(\cdot \mid \theta) \;|\; p(\cdot \mid \theta + d\theta)) \approx \frac{1}{2} d\theta^T I(\theta) d\theta$$

Infinitesimally, KL divergence is half the squared Fisher-Rao distance. This connects information theory, geometry, and statistics.

Why this matters for ML: The parameter space of a model has geometry defined by $I(\theta)$. A "unit distance" in parameter space means different things depending on where you are — the Fisher metric tells you the correct local distance. This directly motivates natural gradient descent (24-02).

Connection to Shannon Information

The relationship is deep: $I(\theta)$ measures how much information a sample carries about $\theta$. For $n$ i.i.d. samples, $I_n(\theta) = n I(\theta)$ — information scales linearly with data. The name "Fisher information" reflects this fundamental connection to information theory.

Edge Cases and Considerations

Singular Fisher: When parameters are redundant (e.g., overparameterized neural networks), $I(\theta)$ becomes singular and cannot be inverted. This is why natural gradient methods require damping: $I(\theta) + \lambda I$.
Empirical Fisher: In practice, we replace $\mathbb{E}{x \sim p}$ with the empirical expectation over a batch: $\hat{I}(\theta) = \frac{1}{B} \sum{i=1}^B s(\theta, x_i) s(\theta, x_i)^T$. The empirical Fisher is always PSD by construction.
Observed vs. Expected Fisher: The observed Fisher uses $-\nabla_\theta^2 \log p(x \mid \theta)$ at the observed data, without expectation. The expected Fisher averages over $p(x \mid \theta)$. They coincide asymptotically but differ in finite samples.

Key Terms

Fisher information
Fisher information is always PSD
Fisher information matrix
Fisher-Rao distance
Riemannian metric

Worked Examples

Example 1: Fisher Information of Exponential Distribution

Compute the Fisher information for $p(x \mid \lambda) = \lambda e^{-\lambda x}$, $x \geq 0$, $\lambda > 0$.

Solution:

$$\log p = \log \lambda - \lambda x$$

Score: $s(\lambda, x) = \frac{\partial}{\partial\lambda}(\log \lambda - \lambda x) = \frac{1}{\lambda} - x$

Check $\mathbb{E}[s] = \frac{1}{\lambda} - \mathbb{E}[x] = \frac{1}{\lambda} - \frac{1}{\lambda} = 0$ ✓

$$I(\lambda) = \mathbb{E}\left[\left(\frac{1}{\lambda} - x\right)^2\right] = \text{Var}(x) = \frac{1}{\lambda^2}$$

Alternatively via the Hessian: $\frac{\partial^2}{\partial\lambda^2}\log p = -\frac{1}{\lambda^2}$, so $I(\lambda) = -\mathbb{E}[-\frac{1}{\lambda^2}] = \frac{1}{\lambda^2}$.

The Cramér-Rao bound tells us $\text{Var}(\hat{\lambda}) \geq \lambda^2$. The MLE $\hat{\lambda} = 1/\bar{x}$ has asymptotic variance $\lambda^2/n$, achieving the bound as $n \to \infty$.

Click for answer

$I(\lambda) = 1/\lambda^2$. For $n$ i.i.d. samples, $I_n(\lambda) = n/\lambda^2$. Cramér-Rao: any unbiased estimator of $\lambda$ has variance at least $\lambda^2/n$.

Example 2: Fisher Matrix for Multivariate Gaussian (Known Covariance)

For $p(\mathbf{x} \mid \boldsymbol{\mu}) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \Sigma)$ with known $\Sigma$, compute the Fisher information matrix.

Solution:

$$\log p = -\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu})$$

Score: $\nabla_{\boldsymbol{\mu}} \log p = \Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu})$

The Fisher matrix:

$$I(\boldsymbol{\mu}) = \mathbb{E}\left[\Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1}\right] = \Sigma^{-1} \, \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T] \, \Sigma^{-1} = \Sigma^{-1} \Sigma \Sigma^{-1} = \Sigma^{-1}$$

The Fisher information is the inverse covariance (precision) matrix. This makes sense: highly correlated variables make it harder to estimate individual means.

Click for answer

$I(\boldsymbol{\mu}) = \Sigma^{-1}$. If $\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_d^2)$, then $I_{ii} = 1/\sigma_i^2$ — the diagonal entries match the scalar Gaussian case. If variables are perfectly correlated, $\Sigma$ is singular, and $I(\boldsymbol{\mu})$ becomes singular — you cannot separately estimate the means.

Example 3: Fisher Information and the Cramér-Rao Bound

Suppose $X_1, \ldots, X_n \sim \text{Bernoulli}(\theta)$. The MLE is $\hat{\theta} = \bar{X} = \frac{1}{n}\sum_i X_i$. Verify that (a) $\hat{\theta}$ is unbiased, (b) its variance achieves the Cramér-Rao bound asymptotically.

Solution:

(a) $\mathbb{E}[\hat{\theta}] = \frac{1}{n} \sum_i \mathbb{E}[X_i] = \frac{n\theta}{n} = \theta$ ✓

(b) $\text{Var}(\hat{\theta}) = \text{Var}\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n^2} \cdot n\theta(1-\theta) = \frac{\theta(1-\theta)}{n}$

From earlier: $I_1(\theta) = \frac{1}{\theta(1-\theta)}$, so $I_n(\theta) = \frac{n}{\theta(1-\theta)}$.

Cramér-Rao: $\text{Var}(\hat{\theta}) \geq \frac{1}{I_n(\theta)} = \frac{\theta(1-\theta)}{n}$

The MLE achieves the bound exactly — it is efficient for the Bernoulli family. The variance $\theta(1-\theta)/n$ is worst (largest) at $\theta = 0.5$ and best (smallest) as $\theta \to 0$ or $1$, matching the Fisher information pattern.

Click for answer

MLE is unbiased with variance $\theta(1-\theta)/n$, which equals the Cramér-Rao bound $1/I_n(\theta)$. The MLE is efficient — no unbiased estimator can have lower variance. For $n=100$ and $\theta=0.5$, the standard error is $\sqrt{0.25/100} = 0.05$.

Practice Problems

Problem 1: Compute the Fisher information for the Poisson distribution $p(x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}$, $x \in {0, 1, 2, \ldots}$.

Click for answer

$\log p = x\log\lambda - \lambda - \log x!$ $s(\lambda, x) = \frac{x}{\lambda} - 1$, verify $\mathbb{E}[s] = \frac{\lambda}{\lambda} - 1 = 0$. $I(\lambda) = \mathbb{E}\left[\left(\frac{x}{\lambda} - 1\right)^2\right] = \frac{\mathbb{E}[(x-\lambda)^2]}{\lambda^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}$. Alternatively: $\frac{\partial^2}{\partial\lambda^2}\log p = -\frac{x}{\lambda^2}$, so $I(\lambda) = -\mathbb{E}[-\frac{x}{\lambda^2}] = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}$. Cramér-Rao: $\text{Var}(\hat{\lambda}) \geq \lambda/n$. The MLE $\hat{\lambda} = \bar{x}$ achieves this.

Problem 2: Show that if $\theta$ is reparameterized as $\phi = g(\theta)$ where $g$ is a smooth bijection, the Fisher information transforms as $I(\phi) = (g'(\theta))^{-2} I(\theta)$ (scalar case).

Click for answer

By chain rule: $\frac{\partial}{\partial\phi}\log p = \frac{\partial\theta}{\partial\phi} \cdot \frac{\partial}{\partial\theta}\log p = \frac{1}{g'(\theta)} s(\theta, x)$. $I(\phi) = \mathbb{E}\left[\left(\frac{1}{g'(\theta)} s(\theta)\right)^2\right] = \frac{1}{(g'(\theta))^2} I(\theta)$. This is exactly how a Riemannian metric tensor transforms under coordinate change — Fisher information is a proper tensor. For multivariate: $I(\phi) = J^{-T} I(\theta) J^{-1}$ where $J = \frac{\partial\phi}{\partial\theta}$.

Problem 3: For the bivariate Gaussian $\mathcal{N}\left(\begin{pmatrix}\mu_1 \ \mu_2\end{pmatrix}, \begin{pmatrix}\sigma^2 & \rho\sigma^2 \ \rho\sigma^2 & \sigma^2\end{pmatrix}\right)$ with known $\sigma^2$, compute the Fisher matrix for $\boldsymbol{\mu} = (\mu_1, \mu_2)^T$ in terms of $\rho$.

Click for answer

$I(\boldsymbol{\mu}) = \Sigma^{-1}$ where $\Sigma = \sigma^2 \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}$. $\Sigma^{-1} = \frac{1}{\sigma^2(1-\rho^2)} \begin{pmatrix} 1 & -\rho \\ -\rho & 1 \end{pmatrix}$. As $|\rho| \to 1$, $I(\boldsymbol{\mu}) \to \infty$ in the $(1, 1)^T$ direction (singular limit) — with perfect correlation, you cannot estimate individual means, only their sum. The Fisher matrix becomes singular, reflecting this fundamental indeterminacy.

Problem 4: Prove that Fisher information is additive for independent observations: if $X_1, \ldots, X_n$ are i.i.d., then $I_n(\theta) = n I_1(\theta)$.

Click for answer

By independence: $\log p(x_1, \ldots, x_n \mid \theta) = \sum_{i=1}^n \log p(x_i \mid \theta)$. Score: $s_n = \sum_{i=1}^n s_1(\theta, x_i)$ where $s_1$ is the single-sample score. Since $\mathbb{E}[s_1] = 0$, the scores are uncorrelated (in fact independent) across samples. $I_n(\theta) = \mathbb{E}\left[\left(\sum_i s_1(x_i)\right)\left(\sum_j s_1(x_j)\right)^T\right] = \sum_{i,j} \mathbb{E}[s_1(x_i) s_1(x_j)^T] = \sum_i \mathbb{E}[s_1(x_i) s_1(x_i)^T] = n I_1(\theta)$. The cross-terms vanish because $\mathbb{E}[s_1(x_i)] = 0$ and $x_i \perp x_j$. This is why standard errors scale as $1/\sqrt{n}$.

Problem 5: A neural network classifier outputs a categorical distribution over $K$ classes via softmax. Show that the per-example Fisher matrix $I(\boldsymbol{\pi})$ is singular (rank $K-1$) for the raw softmax probabilities, and explain what this implies about natural gradient on the logits.

Click for answer

Since $\sum_k \pi_k = 1$, the probabilities are not independent. The Fisher matrix for $\boldsymbol{\pi}$ has rank $K-1$ because the constraint removes one degree of freedom. However, when parameterized by logits $\mathbf{z}$ (where $\pi_k = e^{z_k}/\sum_j e^{z_j}$), the Fisher matrix for $\mathbf{z}$ is: $$I(\mathbf{z}) = \text{diag}(\boldsymbol{\pi}) - \boldsymbol{\pi}\boldsymbol{\pi}^T$$ This $K \times K$ matrix is also rank $K-1$ (singular) but is PSD. For natural gradient in practice, we either (a) use a damping term $\lambda I$, (b) use the Moore-Penrose pseudoinverse, or (c) work in a reduced $(K-1)$-dimensional parameterization. The singular nature reflects the fact that adding a constant to all logits doesn't change the probabilities — softmax is invariant to translation in logit space.

Summary

Key takeaways:

The score function $s(\theta, x) = \nabla_\theta \log p(x \mid \theta)$ has zero mean and its covariance is the Fisher information $I(\theta)$
Fisher information quantifies how much a random observation reveals about $\theta$ — it's the fundamental limit of estimation accuracy
The Cramér-Rao bound: $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$ for unbiased estimators; MLEs asymptotically achieve this
$I(\theta)$ is the Riemannian metric on the statistical manifold — KL divergence between nearby distributions is $\frac{1}{2}d\theta^T I(\theta) d\theta$
Fisher information is always PSD, but can be singular in overparameterized models — this motivates damping in natural gradient methods

Quiz

Question 1: What is the expected value of the score function $s(\theta, x) = \nabla_\theta \log p(x \mid \theta)$?

A. $I(\theta)$ B. $\theta$ C. $\mathbf{0}$ D. It depends on the parametrization

Correct Answer: C. $\mathbf{0}$

Explanation: The score always has zero expectation: $\mathbb{E}[s] = \int \nabla_\theta p \, dx = \nabla_\theta \int p \, dx = \nabla_\theta 1 = \mathbf{0}$. This holds for any valid distribution regardless of parametrization. Option A is the second moment (covariance) of $s$, not the mean. Option D is wrong because the zero-mean property is parameterization-invariant.

Question 2: Which of the following is NOT a valid expression for the Fisher information matrix?

A. $\mathbb{E}[s(\theta) s(\theta)^T]$ B. $-\mathbb{E}[\nabla_\theta^2 \log p(x \mid \theta)]$ C. $\mathbb{E}[\nabla_\theta^2 p(x \mid \theta)]$ D. The covariance matrix of the score function

Correct Answer: C. $\mathbb{E}[\nabla_\theta^2 p(x \mid \theta)]$

Explanation: Options A, B, and D are all equivalent definitions of $I(\theta)$. Option C is the expected Hessian of $p$ itself, not $\log p$. We proved earlier that $\mathbb{E}[\nabla_\theta^2 p] = \mathbf{0}$ from differentiating $\int p = 1$, so this is the zero matrix, not the Fisher information.

Question 3: For $n$ i.i.d. observations from a Bernoulli distribution with $\theta = 0.5$, what is the Cramér-Rao lower bound on the variance of any unbiased estimator?

A. $1/n$ B. $4/n$ C. $1/(4n)$ D. $n/4$

Correct Answer: C. $1/(4n)$

Explanation: $I_1(\theta) = \frac{1}{\theta(1-\theta)} = \frac{1}{0.25} = 4$. For $n$ samples, $I_n = 4n$. The Cramér-Rao bound is $1/I_n = 1/(4n)$. The MLE $\bar{X}$ has variance $\theta(1-\theta)/n = 0.25/n = 1/(4n)$, achieving the bound. Option A is the single-sample variance, B is $1/I_1$, and D is $I_n$ itself.

Question 4: Why is the Fisher information matrix always positive semidefinite?

A. Because it equals $-\mathbb{E}[\nabla^2 \log p]$ B. Because it's a covariance matrix C. Because it appears in the Cramér-Rao bound D. Only if the model is well-specified

Correct Answer: B. Because it's a covariance matrix

Explanation: $I(\theta) = \mathbb{E}[s s^T] - \mathbb{E}[s]\mathbb{E}[s]^T = \mathbb{E}[s s^T]$ since $\mathbb{E}[s] = 0$. Any covariance matrix is PSD by construction: for any vector $\mathbf{v}$, $\mathbf{v}^T I \mathbf{v} = \mathbb{E}[(\mathbf{v}^T s)^2] \geq 0$. Option A is equivalent but doesn't explain why it's PSD. Option C is a consequence, not a cause. Option D is false — PSD holds for any model.

Question 5: Which statement about the Fisher-Rao metric is correct?

A. It measures Euclidean distance in parameter space B. It gives the KL divergence between any two distributions exactly C. It provides a second-order approximation to KL divergence for nearby distributions D. It is unrelated to KL divergence

Correct Answer: C. It provides a second-order approximation to KL divergence for nearby distributions

Explanation: $D_{KL}(p_\theta | p_{\theta+d\theta}) \approx \frac{1}{2} d\theta^T I(\theta) d\theta$ — this is the second-order Taylor expansion. The Fisher metric is the infinitesimal KL divergence. It is NOT Euclidean (A is wrong) — the geometry is curved. It is NOT exact for arbitrary distances (B is wrong) — that would require integrating along geodesics. And it is deeply related to KL (D is wrong).

Question 6: For a Gaussian $\mathcal{N}(\mu, \sigma^2)$ with both parameters unknown, the Fisher matrix is $\text{diag}(1/\sigma^2, 1/(2\sigma^4))$. What does the zero off-diagonal entry mean?

A. The model is misspecified B. $\mu$ and $\sigma^2$ are orthogonal parameters in the Fisher-Rao sense C. The Fisher matrix is singular D. There is no relationship between $\mu$ and $\sigma^2$

Correct Answer: B. $\mu$ and $\sigma^2$ are orthogonal parameters in the Fisher-Rao sense

Explanation: Orthogonal parameters mean that information about $\mu$ and $\sigma^2$ is "decoupled" — the score components are uncorrelated. This does NOT mean $\mu$ and $\sigma^2$ are unrelated (option D) — they appear together in the likelihood. It means that estimating $\mu$ and estimating $\sigma^2$ are locally independent tasks. The matrix $\text{diag}(1/\sigma^2, 1/(2\sigma^4))$ is full rank, so C is wrong.

Pitfalls

Confusing the score function with Fisher information: The score $s(\theta, x) = \nabla_\theta \log p(x \mid \theta)$ is a vector-valued function of both $\theta$ and data $x$. Fisher information $I(\theta)$ is the covariance matrix of the score. A single evaluation of the score is not Fisher information — Fisher is the expected outer product (or negative expected Hessian).
Forgetting that empirical Fisher is not true Fisher: Using training labels to compute $\hat{I} = \frac{1}{B}\sum s(y_i)s(y_i)^T$ gives the empirical Fisher under the data distribution, not the model distribution. The true Fisher requires expectation under $p(x \mid \theta)$. This matters particularly early in training when the model doesn't fit the data — the empirical Fisher can give misleading curvature estimates.
Treating singular Fisher as an error: In overparameterized models (especially neural networks with redundant parameters), $I(\theta)$ is often singular — this is expected, not a bug. The singularity reflects true parameter redundancy (e.g., softmax invariance to logit shifts). Don't try to invert a singular Fisher; use damping $I + \lambda I$ or the Moore-Penrose pseudoinverse.
Misapplying the Cramér-Rao bound to biased estimators: The standard Cramér-Rao bound $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$ applies only to unbiased estimators. For biased estimators, you need the information inequality: $\text{MSE}(\hat{\theta}) \geq (1 + b'(\theta))^2 / I(\theta) + b(\theta)^2$. Modern ML estimators are often biased (due to regularization), making the simple bound inapplicable.

Next Steps

Next up: 24-02 — Natural Gradient Descent — where you'll use the Fisher information matrix as a preconditioner to take gradient steps that are invariant to parameterization, fundamentally improving optimization in probability space.

Progress

Phases

24-01 — Fisher Information

Learning Objectives

Core Content

The Score Function

Fisher Information Matrix

Computing Fisher Information for Common Distributions

Bernoulli Distribution

Gaussian with Known Variance

Gaussian with Both Parameters

Categorical Distribution

Cramér-Rao Lower Bound

Fisher Information as a Riemannian Metric

Connection to Shannon Information

Edge Cases and Considerations

Key Terms

Worked Examples

Example 1: Fisher Information of Exponential Distribution

Example 2: Fisher Matrix for Multivariate Gaussian (Known Covariance)

Example 3: Fisher Information and the Cramér-Rao Bound

Practice Problems

Summary

Quiz

Pitfalls

Next Steps