Math graphic
📐 Concept diagram

### 12.4 — Maximum Likelihood Estimation (MLE)

Phase: Statistics Prerequisites: 12-03-point-estimation, 04-07-applications-of-derivatives, 04-08-optimization

Learning Objectives

By the end of this subject, you will be able to:

  1. Define the likelihood function and log-likelihood
  2. Derive MLEs for common distributions by solving the score equation
  3. Apply the invariance property of MLEs
  4. Interpret the Fisher information and asymptotic normality of MLEs
  5. Compute MLEs when a closed form exists and recognise when numerical methods are needed

Core Content

⚠️ CRITICAL: The Likelihood Function

The likelihood is the probability (or density) of observing the data, viewed as a function of the parameter(s):

$$\mathcal{L}(\theta) = P(X_1 = x_1, \ldots, X_n = x_n \mid \theta)$$

For independent observations: $$\mathcal{L}(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta)$$

KEY DISTINCTION: The likelihood is a function of $\theta$ with the data fixed — this is the OPPOSITE of the probability distribution, which is a function of the data with $\theta$ fixed.

The Log-Likelihood

Since products become unwieldy and log is monotonic: $$\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log f(x_i \mid \theta)$$

Maximising $\ell(\theta)$ is equivalent to maximising $\mathcal{L}(\theta)$.

Finding the MLE

  1. Write the likelihood $\mathcal{L}(\theta)$
  2. Take the log: $\ell(\theta) = \log \mathcal{L}(\theta)$
  3. Differentiate: find the score function $S(\theta) = \frac{d\ell}{d\theta}$
  4. Set $S(\theta) = 0$ (the score equation) and solve for $\hat{\theta}_{\text{MLE}}$
  5. Verify it's a maximum (second derivative negative)

🚩 Common Pitfall: Forgetting to check the second derivative. Setting $S(\theta) = 0$ finds critical points, but some may be minima or saddle points.

Properties of MLEs

  1. Invariance: If $\hat{\theta}{\text{MLE}}$ is the MLE of $\theta$, then $g(\hat{\theta}{\text{MLE}})$ is the MLE of $g(\theta)$ for any function $g$. (e.g., MLE of $\sigma^2$ gives MLE of $\sigma$ as $\sqrt{\hat{\sigma}^2}$)

  2. Asymptotic normality: For large $n$: $$\hat{\theta}_{\text{MLE}} \approx N\left(\theta, \frac{1}{n \cdot I(\theta)}\right)$$ where $I(\theta) = -E\left[\frac{d^2 \ell}{d\theta^2}\right]$ is the Fisher information.

  3. Consistency: MLEs are consistent (converge to the true value).

  4. Asymptotic efficiency: Among all consistent, asymptotically normal estimators, MLE achieves the lowest possible variance (Cramér-Rao lower bound) as $n \to \infty$.

⚠️ CRITICAL: MLEs are NOT necessarily unbiased for finite $n$. The MLE of $\sigma^2$ for a normal is $\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$, which is biased by factor $\frac{n-1}{n}$. The asymptotic properties only kick in for large $n$.



Key Terms

Worked Examples

Example 1: MLE for Bernoulli parameter $p$

Data: $n$ independent Bernoulli trials, $k$ successes.

Likelihood: $\mathcal{L}(p) = \prod_{i=1}^{n} p^{x_i}(1-p)^{1-x_i} = p^k (1-p)^{n-k}$

Log-likelihood: $\ell(p) = k \log p + (n-k) \log(1-p)$

Score: $S(p) = \frac{k}{p} - \frac{n-k}{1-p}$

Set to zero: $\frac{k}{p} = \frac{n-k}{1-p}$

Cross-multiply: $k(1-p) = (n-k)p$

$k - kp = np - kp$

$k = np$

$\hat{p}_{\text{MLE}} = \frac{k}{n}$ — the sample proportion. Intuitive and correct!

Check second derivative: $\frac{d^2\ell}{dp^2} = -\frac{k}{p^2} - \frac{n-k}{(1-p)^2} < 0$ ✓ (maximum)

Example 2: MLE for normal mean $\mu$ (with $\sigma^2$ known)

$\ell(\mu) = \sum \log\left(\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right)$

$= -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(x_i - \mu)^2$

Score: $S(\mu) = \frac{1}{\sigma^2}\sum(x_i - \mu)$

Set to zero: $\sum(x_i - \mu) = 0$ → $\hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum x_i = \bar{x}$

The MLE of the normal mean is the sample mean.

Example 3: MLE for Poisson rate $\lambda$

For Poisson($\lambda$) data $x_1, \ldots, x_n$ (all non-negative integers):

$\mathcal{L}(\lambda) = \prod \frac{\lambda^{x_i} e^{-\lambda}}{x_i!}$

$\ell(\lambda) = \sum(x_i \log \lambda - \lambda - \log x_i!)$

$= (\sum x_i)\log \lambda - n\lambda - \sum \log(x_i!)$

Score: $S(\lambda) = \frac{\sum x_i}{\lambda} - n$

Set to zero: $\frac{\sum x_i}{\lambda} = n$ → $\hat{\lambda}_{\text{MLE}} = \frac{\sum x_i}{n} = \bar{x}$



Quiz

Q1: What does the concept of Fisher information primarily refer to in this subject?

A) A visual representation of Fisher information B) A historical anecdote about Fisher information C) A computational error related to Fisher information D) The definition and application of Fisher information

Correct: D)

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) A simplified version of \mathcal{L}(\theta) = P(X_1 ... D) \mathcal{L}(\theta) = P(X_1 = x_1, \ldots, X_n = x_n \mid \theta)

Correct: D)

Q3: What is the primary purpose of Likelihood?

A) It is used to likelihood in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain

Correct: A)

Q4: Which statement about Log-likelihood is TRUE?

A) Log-likelihood is an advanced topic beyond this subject's scope B) Log-likelihood is a fundamental concept covered in this subject C) Log-likelihood is mentioned only as a historical footnote D) Log-likelihood is not related to this subject

Correct: B)

Q5: Based on the worked examples in this subject, what is the correct result?

A) A different result from a common mistake B) An unrelated numerical value C) \bar{ D) The inverse of the correct answer

Correct: C)

Q6: How are Log-likelihood and Score equation related?

A) Log-likelihood and Score equation are completely unrelated topics B) Log-likelihood is the inverse of Score equation C) Log-likelihood is a special case of Score equation D) Log-likelihood and Score equation are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with ⚠️ Critical: The Likelihood Function?

A) ⚠️ Critical: The Likelihood Function is always computed the same way in all contexts B) The main error with ⚠️ Critical: The Likelihood Function is using it when it is not needed C) A common mistake is confusing ⚠️ Critical: The Likelihood Function with a similar concept D) ⚠️ Critical: The Likelihood Function has no common misconceptions

Correct: C)

Q8: When should you apply The Log-Likelihood?

A) Apply The Log-Likelihood to solve problems in this subject's domain B) Avoid The Log-Likelihood unless explicitly instructed C) The Log-Likelihood is not practically useful D) Use The Log-Likelihood only in pure mathematics contexts

Correct: A)

Practice Problems

  1. For $n$ i.i.d. exponential($\beta$) observations with PDF $f(x) = \frac{1}{\beta}e^{-x/\beta}$ for $x > 0$, derive the MLE for $\beta$.
Click for answer $\ell(\beta) = \sum(-\log\beta - x_i/\beta) = -n\log\beta - \frac{1}{\beta}\sum x_i$ $S(\beta) = -\frac{n}{\beta} + \frac{\sum x_i}{\beta^2}$ Set to zero: $\frac{\sum x_i}{\beta^2} = \frac{n}{\beta}$ → $\hat{\beta}_{\text{MLE}} = \bar{X}$
  1. A coin is flipped 100 times, yielding 63 heads. What is the MLE of the probability of heads?

    Click for answer $\hat{p}_{\text{MLE}} = 63/100 = 0.63$

  2. If $\hat{\theta}$ is the MLE of $\theta$, what is the MLE of $\theta^2$?

    Click for answer By the invariance property: $\hat{\theta}^2$. The MLE of any function $g(\theta)$ is simply $g(\hat{\theta})$.

  3. For the normal distribution, find the MLE of $\sigma$ given that the MLE of $\sigma^2$ is $\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$.

    Click for answer By invariance: $\hat{\sigma}_{\text{MLE}} = \sqrt{\hat{\sigma}^2} = \sqrt{\frac{1}{n}\sum(X_i - \bar{X})^2}$ Note: this MLE for $\sigma$ is biased (and so is the MLE for $\sigma^2$), but it IS the MLE.

  4. Why do we maximise the log-likelihood rather than the likelihood itself?

    Click for answer Three reasons: (1) The product $\prod f(x_i|\theta)$ becomes numerically unstable for large $n$ (multiplying many numbers near 0 or 1). (2) The log transforms products into sums, which are easier to differentiate. (3) The log function is monotonic, so the argmax is unchanged.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 12-05-moments-bayesian-estimation.md