📐 Concept diagram

### 12.4 — Maximum Likelihood Estimation (MLE)

Phase: Statistics Prerequisites: 12-03-point-estimation, 04-07-applications-of-derivatives, 04-08-optimization

Learning Objectives

By the end of this subject, you will be able to:

Define the likelihood function and log-likelihood
Derive MLEs for common distributions by solving the score equation
Apply the invariance property of MLEs
Interpret the Fisher information and asymptotic normality of MLEs
Compute MLEs when a closed form exists and recognise when numerical methods are needed

Core Content

⚠️ CRITICAL: The Likelihood Function

The likelihood is the probability (or density) of observing the data, viewed as a function of the parameter(s):

$$\mathcal{L}(\theta) = P(X_1 = x_1, \ldots, X_n = x_n \mid \theta)$$

For independent observations: $$\mathcal{L}(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta)$$

KEY DISTINCTION: The likelihood is a function of $\theta$ with the data fixed — this is the OPPOSITE of the probability distribution, which is a function of the data with $\theta$ fixed.

The Log-Likelihood

Since products become unwieldy and log is monotonic: $$\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log f(x_i \mid \theta)$$

Maximising $\ell(\theta)$ is equivalent to maximising $\mathcal{L}(\theta)$.

Finding the MLE

Write the likelihood $\mathcal{L}(\theta)$
Take the log: $\ell(\theta) = \log \mathcal{L}(\theta)$
Differentiate: find the score function $S(\theta) = \frac{d\ell}{d\theta}$
Set $S(\theta) = 0$ (the score equation) and solve for $\hat{\theta}_{\text{MLE}}$
Verify it's a maximum (second derivative negative)

🚩 Common Pitfall: Forgetting to check the second derivative. Setting $S(\theta) = 0$ finds critical points, but some may be minima or saddle points.

Properties of MLEs

Invariance: If $\hat{\theta}{\text{MLE}}$ is the MLE of $\theta$, then $g(\hat{\theta}{\text{MLE}})$ is the MLE of $g(\theta)$ for any function $g$. (e.g., MLE of $\sigma^2$ gives MLE of $\sigma$ as $\sqrt{\hat{\sigma}^2}$)
Asymptotic normality: For large $n$: $$\hat{\theta}_{\text{MLE}} \approx N\left(\theta, \frac{1}{n \cdot I(\theta)}\right)$$ where $I(\theta) = -E\left[\frac{d^2 \ell}{d\theta^2}\right]$ is the Fisher information.
Consistency: MLEs are consistent (converge to the true value).
Asymptotic efficiency: Among all consistent, asymptotically normal estimators, MLE achieves the lowest possible variance (Cramér-Rao lower bound) as $n \to \infty$.

⚠️ CRITICAL: MLEs are NOT necessarily unbiased for finite $n$. The MLE of $\sigma^2$ for a normal is $\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$, which is biased by factor $\frac{n-1}{n}$. The asymptotic properties only kick in for large $n$.

Key Terms

Fisher information
Likelihood
Log-likelihood
Score equation

Worked Examples

Example 1: MLE for Bernoulli parameter $p$

Data: $n$ independent Bernoulli trials, $k$ successes.

Likelihood: $\mathcal{L}(p) = \prod_{i=1}^{n} p^{x_i}(1-p)^{1-x_i} = p^k (1-p)^{n-k}$

Log-likelihood: $\ell(p) = k \log p + (n-k) \log(1-p)$

Score: $S(p) = \frac{k}{p} - \frac{n-k}{1-p}$

Set to zero: $\frac{k}{p} = \frac{n-k}{1-p}$

Cross-multiply: $k(1-p) = (n-k)p$

$k - kp = np - kp$

$k = np$

$\hat{p}_{\text{MLE}} = \frac{k}{n}$ — the sample proportion. Intuitive and correct!

Check second derivative: $\frac{d^2\ell}{dp^2} = -\frac{k}{p^2} - \frac{n-k}{(1-p)^2} < 0$ ✓ (maximum)

Example 2: MLE for normal mean $\mu$ (with $\sigma^2$ known)

$\ell(\mu) = \sum \log\left(\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right)$

$= -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(x_i - \mu)^2$

Score: $S(\mu) = \frac{1}{\sigma^2}\sum(x_i - \mu)$

Set to zero: $\sum(x_i - \mu) = 0$ → $\hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum x_i = \bar{x}$

The MLE of the normal mean is the sample mean.

Example 3: MLE for Poisson rate $\lambda$

For Poisson($\lambda$) data $x_1, \ldots, x_n$ (all non-negative integers):

$\mathcal{L}(\lambda) = \prod \frac{\lambda^{x_i} e^{-\lambda}}{x_i!}$

$\ell(\lambda) = \sum(x_i \log \lambda - \lambda - \log x_i!)$

$= (\sum x_i)\log \lambda - n\lambda - \sum \log(x_i!)$

Score: $S(\lambda) = \frac{\sum x_i}{\lambda} - n$

Set to zero: $\frac{\sum x_i}{\lambda} = n$ → $\hat{\lambda}_{\text{MLE}} = \frac{\sum x_i}{n} = \bar{x}$

Quiz

Q1: What does the concept of Fisher information primarily refer to in this subject?

A) A visual representation of Fisher information B) A historical anecdote about Fisher information C) A computational error related to Fisher information D) The definition and application of Fisher information

Correct: D)

If you chose A: This is incorrect. Fisher information is defined as: the definition and application of fisher information. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Fisher information is defined as: the definition and application of fisher information. The other options describe different aspects that are not the primary focus.
If you chose C: This is incorrect. Fisher information is defined as: the definition and application of fisher information. The other options describe different aspects that are not the primary focus.
If you chose D: Fisher information is defined as: the definition and application of fisher information. The other options describe different aspects that are not the primary focus. Correct!

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) A simplified version of \mathcal{L}(\theta) = P(X_1 ... D) \mathcal{L}(\theta) = P(X_1 = x_1, \ldots, X_n = x_n \mid \theta)

Correct: D)

If you chose A: This is incorrect. The formula \mathcal{L}(\theta) = P(X_1 = x_1, \ldots, X_n = x_n \mid \theta) is central to this subject. The other options are either simplified versions or unrelated.
If you chose B: This is incorrect. The formula \mathcal{L}(\theta) = P(X_1 = x_1, \ldots, X_n = x_n \mid \theta) is central to this subject. The other options are either simplified versions or unrelated.
If you chose C: This is incorrect. The formula \mathcal{L}(\theta) = P(X_1 = x_1, \ldots, X_n = x_n \mid \theta) is central to this subject. The other options are either simplified versions or unrelated.
If you chose D: The formula \mathcal{L}(\theta) = P(X_1 = x_1, \ldots, X_n = x_n \mid \theta) is central to this subject. The other options are either simplified versions or unrelated. Correct!

Q3: What is the primary purpose of Likelihood?

A) It is used to likelihood in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain

Correct: A)

If you chose A: Likelihood serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose B: This is incorrect. Likelihood serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. Likelihood serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. Likelihood serves the purpose described in the correct answer. The other options misrepresent its role.

Q4: Which statement about Log-likelihood is TRUE?

A) Log-likelihood is an advanced topic beyond this subject's scope B) Log-likelihood is a fundamental concept covered in this subject C) Log-likelihood is mentioned only as a historical footnote D) Log-likelihood is not related to this subject

Correct: B)

If you chose A: This is incorrect. Log-likelihood is a fundamental concept covered in this subject. This subject covers Log-likelihood as part of its core content.
If you chose B: Log-likelihood is a fundamental concept covered in this subject. This subject covers Log-likelihood as part of its core content. Correct!
If you chose C: This is incorrect. Log-likelihood is a fundamental concept covered in this subject. This subject covers Log-likelihood as part of its core content.
If you chose D: This is incorrect. Log-likelihood is a fundamental concept covered in this subject. This subject covers Log-likelihood as part of its core content.

Q5: Based on the worked examples in this subject, what is the correct result?

A) A different result from a common mistake B) An unrelated numerical value C) \bar{ D) The inverse of the correct answer

Correct: C)

If you chose A: This is incorrect. The worked examples show that the result is \bar{. The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is \bar{. The other options represent common errors.
If you chose C: The worked examples show that the result is \bar{. The other options represent common errors. Correct!
If you chose D: This is incorrect. The worked examples show that the result is \bar{. The other options represent common errors.

Q6: How are Log-likelihood and Score equation related?

A) Log-likelihood and Score equation are completely unrelated topics B) Log-likelihood is the inverse of Score equation C) Log-likelihood is a special case of Score equation D) Log-likelihood and Score equation are closely related concepts

Correct: D)

If you chose A: This is incorrect. Both Log-likelihood and Score equation are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both Log-likelihood and Score equation are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both Log-likelihood and Score equation are covered in this subject as interconnected topics.
If you chose D: Both Log-likelihood and Score equation are covered in this subject as interconnected topics. Correct!

Q7: What is a common pitfall when working with ⚠️ Critical: The Likelihood Function?

A) ⚠️ Critical: The Likelihood Function is always computed the same way in all contexts B) The main error with ⚠️ Critical: The Likelihood Function is using it when it is not needed C) A common mistake is confusing ⚠️ Critical: The Likelihood Function with a similar concept D) ⚠️ Critical: The Likelihood Function has no common misconceptions

Correct: C)

If you chose A: This is incorrect. Students often confuse ⚠️ Critical: The Likelihood Function with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse ⚠️ Critical: The Likelihood Function with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: Students often confuse ⚠️ Critical: The Likelihood Function with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose D: This is incorrect. Students often confuse ⚠️ Critical: The Likelihood Function with similar-sounding or related concepts. Pay attention to the precise definitions.

Q8: When should you apply The Log-Likelihood?

A) Apply The Log-Likelihood to solve problems in this subject's domain B) Avoid The Log-Likelihood unless explicitly instructed C) The Log-Likelihood is not practically useful D) Use The Log-Likelihood only in pure mathematics contexts

Correct: A)

If you chose A: The Log-Likelihood is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose B: This is incorrect. The Log-Likelihood is a practical tool used throughout this subject to solve relevant problems.
If you chose C: This is incorrect. The Log-Likelihood is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. The Log-Likelihood is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

For $n$ i.i.d. exponential($\beta$) observations with PDF $f(x) = \frac{1}{\beta}e^{-x/\beta}$ for $x > 0$, derive the MLE for $\beta$.

Click for answer

$\ell(\beta) = \sum(-\log\beta - x_i/\beta) = -n\log\beta - \frac{1}{\beta}\sum x_i$ $S(\beta) = -\frac{n}{\beta} + \frac{\sum x_i}{\beta^2}$ Set to zero: $\frac{\sum x_i}{\beta^2} = \frac{n}{\beta}$ → $\hat{\beta}_{\text{MLE}} = \bar{X}$

A coin is flipped 100 times, yielding 63 heads. What is the MLE of the probability of heads?

Click for answer
$\hat{p}_{\text{MLE}} = 63/100 = 0.63$
If $\hat{\theta}$ is the MLE of $\theta$, what is the MLE of $\theta^2$?

Click for answer
By the invariance property: $\hat{\theta}^2$. The MLE of any function $g(\theta)$ is simply $g(\hat{\theta})$.
For the normal distribution, find the MLE of $\sigma$ given that the MLE of $\sigma^2$ is $\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$.

Click for answer
By invariance: $\hat{\sigma}_{\text{MLE}} = \sqrt{\hat{\sigma}^2} = \sqrt{\frac{1}{n}\sum(X_i - \bar{X})^2}$ Note: this MLE for $\sigma$ is biased (and so is the MLE for $\sigma^2$), but it IS the MLE.
Why do we maximise the log-likelihood rather than the likelihood itself?

Click for answer
Three reasons: (1) The product $\prod f(x_i|\theta)$ becomes numerically unstable for large $n$ (multiplying many numbers near 0 or 1). (2) The log transforms products into sums, which are easier to differentiate. (3) The log function is monotonic, so the argmax is unchanged.

Summary

Key takeaways:

Likelihood $\mathcal{L}(\theta)$ = probability of data given $\theta$, viewed as a function of $\theta$
Log-likelihood converts products to sums for easier differentiation
Score equation $S(\theta) = 0$ gives the MLE (verify second derivative)
Invariance: MLE of $g(\theta)$ is $g(\hat{\theta}_{\text{MLE}})$
MLEs are consistent, asymptotically efficient, and asymptotically normal
MLEs may be biased for finite $n$ — the trade-off is asymptotic optimality

Pitfalls

Forgetting to verify the second derivative: Setting the score function to zero finds critical points, but not all critical points are maxima. The second derivative (or the Hessian for multiple parameters) must be negative (negative definite) at the candidate point. A zero score could correspond to a minimum or saddle point, especially with non-convex log-likelihoods.
Assuming MLEs are always unbiased: The MLE of the normal variance is σ̂² = (1/n)Σ(Xᵢ − X̄)², which has expected value ((n−1)/n)σ² — it is biased for every finite n. MLEs are consistent and asymptotically efficient, but unbiasedness is not guaranteed. The bias typically vanishes at rate 1/n.
Maximizing the likelihood instead of the log-likelihood: Products of many probabilities (or densities) can produce numerical underflow — intermediate values smaller than machine precision round to zero. The log transformation converts products to sums and eliminates this problem. Always work on the log scale.
Treating the likelihood function as a probability distribution over θ: The likelihood L(θ) = P(data | θ) is a function of θ with data fixed. It does NOT integrate to 1 over θ and is NOT a probability distribution for the parameter. Confusing likelihood with posterior probability is a fundamental category error.
Misapplying the invariance property to expected values: The MLE of g(θ) is g(θ̂), but E[g(θ̂)] ≠ g(E[θ̂]) in general. For example, the MLE of σ is the square root of the MLE of σ², but neither is unbiased even if the untransformed MLE were. Invariance gives you the MLE, not the expected value.

Next Steps

Next up: 12-05-moments-bayesian-estimation.md

Progress

Phases

### 12.4 — Maximum Likelihood Estimation (MLE)

Learning Objectives

Core Content

⚠️ CRITICAL: The Likelihood Function

The Log-Likelihood

Finding the MLE

Properties of MLEs

Key Terms

Worked Examples

Example 1: MLE for Bernoulli parameter $p$

Example 2: MLE for normal mean $\mu$ (with $\sigma^2$ known)

Example 3: MLE for Poisson rate $\lambda$

Quiz

Practice Problems

Summary

Pitfalls

Next Steps