### 12.5 β Method of Moments and Bayesian Estimation
Phase: Statistics Prerequisites: 12-04-mle, 10-02-conditional-probability, 12-03-point-estimation
Learning Objectives
By the end of this subject, you will be able to:
- Derive method of moments estimators for common distributions
- Compare MoM estimators with MLEs
- Define prior, likelihood, posterior, and marginal likelihood in Bayesian inference
- Compute posterior distributions using conjugate priors
- Derive MAP (Maximum A Posteriori) estimates
Core Content
Method of Moments (MoM)
The idea: equate population moments to sample moments and solve for parameters.
Procedure: 1. Compute $k$ population moments: $E[X], E[X^2], \ldots, E[X^k]$ (as functions of parameters) 2. Compute the corresponding $k$ sample moments: $\frac{1}{n}\sum X_i, \frac{1}{n}\sum X_i^2, \ldots$ 3. Set them equal and solve the system for the parameters
Where $k$ = number of parameters to estimate.
Example: For $X \sim \text{Exponential}(\beta)$: - Population mean: $E[X] = \beta$ - Sample mean: $\bar{X}$ - Set equal: $\tilde{\beta}_{\text{MoM}} = \bar{X}$
(For the exponential, MoM and MLE coincide.)
MoM vs MLE: | Property | MoM | MLE | |---|---|---| | Computational ease | Often simpler | May need numerical optimisation | | Efficiency | Less efficient generally | Asymptotically efficient | | Consistency | Yes | Yes | | Finite sample bias | Can be biased | Can be biased | | Requires distribution | Yes | Yes |
β οΈ CRITICAL: Bayesian vs Frequentist
The fundamental philosophical divide:
| Frequentist | Bayesian | |
|---|---|---|
| Parameter $\theta$ | Fixed, unknown constant | Random variable |
| Probability | Long-run frequency | Degree of belief |
| Inference | Point estimates, CIs, p-values | Posterior distribution |
| Prior information | Not formally incorporated | Explicit through prior $p(\theta)$ |
Bayes' Theorem for Inference
$$p(\theta \mid \text{data}) = \frac{p(\text{data} \mid \theta) \cdot p(\theta)}{p(\text{data})}$$
- Prior $p(\theta)$: beliefs about $\theta$ before seeing data
- Likelihood $p(\text{data} \mid \theta)$: same as in MLE
- Marginal likelihood $p(\text{data}) = \int p(\text{data} \mid \theta) p(\theta) d\theta$: normalising constant
- Posterior $p(\theta \mid \text{data})$: updated beliefs after seeing data
"Today's posterior is tomorrow's prior."
Conjugate Priors
A prior is conjugate to a likelihood if the posterior belongs to the same distributional family. This makes computation tractable.
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Bernoulli | Beta($\alpha$, $\beta$) | Beta($\alpha + k$, $\beta + n - k$) |
| Poisson | Gamma($a$, $b$) | Gamma($a + \sum x_i$, $b + n$) |
| Normal (mean, known $\sigma^2$) | Normal($\mu_0$, $\sigma_0^2$) | Normal (updated) |
Interpretation of Beta-Bernoulli: - Prior: Beta(2, 2) β believe coin is roughly fair, Β± uncertainty - Data: 7 heads in 10 flips - Posterior: Beta(2+7, 2+3) = Beta(9, 5) β mean = 9/14 β 0.64 - The prior's "2, 2" acts like 4 pseudo-observations
MAP Estimation
The Maximum A Posteriori estimate is the mode of the posterior:
$$\hat{\theta}{\text{MAP}} = \arg\max\theta p(\theta \mid \text{data}) = \arg\max_\theta p(\text{data} \mid \theta) p(\theta)$$
Connection to regularisation: MAP with a Gaussian prior on regression coefficients is equivalent to ridge regression ($L_2$ regularisation). MAP with a Laplace prior gives Lasso ($L_1$ regularisation).
π© Common Pitfall: MAP gives a point estimate (the mode), NOT the full posterior. It discards uncertainty information. When you need uncertainty quantification, use the full posterior, not just the MAP.
Key Terms
- Bayesian inference
- Conjugate priors
- Inference
- Likelihood
- Marginal likelihood
- Maximum A Posteriori
- Posterior
- Prior
- Prior information
- Probability
Worked Examples
Example 1: MoM for Normal($\mu$, $\sigma^2$)
Two parameters β need two moment equations.
1st moment: $E[X] = \mu = \bar{X}$ β $\tilde{\mu} = \bar{X}$
2nd moment: $E[X^2] = \mu^2 + \sigma^2$
Set equal to sample second moment: $\tilde{\mu}^2 + \tilde{\sigma}^2 = \frac{1}{n}\sum X_i^2$
$\tilde{\sigma}^2 = \frac{1}{n}\sum X_i^2 - \bar{X}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$
The MoM estimator for $\sigma^2$ is the biased version (divide by $n$, not $n-1$).
Example 2: Bayesian update for a coin flip
Prior: Beta(1, 1) = Uniform(0, 1) β complete ignorance
Data: H, T, H, H, T, H, H, H, T, H β 7 heads, 3 tails
Posterior: Beta(1+7, 1+3) = Beta(8, 4)
Posterior mean: $\frac{8}{8+4} = \frac{8}{12} \approx 0.667$
Posterior mode (MAP): $\frac{8-1}{8+4-2} = \frac{7}{10} = 0.7$
95% credible interval (equal-tailed): roughly [0.39, 0.89]
Contrast with MLE: $\hat{p} = 0.7$, no uncertainty quantification.
Example 3: MAP for normal mean with normal prior
Likelihood: $X_i \sim N(\mu, \sigma^2)$ with $\sigma^2$ known
Prior: $\mu \sim N(\mu_0, \tau^2)$
Posterior: $\mu \mid \text{data} \sim N\left(\frac{\frac{n}{\sigma^2}\bar{X} + \frac{1}{\tau^2}\mu_0}{\frac{n}{\sigma^2} + \frac{1}{\tau^2}}, \frac{1}{\frac{n}{\sigma^2} + \frac{1}{\tau^2}}\right)$
The posterior mean (and MAP, since normal) is a precision-weighted average of prior mean and sample mean:
$$\hat{\mu}_{\text{MAP}} = \frac{\tau^{-2} \mu_0 + n\sigma^{-2} \bar{X}}{\tau^{-2} + n\sigma^{-2}}$$
As $n \to \infty$, $\hat{\mu}_{\text{MAP}} \to \bar{X}$ β the data overwhelms the prior.
Quiz
Q1: What does the concept of Probability primarily refer to in this subject?
A) A computational error related to Probability B) The definition and application of Probability C) A historical anecdote about Probability D) A visual representation of Probability
Correct: B)
- If you chose A: This is incorrect. Probability is defined as: the definition and application of probability. The other options describe different aspects that are not the primary focus.
- If you chose B: Probability is defined as: the definition and application of probability. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. Probability is defined as: the definition and application of probability. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Probability is defined as: the definition and application of probability. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) An unrelated formula from a different topic B) E[X], E[X^2], \ldots, E[X^k] C) The inverse operation of the formula in question D) A simplified version of E[X], E[X^2], \ldots, E[X^k]...
Correct: B)
- If you chose A: This is incorrect. The formula E[X], E[X^2], \ldots, E[X^k] is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: The formula E[X], E[X^2], \ldots, E[X^k] is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose C: This is incorrect. The formula E[X], E[X^2], \ldots, E[X^k] is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula E[X], E[X^2], \ldots, E[X^k] is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Inference?
A) It is used only in advanced research contexts B) It is used to inference in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain
Correct: B)
- If you chose A: This is incorrect. Inference serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Inference serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Inference serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Inference serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Prior information is TRUE?
A) Prior information is not related to this subject B) Prior information is an advanced topic beyond this subject's scope C) Prior information is a fundamental concept covered in this subject D) Prior information is mentioned only as a historical footnote
Correct: C)
- If you chose A: This is incorrect. Prior information is a fundamental concept covered in this subject. This subject covers Prior information as part of its core content.
- If you chose B: This is incorrect. Prior information is a fundamental concept covered in this subject. This subject covers Prior information as part of its core content.
- If you chose C: Prior information is a fundamental concept covered in this subject. This subject covers Prior information as part of its core content. Correct!
- If you chose D: This is incorrect. Prior information is a fundamental concept covered in this subject. This subject covers Prior information as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) An unrelated numerical value B) 2\bar{X}$ C) A different result from a common mistake D) The inverse of the correct answer
Correct: B)
- If you chose A: This is incorrect. The worked examples show that the result is 2\bar{X}$. The other options represent common errors.
- If you chose B: The worked examples show that the result is 2\bar{X}$. The other options represent common errors. Correct!
- If you chose C: This is incorrect. The worked examples show that the result is 2\bar{X}$. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is 2\bar{X}$. The other options represent common errors.
Q6: How are Prior information and Prior related?
A) Prior information and Prior are completely unrelated topics B) Prior information is a special case of Prior C) Prior information is the inverse of Prior D) Prior information and Prior are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Prior information and Prior are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Prior information and Prior are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Prior information and Prior are covered in this subject as interconnected topics.
- If you chose D: Both Prior information and Prior are covered in this subject as interconnected topics. Correct!
Q7: What is a common pitfall when working with Likelihood?
A) The main error with Likelihood is using it when it is not needed B) Likelihood has no common misconceptions C) A common mistake is confusing Likelihood with a similar concept D) Likelihood is always computed the same way in all contexts
Correct: C)
- If you chose A: This is incorrect. Students often confuse Likelihood with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Likelihood with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse Likelihood with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse Likelihood with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Marginal likelihood?
A) Apply Marginal likelihood to solve problems in this subject's domain B) Use Marginal likelihood only in pure mathematics contexts C) Marginal likelihood is not practically useful D) Avoid Marginal likelihood unless explicitly instructed
Correct: A)
- If you chose A: Marginal likelihood is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose B: This is incorrect. Marginal likelihood is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Marginal likelihood is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Marginal likelihood is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
- Derive the MoM estimator for a Uniform(0, $\theta$) distribution. (Hint: $E[X] = \theta/2$)
Click for answer
$E[X] = \frac{\theta}{2}$ Set equal to $\bar{X}$: $\frac{\tilde{\theta}}{2} = \bar{X}$ β $\tilde{\theta}_{\text{MoM}} = 2\bar{X}$ Note: this can give estimates less than the maximum observation, which is incoherent. The MLE $\hat{\theta} = X_{(n)}$ (the sample maximum) is better here.-
For a Beta(2, 2) prior and 7 heads in 10 flips, what is the MAP estimate?
Click for answer
Posterior: Beta(2+7, 2+3) = Beta(9, 5) MAP (mode of Beta($\alpha$, $\beta$)): $\frac{\alpha-1}{\alpha+\beta-2} = \frac{9-1}{9+5-2} = \frac{8}{12} = 0.667$ -
Why does MAP with a Gaussian prior correspond to $L_2$ regularisation?
Click for answer
$\log p(w \mid \text{data}) \propto \log p(\text{data} \mid w) + \log p(w)$ For Gaussian prior $p(w) \propto \exp(-w^2/2\sigma^2)$: $\log p(w) \propto -w^2$ β $L_2$ penalty. For Laplace prior $p(w) \propto \exp(-|w|/b)$: $\log p(w) \propto -|w|$ β $L_1$ penalty. MAP = MLE + log-prior = MLE β regularisation term. -
What happens to a Bayesian posterior as $n \to \infty$?
Click for answer
The posterior concentrates around the true parameter value (posterior consistency). The influence of the prior vanishes β the data dominates. For large $n$, the posterior is approximately $N(\hat{\theta}_{\text{MLE}}, 1/(nI(\hat{\theta})))$ regardless of the prior (Bernstein-von Mises theorem). -
If your prior is Beta(10, 10) and you observe 1 head in 1 flip, what is the posterior mean?
Click for answer
Posterior: Beta(10+1, 10+0) = Beta(11, 10) Posterior mean: $\frac{11}{11+10} = \frac{11}{21} \approx 0.524$ Even though the MLE is 1.0, your strong prior (equivalent to 20 pseudo-observations, roughly fair) keeps the estimate near 0.5.
Summary
Key takeaways:
- MoM equates population and sample moments; computationally simple but less efficient than MLE
- Bayesian inference treats $\theta$ as random and produces a full posterior distribution
- Prior + likelihood β posterior via Bayes' theorem
- Conjugate priors (Beta-Bernoulli, Gamma-Poisson, Normal-Normal) yield tractable posteriors
- MAP is the mode of the posterior; connects to regularisation ($L_2$ = Gaussian prior, $L_1$ = Laplace prior)
- The prior's influence diminishes as sample size grows
Pitfalls
- Treating MAP as a full Bayesian analysis: MAP provides only a point estimate β the mode of the posterior. It discards all uncertainty information (spread, skewness, tails) contained in the full posterior distribution. When you need credible intervals or uncertainty quantification, you must use the full posterior, not just its mode.
- Confusing credible intervals with confidence intervals: A 95% Bayesian credible interval means "there is a 95% probability the parameter lies in this interval." This is NOT the correct interpretation of a frequentist confidence interval. The two may coincide numerically in some cases but have fundamentally different meanings β do not switch between frameworks casually.
- Forgetting to add prior pseudo-counts in conjugate updating: When updating a Beta(Ξ±, Ξ²) prior with k successes in n Bernoulli trials, the posterior is Beta(Ξ± + k, Ξ² + n β k), not Beta(k, n β k). The prior parameters act as "pseudo-observations" that must be added to the actual data counts. Omitting them is equivalent to using a Beta(0, 0) improper prior.
- Assuming MoM estimators always produce valid estimates: The method of moments can produce estimates outside the parameter space. For a Uniform(0, ΞΈ) distribution, the MoM estimator ΞΈΜ = 2XΜ can be less than the sample maximum Xβββ, which is impossible for consistent data. Always check whether the estimate respects the parameter constraints.
- Choosing a strong informative prior without sensitivity analysis: With a Beta(10, 10) prior (equivalent to 20 pseudo-observations) and only 5 actual data points, the prior dominates the posterior. Always test how conclusions change under different reasonable priors β if the answer changes substantially, your data is too weak to overcome prior assumptions.
Next Steps
Next up: 12-06-confidence-intervals.md