📐 Concept diagram

### 12.5 — Method of Moments and Bayesian Estimation

Phase: Statistics Prerequisites: 12-04-mle, 10-02-conditional-probability, 12-03-point-estimation

Learning Objectives

By the end of this subject, you will be able to:

Derive method of moments estimators for common distributions
Compare MoM estimators with MLEs
Define prior, likelihood, posterior, and marginal likelihood in Bayesian inference
Compute posterior distributions using conjugate priors
Derive MAP (Maximum A Posteriori) estimates

Core Content

Method of Moments (MoM)

The idea: equate population moments to sample moments and solve for parameters.

Procedure: 1. Compute $k$ population moments: $E[X], E[X^2], \ldots, E[X^k]$ (as functions of parameters) 2. Compute the corresponding $k$ sample moments: $\frac{1}{n}\sum X_i, \frac{1}{n}\sum X_i^2, \ldots$ 3. Set them equal and solve the system for the parameters

Where $k$ = number of parameters to estimate.

Example: For $X \sim \text{Exponential}(\beta)$: - Population mean: $E[X] = \beta$ - Sample mean: $\bar{X}$ - Set equal: $\tilde{\beta}_{\text{MoM}} = \bar{X}$

(For the exponential, MoM and MLE coincide.)

MoM vs MLE: | Property | MoM | MLE | |---|---|---| | Computational ease | Often simpler | May need numerical optimisation | | Efficiency | Less efficient generally | Asymptotically efficient | | Consistency | Yes | Yes | | Finite sample bias | Can be biased | Can be biased | | Requires distribution | Yes | Yes |

⚠️ CRITICAL: Bayesian vs Frequentist

The fundamental philosophical divide:

	Frequentist	Bayesian
Parameter $\theta$	Fixed, unknown constant	Random variable
Probability	Long-run frequency	Degree of belief
Inference	Point estimates, CIs, p-values	Posterior distribution
Prior information	Not formally incorporated	Explicit through prior $p(\theta)$

Bayes' Theorem for Inference

$$p(\theta \mid \text{data}) = \frac{p(\text{data} \mid \theta) \cdot p(\theta)}{p(\text{data})}$$

Prior $p(\theta)$: beliefs about $\theta$ before seeing data
Likelihood $p(\text{data} \mid \theta)$: same as in MLE
Marginal likelihood $p(\text{data}) = \int p(\text{data} \mid \theta) p(\theta) d\theta$: normalising constant
Posterior $p(\theta \mid \text{data})$: updated beliefs after seeing data

"Today's posterior is tomorrow's prior."

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same distributional family. This makes computation tractable.

Likelihood	Conjugate Prior	Posterior
Bernoulli	Beta($\alpha$, $\beta$)	Beta($\alpha + k$, $\beta + n - k$)
Poisson	Gamma($a$, $b$)	Gamma($a + \sum x_i$, $b + n$)
Normal (mean, known $\sigma^2$)	Normal($\mu_0$, $\sigma_0^2$)	Normal (updated)

Interpretation of Beta-Bernoulli: - Prior: Beta(2, 2) — believe coin is roughly fair, ± uncertainty - Data: 7 heads in 10 flips - Posterior: Beta(2+7, 2+3) = Beta(9, 5) — mean = 9/14 ≈ 0.64 - The prior's "2, 2" acts like 4 pseudo-observations

MAP Estimation

The Maximum A Posteriori estimate is the mode of the posterior:

$$\hat{\theta}{\text{MAP}} = \arg\max\theta p(\theta \mid \text{data}) = \arg\max_\theta p(\text{data} \mid \theta) p(\theta)$$

Connection to regularisation: MAP with a Gaussian prior on regression coefficients is equivalent to ridge regression ($L_2$ regularisation). MAP with a Laplace prior gives Lasso ($L_1$ regularisation).

🚩 Common Pitfall: MAP gives a point estimate (the mode), NOT the full posterior. It discards uncertainty information. When you need uncertainty quantification, use the full posterior, not just the MAP.

Key Terms

Bayesian inference
Conjugate priors
Inference
Likelihood
Marginal likelihood
Maximum A Posteriori
Posterior
Prior
Prior information
Probability

Worked Examples

Example 1: MoM for Normal($\mu$, $\sigma^2$)

Two parameters → need two moment equations.

1st moment: $E[X] = \mu = \bar{X}$ → $\tilde{\mu} = \bar{X}$

2nd moment: $E[X^2] = \mu^2 + \sigma^2$

Set equal to sample second moment: $\tilde{\mu}^2 + \tilde{\sigma}^2 = \frac{1}{n}\sum X_i^2$

$\tilde{\sigma}^2 = \frac{1}{n}\sum X_i^2 - \bar{X}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$

The MoM estimator for $\sigma^2$ is the biased version (divide by $n$, not $n-1$).

Example 2: Bayesian update for a coin flip

Prior: Beta(1, 1) = Uniform(0, 1) — complete ignorance

Data: H, T, H, H, T, H, H, H, T, H → 7 heads, 3 tails

Posterior: Beta(1+7, 1+3) = Beta(8, 4)

Posterior mean: $\frac{8}{8+4} = \frac{8}{12} \approx 0.667$

Posterior mode (MAP): $\frac{8-1}{8+4-2} = \frac{7}{10} = 0.7$

95% credible interval (equal-tailed): roughly [0.39, 0.89]

Contrast with MLE: $\hat{p} = 0.7$, no uncertainty quantification.

Example 3: MAP for normal mean with normal prior

Likelihood: $X_i \sim N(\mu, \sigma^2)$ with $\sigma^2$ known

Prior: $\mu \sim N(\mu_0, \tau^2)$

Posterior: $\mu \mid \text{data} \sim N\left(\frac{\frac{n}{\sigma^2}\bar{X} + \frac{1}{\tau^2}\mu_0}{\frac{n}{\sigma^2} + \frac{1}{\tau^2}}, \frac{1}{\frac{n}{\sigma^2} + \frac{1}{\tau^2}}\right)$

The posterior mean (and MAP, since normal) is a precision-weighted average of prior mean and sample mean:

$$\hat{\mu}_{\text{MAP}} = \frac{\tau^{-2} \mu_0 + n\sigma^{-2} \bar{X}}{\tau^{-2} + n\sigma^{-2}}$$

As $n \to \infty$, $\hat{\mu}_{\text{MAP}} \to \bar{X}$ — the data overwhelms the prior.

Quiz

Q1: What does the concept of Probability primarily refer to in this subject?

A) A computational error related to Probability B) The definition and application of Probability C) A historical anecdote about Probability D) A visual representation of Probability

Correct: B)

If you chose A: This is incorrect. Probability is defined as: the definition and application of probability. The other options describe different aspects that are not the primary focus.
If you chose B: Probability is defined as: the definition and application of probability. The other options describe different aspects that are not the primary focus. Correct!
If you chose C: This is incorrect. Probability is defined as: the definition and application of probability. The other options describe different aspects that are not the primary focus.
If you chose D: This is incorrect. Probability is defined as: the definition and application of probability. The other options describe different aspects that are not the primary focus.

Q2: Which of the following is the key formula discussed in this subject?

A) An unrelated formula from a different topic B) E[X], E[X^2], \ldots, E[X^k] C) The inverse operation of the formula in question D) A simplified version of E[X], E[X^2], \ldots, E[X^k]...

Correct: B)

If you chose A: This is incorrect. The formula E[X], E[X^2], \ldots, E[X^k] is central to this subject. The other options are either simplified versions or unrelated.
If you chose B: The formula E[X], E[X^2], \ldots, E[X^k] is central to this subject. The other options are either simplified versions or unrelated. Correct!
If you chose C: This is incorrect. The formula E[X], E[X^2], \ldots, E[X^k] is central to this subject. The other options are either simplified versions or unrelated.
If you chose D: This is incorrect. The formula E[X], E[X^2], \ldots, E[X^k] is central to this subject. The other options are either simplified versions or unrelated.

Q3: What is the primary purpose of Inference?

A) It is used only in advanced research contexts B) It is used to inference in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain

Correct: B)

If you chose A: This is incorrect. Inference serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: Inference serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose C: This is incorrect. Inference serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. Inference serves the purpose described in the correct answer. The other options misrepresent its role.

Q4: Which statement about Prior information is TRUE?

A) Prior information is not related to this subject B) Prior information is an advanced topic beyond this subject's scope C) Prior information is a fundamental concept covered in this subject D) Prior information is mentioned only as a historical footnote

Correct: C)

If you chose A: This is incorrect. Prior information is a fundamental concept covered in this subject. This subject covers Prior information as part of its core content.
If you chose B: This is incorrect. Prior information is a fundamental concept covered in this subject. This subject covers Prior information as part of its core content.
If you chose C: Prior information is a fundamental concept covered in this subject. This subject covers Prior information as part of its core content. Correct!
If you chose D: This is incorrect. Prior information is a fundamental concept covered in this subject. This subject covers Prior information as part of its core content.

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) 2\bar{X}$ C) A different result from a common mistake D) The inverse of the correct answer

Correct: B)

If you chose A: This is incorrect. The worked examples show that the result is 2\bar{X}$. The other options represent common errors.
If you chose B: The worked examples show that the result is 2\bar{X}$. The other options represent common errors. Correct!
If you chose C: This is incorrect. The worked examples show that the result is 2\bar{X}$. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is 2\bar{X}$. The other options represent common errors.

Q6: How are Prior information and Prior related?

A) Prior information and Prior are completely unrelated topics B) Prior information is a special case of Prior C) Prior information is the inverse of Prior D) Prior information and Prior are closely related concepts

Correct: D)

If you chose A: This is incorrect. Both Prior information and Prior are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both Prior information and Prior are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both Prior information and Prior are covered in this subject as interconnected topics.
If you chose D: Both Prior information and Prior are covered in this subject as interconnected topics. Correct!

Q7: What is a common pitfall when working with Likelihood?

A) The main error with Likelihood is using it when it is not needed B) Likelihood has no common misconceptions C) A common mistake is confusing Likelihood with a similar concept D) Likelihood is always computed the same way in all contexts

Correct: C)

If you chose A: This is incorrect. Students often confuse Likelihood with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse Likelihood with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: Students often confuse Likelihood with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose D: This is incorrect. Students often confuse Likelihood with similar-sounding or related concepts. Pay attention to the precise definitions.

Q8: When should you apply Marginal likelihood?

A) Apply Marginal likelihood to solve problems in this subject's domain B) Use Marginal likelihood only in pure mathematics contexts C) Marginal likelihood is not practically useful D) Avoid Marginal likelihood unless explicitly instructed

Correct: A)

If you chose A: Marginal likelihood is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose B: This is incorrect. Marginal likelihood is a practical tool used throughout this subject to solve relevant problems.
If you chose C: This is incorrect. Marginal likelihood is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. Marginal likelihood is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Derive the MoM estimator for a Uniform(0, $\theta$) distribution. (Hint: $E[X] = \theta/2$)

Click for answer

$E[X] = \frac{\theta}{2}$ Set equal to $\bar{X}$: $\frac{\tilde{\theta}}{2} = \bar{X}$ → $\tilde{\theta}_{\text{MoM}} = 2\bar{X}$ Note: this can give estimates less than the maximum observation, which is incoherent. The MLE $\hat{\theta} = X_{(n)}$ (the sample maximum) is better here.

For a Beta(2, 2) prior and 7 heads in 10 flips, what is the MAP estimate?

Click for answer
Posterior: Beta(2+7, 2+3) = Beta(9, 5) MAP (mode of Beta($\alpha$, $\beta$)): $\frac{\alpha-1}{\alpha+\beta-2} = \frac{9-1}{9+5-2} = \frac{8}{12} = 0.667$
Why does MAP with a Gaussian prior correspond to $L_2$ regularisation?

Click for answer
$\log p(w \mid \text{data}) \propto \log p(\text{data} \mid w) + \log p(w)$ For Gaussian prior $p(w) \propto \exp(-w^2/2\sigma^2)$: $\log p(w) \propto -w^2$ → $L_2$ penalty. For Laplace prior $p(w) \propto \exp(-|w|/b)$: $\log p(w) \propto -|w|$ → $L_1$ penalty. MAP = MLE + log-prior = MLE − regularisation term.
What happens to a Bayesian posterior as $n \to \infty$?

Click for answer
The posterior concentrates around the true parameter value (posterior consistency). The influence of the prior vanishes — the data dominates. For large $n$, the posterior is approximately $N(\hat{\theta}_{\text{MLE}}, 1/(nI(\hat{\theta})))$ regardless of the prior (Bernstein-von Mises theorem).
If your prior is Beta(10, 10) and you observe 1 head in 1 flip, what is the posterior mean?

Click for answer
Posterior: Beta(10+1, 10+0) = Beta(11, 10) Posterior mean: $\frac{11}{11+10} = \frac{11}{21} \approx 0.524$ Even though the MLE is 1.0, your strong prior (equivalent to 20 pseudo-observations, roughly fair) keeps the estimate near 0.5.

Summary

Key takeaways:

MoM equates population and sample moments; computationally simple but less efficient than MLE
Bayesian inference treats $\theta$ as random and produces a full posterior distribution
Prior + likelihood → posterior via Bayes' theorem
Conjugate priors (Beta-Bernoulli, Gamma-Poisson, Normal-Normal) yield tractable posteriors
MAP is the mode of the posterior; connects to regularisation ($L_2$ = Gaussian prior, $L_1$ = Laplace prior)
The prior's influence diminishes as sample size grows

Pitfalls

Treating MAP as a full Bayesian analysis: MAP provides only a point estimate — the mode of the posterior. It discards all uncertainty information (spread, skewness, tails) contained in the full posterior distribution. When you need credible intervals or uncertainty quantification, you must use the full posterior, not just its mode.
Confusing credible intervals with confidence intervals: A 95% Bayesian credible interval means "there is a 95% probability the parameter lies in this interval." This is NOT the correct interpretation of a frequentist confidence interval. The two may coincide numerically in some cases but have fundamentally different meanings — do not switch between frameworks casually.
Forgetting to add prior pseudo-counts in conjugate updating: When updating a Beta(α, β) prior with k successes in n Bernoulli trials, the posterior is Beta(α + k, β + n − k), not Beta(k, n − k). The prior parameters act as "pseudo-observations" that must be added to the actual data counts. Omitting them is equivalent to using a Beta(0, 0) improper prior.
Assuming MoM estimators always produce valid estimates: The method of moments can produce estimates outside the parameter space. For a Uniform(0, θ) distribution, the MoM estimator θ̃ = 2X̄ can be less than the sample maximum X₍ₙ₎, which is impossible for consistent data. Always check whether the estimate respects the parameter constraints.
Choosing a strong informative prior without sensitivity analysis: With a Beta(10, 10) prior (equivalent to 20 pseudo-observations) and only 5 actual data points, the prior dominates the posterior. Always test how conclusions change under different reasonable priors — if the answer changes substantially, your data is too weak to overcome prior assumptions.

Next Steps

Next up: 12-06-confidence-intervals.md

Progress

Phases

### 12.5 — Method of Moments and Bayesian Estimation

Learning Objectives

Core Content

Method of Moments (MoM)

⚠️ CRITICAL: Bayesian vs Frequentist

Bayes' Theorem for Inference

Conjugate Priors

MAP Estimation

Key Terms

Worked Examples

Example 1: MoM for Normal($\mu$, $\sigma^2$)

Example 2: Bayesian update for a coin flip

Example 3: MAP for normal mean with normal prior

Quiz

Practice Problems

Summary

Pitfalls

Next Steps