Math graphic
πŸ“ Concept diagram

### 12.5 β€” Method of Moments and Bayesian Estimation

Phase: Statistics Prerequisites: 12-04-mle, 10-02-conditional-probability, 12-03-point-estimation

Learning Objectives

By the end of this subject, you will be able to:

  1. Derive method of moments estimators for common distributions
  2. Compare MoM estimators with MLEs
  3. Define prior, likelihood, posterior, and marginal likelihood in Bayesian inference
  4. Compute posterior distributions using conjugate priors
  5. Derive MAP (Maximum A Posteriori) estimates

Core Content

Method of Moments (MoM)

The idea: equate population moments to sample moments and solve for parameters.

Procedure: 1. Compute $k$ population moments: $E[X], E[X^2], \ldots, E[X^k]$ (as functions of parameters) 2. Compute the corresponding $k$ sample moments: $\frac{1}{n}\sum X_i, \frac{1}{n}\sum X_i^2, \ldots$ 3. Set them equal and solve the system for the parameters

Where $k$ = number of parameters to estimate.

Example: For $X \sim \text{Exponential}(\beta)$: - Population mean: $E[X] = \beta$ - Sample mean: $\bar{X}$ - Set equal: $\tilde{\beta}_{\text{MoM}} = \bar{X}$

(For the exponential, MoM and MLE coincide.)

MoM vs MLE: | Property | MoM | MLE | |---|---|---| | Computational ease | Often simpler | May need numerical optimisation | | Efficiency | Less efficient generally | Asymptotically efficient | | Consistency | Yes | Yes | | Finite sample bias | Can be biased | Can be biased | | Requires distribution | Yes | Yes |

⚠️ CRITICAL: Bayesian vs Frequentist

The fundamental philosophical divide:

Frequentist Bayesian
Parameter $\theta$ Fixed, unknown constant Random variable
Probability Long-run frequency Degree of belief
Inference Point estimates, CIs, p-values Posterior distribution
Prior information Not formally incorporated Explicit through prior $p(\theta)$

Bayes' Theorem for Inference

$$p(\theta \mid \text{data}) = \frac{p(\text{data} \mid \theta) \cdot p(\theta)}{p(\text{data})}$$

"Today's posterior is tomorrow's prior."

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same distributional family. This makes computation tractable.

Likelihood Conjugate Prior Posterior
Bernoulli Beta($\alpha$, $\beta$) Beta($\alpha + k$, $\beta + n - k$)
Poisson Gamma($a$, $b$) Gamma($a + \sum x_i$, $b + n$)
Normal (mean, known $\sigma^2$) Normal($\mu_0$, $\sigma_0^2$) Normal (updated)

Interpretation of Beta-Bernoulli: - Prior: Beta(2, 2) β€” believe coin is roughly fair, Β± uncertainty - Data: 7 heads in 10 flips - Posterior: Beta(2+7, 2+3) = Beta(9, 5) β€” mean = 9/14 β‰ˆ 0.64 - The prior's "2, 2" acts like 4 pseudo-observations

MAP Estimation

The Maximum A Posteriori estimate is the mode of the posterior:

$$\hat{\theta}{\text{MAP}} = \arg\max\theta p(\theta \mid \text{data}) = \arg\max_\theta p(\text{data} \mid \theta) p(\theta)$$

Connection to regularisation: MAP with a Gaussian prior on regression coefficients is equivalent to ridge regression ($L_2$ regularisation). MAP with a Laplace prior gives Lasso ($L_1$ regularisation).

🚩 Common Pitfall: MAP gives a point estimate (the mode), NOT the full posterior. It discards uncertainty information. When you need uncertainty quantification, use the full posterior, not just the MAP.



Key Terms

Worked Examples

Example 1: MoM for Normal($\mu$, $\sigma^2$)

Two parameters β†’ need two moment equations.

1st moment: $E[X] = \mu = \bar{X}$ β†’ $\tilde{\mu} = \bar{X}$

2nd moment: $E[X^2] = \mu^2 + \sigma^2$

Set equal to sample second moment: $\tilde{\mu}^2 + \tilde{\sigma}^2 = \frac{1}{n}\sum X_i^2$

$\tilde{\sigma}^2 = \frac{1}{n}\sum X_i^2 - \bar{X}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$

The MoM estimator for $\sigma^2$ is the biased version (divide by $n$, not $n-1$).

Example 2: Bayesian update for a coin flip

Prior: Beta(1, 1) = Uniform(0, 1) β€” complete ignorance

Data: H, T, H, H, T, H, H, H, T, H β†’ 7 heads, 3 tails

Posterior: Beta(1+7, 1+3) = Beta(8, 4)

Posterior mean: $\frac{8}{8+4} = \frac{8}{12} \approx 0.667$

Posterior mode (MAP): $\frac{8-1}{8+4-2} = \frac{7}{10} = 0.7$

95% credible interval (equal-tailed): roughly [0.39, 0.89]

Contrast with MLE: $\hat{p} = 0.7$, no uncertainty quantification.

Example 3: MAP for normal mean with normal prior

Likelihood: $X_i \sim N(\mu, \sigma^2)$ with $\sigma^2$ known

Prior: $\mu \sim N(\mu_0, \tau^2)$

Posterior: $\mu \mid \text{data} \sim N\left(\frac{\frac{n}{\sigma^2}\bar{X} + \frac{1}{\tau^2}\mu_0}{\frac{n}{\sigma^2} + \frac{1}{\tau^2}}, \frac{1}{\frac{n}{\sigma^2} + \frac{1}{\tau^2}}\right)$

The posterior mean (and MAP, since normal) is a precision-weighted average of prior mean and sample mean:

$$\hat{\mu}_{\text{MAP}} = \frac{\tau^{-2} \mu_0 + n\sigma^{-2} \bar{X}}{\tau^{-2} + n\sigma^{-2}}$$

As $n \to \infty$, $\hat{\mu}_{\text{MAP}} \to \bar{X}$ β€” the data overwhelms the prior.



Quiz

Q1: What does the concept of Probability primarily refer to in this subject?

A) A computational error related to Probability B) The definition and application of Probability C) A historical anecdote about Probability D) A visual representation of Probability

Correct: B)

Q2: Which of the following is the key formula discussed in this subject?

A) An unrelated formula from a different topic B) E[X], E[X^2], \ldots, E[X^k] C) The inverse operation of the formula in question D) A simplified version of E[X], E[X^2], \ldots, E[X^k]...

Correct: B)

Q3: What is the primary purpose of Inference?

A) It is used only in advanced research contexts B) It is used to inference in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain

Correct: B)

Q4: Which statement about Prior information is TRUE?

A) Prior information is not related to this subject B) Prior information is an advanced topic beyond this subject's scope C) Prior information is a fundamental concept covered in this subject D) Prior information is mentioned only as a historical footnote

Correct: C)

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) 2\bar{X}$ C) A different result from a common mistake D) The inverse of the correct answer

Correct: B)

Q6: How are Prior information and Prior related?

A) Prior information and Prior are completely unrelated topics B) Prior information is a special case of Prior C) Prior information is the inverse of Prior D) Prior information and Prior are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with Likelihood?

A) The main error with Likelihood is using it when it is not needed B) Likelihood has no common misconceptions C) A common mistake is confusing Likelihood with a similar concept D) Likelihood is always computed the same way in all contexts

Correct: C)

Q8: When should you apply Marginal likelihood?

A) Apply Marginal likelihood to solve problems in this subject's domain B) Use Marginal likelihood only in pure mathematics contexts C) Marginal likelihood is not practically useful D) Avoid Marginal likelihood unless explicitly instructed

Correct: A)

Practice Problems

  1. Derive the MoM estimator for a Uniform(0, $\theta$) distribution. (Hint: $E[X] = \theta/2$)
Click for answer $E[X] = \frac{\theta}{2}$ Set equal to $\bar{X}$: $\frac{\tilde{\theta}}{2} = \bar{X}$ β†’ $\tilde{\theta}_{\text{MoM}} = 2\bar{X}$ Note: this can give estimates less than the maximum observation, which is incoherent. The MLE $\hat{\theta} = X_{(n)}$ (the sample maximum) is better here.
  1. For a Beta(2, 2) prior and 7 heads in 10 flips, what is the MAP estimate?

    Click for answer Posterior: Beta(2+7, 2+3) = Beta(9, 5) MAP (mode of Beta($\alpha$, $\beta$)): $\frac{\alpha-1}{\alpha+\beta-2} = \frac{9-1}{9+5-2} = \frac{8}{12} = 0.667$

  2. Why does MAP with a Gaussian prior correspond to $L_2$ regularisation?

    Click for answer $\log p(w \mid \text{data}) \propto \log p(\text{data} \mid w) + \log p(w)$ For Gaussian prior $p(w) \propto \exp(-w^2/2\sigma^2)$: $\log p(w) \propto -w^2$ β†’ $L_2$ penalty. For Laplace prior $p(w) \propto \exp(-|w|/b)$: $\log p(w) \propto -|w|$ β†’ $L_1$ penalty. MAP = MLE + log-prior = MLE βˆ’ regularisation term.

  3. What happens to a Bayesian posterior as $n \to \infty$?

    Click for answer The posterior concentrates around the true parameter value (posterior consistency). The influence of the prior vanishes β€” the data dominates. For large $n$, the posterior is approximately $N(\hat{\theta}_{\text{MLE}}, 1/(nI(\hat{\theta})))$ regardless of the prior (Bernstein-von Mises theorem).

  4. If your prior is Beta(10, 10) and you observe 1 head in 1 flip, what is the posterior mean?

    Click for answer Posterior: Beta(10+1, 10+0) = Beta(11, 10) Posterior mean: $\frac{11}{11+10} = \frac{11}{21} \approx 0.524$ Even though the MLE is 1.0, your strong prior (equivalent to 20 pseudo-observations, roughly fair) keeps the estimate near 0.5.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 12-06-confidence-intervals.md