Math graphic
📐 Concept diagram

### 12.2 — Sampling and Sampling Distributions

Phase: Statistics Prerequisites: 12-01-descriptive-statistics, 10-08-normal-gaussian-distribution, 11-06-limit-theorems

Learning Objectives

By the end of this subject, you will be able to:

  1. Distinguish between population parameters and sample statistics
  2. Explain the concept of a sampling distribution
  3. Compute the standard error of the mean
  4. Apply the Central Limit Theorem to construct sampling distributions
  5. Work with the t-distribution and F-distribution for small samples

Core Content

⚠️ CRITICAL: Population vs Sample

The single most important distinction in statistics:

Concept Population Sample
Size $N$ (usually unknown or infinite) $n$
Mean $\mu$ (parameter) $\bar{x}$ (statistic)
Variance $\sigma^2$ (parameter) $s^2$ (statistic)
Known? Almost never Always

Parameters are fixed (but unknown) numbers describing the population. Statistics are random variables computed from samples — they vary from sample to sample.

What is a Sampling Distribution?

Take many samples of size $n$ from the same population. Compute a statistic (e.g. the mean) for each sample. The distribution of those statistics is the sampling distribution of that statistic.

Example: Roll a fair die 5 times, compute the mean. Do this 10,000 times. The histogram of those 10,000 means IS the sampling distribution of the mean for $n=5$.

Key insight: The statistic itself is a random variable — it has its own mean, variance, and distribution shape.

Sampling Distribution of the Sample Mean

For a population with mean $\mu$ and variance $\sigma^2$:

Mean of sample means: $E[\bar{X}] = \mu$ The sample mean is an unbiased estimator of the population mean.

Variance of sample means: $$\text{Var}(\bar{X}) = \frac{\sigma^2}{n}$$

Standard error of the mean (SEM): $$\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}}$$

⚠️ CRITICAL: The standard error is NOT the same as the standard deviation. SD describes spread of individual data points; SE describes precision of the sample mean as an estimator.

The Central Limit Theorem (Applied)

The CLT states: For independent observations from ANY distribution with finite variance, as $n \to \infty$:

$$\bar{X} \approx N\left(\mu, \frac{\sigma^2}{n}\right)$$

Or equivalently: $$\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \approx N(0, 1)$$

Practical rule of thumb: $n \geq 30$ is usually sufficient for the normal approximation to be reasonable (for symmetric distributions, even smaller $n$ works).

🚩 Common Pitfall: The CLT is about the distribution of the SAMPLE MEAN, not about the raw data. The raw data can be any shape; it's the sampling distribution that becomes normal.

The t-Distribution

When $\sigma$ is unknown (which is always in practice), we estimate it with $s$. The standardised statistic:

$$t = \frac{\bar{X} - \mu}{s / \sqrt{n}}$$

follows a t-distribution with $n-1$ degrees of freedom (df), NOT a standard normal.

Properties of the t-distribution: - Bell-shaped and symmetric about 0 - Heavier tails than normal (accounts for uncertainty in estimating $\sigma$) - As df $\to \infty$, the t-distribution approaches $N(0,1)$ - For df $\geq 30$, the t and normal are nearly indistinguishable

The F-Distribution

The ratio of two independent chi-squared variables (each divided by its df) follows an F-distribution. Used in ANOVA and for comparing variances.

If $S_1^2$ and $S_2^2$ are sample variances from normal populations:

$$F = \frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} \sim F_{n_1-1, n_2-1}$$



Key Terms

Worked Examples

Example 1: Computing standard error

A population has $\sigma = 12$. You take a sample of $n = 36$. What is the standard error of the mean?

Solution:

$$\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}} = \frac{12}{\sqrt{36}} = \frac{12}{6} = 2$$

If you quadruple the sample size to $n = 144$: $$\text{SE}(\bar{X}) = \frac{12}{\sqrt{144}} = \frac{12}{12} = 1$$

Quadrupling $n$ halves the standard error. To halve the SE, you must quadruple the sample size (SE scales as $1/\sqrt{n}$).

Example 2: CLT application

The time to complete a task is exponentially distributed with mean $\mu = 30$ minutes and $\sigma = 30$ minutes (for exponential, $\sigma = \mu$). You time 64 independent workers. What is the probability the sample mean exceeds 34 minutes?

Solution:

By the CLT, $\bar{X} \approx N(30, 30^2/64) = N(30, 14.0625)$.

$z = \frac{34 - 30}{\sqrt{14.0625}} = \frac{4}{3.75} \approx 1.067$

$P(\bar{X} > 34) = P(Z > 1.067) \approx 1 - 0.857 = 0.143$

There's about a 14.3% chance.

Example 3: t-distribution critical values

For a sample of size $n = 16$, find the critical t-value for a 95% confidence interval: $t_{0.025, 15}$ (two-tailed, $\alpha = 0.05$).

From t-tables or software: $t_{0.025, 15} \approx 2.131$

Compare with the normal critical value $z_{0.025} = 1.96$. The t-value is larger because the t-distribution has heavier tails — we need to go further out to capture 95% of the probability.



Quiz

Q1: What does the concept of Variance primarily refer to in this subject?

A) A computational error related to Variance B) A visual representation of Variance C) The definition and application of Variance D) A historical anecdote about Variance

Correct: C)

Q2: Which of the following is the key formula discussed in this subject?

A) An unrelated formula from a different topic B) A simplified version of \bar{x}... C) \bar{x} D) The inverse operation of the formula in question

Correct: C)

Q3: What is the primary purpose of Parameters?

A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is used to parameters in mathematical analysis

Correct: D)

Q4: Which statement about Statistics is TRUE?

A) Statistics is a fundamental concept covered in this subject B) Statistics is mentioned only as a historical footnote C) Statistics is an advanced topic beyond this subject's scope D) Statistics is not related to this subject

Correct: A)

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) A different result from a common mistake C) - Parameters describe populations; **statistic D) An unrelated numerical value

Correct: C)

Q6: How are Statistics and The statistic itself is a random variable related?

A) Statistics is a special case of The statistic itself is a random variable B) Statistics is the inverse of The statistic itself is a random variable C) Statistics and The statistic itself is a random variable are completely unrelated topics D) Statistics and The statistic itself is a random variable are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with F-distribution?

A) F-distribution is always computed the same way in all contexts B) The main error with F-distribution is using it when it is not needed C) A common mistake is confusing F-distribution with a similar concept D) F-distribution has no common misconceptions

Correct: C)

Q8: When should you apply ⚠️ Critical: Population Vs Sample?

A) Avoid ⚠️ Critical: Population Vs Sample unless explicitly instructed B) Apply ⚠️ Critical: Population Vs Sample to solve problems in this subject's domain C) Use ⚠️ Critical: Population Vs Sample only in pure mathematics contexts D) ⚠️ Critical: Population Vs Sample is not practically useful

Correct: B)

Practice Problems

  1. A population has $\mu = 100$ and $\sigma = 15$. For a sample of $n = 25$, find $E[\bar{X}]$ and $\text{SE}(\bar{X})$.
Click for answer $E[\bar{X}] = \mu = 100$ $\text{SE}(\bar{X}) = \frac{15}{\sqrt{25}} = \frac{15}{5} = 3$
  1. True or False: The Central Limit Theorem guarantees that raw data from a large sample will be normally distributed.

    Click for answer **False.** The CLT applies to the sampling distribution of the mean, not to the raw data. If you sample from a uniform distribution, the raw data remains uniform regardless of sample size — only the distribution of sample means becomes normal.

  2. For $n = 10$, what distribution does $t = (\bar{X} - \mu)/(s / \sqrt{n})$ follow? How many degrees of freedom?

    Click for answer t-distribution with $n - 1 = 9$ degrees of freedom.

  3. Why does the standard error decrease as sample size increases? Provide an intuitive explanation.

    Click for answer Larger samples average out individual fluctuations. An extreme observation has less influence on the mean when averaged over many observations — the sample mean becomes more precise (closer to $\mu$) with larger $n$. Mathematically, $\text{Var}(\bar{X}) = \sigma^2/n$, so precision grows linearly with $n$.

  4. Sample A ($n=81$, $\bar{x}=52$, $s=9$) and Sample B ($n=81$, $\bar{x}=52$, $s=9$) produce the same sample mean. Which would be more convincing that $\mu \neq 50$?

    Click for answer They're identical — same $n$, same $\bar{x}$, same $s$. The test statistic $t = (52-50)/(9/9) = 2$ would be the same, as would the p-value. What matters is the combination of effect size and sample size, captured in the standard error.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 12-03-point-estimation.md