### 12.2 — Sampling and Sampling Distributions
Phase: Statistics Prerequisites: 12-01-descriptive-statistics, 10-08-normal-gaussian-distribution, 11-06-limit-theorems
Learning Objectives
By the end of this subject, you will be able to:
- Distinguish between population parameters and sample statistics
- Explain the concept of a sampling distribution
- Compute the standard error of the mean
- Apply the Central Limit Theorem to construct sampling distributions
- Work with the t-distribution and F-distribution for small samples
Core Content
⚠️ CRITICAL: Population vs Sample
The single most important distinction in statistics:
| Concept | Population | Sample |
|---|---|---|
| Size | $N$ (usually unknown or infinite) | $n$ |
| Mean | $\mu$ (parameter) | $\bar{x}$ (statistic) |
| Variance | $\sigma^2$ (parameter) | $s^2$ (statistic) |
| Known? | Almost never | Always |
Parameters are fixed (but unknown) numbers describing the population. Statistics are random variables computed from samples — they vary from sample to sample.
What is a Sampling Distribution?
Take many samples of size $n$ from the same population. Compute a statistic (e.g. the mean) for each sample. The distribution of those statistics is the sampling distribution of that statistic.
Example: Roll a fair die 5 times, compute the mean. Do this 10,000 times. The histogram of those 10,000 means IS the sampling distribution of the mean for $n=5$.
Key insight: The statistic itself is a random variable — it has its own mean, variance, and distribution shape.
Sampling Distribution of the Sample Mean
For a population with mean $\mu$ and variance $\sigma^2$:
Mean of sample means: $E[\bar{X}] = \mu$ The sample mean is an unbiased estimator of the population mean.
Variance of sample means: $$\text{Var}(\bar{X}) = \frac{\sigma^2}{n}$$
Standard error of the mean (SEM): $$\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}}$$
⚠️ CRITICAL: The standard error is NOT the same as the standard deviation. SD describes spread of individual data points; SE describes precision of the sample mean as an estimator.
The Central Limit Theorem (Applied)
The CLT states: For independent observations from ANY distribution with finite variance, as $n \to \infty$:
$$\bar{X} \approx N\left(\mu, \frac{\sigma^2}{n}\right)$$
Or equivalently: $$\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \approx N(0, 1)$$
Practical rule of thumb: $n \geq 30$ is usually sufficient for the normal approximation to be reasonable (for symmetric distributions, even smaller $n$ works).
🚩 Common Pitfall: The CLT is about the distribution of the SAMPLE MEAN, not about the raw data. The raw data can be any shape; it's the sampling distribution that becomes normal.
The t-Distribution
When $\sigma$ is unknown (which is always in practice), we estimate it with $s$. The standardised statistic:
$$t = \frac{\bar{X} - \mu}{s / \sqrt{n}}$$
follows a t-distribution with $n-1$ degrees of freedom (df), NOT a standard normal.
Properties of the t-distribution: - Bell-shaped and symmetric about 0 - Heavier tails than normal (accounts for uncertainty in estimating $\sigma$) - As df $\to \infty$, the t-distribution approaches $N(0,1)$ - For df $\geq 30$, the t and normal are nearly indistinguishable
The F-Distribution
The ratio of two independent chi-squared variables (each divided by its df) follows an F-distribution. Used in ANOVA and for comparing variances.
If $S_1^2$ and $S_2^2$ are sample variances from normal populations:
$$F = \frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} \sim F_{n_1-1, n_2-1}$$
Key Terms
- F-distribution
- Mean
- Parameters
- Size
- Statistics
- The statistic itself is a random variable
- Variance
Worked Examples
Example 1: Computing standard error
A population has $\sigma = 12$. You take a sample of $n = 36$. What is the standard error of the mean?
Solution:
$$\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}} = \frac{12}{\sqrt{36}} = \frac{12}{6} = 2$$
If you quadruple the sample size to $n = 144$: $$\text{SE}(\bar{X}) = \frac{12}{\sqrt{144}} = \frac{12}{12} = 1$$
Quadrupling $n$ halves the standard error. To halve the SE, you must quadruple the sample size (SE scales as $1/\sqrt{n}$).
Example 2: CLT application
The time to complete a task is exponentially distributed with mean $\mu = 30$ minutes and $\sigma = 30$ minutes (for exponential, $\sigma = \mu$). You time 64 independent workers. What is the probability the sample mean exceeds 34 minutes?
Solution:
By the CLT, $\bar{X} \approx N(30, 30^2/64) = N(30, 14.0625)$.
$z = \frac{34 - 30}{\sqrt{14.0625}} = \frac{4}{3.75} \approx 1.067$
$P(\bar{X} > 34) = P(Z > 1.067) \approx 1 - 0.857 = 0.143$
There's about a 14.3% chance.
Example 3: t-distribution critical values
For a sample of size $n = 16$, find the critical t-value for a 95% confidence interval: $t_{0.025, 15}$ (two-tailed, $\alpha = 0.05$).
From t-tables or software: $t_{0.025, 15} \approx 2.131$
Compare with the normal critical value $z_{0.025} = 1.96$. The t-value is larger because the t-distribution has heavier tails — we need to go further out to capture 95% of the probability.
Quiz
Q1: What does the concept of Variance primarily refer to in this subject?
A) A computational error related to Variance B) A visual representation of Variance C) The definition and application of Variance D) A historical anecdote about Variance
Correct: C)
- If you chose A: This is incorrect. Variance is defined as: the definition and application of variance. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Variance is defined as: the definition and application of variance. The other options describe different aspects that are not the primary focus.
- If you chose C: Variance is defined as: the definition and application of variance. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Variance is defined as: the definition and application of variance. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) An unrelated formula from a different topic B) A simplified version of \bar{x}... C) \bar{x} D) The inverse operation of the formula in question
Correct: C)
- If you chose A: This is incorrect. The formula \bar{x} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula \bar{x} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: The formula \bar{x} is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose D: This is incorrect. The formula \bar{x} is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Parameters?
A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is used to parameters in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. Parameters serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Parameters serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Parameters serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: Parameters serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q4: Which statement about Statistics is TRUE?
A) Statistics is a fundamental concept covered in this subject B) Statistics is mentioned only as a historical footnote C) Statistics is an advanced topic beyond this subject's scope D) Statistics is not related to this subject
Correct: A)
- If you chose A: Statistics is a fundamental concept covered in this subject. This subject covers Statistics as part of its core content. Correct!
- If you chose B: This is incorrect. Statistics is a fundamental concept covered in this subject. This subject covers Statistics as part of its core content.
- If you chose C: This is incorrect. Statistics is a fundamental concept covered in this subject. This subject covers Statistics as part of its core content.
- If you chose D: This is incorrect. Statistics is a fundamental concept covered in this subject. This subject covers Statistics as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) A different result from a common mistake C) - Parameters describe populations; **statistic D) An unrelated numerical value
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is - Parameters describe populations; **statistic. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is - Parameters describe populations; **statistic. The other options represent common errors.
- If you chose C: The worked examples show that the result is - Parameters describe populations; **statistic. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is - Parameters describe populations; **statistic. The other options represent common errors.
Q6: How are Statistics and The statistic itself is a random variable related?
A) Statistics is a special case of The statistic itself is a random variable B) Statistics is the inverse of The statistic itself is a random variable C) Statistics and The statistic itself is a random variable are completely unrelated topics D) Statistics and The statistic itself is a random variable are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Statistics and The statistic itself is a random variable are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Statistics and The statistic itself is a random variable are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Statistics and The statistic itself is a random variable are covered in this subject as interconnected topics.
- If you chose D: Both Statistics and The statistic itself is a random variable are covered in this subject as interconnected topics. Correct!
Q7: What is a common pitfall when working with F-distribution?
A) F-distribution is always computed the same way in all contexts B) The main error with F-distribution is using it when it is not needed C) A common mistake is confusing F-distribution with a similar concept D) F-distribution has no common misconceptions
Correct: C)
- If you chose A: This is incorrect. Students often confuse F-distribution with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse F-distribution with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse F-distribution with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse F-distribution with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply ⚠️ Critical: Population Vs Sample?
A) Avoid ⚠️ Critical: Population Vs Sample unless explicitly instructed B) Apply ⚠️ Critical: Population Vs Sample to solve problems in this subject's domain C) Use ⚠️ Critical: Population Vs Sample only in pure mathematics contexts D) ⚠️ Critical: Population Vs Sample is not practically useful
Correct: B)
- If you chose A: This is incorrect. ⚠️ Critical: Population Vs Sample is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: ⚠️ Critical: Population Vs Sample is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. ⚠️ Critical: Population Vs Sample is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. ⚠️ Critical: Population Vs Sample is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
- A population has $\mu = 100$ and $\sigma = 15$. For a sample of $n = 25$, find $E[\bar{X}]$ and $\text{SE}(\bar{X})$.
Click for answer
$E[\bar{X}] = \mu = 100$ $\text{SE}(\bar{X}) = \frac{15}{\sqrt{25}} = \frac{15}{5} = 3$-
True or False: The Central Limit Theorem guarantees that raw data from a large sample will be normally distributed.
Click for answer
**False.** The CLT applies to the sampling distribution of the mean, not to the raw data. If you sample from a uniform distribution, the raw data remains uniform regardless of sample size — only the distribution of sample means becomes normal. -
For $n = 10$, what distribution does $t = (\bar{X} - \mu)/(s / \sqrt{n})$ follow? How many degrees of freedom?
Click for answer
t-distribution with $n - 1 = 9$ degrees of freedom. -
Why does the standard error decrease as sample size increases? Provide an intuitive explanation.
Click for answer
Larger samples average out individual fluctuations. An extreme observation has less influence on the mean when averaged over many observations — the sample mean becomes more precise (closer to $\mu$) with larger $n$. Mathematically, $\text{Var}(\bar{X}) = \sigma^2/n$, so precision grows linearly with $n$. -
Sample A ($n=81$, $\bar{x}=52$, $s=9$) and Sample B ($n=81$, $\bar{x}=52$, $s=9$) produce the same sample mean. Which would be more convincing that $\mu \neq 50$?
Click for answer
They're identical — same $n$, same $\bar{x}$, same $s$. The test statistic $t = (52-50)/(9/9) = 2$ would be the same, as would the p-value. What matters is the combination of effect size and sample size, captured in the standard error.
Summary
Key takeaways:
- Parameters describe populations; statistics describe samples and are random variables
- The sampling distribution of a statistic describes how that statistic varies across repeated samples
- $\text{SE}(\bar{X}) = \sigma / \sqrt{n}$ — precision scales with $\sqrt{n}$, not linearly
- The CLT says $\bar{X} \approx N(\mu, \sigma^2/n)$ for large $n$, regardless of the population shape
- When $\sigma$ is estimated by $s$, use the t-distribution ($n-1$ df) instead of normal
- The F-distribution arises from ratios of variances and is central to ANOVA
Pitfalls
- Confusing standard deviation with standard error: The standard deviation (σ or s) measures the spread of individual data points. The standard error (σ/√n) measures the precision of the sample mean as an estimator. They have different meanings and different formulas — do not use them interchangeably.
- Believing the CLT makes raw data normal: The Central Limit Theorem applies to the sampling distribution of the sample mean, not to the individual observations. If you sample from a uniform distribution, the data remains uniform regardless of sample size — only the distribution of the sample mean becomes approximately normal.
- Using z-procedures when σ is unknown: In practice σ is almost always unknown. When σ is estimated by s, the standardized statistic follows a t-distribution, not a standard normal. For small n, the difference in tail probabilities is substantial — using z instead of t produces confidence intervals that are too narrow and hypothesis tests with inflated Type I error.
- Treating n ≥ 30 as a guarantee of normality: The rule of thumb n ≥ 30 is approximate and depends on the population shape. Heavily skewed or heavy-tailed populations may require much larger samples for the CLT approximation to be adequate. Always assess the population distribution, not just the sample size.
- Assuming the t-distribution and normal are interchangeable for all n: With df = 5, the t critical value t₀.₀₂₅ ≈ 2.571 vs z₀.₀₂₅ = 1.96 — a 31% difference. The t-distribution has heavier tails to account for uncertainty in estimating σ. Only at very large df (50+) do they essentially coincide.
Next Steps
Next up: 12-03-point-estimation.md