📐 Concept diagram

### 12.7 — Hypothesis Testing (Basics)

Phase: Statistics Prerequisites: 12-06-confidence-intervals, 12-02-sampling-sampling-distributions

Learning Objectives

By the end of this subject, you will be able to:

Formulate null and alternative hypotheses
Distinguish between Type I and Type II errors
Interpret p-values correctly and state the decision rule
Apply one-sided and two-sided tests appropriately
Compute and interpret statistical power

Core Content

The Hypothesis Testing Framework

State hypotheses:
$H_0$ (null): the "status quo" or "no effect" claim
$H_a$ or $H_1$ (alternative): what we suspect might be true
Choose significance level $\alpha$ (typically 0.05): the probability of rejecting $H_0$ when it's true.
Compute test statistic from the data.
Compute p-value: probability of observing a test statistic at least as extreme as the one observed, assuming $H_0$ is true.
Decision: Reject $H_0$ if $p \leq \alpha$. Fail to reject $H_0$ if $p > \alpha$.

⚠️ CRITICAL: p-value Misconceptions

A p-value is NOT: - The probability that $H_0$ is true - The probability that $H_a$ is false - The probability the result is due to chance - A measure of effect size or practical importance

A p-value IS: $P(\text{data at least as extreme} \mid H_0 \text{ is true})$

🚩 Common Pitfall: Failing to reject $H_0$ does NOT prove $H_0$ is true. "Absence of evidence is not evidence of absence." A non-significant result may just mean your sample was too small.

Type I and Type II Errors

	$H_0$ True	$H_0$ False
Reject $H_0$	Type I error ($\alpha$)	Correct (Power = $1-\beta$)
Fail to reject $H_0$	Correct ($1-\alpha$)	Type II error ($\beta$)

Type I error ($\alpha$): False positive — detecting an effect that isn't there
Type II error ($\beta$): False negative — missing a real effect

Trade-off: Decreasing $\alpha$ (harder to reject) increases $\beta$ (harder to detect real effects), for a fixed sample size.

One-Sided vs Two-Sided Tests

Two-sided: $H_0: \mu = \mu_0$ vs $H_a: \mu \neq \mu_0$ - Use when any deviation from $H_0$ is interesting - Rejection region split between both tails - p-value = $2 \cdot P(T > |t_{\text{obs}}|)$ (double the one-sided tail)

One-sided: $H_0: \mu \leq \mu_0$ vs $H_a: \mu > \mu_0$ - Use when only one direction matters - Rejection region in one tail only - p-value = $P(T > t_{\text{obs}})$ (upper tail) - More powerful for detecting effects in the specified direction because all $\alpha$ is in one tail

🚩 Common Pitfall: Switching to a one-sided test AFTER seeing the data because the two-sided test wasn't significant. This is p-hacking and inflates the Type I error rate.

Statistical Power

Power = $1 - \beta$ = $P(\text{reject } H_0 \mid H_a \text{ is true})$

Factors affecting power: - Sample size $n$: larger → more power - Effect size: larger true difference → more power - Significance level $\alpha$: larger $\alpha$ → more power (but more Type I errors) - Population variance $\sigma^2$: smaller → more power

Power analysis: Determine required $n$ to achieve desired power (typically 0.80) for a given effect size and $\alpha$.

Key Terms

Choose significance level
Compute test statistic
Effect size
Fail to reject
More powerful
One-sided tests
Population variance
Power
Sample size
Significance level
Type I error
Type II error

Worked Examples

Example 1: Full hypothesis test

A manufacturer claims light bulbs last $\mu = 1000$ hours. A consumer group tests 36 bulbs, finding $\bar{x} = 962$ hours, $s = 120$ hours. Test at $\alpha = 0.05$ whether the true mean is less than 1000.

Step 1: $H_0: \mu = 1000$ vs $H_a: \mu < 1000$ (one-sided, lower tail)

Step 2: $\alpha = 0.05$

Step 3: Test statistic: $t = \frac{962 - 1000}{120/\sqrt{36}} = \frac{-38}{20} = -1.90$

Step 4: df = 35. For a one-sided test: $P(t_{35} < -1.90) \approx 0.033$

Step 5: $p = 0.033 < 0.05$ → Reject $H_0$.

Conclusion: There is sufficient evidence that the true mean bulb life is less than 1000 hours.

Example 2: Two-sided test

Same data, but now test $H_0: \mu = 1000$ vs $H_a: \mu \neq 1000$.

$t = -1.90$ (same test statistic)

p-value = $2 \cdot P(t_{35} < -1.90) \approx 2 \cdot 0.033 = 0.066$

$p = 0.066 > 0.05$ → Fail to reject $H_0$.

The same data leads to different conclusions! Two-sided test requires stronger evidence because we're looking for deviation in either direction.

Example 3: Power calculation (conceptual)

Suppose true $\mu = 960$ (40 hours below claim), $\sigma = 120$, $n = 36$, $\alpha = 0.05$ (one-sided).

Effect size: $d = \frac{40}{120} = 0.333$ (Cohen's d, small-medium)

Non-centrality parameter: $\delta = \frac{40}{120/\sqrt{36}} = \frac{40}{20} = 2.0$

Power $\approx 0.52$ — only about 52% chance of detecting this difference. The study is underpowered.

To achieve 80% power: need $n \approx 55$ bulbs.

Quiz

Q1: What does the concept of Choose significance level primarily refer to in this subject?

A) A visual representation of Choose significance level B) A historical anecdote about Choose significance level C) The definition and application of Choose significance level D) A computational error related to Choose significance level

Correct: C)

If you chose A: This is incorrect. Choose significance level is defined as: the definition and application of choose significance level. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Choose significance level is defined as: the definition and application of choose significance level. The other options describe different aspects that are not the primary focus.
If you chose C: Choose significance level is defined as: the definition and application of choose significance level. The other options describe different aspects that are not the primary focus. Correct!
If you chose D: This is incorrect. Choose significance level is defined as: the definition and application of choose significance level. The other options describe different aspects that are not the primary focus.

Q2: Which of the following is the key formula discussed in this subject?

A) \alpha B) An unrelated formula from a different topic C) The inverse operation of the formula in question D) A simplified version of \alpha...

Correct: A)

If you chose A: The formula \alpha is central to this subject. The other options are either simplified versions or unrelated. Correct!
If you chose B: This is incorrect. The formula \alpha is central to this subject. The other options are either simplified versions or unrelated.
If you chose C: This is incorrect. The formula \alpha is central to this subject. The other options are either simplified versions or unrelated.
If you chose D: This is incorrect. The formula \alpha is central to this subject. The other options are either simplified versions or unrelated.

Q3: What is the primary purpose of Compute test statistic?

A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is used to compute test statistic in mathematical analysis

Correct: D)

If you chose A: This is incorrect. Compute test statistic serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Compute test statistic serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. Compute test statistic serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: Compute test statistic serves the purpose described in the correct answer. The other options misrepresent its role. Correct!

Q4: Which statement about More powerful is TRUE?

A) More powerful is mentioned only as a historical footnote B) More powerful is an advanced topic beyond this subject's scope C) More powerful is not related to this subject D) More powerful is a fundamental concept covered in this subject

Correct: D)

If you chose A: This is incorrect. More powerful is a fundamental concept covered in this subject. This subject covers More powerful as part of its core content.
If you chose B: This is incorrect. More powerful is a fundamental concept covered in this subject. This subject covers More powerful as part of its core content.
If you chose C: This is incorrect. More powerful is a fundamental concept covered in this subject. This subject covers More powerful as part of its core content.
If you chose D: More powerful is a fundamental concept covered in this subject. This subject covers More powerful as part of its core content. Correct!

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) $P(\text{data at least as extreme} \mid H_0 \text{ C) An unrelated numerical value D) A different result from a common mistake

Correct: B)

If you chose A: This is incorrect. The worked examples show that the result is $P(\text{data at least as extreme} \mid H_0 \text{. The other options represent common errors.
If you chose B: The worked examples show that the result is $P(\text{data at least as extreme} \mid H_0 \text{. The other options represent common errors. Correct!
If you chose C: This is incorrect. The worked examples show that the result is $P(\text{data at least as extreme} \mid H_0 \text{. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is $P(\text{data at least as extreme} \mid H_0 \text{. The other options represent common errors.

Q6: How are More powerful and Power related?

A) More powerful is the inverse of Power B) More powerful and Power are completely unrelated topics C) More powerful is a special case of Power D) More powerful and Power are closely related concepts

Correct: D)

If you chose A: This is incorrect. Both More powerful and Power are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both More powerful and Power are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both More powerful and Power are covered in this subject as interconnected topics.
If you chose D: Both More powerful and Power are covered in this subject as interconnected topics. Correct!

Q7: What is a common pitfall when working with Sample size?

A) Sample size is always computed the same way in all contexts B) The main error with Sample size is using it when it is not needed C) Sample size has no common misconceptions D) A common mistake is confusing Sample size with a similar concept

Correct: D)

If you chose A: This is incorrect. Students often confuse Sample size with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse Sample size with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: This is incorrect. Students often confuse Sample size with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: Students often confuse Sample size with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!

Q8: When should you apply Effect size?

A) Avoid Effect size unless explicitly instructed B) Apply Effect size to solve problems in this subject's domain C) Use Effect size only in pure mathematics contexts D) Effect size is not practically useful

Correct: B)

If you chose A: This is incorrect. Effect size is a practical tool used throughout this subject to solve relevant problems.
If you chose B: Effect size is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose C: This is incorrect. Effect size is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. Effect size is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

A test yields $p = 0.03$ with $\alpha = 0.05$. What is the correct conclusion?

Click for answer
Reject $H_0$. There is sufficient evidence to conclude the alternative hypothesis at the 5% significance level.
Explain why failing to reject $H_0$ does not prove $H_0$ is true.

Click for answer
Failing to reject means the data are consistent with $H_0$, not that $H_0$ has been proven. There might be a real effect that the study was underpowered to detect. It's like a court verdict of "not guilty" rather than "innocent" — insufficient evidence, not proof of absence.
If you decrease $\alpha$ from 0.05 to 0.01, what happens to $\beta$ (for fixed $n$)?

Click for answer
$\beta$ increases. Making it harder to reject $H_0$ (smaller $\alpha$) means you're also more likely to miss real effects. There's a direct trade-off between Type I and Type II error rates for fixed sample size.
A study with $n = 100$ finds a statistically significant result ($p = 0.04$) for a tiny effect size. Is this practically significant?

Click for answer
Not necessarily. With large $n$, even trivially small effects become statistically significant. Statistical significance ≠ practical importance. Always report and interpret effect sizes alongside p-values.
A researcher runs a two-sided test, gets $p = 0.08$, then decides to report a one-sided test instead ($p = 0.04$) and claims significance. Why is this wrong?

Click for answer
This is p-hacking. The choice of one-sided vs two-sided must be made BEFORE seeing the data based on the research question. Switching post-hoc inflates the effective Type I error rate because you're giving yourself two chances: if the effect is in the predicted direction, report one-sided; if not, stick with two-sided. This doubles your false positive risk.

Summary

Key takeaways:

$H_0$ is the "nothing happening" claim; $H_a$ is the research hypothesis
p-value = $P(\text{data} \mid H_0)$, NOT $P(H_0 \mid \text{data})$
Type I error ($\alpha$) = false positive; Type II error ($\beta$) = false negative
One-sided tests have more power but must be justified BEFORE seeing data
Power depends on $n$, effect size, $\alpha$, and variance
Statistical significance $\neq$ practical significance — always consider effect size

Pitfalls

Interpreting "fail to reject H₀" as "H₀ is true": A non-significant result means the data are consistent with the null hypothesis, not that the null has been proven. There may be a real effect that the study was underpowered to detect. "Absence of evidence is not evidence of absence" — this is the single most common misinterpretation in hypothesis testing.
Equating statistical significance with practical importance: With large sample sizes, even trivially small effects become statistically significant. A p-value of 0.001 with n = 10,000 might correspond to a Cohen's d of 0.05 — a negligible effect. Always report and interpret effect sizes alongside p-values.
Switching from two-sided to one-sided after seeing the data: If a two-sided test gives p = 0.08, halving it to report a one-sided p = 0.04 is p-hacking. The choice of test direction must be justified by the research question BEFORE data collection. Post-hoc switching effectively doubles your false positive rate.
Believing the p-value is P(H₀ | data): The p-value is P(data or more extreme | H₀), not P(H₀ | data). These are fundamentally different quantities. A p-value of 0.01 does NOT mean there is a 1% chance the null is true — it means that if the null were true, you'd see results this extreme only 1% of the time.
Running multiple tests without multiplicity correction: Testing 20 independent null hypotheses at α = 0.05 each yields an expected 1 false positive, and the family-wise probability of at least one false positive is 1 − 0.95²⁰ ≈ 0.64. Without correction (Bonferroni, Holm, FDR), you are nearly guaranteed to find "significant" results by chance alone.

Next Steps

Next up: 12-08-common-tests.md

Progress

Phases

### 12.7 — Hypothesis Testing (Basics)

Learning Objectives

Core Content

The Hypothesis Testing Framework

⚠️ CRITICAL: p-value Misconceptions

Type I and Type II Errors

One-Sided vs Two-Sided Tests

Statistical Power

Key Terms

Worked Examples

Example 1: Full hypothesis test

Example 2: Two-sided test

Example 3: Power calculation (conceptual)

Quiz

Practice Problems

Summary

Pitfalls

Next Steps