Math graphic
📐 Concept diagram

### 12.7 — Hypothesis Testing (Basics)

Phase: Statistics Prerequisites: 12-06-confidence-intervals, 12-02-sampling-sampling-distributions

Learning Objectives

By the end of this subject, you will be able to:

  1. Formulate null and alternative hypotheses
  2. Distinguish between Type I and Type II errors
  3. Interpret p-values correctly and state the decision rule
  4. Apply one-sided and two-sided tests appropriately
  5. Compute and interpret statistical power

Core Content

The Hypothesis Testing Framework

  1. State hypotheses:
  2. $H_0$ (null): the "status quo" or "no effect" claim
  3. $H_a$ or $H_1$ (alternative): what we suspect might be true

  4. Choose significance level $\alpha$ (typically 0.05): the probability of rejecting $H_0$ when it's true.

  5. Compute test statistic from the data.

  6. Compute p-value: probability of observing a test statistic at least as extreme as the one observed, assuming $H_0$ is true.

  7. Decision: Reject $H_0$ if $p \leq \alpha$. Fail to reject $H_0$ if $p > \alpha$.

⚠️ CRITICAL: p-value Misconceptions

A p-value is NOT: - The probability that $H_0$ is true - The probability that $H_a$ is false - The probability the result is due to chance - A measure of effect size or practical importance

A p-value IS: $P(\text{data at least as extreme} \mid H_0 \text{ is true})$

🚩 Common Pitfall: Failing to reject $H_0$ does NOT prove $H_0$ is true. "Absence of evidence is not evidence of absence." A non-significant result may just mean your sample was too small.

Type I and Type II Errors

$H_0$ True $H_0$ False
Reject $H_0$ Type I error ($\alpha$) Correct (Power = $1-\beta$)
Fail to reject $H_0$ Correct ($1-\alpha$) Type II error ($\beta$)

Trade-off: Decreasing $\alpha$ (harder to reject) increases $\beta$ (harder to detect real effects), for a fixed sample size.

One-Sided vs Two-Sided Tests

Two-sided: $H_0: \mu = \mu_0$ vs $H_a: \mu \neq \mu_0$ - Use when any deviation from $H_0$ is interesting - Rejection region split between both tails - p-value = $2 \cdot P(T > |t_{\text{obs}}|)$ (double the one-sided tail)

One-sided: $H_0: \mu \leq \mu_0$ vs $H_a: \mu > \mu_0$ - Use when only one direction matters - Rejection region in one tail only - p-value = $P(T > t_{\text{obs}})$ (upper tail) - More powerful for detecting effects in the specified direction because all $\alpha$ is in one tail

🚩 Common Pitfall: Switching to a one-sided test AFTER seeing the data because the two-sided test wasn't significant. This is p-hacking and inflates the Type I error rate.

Statistical Power

Power = $1 - \beta$ = $P(\text{reject } H_0 \mid H_a \text{ is true})$

Factors affecting power: - Sample size $n$: larger → more power - Effect size: larger true difference → more power - Significance level $\alpha$: larger $\alpha$ → more power (but more Type I errors) - Population variance $\sigma^2$: smaller → more power

Power analysis: Determine required $n$ to achieve desired power (typically 0.80) for a given effect size and $\alpha$.



Key Terms

Worked Examples

Example 1: Full hypothesis test

A manufacturer claims light bulbs last $\mu = 1000$ hours. A consumer group tests 36 bulbs, finding $\bar{x} = 962$ hours, $s = 120$ hours. Test at $\alpha = 0.05$ whether the true mean is less than 1000.

Step 1: $H_0: \mu = 1000$ vs $H_a: \mu < 1000$ (one-sided, lower tail)

Step 2: $\alpha = 0.05$

Step 3: Test statistic: $t = \frac{962 - 1000}{120/\sqrt{36}} = \frac{-38}{20} = -1.90$

Step 4: df = 35. For a one-sided test: $P(t_{35} < -1.90) \approx 0.033$

Step 5: $p = 0.033 < 0.05$ → Reject $H_0$.

Conclusion: There is sufficient evidence that the true mean bulb life is less than 1000 hours.

Example 2: Two-sided test

Same data, but now test $H_0: \mu = 1000$ vs $H_a: \mu \neq 1000$.

$t = -1.90$ (same test statistic)

p-value = $2 \cdot P(t_{35} < -1.90) \approx 2 \cdot 0.033 = 0.066$

$p = 0.066 > 0.05$ → Fail to reject $H_0$.

The same data leads to different conclusions! Two-sided test requires stronger evidence because we're looking for deviation in either direction.

Example 3: Power calculation (conceptual)

Suppose true $\mu = 960$ (40 hours below claim), $\sigma = 120$, $n = 36$, $\alpha = 0.05$ (one-sided).

Effect size: $d = \frac{40}{120} = 0.333$ (Cohen's d, small-medium)

Non-centrality parameter: $\delta = \frac{40}{120/\sqrt{36}} = \frac{40}{20} = 2.0$

Power $\approx 0.52$ — only about 52% chance of detecting this difference. The study is underpowered.

To achieve 80% power: need $n \approx 55$ bulbs.



Quiz

Q1: What does the concept of Choose significance level primarily refer to in this subject?

A) A visual representation of Choose significance level B) A historical anecdote about Choose significance level C) The definition and application of Choose significance level D) A computational error related to Choose significance level

Correct: C)

Q2: Which of the following is the key formula discussed in this subject?

A) \alpha B) An unrelated formula from a different topic C) The inverse operation of the formula in question D) A simplified version of \alpha...

Correct: A)

Q3: What is the primary purpose of Compute test statistic?

A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is used to compute test statistic in mathematical analysis

Correct: D)

Q4: Which statement about More powerful is TRUE?

A) More powerful is mentioned only as a historical footnote B) More powerful is an advanced topic beyond this subject's scope C) More powerful is not related to this subject D) More powerful is a fundamental concept covered in this subject

Correct: D)

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) $P(\text{data at least as extreme} \mid H_0 \text{ C) An unrelated numerical value D) A different result from a common mistake

Correct: B)

Q6: How are More powerful and Power related?

A) More powerful is the inverse of Power B) More powerful and Power are completely unrelated topics C) More powerful is a special case of Power D) More powerful and Power are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with Sample size?

A) Sample size is always computed the same way in all contexts B) The main error with Sample size is using it when it is not needed C) Sample size has no common misconceptions D) A common mistake is confusing Sample size with a similar concept

Correct: D)

Q8: When should you apply Effect size?

A) Avoid Effect size unless explicitly instructed B) Apply Effect size to solve problems in this subject's domain C) Use Effect size only in pure mathematics contexts D) Effect size is not practically useful

Correct: B)

Practice Problems

  1. A test yields $p = 0.03$ with $\alpha = 0.05$. What is the correct conclusion?

    Click for answer Reject $H_0$. There is sufficient evidence to conclude the alternative hypothesis at the 5% significance level.

  2. Explain why failing to reject $H_0$ does not prove $H_0$ is true.

    Click for answer Failing to reject means the data are consistent with $H_0$, not that $H_0$ has been proven. There might be a real effect that the study was underpowered to detect. It's like a court verdict of "not guilty" rather than "innocent" — insufficient evidence, not proof of absence.

  3. If you decrease $\alpha$ from 0.05 to 0.01, what happens to $\beta$ (for fixed $n$)?

    Click for answer $\beta$ increases. Making it harder to reject $H_0$ (smaller $\alpha$) means you're also more likely to miss real effects. There's a direct trade-off between Type I and Type II error rates for fixed sample size.

  4. A study with $n = 100$ finds a statistically significant result ($p = 0.04$) for a tiny effect size. Is this practically significant?

    Click for answer Not necessarily. With large $n$, even trivially small effects become statistically significant. Statistical significance ≠ practical importance. Always report and interpret effect sizes alongside p-values.

  5. A researcher runs a two-sided test, gets $p = 0.08$, then decides to report a one-sided test instead ($p = 0.04$) and claims significance. Why is this wrong?

    Click for answer This is p-hacking. The choice of one-sided vs two-sided must be made BEFORE seeing the data based on the research question. Switching post-hoc inflates the effective Type I error rate because you're giving yourself two chances: if the effect is in the predicted direction, report one-sided; if not, stick with two-sided. This doubles your false positive risk.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 12-08-common-tests.md