### 12.9 — Regression (Linear)
Phase: Statistics Prerequisites: 12-08-common-tests, 12-03-point-estimation, 04-08-optimization
Learning Objectives
By the end of this subject, you will be able to:
- Fit a simple linear regression model using least squares
- Interpret regression coefficients, $R^2$, and residual plots
- Conduct hypothesis tests on regression coefficients
- Construct confidence and prediction intervals for regression
- Understand multiple linear regression and the problem of multicollinearity
Core Content
Simple Linear Regression
Model: $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$, where $\epsilon_i \sim N(0, \sigma^2)$ independently.
- $\beta_0$: intercept (expected $Y$ when $X = 0$)
- $\beta_1$: slope (change in $Y$ per unit change in $X$)
⚠️ CRITICAL: Least Squares Estimation
We find $\hat{\beta}_0, \hat{\beta}_1$ that minimise the sum of squared residuals:
$$\text{SSE} = \sum_{i=1}^{n} (Y_i - \hat{Y}i)^2 = \sum{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2$$
The solutions:
$$\hat{\beta}1 = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2} = r{xy} \cdot \frac{s_y}{s_x}$$
$$\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}$$
Where $r_{xy}$ is the sample correlation coefficient.
Coefficient of Determination: $R^2$
$$R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = \frac{\text{SSR}}{\text{SST}}$$
Where: - SST = $\sum(Y_i - \bar{Y})^2$ (total variation) - SSR = $\sum(\hat{Y}_i - \bar{Y})^2$ (variation explained by model) - SSE = $\sum(Y_i - \hat{Y}_i)^2$ (unexplained variation)
$R^2$ is the proportion of variance in $Y$ explained by $X$. Range: $[0, 1]$.
🚩 Common Pitfall: A high $R^2$ does NOT mean the model is good. You can get $R^2 \approx 1$ with nonsense models (e.g., overfitting, spurious correlations). Always check residual plots.
Inference for Regression Coefficients
Standard error of $\hat{\beta}_1$:
$$\text{SE}(\hat{\beta}_1) = \frac{s}{\sqrt{\sum(X_i - \bar{X})^2}}$$
where $s = \sqrt{\frac{\text{SSE}}{n-2}}$ (residual standard error).
Test $H_0: \beta_1 = 0$ (no linear relationship):
$$t = \frac{\hat{\beta}1}{\text{SE}(\hat{\beta}_1)} \sim t{n-2}$$
Residual Analysis
Residual $e_i = Y_i - \hat{Y}_i$. Key diagnostic plots:
- Residuals vs fitted: Should show random scatter around 0. Patterns (fan shape, curves) indicate violated assumptions.
- Q-Q plot of residuals: Should fall along diagonal — checks normality assumption.
- Residuals vs order: Check for time trends or autocorrelation.
Confidence vs Prediction Intervals
Confidence interval for mean response at $X = x_0$: Narrower — uncertainty only in estimating $E[Y \mid X = x_0]$.
Prediction interval for a new observation at $X = x_0$: Wider — includes both estimation uncertainty AND natural variability of individual $Y$.
Multiple Linear Regression
Model: $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon$
$\hat{\beta} = (X^T X)^{-1} X^T Y$ (matrix form)
Multicollinearity: When predictors are highly correlated, $(X^T X)$ becomes nearly singular. Consequences: - Coefficient estimates become unstable (high variance) - Individual t-tests may be non-significant even when the model has high $R^2$ - Small changes in data cause large changes in estimates
Detection: Variance Inflation Factor (VIF). $\text{VIF}_j = \frac{1}{1 - R_j^2}$ where $R_j^2$ is from regressing $X_j$ on all other predictors. VIF > 10 indicates serious multicollinearity.
Key Terms
- Confidence interval for mean response
- Multicollinearity
- Prediction interval for a new observation
- Prediction intervals
- Residual plots
Worked Examples
Example 1: Simple linear regression by hand
Data: (1, 2), (2, 3), (3, 5), (4, 6), (5, 8)
$\bar{X} = 3$, $\bar{Y} = 4.8$
$\sum(X_i - \bar{X})(Y_i - \bar{Y}) = (-2)(-2.8) + (-1)(-1.8) + 0(0.2) + 1(1.2) + 2(3.2)$ $= 5.6 + 1.8 + 0 + 1.2 + 6.4 = 15$
$\sum(X_i - \bar{X})^2 = 4 + 1 + 0 + 1 + 4 = 10$
$\hat{\beta}_1 = 15/10 = 1.5$
$\hat{\beta}_0 = 4.8 - 1.5(3) = 4.8 - 4.5 = 0.3$
Fitted model: $\hat{Y} = 0.3 + 1.5X$
Example 2: Computing $R^2$
Using fitted values: $\hat{Y} = 0.3 + 1.5 \cdot (1, 2, 3, 4, 5) = (1.8, 3.3, 4.8, 6.3, 7.8)$
SST = $(2-4.8)^2 + (3-4.8)^2 + (5-4.8)^2 + (6-4.8)^2 + (8-4.8)^2$ $= 7.84 + 3.24 + 0.04 + 1.44 + 10.24 = 22.8$
SSE = $(2-1.8)^2 + (3-3.3)^2 + (5-4.8)^2 + (6-6.3)^2 + (8-7.8)^2$ $= 0.04 + 0.09 + 0.04 + 0.09 + 0.04 = 0.30$
$R^2 = 1 - 0.30/22.8 = 1 - 0.013 = 0.987$
The model explains 98.7% of the variation in $Y$.
Example 3: Significance test for slope
$s = \sqrt{\frac{0.30}{5-2}} = \sqrt{0.10} = 0.316$
$\text{SE}(\hat{\beta}_1) = \frac{0.316}{\sqrt{10}} = \frac{0.316}{3.162} = 0.10$
$t = \frac{1.5}{0.10} = 15.0$, $df = 3$, $p \approx 0.0006$
Strong evidence of a linear relationship.
Quiz
Q1: What does the concept of Confidence interval for mean response primarily refer to in this subject?
A) A visual representation of Confidence interval for mean response B) A historical anecdote about Confidence interval for mean response C) A computational error related to Confidence interval for mean response D) The definition and application of Confidence interval for mean response
Correct: D)
- If you chose A: This is incorrect. Confidence interval for mean response is defined as: the definition and application of confidence interval for mean response. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Confidence interval for mean response is defined as: the definition and application of confidence interval for mean response. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Confidence interval for mean response is defined as: the definition and application of confidence interval for mean response. The other options describe different aspects that are not the primary focus.
- If you chose D: Confidence interval for mean response is defined as: the definition and application of confidence interval for mean response. The other options describe different aspects that are not the primary focus. Correct!
Q2: Which of the following is the key formula discussed in this subject?
A) An unrelated formula from a different topic B) A simplified version of R^2... C) R^2 D) The inverse operation of the formula in question
Correct: C)
- If you chose A: This is incorrect. The formula R^2 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula R^2 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: The formula R^2 is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose D: This is incorrect. The formula R^2 is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Prediction interval for a new observation?
A) It is used only in advanced research contexts B) It is used to prediction interval for a new observation in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain
Correct: B)
- If you chose A: This is incorrect. Prediction interval for a new observation serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Prediction interval for a new observation serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Prediction interval for a new observation serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Prediction interval for a new observation serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Multicollinearity is TRUE?
A) Multicollinearity is not related to this subject B) Multicollinearity is mentioned only as a historical footnote C) Multicollinearity is an advanced topic beyond this subject's scope D) Multicollinearity is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. Multicollinearity is a fundamental concept covered in this subject. This subject covers Multicollinearity as part of its core content.
- If you chose B: This is incorrect. Multicollinearity is a fundamental concept covered in this subject. This subject covers Multicollinearity as part of its core content.
- If you chose C: This is incorrect. Multicollinearity is a fundamental concept covered in this subject. This subject covers Multicollinearity as part of its core content.
- If you chose D: Multicollinearity is a fundamental concept covered in this subject. This subject covers Multicollinearity as part of its core content. Correct!
Q5: Based on the worked examples in this subject, what is the correct result?
A) $\sqrt{\text{SE}^2_{\text{fit}} + \sigma^2}$ vs ju B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer
Correct: A)
- If you chose A: The worked examples show that the result is $\sqrt{\text{SE}^2_{\text{fit}} + \sigma^2}$ vs ju. The other options represent common errors. Correct!
- If you chose B: This is incorrect. The worked examples show that the result is $\sqrt{\text{SE}^2_{\text{fit}} + \sigma^2}$ vs ju. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is $\sqrt{\text{SE}^2_{\text{fit}} + \sigma^2}$ vs ju. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is $\sqrt{\text{SE}^2_{\text{fit}} + \sigma^2}$ vs ju. The other options represent common errors.
Q6: How are Multicollinearity and Prediction intervals related?
A) Multicollinearity is a special case of Prediction intervals B) Multicollinearity and Prediction intervals are closely related concepts C) Multicollinearity is the inverse of Prediction intervals D) Multicollinearity and Prediction intervals are completely unrelated topics
Correct: B)
- If you chose A: This is incorrect. Both Multicollinearity and Prediction intervals are covered in this subject as interconnected topics.
- If you chose B: Both Multicollinearity and Prediction intervals are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both Multicollinearity and Prediction intervals are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Multicollinearity and Prediction intervals are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with Residual plots?
A) Residual plots has no common misconceptions B) A common mistake is confusing Residual plots with a similar concept C) The main error with Residual plots is using it when it is not needed D) Residual plots is always computed the same way in all contexts
Correct: B)
- If you chose A: This is incorrect. Students often confuse Residual plots with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse Residual plots with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse Residual plots with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Residual plots with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Simple Linear Regression?
A) Simple Linear Regression is not practically useful B) Use Simple Linear Regression only in pure mathematics contexts C) Apply Simple Linear Regression to solve problems in this subject's domain D) Avoid Simple Linear Regression unless explicitly instructed
Correct: C)
- If you chose A: This is incorrect. Simple Linear Regression is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Simple Linear Regression is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: Simple Linear Regression is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. Simple Linear Regression is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Interpret $\hat{\beta}_0 = 2.5$ and $\hat{\beta}_1 = -0.8$ in context: $Y$ = test score, $X$ = hours of video games.
Click for answer
$\hat{\beta}_0 = 2.5$: A student who plays zero hours of video games is predicted to score 2.5. (May or may not be meaningful depending on whether $X=0$ is in the data range.) $\hat{\beta}_1 = -0.8$: Each additional hour of video games is associated with a decrease of 0.8 points in test score, on average. -
For $R^2 = 0.65$, $n = 50$, and one predictor, what is the correlation $r_{xy}$?
Click for answer
For simple linear regression, $R^2 = r_{xy}^2$. So $|r_{xy}| = \sqrt{0.65} \approx 0.806$. The sign of $r_{xy}$ matches the sign of $\hat{\beta}_1$. -
A residual plot shows a clear U-shaped pattern. What does this suggest?
Click for answer
The relationship between $X$ and $Y$ is not linear — a curved pattern in residuals indicates the linear model is misspecified. Consider adding a quadratic term ($X^2$) or transforming the variables. -
Why is a prediction interval wider than a confidence interval for the same $X$ value?
Click for answer
A confidence interval captures uncertainty in estimating $E[Y \mid X=x_0]$ — the regression line itself. A prediction interval adds the natural variability of an individual observation ($\sigma^2$) on top of the estimation uncertainty. The PI must account for both: $\sqrt{\text{SE}^2_{\text{fit}} + \sigma^2}$ vs just $\text{SE}_{\text{fit}}$ for the CI. -
Two predictors have VIF values of 1.1 and 12.5. What does this tell you?
Click for answer
VIF = 1.1: The first predictor is essentially uncorrelated with other predictors — no multicollinearity concern. VIF = 12.5: The second predictor has serious multicollinearity (VIF > 10). Its coefficient estimate is unreliable, and its standard error is inflated. Consider removing it, combining predictors, or using regularisation (ridge regression).
Summary
Key takeaways:
- Simple linear regression: $Y = \beta_0 + \beta_1 X + \epsilon$, fitted by minimising SSE
- $\hat{\beta}1 = r{xy} \cdot s_y / s_x$ connects regression to correlation
- $R^2$ = proportion of variance explained; always check residuals, not just $R^2$
- Residual plots diagnose model violations (non-linearity, heteroscedasticity, non-normality)
- Prediction intervals are wider than confidence intervals because they include observation-level noise
- Multicollinearity (VIF > 10) inflates variance of coefficient estimates and makes them unstable
Pitfalls
- Judging model quality by R² alone: A high R² does not mean the model is correct. You can achieve R² ≈ 1 by overfitting (too many predictors for too few observations), by including spuriously correlated variables, or by fitting a model that violates assumptions. Always inspect residual plots — patterns in residuals reveal model inadequacies that R² cannot.
- Extrapolating beyond the range of the predictor: The linear relationship estimated from data may not hold outside the observed range of X. Predicting Y at X values far from the data is unreliable and can produce nonsensical results (e.g., negative predicted counts or probabilities outside [0, 1]).
- Interpreting β̂₁ as causal: Regression measures association, not causation. A significant slope means X and Y covary, but a confounding variable Z could drive both. "Ice cream sales predict drowning rates" is a real association (both driven by summer weather), but buying ice cream does not cause drowning.
- Ignoring residual diagnostic plots: Fan-shaped residuals indicate heteroscedasticity (non-constant variance), which invalidates standard errors and hypothesis tests. U-shaped or curved residuals indicate the relationship is not linear — a transformation or polynomial term is needed.
- Confusing prediction intervals with confidence intervals for the mean: A confidence interval for E[Y | X = x₀] captures uncertainty in estimating the regression line. A prediction interval for a new observation at x₀ additionally includes the natural variability σ² of individual Y values. Prediction intervals are always wider — using a CI when you need a PI drastically understates uncertainty.
Next Steps
Next up: 12-10-anova.md