Math graphic
📐 Concept diagram

### 12.9 — Regression (Linear)

Phase: Statistics Prerequisites: 12-08-common-tests, 12-03-point-estimation, 04-08-optimization

Learning Objectives

By the end of this subject, you will be able to:

  1. Fit a simple linear regression model using least squares
  2. Interpret regression coefficients, $R^2$, and residual plots
  3. Conduct hypothesis tests on regression coefficients
  4. Construct confidence and prediction intervals for regression
  5. Understand multiple linear regression and the problem of multicollinearity

Core Content

Simple Linear Regression

Model: $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$, where $\epsilon_i \sim N(0, \sigma^2)$ independently.

⚠️ CRITICAL: Least Squares Estimation

We find $\hat{\beta}_0, \hat{\beta}_1$ that minimise the sum of squared residuals:

$$\text{SSE} = \sum_{i=1}^{n} (Y_i - \hat{Y}i)^2 = \sum{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2$$

The solutions:

$$\hat{\beta}1 = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2} = r{xy} \cdot \frac{s_y}{s_x}$$

$$\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}$$

Where $r_{xy}$ is the sample correlation coefficient.

Coefficient of Determination: $R^2$

$$R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = \frac{\text{SSR}}{\text{SST}}$$

Where: - SST = $\sum(Y_i - \bar{Y})^2$ (total variation) - SSR = $\sum(\hat{Y}_i - \bar{Y})^2$ (variation explained by model) - SSE = $\sum(Y_i - \hat{Y}_i)^2$ (unexplained variation)

$R^2$ is the proportion of variance in $Y$ explained by $X$. Range: $[0, 1]$.

🚩 Common Pitfall: A high $R^2$ does NOT mean the model is good. You can get $R^2 \approx 1$ with nonsense models (e.g., overfitting, spurious correlations). Always check residual plots.

Inference for Regression Coefficients

Standard error of $\hat{\beta}_1$:

$$\text{SE}(\hat{\beta}_1) = \frac{s}{\sqrt{\sum(X_i - \bar{X})^2}}$$

where $s = \sqrt{\frac{\text{SSE}}{n-2}}$ (residual standard error).

Test $H_0: \beta_1 = 0$ (no linear relationship):

$$t = \frac{\hat{\beta}1}{\text{SE}(\hat{\beta}_1)} \sim t{n-2}$$

Residual Analysis

Residual $e_i = Y_i - \hat{Y}_i$. Key diagnostic plots:

  1. Residuals vs fitted: Should show random scatter around 0. Patterns (fan shape, curves) indicate violated assumptions.
  2. Q-Q plot of residuals: Should fall along diagonal — checks normality assumption.
  3. Residuals vs order: Check for time trends or autocorrelation.

Confidence vs Prediction Intervals

Confidence interval for mean response at $X = x_0$: Narrower — uncertainty only in estimating $E[Y \mid X = x_0]$.

Prediction interval for a new observation at $X = x_0$: Wider — includes both estimation uncertainty AND natural variability of individual $Y$.

Multiple Linear Regression

Model: $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon$

$\hat{\beta} = (X^T X)^{-1} X^T Y$ (matrix form)

Multicollinearity: When predictors are highly correlated, $(X^T X)$ becomes nearly singular. Consequences: - Coefficient estimates become unstable (high variance) - Individual t-tests may be non-significant even when the model has high $R^2$ - Small changes in data cause large changes in estimates

Detection: Variance Inflation Factor (VIF). $\text{VIF}_j = \frac{1}{1 - R_j^2}$ where $R_j^2$ is from regressing $X_j$ on all other predictors. VIF > 10 indicates serious multicollinearity.



Key Terms

Worked Examples

Example 1: Simple linear regression by hand

Data: (1, 2), (2, 3), (3, 5), (4, 6), (5, 8)

$\bar{X} = 3$, $\bar{Y} = 4.8$

$\sum(X_i - \bar{X})(Y_i - \bar{Y}) = (-2)(-2.8) + (-1)(-1.8) + 0(0.2) + 1(1.2) + 2(3.2)$ $= 5.6 + 1.8 + 0 + 1.2 + 6.4 = 15$

$\sum(X_i - \bar{X})^2 = 4 + 1 + 0 + 1 + 4 = 10$

$\hat{\beta}_1 = 15/10 = 1.5$

$\hat{\beta}_0 = 4.8 - 1.5(3) = 4.8 - 4.5 = 0.3$

Fitted model: $\hat{Y} = 0.3 + 1.5X$

Example 2: Computing $R^2$

Using fitted values: $\hat{Y} = 0.3 + 1.5 \cdot (1, 2, 3, 4, 5) = (1.8, 3.3, 4.8, 6.3, 7.8)$

SST = $(2-4.8)^2 + (3-4.8)^2 + (5-4.8)^2 + (6-4.8)^2 + (8-4.8)^2$ $= 7.84 + 3.24 + 0.04 + 1.44 + 10.24 = 22.8$

SSE = $(2-1.8)^2 + (3-3.3)^2 + (5-4.8)^2 + (6-6.3)^2 + (8-7.8)^2$ $= 0.04 + 0.09 + 0.04 + 0.09 + 0.04 = 0.30$

$R^2 = 1 - 0.30/22.8 = 1 - 0.013 = 0.987$

The model explains 98.7% of the variation in $Y$.

Example 3: Significance test for slope

$s = \sqrt{\frac{0.30}{5-2}} = \sqrt{0.10} = 0.316$

$\text{SE}(\hat{\beta}_1) = \frac{0.316}{\sqrt{10}} = \frac{0.316}{3.162} = 0.10$

$t = \frac{1.5}{0.10} = 15.0$, $df = 3$, $p \approx 0.0006$

Strong evidence of a linear relationship.



Quiz

Q1: What does the concept of Confidence interval for mean response primarily refer to in this subject?

A) A visual representation of Confidence interval for mean response B) A historical anecdote about Confidence interval for mean response C) A computational error related to Confidence interval for mean response D) The definition and application of Confidence interval for mean response

Correct: D)

Q2: Which of the following is the key formula discussed in this subject?

A) An unrelated formula from a different topic B) A simplified version of R^2... C) R^2 D) The inverse operation of the formula in question

Correct: C)

Q3: What is the primary purpose of Prediction interval for a new observation?

A) It is used only in advanced research contexts B) It is used to prediction interval for a new observation in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain

Correct: B)

Q4: Which statement about Multicollinearity is TRUE?

A) Multicollinearity is not related to this subject B) Multicollinearity is mentioned only as a historical footnote C) Multicollinearity is an advanced topic beyond this subject's scope D) Multicollinearity is a fundamental concept covered in this subject

Correct: D)

Q5: Based on the worked examples in this subject, what is the correct result?

A) $\sqrt{\text{SE}^2_{\text{fit}} + \sigma^2}$ vs ju B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer

Correct: A)

Q6: How are Multicollinearity and Prediction intervals related?

A) Multicollinearity is a special case of Prediction intervals B) Multicollinearity and Prediction intervals are closely related concepts C) Multicollinearity is the inverse of Prediction intervals D) Multicollinearity and Prediction intervals are completely unrelated topics

Correct: B)

Q7: What is a common pitfall when working with Residual plots?

A) Residual plots has no common misconceptions B) A common mistake is confusing Residual plots with a similar concept C) The main error with Residual plots is using it when it is not needed D) Residual plots is always computed the same way in all contexts

Correct: B)

Q8: When should you apply Simple Linear Regression?

A) Simple Linear Regression is not practically useful B) Use Simple Linear Regression only in pure mathematics contexts C) Apply Simple Linear Regression to solve problems in this subject's domain D) Avoid Simple Linear Regression unless explicitly instructed

Correct: C)

Practice Problems

  1. Interpret $\hat{\beta}_0 = 2.5$ and $\hat{\beta}_1 = -0.8$ in context: $Y$ = test score, $X$ = hours of video games.

    Click for answer $\hat{\beta}_0 = 2.5$: A student who plays zero hours of video games is predicted to score 2.5. (May or may not be meaningful depending on whether $X=0$ is in the data range.) $\hat{\beta}_1 = -0.8$: Each additional hour of video games is associated with a decrease of 0.8 points in test score, on average.

  2. For $R^2 = 0.65$, $n = 50$, and one predictor, what is the correlation $r_{xy}$?

    Click for answer For simple linear regression, $R^2 = r_{xy}^2$. So $|r_{xy}| = \sqrt{0.65} \approx 0.806$. The sign of $r_{xy}$ matches the sign of $\hat{\beta}_1$.

  3. A residual plot shows a clear U-shaped pattern. What does this suggest?

    Click for answer The relationship between $X$ and $Y$ is not linear — a curved pattern in residuals indicates the linear model is misspecified. Consider adding a quadratic term ($X^2$) or transforming the variables.

  4. Why is a prediction interval wider than a confidence interval for the same $X$ value?

    Click for answer A confidence interval captures uncertainty in estimating $E[Y \mid X=x_0]$ — the regression line itself. A prediction interval adds the natural variability of an individual observation ($\sigma^2$) on top of the estimation uncertainty. The PI must account for both: $\sqrt{\text{SE}^2_{\text{fit}} + \sigma^2}$ vs just $\text{SE}_{\text{fit}}$ for the CI.

  5. Two predictors have VIF values of 1.1 and 12.5. What does this tell you?

    Click for answer VIF = 1.1: The first predictor is essentially uncorrelated with other predictors — no multicollinearity concern. VIF = 12.5: The second predictor has serious multicollinearity (VIF > 10). Its coefficient estimate is unreliable, and its standard error is inflated. Consider removing it, combining predictors, or using regularisation (ridge regression).


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 12-10-anova.md