Math graphic
๐Ÿ“ Concept diagram

### 12.1 โ€” Descriptive Statistics

Phase: Statistics Prerequisites: 10-04-discrete-random-variables, 11-01-expectation-continuous-rv

Learning Objectives

By the end of this subject, you will be able to:

  1. Compute and interpret measures of central tendency: mean, median, mode
  2. Compute and interpret measures of dispersion: range, variance, standard deviation, IQR
  3. Construct and read box plots, histograms, and scatter plots
  4. Identify skewness and kurtosis from data visualisations
  5. Use percentiles and quantiles to describe data position

Core Content

What is Descriptive Statistics?

Descriptive statistics summarise and describe the main features of a dataset. Unlike inferential statistics, we do not draw conclusions beyond the data at hand โ€” we simply describe what we see.

Measures of Central Tendency

These answer: "What is a typical value?"

Mean (arithmetic average): For data $x_1, x_2, \ldots, x_n$:

$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$

Median: The middle value when data is sorted. If $n$ is even, average the two middle values.

Mode: The most frequently occurring value. A dataset can have no mode, one mode (unimodal), or multiple modes (bimodal/multimodal).

When to use each: - Mean: symmetric data without outliers - Median: skewed data or data with outliers - Mode: categorical data or identifying peaks

Measures of Dispersion

These answer: "How spread out is the data?"

Range: $\max(x_i) - \min(x_i)$

Variance (sample): $$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

The $(n-1)$ denominator (Bessel's correction) makes this an unbiased estimator of population variance.

Standard deviation: $s = \sqrt{s^2}$ โ€” returns to original units.

Interquartile Range (IQR): $\text{IQR} = Q_3 - Q_1$ Where $Q_1$ is the 25th percentile and $Q_3$ is the 75th percentile.

โš ๏ธ CRITICAL: Standard deviation is NOT the same as average deviation. SD penalises outliers more heavily because it squares deviations before averaging.

Percentiles and Quantiles

The $p$-th percentile is the value below which $p\%$ of observations fall.

Finding a percentile manually: 1. Sort the data 2. Compute index $i = \frac{p}{100} \cdot n$ 3. If $i$ is an integer, average values at positions $i$ and $i+1$ 4. If $i$ is not an integer, round up and take that position

Box Plots

A box plot displays the five-number summary: - Minimum (excluding outliers) - $Q_1$ (25th percentile) - Median ($Q_2$, 50th percentile) - $Q_3$ (75th percentile) - Maximum (excluding outliers)

Outlier detection: Points below $Q_1 - 1.5 \times \text{IQR}$ or above $Q_3 + 1.5 \times \text{IQR}$ are flagged as outliers.

Histograms

Group data into bins and display frequency. Key decisions: - Bin width: too narrow โ†’ noisy; too wide โ†’ loss of detail - Bin count: Sturges' rule recommends $k \approx \log_2(n) + 1$

Skewness and Kurtosis

Skewness measures asymmetry: - Positive skew (right tail): mean > median > mode - Negative skew (left tail): mean < median < mode - Zero skew: symmetric distribution

Sample skewness: $$g_1 = \frac{\frac{1}{n}\sum(x_i - \bar{x})^3}{(\frac{1}{n}\sum(x_i - \bar{x})^2)^{3/2}}$$

Kurtosis measures "tailedness": - Excess kurtosis > 0: heavy-tailed (leptokurtic) - Excess kurtosis = 0: normal tail weight (mesokurtic) - Excess kurtosis < 0: light-tailed (platykurtic)

๐Ÿšฉ Common Pitfall: Kurtosis is often described as "peakedness" but it is more accurately about the tails. A distribution with high kurtosis has heavier tails and a higher central peak as a consequence.



Key Terms

Worked Examples

Example 1: Computing descriptive statistics

Data: 12, 15, 14, 18, 12, 20, 14, 13, 16, 14

Find the mean, median, mode, range, variance, and standard deviation.

Solution:

  1. Sort: 12, 12, 13, 14, 14, 14, 15, 16, 18, 20

  2. Mean: $\bar{x} = \frac{12+15+14+18+12+20+14+13+16+14}{10} = \frac{148}{10} = 14.8$

  3. Median: With $n=10$ (even), average 5th and 6th values: $(14 + 14)/2 = 14$

  4. Mode: 14 appears three times โ†’ mode = 14 (unimodal)

  5. Range: $20 - 12 = 8$

  6. Variance:

  7. Deviations from mean: -2.8, +0.2, -0.8, +3.2, -2.8, +5.2, -0.8, -1.8, +1.2, -0.8
  8. Squared deviations: 7.84, 0.04, 0.64, 10.24, 7.84, 27.04, 0.64, 3.24, 1.44, 0.64
  9. Sum: 59.6
  10. $s^2 = 59.6/9 = 6.622$ (using $n-1$)

  11. Standard deviation: $s = \sqrt{6.622} \approx 2.573$

Example 2: Box plot and outlier detection

Data: 3, 7, 8, 9, 10, 12, 13, 14, 14, 15, 16, 18, 22, 45

For $n=14$: - $Q_1$: position $0.25 \times 14 = 3.5$ โ†’ average 3rd and 4th: $(8+9)/2 = 8.5$ - $Q_2$ (median): position $7.5$ โ†’ average 7th and 8th: $(13+14)/2 = 13.5$ - $Q_3$: position $0.75 \times 14 = 10.5$ โ†’ average 10th and 11th: $(15+16)/2 = 15.5$

$\text{IQR} = 15.5 - 8.5 = 7$

Outlier bounds: - Lower: $8.5 - 1.5 \times 7 = 8.5 - 10.5 = -2$ โ†’ no low outliers - Upper: $15.5 + 1.5 \times 7 = 15.5 + 10.5 = 26$

45 exceeds 26 โ†’ 45 is an outlier. The upper whisker extends to 22 (the largest non-outlier).

Example 3: Histogram bin selection

Data: 30 exam scores from 0-100. $n=30$, so Sturges' rule suggests $k \approx \log_2(30) + 1 \approx 4.9 + 1 \approx 6$ bins.

Bin width: $(100 - 36)/6 \approx 10.67$ โ†’ use width of 10 for readability.

Bins: 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-100



Quiz

Q1: What does the concept of Descriptive statistics primarily refer to in this subject?

A) A historical anecdote about Descriptive statistics B) A visual representation of Descriptive statistics C) A computational error related to Descriptive statistics D) The definition and application of Descriptive statistics

Correct: D)

Q2: Which of the following is the key formula discussed in this subject?

A) x_1, x_2, \ldots, x_n B) A simplified version of x_1, x_2, \ldots, x_n... C) An unrelated formula from a different topic D) The inverse operation of the formula in question

Correct: A)

Q3: What is the primary purpose of Skewness?

A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is used to skewness in mathematical analysis

Correct: D)

Q4: Which statement about Kurtosis is TRUE?

A) Kurtosis is not related to this subject B) Kurtosis is a fundamental concept covered in this subject C) Kurtosis is mentioned only as a historical footnote D) Kurtosis is an advanced topic beyond this subject's scope

Correct: B)

Q5: Based on the worked examples in this subject, what is the correct result?

A) A different result from a common mistake B) The inverse of the correct answer C) An unrelated numerical value D) "What is a typical value?"

Correct: D)

Q6: How are Kurtosis and Box plots related?

A) Kurtosis and Box plots are closely related concepts B) Kurtosis is the inverse of Box plots C) Kurtosis is a special case of Box plots D) Kurtosis and Box plots are completely unrelated topics

Correct: A)

Q7: What is a common pitfall when working with Central tendency?

A) Central tendency is always computed the same way in all contexts B) A common mistake is confusing Central tendency with a similar concept C) Central tendency has no common misconceptions D) The main error with Central tendency is using it when it is not needed

Correct: B)

Q8: When should you apply Dispersion?

A) Apply Dispersion to solve problems in this subject's domain B) Use Dispersion only in pure mathematics contexts C) Dispersion is not practically useful D) Avoid Dispersion unless explicitly instructed

Correct: A)

Practice Problems

  1. For the dataset [23, 25, 28, 28, 30, 32, 35, 38, 42, 45], compute the mean, median, mode, and IQR.
Click for answer Mean: $(23+25+28+28+30+32+35+38+42+45)/10 = 326/10 = 32.6$ Median: $(30+32)/2 = 31$ Mode: 28 (appears twice) $Q_1$: position $2.75$ โ†’ 3rd value = 28 $Q_3$: position $8.25$ โ†’ 9th value = 42 IQR: $42 - 28 = 14$
  1. A dataset has mean 50 and standard deviation 5. What percentage of data lies within $[40, 60]$ if the distribution is approximately normal?

    Click for answer The interval $[40, 60]$ is $\bar{x} \pm 2s$. For approximately normal data, about 95% lies within 2 standard deviations of the mean.

  2. Given $Q_1 = 20$, $Q_3 = 35$, find the outlier boundaries. Would the value 62 be flagged as an outlier?

    Click for answer IQR = 35 - 20 = 15 Lower bound: $20 - 1.5 \times 15 = 20 - 22.5 = -2.5$ Upper bound: $35 + 1.5 \times 15 = 35 + 22.5 = 57.5$ 62 > 57.5 โ†’ **Yes**, 62 would be flagged as an outlier.

  3. A dataset has sample skewness $g_1 = 1.8$. Interpret this value and describe the likely relationship between mean and median.

    Click for answer $g_1 = 1.8$ indicates substantial positive skew (right tail). In positively skewed data, mean > median, because the mean is pulled toward the long right tail while the median is more robust.

  4. You have data on 100 house prices. The mean is $450,000 and the median is $385,000. What does this suggest about the distribution?

    Click for answer Mean > median by a substantial margin ($65,000) suggests positive (right) skew. A few very expensive houses are pulling the mean upward while the median better represents the "typical" house price.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 12-02-sampling-sampling-distributions.md