### 12.1 โ Descriptive Statistics
Phase: Statistics Prerequisites: 10-04-discrete-random-variables, 11-01-expectation-continuous-rv
Learning Objectives
By the end of this subject, you will be able to:
- Compute and interpret measures of central tendency: mean, median, mode
- Compute and interpret measures of dispersion: range, variance, standard deviation, IQR
- Construct and read box plots, histograms, and scatter plots
- Identify skewness and kurtosis from data visualisations
- Use percentiles and quantiles to describe data position
Core Content
What is Descriptive Statistics?
Descriptive statistics summarise and describe the main features of a dataset. Unlike inferential statistics, we do not draw conclusions beyond the data at hand โ we simply describe what we see.
Measures of Central Tendency
These answer: "What is a typical value?"
Mean (arithmetic average): For data $x_1, x_2, \ldots, x_n$:
$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$
Median: The middle value when data is sorted. If $n$ is even, average the two middle values.
Mode: The most frequently occurring value. A dataset can have no mode, one mode (unimodal), or multiple modes (bimodal/multimodal).
When to use each: - Mean: symmetric data without outliers - Median: skewed data or data with outliers - Mode: categorical data or identifying peaks
Measures of Dispersion
These answer: "How spread out is the data?"
Range: $\max(x_i) - \min(x_i)$
Variance (sample): $$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$
The $(n-1)$ denominator (Bessel's correction) makes this an unbiased estimator of population variance.
Standard deviation: $s = \sqrt{s^2}$ โ returns to original units.
Interquartile Range (IQR): $\text{IQR} = Q_3 - Q_1$ Where $Q_1$ is the 25th percentile and $Q_3$ is the 75th percentile.
โ ๏ธ CRITICAL: Standard deviation is NOT the same as average deviation. SD penalises outliers more heavily because it squares deviations before averaging.
Percentiles and Quantiles
The $p$-th percentile is the value below which $p\%$ of observations fall.
Finding a percentile manually: 1. Sort the data 2. Compute index $i = \frac{p}{100} \cdot n$ 3. If $i$ is an integer, average values at positions $i$ and $i+1$ 4. If $i$ is not an integer, round up and take that position
Box Plots
A box plot displays the five-number summary: - Minimum (excluding outliers) - $Q_1$ (25th percentile) - Median ($Q_2$, 50th percentile) - $Q_3$ (75th percentile) - Maximum (excluding outliers)
Outlier detection: Points below $Q_1 - 1.5 \times \text{IQR}$ or above $Q_3 + 1.5 \times \text{IQR}$ are flagged as outliers.
Histograms
Group data into bins and display frequency. Key decisions: - Bin width: too narrow โ noisy; too wide โ loss of detail - Bin count: Sturges' rule recommends $k \approx \log_2(n) + 1$
Skewness and Kurtosis
Skewness measures asymmetry: - Positive skew (right tail): mean > median > mode - Negative skew (left tail): mean < median < mode - Zero skew: symmetric distribution
Sample skewness: $$g_1 = \frac{\frac{1}{n}\sum(x_i - \bar{x})^3}{(\frac{1}{n}\sum(x_i - \bar{x})^2)^{3/2}}$$
Kurtosis measures "tailedness": - Excess kurtosis > 0: heavy-tailed (leptokurtic) - Excess kurtosis = 0: normal tail weight (mesokurtic) - Excess kurtosis < 0: light-tailed (platykurtic)
๐ฉ Common Pitfall: Kurtosis is often described as "peakedness" but it is more accurately about the tails. A distribution with high kurtosis has heavier tails and a higher central peak as a consequence.
Key Terms
- Box plots
- Central tendency
- Descriptive statistics
- Dispersion
- Kurtosis
- Skewness
Worked Examples
Example 1: Computing descriptive statistics
Data: 12, 15, 14, 18, 12, 20, 14, 13, 16, 14
Find the mean, median, mode, range, variance, and standard deviation.
Solution:
-
Sort: 12, 12, 13, 14, 14, 14, 15, 16, 18, 20
-
Mean: $\bar{x} = \frac{12+15+14+18+12+20+14+13+16+14}{10} = \frac{148}{10} = 14.8$
-
Median: With $n=10$ (even), average 5th and 6th values: $(14 + 14)/2 = 14$
-
Mode: 14 appears three times โ mode = 14 (unimodal)
-
Range: $20 - 12 = 8$
-
Variance:
- Deviations from mean: -2.8, +0.2, -0.8, +3.2, -2.8, +5.2, -0.8, -1.8, +1.2, -0.8
- Squared deviations: 7.84, 0.04, 0.64, 10.24, 7.84, 27.04, 0.64, 3.24, 1.44, 0.64
- Sum: 59.6
-
$s^2 = 59.6/9 = 6.622$ (using $n-1$)
-
Standard deviation: $s = \sqrt{6.622} \approx 2.573$
Example 2: Box plot and outlier detection
Data: 3, 7, 8, 9, 10, 12, 13, 14, 14, 15, 16, 18, 22, 45
For $n=14$: - $Q_1$: position $0.25 \times 14 = 3.5$ โ average 3rd and 4th: $(8+9)/2 = 8.5$ - $Q_2$ (median): position $7.5$ โ average 7th and 8th: $(13+14)/2 = 13.5$ - $Q_3$: position $0.75 \times 14 = 10.5$ โ average 10th and 11th: $(15+16)/2 = 15.5$
$\text{IQR} = 15.5 - 8.5 = 7$
Outlier bounds: - Lower: $8.5 - 1.5 \times 7 = 8.5 - 10.5 = -2$ โ no low outliers - Upper: $15.5 + 1.5 \times 7 = 15.5 + 10.5 = 26$
45 exceeds 26 โ 45 is an outlier. The upper whisker extends to 22 (the largest non-outlier).
Example 3: Histogram bin selection
Data: 30 exam scores from 0-100. $n=30$, so Sturges' rule suggests $k \approx \log_2(30) + 1 \approx 4.9 + 1 \approx 6$ bins.
Bin width: $(100 - 36)/6 \approx 10.67$ โ use width of 10 for readability.
Bins: 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-100
Quiz
Q1: What does the concept of Descriptive statistics primarily refer to in this subject?
A) A historical anecdote about Descriptive statistics B) A visual representation of Descriptive statistics C) A computational error related to Descriptive statistics D) The definition and application of Descriptive statistics
Correct: D)
- If you chose A: This is incorrect. Descriptive statistics is defined as: the definition and application of descriptive statistics. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Descriptive statistics is defined as: the definition and application of descriptive statistics. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Descriptive statistics is defined as: the definition and application of descriptive statistics. The other options describe different aspects that are not the primary focus.
- If you chose D: Descriptive statistics is defined as: the definition and application of descriptive statistics. The other options describe different aspects that are not the primary focus. Correct!
Q2: Which of the following is the key formula discussed in this subject?
A) x_1, x_2, \ldots, x_n B) A simplified version of x_1, x_2, \ldots, x_n... C) An unrelated formula from a different topic D) The inverse operation of the formula in question
Correct: A)
- If you chose A: The formula x_1, x_2, \ldots, x_n is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose B: This is incorrect. The formula x_1, x_2, \ldots, x_n is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula x_1, x_2, \ldots, x_n is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula x_1, x_2, \ldots, x_n is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Skewness?
A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is used to skewness in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. Skewness serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Skewness serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Skewness serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: Skewness serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q4: Which statement about Kurtosis is TRUE?
A) Kurtosis is not related to this subject B) Kurtosis is a fundamental concept covered in this subject C) Kurtosis is mentioned only as a historical footnote D) Kurtosis is an advanced topic beyond this subject's scope
Correct: B)
- If you chose A: This is incorrect. Kurtosis is a fundamental concept covered in this subject. This subject covers Kurtosis as part of its core content.
- If you chose B: Kurtosis is a fundamental concept covered in this subject. This subject covers Kurtosis as part of its core content. Correct!
- If you chose C: This is incorrect. Kurtosis is a fundamental concept covered in this subject. This subject covers Kurtosis as part of its core content.
- If you chose D: This is incorrect. Kurtosis is a fundamental concept covered in this subject. This subject covers Kurtosis as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) A different result from a common mistake B) The inverse of the correct answer C) An unrelated numerical value D) "What is a typical value?"
Correct: D)
- If you chose A: This is incorrect. The worked examples show that the result is "What is a typical value?". The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is "What is a typical value?". The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is "What is a typical value?". The other options represent common errors.
- If you chose D: The worked examples show that the result is "What is a typical value?". The other options represent common errors. Correct!
Q6: How are Kurtosis and Box plots related?
A) Kurtosis and Box plots are closely related concepts B) Kurtosis is the inverse of Box plots C) Kurtosis is a special case of Box plots D) Kurtosis and Box plots are completely unrelated topics
Correct: A)
- If you chose A: Both Kurtosis and Box plots are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both Kurtosis and Box plots are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Kurtosis and Box plots are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Kurtosis and Box plots are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with Central tendency?
A) Central tendency is always computed the same way in all contexts B) A common mistake is confusing Central tendency with a similar concept C) Central tendency has no common misconceptions D) The main error with Central tendency is using it when it is not needed
Correct: B)
- If you chose A: This is incorrect. Students often confuse Central tendency with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse Central tendency with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse Central tendency with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Central tendency with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Dispersion?
A) Apply Dispersion to solve problems in this subject's domain B) Use Dispersion only in pure mathematics contexts C) Dispersion is not practically useful D) Avoid Dispersion unless explicitly instructed
Correct: A)
- If you chose A: Dispersion is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose B: This is incorrect. Dispersion is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Dispersion is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Dispersion is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
- For the dataset [23, 25, 28, 28, 30, 32, 35, 38, 42, 45], compute the mean, median, mode, and IQR.
Click for answer
Mean: $(23+25+28+28+30+32+35+38+42+45)/10 = 326/10 = 32.6$ Median: $(30+32)/2 = 31$ Mode: 28 (appears twice) $Q_1$: position $2.75$ โ 3rd value = 28 $Q_3$: position $8.25$ โ 9th value = 42 IQR: $42 - 28 = 14$-
A dataset has mean 50 and standard deviation 5. What percentage of data lies within $[40, 60]$ if the distribution is approximately normal?
Click for answer
The interval $[40, 60]$ is $\bar{x} \pm 2s$. For approximately normal data, about 95% lies within 2 standard deviations of the mean. -
Given $Q_1 = 20$, $Q_3 = 35$, find the outlier boundaries. Would the value 62 be flagged as an outlier?
Click for answer
IQR = 35 - 20 = 15 Lower bound: $20 - 1.5 \times 15 = 20 - 22.5 = -2.5$ Upper bound: $35 + 1.5 \times 15 = 35 + 22.5 = 57.5$ 62 > 57.5 โ **Yes**, 62 would be flagged as an outlier. -
A dataset has sample skewness $g_1 = 1.8$. Interpret this value and describe the likely relationship between mean and median.
Click for answer
$g_1 = 1.8$ indicates substantial positive skew (right tail). In positively skewed data, mean > median, because the mean is pulled toward the long right tail while the median is more robust. -
You have data on 100 house prices. The mean is $450,000 and the median is $385,000. What does this suggest about the distribution?
Click for answer
Mean > median by a substantial margin ($65,000) suggests positive (right) skew. A few very expensive houses are pulling the mean upward while the median better represents the "typical" house price.
Summary
Key takeaways:
- Central tendency (mean, median, mode) describes typical values; choose based on data shape
- Dispersion (range, variance, SD, IQR) quantifies spread; SD uses squared deviations making it outlier-sensitive
- Box plots display the five-number summary and flag outliers via the $1.5 \times \text{IQR}$ rule
- Skewness indicates asymmetry: positive skew โ mean > median
- Kurtosis measures tail weight, not peakedness
- Always visualise your data before choosing summary statistics
Pitfalls
- Using the mean for skewed data without checking: When data is skewed, the mean is pulled toward the tail and no longer represents the "typical" value. Always compare mean and median โ if they differ substantially, prefer the median for central tendency and report both.
- Confusing standard deviation with average deviation: SD = โ(ฮฃ(xแตขโxฬ)ยฒ/(nโ1)), which penalizes outliers more heavily because deviations are squared. The average absolute deviation ฮฃ|xแตขโxฬ|/n is smaller and less sensitive to extremes. They are not interchangeable.
- Using n instead of nโ1 for sample variance: Dividing by n gives the population variance formula, which systematically underestimates ฯยฒ. For a sample (not the entire population), always use nโ1 (Bessel's correction) unless you specifically want the maximum likelihood estimate.
- Drawing conclusions from poorly binned histograms: Bin width changes the appearance dramatically โ too wide hides features, too narrow shows noise. Always try multiple bin widths and use rules like Sturges' or Freedman-Diaconis as starting points, not final answers.
- Misinterpreting kurtosis as "peakedness": Kurtosis primarily measures tail weight, not how pointy the distribution is. A distribution with high kurtosis has heavier tails (more extreme outliers), and the higher central peak is a consequence โ not the definition.