Statistics

Data Presentation & Interpretation

Year 12 · Year 13

  • Calculate and interpret measures of location (mean, median, mode) and spread (range, interquartile range, standard deviation, variance) for both raw and grouped data.
  • Construct and interpret histograms and box plots, understanding the significance of frequency density and the five-number summary.
  • Identify and handle outliers using the interquartile range rule and discuss their impact on statistical measures.
  • Calculate and interpret Pearson's product-moment correlation coefficient (PMCC) and determine the equation of a linear regression line, interpreting its coefficients in context.

Key concepts

Measures of Location

These statistics describe the central tendency of a dataset. The mean is the arithmetic average, the median is the middle value when data is ordered, and the mode is the most frequent value.

Mean (raw data): x̄ = (Σx) / n Mean (grouped data): x̄ = (Σfx) / Σf Median position (discrete): (n+1)/2-th value
Measures of Spread

These statistics describe the variability or dispersion of a dataset. The range is the difference between maximum and minimum values. The interquartile range (IQR) is the spread of the middle 50% of the data. Variance and standard deviation measure the average squared deviation and average deviation from the mean, respectively.

Range = Maximum value - Minimum value IQR = Q3 - Q1 Variance (raw data): σ² = (Σx²) / n - x̄² Standard Deviation: σ = √Variance
Histograms

A graphical representation of the distribution of numerical data. The area of each bar is proportional to the frequency of the data in that class interval. The vertical axis represents frequency density.

Frequency Density = Frequency / Class Width
Box Plots (Box-and-Whisker Diagrams)

A graphical display that summarises the distribution of a dataset using five key values: minimum value, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum value. It effectively shows the spread and skewness of the data.

Outliers

Data points that lie an abnormal distance from other values in a dataset. They can significantly affect the mean and standard deviation. A common rule for identification is based on the interquartile range.

An observation x is an outlier if x < Q1 - 1.5 * IQR or x > Q3 + 1.5 * IQR
Correlation

Describes the strength and direction of a linear relationship between two quantitative variables. Pearson's product-moment correlation coefficient (PMCC), denoted by 'r', is used for this. 'r' ranges from -1 (perfect negative linear correlation) to +1 (perfect positive linear correlation), with 0 indicating no linear correlation.

r = Sxy / √(Sxx * Syy) where Sxx = Σx² - (Σx)²/n, Syy = Σy² - (Σy)²/n, Sxy = Σxy - (ΣxΣy)/n
Linear Regression

A statistical method used to model the linear relationship between a dependent variable (y) and an independent variable (x). The least squares regression line, y = a + bx, is found by minimising the sum of the squared vertical distances from the data points to the line.

y = a + bx b = Sxy / Sxx a = ȳ - b * x̄

Key facts to remember

  • 1The mean is sensitive to extreme values (outliers), while the median is resistant.
  • 2Standard deviation measures the typical spread of data points around the mean. A larger standard deviation indicates greater variability.
  • 3In a histogram, the area of each bar, not its height, represents the frequency of the class. Frequency density is the height.
  • 4A box plot visually summarises the minimum, lower quartile, median, upper quartile, and maximum values of a dataset.
  • 5Outliers are typically defined as values falling outside Q1 - 1.5 * IQR or Q3 + 1.5 * IQR.
  • 6Pearson's product-moment correlation coefficient (PMCC), r, measures the strength and direction of *linear* correlation, ranging from -1 (perfect negative) to +1 (perfect positive).
  • 7The least squares regression line y = a + bx minimises the sum of the squared vertical distances (residuals) from the data points to the line.
  • 8Correlation does not imply causation.

Worked examples

Example 1

The table shows the heights, h cm, of 100 students. Height (h cm) | Frequency --------------|---------- 150 ≤ h < 155 | 12 155 ≤ h < 160 | 28 160 ≤ h < 165 | 35 165 ≤ h < 170 | 15 170 ≤ h < 175 | 10 Estimate the mean and standard deviation of the students' heights. Estimate the median height.

ICalculate midpoints (x) for each class and then calculate fx and fx² for each class: Class 150-155: x = 152.5, f = 12, fx = 1830, fx² = 279075 Class 155-160: x = 157.5, f = 28, fx = 4410, fx² = 694575 Class 160-165: x = 162.5, f = 35, fx = 5687.5, fx² = 924218.75 Class 165-170: x = 167.5, f = 15, fx = 2512.5, fx² = 420937.5 Class 170-175: x = 172.5, f = 10, fx = 1725, fx² = 297562.5
IICalculate the sums: Σf = 12 + 28 + 35 + 15 + 10 = 100 Σfx = 1830 + 4410 + 5687.5 + 2512.5 + 1725 = 16165 Σfx² = 279075 + 694575 + 924218.75 + 420937.5 + 297562.5 = 2616368.75
IIIEstimate the mean (x̄): x̄ = Σfx / Σf = 16165 / 100 = 161.65 cm
IVEstimate the standard deviation (σ): σ² = (Σfx²) / Σf - x̄² σ² = 2616368.75 / 100 - (161.65)² σ² = 26163.6875 - 26130.7225 = 32.965 σ = √32.965 ≈ 5.7415... ≈ 5.74 cm (3 s.f.)
VEstimate the median: Total frequency = 100. The median position is the (100/2) = 50th value. Cumulative frequencies: < 155: 12 < 160: 12 + 28 = 40 < 165: 40 + 35 = 75 (The 50th value is in this class) Median class: 160 ≤ h < 165 Using interpolation: Median = L + [(n/2 - C_f) / f_m] * c Median = 160 + [(50 - 40) / 35] * 5 Median = 160 + (10 / 35) * 5 = 160 + (2 / 7) * 5 = 160 + 10/7 ≈ 161.428... ≈ 161.43 cm (2 d.p.)

Answer

Mean ≈ 161.65 cm, Standard deviation ≈ 5.74 cm, Median ≈ 161.43 cm.

For grouped data, these are estimates based on the assumption that data within each class is evenly distributed.

Example 2

A dataset of 15 values is given: 12, 15, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 30, 32, 45. a) Find the median, lower quartile (Q1), and upper quartile (Q3). b) Identify any outliers using the rule Q1 - 1.5 * IQR or Q3 + 1.5 * IQR. c) Describe how you would draw a box plot for the data, clearly indicating any outliers.

Ia) Find Q1, Median, Q3: Data in order (n=15): 12, 15, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 30, 32, 45. Median (Q2) position: (15+1)/2 = 8th value. Median = 23. Q1 position: (15+1)/4 = 4th value. Q1 = 19. Q3 position: 3(15+1)/4 = 12th value. Q3 = 28.
IIb) Identify outliers: IQR = Q3 - Q1 = 28 - 19 = 9. Lower bound for outliers: Q1 - 1.5 * IQR = 19 - 1.5 * 9 = 19 - 13.5 = 5.5. Upper bound for outliers: Q3 + 1.5 * IQR = 28 + 1.5 * 9 = 28 + 13.5 = 41.5. Check data: The minimum value is 12 (> 5.5). The maximum value is 45 (> 41.5). Therefore, 45 is an outlier.
IIIc) Describe box plot construction: Draw a numerical scale covering the range of the data (e.g., from 10 to 50). Draw a box from Q1 (19) to Q3 (28). Draw a line inside the box at the median (23). Draw a 'whisker' from Q1 to the minimum non-outlier value (12). Draw a 'whisker' from Q3 to the maximum non-outlier value (32). Mark the outlier (45) with a cross or asterisk beyond the end of the upper whisker.

Answer

a) Median = 23, Q1 = 19, Q3 = 28. b) The value 45 is an outlier. c) A box plot would be drawn with a scale, a box from 19 to 28, a median line at 23, whiskers extending to 12 and 32, and an outlier marked at 45.

Outliers can significantly skew the appearance of a box plot and affect measures like the mean and range. It's important to identify and consider their impact.

Example 3

A student investigates the relationship between the number of hours spent revising (x) and the mark achieved in a test (y) for 8 students. x (hours) | y (mark) ----------|--------- 2 | 35 3 | 40 4 | 50 5 | 55 6 | 60 7 | 65 8 | 70 9 | 75 a) Calculate Pearson's product-moment correlation coefficient (PMCC). b) Find the equation of the least squares regression line of y on x in the form y = a + bx. c) Interpret the value of b in the context of this problem. d) Predict the mark for a student who revised for 5.5 hours.

Ia) Calculate PMCC (r): Σx = 44, Σy = 450, Σx² = 284, Σy² = 26700, Σxy = 2715, n = 8. Sxx = Σx² - (Σx)²/n = 284 - (44)²/8 = 284 - 242 = 42. Syy = Σy² - (Σy)²/n = 26700 - (450)²/8 = 26700 - 25312.5 = 1387.5. Sxy = Σxy - (ΣxΣy)/n = 2715 - (44 * 450)/8 = 2715 - 2475 = 240. r = Sxy / √(Sxx * Syy) = 240 / √(42 * 1387.5) = 240 / √58275 = 240 / 241.402... ≈ 0.994 (3 s.f.)
IIb) Find the regression line y = a + bx: x̄ = Σx / n = 44 / 8 = 5.5. ȳ = Σy / n = 450 / 8 = 56.25. b = Sxy / Sxx = 240 / 42 = 40 / 7 ≈ 5.714 (3 d.p.). a = ȳ - b * x̄ = 56.25 - (40/7) * 5.5 = 56.25 - 220/7 ≈ 56.25 - 31.42857... ≈ 24.8214... ≈ 24.82 (2 d.p.). Equation: y = 24.82 + 5.71x (2 d.p.)
IIIc) Interpret b: b ≈ 5.71. This means that for every additional hour spent revising, the student's test mark is predicted to increase by approximately 5.71 marks.
IVd) Predict mark for 5.5 hours: y = 24.82 + 5.71 * 5.5 y = 24.82 + 31.405 = 56.225 ≈ 56.2 (1 d.p.)

Answer

a) PMCC (r) ≈ 0.994 b) y = 24.82 + 5.71x c) For every additional hour of revision, the predicted test mark increases by approximately 5.71 marks. d) Predicted mark ≈ 56.2.

The PMCC value of 0.994 indicates a very strong positive linear correlation between revision hours and test marks. The prediction for 5.5 hours is an interpolation, as it falls within the range of the observed data.

Common mistakes

  • Using frequency as the height of bars in a histogram instead of frequency density, or having gaps between bars for continuous data.
  • Incorrectly calculating quartile positions, especially for discrete data or when interpolating for grouped data.
  • Not checking for outliers, or removing them without justification or consideration of their impact on statistical measures.
  • Confusing correlation with causation, assuming that a strong correlation between two variables means one causes the other.
  • Extrapolating beyond the range of the original data for regression predictions without acknowledging the potential unreliability of such predictions.

Exam tips

  • Always show full working for calculations involving grouped data, PMCC, or regression, including intermediate sums and formulas used.
  • Ensure all axes on histograms and box plots are correctly labelled with units and appropriate scales.
  • When asked to interpret measures (e.g., mean, standard deviation, PMCC, regression coefficients), relate your answer back to the context of the problem.
  • Be aware of the limitations of statistical models, such as the assumption of linearity for PMCC and regression, and the dangers of extrapolation.

Ready to practise?

Try a problem on this topic

Snap a photo or type a question — get step-by-step working instantly.