Statistics
Data Presentation & Interpretation
Year 12 · Year 13
- ✓Calculate and interpret measures of location (mean, median, mode) and spread (range, interquartile range, standard deviation, variance) for both raw and grouped data.
- ✓Construct and interpret histograms and box plots, understanding the significance of frequency density and the five-number summary.
- ✓Identify and handle outliers using the interquartile range rule and discuss their impact on statistical measures.
- ✓Calculate and interpret Pearson's product-moment correlation coefficient (PMCC) and determine the equation of a linear regression line, interpreting its coefficients in context.
Key concepts
These statistics describe the central tendency of a dataset. The mean is the arithmetic average, the median is the middle value when data is ordered, and the mode is the most frequent value.
These statistics describe the variability or dispersion of a dataset. The range is the difference between maximum and minimum values. The interquartile range (IQR) is the spread of the middle 50% of the data. Variance and standard deviation measure the average squared deviation and average deviation from the mean, respectively.
A graphical representation of the distribution of numerical data. The area of each bar is proportional to the frequency of the data in that class interval. The vertical axis represents frequency density.
A graphical display that summarises the distribution of a dataset using five key values: minimum value, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum value. It effectively shows the spread and skewness of the data.
Data points that lie an abnormal distance from other values in a dataset. They can significantly affect the mean and standard deviation. A common rule for identification is based on the interquartile range.
Describes the strength and direction of a linear relationship between two quantitative variables. Pearson's product-moment correlation coefficient (PMCC), denoted by 'r', is used for this. 'r' ranges from -1 (perfect negative linear correlation) to +1 (perfect positive linear correlation), with 0 indicating no linear correlation.
A statistical method used to model the linear relationship between a dependent variable (y) and an independent variable (x). The least squares regression line, y = a + bx, is found by minimising the sum of the squared vertical distances from the data points to the line.
Key facts to remember
- 1The mean is sensitive to extreme values (outliers), while the median is resistant.
- 2Standard deviation measures the typical spread of data points around the mean. A larger standard deviation indicates greater variability.
- 3In a histogram, the area of each bar, not its height, represents the frequency of the class. Frequency density is the height.
- 4A box plot visually summarises the minimum, lower quartile, median, upper quartile, and maximum values of a dataset.
- 5Outliers are typically defined as values falling outside Q1 - 1.5 * IQR or Q3 + 1.5 * IQR.
- 6Pearson's product-moment correlation coefficient (PMCC), r, measures the strength and direction of *linear* correlation, ranging from -1 (perfect negative) to +1 (perfect positive).
- 7The least squares regression line y = a + bx minimises the sum of the squared vertical distances (residuals) from the data points to the line.
- 8Correlation does not imply causation.
Worked examples
Example 1
The table shows the heights, h cm, of 100 students. Height (h cm) | Frequency --------------|---------- 150 ≤ h < 155 | 12 155 ≤ h < 160 | 28 160 ≤ h < 165 | 35 165 ≤ h < 170 | 15 170 ≤ h < 175 | 10 Estimate the mean and standard deviation of the students' heights. Estimate the median height.
Answer
Mean ≈ 161.65 cm, Standard deviation ≈ 5.74 cm, Median ≈ 161.43 cm.
For grouped data, these are estimates based on the assumption that data within each class is evenly distributed.
Example 2
A dataset of 15 values is given: 12, 15, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 30, 32, 45. a) Find the median, lower quartile (Q1), and upper quartile (Q3). b) Identify any outliers using the rule Q1 - 1.5 * IQR or Q3 + 1.5 * IQR. c) Describe how you would draw a box plot for the data, clearly indicating any outliers.
Answer
a) Median = 23, Q1 = 19, Q3 = 28. b) The value 45 is an outlier. c) A box plot would be drawn with a scale, a box from 19 to 28, a median line at 23, whiskers extending to 12 and 32, and an outlier marked at 45.
Outliers can significantly skew the appearance of a box plot and affect measures like the mean and range. It's important to identify and consider their impact.
Example 3
A student investigates the relationship between the number of hours spent revising (x) and the mark achieved in a test (y) for 8 students. x (hours) | y (mark) ----------|--------- 2 | 35 3 | 40 4 | 50 5 | 55 6 | 60 7 | 65 8 | 70 9 | 75 a) Calculate Pearson's product-moment correlation coefficient (PMCC). b) Find the equation of the least squares regression line of y on x in the form y = a + bx. c) Interpret the value of b in the context of this problem. d) Predict the mark for a student who revised for 5.5 hours.
Answer
a) PMCC (r) ≈ 0.994 b) y = 24.82 + 5.71x c) For every additional hour of revision, the predicted test mark increases by approximately 5.71 marks. d) Predicted mark ≈ 56.2.
The PMCC value of 0.994 indicates a very strong positive linear correlation between revision hours and test marks. The prediction for 5.5 hours is an interpolation, as it falls within the range of the observed data.
Common mistakes
- ✗Using frequency as the height of bars in a histogram instead of frequency density, or having gaps between bars for continuous data.
- ✗Incorrectly calculating quartile positions, especially for discrete data or when interpolating for grouped data.
- ✗Not checking for outliers, or removing them without justification or consideration of their impact on statistical measures.
- ✗Confusing correlation with causation, assuming that a strong correlation between two variables means one causes the other.
- ✗Extrapolating beyond the range of the original data for regression predictions without acknowledging the potential unreliability of such predictions.
Exam tips
- ★Always show full working for calculations involving grouped data, PMCC, or regression, including intermediate sums and formulas used.
- ★Ensure all axes on histograms and box plots are correctly labelled with units and appropriate scales.
- ★When asked to interpret measures (e.g., mean, standard deviation, PMCC, regression coefficients), relate your answer back to the context of the problem.
- ★Be aware of the limitations of statistical models, such as the assumption of linearity for PMCC and regression, and the dangers of extrapolation.
Ready to practise?
Try a problem on this topic
Snap a photo or type a question — get step-by-step working instantly.
