Strand 1 — Statistics & Probability

Bivariate Data: Scatter Plots, Correlation, and Line of Best Fit

5th Year · 6th Year (Leaving Cert)

  • Understand and identify bivariate data and the roles of independent and dependent variables.
  • Construct and interpret scatter plots to visually represent relationships between two variables.
  • Describe the direction, strength, and form of a relationship shown in a scatter plot.
  • Calculate and interpret the correlation coefficient (r) using technology.
  • Draw and use a line of best fit to model linear relationships and make predictions.

Key concepts

Bivariate Data

Bivariate data involves two different variables for each observation. The primary goal of analysing bivariate data is to determine if there is a relationship or association between these two variables. For example, a student's study hours and their exam score.

Independent and Dependent Variables

When examining a relationship, one variable is often considered to influence or explain changes in the other. The independent variable (or explanatory variable) is the one that is thought to cause or influence the change, and it is typically plotted on the horizontal (x) axis. The dependent variable (or response variable) is the one that is affected by the independent variable, and it is plotted on the vertical (y) axis.

Scatter Plot

A scatter plot is a graph that displays bivariate data as a set of points. Each point on the plot represents a pair of values for the two variables for a single observation. Scatter plots are used to visually examine the direction, strength, and form (e.g., linear or non-linear) of the relationship between the two variables.

Correlation

Correlation describes the strength and direction of a linear relationship between two quantitative variables. * **Positive Correlation**: As the independent variable increases, the dependent variable also tends to increase. The points on the scatter plot generally rise from left to right. * **Negative Correlation**: As the independent variable increases, the dependent variable tends to decrease. The points on the scatter plot generally fall from left to right. * **No Correlation**: There is no apparent linear relationship between the variables. The points on the scatter plot show no clear pattern.

Correlation Coefficient (r)

The correlation coefficient (r), also known as Pearson's product-moment correlation coefficient, is a numerical measure of the strength and direction of the linear relationship between two quantitative variables. * The value of 'r' always lies between -1 and +1, inclusive (-1 ≤ r ≤ 1). * If r = +1, there is a perfect positive linear correlation. * If r = -1, there is a perfect negative linear correlation. * If r = 0, there is no linear correlation. * Values close to +1 or -1 indicate a strong linear relationship. * Values close to 0 indicate a weak or no linear relationship. * It only measures *linear* relationships; a strong non-linear relationship might have an r value close to 0.

Line of Best Fit (Regression Line)

The line of best fit is a straight line drawn through the centre of the points on a scatter plot, representing the general linear trend of the data. It is used to model the linear relationship between the variables and can be used for prediction. * **Interpolation**: Making predictions within the range of the observed data. These predictions are generally considered reliable. * **Extrapolation**: Making predictions outside the range of the observed data. These predictions are less reliable as the relationship may not hold true beyond the observed data.

Key facts to remember

  • 1Bivariate data involves two variables per observation, used to explore relationships.
  • 2Scatter plots visually display bivariate data, revealing the direction, strength, and form of a relationship.
  • 3The independent variable is plotted on the x-axis, and the dependent variable on the y-axis.
  • 4Correlation describes the strength and direction of a *linear* relationship between two variables.
  • 5The correlation coefficient 'r' ranges from -1 to +1. Values near ±1 indicate strong linear correlation; values near 0 indicate weak or no linear correlation.
  • 6A positive 'r' means both variables tend to increase together; a negative 'r' means one increases as the other decreases.
  • 7The line of best fit models the linear trend in a scatter plot and is used for making predictions.
  • 8Interpolation (predictions within the data range) is generally more reliable than extrapolation (predictions outside the data range).

Worked examples

Example 1

A researcher collected data on the number of hours students spent studying for a maths exam and their corresponding exam scores. The data is displayed in the scatter plot below. Describe the relationship between study hours and exam scores based on the scatter plot. (Imagine a scatter plot where the x-axis is 'Study Hours' (0-10) and the y-axis is 'Exam Score' (0-100). The points generally rise from left to right, showing a moderate to strong upward trend, but with some spread.)

IObserve the general trend of the points on the scatter plot.
IIDetermine if the points tend to go up or down from left to right to identify the direction.
IIIAssess how closely the points cluster around a potential straight line to determine the strength.
IVIdentify if the relationship appears to be linear or non-linear.

Answer

The scatter plot shows a positive, moderately strong linear relationship between the number of hours spent studying and exam scores. As the number of study hours increases, the exam scores generally tend to increase. The points are somewhat clustered around a straight line, indicating a moderate strength in this linear association.

Always comment on direction (positive/negative), strength (strong/moderate/weak), and form (linear/non-linear) when describing a relationship from a scatter plot.

Example 2

The table below shows the average daily temperature (in °C) and the number of ice cream cones sold at a shop over 7 days. | Temperature (°C) | Ice Cream Cones Sold | |------------------|----------------------| | 18 | 120 | | 20 | 135 | | 22 | 150 | | 19 | 125 | | 23 | 160 | | 17 | 110 | | 21 | 140 | Calculate the correlation coefficient (r) and interpret its meaning.

IEnter the data into the statistics mode of your scientific calculator (e.g., 'STAT' -> '2-VAR' or 'a+bx').
IIInput the temperature values into the x-list (independent variable) and the ice cream cone sales into the y-list (dependent variable).
IIICalculate the regression statistics (e.g., 'CALC' -> '2-VAR Stats' or 'LinReg(ax+b)').
IVIdentify the value of 'r' from the calculator output.
VInterpret the value of 'r' in the context of the problem, commenting on both the strength and direction of the linear relationship.

Answer

1. Using a scientific calculator's statistics function, input the temperature data as the independent variable (x) and the ice cream cone sales as the dependent variable (y). 2. The calculated correlation coefficient is r ≈ 0.98. 3. Interpretation: This indicates a very strong positive linear correlation between the average daily temperature and the number of ice cream cones sold. As the temperature increases, the number of ice cream cones sold strongly tends to increase.

Ensure your calculator is set to display the correlation coefficient (r) in regression calculations. Some calculators require 'DiagnosticsOn' to be enabled in the settings.

Example 3

A company recorded the number of hours a new employee spent in training (x) and their productivity score (y) after one month. The data points are plotted on a scatter plot. Draw a line of best fit by eye and use it to estimate the productivity score of an employee who had 15 hours of training. (Imagine a scatter plot with points: (5, 40), (10, 55), (12, 60), (18, 75), (20, 80), (25, 90). The x-axis is 'Training Hours' (0-30), y-axis is 'Productivity Score' (0-100).)

IDraw a straight line that passes through the 'centre' of the data points, ensuring roughly an equal number of points are above and below the line. The line should follow the general trend of the data.
IITo estimate for 15 hours of training, locate 15 on the x-axis.
IIIDraw a vertical line from x=15 up to the line of best fit.
IVFrom the point where the vertical line intersects the line of best fit, draw a horizontal line to the y-axis.
VRead the value on the y-axis to get the estimated productivity score.

Answer

1. A line of best fit is drawn by eye, following the general upward trend of the points. It should pass through the approximate centre of the data, with roughly half the points above and half below. 2. Locate 15 on the x-axis (Training Hours). 3. Move vertically up from x=15 to the line of best fit. 4. Move horizontally from this point on the line to the y-axis (Productivity Score). 5. Reading the y-axis, the estimated productivity score for an employee with 15 hours of training is approximately 68. (Note: The exact value may vary slightly depending on the line drawn by eye, but it should be close to this value given the data points.)

When drawing a line of best fit by eye, use a ruler and ensure the line extends across the full range of the data on the x-axis. Show your prediction lines clearly on the graph.

Common mistakes

  • **Confusing correlation with causation**: A strong correlation does not automatically mean one variable causes the other. There might be a lurking variable or it could be coincidental.
  • **Misinterpreting the correlation coefficient**: Believing that r=0 implies no relationship at all, when it only means no *linear* relationship. A strong non-linear relationship can have an r value close to 0.
  • **Drawing the line of best fit inaccurately**: Not ensuring the line follows the general trend of the data or having a disproportionate number of points on one side of the line.
  • **Extrapolating too far**: Making predictions for values significantly outside the observed data range, which can lead to unreliable or inaccurate results.
  • **Incorrectly identifying variables**: Placing the dependent variable on the x-axis and the independent variable on the y-axis.

Exam tips

  • Always use a ruler when constructing scatter plots and drawing lines of best fit. Label your axes clearly with variable names and units.
  • When asked to describe a relationship from a scatter plot, always comment on three aspects: direction (positive/negative), strength (strong/moderate/weak), and form (linear/non-linear).
  • Be proficient in using your scientific calculator's statistics mode to calculate the correlation coefficient (r). Practice entering data and retrieving 'r' quickly and accurately.
  • Remember the key phrase: 'Correlation does not imply causation.' Be prepared to state this if a question asks about the implications of a strong correlation.
  • When using the line of best fit for prediction, clearly show your working on the graph by drawing lines from the axis to the line of best fit and then to the other axis.

Ready to practise?

Try a problem on this topic

Snap a photo or type a question — get step-by-step working instantly.