Scatter Plots and Linear Correlation

Scatter plots are ways to graph independent variables against dependent variables to determine if there's a relationship. Instead of using one-variable analysis we use two-variable analysis.

Fig.1 - Standard Scatter Plot

As you can see, there are multiple points all over the place! Unlike graphs we've already studied such as box plots, bar graphs, histograms, and pie charts, Scatter plots use two variables, \(x\) (independant variable) and \(y\) (dependent variable) to plot the data.

Scatter plots are used in applications where there are two pieces of data that need to be compared. For example, a group of student's study time versus their test scores in an english class.


Correlation and Types of Correlation

When we use scatter plots, we want to look at how the two variables are related to each other. This is also known as their correlation. To analyse the relationship between two variables, we create a linear regression model, which allows us to make predictions using the line of best fit, also called the regression line. This line will represents how the data is trending. We normally utilize graphing software for this.

Note: Remember, the equation of a line is \(y = mx + b\), where \( m \) is the slope, and \(b\) is the y-intercept.

A generic scatter plot.
Fig.2 - Scatter plot with a line of best fit

We read the graph from left to right. In this case, the Line of Best Fit trends upwards, indicating a positive trend between the two variables. We will cover this topic more in depth next lesson.

Note: Correlation \(\neq\) Causation. Just because there is a relationship doesn't mean that one causes the other.

Variables have a linear correlation if changes in one variable tend to be proportonal to changes in the other.

There are three types of correlation:

  • Positive Correlation: As \(x\) increases, \(y\) increases. \(+r\)

  • Negative Correlation: As \(x\) increases, \(y\) decreases. \(-r\)

  • No Correlation: There is no clear correlation with \(x\) and \(y\) on the graph. This could be shown with randomly dispersed points. ~\(r\)


To develop a measure of correlation, mathematicians first defined the covariance of two variables in a sample:

\(s_{XY} = \cfrac{1}{n-1} \sum (x - \bar{x})(y - \bar{y})\)

Where:

  • \(n\) is the size of the sample
  • \(x\) represents individual values of the variable \(X\)
  • \(y\) represents individual values of the variable \(Y\)
  • \(\bar{x}\) represents the mean of \(X\)
  • \(\bar{y}\) represents the mean of \(Y\)

The capital letters denote the variables used when potting a graph.

Note: Remember that the symbol sigma means "the sum of."

Thus, we can define this as the sum of products of the deviations of \(x\) and \(y\) for all of the data points divided by \(n-1\)

The correlation coefficient, \(r\), is the covariance divided by the prodct of the standard deviations for \(X\) and \(Y\):

\(r = \cfrac{s_{XY}}{s_X \times s_Y}\)

Where:

  • \(s_X\) is the standard deviation of \(X\)
  • \(s_Y\) is the standard deviation of \(Y\)

The correlation coefficient can range from \(-1\) to \(1\), and tells us both the strength and direction of our variables, and how closely it models a linear relationship.


A student models a relationship between the time he studies and his grade.

What type of correlation does the data represent?

Time He Cycles (Minutes) Distance He Covers (Miles)
1 50
2 60
3 70
4 80
5 90

By analyzing the relationship between \(x\) and \(y\), we can see that, as \(x\) increases, \(y\) also increases. Therefore, it's a positive correlation.


Linear Regression

Linear regression is a mathematical function describing the relationship between two variables. As previously mentioned, it is the also the line of best fit, however, it also looks at a couple other things. Another value it measures is the residual value, which is the vertical distance between the line of best fit and each data point in a Scatter Plot.

Scatter plot with dashed lines connecting points to the best-fit line.
Fig.3 - Scatter plot displaying the residual measures for each of the respective points

In the above image, there are these dark, dashed lines connecting each point to the line of best fit. This is the residual for each point. The sum of tall residuals measures the overall error in the model, so ideally, the sum of all residuals should be zero, and the sum of the squares of the residuals should be as small as possible. This criteria is called the least-squares fit.

For the line of best fit in the least-squares method, it can be shown that the line has the equation:

\(y = mx + b, \quad \text{where} \quad m = \cfrac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2} \quad \text{and} \quad b = \bar{y} - m\bar{x}\)

Where \(\bar{x}\) is the mean of \(x\) and \(\bar{y} \) is the mean of \(y\).


Given the following table, answer these questions:

  1. Calculate the mean of \(X\) and \(Y\)
  2. Calculate the slope, \(m\), for the line of best fit
  3. Determine the y-intercept, \(b\)
  4. Determine the equation for the line of best fit
Study Time (Hours) \(X\) Test Score (%) \(Y\)
1 55
3 65
5 75
7 85
9 92

i. In order to determine the means of \(x\) and \(y\), we can divide the sum of the terms by the number of terms.

First, we can find the mean of \(x\):

\(\bar{x} = \cfrac{1+3+5+7+9}{5}\)

\(\bar{x} = \cfrac{25}{5}\)

\(\bar{x} = 5\)

Next, we can find the mean of \(y\):

\(\bar{y} = \cfrac{55+65+75+85+92}{5}\)

\(\bar{y} = \cfrac{372}{5}\)

\(\bar{y} = 74.4\)

Therefore, we can determine that the means of \(x\) and \(y\) are 5 and 74.4 respectively.


ii. In order to determine the slope, \(m\), for the line of best fit, we can use the following formula:

\(m = \cfrac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2}\)

Next, we can plug in the values and solve for \(m\):

\(m = \cfrac{5(1\cdot55 + 3\cdot65 + 5\cdot75 + 7\cdot85 + 9\cdot92) - (1+3+5+7+9)(55+65+75+85+92)}{5(1^2 + 3^2 + 5^2 + 7^2 + 9^2) - (1+3+5+7+9)^2}\)

\(m = \cfrac{5(2048) - (25)(372)}{5(165) - (25)^2}\)

\(m = \cfrac{10240 - 9300}{825 - 625}\)

\(m = \cfrac{940}{200}\)

\(m = 4.7\)

Therefore, we can determine that the slope for the line of best fit is 4.7.


iii. In order to determine the y-intercept, \(b\), we can use the following formula:

\(b = \bar{y} - m\bar{x}\)

We can then plug in the values and calculate to determine the y-intercept:

\(b = 74.4 - (4.6 \times 5)\)

\(b = 51.4\)

Therefore, we can determine that the y-intercept is 51.4.


iv. The equation of the best-fit line is:

\(y = 4.6x + 51.4\)

This means that for every 1 additional hour of study, the test score increases by 4.6 points on average.