Linear Regression

Linear Regression is a mathematical function for determining the relationship between a dependent variable and an independent variable. When the two variables have a linear correlation, you can develop a simple mathematical model of the relationship between the two variables by finding a line of best fit, which is a line that displays the general trend of a set of data points on a graph.

But, how do we know if the line of best fit is a good estimate of the trend? Well, we could eyeball it and look at the graph, but there is an analytical approach to this using the residual values of the graph and least-squares fit.


Residuals

A Residual value refers to the vertical distance between each of the data points and the line of best fit. The regression matters because it indicates how well our line of best fit reflects the data.

The formula for Residuals is:

\(\text{Residual} = \text{Actual}\; Y - \text{Predicted}\; Y\)

  • \(\text{Actual}\; Y\) is the data point on the grap
  • \(\text{Predicted}\; Y\) is the point in which the residual line and the regression line intersect each other

The quantity of the regression value tells us how well the line of best fit reflects the data.

  • If residuals are small, the model is making accurate predictions
  • If residuals are large, the model is not fitting the data well

The sign of the regression value also matters in this context:

  • If it's a positive residual, The actual value is higher than the predicted value (the model underestimates)
  • If it's a negative residual, The actual value is lower than the predicted value (the model overestimates)
  • If the Residual is equal to 0, The prediction is exactly correct for that value
A Scatter Plot with dashed lines connecting each of the points to their corresponding points on the best-fit line.
Fig.1 - Scatter plot with residual measures to each of the data points from the line of best fit

It's important to watch out for outliers in your graph. These are values that vary drastically from the line of best fit. These points might negatively influence the line of best fit. In order to prevent this, remove the outlier and recalculate the regression line.


Assuming the actual value is \(y = 3\), and the predicted value is \(y = 1\), Determine the residual value for this point in the graph:

In order to determine the residual value, we can use the following formula:

\(\text{Residual} = \text{Actual}\; Y - \text{Predicted}\; Y\)

Using the above formula, we can compute the following:

\(\text{Residual} = 3 - 1\)

\(\text{Residual} = 2\)

Therefore, we can determine the residual for this point is 2.


Least-Squares Fit

The least-squares fit is a mathematical method used to find the best-fit line for a set of data points. Ideally, we want the sum of all residuals to be zero and the sum of the squares of the residuals should be as small as possible.

The sum of squared residuals \((SSR\)) can be be calculated as such:

\(SSR = \sum (y_{\text{actual}} - y_{\text{predicted}})^2\)

  • \(y_{\text{actual}}\) is the observed value from the dataset.
  • \(y_{\text{predicted}}\) is the predicted value from the regression line.
  • \((y_{\text{actual}} - y_{\text{predicted}})\) is the residual for each data point.
  • Summation Symbol (\(\sum\)) means we add up all of the squared residuals in the dataset.

Although the algebra looks daunting, the equation of this line can be determined using:

\(y = mx + b, \quad \text{where} \quad m = \cfrac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2} \quad \text{and} \quad b = \bar{y} - m\bar{x}\)

Where \(\bar{x}\) is the mean of \(x\) and \(\bar{y} \) is the mean of \(y\).

Note: You might also see this referenced as \(y = ax + b \). It's the same equation, with \(a\) as the slope.


Interpolating and Extrapolating

Using the line of best fit, You can then use the equation for this line to make predictions by interpolating and extrapolating the data.

Interpolating means predicting values that lie within the domain/range. These values are within the data that we have collected. In simple terms, we are predicting values anywhere in between the first data point and the last data point of our Scatter Plot.

Extrapolating means predicting values that lie outside the domain or range. These values are outside the data that we've collected. Usually when doing extrapolation, we present our predictions with a disclaimer saying “If the model continues this way, then…”. We do this, as we have no trends to aid us in our predictions, thus we don’t know for sure what will happen.


A cyclist modelled the relationship between the time he cycles, and the amount of distance he covers, in Miles, in the following table.

  1. What are the Independant and Dependant Variables?
  2. How far does he travel in 12 minutes?
Time He Cycles (Minutes) Distance He Covers (Miles)
10 4.0
20 5.9
30 7.0
40 8.5
50 9.8

i. The Independant variable (or \(x\)-value) is how much time he cycles in minutes, and the Dependant variable (or \(y\)-value) is how much distance he covers in miles.


ii. First, we can sketch a graph to more easily determine a formula to represent the cyclist's distance:

A Scatter Plot with dashed lines connecting each of the points to their corresponding points on the best-fit line.

Based on this graph, we can represent the distance algebraically as such:

\(y = 0.142x + 2.78\)

In order to determine the cyclist's distance after \(12\) minutes, we can substitute this value into the equation:

\(y = 0.142x + 2.78\)

\(y = 0.142 (12) + 2.78\)

\(y = 4.484\)

Therefore, we can determine the cyclist's distance after \(12\) minutes is approximately 4.5 miles.