Non-Linear Regression

Many relationships that you may encounter between two variables, such as square-law and exponential, might not be linear. Non-linear regression is a technique for finding a curve of best fit for these types of data.

Compared to Linear Regression, the calculations for curves are more complicated. You're most likely going to use Graphing calculators, or statistical programs for these types of problems, to calculate the best-fit curve of that type. They can also calculate the coefficient of determination, \(r^2\), which is a useful measure of how closely a curve fits the data.


Coefficient of Determination

We've studied before that the correlation coefficient, \(r\), is a measure of the linearity of the data, so it can indicate only how closely a straight line fits the data. However, when dealing with non-linear regression, the correlation coefficient is not always the best choice because it assumes a straight-line relationship. This is why we use the coefficient of determination, \(r^2\). It is defined such that it applies to any type of regression curve. It measures how well a regression model fits the data, even for non-linear models.

The formula we use for the coefficient of determination is:

\(r^2 = \cfrac{\sum (y_{\text{est}} - \bar{y})^2}{\sum (y - \bar{y})^2}\)

The coefficient of determination can have values from 0 to 1. The \(r^2\) value indicates how well your regression model explains the variability in the dependent variable, \(y\), based on the independent variable, \(x\).

If the curve is a perfect fit, then \(y_{\text{est}} \) and \(y\) will be the same for each value of \(x\). This tells is the change in \(x\) accounts for all of the variation in \(y\), so \(r^2 = 1\).

Also, if the curve is a poor fit, the total of \( (y_{\text{est}} - \bar{y})^2 \) will be much smaller than the total of \((y - \bar{y})^2\), since the change in \( x \) will account for only a small part of the total variation in \(y\). Therefore, \(r^2\) will be close to \(0\). For any given type of regression, the curve of best fit will be the one that has the highest value for \(r^2\).

In simple terms, it measures the goodness-of-fit of a regression model, with a higher number meaning the model does a good job of representing the data, and a lower number meaning the model does a bad job of representing the data.

Note: This does NOT tell you if the model is correct. Correlation ≠ Causation.


Given the quadratic expression \(y = 12x^2 + 3x + 5\), determine \( y_{\text{est}} \) when \(x = 3\).

In order to determine \( y_{\text{est}} \) when \(x = 3\), we can substitute the corresponding value into the formula:

\(y_{\text{est}} = 12(3)^2 + 3(3) + 5\)

\(y_{\text{est}} = 122\)

Therefore, we can determine that \( y_{\text{est}}\) is 122 when \(x = 3\).


Types of Non-Linear Regression

There are several types of non-linear regression models, each suited for different kinds of relationships between the dependent variable \(y\) and the independent variable \(x\).

Below are some of the most common types of non-linear regression models:

  • Exponential Regressions produce equations with the form:

    \(y = ab^x\)

    Or with the form:

    \(y = ae^kx\)

    Where \(e = 2.718 28...\) represents an irrational number commonly used as the base for exponents and logarithms.


  • Power Regression, where the curve of best fit has an equation with the form $$ y = ax^b $$.
  • Polynomial regression is used when the relationship between \( X \) and \( Y \) follows a curved pattern. Instead of a straight-line equation like \( y = mx + b \), polynomial regression introduces higher-degree terms: $$ y = a_0 + a_1x + a_2x^2 + a_3x^3 + \dots + a_nx^n $$.
A researcher is studying the growth of a plant over time. They collect the following data on the plant's height (in cm) at different days, and organized the data in a table. By analyzing the trend between the Days \(x\) and Heights \(y\), would a Linear Regression model work to display this data?
Days Height
1 2
2 5
3 10
4 17
5 26

In order to determine if a linear regression model would work to display this data, we can determine the slope with two separate sets of values.

First, we can use the values for Days 1 and 2 to determine the first slope:

\(m_1 = \cfrac{5-2}{2-1}\)

\(m_1 = \cfrac{3}{1} = 3\)

Next, we can use the values for Days 2 and 3 to determine the second slope:

\(m_2 = \cfrac{10-5}{3-2}\)

\(m_2 = \cfrac{5}{1} = 5\)

Therefore, we can determine a linear regression model wouldn't be the best way to represent the data; the \(y\)-value increases by a different amount over time.





Try these questions: