Regression#
Use the LINEST
formula for linear regression.
Sample usage:
LINEST(A2:A10, B2:B10)
for simple linear regression, predicting A (y) from B (x)LINEST(A2:A10, B2:B10, True, True)
for simple linear regression with verbose output for significance testing
Simple Linear Regression#
Finding the Regression Line#
Suppose, like Galton, we want to predict a son’s height from his father’s height ([FPP07] p. 170).
We’re looking for the equation
where \(\hat{y}\) is the predicted father’s height and \(x\) is the son’s height. The data is organized in this spreadsheet, with Fathers’ heights stored in A2:A1079
and sons’ in B2:B1079
.
Using =LINEST(B2:B1079, A2:A1079)
, we find
The intercept is always the value furthest to the right in the output. The next value to the left is the slope.
Predictions and Residuals#
For a father of height 75.4 (the tallest father in the data), the predicted son’s height is \(\hat{y} = 0.51 \times 75.4 + 33.89 \approx 72.3\).[1]
In a separate tab, we calculate the predictions. For the observation in the second row, this is done with =$G$2 + $F$2*A2
. G2
references the intercept from the LINEST
output and F2
references the slope. We use absolute referencing, $F$2
and $G$2
, for the coefficients so we can copy and paste the formula without the coefficients changing. If the formula is pasted into the cell below, the coefficient references are frozen, but A2
is replaced by A3
, giving the prediction for the next observation in row three. These predictions are stored in column C
.
Next, residuals (or prediction errors) are calculated with the formula B2 - C2
entered in D2
and then pasted down the rest of column D
. A useful sanity check is that the average residual should be zero (for the observations used to find the regression line). This helps ensure that no mistakes were made in finding your predictions.
The root mean square error is calculated in F14
as the square root of the average of the squared residuals.