Regression#

This is a series of basic examples for running linear regression with StatWrap. It’s more typical to use a package like sklearn or statsmodels for regression, but StatWrap is written to imitate Google Sheets/Excel.

Beyond StatWrap, there are two well known libraries for running regressions in Python, StatsModels and Scikit-Learn. Use StatWrap if you want to keep things simple. Use Scikit-Learn if you are interested in machine learning applications. Use StatsModels if you are focused on pure stats (eg statistical inference) or want something closer to R. It’s a more natural choice for this course than scikit-learn.

StatWrap#

Imports#

Run the typical three starter lines in a Google Colab (suggested) or Jupyter notebook.

!pip install --upgrade statwrap
import statwrap
%use_all

Example 1#

x = 1, 2, 3, 4, 5
y = 10, 30, 20, 40, 50

scatter_plot(x,y)

print('Correlation:', correl(x,y))
_images/2f4c3f79551a5c1612a6603ed6e3e16278736ce98fba0272aa710acd08b40d6c.png
Correlation: 0.8999999999999999
# Direct Estimation

regression_line = linest(y, x)
regression_line
\[\hat{y} = 3 +9 x\]
# predict y for x = 1 and x = 2

regression_line(1), regression_line(2)
(np.float64(11.999999999999986), np.float64(20.999999999999986))
# Use correlation

correlation = correl(x,y)
sd_x = sd(x)
sd_y = sd(y)

slope = correlation * sd_y/sd_x

# y_average = intercept + slope*x_average 
intercept = average(y) - slope*average(x)

print(f'predicted Y = {intercept} + {slope}*X')
predicted Y = 3.0 + 9.0*X

Example 2 - Read from CSV#

Now we use a package called pandas to import data. Pandas does not have to be explicitly imported because of the %use_all command we used earlier. Later, we use a package called matplotlib to plot a regression line. Again, matplotlib doesn’t have to be explicitly imported.

url = 'https://raw.githubusercontent.com/alexanderthclark/Stats1101/main/Data/FatherSonHeights/pearson.csv'

# load a DataFrame
df = pd.read_csv(url)

# inspect first five rows
df.head()
Father Son
0 65.0 59.8
1 63.3 63.2
2 65.0 63.3
3 65.8 62.8
4 61.1 64.3
father = df.Father
son = df.Son

scatter_plot(father, son,
             xlabel = "Father's Height",
             ylabel = "Son's Height",
             alpha = 0.2) # reduce opacity of points
_images/8c24e350b17ed353d96d91ae8b25c88989c525db3c6abda809030358753f65e1.png
reg_line = linest(son, father) # y then x
reg_line
\[\hat{y} = 33.893 +0.514 x\]
scatter_plot(father, son,
             xlabel = "Father's Height",
             ylabel = "Son's Height",
             alpha = 0.2, # reduce opacity of points
             show = False) # tell Python to let us add more stuff to the plot

# Add regression line using matplotlib
x_range = father.min(), father.max()
y_predictions = reg_line(x_range) 
plt.plot(x_range, y_predictions, color = 'black')
plt.show()
_images/5d37d9ffd39dff4a537e7f1aa05db5d3e1f5ac171eb810be1a45632c1bbea166.png

StatsModels#

import statsmodels.api as sm

# data
x = (1,2,3,4,5)
y = (1,4,3,4,10)

# Add a constant to the model (intercept)
X = sm.add_constant(x)

# Fit the model
model = sm.OLS(y, X).fit()
#model.summary()
# coefficients for intercept, slope
model.params
array([-1. ,  1.8])
residuals_sm = model.resid

StatsModels for R Users#

import statsmodels.formula.api as smf
import pandas as pd

data = pd.DataFrame({"x": x, "y": y})
model_smf = smf.ols(formula='y ~ x', data=data).fit()
residuals_smf = model_smf.resid
#model_smf.summary()
# coefficients for intercept, slope
model_smf.params
Intercept   -1.0
x            1.8
dtype: float64

Scikit Learn#

from sklearn.linear_model import LinearRegression
import numpy as np

# Create a linear regression model
model_sk = LinearRegression()

# convert x and y to arrays
x_sk = np.array(x).reshape(-1,1)
y_sk = np.array(y)

# Run the regression
# note the order of x and y is different
model_sk.fit(x_sk, y_sk)

# Get the coefficients
print(f"Slope: {model_sk.coef_}")
print(f"Intercept: {model_sk.intercept_}")
Slope: [1.8]
Intercept: -0.9999999999999991
# Predictions
y_pred = model_sk.predict(x_sk)

# Calculate residuals
residuals_sk = y_sk - y_pred