Regression

Regression#

This is a series of basic examples for running linear regression with StatWrap. It’s more typical to use a package like sklearn or statsmodels for regression, but StatWrap is written to imitate Google Sheets/Excel.

Beyond StatWrap, there are two well known libraries for running regressions in Python, StatsModels and Scikit-Learn. Use StatWrap if you want to keep things simple. Use Scikit-Learn if you are interested in machine learning applications. Use StatsModels if you are focused on pure stats (eg statistical inference) or want something closer to R. It’s a more natural choice for this course than scikit-learn.

StatWrap#

Imports#

Run the typical three starter lines in a Google Colab (suggested) or Jupyter notebook.

!pip install --upgrade statwrap
import statwrap
%use_all

Example 1#

x = 1, 2, 3, 4, 5
y = 10, 30, 20, 40, 50

scatter_plot(x,y)

print('Correlation:', correl(x,y))

_images/5e8efdeb5f3776a9af7820393114f8ea27ae16e9d38936582d76fc5342a7611f.png

Correlation: 0.8999999999999999

# Direct Estimation

regression_line = linest(y, x)
regression_line

\[\hat{y} = 3 +9 x\]

# predict y for x = 1 and x = 2

regression_line(1), regression_line(2)

(np.float64(11.999999999999986), np.float64(20.999999999999986))

# Use correlation

correlation = correl(x,y)
sd_x = sd(x)
sd_y = sd(y)

slope = correlation * sd_y/sd_x

# y_average = intercept + slope*x_average 
intercept = average(y) - slope*average(x)

print(f'predicted Y = {intercept} + {slope}*X')

predicted Y = 3.0 + 9.0*X

Example 2 - Read from CSV#

Now we use a package called pandas to import data. Pandas does not have to be explicitly imported because of the %use_all command we used earlier. Later, we use a package called matplotlib to plot a regression line. Again, matplotlib doesn’t have to be explicitly imported.

url = 'https://raw.githubusercontent.com/alexanderthclark/Stats1101/main/Data/FatherSonHeights/pearson.csv'

# load a DataFrame
df = pd.read_csv(url)

# inspect first five rows
df.head()

	Father	Son
0	65.0	59.8
1	63.3	63.2
2	65.0	63.3
3	65.8	62.8
4	61.1	64.3

father = df.Father
son = df.Son

scatter_plot(father, son,
             xlabel = "Father's Height",
             ylabel = "Son's Height",
             alpha = 0.2) # reduce opacity of points

_images/398929392c37566eeae308f73a25abca9eea09bbbb7d6a6b00aa625366a459bd.png

reg_line = linest(son, father) # y then x
reg_line

\[\hat{y} = 33.893 +0.514 x\]

scatter_plot(father, son,
             xlabel = "Father's Height",
             ylabel = "Son's Height",
             alpha = 0.2, # reduce opacity of points
             show = False) # tell Python to let us add more stuff to the plot

# Add regression line using matplotlib
x_range = father.min(), father.max()
y_predictions = reg_line(x_range) 
plt.plot(x_range, y_predictions, color = 'black')
plt.show()

_images/f722d411237927a2e8a1fe9762b62f054d1a3b75599d60092c462b331af2ab64.png

StatsModels#

import statsmodels.api as sm

# data
x = (1,2,3,4,5)
y = (1,4,3,4,10)

# Add a constant to the model (intercept)
X = sm.add_constant(x)

# Fit the model
model = sm.OLS(y, X).fit()
#model.summary()

# coefficients for intercept, slope
model.params

array([-1. ,  1.8])

residuals_sm = model.resid

StatsModels for R Users#

import statsmodels.formula.api as smf
import pandas as pd

data = pd.DataFrame({"x": x, "y": y})

model_smf = smf.ols(formula='y ~ x', data=data).fit()
residuals_smf = model_smf.resid
#model_smf.summary()

# coefficients for intercept, slope
model_smf.params

Intercept   -1.0
x            1.8
dtype: float64

Scikit Learn#

from sklearn.linear_model import LinearRegression
import numpy as np

# Create a linear regression model
model_sk = LinearRegression()

# convert x and y to arrays
x_sk = np.array(x).reshape(-1,1)
y_sk = np.array(y)

# Run the regression
# note the order of x and y is different
model_sk.fit(x_sk, y_sk)

# Get the coefficients
print(f"Slope: {model_sk.coef_}")
print(f"Intercept: {model_sk.intercept_}")

Slope: [1.8]
Intercept: -0.9999999999999991

# Predictions
y_pred = model_sk.predict(x_sk)

# Calculate residuals
residuals_sk = y_sk - y_pred