14.7. Multivariate Linear Regression#

14.7.1. Learning Objectives#

After studying this notebook and your lecture notes, you should be able to:

  • Fit multivariate models (including polynomials) using linear regression

  • Using the 538 blog post as an example, explain why claiming a trend line is “statistically significant” can be misleading or not informative

# load libraries
import scipy.stats as stats
import numpy as np
import scipy.optimize as optimize
import math
import matplotlib.pyplot as plt

14.7.2. Motivating Example: Predict Bluegills Age#

Data source: https://newonlinecourses.science.psu.edu/stat501/node/325/

# age of fish in years
age = np.array([1,1,2,2,2,2,3,3,3,3,3,3,3,3,3,2,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
                4,4,4,4,4,5,4,4,4,5,2,2,4,3,4,3,4,4,4,4,3,3,3,4,4,3,4,5,4,5,4,4,
                3,5,5,4,5,3,4,3,4,6,4,5,4,4])

# length of fish in mm
length = np.array([67,62,109,83,91,88,137,131,122,122,118,115,131,143,142,123,122,
                   138,135,146,146,145,145,144,140,150,152,157,155,153,154,158,162,
                   161,162,165,171,171,162,169,167,188,100,109,150,140,170,150,140,
                   140,150,150,140,150,150,150,160,140,150,170,150,150,150,150,150,
                   150,160,140,160,130,160,130,170,170,160,180,160,170])

plt.plot(age,length,'b.')
plt.xlabel("Age [years]")
plt.ylabel("Length [mm]")
plt.show()
../../_images/d9c15dfb29bc16f26cd338377627df6efb67870f30e048c44dfb61548890a84d.png

Can we use the age of the fish to predict its length?

14.7.3. Normal Equations#

Recall from 12-02 and 12-03 the formulation for multivariate linear regression:

\[ \underbrace{\vec{y}}_{\mathbb{R}^{n x 1}} = \underbrace{\mathbf{X}}_{\mathbb{R}^{n x m}} \cdot \underbrace{\vec{\beta}}_{\mathbb{R}^{m x 1}} + \underbrace{\vec{\epsilon}}_{\mathbb{R}^{n x 1}} \]

Observations: \(\vec{y} = [y_1, y_2, ..., x_n]^T\)

Fitted Parameters: \(\vec{\beta} = [\beta_0, \beta_1, ..., \beta_{m}]^T\)

Data / Feature Matrix:

\[\begin{split} \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,m} \\ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,m} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \dots & x_{n,m} \end{bmatrix} \end{split}\]

For example, we could fit a linear model where reactor yield depended on both temperature, pressure and inlet compositions.

The linear regression best fit estimates can be easily computed with linear algebra:

\[\mathbf{X}^T \mathbf{X} \hat{\beta} = \mathbf{X}^T y\]

or

\[ \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T y\]

where the \(\vec{~}\) is omitted for simplicity. Above are know as the normal equations.

By applying the error propagation formula to the above equation, we can easily calculate the covariance matrix of the fitted parameters:

\[\Sigma_{\hat{\beta}} = \hat{\sigma}^2 (\mathbf{X}^T \mathbf{X})^{-1}\]

14.7.4. Attempt 1. Linear Model#

Let’s start with a simple regression model:

\[ y = \beta_0 + \beta_1 x_1 \]

where \(x_1\) is the length of the fish.

Class Activity

Construct the feature matrix X.

nfish = len(age)

# feature matrix (store in 'X')
# Add your solution here

print("X =\n",X)

Class Activity

Compute \((\mathbf{X}^T \mathbf{X})^{-1}\) and store in XXinv.

# calculate inverse of XT * X
# Add your solution here
print("inv(XT X) =\n",XXinv)

Class Activity

Compute \(\hat{\mathbf{\beta}}\) (fitted coefficients) and store in beta_hat.

# recall we can calculate the best fit coefficients with linear algebra
# Add your solution here
print("beta_hat =",beta_hat)

Class Activity

Complete the code below to evaluate the model and plot the predictions.

## evaluate predictions

# create ages to evaluate model with
age_plot = np.linspace(1,6,100)

# create new X matrix with 100 rows and 2 columns
Xplot = np.ones((len(age_plot),2))

# fill in second column
# Add your solution here

# evaluate model and store in `length_plot`
# Add your solution here

# plot
plt.plot(age,length,'b.',markersize=10,label="Data")
plt.plot(age_plot,length_plot,'r-',label="Best Fit")
plt.xlabel("Age [years]")
plt.ylabel("Length [mm]")
plt.legend()
plt.show()

Class Activity

Calculate and plot the residuals.

# calculate residuals, store in `e`
# Add your solution here

plt.plot(age,e,'b.')
plt.xlabel("Age [years]")
plt.ylabel("Residuals [mm]")
plt.show()

Class Activity

Compute the variance of the residuals. Then compute the covariance of the fitted parameters.

# compute variance of residuals, store in `se`
# Add your solution here
print("variance of residuals = ",se,"add units")

# compute covariance matrix, store in `Sigma_beta`
# Add your solution here
print("covariance matrix:\n",Sigma_beta)

14.7.5. Attempt 2. Quadratic Model#

Let’s try a quadratic model now:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

where \(x_1\) is the age and \(x_2\) is age squared.

Class Activity

With a partner, modify the code below (copied from previous example) to and fit the quadratic model.

# construct the feature matrix
X = np.ones((nfish,3))

# Add your solution here

print("X =\n",X)
# calculate inverse of XT * X, store in XXinv
# Add your solution here
print("inv(XT X) =\n",XXinv)
# recall we can calculate the best fit coefficients with linear algebra, store in beta_hat
# Add your solution here
print("beta_hat =",beta_hat)
## evaluate predictions

# create ages to evaluate model with
age_plot = np.linspace(1,6,100)

# create new X matrix with 100 rows and 2 columns
Xplot = np.ones((len(age_plot),3))

# fill in second and third column
# Add your solution here

# evaluate model, store predictions in length_plot
# Add your solution here

# plot
plt.plot(age,length,'b.',markersize=10,label="Data")
plt.plot(age_plot,length_plot,'r-',label="Best Fit")
plt.xlabel("Age [years]")
plt.ylabel("Length [mm]")
plt.legend()
plt.show()
# calculate residuals, store in `e`
# Add your solution here

plt.plot(age,e,'b.')
plt.xlabel("Age [years]")
plt.ylabel("Residuals [mm]")
plt.show()
# compute the variance of residuals, store in `e`
# Add your solution here


# compute covariance matrix, store in `Sigma_beta`
# Add your solution here
print("covariance matrix:\n",Sigma_beta)

14.7.6. How to select the best model?#

We can calculate a p-value to determine if a model is statistically significant.

https://fivethirtyeight.com/features/science-isnt-broken/#part1