Simple Least Squares

14.2. Simple Least Squares#

Further Reading: §7.2 Navidi (2015)

14.2.1. Learning Objectives#

After studying this notebook and your lecture notes, you should be able to:

Understand how to calculate a best fit line numerically or analytically.
Implement scipy to calculate a line of best fit.
Know when the least squares line is applicable.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

14.2.2. Introduction#

variable	symbol
dependent variable	$y_i$
independent variable	$x_i$
regression coefficients	$\beta_0$ , $\beta_1$
random error	\varepsilon_i$

Linear model:

\[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\]

We measure many $(x_i,y_i$ pairs in lab.

Can we compute $\beta_0$ , $\beta_1$ exactly? Why or why not?

Best fit line:

\[\hat{y} = \hat{\beta_0} + \hat{\beta_1}x\]

where $\hat{y}$ is the predicted response and $\beta_0$ , $\beta_1$ are the fitted coefficients.

14.2.3. Computing Best Fit Line#

\[\min_{\beta_0 , \beta_1} \sum_{i=1}^n (y_i - \hat{y_i})^2\]

\[\textrm{(residual) } e_i = y_i - \hat{y_i}\]

Notice inside the parenthesis we have the sum of the error squared.

We can compute $\beta_0$ , $\beta_1$ either numerically or analytically. See the textbook for the full derivation.

\[\hat{\beta_1} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) = \frac{s_{x,y}}{s_x^2} = r\frac{s_y}{s_x}\]

\[(s_{x,y} \textrm{ is sample covariance, }s \textrm{ is sample variance})\]

\[ \hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x}\]

Scipy also computes best fit as we will see in the example below.

# create data
xdata = [0,1.5,2,3.2,4,6,7.8]
ydata = [1,2.4,3,5.6,6,6.1,6.8]

# use scipy.stats.linregress to calculate a line of best fit and extract the key info
slope, intercept, r, p, se = stats.linregress(xdata,ydata)

# create a lambda function to plot the line of best fit
xline = np.linspace(0,8,10)
y = lambda x: slope*x + intercept
yline = y(xline)

# plot data
plt.plot(xdata,ydata,'o',label="data")
plt.plot(xline,yline,'-',label="best fit line")
plt.legend()
plt.show()

../../_images/d1c580b0fb7ba1451ff095b0492c826cfcfb38e3353329820bae45d7dcee2e41.png

14.2.4. Warnings:#

Estimates $\beta_0$ , $\beta_1$ are not the same as true values. $\beta_0$ , $\beta_1$ are random variables because of measurement error.
The residuals $e_i$ are not the same as error $\epsilon_i$.

../../_images/dacc13ea5f16b9c78590d119216762167954535b1035e33db941a2e8820a25df.png

Do NOT extrapolate outside the range of the data.
Do NOT use the least-squares line when the data are not linear.

../../_images/5c97855aaaa6f0a5ec3118651b84a0fa70ddd2f6d2288ca64a9aad4c4f877d28.png

14.2.5. Measuring Goodness-of-Fit#

How well does the model explain the data?

Coefficient of determination:

\[r^2 = \frac{\sum_{i=1}^n (y_i - \bar{y})^2 - \sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y})^2}\]

\[ = \frac{\textrm{regression sum of squares}}{\textrm{total sum of squares}}\]

Interpretation: proportion of the variance in $y$ explained by regression.

See the textbook for the derivation of the $r^2$ formula.

variable	symbol
dependent variable	\(y_i\)
independent variable	\(x_i\)
regression coefficients	\(\beta_0\) , \(\beta_1\)
random error	\varepsilon_i$