Uncertainty Analysis and Statistical Inference

14.6. Uncertainty Analysis and Statistical Inference#

Further Reading: §7.3 in Navidi (2015), Uncertainty in the Least-Squares Coefficients

14.6.1. Learning Objectives#

After studying this notebook and your lecture notes, you should be able to:

Interpret correlation coefficient
Compute simple linear regression best fits
Check linear regression error assumptions using residual analysis (plots)
Compute residual standard error and covariance matrix for fitted parameters
Assemble confidence intervals for fitted parameters

# load libraries
import scipy.stats as stats
import numpy as np
import math
import matplotlib.pyplot as plt

14.6.2. Main Idea#

Returning to the farm example from the last notebook, we already fit a linear model and checked the residuals (error) assumptions.

We can now ask questions like:

How does uncertainty in the observed data propagate to the fitted model parameters?
Given the uncertainty in the observed data, what can we infer about the true values \(\beta_0\) and \(\beta_1\)?

14.6.3. Estimating Variance of Residuals#

Recall the general linear model formula:

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

and our assumption that

\[\epsilon_i \sim \mathcal{N}(0,\sigma^2) ~.\]

We rarely know \(\sigma^2\) and instead need to estimate it from the residuals:

\[s^2 = \hat{\sigma}^2 = \frac{\sum_{i=1}^{n} e^2_i}{n-2} = \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n-2} = \frac{e^T e}{n-2}\]

Here \(s^2\) is the variance of the residuals and an unbias estimate of \(\sigma^2\). Often \(s\) is called the residual standard error or standard error of the residuals. The \(n-2\) is the degree of freedom, where \(-2\) is because we fit two model parameters (\(\hat{\beta}_0\) and \(\hat{\beta}_1\)).

e_farm=np.array([-243.72463768, 385.85507246, -112.98550725, -37.4057971, -253.82608696, 462.753623109, -200.66666667]) # calculated in last notebook
n_farm = len(e_farm)
se = math.sqrt((np.dot(e_farm,e_farm)) / (len(e_farm) - 2))
print("Residual standard error = ",se,"(pounds/acre)")

Residual standard error =  329.0247600698057 (pounds/acre)

14.6.4. Simplified Uncertainty in Regression Parameters#

Recall the formulas to calculate the best fit parameters:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

How are the fitted parameters impacted by uncertainty \(\epsilon \sim \mathcal{N}(0,\sigma^2)\) in the dependent variable observations (\(y_i\))?

Using the error propagation formulas, we can obtain:

\[s_{\hat{\beta}_{0}} = s \sqrt{\frac{\sum_{i=1}^n x_i^2}{n \sum_{i=1}^{n} (x_i - \bar{x})^2}}\]

\[s_{\hat{\beta}_{1}} = \frac{s}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2}}\]

where \(s\) is the residual standard error and an estimate for \(\sigma\). These formulas neglect the covariance of the fitted parameters \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\).

Class Activity

With a partner, take a minute to verify the units for the above equations make sense.

pH = np.array([4.6, 4.8, 5.2, 5.4, 5.6, 5.8, 6.0]) #pH values from last notebook
x = pH
xbar = np.mean(x)

xdiff2 = (x-xbar) @ (x-xbar)
sb0 = se * math.sqrt((x@x) / (n_farm * xdiff2))
print("Standard deviation of intercept:")
print(sb0,"(pounds/acre)")

Standard deviation of intercept:
1405.31530958388 (pounds/acre)

sb1 = se / math.sqrt(xdiff2)
print("Standard deviation of slope:")
print(sb1,"(pounds per acre / pH unit)")

Standard deviation of slope:
261.995036940794 (pounds per acre / pH unit)

14.6.5. Confidence Intervals#

Under the four assumptions for errors, the quantities

\[\frac{\hat{\beta_0} - \beta_0}{s_{\hat{\beta}_{0}}} \qquad \mathrm{and} \qquad \frac{\hat{\beta_1} - \beta_1}{s_{\hat{\beta}_{1}}}\]

have Student’s t-distribution with \(n-2\) degrees of freedom. This facilitates a direct extension of statistical inference to linear regression.

For example, level 100\((1 - \alpha)\) confidence intervals for \(\beta_0\) and \(\beta_1\) are given by

\[\hat{\beta}_0 \pm t_{n-2,\alpha/2} \cdot s_{\hat{\beta}_0}\]

\[\hat{\beta}_1 \pm t_{n-2,\alpha/2} \cdot s_{\hat{\beta}_1}\]

# look up t-score for 95% confidence interval
tscore = stats.t.ppf([0.025, 0.975], n_farm-2)
print("t-score =",tscore)

t-score = [-2.57058184  2.57058184]

b0 = -2090.942028985508 # calculated in last notebook

print("95% confidence interval for intercept:")
intercept_interval = b0 + sb0 * tscore
print(intercept_interval," pounds/acre")

95% confidence interval for intercept:
[-5703.42003852  1521.53598055]  pounds/acre

b1 = 737.1014492753624 # calculated in last notebook

print("95% confidence interval for intercept:")
slope_interval = b1 + sb1 * tscore
print(slope_interval," pounds per acre / pH unit")

95% confidence interval for intercept:
[  63.62176603 1410.58113252]  pounds per acre / pH unit

14.6.6. Uncertainty in Regression Parameters Revisited#

Recall, the above analysis neglected the correlation between \(\hat{\beta}_0\) and \(\hat{\beta}_1\). But these parameters are almost always correlated. So you should excercise caution when applying the above formulas (although they are routinely the end of the story in an introductory statistics textbook).

Before we discuss a more holistic approach, we need to introduce multivariate linear models.

\[ \underbrace{\vec{y}}_{\mathbb{R}^{n x 1}} = \underbrace{\mathbf{X}}_{\mathbb{R}^{n x m}} \cdot \underbrace{\vec{\beta}}_{\mathbb{R}^{m x 1}} + \underbrace{\vec{\epsilon}}_{\mathbb{R}^{n x 1}} \]

Observations: \(\vec{y} = [y_1, y_2, ..., y_n]^T\)

Fitted Parameters: \(\vec{\beta} = [\beta_0, \beta_1, ..., \beta_{m}]^T\)

Data / Feature Matrix:

\[\begin{split} \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,m} \\ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,m} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \dots & x_{n,m} \end{bmatrix} \end{split}\]

For example, we could fit a linear model where reactor yield depended on both temperature, pressure and inlet compositions. More on this in a few lectures.

The linear regression best fit estimates can be easily computed with linear algebra:

\[\mathbf{X}^T \mathbf{X} \hat{\beta} = \mathbf{X}^T y\]

or

\[ \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T y\]

where the \(\vec{\epsilon}\) is omitted for simplificty. The equations above are know as the normal equations.

By applying the error propagation formula to the above equation, we can easily calculate the covariance matrix of the fitted parameters:

\[\Sigma_{\hat{\beta}} = \hat{\sigma}^2 (\mathbf{X}^T \mathbf{X})^{-1}\]

Now we will apply this to the farm example:

# feature matrix
X = np.ones((n_farm,2))
X[:,1] = pH

print("X =\n",X)

X =
 [[1.  4.6]
 [1.  4.8]
 [1.  5.2]
 [1.  5.4]
 [1.  5.6]
 [1.  5.8]
 [1.  6. ]]

yld = np.array([1056, 1833, 1629, 1852, 1783, 2647, 2131]) # data from the example in the last notebook
y = yld
print("y =",y)

y = [1056 1833 1629 1852 1783 2647 2131]

XXinv = np.linalg.inv(X.transpose().dot(X))
print("inv(XT X) =\n",XXinv)

inv(XT X) =
 [[18.24275362 -3.38768116]
 [-3.38768116  0.63405797]]

beta_hat = XXinv @ X.transpose() @ y
print("beta_hat =",beta_hat)

beta_hat = [-2090.94202899   737.10144928]

# results from the beginning of the notebook
print("intercept =",b0,"pounds per acre")
print("slope = ",b1,"pounds per acre / pH unit")

intercept = -2090.942028985508 pounds per acre
slope =  737.1014492753624 pounds per acre / pH unit

# covariance matrix
Sigma_beta = se**2 * XXinv
print("covariance matrix:\n",Sigma_beta)

covariance matrix:
 [[1974911.11935094 -366741.19098175]
 [-366741.19098175   68641.39938161]]

# results from middle of notebook
print("variance of intercept:\n",sb0**2)

print("\nvariance of slope:\n",sb1**2)

variance of intercept:
 1974911.1193508364

variance of slope:
 68641.39938160802

Main idea: The previous calculated standard deviations \(s_{\hat{\beta}_{0}}\) and \(s_{\hat{\beta}_{1}}\) are consistent with the diagonal elements of the covariance matrix.

We can further scale the covariance matrix into the correlation matrix to analyze how closely the two variables are related.

\[ Corr = D^{-1} \Sigma D^{-1} \]

# get matrix of diagonals of covariance matrix for variances
size = np.shape(Sigma_beta)
diag = np.diagonal(Sigma_beta)
stdev = np.sqrt(diag)
# calculate correlation matrix
corr_matrix = np.zeros(size)
for i in range(0,size[0]):
    for j in range(0,size[1]):
        corr_matrix[i,j] = Sigma_beta[i,j] / stdev[i] / stdev[j]
print("Correlation Matrix: \n",corr_matrix)

Correlation Matrix: 
 [[ 1.         -0.99607686]
 [-0.99607686  1.        ]]