Problem Set 6#

CBE 60258, University of Notre Dame. © Prof. Alexander Dowling, 2023

You may not distribution homework solutions without written permissions from Prof. Alexander Dowling.

Your Name and Email:

Motivation: Nonlinear Regression for Adsorptive Nanoporous Membranes#

Contributors: Elvis A. Eugene and Alexander W. Dowling

Prof. William Phillip has been developing novel adsorptive nanoporous membranes to remove heavy metals from water. He has asked for your help analyzing data to design a water treatment system.

Assume the amount of contaminant (e.g., Pb) that adsorbs to the water treatment membranes can be modeled with the Langmuir isotherm:

\[q = \frac{Q \cdot K \cdot c}{1 + K \cdot c}\]

where \(c\) is the concentration of contaminant in the bulk fluid (water), \(q\) is the loading of contaminant on the sorbent (membrane), \(K\) is the binding affinity and \(Q\) is the capacity.

Prof. Phillip’s lab has provided us with the following data:

Contaminant Concentration in Bulk (mM)

Contaminant Loading on Sorbent (mmol/g)

1

0.5

2.5

0.9

5

1

10

1.33

20

1.3

40

1.4

In this case study, you will compare three different regression strategies:

  1. Linear Regression after Transformation

  2. Linear Regression after a Different Transformation

  3. Nonlinear Regression

For each strategy, we will follow the same basic steps:

  • Set up the problem, including deriving the transformation (if applicable)

  • Perform the regression calculation

  • Plot the experimental data (with transformed variables if applicable)

  • Plot the experimental data versus fitted model (\(q\) vs \(c\))

  • Graphically inspect the residuals

  • Compute the covariance matrix for the fitted parameters, transform to the original parameters (if applicable)

# load libraries
import numpy as np
from scipy import optimize, stats, integrate
import matplotlib.pyplot as plt
import pandas as pd

# define the data
c = np.array([1, 2.5, 5, 10, 20, 40]);
q = np.array([0.5, 0.9, 1.0, 1.33, 1.3, 1.4]);

# number of observations
n = len(c);

# number of fitted parameters
p = 2;

Getting Started#

Determine the units of \(K\) and \(Q\). Hint: \(1\) in the denominator of the Langmuir isotherm is dimensionless.

Units for \(K\):

Units for \(Q\):

Before we run any regression analysis, let’s first plot this data. This is a best practice for data science – first visualize.

Paramter estimation using a transformation and linear regression#

Transformation#

We will start our analysis by performing a transformation; this is neccessary to apply linear regression. We start with the Languir isotherm:

\[ q = \frac{Q \cdot K \cdot c}{1 + K \cdot c}\]

With a little bit of algebra, we obtain:

\[ \frac{c}{q} = \frac{c}{Q} + \frac{1}{Q \cdot K} \]

Write down the mathematical steps to go from the original isotherm \(q=~...\) to the transformed isotherm \(c/q =~...~\) on paper. Turn this in on Gradescope.

Linear Regression#

You may be asking yourselves, how is that model linear?!? We just need to define our variables and parameters carefully:

\[ \underbrace{\frac{c}{q}}_{y} = \underbrace{\frac{1}{Q}}_{\beta_1} \cdot \underbrace{c}_{x} + \underbrace{\frac{1}{K \cdot Q}}_{\beta_0} \]

Recall \(x\) is the independent variable, \(y\) is the observed variable, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope. With this transformation, we can compute the regression coefficients:

# Code for parameter estimation using linear regression
x = c;
y = c/q;

slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

print("slope =",slope)
print("intercept =",intercept)
print("R-squared:",r_value**2)

Alright, we now have estimates for \(\beta_0\) and \(\beta_1\). But we really care about \(K\) and \(Q\). Let’s reverse the transformation.

\[Q = \frac{1}{\beta_1}, \qquad K = \frac{\beta_1}{\beta_0}\]
## Reverse transformation to obtain Q and K and display results
Qlin1 = 1/slope;
Klin1 = slope/intercept;
print("\nK (linear regression) = {0:0.1f} l/mmol".format(Klin1));
print("Q (linear regression) = {0:0.1f} mmol/g_membrane".format(Qlin1));

Plot Fit and Residuals for Transformed Model#

Now we want to visualize the quality of the fit. Because we plan to repeat the regression analysis two more times, we invested in writing functions for the visualization. This means we just need to declare the plotting procedures once and we can reuse them. This is a good programming practice; use functions to maximize code reuse.

# define a function to plot linearized isotherm 
def plot_linearized_isotherm(x,y,slope,intercept,title):
    ''' 
    function to plot the linearized model and residuals
    Arguments:
        x: transformed independent variable (float vector)
        y: transformed independent variable (float vector)
        slope: fitted parameter 1 (float)
        intercept: fitted parameter 2 (float)
        title: keyword in plot title (string). Either 'lin1' or 'lin2'
        
    Returns:
        nothing
    '''
    
    assert (title == 'lin1') or (title == 'lin2'), "Argument title must be 'lin1' or 'lin2'"
    
    ## evaluate the best fit line
    # determine the max value of x
    xmax = np.amax(x);
    # define range of x to evaluate the model
    x_span = np.linspace(0,1.2*xmax,50);
    # evaluate the model
    y_hat = slope*x_span + intercept
    
    ## plot the best fit line
    plt.figure()
    plt.plot(x,y,'ro',label = 'data');
    plt.plot(x_span, y_hat, 'k', label='best fit')
    
    if (title == 'lin1'):
        plt.xlabel("c (mM)")
        plt.ylabel("c/q (grams membrane / L)")
    elif (title == 'lin2'):
        plt.xlabel("1/c (l/mmol)")
        plt.ylabel("1/q (grams membrane / mmol)")            
            
    plt.grid(True)
    plt.title('Transformed Linear Langmuir Isotherm')
    plt.legend()
    plt.show()
    
    
    ## calculate the residuals
    r = y - (x*slope + intercept);
    
    ## plot the residuals versus concentration
    plt.figure()
    plt.plot(c,r,'ro')
    
    if (title == 'lin1'):
        plt.ylabel('Residual (grams membrane / L)')
    elif (title == 'lin2'):
        plt.ylabel('Residual (grams membrane / mmol)')
                      
    plt.xlabel('Equilibrium concentration (mM)')
    plt.grid(True)
    plt.title('Transformed Residuals')
    plt.show()
    
# End: define a function to linearized isotherm 
# Plot linearized isotherm versus data
# The last argument toggles between the two transformations we'll consider
plot_linearized_isotherm(x,y,slope,intercept,'lin1')

Write a few bullet points to interpret the scatter plot:

  • Fill in

  • Fill in

Write a few bullet points interpreting the residual plot:

  • Fill in

  • Fill in

# Ignore this code block

# Add your solution here

Plot Fit and Residuals for Original Model#

Next, we will define the second function which plots the fitted Langmuir isotherm without transformation and plots the non-transformed residuals. Notice the function can be directly reused for each regression strategy.

# define a function to plot Langmuir isotherm
def plot_original_isotherm(c,q,K,Q,title):
    ''' 
    function to plot fitted Langmuir isotherm and data
    Arguments:
        c: independent variable i.e. concentration - experimental data [mM] (float vector)
        q: membrane loading - experimental data [l/gram membrane] (float vector)
        K: fitted parameter [1/mmol] (float scalar)
        Q: fitted parameter [mol/gram membrane] (float scalar)
        title: name of regression to add to title (string)
    Returns:
        nothing
    '''
    
    ## evaluate the model
    # determine maximum value of c
    cmax = np.amax(c);
    # define range of x to evaluate the model
    c_span = np.linspace(0,1.2*cmax,50);
    # evaluate the model
    q_hat = Q*K*c_span / (1 + K*c_span)
    
    # plot the best fit model
    plt.figure()
    plt.title('Langmuir Isotherm')
    plt.plot(c_span,q_hat,'b-',label="Fitted Model")
    plt.plot(c,q,'r.',label="Experimental Data")
    plt.legend()
    plt.grid(True)
    plt.xlabel("Concentration (mM)")
    plt.ylabel("Loading (mmol / grams membrane)")
    plt.show()
    
    ## calculate the residuals
    r = q - Q*K*c/(1+K*c)
    
    ## plot the residuals
    plt.figure()
    plt.plot(c,r,'ro')
    plt.ylabel('Residual (mmol / grams membrane)')
    plt.xlabel('Equilibrium concentration (mM)')
    plt.grid(True)
    plt.title('Residuals from {0} regression'.format(title))
    plt.show()
    
# End: define a function to plot Langmuir isotherm
# Plot Langmuir isotherm
plot_original_isotherm(c,q,Klin1,Qlin1,'first transformed')

Write a few bullet points to interpret the scatter plot:

  • Fill in

  • Fill in

Write a few bullet points to interpret the residual plot:

  • Fill in

  • Fill in

# Ignore this code block
# Add your solution here

Compute Covariance Matricies#

Next, we will compute the covariance matrix for the fitted parameters \(\beta_0\) and \(\beta_1\).

We will start by computing the variance of the residuals. Recall the formula is:

\[\hat{\sigma}_r^2 = \frac{1}{n-p} \sum_{i=1}^{n} (r_i - 0)^2\]

Here \(n\) is the number of observations, \(p\) is the number of fitted parameters, and \(r_i\) is the residual for observation \(i\). An interesting property of linear regression is that the mean of the residuals is always zero.

We can write the above formula with linear algebra in one line of code:

# compute the residuals (using the transformed variables)
r = y - (x*slope + intercept);

# variance of residuals
var_r = r @ r / (n-p)
print("Variance of residuals =",var_r," (g / L)^2")

Recall from the previous class, stats.linregress does not directly compute the covariance of the linear regression parameters. Instead, we need to write the regression problem in matrix notation:

\[ \underbrace{\vec{y}}_{\mathbb{R}^{n x 1}} = \underbrace{\mathbf{X}}_{\mathbb{R}^{n x m}} \cdot \underbrace{\vec{\beta}}_{\mathbb{R}^{m x 1}} + \underbrace{\vec{\epsilon}}_{\mathbb{R}^{n x 1}} \]

Observations: \(\vec{y} = [y_1, y_2, ..., y_n]^T\)

Fitted Parameters: \(\vec{\beta} = [\beta_0, \beta_1, ..., \beta_{m}]^T\)

Data / Feature Matrix:

\[\begin{split} \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,m} \\ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,m} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \dots & x_{n,m} \end{bmatrix} \end{split}\]

For our transformed membrane, the feature matrix is:

\[\begin{split} \mathbf{X} = \begin{bmatrix} 1 & c_1 \\ 1 & c_{2} \\ \vdots & \vdots \\ 1 & c_{n} \end{bmatrix} \end{split}\]
# feature matrix, a.k.a. data matrix, a.k.a. predictor matrix
X = np.ones((n,2))
X[:,1] = c
print("X=\n",X)

Next, we will compute the covariance matrix of the fitted parameters:

\[\Sigma_{\hat{\beta}} = \hat{\sigma}_r^2 (\mathbf{X}^T \mathbf{X})^{-1}\]
# assemble covariance matrix for regression parameters
cov_b = var_r * np.linalg.inv(X.transpose() @ X)

Finally, we will propagate the covariance matrix of the transformed fitted parameters, \(\mathbf{\Sigma_{\vec{\beta}}}\), to determine the covariance matrix of the desired isotherm parameters, \(\mathbf{\Sigma_{\vec{\theta}}}\).

Recall of model transformation:

\[K = \frac{\beta_1}{\beta_0}, \qquad Q = \frac{1}{\beta_1}, \]

We will define the vector of model parameters as \(\vec{\theta} = [K, Q]^T\).

In anticipation of applying the general nonlinear error propagation formula, we will calculate the following partial derivatives:

\[ \frac{\partial K}{\partial \beta_0} = -\frac{\beta_0}{\beta_1^2}, \qquad \frac{\partial K}{\partial \beta_1} = \frac{1}{\beta_0} \]
\[ \frac{\partial Q}{\partial \beta_0} = 0, \qquad \frac{\partial Q}{\partial \beta_1} = -\frac{1}{\beta_1^2} \]

We can then assemble these gradients into the Jacobian matrix:

\[\begin{split} \mathbf{\nabla_{\vec{\beta}} \vec{\theta}} = \begin{bmatrix} \frac{\partial K}{\partial \beta_0} & \frac{\partial K}{\partial \beta_1} \\ \frac{\partial Q}{\partial \beta_0} & \frac{\partial Q}{\partial \beta_1} \end{bmatrix} \end{split}\]
# Assemble gradient vectors for K and Q
gradK = np.array([-slope / intercept**2, 1/intercept])
gradQ = np.array([0, -1/slope**2])

print("\nGradient of K:",gradK)
print("Gradient of Q:",gradQ)

# Assemble gradient vector
jac = np.stack((gradK, gradQ));
print("\nJacobian Matrix:\n",jac)

For this problem, calculating the partial derivates was very straightforward; but for more complex expression, we can use a finite difference approximation. We’ll see this later in the case study.

We can now apply the general nonlinear error propagation formula:

\[ \Sigma_{\hat{\theta}} \approx \left( \mathbf{\nabla_{\vec{\beta}} \vec{\theta}} \right) \left(\Sigma_{\hat{\beta}} \right) \left(\mathbf{\nabla_{\vec{\beta}} \vec{\theta}}\right)^T\]
# Apply nonlinear error propagation formula
cov_theta_lin1 = jac @ cov_b @ jac.transpose()

print("\nCovariance of Original Model Parameters (K,Q):\n",cov_theta_lin1)

Paramter estimation using an alternate transformation and linear regression#

We start with the Languir isotherm:

\[ q = \frac{Q \cdot K \cdot c}{1 + K \cdot c}\]

With a little bit of algebra, we obtain:

\[ \frac{1}{q} = \frac{1}{Q} + \frac{1}{Q \cdot K} \cdot \frac{1}{c} \]

Model Transformation#

Identify \(x\), \(y\), \(\beta_0\), and \(\beta_1\) in the alternate transformed model. Show your work on paper and turn in via Gradescope.

Next, determine the new formulas to compute \(K\) and \(Q\) from the new fitted parameters \(\beta_0\) and \(\beta_1\). Show your work on paper and turn in via Gradescope.

Also, typeset the final answer below:

\[K = \frac{?}{?}, \qquad Q = \frac{?}{?}, \]

Trust us, writing this down in the notebook will help.

Linear Regression#

Use stats.linregress to compute the best fit line. To maximize code reuse, save the results in variables slope and intercept.

# Add your solution here

Compute \(K\) and \(Q\) from new model. Store your answers in the variances Klin2 and Qlin2.

## Reverse transformation to obtain Q and K and display results
# Add your solution here
print("\nK (alternate linear regression) = {0:0.1f} L/mmol".format(Klin2));
print("Q (alternate linear regression) = {0:0.1f} mmol/g_membrane".format(Qlin2));

Plot Fitted Transformed Model and Residuals#

Plot the transformed model and the transformed residuals. Hint: When using the plot_linear_isotherm function, enter 'lin2' as the fourth argument. This will add the appropraite units to the plot.

# Add your solution here

Interpret the plot. Write a few bullet points (sentences) with your observations.

Your Observations:

Plot Fitted Untransformed (Original) Model and Residuals#

Plot the Langmuir isotherm model and the (non-transformed) residuals. Hint: When using the plot_original_isotherm function, enter 'first transformed' as the fifth argument. This will add the appropriate title to the plot. Also remember to use Klin2 and Qlin2.

# Plot Langmuir isotherm
# Add your solution here

Interpret these plots. Write a few bullet points (sentences) with your observations.

Your Observations:

Uncertainty Analysis#

Calculate the residuals and store in variable r2.

# Calculate residuals
# Add your solution here

Calculate the variance of the residuals. Store in variable var_r2.

# variance of residuals
# Add your solution here

print("Variance of residuals =",var_r2," (g / mmol)^2")

Assemble the feature matrix and store in variable X2.

# matrix of predictors
# Add your solution here

print("X2 =\n",X2)

Calculate the covariance of the linear regression parameters. Store the result in cov_b2.

# assemble covariance matrix for regression parameters
# Add your solution here

print("Covariance of Transformed (Linearized) Regression Parameters (1/KQ, 1/Q):\n",cov_b2)

One paper, re-derive the formulas in the code below for the gradients of \(K\) and \(Q\) with respect to the new \(\beta_0\) and \(\beta_1\).

gradK2 = np.array([1/slope, -intercept/slope**2])
gradQ2 = np.array([-1/intercept**2, 0])

print("\nGradient of K:",gradK2)
print("Gradient of Q:",gradQ2)

Assemble the gradients into the Jacobian matrix. Store the matrix in jac2. Hint: Look at the documentation for numpy.stack.

# Add your solution here

print("\nJacobian Matrix:\n",jac2)

Apply the multivariate general nonlinear error propagation formulation. Store the covariance matrix of the original model parameters in cov_theta_lin2.

# Apply nonlinear error propagation formula
# Add your solution here

print("\nCovariance of Original Model Parameters (K,Q):\n",cov_theta_lin2)

Parameter estimation using nonlinear regression#

We will now apply nonlinear regression to our problem. First, we need to define a function to evaluate the model.

Define Model#

Fill in the missing line in model_func below.

## Code for parameter estimation using nonlinear regression

# define function for the model being fitted
def model_func(theta, c):
    '''
    Function to define model being fitted
    Arguments:
        theta: parameter vector (K, Q)
        c: concentration(s) to evaluate (scalar or vector)
    Returns:
        qhat: predicted loading(s), (scalar or vector)
    '''
    # Add your solution here
    return qhat
# End: define function for the model being fitted

Next we need to define a function to calculate the residuals for each data point.

Fill in the missing line in regression_func below.

# define function to return residuals of model being fitted
def regression_func(theta, c, q):
    '''
    Function to define regression function for least-squares fitting
    Arguments:
        theta: parameter vector
        c: concentration(s) to evaluate (vector)
        q: loading(s) to fit (vector)
    Returns:
        ls_func: evaluation of loss function
    '''
    qhat = model_func(theta,c)

    # Add your solution here
    return r
# End: define function to return residuals of model being fitted

Estimate Parameters#

Use the function scipy.optimize.least_squares to compute the best fit. Store the results in the variable theta_fit.

# Perform nonlinear parameter estimation

# initial guess
theta_guess = np.array([1, 1])

# nonlinear regression
# Add your solution here

# Extract fitted parameters and display results
Knl = theta_fit.x[0]
Qnl = theta_fit.x[1]
print("\nK (nonlinear regression) = {0:0.1f} l/mmol".format(Knl));
print("Q (nonlinear regression) = {0:0.1f} mmol/g_membrane".format(Qnl));

Visual Regression Diagnostics#

Plot the fitted model and residuals. Hint: Use the function plot_original_isotherm. Enter 'nonlinear' as the fifth argument.

# Add your solution here

Interpret these plots. Write a few bullet points (sentences) with your observations.

Your Observations:

Uncertainty Analysis#

We will complete the nonlinear regression analysis by estimating the covariance of the fitted parameters \(\mathbf{\Sigma}_{\vec{\theta}}\).

Calculate the residuals and store in r3. Next calculate the variance of the residuals and store in var_r3.

# Calculate residuals. Hint: use model_func
### BEGIN SOLUTON
# calculate residuals
r3= q - model_func(theta_fit.x, c)
### END SOLUTION
# Calculate the variance of the residuals
# Add your solution here

print("Variance of the residuals: ",var_r3,"(mmol/g)^2")

Recall from the previous class the covariance matrix for nonlinear regression has a similar formula to the linear regression case:

\[\mathbf{\Sigma_{\vec{\theta}}} \approx \hat{\sigma}_r^2 (\mathbf{J}^T \mathbf{J})^{-1}\]

where \(\mathbf{J}\) is the Jacobian of the residuals w.r.t. \(\vec{\theta}\):

\[ J_{i,j} = \frac{\partial(y_i - \hat{y}_i)}{\partial \theta_j} \]

This IS NOT the same Jacobian matrix for nonlinear error propagation. Does this formula look familar? Recall, for LINEAR REGRESSION, the covariance estimate is:

\[\Sigma_{\hat{\beta}} = \hat{\sigma}_r^2 (\mathbf{X}^T \mathbf{X})^{-1}\]

Compare the nonlinear regression and linear regression covariance formulas. On paper, compute the elements of the Jacobian matrix for a linear model using \(\vec{\theta} = \vec{\beta}\). How are \(\mathbf{J}\) and \(\mathbf{X}\) related? Discuss then write one sentence.

Discussion:

Luckily, optimize.least_squares computes this Jacobian for us automatically.

print("J =\n",theta_fit.jac)

Compute the covariance matrix \(\mathbf{\Sigma_{\vec{\theta}}}\). Store your answer in cov_theta_nl.

# assemble covariance matrix
# Add your solution here
# plot Langmuir isotherm
print("Covariance Matrix of Original Model Parameters (K,Q):\n",cov_theta_nl)

Comparison of Three Regression Approaches#

First Transformation + Linear Regression

print("\nK (linear regression) = {0:0.3f} l/mmol".format(Klin1));
print("Q (linear regression) = {0:0.3f} mmol/g_membrane".format(Qlin1));

print("\nCovariance of Original Model Parameters (K,Q):\n",cov_theta_lin1)

Second (Alternative) Transformation + Linear Regression

print("\nK (alternate linear regression) = {0:0.3f} L/mmol".format(Klin1));
print("Q (alternate linear regression) = {0:0.3f} mmol/g_membrane".format(Qlin2));


print("\nCovariance of Original Model Parameters (K,Q):\n",cov_theta_lin2)

Nonlinear Regression

print("\nK (nonlinear regression) = {0:0.3f} l/mmol".format(Knl));
print("Q (nonlinear regression) = {0:0.3f} mmol/g_membrane".format(Qnl));

print("\nCovariance Matrix of Original Model Parameters (K,Q):\n",cov_theta_nl)

Write at least a one sentence answer for each of the following discussion questions.

Discussion:

  1. How many significant figures are we estimating for \(Q\) and \(K\) based on the original data? Answer: Fill in here…

  2. Do the three regression strategies give the same or different estimates for \(K\) (within reasonable significant figures)? Answer: Fill in here…

  3. Do the three regression strategies give the same or different estimates for \(Q\) (within reasonable significant figures)? Answer: Fill in here…

  4. Do the three regression strategies give the same estimate for uncertainty (covariance)?

  5. Which regression technique is best suited for this problem? Answer: Fill in here…

Statistical Power#

As part of Problem Set 6, also complete the old exam questions on statistical power (posted on Canvas and handed out in class). Submit via Gradescope.