This notebook contains material from cbe67701-uncertainty-quantification; content is available on Github.
Created by Haimeng Wang (hwang22@nd.edu)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
This example was adapted from:
McClarren, Ryan G (2018). Uncertainty Quantification and Predictive Computational Science: A Foundation for Physical Scientists and Engineers, Chapter 4: Local Sensitivity Analysis Based on Derivative Approximations, Springer, https://doi.org/10.1007/978-3-319-99525-0_4
Lasso stands for the Least Absolute Shrinkage and Selection Operator. Unlike the ridge regression, the lasso regression make the penealty to be 1-norm ($L_1$) of the coefficient
$$\hat{\beta}_{lasso} = \min_{\beta} \sum^{I}_{i=1}(y_i-\boldsymbol{\beta} \cdot \boldsymbol{X}_i)^2 + \lambda \Vert \boldsymbol{\beta} \Vert_1 $$Using the $L_1$ penalty tends to make some of the coefficients to be zero, which is also called a sparse model. In a sparse model, many of the coefficients are close to zero, and there are a few large non-zero coeffcients. In other words, this model does not include variables that are not important (variables with zero coeffcients).
This is an example adapted from the example in Section 5.3 of the book. Let us assume that we have a simulation that has 200 input variables and only 120 simulations can be afforded. The sklearn.linbear_model is used to do the fit.
Firstly, we used a model with 5 large non-zero ceofficients (between 5~30) and the rest coefficients are 0.1
# Generate some sparse data
np.random.seed(711)
# number of samples is 120 and the dimension of varaibles os 200
n_samples, n_features = 100, 200
X = np.random.randn(n_samples, n_features)
# generate coefficients
LargeCoef = 5
SenCoef = np.zeros(n_features)
for i in range(LargeCoef):
SenCoef[i] = 5 + 20*np.random.rand()
for i in range(LargeCoef, n_features):
SenCoef[i] = 0.1*np.random.rand()
# generate y
y = np.dot(X, SenCoef) + np.random.normal(loc=0, scale=0.001, size=n_samples)
# cross-validation and fit
reg = LassoCV(cv=50).fit(X, y)
print("Residue of the fiting")
print(reg.score(X, y))
print("Weight")
print(reg.alpha_)
# plot the results
fig, ax = plt.subplots(figsize=(12, 9))
fig.subplots_adjust(bottom=0.15, left=0.2)
ax.scatter(np.where(reg.coef_)[0], reg.coef_[reg.coef_ != 0], label='Lasso coefficients', linewidth=4)
ax.scatter(np.where(SenCoef)[0], SenCoef[SenCoef != 0], label='Actual coefficients', linewidth=4)
# ax.set_xlim([0,n_features])
# ax.set_ylim([0, 0])
ax.set_xlabel('variables', fontsize=25)
ax.set_ylabel('coefficients', fontsize=25)
ax.legend(fontsize=24,loc='upper right',frameon=False)
ax.tick_params(which = 'both',direction='in',colors='black',
bottom = True,top=True,left=True, right=True,pad=15)
ax.tick_params(which = 'major',direction='in',length=15,labelsize=20,width=1.5)
ax.tick_params(which = 'minor',direction='in',length=6,width = 1.5)
for axis in ['top','bottom','left','right']:
ax.spines[axis].set_linewidth(2.0)
plt.show()
Then, we used a model with 20 large non-zero ceofficients (between 5~30) and the rest coefficients are 0.1
# Generate some sparse data
np.random.seed(6250)
# number of samples is 120 and the dimension of varaibles os 200
n_samples, n_features = 100, 200
X = np.random.randn(n_samples, n_features)
# generate coefficients
LargeCoef = 20
SenCoef = np.zeros(n_features)
for i in range(LargeCoef):
SenCoef[i] = 5 + 20*np.random.rand()
for i in range(LargeCoef, n_features):
SenCoef[i] = 0.1*np.random.rand()
# generate y
y = np.dot(X, SenCoef) + np.random.normal(loc=0, scale=0.001, size=n_samples)
# cross-validation and fit
reg = LassoCV(cv=50).fit(X, y)
print("Residue of the fiting")
print(reg.score(X, y))
print("Weight")
print(reg.alpha_)
# plot the results
fig, ax = plt.subplots(figsize=(12, 9))
fig.subplots_adjust(bottom=0.15, left=0.2)
ax.scatter(np.where(reg.coef_)[0], reg.coef_[reg.coef_ != 0], label='Lasso coefficients', linewidth=4)
ax.scatter(np.where(SenCoef)[0], SenCoef[SenCoef != 0], label='Actual coefficients', linewidth=4)
# ax.set_xlim([0,n_features])
# ax.set_ylim([0, 0])
ax.set_xlabel('variables', fontsize=25)
ax.set_ylabel('coefficients', fontsize=25)
ax.legend(fontsize=24,loc='upper right',frameon=False)
ax.tick_params(which = 'both',direction='in',colors='black',
bottom = True,top=True,left=True, right=True,pad=15)
ax.tick_params(which = 'major',direction='in',length=15,labelsize=20,width=1.5)
ax.tick_params(which = 'minor',direction='in',length=6,width = 1.5)
for axis in ['top','bottom','left','right']:
ax.spines[axis].set_linewidth(2.0)
plt.show()