This notebook contains material from cbe67701-uncertainty-quantification; content is available on Github.
Stephen Adams (sadams22@nd.edu) 6/18/2020
The following example shows an application of principal component analysis (PCA), also known as random variable reduction, the Hotelling transform, or proper orthogonal decomposition (see page 76 of textbook). PCA is commonly used in machine learning to reduce the number of components, thereby speeding up machine learning algorithms.
In this example, a scree plot will be generated. A scree plot shows how much of the variance in a data set can be attributed to each principal component (see Fig. 3.14 in the textbook).
## Dowload data from GitHub
import os, requests, urllib
# GitHub pages url
url = "https://ndcbe.github.io/cbe67701-uncertainty-quantification/"
# relative file paths to download
# this is the only line of code you need to change
file_paths = ['data/quarterbacks3.csv']
# loop over all files to download
for file_path in file_paths:
print("Checking for",file_path)
# split each file_path into a folder and filename
stem, filename = os.path.split(file_path)
# check if the folder name is not empty
if stem:
# check if the folder exists
if not os.path.exists(stem):
print("\tCreating folder",stem)
# if the folder does not exist, create it
os.mkdir(stem)
# if the file does not exist, create it by downloading from GitHub pages
if not os.path.isfile(file_path):
file_url = urllib.parse.urljoin(url,
urllib.request.pathname2url(file_path))
print("\tDownloading",file_url)
with open(file_path, 'wb') as f:
f.write(requests.get(file_url).content)
else:
print("\tFile found!")
The data set being analyzed is the statistics of the starting quarterbacks for all 32 NFL teams in the 2019 season. The statistics can be found at https://www.pro-football-reference.com/years/2019/passing.htm. The salaries were also included and can be found at https://www.spotrac.com/nfl/rankings/2019/average/quarterback/.
# Import all libraries
import pandas as pd
import io
import numpy as np
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt
# Put data into array
qb_data = pd.read_csv('./data/quarterbacks3.csv',delimiter="\t")
# Preview
print (qb_data.head())
Now perform PCA on the data set.
# Eliminate the columns with strings such as "Player" and "Team" leaving only
# numbers so that PCA can be performed.
new_qb_data = qb_data.drop(['Data','Player','Team'],axis=1)
# To perform PCA, the data must be scaled. This adjusts the values so that each
# row has a mean of 0 and a standard deviation of 1.
scaled_qb_data = preprocessing.scale(new_qb_data)
# Perform PCA
qb_pca = PCA()
qb_pca.fit(scaled_qb_data)
qb_pca_data = qb_pca.transform(scaled_qb_data)
# Calculate the percentage of variation each principal component accounts for.
percent_variance = np.round(qb_pca.explained_variance_ratio_*100, decimals=2)
labels = [str(x) for x in range(1, len(percent_variance)+1)]
plt.bar(x=range(1,len(percent_variance)+1), height=percent_variance, tick_label=labels)
plt.ylabel('Explained Variance (%)')
plt.xlabel('Principal Component')
plt.title('Scree Plot for 2019 NFL Quarterback Data')
plt.show()
Most of the variance can be explained with the first 4 principal components. Prediction algorithms will be more accurate (but slower) if more principal components are included and vice versa.