9.2. Summary Statistics#
Further Reading: §1.2 in Navidi (2015)
9.2.1. Learning Objectives#
After studying this notebook and your notes, completing the activities, asking questions in class, you should be able to:
Compute descriptive statistics for data using Pandas (Python package)
Interpret the elements of a covariance matrix
Explain why some data distributions have a very different median and mean
9.2.2. Two Types of Data: Numerical and Categorical#
As engineers, you will encounter both numerical (quantitative) and categorical (qualitative) data.
Unfortunately, Example 1.8 from the textbook was not included in the data files on the publisher’s website. But do not fear! We can recreate the table from a dictionary.
import pandas as pd
# Store all of the data from Example 1.8 in a dictionary
# Notice the keys are the column names
my_dict = {"Specimen":[1, 2, 3, 4, 5], "Torque":[165, 237, 222, 255, 194],"Failure Location":['Weld','Beam','Beam','Beam','Weld']}
# Convert the dictionary into a Pandas dataframe
my_df = pd.DataFrame(my_dict)
# Print
print(my_df)
# Look at the first five entries
my_df.head()
# Profit???
Specimen Torque Failure Location
0 1 165 Weld
1 2 237 Beam
2 3 222 Beam
3 4 255 Beam
4 5 194 Weld
Specimen | Torque | Failure Location | |
---|---|---|---|
0 | 1 | 165 | Weld |
1 | 2 | 237 | Beam |
2 | 3 | 222 | Beam |
3 | 4 | 255 | Beam |
4 | 5 | 194 | Weld |
Well that was easy.
In this example, we see that Torque is a numerical variable and Failure Location is a categorical variable.
We will now learn about statistics to summarize key characteristics of samples. Let’s start by focusing on numerical variables.
9.2.3. Sample Mean#
The sample mean is the average of the sample.
Let \(X_1\), \(X_2\), …, \(X_n\) be the sample. The sample mean is
Statisticians have many quirks, which make them the butts of many jokes. I would like to draw your attention to two:
A capital variable, such as \(X_i\), is often a random variable. More on this next class.
Statisticians like to give variables decorations. Here \(\bar{~}\) (bar) means average. Later in the semester we’ll see that \(\hat{~}\) (hat) means estimate.
It is really easy to calculate the sample mean with pandas:
low = pd.read_csv('https://raw.githubusercontent.com/ndcbe/data-and-computing/main/notebooks/data/table1-1.csv')
high = pd.read_csv('https://raw.githubusercontent.com/ndcbe/data-and-computing/main/notebooks/data/table1-2.csv')
low.mean()
PM 3.714565
dtype: float64
That output looks strange.
type(low.mean())
pandas.core.series.Series
Interesting. The command low.mean()
does not return a floating point number. It instead returns a variable that is type pandas.core.series.Series
. What if I just want the sample mean in a floating point number?
In pandas, we can access a column using its name. Recall here are the first five elements in the data set:
low.head()
PM | |
---|---|
0 | 1.50 |
1 | 0.87 |
2 | 1.12 |
3 | 1.25 |
4 | 3.46 |
This data set only has one column. But if we want to extract that column, we write:
low.PM
0 1.50
1 0.87
2 1.12
3 1.25
4 3.46
...
133 4.63
134 2.80
135 2.16
136 2.97
137 3.90
Name: PM, Length: 138, dtype: float64
And to get the numeric mean, we simply write:
low.PM.mean()
3.714565217391306
Optional Home Activity
Calculate the sample mean for PM in the high altitude data set. Store your answer (a float) in the variable ans_14b_1.
# Add your solution here
# Removed autograder test. You may delete this cell.
9.2.4. Sample Variance and Standard Deviation#
While mean reflects the average of a sample, variance and standard deviation measure the spread.
Let \(X_1\), …, \(X_n\) be a sample. The sample variance is
Class Activity
If the measured quantitfy \(X\) is velocity with units m/s, then what are the units of variance \(s^2\)?
Activity Answer:
Let’s dig into the idea that variance measures the spread of the data set. Consider a synthetic (i.e., made up, artificial) data set with two columns, A and B:
my_data = pd.DataFrame({"A":[0, 0, 5, 10, 10], "B":[4, 4, 5, 6, 6]})
my_data.head()
A | B | |
---|---|---|
0 | 0 | 4 |
1 | 0 | 4 |
2 | 5 | 5 |
3 | 10 | 6 |
4 | 10 | 6 |
Both columns have the same mean (average):
my_data.mean()
A 5.0
B 5.0
dtype: float64
Optional Home Activity
Please take a minute to verify (on paper or in your head) that both data sets do in fact have a mean of 5.
Cool, pandas just calculated the means for both columns in our synthetic data set.
What about the variance of both columns? The variance formula,
sums the squared difference between each datum (a.k.a., data point) and the mean. Thus we expect a data set with a larger spread to have a larger variance.
my_data.var()
A 25.0
B 1.0
dtype: float64
And this is in fact what we see with the variance calculation. Column A, with range 0 to 10, has a much larger variance than column B, which has a range of 4 to 6.
Often we prefer to work with sample standard derivation, which is the square root of sample variance:
Notice that \(s\) has the same units as \(X\).
my_data.std()
A 5.0
B 1.0
dtype: float64
But why do the standard deviation and variance formulas divide by \(n-1\) and not \(n\)? Short answer: often, we do not know the population standard deviation. (Recall, the sample was drawn from a population.) So we use \(s\) as an estimate for the population standard deviation. This uses one degree of freedom, so we divide by \(n-1\) instead of \(n\). The \(-1\) is because we want to estimate one parameter (population standard deviation). We will revist this idea a few times during the semester. A common exercise in a graduate level statistics course is to prove that dividing by \(n-1\) makes \(s^2\) an unbiased estimate of population variance. Still curious? Check out this video: https://www.youtube.com/watch?v=D1hgiAla3KI
If we really wanted to, we could change the degree of freedom from the default 1 to any number:
# divide by n - 1 in variance and standard derviation formulae. This is the default
print("variance\n",my_data.var(ddof=1),"\n")
print("standard deviation\n",my_data.std(ddof=1))
variance
A 25.0
B 1.0
dtype: float64
standard deviation
A 5.0
B 1.0
dtype: float64
# divide by n in variance and standard derviation formulae.
print("variance\n",my_data.var(ddof=0),"\n")
print("standard deviation\n",my_data.std(ddof=0))
variance
A 20.0
B 0.8
dtype: float64
standard deviation
A 4.472136
B 0.894427
dtype: float64
# divide by n - 2 in variance and standard derviation formulae.
print("variance\n",my_data.var(ddof=2),"\n")
print("standard deviation\n",my_data.std(ddof=2))
variance
A 33.333333
B 1.333333
dtype: float64
standard deviation
A 5.773503
B 1.154701
dtype: float64
Like with the mean, we can easily extract a floating point number from pandas.
my_data["A"].std()
5.0
my_data.A.std()
5.0
Optional Home Activity
Calculate the standard deviation (using \(n-1\)) for the particulate matter example. Store the results in the dictionary ans_14b_2 with keys “low” and “high”. Hint: You’ll need to calculate the standard deviation as a float and then save the answers into the dictionary.
# Add your solution here
# Removed autograder test. You may delete this cell.
Optional Home Activity
Show the following formulae are equivalent to the definitions for sample variance and sample standard deviation given above. This is excellent practice for the next exam.
9.2.5. Sample Median#
The sample median also measures the center of a data set. To compute the median, we order the data from smallest to largest and find the middle. For a data set with an even number of objects, we average the two middle datum.
As you likely expect, pandas computes the median:
my_data.median()
A 5.0
B 5.0
dtype: float64
Optional Home Activity
Calculate the median for the particulate matter example. Store the results in the dictionary ans_14b_3 with keys “low” and “high”. Hint: You’ll need to calculate the median as a float and then save the answers into the dictionary.
# Add your solution here
# Removed autograder test. You may delete this cell.
9.2.6. Sample Mode and Range#
The sample mode is the most common element in a sample.
Let’s review samples A and B:
my_data.head()
A | B | |
---|---|---|
0 | 0 | 4 |
1 | 0 | 4 |
2 | 5 | 5 |
3 | 10 | 6 |
4 | 10 | 6 |
Recall, these two data sets have the same mean and median but different variances. Let’s compute the mode:
my_data.mode()
A | B | |
---|---|---|
0 | 0 | 4 |
1 | 10 | 6 |
Interesting. In data set A, both there are two 0 elements and two 10 elements. Thus 0 and 10 are tied for the mode, and pandas returns two values. Likewise, 4 and 6 are tied for the mode in data set B.
Let’s look at the particulate matter example together.
low.PM.mode()
0 1.11
1 1.63
Name: PM, dtype: float64
This suggests that both 1.11 and 1.63 are tied for the mode. Notice that .mode()
returns a dictionary-like structure. We can access the first mode with the index (key) 0:
low.PM.mode()[0]
1.11
And the second mode with index (key) 1:
low.PM.mode()[1]
1.63
We can investigate further using the .value_counts()
command:
low.PM.value_counts()
1.63 3
1.11 3
2.67 2
2.96 2
5.30 2
..
0.55 1
3.67 1
1.14 1
1.37 1
3.90 1
Name: PM, Length: 124, dtype: int64
This gives us the number of times each value repeats in the data set, sorted from most to least common.
Optional Home Activity
Calculate the mode for the high elevation particulate matter example. Store the results in the float ans_14b_4. Hint: You’ll need to either extract the float from the pandas output with key 0 or manually enter your answer.
# Add your solution here
# Removed autograder test. You may delete this cell.
The sample range is the different between the smallest and largest values in a sample. Pandas allows us to easily calculate these:
# Inspect the data set
my_data.head()
A | B | |
---|---|---|
0 | 0 | 4 |
1 | 0 | 4 |
2 | 5 | 5 |
3 | 10 | 6 |
4 | 10 | 6 |
# identify smallest values
my_data.min()
A 0
B 4
dtype: int64
# identify largest values
my_data.max()
A 10
B 6
dtype: int64
9.2.7. Quartiles and Percentiles#
The median is the 50%-ile of the data set. Half of the elements are above and half are below. But we can generalize this to any numeric value between 0% and 100%. You’ll notice that the pandas documentation says quantile instead of percentile. These terms mean the same thing.
# Calculate 50% quantile, a.k.a, 50%-ile
low.PM.quantile(0.5)
3.18
# Verify this is the median
low.PM.median()
3.18
Quartiles divide the data set into quarters (four pieces). Quartiles are specific percentiles:
Quartiles |
Percentile / Quantile |
---|---|
1st |
25% |
2nd |
50% |
3rd |
75% |
Let’s look at the low altitude particulate matter example again:
# Determine the number of entries
low.PM.count()
138
It contains 138 datum. So what happens if we want to compute the 52.1%-ile?
Pandas will give us a number:
low.PM.quantile(0.521)
3.34754
Notice the answer ends in …000004. What is going on?
Because there are only 138 elements, the quantiles increment by 0.724…%
1/138
0.007246376811594203
Pandas is doing interpolation under the hood because there is not a datum exactly at the 52.1%-ile. The …000004 ending is an artifact of inexact floating point arithmetic.
Optional Home Activity
Calculate the 80%-ile for the high elevation particulate matter example. Store the results in the float ans_14b_5.
# Add your solution here
# Removed autograder test. You may delete this cell.
9.2.8. Multivariate Data Example: Exam 1 Scores#
The file Exam-1-scores.csv
contains anonymized numeric scores from Exam 1 this semester. Let’s open the file with Pandas.
# Open the file
exam1_full = pd.read_csv("https://raw.githubusercontent.com/ndcbe/data-and-computing/main/notebooks/data/Exam_1_scores.csv")
# Print entire dataframe
print(exam1_full)
Total Score 1-A 1-B 2-A-1 2-A-2 2-B 2-C-1 2-C-2 2-C-3 3-A \
0 60.00 4.00 4.00 2.00 2.00 3.0 2.00 2.00 3.0 3.0
1 59.75 4.00 4.00 2.00 2.00 3.0 2.00 2.00 3.0 3.0
2 59.75 4.00 4.00 2.00 2.00 3.0 2.00 2.00 3.0 3.0
3 59.75 4.00 4.00 2.00 2.00 3.0 2.00 2.00 3.0 3.0
4 59.25 4.00 4.00 2.00 2.00 3.0 2.00 2.00 3.0 3.0
.. ... ... ... ... ... ... ... ... ... ...
61 42.25 2.25 1.00 2.00 2.00 3.0 2.00 2.00 1.0 3.0
62 39.00 2.25 1.00 1.25 1.25 3.0 1.25 0.75 1.0 3.0
63 22.00 2.25 0.25 0.25 0.25 0.0 2.00 2.00 1.0 3.0
64 21.75 4.00 0.00 0.25 0.25 0.0 1.50 2.00 3.0 3.0
65 15.50 0.25 1.00 0.25 0.25 0.0 0.50 0.75 0.0 3.0
3-B-1 3-B-2 3-B-3 4-A-1 4-A-2 4-B 4-C-1 4-C-2 5-A 5-B
0 4.00 1.00 2.00 2.0 2.00 3.00 4.0 3.00 7.00 7.00
1 4.00 1.00 2.00 2.0 1.75 3.00 4.0 3.00 7.00 7.00
2 4.00 1.00 2.00 2.0 1.75 3.00 4.0 3.00 7.00 7.00
3 4.00 1.00 2.00 2.0 1.75 3.00 4.0 3.00 7.00 7.00
4 4.00 1.00 2.00 2.0 1.50 3.00 4.0 3.00 6.75 7.00
.. ... ... ... ... ... ... ... ... ... ...
61 4.00 1.00 2.00 2.0 1.75 3.00 1.5 0.00 7.00 1.75
62 4.00 1.00 2.00 1.0 0.00 0.25 4.0 2.75 5.50 3.75
63 0.75 1.00 1.00 2.0 1.25 1.00 1.5 0.00 1.00 1.50
64 0.75 1.00 0.00 0.0 1.00 0.00 0.0 0.00 5.00 0.00
65 0.25 0.25 0.25 2.0 1.25 3.00 1.0 0.00 0.00 1.50
[66 rows x 20 columns]
We see 65 rows (students) and 20 columns. A perfect score was 60 points. Let’s loop over the column names:
for c in exam1_full.columns:
print(c)
Total Score
1-A
1-B
2-A-1
2-A-2
2-B
2-C-1
2-C-2
2-C-3
3-A
3-B-1
3-B-2
3-B-3
4-A-1
4-A-2
4-B
4-C-1
4-C-2
5-A
5-B
Let’s make a new Pandas dataframe with only the problem totals.
# create empty dictionary
new_data = {}
new_data['Total'] = exam1_full['Total Score']
new_data['P1'] = exam1_full['1-A'] + exam1_full['1-B']
new_data['P2'] = exam1_full['2-A-1'] + exam1_full['2-A-2'] + exam1_full['2-B']
new_data['P2'] += exam1_full['2-C-1'] + exam1_full['2-C-2'] + exam1_full['2-C-3']
new_data['P3'] = exam1_full['3-A'] + exam1_full['3-B-1'] + exam1_full['3-B-2'] + exam1_full['3-B-3']
new_data['P4'] = exam1_full['4-A-1'] + exam1_full['4-A-2'] + exam1_full['4-B']
new_data['P4'] += exam1_full['4-B'] + exam1_full['4-C-1'] + exam1_full['4-C-2']
new_data['P5'] = exam1_full['5-A'] + exam1_full['5-B']
exam1 = pd.DataFrame(new_data)
Class Activity
Print the first and last five elements of the data set.
# print top ("head") of data frame
# Add your solution here
# print bottom ("tail") of data frame
# Add your solution here
Class Activity
Compute the mean, median, mode, standard deviation, and quartiles using pandas.
# mean
# Add your solution here
# median
# Add your solution here
# mode
# Add your solution here
# standard deviation
# Add your solution here
# 25%-ile
# Add your solution here
# 50%-ile
# Add your solution here
# 75%-ile
# Add your solution here
Pandas offers a single function to compute these summary statistics.
# all in one line
exam1.describe()
Total | P1 | P2 | P3 | P4 | P5 | |
---|---|---|---|---|---|---|
count | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 |
mean | 51.261364 | 5.734848 | 12.875000 | 9.435606 | 14.924242 | 11.083333 |
std | 8.487823 | 1.839146 | 2.156051 | 1.281008 | 3.065932 | 3.025956 |
min | 15.500000 | 1.250000 | 1.750000 | 3.750000 | 1.000000 | 1.500000 |
25% | 49.312500 | 5.000000 | 12.312500 | 9.500000 | 13.562500 | 9.937500 |
50% | 52.500000 | 5.000000 | 14.000000 | 10.000000 | 16.500000 | 12.250000 |
75% | 56.500000 | 8.000000 | 14.000000 | 10.000000 | 17.000000 | 12.937500 |
max | 60.000000 | 8.000000 | 14.000000 | 10.000000 | 17.000000 | 14.000000 |
9.2.9. Sample Covariance#
Standard devation (and variance) measure the average squared distance from each element of a dataset and the mean. Standard deviation is for a single dimension.
Often, we want to know to the extent two variables are related. Let \(X_1\), \(X_2\), …, \(X_n\) and \(Y_1\), \(Y_2\), …, \(Y_n\) be the sample. Each pair (\(X_i\), \(Y_i\)) corresponds to one experiment. For example, \(X\) could be the effluent temperature and \(Y\) could be the conversion for an adiabitic reactor. The experiment was repeated for \(n\) trials.
The sample covariance if a generalization of variance from one to two dimensions:
\(\bar{X}\) and \(\bar{Y}\) are the sample means for variables \(X\) and \(Y\).
If both \(X_i\) and \(Y_i\) tend to move together, i.e., both are either above (\(Y_i > \bar{Y}\) when \(X_{i} > \bar{X}\)) or below (\(Y_i < \bar{Y}\) when \(X_{i} < \bar{X}\)) their sample means, then \(s_{X,Y}^2 >> 0\).
If they move in oppositive directions, i.e., \(Y_i > \bar{Y}\) when \(X_{i} < \bar{X}\) and \(Y_i > \bar{Y}\) when \(X_{i} < \bar{X}\), then \(s_{X,Y}^2 << 0\).
If there is not a strong trend, then \(s_{X,Y}^2 \approx 0\).
We can quickly compute covariance with Pandas.
exam1.cov()
Total | P1 | P2 | P3 | P4 | P5 | |
---|---|---|---|---|---|---|
Total | 72.043138 | 10.143444 | 16.003365 | 7.843051 | 19.822028 | 20.955769 |
P1 | 10.143444 | 3.382459 | 1.928846 | 0.750932 | 2.007488 | 2.357051 |
P2 | 16.003365 | 1.928846 | 4.648558 | 1.763942 | 3.881731 | 4.382692 |
P3 | 7.843051 | 0.750932 | 1.763942 | 1.640982 | 2.086393 | 1.952564 |
P4 | 19.822028 | 2.007488 | 3.881731 | 2.086393 | 9.399942 | 3.725641 |
P5 | 20.955769 | 2.357051 | 4.382692 | 1.952564 | 3.725641 | 9.156410 |
Class Activity
What are the units of sample covariance?
Home Activity Answer:
Often, it is more convinent to interpret sample correlation, which is sample covariance scaled by the standard deviation of each variable.
Recall, \(s_{X,Y}\) is the covariance. \(s_{X}\) and \(s_{Y}\) are the standard deviations for variable \(X\) and \(Y\).
By construction, correlation is bounded between -1 and 1. During the remainded of the semester, we will see why the following rules hold:
Correlation |
Interpretation |
---|---|
\(r_{X,Y} = -1.0\) |
Perfect negative correlation. Samples \(X_i\) and \(Y_i\) lie exactly on a straight line with a negative slope. |
\(-1.0 < r_{X,Y} < 0\) |
Negative correlation. If we fit a line to the data \(X_i\) and \(Y_i\), the slope would be negative. |
\(r_{X,Y} = 0\) |
No correlation. If we fit a line to the data \(X_i\) and \(Y_i\), the slope would be zero. |
\(0 < r_{X,Y} < 1\) |
Positive correlation. If we fit a line to the data \(X_i\) and \(Y_i\), the slope would be positive. |
\(r_{X,Y} = -1.0\) |
Perfect positive correlation. Samples \(X_i\) and \(Y_i\) lie exactly on a straight line with a positive slope. |
Class Activity
What are the units of sample correlation?
Activity Answer:
Let’s look at the correlation for the exam1 data.
exam1.corr()
Total | P1 | P2 | P3 | P4 | P5 | |
---|---|---|---|---|---|---|
Total | 1.000000 | 0.649790 | 0.874492 | 0.721335 | 0.761709 | 0.815915 |
P1 | 0.649790 | 1.000000 | 0.486432 | 0.318737 | 0.356020 | 0.423536 |
P2 | 0.874492 | 0.486432 | 1.000000 | 0.638665 | 0.587224 | 0.671768 |
P3 | 0.721335 | 0.318737 | 0.638665 | 1.000000 | 0.531229 | 0.503722 |
P4 | 0.761709 | 0.356020 | 0.587224 | 0.531229 | 1.000000 | 0.401583 |
P5 | 0.815915 | 0.423536 | 0.671768 | 0.503722 | 0.401583 | 1.000000 |
Class Activity
Interpret correlation for the exam data. Why is the diagonal exactly 1.0. What would a negative correlation mean?
What happens if we first normalize the scores by the number of available points?
# Make a copy of the DataFrame
exam1_norm = exam1.copy()
# Divide by the total number of available points
exam1_norm["P1"] = exam1_norm["P1"] / 8
exam1_norm["P2"] = exam1_norm["P2"] / 14
exam1_norm["P3"] = exam1_norm["P3"] / 10
exam1_norm["P4"] = exam1_norm["P4"] / 17
exam1_norm["P5"] = exam1_norm["P5"] / 14
exam1_norm["Total"] = exam1_norm["Total"] / 60
# Compute covariance matrix
exam1_norm.cov()
Total | P1 | P2 | P3 | P4 | P5 | |
---|---|---|---|---|---|---|
Total | 0.020012 | 0.021132 | 0.019052 | 0.013072 | 0.019433 | 0.024947 |
P1 | 0.021132 | 0.052851 | 0.017222 | 0.009387 | 0.014761 | 0.021045 |
P2 | 0.019052 | 0.017222 | 0.023717 | 0.012600 | 0.016310 | 0.022361 |
P3 | 0.013072 | 0.009387 | 0.012600 | 0.016410 | 0.012273 | 0.013947 |
P4 | 0.019433 | 0.014761 | 0.016310 | 0.012273 | 0.032526 | 0.015654 |
P5 | 0.024947 | 0.021045 | 0.022361 | 0.013947 | 0.015654 | 0.046716 |
# Compute correlation matrix
exam1_norm.corr()
Total | P1 | P2 | P3 | P4 | P5 | |
---|---|---|---|---|---|---|
Total | 1.000000 | 0.649790 | 0.874492 | 0.721335 | 0.761709 | 0.815915 |
P1 | 0.649790 | 1.000000 | 0.486432 | 0.318737 | 0.356020 | 0.423536 |
P2 | 0.874492 | 0.486432 | 1.000000 | 0.638665 | 0.587224 | 0.671768 |
P3 | 0.721335 | 0.318737 | 0.638665 | 1.000000 | 0.531229 | 0.503722 |
P4 | 0.761709 | 0.356020 | 0.587224 | 0.531229 | 1.000000 | 0.401583 |
P5 | 0.815915 | 0.423536 | 0.671768 | 0.503722 | 0.401583 | 1.000000 |
Class Activity
Why is the correlation matrix not changed by scaling?
9.2.10. Summary Statistics for Categorical Variables#
Categorical variables are qualititative. Recall our example from earlier:
my_df.head()
Specimen | Torque | Failure Location | |
---|---|---|---|
0 | 1 | 165 | Weld |
1 | 2 | 237 | Beam |
2 | 3 | 222 | Beam |
3 | 4 | 255 | Beam |
4 | 5 | 194 | Weld |
Here failure location was a categorical variable. We cannot compute the median, standard deviation, or range for a categorical variable. Instead, we often compute frequencies by counting.
my_df['Failure Location'].value_counts()
Beam 3
Weld 2
Name: Failure Location, dtype: int64
We can instead compute the sample proportions by normalizing (dividing by) the total number of samples.
my_df['Failure Location'].value_counts(normalize=True)
Beam 0.6
Weld 0.4
Name: Failure Location, dtype: float64
9.2.11. Summary Statistics and Population Parameters#
Each sample statistic we have discussed (e.g., mean, median, standard deviation, covariance) has a counterpart for the population. We will call numerical summaries of samples statistics and numerical summaries of populations parameters. A central idea in data analysis is to use statistics to estimate/infer parameters. Often, we cannot directly measure a population. So instead we sample.