9.2. Summary Statistics#

Further Reading: §1.2 in Navidi (2015)

9.2.1. Learning Objectives#

After studying this notebook and your notes, completing the activities, asking questions in class, you should be able to:

  • Compute descriptive statistics for data using Pandas (Python package)

  • Interpret the elements of a covariance matrix

  • Explain why some data distributions have a very different median and mean

9.2.2. Two Types of Data: Numerical and Categorical#

As engineers, you will encounter both numerical (quantitative) and categorical (qualitative) data.

Unfortunately, Example 1.8 from the textbook was not included in the data files on the publisher’s website. But do not fear! We can recreate the table from a dictionary.

import pandas as pd
# Store all of the data from Example 1.8 in a dictionary
# Notice the keys are the column names
my_dict = {"Specimen":[1, 2, 3, 4, 5], "Torque":[165, 237, 222, 255, 194],"Failure Location":['Weld','Beam','Beam','Beam','Weld']}

# Convert the dictionary into a Pandas dataframe
my_df = pd.DataFrame(my_dict)

# Print
print(my_df)

# Look at the first five entries
my_df.head()

# Profit???
   Specimen  Torque Failure Location
0         1     165             Weld
1         2     237             Beam
2         3     222             Beam
3         4     255             Beam
4         5     194             Weld
Specimen Torque Failure Location
0 1 165 Weld
1 2 237 Beam
2 3 222 Beam
3 4 255 Beam
4 5 194 Weld

Well that was easy.

In this example, we see that Torque is a numerical variable and Failure Location is a categorical variable.

We will now learn about statistics to summarize key characteristics of samples. Let’s start by focusing on numerical variables.

9.2.3. Sample Mean#

The sample mean is the average of the sample.

Let \(X_1\), \(X_2\), …, \(X_n\) be the sample. The sample mean is

\[\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i\]

Statisticians have many quirks, which make them the butts of many jokes. I would like to draw your attention to two:

  1. A capital variable, such as \(X_i\), is often a random variable. More on this next class.

  2. Statisticians like to give variables decorations. Here \(\bar{~}\) (bar) means average. Later in the semester we’ll see that \(\hat{~}\) (hat) means estimate.

It is really easy to calculate the sample mean with pandas:

low = pd.read_csv('https://raw.githubusercontent.com/ndcbe/data-and-computing/main/notebooks/data/table1-1.csv')
high = pd.read_csv('https://raw.githubusercontent.com/ndcbe/data-and-computing/main/notebooks/data/table1-2.csv')
low.mean()
PM    3.714565
dtype: float64

That output looks strange.

type(low.mean())
pandas.core.series.Series

Interesting. The command low.mean() does not return a floating point number. It instead returns a variable that is type pandas.core.series.Series. What if I just want the sample mean in a floating point number?

In pandas, we can access a column using its name. Recall here are the first five elements in the data set:

low.head()
PM
0 1.50
1 0.87
2 1.12
3 1.25
4 3.46

This data set only has one column. But if we want to extract that column, we write:

low.PM
0      1.50
1      0.87
2      1.12
3      1.25
4      3.46
       ... 
133    4.63
134    2.80
135    2.16
136    2.97
137    3.90
Name: PM, Length: 138, dtype: float64

And to get the numeric mean, we simply write:

low.PM.mean()
3.714565217391306

Optional Home Activity

Calculate the sample mean for PM in the high altitude data set. Store your answer (a float) in the variable ans_14b_1.

# Add your solution here
# Removed autograder test. You may delete this cell.

9.2.4. Sample Variance and Standard Deviation#

While mean reflects the average of a sample, variance and standard deviation measure the spread.

Let \(X_1\), …, \(X_n\) be a sample. The sample variance is

\[s^2 = \frac{1}{n-1} \sum_{i}^{n} (X_i - \bar{X})^2\]

Class Activity

If the measured quantitfy \(X\) is velocity with units m/s, then what are the units of variance \(s^2\)?

Activity Answer:

Let’s dig into the idea that variance measures the spread of the data set. Consider a synthetic (i.e., made up, artificial) data set with two columns, A and B:

my_data = pd.DataFrame({"A":[0, 0, 5, 10, 10], "B":[4, 4, 5, 6, 6]})
my_data.head()
A B
0 0 4
1 0 4
2 5 5
3 10 6
4 10 6

Both columns have the same mean (average):

my_data.mean()
A    5.0
B    5.0
dtype: float64

Optional Home Activity

Please take a minute to verify (on paper or in your head) that both data sets do in fact have a mean of 5.

Cool, pandas just calculated the means for both columns in our synthetic data set.

What about the variance of both columns? The variance formula,

\[s^2 = \frac{1}{n-1} \sum_{i}^{n} (X_i - \bar{X})^2\]

sums the squared difference between each datum (a.k.a., data point) and the mean. Thus we expect a data set with a larger spread to have a larger variance.

my_data.var()
A    25.0
B     1.0
dtype: float64

And this is in fact what we see with the variance calculation. Column A, with range 0 to 10, has a much larger variance than column B, which has a range of 4 to 6.

Often we prefer to work with sample standard derivation, which is the square root of sample variance:

\[s = \sqrt{ \frac{1}{n-1} \sum_{i}^{n} (X_i - \bar{X})^2}\]

Notice that \(s\) has the same units as \(X\).

my_data.std()
A    5.0
B    1.0
dtype: float64

But why do the standard deviation and variance formulas divide by \(n-1\) and not \(n\)? Short answer: often, we do not know the population standard deviation. (Recall, the sample was drawn from a population.) So we use \(s\) as an estimate for the population standard deviation. This uses one degree of freedom, so we divide by \(n-1\) instead of \(n\). The \(-1\) is because we want to estimate one parameter (population standard deviation). We will revist this idea a few times during the semester. A common exercise in a graduate level statistics course is to prove that dividing by \(n-1\) makes \(s^2\) an unbiased estimate of population variance. Still curious? Check out this video: https://www.youtube.com/watch?v=D1hgiAla3KI

If we really wanted to, we could change the degree of freedom from the default 1 to any number:

# divide by n - 1 in variance and standard derviation formulae. This is the default
print("variance\n",my_data.var(ddof=1),"\n")
print("standard deviation\n",my_data.std(ddof=1))
variance
 A    25.0
B     1.0
dtype: float64 

standard deviation
 A    5.0
B    1.0
dtype: float64
# divide by n in variance and standard derviation formulae.
print("variance\n",my_data.var(ddof=0),"\n")
print("standard deviation\n",my_data.std(ddof=0))
variance
 A    20.0
B     0.8
dtype: float64 

standard deviation
 A    4.472136
B    0.894427
dtype: float64
# divide by n - 2 in variance and standard derviation formulae.
print("variance\n",my_data.var(ddof=2),"\n")
print("standard deviation\n",my_data.std(ddof=2))
variance
 A    33.333333
B     1.333333
dtype: float64 

standard deviation
 A    5.773503
B    1.154701
dtype: float64

Like with the mean, we can easily extract a floating point number from pandas.

my_data["A"].std()
5.0
my_data.A.std()
5.0

Optional Home Activity

Calculate the standard deviation (using \(n-1\)) for the particulate matter example. Store the results in the dictionary ans_14b_2 with keys “low” and “high”. Hint: You’ll need to calculate the standard deviation as a float and then save the answers into the dictionary.

# Add your solution here
# Removed autograder test. You may delete this cell.

Optional Home Activity

Show the following formulae are equivalent to the definitions for sample variance and sample standard deviation given above. This is excellent practice for the next exam.

\[s^2 = \frac{1}{n-1} \left( \sum_{i=1}^{n} X_i^2 - n \bar{X} \right)\]
\[s = \sqrt{\frac{1}{n-1} \left( \sum_{i=1}^{n} X_i^2 - n \bar{X} \right)}\]

9.2.5. Sample Median#

The sample median also measures the center of a data set. To compute the median, we order the data from smallest to largest and find the middle. For a data set with an even number of objects, we average the two middle datum.

As you likely expect, pandas computes the median:

my_data.median()
A    5.0
B    5.0
dtype: float64

Optional Home Activity

Calculate the median for the particulate matter example. Store the results in the dictionary ans_14b_3 with keys “low” and “high”. Hint: You’ll need to calculate the median as a float and then save the answers into the dictionary.

# Add your solution here
# Removed autograder test. You may delete this cell.

9.2.6. Sample Mode and Range#

The sample mode is the most common element in a sample.

Let’s review samples A and B:

my_data.head()
A B
0 0 4
1 0 4
2 5 5
3 10 6
4 10 6

Recall, these two data sets have the same mean and median but different variances. Let’s compute the mode:

my_data.mode()
A B
0 0 4
1 10 6

Interesting. In data set A, both there are two 0 elements and two 10 elements. Thus 0 and 10 are tied for the mode, and pandas returns two values. Likewise, 4 and 6 are tied for the mode in data set B.

Let’s look at the particulate matter example together.

low.PM.mode()
0    1.11
1    1.63
Name: PM, dtype: float64

This suggests that both 1.11 and 1.63 are tied for the mode. Notice that .mode() returns a dictionary-like structure. We can access the first mode with the index (key) 0:

low.PM.mode()[0]
1.11

And the second mode with index (key) 1:

low.PM.mode()[1]
1.63

We can investigate further using the .value_counts() command:

low.PM.value_counts()
1.63    3
1.11    3
2.67    2
2.96    2
5.30    2
       ..
0.55    1
3.67    1
1.14    1
1.37    1
3.90    1
Name: PM, Length: 124, dtype: int64

This gives us the number of times each value repeats in the data set, sorted from most to least common.

Optional Home Activity

Calculate the mode for the high elevation particulate matter example. Store the results in the float ans_14b_4. Hint: You’ll need to either extract the float from the pandas output with key 0 or manually enter your answer.

# Add your solution here
# Removed autograder test. You may delete this cell.

The sample range is the different between the smallest and largest values in a sample. Pandas allows us to easily calculate these:

# Inspect the data set
my_data.head()
A B
0 0 4
1 0 4
2 5 5
3 10 6
4 10 6
# identify smallest values
my_data.min()
A    0
B    4
dtype: int64
# identify largest values
my_data.max()
A    10
B     6
dtype: int64

9.2.7. Quartiles and Percentiles#

The median is the 50%-ile of the data set. Half of the elements are above and half are below. But we can generalize this to any numeric value between 0% and 100%. You’ll notice that the pandas documentation says quantile instead of percentile. These terms mean the same thing.

# Calculate 50% quantile, a.k.a, 50%-ile
low.PM.quantile(0.5)
3.18
# Verify this is the median
low.PM.median()
3.18

Quartiles divide the data set into quarters (four pieces). Quartiles are specific percentiles:

Quartiles

Percentile / Quantile

1st

25%

2nd

50%

3rd

75%

Let’s look at the low altitude particulate matter example again:

# Determine the number of entries
low.PM.count()
138

It contains 138 datum. So what happens if we want to compute the 52.1%-ile?

Pandas will give us a number:

low.PM.quantile(0.521)
3.34754

Notice the answer ends in …000004. What is going on?

Because there are only 138 elements, the quantiles increment by 0.724…%

1/138
0.007246376811594203

Pandas is doing interpolation under the hood because there is not a datum exactly at the 52.1%-ile. The …000004 ending is an artifact of inexact floating point arithmetic.

Optional Home Activity

Calculate the 80%-ile for the high elevation particulate matter example. Store the results in the float ans_14b_5.

# Add your solution here
# Removed autograder test. You may delete this cell.

9.2.8. Multivariate Data Example: Exam 1 Scores#

The file Exam-1-scores.csv contains anonymized numeric scores from Exam 1 this semester. Let’s open the file with Pandas.

# Open the file
exam1_full = pd.read_csv("https://raw.githubusercontent.com/ndcbe/data-and-computing/main/notebooks/data/Exam_1_scores.csv")
# Print entire dataframe
print(exam1_full)
    Total Score   1-A   1-B  2-A-1  2-A-2  2-B  2-C-1  2-C-2  2-C-3  3-A  \
0         60.00  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
1         59.75  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
2         59.75  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
3         59.75  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
4         59.25  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
..          ...   ...   ...    ...    ...  ...    ...    ...    ...  ...   
61        42.25  2.25  1.00   2.00   2.00  3.0   2.00   2.00    1.0  3.0   
62        39.00  2.25  1.00   1.25   1.25  3.0   1.25   0.75    1.0  3.0   
63        22.00  2.25  0.25   0.25   0.25  0.0   2.00   2.00    1.0  3.0   
64        21.75  4.00  0.00   0.25   0.25  0.0   1.50   2.00    3.0  3.0   
65        15.50  0.25  1.00   0.25   0.25  0.0   0.50   0.75    0.0  3.0   

    3-B-1  3-B-2  3-B-3  4-A-1  4-A-2   4-B  4-C-1  4-C-2   5-A   5-B  
0    4.00   1.00   2.00    2.0   2.00  3.00    4.0   3.00  7.00  7.00  
1    4.00   1.00   2.00    2.0   1.75  3.00    4.0   3.00  7.00  7.00  
2    4.00   1.00   2.00    2.0   1.75  3.00    4.0   3.00  7.00  7.00  
3    4.00   1.00   2.00    2.0   1.75  3.00    4.0   3.00  7.00  7.00  
4    4.00   1.00   2.00    2.0   1.50  3.00    4.0   3.00  6.75  7.00  
..    ...    ...    ...    ...    ...   ...    ...    ...   ...   ...  
61   4.00   1.00   2.00    2.0   1.75  3.00    1.5   0.00  7.00  1.75  
62   4.00   1.00   2.00    1.0   0.00  0.25    4.0   2.75  5.50  3.75  
63   0.75   1.00   1.00    2.0   1.25  1.00    1.5   0.00  1.00  1.50  
64   0.75   1.00   0.00    0.0   1.00  0.00    0.0   0.00  5.00  0.00  
65   0.25   0.25   0.25    2.0   1.25  3.00    1.0   0.00  0.00  1.50  

[66 rows x 20 columns]

We see 65 rows (students) and 20 columns. A perfect score was 60 points. Let’s loop over the column names:

for c in exam1_full.columns:
    print(c)
Total Score
1-A
1-B
2-A-1
2-A-2
2-B
2-C-1
2-C-2
2-C-3
3-A
3-B-1
3-B-2
3-B-3
4-A-1
4-A-2
4-B
4-C-1
4-C-2
5-A
5-B

Let’s make a new Pandas dataframe with only the problem totals.

# create empty dictionary
new_data = {}
new_data['Total'] = exam1_full['Total Score']
new_data['P1'] = exam1_full['1-A'] + exam1_full['1-B']
new_data['P2'] = exam1_full['2-A-1'] + exam1_full['2-A-2'] + exam1_full['2-B']
new_data['P2'] += exam1_full['2-C-1'] + exam1_full['2-C-2'] + exam1_full['2-C-3']
new_data['P3'] = exam1_full['3-A'] + exam1_full['3-B-1'] + exam1_full['3-B-2'] + exam1_full['3-B-3']
new_data['P4'] = exam1_full['4-A-1'] + exam1_full['4-A-2'] + exam1_full['4-B']
new_data['P4'] += exam1_full['4-B'] + exam1_full['4-C-1'] + exam1_full['4-C-2']
new_data['P5'] = exam1_full['5-A'] + exam1_full['5-B']
exam1 = pd.DataFrame(new_data)

Class Activity

Print the first and last five elements of the data set.

# print top ("head") of data frame
# Add your solution here
# print bottom ("tail") of data frame
# Add your solution here

Class Activity

Compute the mean, median, mode, standard deviation, and quartiles using pandas.

# mean
# Add your solution here
# median
# Add your solution here
# mode
# Add your solution here
# standard deviation
# Add your solution here
# 25%-ile
# Add your solution here
# 50%-ile
# Add your solution here
# 75%-ile
# Add your solution here

Pandas offers a single function to compute these summary statistics.

# all in one line
exam1.describe()
Total P1 P2 P3 P4 P5
count 66.000000 66.000000 66.000000 66.000000 66.000000 66.000000
mean 51.261364 5.734848 12.875000 9.435606 14.924242 11.083333
std 8.487823 1.839146 2.156051 1.281008 3.065932 3.025956
min 15.500000 1.250000 1.750000 3.750000 1.000000 1.500000
25% 49.312500 5.000000 12.312500 9.500000 13.562500 9.937500
50% 52.500000 5.000000 14.000000 10.000000 16.500000 12.250000
75% 56.500000 8.000000 14.000000 10.000000 17.000000 12.937500
max 60.000000 8.000000 14.000000 10.000000 17.000000 14.000000

9.2.9. Sample Covariance#

Standard devation (and variance) measure the average squared distance from each element of a dataset and the mean. Standard deviation is for a single dimension.

Often, we want to know to the extent two variables are related. Let \(X_1\), \(X_2\), …, \(X_n\) and \(Y_1\), \(Y_2\), …, \(Y_n\) be the sample. Each pair (\(X_i\), \(Y_i\)) corresponds to one experiment. For example, \(X\) could be the effluent temperature and \(Y\) could be the conversion for an adiabitic reactor. The experiment was repeated for \(n\) trials.

The sample covariance if a generalization of variance from one to two dimensions:

\[s_{X,Y}^2 = \frac{1}{n-1} \sum_{i}^{n} (X_i - \bar{X})(Y_i - \bar{Y})\]

\(\bar{X}\) and \(\bar{Y}\) are the sample means for variables \(X\) and \(Y\).

If both \(X_i\) and \(Y_i\) tend to move together, i.e., both are either above (\(Y_i > \bar{Y}\) when \(X_{i} > \bar{X}\)) or below (\(Y_i < \bar{Y}\) when \(X_{i} < \bar{X}\)) their sample means, then \(s_{X,Y}^2 >> 0\).

If they move in oppositive directions, i.e., \(Y_i > \bar{Y}\) when \(X_{i} < \bar{X}\) and \(Y_i > \bar{Y}\) when \(X_{i} < \bar{X}\), then \(s_{X,Y}^2 << 0\).

If there is not a strong trend, then \(s_{X,Y}^2 \approx 0\).

We can quickly compute covariance with Pandas.

exam1.cov()
Total P1 P2 P3 P4 P5
Total 72.043138 10.143444 16.003365 7.843051 19.822028 20.955769
P1 10.143444 3.382459 1.928846 0.750932 2.007488 2.357051
P2 16.003365 1.928846 4.648558 1.763942 3.881731 4.382692
P3 7.843051 0.750932 1.763942 1.640982 2.086393 1.952564
P4 19.822028 2.007488 3.881731 2.086393 9.399942 3.725641
P5 20.955769 2.357051 4.382692 1.952564 3.725641 9.156410

Class Activity

What are the units of sample covariance?

Home Activity Answer:

Often, it is more convinent to interpret sample correlation, which is sample covariance scaled by the standard deviation of each variable.

\[ r_{X,Y} = \frac{s_{X,Y}}{s_X \cdot s_Y} \]

Recall, \(s_{X,Y}\) is the covariance. \(s_{X}\) and \(s_{Y}\) are the standard deviations for variable \(X\) and \(Y\).

By construction, correlation is bounded between -1 and 1. During the remainded of the semester, we will see why the following rules hold:

Correlation

Interpretation

\(r_{X,Y} = -1.0\)

Perfect negative correlation. Samples \(X_i\) and \(Y_i\) lie exactly on a straight line with a negative slope.

\(-1.0 < r_{X,Y} < 0\)

Negative correlation. If we fit a line to the data \(X_i\) and \(Y_i\), the slope would be negative.

\(r_{X,Y} = 0\)

No correlation. If we fit a line to the data \(X_i\) and \(Y_i\), the slope would be zero.

\(0 < r_{X,Y} < 1\)

Positive correlation. If we fit a line to the data \(X_i\) and \(Y_i\), the slope would be positive.

\(r_{X,Y} = -1.0\)

Perfect positive correlation. Samples \(X_i\) and \(Y_i\) lie exactly on a straight line with a positive slope.

Class Activity

What are the units of sample correlation?

Activity Answer:

Let’s look at the correlation for the exam1 data.

exam1.corr()
Total P1 P2 P3 P4 P5
Total 1.000000 0.649790 0.874492 0.721335 0.761709 0.815915
P1 0.649790 1.000000 0.486432 0.318737 0.356020 0.423536
P2 0.874492 0.486432 1.000000 0.638665 0.587224 0.671768
P3 0.721335 0.318737 0.638665 1.000000 0.531229 0.503722
P4 0.761709 0.356020 0.587224 0.531229 1.000000 0.401583
P5 0.815915 0.423536 0.671768 0.503722 0.401583 1.000000

Class Activity

Interpret correlation for the exam data. Why is the diagonal exactly 1.0. What would a negative correlation mean?

What happens if we first normalize the scores by the number of available points?

# Make a copy of the DataFrame
exam1_norm = exam1.copy()

# Divide by the total number of available points
exam1_norm["P1"] = exam1_norm["P1"] / 8
exam1_norm["P2"] = exam1_norm["P2"] / 14
exam1_norm["P3"] = exam1_norm["P3"] / 10
exam1_norm["P4"] = exam1_norm["P4"] / 17
exam1_norm["P5"] = exam1_norm["P5"] / 14
exam1_norm["Total"] = exam1_norm["Total"] / 60
# Compute covariance matrix
exam1_norm.cov()
Total P1 P2 P3 P4 P5
Total 0.020012 0.021132 0.019052 0.013072 0.019433 0.024947
P1 0.021132 0.052851 0.017222 0.009387 0.014761 0.021045
P2 0.019052 0.017222 0.023717 0.012600 0.016310 0.022361
P3 0.013072 0.009387 0.012600 0.016410 0.012273 0.013947
P4 0.019433 0.014761 0.016310 0.012273 0.032526 0.015654
P5 0.024947 0.021045 0.022361 0.013947 0.015654 0.046716
# Compute correlation matrix
exam1_norm.corr()
Total P1 P2 P3 P4 P5
Total 1.000000 0.649790 0.874492 0.721335 0.761709 0.815915
P1 0.649790 1.000000 0.486432 0.318737 0.356020 0.423536
P2 0.874492 0.486432 1.000000 0.638665 0.587224 0.671768
P3 0.721335 0.318737 0.638665 1.000000 0.531229 0.503722
P4 0.761709 0.356020 0.587224 0.531229 1.000000 0.401583
P5 0.815915 0.423536 0.671768 0.503722 0.401583 1.000000

Class Activity

Why is the correlation matrix not changed by scaling?

9.2.10. Summary Statistics for Categorical Variables#

Categorical variables are qualititative. Recall our example from earlier:

my_df.head()
Specimen Torque Failure Location
0 1 165 Weld
1 2 237 Beam
2 3 222 Beam
3 4 255 Beam
4 5 194 Weld

Here failure location was a categorical variable. We cannot compute the median, standard deviation, or range for a categorical variable. Instead, we often compute frequencies by counting.

my_df['Failure Location'].value_counts()
Beam    3
Weld    2
Name: Failure Location, dtype: int64

We can instead compute the sample proportions by normalizing (dividing by) the total number of samples.

my_df['Failure Location'].value_counts(normalize=True)
Beam    0.6
Weld    0.4
Name: Failure Location, dtype: float64

9.2.11. Summary Statistics and Population Parameters#

Each sample statistic we have discussed (e.g., mean, median, standard deviation, covariance) has a counterpart for the population. We will call numerical summaries of samples statistics and numerical summaries of populations parameters. A central idea in data analysis is to use statistics to estimate/infer parameters. Often, we cannot directly measure a population. So instead we sample.