Summary Statistics

9.2. Summary Statistics#

Further Reading: §1.2 in Navidi (2015)

9.2.1. Learning Objectives#

After studying this notebook and your notes, completing the activities, asking questions in class, you should be able to:

Compute descriptive statistics for data using Pandas (Python package)
Interpret the elements of a covariance matrix
Explain why some data distributions have a very different median and mean

9.2.2. Two Types of Data: Numerical and Categorical#

As engineers, you will encounter both numerical (quantitative) and categorical (qualitative) data.

Unfortunately, Example 1.8 from the textbook was not included in the data files on the publisher’s website. But do not fear! We can recreate the table from a dictionary.

import pandas as pd

# Store all of the data from Example 1.8 in a dictionary
# Notice the keys are the column names
my_dict = {"Specimen":[1, 2, 3, 4, 5], "Torque":[165, 237, 222, 255, 194],"Failure Location":['Weld','Beam','Beam','Beam','Weld']}

# Convert the dictionary into a Pandas dataframe
my_df = pd.DataFrame(my_dict)

# Print
print(my_df)

# Look at the first five entries
my_df.head()

# Profit???

   Specimen  Torque Failure Location
       1     165             Weld
       2     237             Beam
       3     222             Beam
       4     255             Beam
       5     194             Weld

	Specimen	Torque	Failure Location
0	1	165	Weld
1	2	237	Beam
2	3	222	Beam
3	4	255	Beam
4	5	194	Weld

Well that was easy.

In this example, we see that Torque is a numerical variable and Failure Location is a categorical variable.

We will now learn about statistics to summarize key characteristics of samples. Let’s start by focusing on numerical variables.

9.2.3. Sample Mean#

The sample mean is the average of the sample.

Let \(X_1\), \(X_2\), …, \(X_n\) be the sample. The sample mean is

\[\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i\]

Statisticians have many quirks, which make them the butts of many jokes. I would like to draw your attention to two:

A capital variable, such as \(X_i\), is often a random variable. More on this next class.
Statisticians like to give variables decorations. Here \(\bar{~}\) (bar) means average. Later in the semester we’ll see that \(\hat{~}\) (hat) means estimate.

It is really easy to calculate the sample mean with pandas:

low = pd.read_csv('https://raw.githubusercontent.com/ndcbe/data-and-computing/main/notebooks/data/table1-1.csv')
high = pd.read_csv('https://raw.githubusercontent.com/ndcbe/data-and-computing/main/notebooks/data/table1-2.csv')

low.mean()

PM    3.714565
dtype: float64

That output looks strange.

type(low.mean())

pandas.core.series.Series

Interesting. The command low.mean() does not return a floating point number. It instead returns a variable that is type pandas.core.series.Series. What if I just want the sample mean in a floating point number?

In pandas, we can access a column using its name. Recall here are the first five elements in the data set:

low.head()

	PM
0	1.50
1	0.87
2	1.12
3	1.25
4	3.46

This data set only has one column. But if we want to extract that column, we write:

low.PM

    1.50
    0.87
    1.12
    1.25
    3.46
       ... 
  4.63
  2.80
  2.16
  2.97
  3.90
Name: PM, Length: 138, dtype: float64

And to get the numeric mean, we simply write:

low.PM.mean()

3.714565217391306

Optional Home Activity

Calculate the sample mean for PM in the high altitude data set. Store your answer (a float) in the variable ans_14b_1.

# Add your solution here

# Removed autograder test. You may delete this cell.

9.2.4. Sample Variance and Standard Deviation#

While mean reflects the average of a sample, variance and standard deviation measure the spread.

Let \(X_1\), …, \(X_n\) be a sample. The sample variance is

\[s^2 = \frac{1}{n-1} \sum_{i}^{n} (X_i - \bar{X})^2\]

Class Activity

If the measured quantitfy \(X\) is velocity with units m/s, then what are the units of variance \(s^2\)?

Activity Answer:

Let’s dig into the idea that variance measures the spread of the data set. Consider a synthetic (i.e., made up, artificial) data set with two columns, A and B:

my_data = pd.DataFrame({"A":[0, 0, 5, 10, 10], "B":[4, 4, 5, 6, 6]})
my_data.head()

	A	B
0	0	4
1	0	4
2	5	5
3	10	6
4	10	6

Both columns have the same mean (average):

my_data.mean()

A    5.0
B    5.0
dtype: float64

Optional Home Activity

Please take a minute to verify (on paper or in your head) that both data sets do in fact have a mean of 5.

Cool, pandas just calculated the means for both columns in our synthetic data set.

What about the variance of both columns? The variance formula,

\[s^2 = \frac{1}{n-1} \sum_{i}^{n} (X_i - \bar{X})^2\]

sums the squared difference between each datum (a.k.a., data point) and the mean. Thus we expect a data set with a larger spread to have a larger variance.

my_data.var()

A    25.0
B     1.0
dtype: float64

And this is in fact what we see with the variance calculation. Column A, with range 0 to 10, has a much larger variance than column B, which has a range of 4 to 6.

Often we prefer to work with sample standard derivation, which is the square root of sample variance:

\[s = \sqrt{ \frac{1}{n-1} \sum_{i}^{n} (X_i - \bar{X})^2}\]

Notice that \(s\) has the same units as \(X\).

my_data.std()

A    5.0
B    1.0
dtype: float64

But why do the standard deviation and variance formulas divide by \(n-1\) and not \(n\)? Short answer: often, we do not know the population standard deviation. (Recall, the sample was drawn from a population.) So we use \(s\) as an estimate for the population standard deviation. This uses one degree of freedom, so we divide by \(n-1\) instead of \(n\). The \(-1\) is because we want to estimate one parameter (population standard deviation). We will revist this idea a few times during the semester. A common exercise in a graduate level statistics course is to prove that dividing by \(n-1\) makes \(s^2\) an unbiased estimate of population variance. Still curious? Check out this video: https://www.youtube.com/watch?v=D1hgiAla3KI

If we really wanted to, we could change the degree of freedom from the default 1 to any number:

# divide by n - 1 in variance and standard derviation formulae. This is the default
print("variance\n",my_data.var(ddof=1),"\n")
print("standard deviation\n",my_data.std(ddof=1))

variance
 A    25.0
B     1.0
dtype: float64 

standard deviation
 A    5.0
B    1.0
dtype: float64

# divide by n in variance and standard derviation formulae.
print("variance\n",my_data.var(ddof=0),"\n")
print("standard deviation\n",my_data.std(ddof=0))

variance
 A    20.0
B     0.8
dtype: float64 

standard deviation
 A    4.472136
B    0.894427
dtype: float64

# divide by n - 2 in variance and standard derviation formulae.
print("variance\n",my_data.var(ddof=2),"\n")
print("standard deviation\n",my_data.std(ddof=2))

variance
 A    33.333333
B     1.333333
dtype: float64 

standard deviation
 A    5.773503
B    1.154701
dtype: float64

Like with the mean, we can easily extract a floating point number from pandas.

my_data["A"].std()

5.0

my_data.A.std()

5.0

Optional Home Activity

Calculate the standard deviation (using \(n-1\)) for the particulate matter example. Store the results in the dictionary ans_14b_2 with keys “low” and “high”. Hint: You’ll need to calculate the standard deviation as a float and then save the answers into the dictionary.

# Add your solution here

# Removed autograder test. You may delete this cell.

Optional Home Activity

Show the following formulae are equivalent to the definitions for sample variance and sample standard deviation given above. This is excellent practice for the next exam.

\[s^2 = \frac{1}{n-1} \left( \sum_{i=1}^{n} X_i^2 - n \bar{X} \right)\]

\[s = \sqrt{\frac{1}{n-1} \left( \sum_{i=1}^{n} X_i^2 - n \bar{X} \right)}\]

9.2.5. Sample Median#

The sample median also measures the center of a data set. To compute the median, we order the data from smallest to largest and find the middle. For a data set with an even number of objects, we average the two middle datum.

As you likely expect, pandas computes the median:

my_data.median()

A    5.0
B    5.0
dtype: float64

Optional Home Activity

Calculate the median for the particulate matter example. Store the results in the dictionary ans_14b_3 with keys “low” and “high”. Hint: You’ll need to calculate the median as a float and then save the answers into the dictionary.

# Add your solution here

# Removed autograder test. You may delete this cell.

9.2.6. Sample Mode and Range#

The sample mode is the most common element in a sample.

Let’s review samples A and B:

my_data.head()

	A	B
0	0	4
1	0	4
2	5	5
3	10	6
4	10	6

Recall, these two data sets have the same mean and median but different variances. Let’s compute the mode:

my_data.mode()

	A	B
0	0	4
1	10	6

Interesting. In data set A, both there are two 0 elements and two 10 elements. Thus 0 and 10 are tied for the mode, and pandas returns two values. Likewise, 4 and 6 are tied for the mode in data set B.

Let’s look at the particulate matter example together.

low.PM.mode()

0    1.11
1    1.63
Name: PM, dtype: float64

This suggests that both 1.11 and 1.63 are tied for the mode. Notice that .mode() returns a dictionary-like structure. We can access the first mode with the index (key) 0:

low.PM.mode()[0]

1.11

And the second mode with index (key) 1:

low.PM.mode()[1]

1.63

We can investigate further using the .value_counts() command:

low.PM.value_counts()

63    3
11    3
67    2
96    2
30    2
       ..
55    1
67    1
14    1
37    1
90    1
Name: PM, Length: 124, dtype: int64

This gives us the number of times each value repeats in the data set, sorted from most to least common.

Optional Home Activity

Calculate the mode for the high elevation particulate matter example. Store the results in the float ans_14b_4. Hint: You’ll need to either extract the float from the pandas output with key 0 or manually enter your answer.

# Add your solution here

# Removed autograder test. You may delete this cell.

The sample range is the different between the smallest and largest values in a sample. Pandas allows us to easily calculate these:

# Inspect the data set
my_data.head()

	A	B
0	0	4
1	0	4
2	5	5
3	10	6
4	10	6

# identify smallest values
my_data.min()

A    0
B    4
dtype: int64

# identify largest values
my_data.max()

A    10
B     6
dtype: int64

9.2.7. Quartiles and Percentiles#

The median is the 50%-ile of the data set. Half of the elements are above and half are below. But we can generalize this to any numeric value between 0% and 100%. You’ll notice that the pandas documentation says quantile instead of percentile. These terms mean the same thing.

# Calculate 50% quantile, a.k.a, 50%-ile
low.PM.quantile(0.5)

3.18

# Verify this is the median
low.PM.median()

3.18

Quartiles divide the data set into quarters (four pieces). Quartiles are specific percentiles:

Quartiles	Percentile / Quantile
1st	25%
2nd	50%
3rd	75%

Let’s look at the low altitude particulate matter example again:

# Determine the number of entries
low.PM.count()

It contains 138 datum. So what happens if we want to compute the 52.1%-ile?

Pandas will give us a number:

low.PM.quantile(0.521)

3.34754

Notice the answer ends in …000004. What is going on?

Because there are only 138 elements, the quantiles increment by 0.724…%

1/138

0.007246376811594203

Pandas is doing interpolation under the hood because there is not a datum exactly at the 52.1%-ile. The …000004 ending is an artifact of inexact floating point arithmetic.

Optional Home Activity

Calculate the 80%-ile for the high elevation particulate matter example. Store the results in the float ans_14b_5.

# Add your solution here

# Removed autograder test. You may delete this cell.

9.2.8. Multivariate Data Example: Exam 1 Scores#

The file Exam-1-scores.csv contains anonymized numeric scores from Exam 1 this semester. Let’s open the file with Pandas.

# Open the file
exam1_full = pd.read_csv("https://raw.githubusercontent.com/ndcbe/data-and-computing/main/notebooks/data/Exam_1_scores.csv")

# Print entire dataframe
print(exam1_full)

    Total Score   1-A   1-B  2-A-1  2-A-2  2-B  2-C-1  2-C-2  2-C-3  3-A  \
       60.00  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
       59.75  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
       59.75  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
       59.75  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
       59.25  4.00  4.00   2.00   2.00  3.0   2.00   2.00    3.0  3.0   
..          ...   ...   ...    ...    ...  ...    ...    ...    ...  ...   
      42.25  2.25  1.00   2.00   2.00  3.0   2.00   2.00    1.0  3.0   
      39.00  2.25  1.00   1.25   1.25  3.0   1.25   0.75    1.0  3.0   
      22.00  2.25  0.25   0.25   0.25  0.0   2.00   2.00    1.0  3.0   
      21.75  4.00  0.00   0.25   0.25  0.0   1.50   2.00    3.0  3.0   
      15.50  0.25  1.00   0.25   0.25  0.0   0.50   0.75    0.0  3.0   

    3-B-1  3-B-2  3-B-3  4-A-1  4-A-2   4-B  4-C-1  4-C-2   5-A   5-B  
  4.00   1.00   2.00    2.0   2.00  3.00    4.0   3.00  7.00  7.00  
  4.00   1.00   2.00    2.0   1.75  3.00    4.0   3.00  7.00  7.00  
  4.00   1.00   2.00    2.0   1.75  3.00    4.0   3.00  7.00  7.00  
  4.00   1.00   2.00    2.0   1.75  3.00    4.0   3.00  7.00  7.00  
  4.00   1.00   2.00    2.0   1.50  3.00    4.0   3.00  6.75  7.00  
..    ...    ...    ...    ...    ...   ...    ...    ...   ...   ...  
 4.00   1.00   2.00    2.0   1.75  3.00    1.5   0.00  7.00  1.75  
 4.00   1.00   2.00    1.0   0.00  0.25    4.0   2.75  5.50  3.75  
 0.75   1.00   1.00    2.0   1.25  1.00    1.5   0.00  1.00  1.50  
 0.75   1.00   0.00    0.0   1.00  0.00    0.0   0.00  5.00  0.00  
 0.25   0.25   0.25    2.0   1.25  3.00    1.0   0.00  0.00  1.50  

[66 rows x 20 columns]

We see 65 rows (students) and 20 columns. A perfect score was 60 points. Let’s loop over the column names:

for c in exam1_full.columns:
    print(c)

Total Score
1-A
1-B
2-A-1
2-A-2
2-B
2-C-1
2-C-2
2-C-3
3-A
3-B-1
3-B-2
3-B-3
4-A-1
4-A-2
4-B
4-C-1
4-C-2
5-A
5-B

Let’s make a new Pandas dataframe with only the problem totals.

# create empty dictionary
new_data = {}
new_data['Total'] = exam1_full['Total Score']
new_data['P1'] = exam1_full['1-A'] + exam1_full['1-B']
new_data['P2'] = exam1_full['2-A-1'] + exam1_full['2-A-2'] + exam1_full['2-B']
new_data['P2'] += exam1_full['2-C-1'] + exam1_full['2-C-2'] + exam1_full['2-C-3']
new_data['P3'] = exam1_full['3-A'] + exam1_full['3-B-1'] + exam1_full['3-B-2'] + exam1_full['3-B-3']
new_data['P4'] = exam1_full['4-A-1'] + exam1_full['4-A-2'] + exam1_full['4-B']
new_data['P4'] += exam1_full['4-B'] + exam1_full['4-C-1'] + exam1_full['4-C-2']
new_data['P5'] = exam1_full['5-A'] + exam1_full['5-B']

exam1 = pd.DataFrame(new_data)

Class Activity

Print the first and last five elements of the data set.

# print top ("head") of data frame
# Add your solution here

# print bottom ("tail") of data frame
# Add your solution here

Class Activity

Compute the mean, median, mode, standard deviation, and quartiles using pandas.

# mean
# Add your solution here

# median
# Add your solution here

# mode
# Add your solution here

# standard deviation
# Add your solution here

# 25%-ile
# Add your solution here

# 50%-ile
# Add your solution here

# 75%-ile
# Add your solution here

Pandas offers a single function to compute these summary statistics.

# all in one line
exam1.describe()

	Total	P1	P2	P3	P4	P5
count	66.000000	66.000000	66.000000	66.000000	66.000000	66.000000
mean	51.261364	5.734848	12.875000	9.435606	14.924242	11.083333
std	8.487823	1.839146	2.156051	1.281008	3.065932	3.025956
min	15.500000	1.250000	1.750000	3.750000	1.000000	1.500000
25%	49.312500	5.000000	12.312500	9.500000	13.562500	9.937500
50%	52.500000	5.000000	14.000000	10.000000	16.500000	12.250000
75%	56.500000	8.000000	14.000000	10.000000	17.000000	12.937500
max	60.000000	8.000000	14.000000	10.000000	17.000000	14.000000

9.2.9. Sample Covariance#

Standard devation (and variance) measure the average squared distance from each element of a dataset and the mean. Standard deviation is for a single dimension.

Often, we want to know to the extent two variables are related. Let \(X_1\), \(X_2\), …, \(X_n\) and \(Y_1\), \(Y_2\), …, \(Y_n\) be the sample. Each pair (\(X_i\), \(Y_i\)) corresponds to one experiment. For example, \(X\) could be the effluent temperature and \(Y\) could be the conversion for an adiabitic reactor. The experiment was repeated for \(n\) trials.

The sample covariance if a generalization of variance from one to two dimensions:

\[s_{X,Y}^2 = \frac{1}{n-1} \sum_{i}^{n} (X_i - \bar{X})(Y_i - \bar{Y})\]

\(\bar{X}\) and \(\bar{Y}\) are the sample means for variables \(X\) and \(Y\).

If both \(X_i\) and \(Y_i\) tend to move together, i.e., both are either above (\(Y_i > \bar{Y}\) when \(X_{i} > \bar{X}\)) or below (\(Y_i < \bar{Y}\) when \(X_{i} < \bar{X}\)) their sample means, then \(s_{X,Y}^2 >> 0\).

If they move in oppositive directions, i.e., \(Y_i > \bar{Y}\) when \(X_{i} < \bar{X}\) and \(Y_i > \bar{Y}\) when \(X_{i} < \bar{X}\), then \(s_{X,Y}^2 << 0\).

If there is not a strong trend, then \(s_{X,Y}^2 \approx 0\).

We can quickly compute covariance with Pandas.

exam1.cov()

	Total	P1	P2	P3	P4	P5
Total	72.043138	10.143444	16.003365	7.843051	19.822028	20.955769
P1	10.143444	3.382459	1.928846	0.750932	2.007488	2.357051
P2	16.003365	1.928846	4.648558	1.763942	3.881731	4.382692
P3	7.843051	0.750932	1.763942	1.640982	2.086393	1.952564
P4	19.822028	2.007488	3.881731	2.086393	9.399942	3.725641
P5	20.955769	2.357051	4.382692	1.952564	3.725641	9.156410

Class Activity

What are the units of sample covariance?

Home Activity Answer:

Often, it is more convinent to interpret sample correlation, which is sample covariance scaled by the standard deviation of each variable.

\[ r_{X,Y} = \frac{s_{X,Y}}{s_X \cdot s_Y} \]

Recall, \(s_{X,Y}\) is the covariance. \(s_{X}\) and \(s_{Y}\) are the standard deviations for variable \(X\) and \(Y\).

By construction, correlation is bounded between -1 and 1. During the remainded of the semester, we will see why the following rules hold:

Correlation	Interpretation

\(r_{X,Y} = -1.0\)	Perfect negative correlation. Samples \(X_i\) and \(Y_i\) lie exactly on a straight line with a negative slope.
\(-1.0 < r_{X,Y} < 0\)	Negative correlation. If we fit a line to the data \(X_i\) and \(Y_i\), the slope would be negative.
\(r_{X,Y} = 0\)	No correlation. If we fit a line to the data \(X_i\) and \(Y_i\), the slope would be zero.
\(0 < r_{X,Y} < 1\)	Positive correlation. If we fit a line to the data \(X_i\) and \(Y_i\), the slope would be positive.
\(r_{X,Y} = -1.0\)	Perfect positive correlation. Samples \(X_i\) and \(Y_i\) lie exactly on a straight line with a positive slope.

Class Activity

What are the units of sample correlation?

Activity Answer:

Let’s look at the correlation for the exam1 data.

exam1.corr()

	Total	P1	P2	P3	P4	P5
Total	1.000000	0.649790	0.874492	0.721335	0.761709	0.815915
P1	0.649790	1.000000	0.486432	0.318737	0.356020	0.423536
P2	0.874492	0.486432	1.000000	0.638665	0.587224	0.671768
P3	0.721335	0.318737	0.638665	1.000000	0.531229	0.503722
P4	0.761709	0.356020	0.587224	0.531229	1.000000	0.401583
P5	0.815915	0.423536	0.671768	0.503722	0.401583	1.000000

Class Activity

Interpret correlation for the exam data. Why is the diagonal exactly 1.0. What would a negative correlation mean?

What happens if we first normalize the scores by the number of available points?

# Make a copy of the DataFrame
exam1_norm = exam1.copy()

# Divide by the total number of available points
exam1_norm["P1"] = exam1_norm["P1"] / 8
exam1_norm["P2"] = exam1_norm["P2"] / 14
exam1_norm["P3"] = exam1_norm["P3"] / 10
exam1_norm["P4"] = exam1_norm["P4"] / 17
exam1_norm["P5"] = exam1_norm["P5"] / 14
exam1_norm["Total"] = exam1_norm["Total"] / 60

# Compute covariance matrix
exam1_norm.cov()

	Total	P1	P2	P3	P4	P5
Total	0.020012	0.021132	0.019052	0.013072	0.019433	0.024947
P1	0.021132	0.052851	0.017222	0.009387	0.014761	0.021045
P2	0.019052	0.017222	0.023717	0.012600	0.016310	0.022361
P3	0.013072	0.009387	0.012600	0.016410	0.012273	0.013947
P4	0.019433	0.014761	0.016310	0.012273	0.032526	0.015654
P5	0.024947	0.021045	0.022361	0.013947	0.015654	0.046716

# Compute correlation matrix
exam1_norm.corr()

	Total	P1	P2	P3	P4	P5
Total	1.000000	0.649790	0.874492	0.721335	0.761709	0.815915
P1	0.649790	1.000000	0.486432	0.318737	0.356020	0.423536
P2	0.874492	0.486432	1.000000	0.638665	0.587224	0.671768
P3	0.721335	0.318737	0.638665	1.000000	0.531229	0.503722
P4	0.761709	0.356020	0.587224	0.531229	1.000000	0.401583
P5	0.815915	0.423536	0.671768	0.503722	0.401583	1.000000

Class Activity

Why is the correlation matrix not changed by scaling?

9.2.10. Summary Statistics for Categorical Variables#

Categorical variables are qualititative. Recall our example from earlier:

my_df.head()

	Specimen	Torque	Failure Location
0	1	165	Weld
1	2	237	Beam
2	3	222	Beam
3	4	255	Beam
4	5	194	Weld

Here failure location was a categorical variable. We cannot compute the median, standard deviation, or range for a categorical variable. Instead, we often compute frequencies by counting.

my_df['Failure Location'].value_counts()

Beam    3
Weld    2
Name: Failure Location, dtype: int64

We can instead compute the sample proportions by normalizing (dividing by) the total number of samples.

my_df['Failure Location'].value_counts(normalize=True)

Beam    0.6
Weld    0.4
Name: Failure Location, dtype: float64

9.2.11. Summary Statistics and Population Parameters#

Each sample statistic we have discussed (e.g., mean, median, standard deviation, covariance) has a counterpart for the population. We will call numerical summaries of samples statistics and numerical summaries of populations parameters. A central idea in data analysis is to use statistics to estimate/infer parameters. Often, we cannot directly measure a population. So instead we sample.

Summary Statistics

Contents

9.2. Summary Statistics#

9.2.1. Learning Objectives#

9.2.2. Two Types of Data: Numerical and Categorical#

9.2.3. Sample Mean#

9.2.4. Sample Variance and Standard Deviation#

9.2.5. Sample Median#

9.2.6. Sample Mode and Range#

9.2.7. Quartiles and Percentiles#

9.2.8. Multivariate Data Example: Exam 1 Scores#

9.2.9. Sample Covariance#

9.2.10. Summary Statistics for Categorical Variables#

9.2.11. Summary Statistics and Population Parameters#