# IMPORT DATA FILES USED BY THIS NOTEBOOK
import os, requests
file_links = [("data/Stock_Data.csv", "https://ndcbe.github.io/cbe-xx258/data/Stock_Data.csv"),
("data/table1-1.csv", "https://ndcbe.github.io/cbe-xx258/data/table1-1.csv"),
("data/table1-2.csv", "https://ndcbe.github.io/cbe-xx258/data/table1-2.csv")]
# This cell has been added by nbpages. Run this cell to download data files required for this notebook.
for filepath, fileurl in file_links:
stem, filename = os.path.split(filepath)
if stem:
if not os.path.exists(stem):
os.mkdir(stem)
if not os.path.isfile(filepath):
with open(filepath, 'wb') as f:
response = requests.get(fileurl)
f.write(response.content)
After studying this notebook, completing the activities, and asking questions in class, you should be able to:
On Sakai, you'll find Datasets-All-Examples-Navidi.zip
. This file, which I downloaded from the publisher, contains all of the data for the examples and tables in our textbook. We'll use many of these datasets to illustrate key concepts in class.
Let's start with Tables 1.1 and 1.2 (pg. 21), which give particulate matter (PM) emissions in g/gal for 138 and 62 vehicles at low and high altitudes, respectively. Please take a moment to get out your textbook and glance at the tables.
Now let's load the data into Python. In this class, we will use Pandas
, which is a super popular and easy to use package/library/module for organizing and manipulating data. Here is a highly recommended 10 minutes to pandas getting started tutorial.
# load the Pandas library, give nickname 'pd'
import pandas as pd
The code below reads in the first text file.
low = pd.read_csv('table1-1.csv')
This creates a Pandas dataframe, which is stored in the variable low
. We can easily print its contents to the screen.
print(low)
len(low)
The first row (vehicle) is numbered 0, which is perhaps not a surprise. We see there are 138 rows in the dataset, which matches what we expect: data for 138 vehicles at low altitude.
The output above is ugly. We can use the .head()
and .tail()
commands to look at only the first and last five entries.
low.head()
low.tail()
# YOUR SOLUTION HERE
Our example so far has only one column of data, named PM
. We can access this column two ways:
low['PM']
low.PM
Pandas also makes it extremely easy to compute summary statistics and perform exploratory data analysis.
low.PM.describe()
We will mathematically define the mean (a.k.a. average), standard deviation (std), minimum (min), maximum (max), and 25%-, 50%-, and 75%-ile (percentile) later this semester. The 50%-ile is also know as the median. Half of the observations are above the median and half are below.
# YOUR SOLUTION HERE
Together, Pandas and Matplotlib make it easy to quickly visualize a dataset. The code below creates a histogram.
plt.hist(low.PM)
plt.xlabel("Particulate Matter (PM) Emissions in g/gal ")
plt.ylabel("Count")
plt.title("Emissions at Low Altitude")
plt.grid(True)
plt.show()
Each bin of the histogram shows the count (number) of vehicle with emissions between the left and right bound of the bin. For example, the third bin from the left shows that there are approximate 30 vehicles in the dataset with emissions between 2.2 and 3.8 g/gal.
# YOUR SOLUTION HERE
We will spend one-third to one-half of Class 3 working on an example to leverage our new Python skills.
The CSV (Comma Seperated Value) file Stock_Data.csv
is the historical daily adjusted closing prices for five index funds:
Symbol | Name |
---|---|
GSPC | S&P 500 |
DJI | Dow Jones Industrial Average |
IXIC | NASDAQ Composite |
RUT | Russell 2000 |
VIX | CBOE Volatility Index |
# YOUR SOLUTION HERE
We can loop over the column names of the dataframe:
for c in stocks.columns:
print("The mean price of",c,"is",stocks[c].mean(),"dollars.")
This is extremely powerful. Let's use a for loop to plot the price of each index fund relative to the first day on a single plot.
for c in stocks.columns:
plt.plot(stocks[c] / stocks[c][0],label=c)
plt.xlabel("Day")
plt.ylabel("Price Relative to Day-0")
plt.grid(True)
plt.legend()
plt.show()
We want to create a compute program (function) that does the following:
def portfolio(stock_data,initial_investment,daily_investment):
''' Compute and plot portfolio value
Assumptions:
We invest evenly across all available index funds
Arguments:
stock_data: Pandas dataframe containing historical stock prices
initial_investment: dollars invested at the start of our portfolio (float)
daily_investment: dollars invested at the end of each day (float)
Returns:
portfolio: Pandas dataframe containing the number of shares of each fund
and the value of the portfolio
Also:
Creates a (well labeled) plot of portfolio value versus time
'''
# determine the numbers of stocks
n = len(stock_data.columns)
### Create a dataframe to store the results
# Extract the names of the columns of 'stock_data', convert to list
c = stock_data.columns.values.tolist()
# Add 'Value' to the list
c.append("Value")
# Create new dataframe with the name number of rows as 'stock_data',
# the same columns as 'stock_data' plus 'Value', and filled with 0.0
portfolio = pd.DataFrame(0.0, index=range(len(stock_data)), columns=c)
# YOUR SOLUTION HERE
return portfolio
# YOUR SOLUTION HERE
Discuss in a few sentences: