Weather with Pandas
Introduction
This is a basic introduction to pandas using some Environment Canada weather station data. We will look at one year's worth of daily data from Victoria, BC and we'll cover three basic topics:
- Loading and looking at data
- Basic filtering of data
- Basic plotting
To start with, you need to download the data from here and save the file as yyj_daily_2012.csv You can get similar data for other stations, all of which can be explored on the EC website. If you want to download multiple years worth of data checkout this wget script on my github for an example.
Some of the material presented here was taken from a similar tutorial, which documents even more functionality:
Of course also check out the official docs at http://pandas.pydata.org/
Loading data and viewing its basic properties
Now lets get started. First off lets load up our required modules, including pandas.
import re
import pandas as pd
from datetime import datetime
import pylab
#%matplotlib inline
Lets read in the yyj daily weather data from the CSV formatted file into a Pandas dataframe. We will tell pandas to skip the first 24 rows which are unwanted header info. Pandas will automatically read the column names from row 25, and the data below that.
df = pd.read_csv('yyj_daily_2012.csv',skiprows=24)
Pandas dataframes have much of the functionality of numpy arrays, plus much more. Lets look at some properties of the dataframe, like its shape:
df.shape
The shape tells us that the dataframe has 366 rows (one for each day of 2012) and 27 columns. Lets see what the names of those columns are:
df.columns
Ok so now we know what the columns are. The column names arn't particularly easy to read, but we'll get to fixing that soon. First though, lets look at some mean statistics, starting with the mean for each column.
df.mean()
Max, Min, Std, and others are available too. The flag columns have NaN values for the mean, because there is not valid numeric data in those columns. Now lets access some data, lets say the month field
df['Month']
so we got a printout of the row index (left) and month (right). Pandas truncated the data, and only actually printed the first and last 15 columns, to make it easier to read. Now say we only want the first 10 rows, we can use the head function:
df.Month.head(10)
Notice that here we also used the "dot syntax" (df.Month) to access the "month" column, compared to before when we used df["Month"]. Both are valid ways to access a column. However because many of our columns names contain spaces and weird characters, using the dot syntax won't work easily for most of our columns. We'll fix that soon. Firstly, lets get focused statistics for the Total precip field using describe:
df['Total Precip (mm)'].describe()
The columns names are a little unweildy, so lets replace them with something better: For each column name, lets strip out the wierd unit and spaces and replace the spaces with underscores and store the result in a list called dcol. This is just a python trick, not really pandas. You could just write a list of names if you wanted.
# use a list comprehension to strip off weird units, and replace spaces with underscores.
dcol = [ re.sub(r'\(.*?\)', '', col).strip(' ').replace(' ', '_') for col in df.columns ]
# now lets replace our old columns headers with dcol in the pandas dataframe. Remember, dcol
# could be any list of names you make up, as long as it is a list containing the correct number strings = # columns.
df.columns = dcol
# check that it worked
df.columns
Now that we have fixed the columns names, we can easily use dot sytax to get some more interesting data. Lets look at the maximum temperature field.
df.Max_Temp
Ok, now lets look at the "index" field that we heard about before. Currently the index is just an integer.
df.index
thats okay, but it turns out to be useful to use the datetime as the index. let write a function that converts our first column, the string Date/Time into an actual python datetime object.
# Define a function to convert strings to dates
def string_to_date(date_string):
return datetime.strptime(date_string, "%Y-%m-%d")
Run the function on every date string and overwrite the column
df.date = df['Date/Time'].apply(string_to_date)
df.date.head()
Now lets replace our dataframes index with the date field
df.index = df.date
check that it worked
df.index
Actually, we are being inefficient. When we loaded the dataframe, we could have told pandas to parse the dates and use them. But the above example shows how you can define a function and apply it to every line in a dataframe.
Basic filtering
Okay, now that we have the index as datetimes, it allows us to do some neat filetering and plotting. Lets start by doing a seasonal decomposition. Lets find the max temp in summer (JJA or months 6 to 8):
df.jja = df[ ( df.index.month >= 6 ) & ( df.index.month <= 8 )]
df.jja.Max_Temp.max()
Lets find information on 18th July. To do this we will use ix, and pass it the datetime for 18 July.
df.ix[ datetime(2012, 6, 18) ]
Lets look only for days that are nice and warm, lets say Mean_Temp > 18. We'll request the index so we get a print out of the datetimes.
df[ df.Mean_Temp > 18 ].index
Basic plotting
Lets do some basic plotting. First up, lets look at a historgram of our maximum temperature data:
df.Max_Temp.hist()
Okay, that's not bad, but now lets look at a time-series of the data instead:
df.Max_Temp.plot()
plt.ylabel('Max. temp. ($^{\circ}$C)')
So Pandas has given us a decent plot, with nice labelling on the time-axis. Because pandas uses matplotlib, we can add-to and alter our pandas plots in the same way we would for any matplotlib plot. Note above how I manually added the y-axis label. Now lets resample to a 5 day interval and then plot the result:
df.Max_Temp.plot(linewidth=1)
df.Max_Temp.resample('5d').plot(color='r',linewidth=3)
plt.ylabel('Max. temp. ($^{\circ}$C)')
plt.title('Resampled to 5 day means')