Weather with Pandas

Introduction

This is a basic introduction to pandas using some Environment Canada weather station data. We will look at one year's worth of daily data from Victoria, BC and we'll cover three basic topics:

  1. Loading and looking at data
  2. Basic filtering of data
  3. Basic plotting

To start with, you need to download the data from here and save the file as yyj_daily_2012.csv You can get similar data for other stations, all of which can be explored on the EC website. If you want to download multiple years worth of data checkout this wget script on my github for an example.

Some of the material presented here was taken from a similar tutorial, which documents even more functionality:

Of course also check out the official docs at http://pandas.pydata.org/

Loading data and viewing its basic properties

Now lets get started. First off lets load up our required modules, including pandas.

In [1]:
import re
import pandas as pd
from datetime import datetime
import pylab
#%matplotlib inline  

Lets read in the yyj daily weather data from the CSV formatted file into a Pandas dataframe. We will tell pandas to skip the first 24 rows which are unwanted header info. Pandas will automatically read the column names from row 25, and the data below that.

In [26]:
df = pd.read_csv('yyj_daily_2012.csv',skiprows=24)

Pandas dataframes have much of the functionality of numpy arrays, plus much more. Lets look at some properties of the dataframe, like its shape:

In [27]:
df.shape
Out[27]:
(366, 27)

The shape tells us that the dataframe has 366 rows (one for each day of 2012) and 27 columns. Lets see what the names of those columns are:

In [28]:
df.columns
Out[28]:
Index([u'Date/Time', u'Year', u'Month', u'Day', u'Data Quality', u'Max Temp (�C)', u'Max Temp Flag', u'Min Temp (�C)', u'Min Temp Flag', u'Mean Temp (�C)', u'Mean Temp Flag', u'Heat Deg Days (�C)', u'Heat Deg Days Flag', u'Cool Deg Days (�C)', u'Cool Deg Days Flag', u'Total Rain (mm)', u'Total Rain Flag', u'Total Snow (cm)', u'Total Snow Flag', u'Total Precip (mm)', u'Total Precip Flag', u'Snow on Grnd (cm)', u'Snow on Grnd Flag', u'Dir of Max Gust (10s deg)', u'Dir of Max Gust Flag', u'Spd of Max Gust (km/h)', u'Spd of Max Gust Flag'], dtype=object)

Ok so now we know what the columns are. The column names arn't particularly easy to read, but we'll get to fixing that soon. First though, lets look at some mean statistics, starting with the mean for each column.

In [29]:
df.mean()
Out[29]:
Year                         2012.000000
Month                           6.513661
Day                            15.756831
Data Quality                         NaN
Max Temp (�C)                  14.179508
Max Temp Flag                        NaN
Min Temp (�C)                   5.578142
Min Temp Flag                        NaN
Mean Temp (�C)                  9.902186
Mean Temp Flag                       NaN
Heat Deg Days (�C)              8.173224
Heat Deg Days Flag                   NaN
Cool Deg Days (�C)              0.075410
Cool Deg Days Flag                   NaN
Total Rain (mm)                 2.597541
Total Snow (cm)                 0.037705
Total Precip (mm)               2.633607
Snow on Grnd (cm)               0.040984
Dir of Max Gust (10s deg)      19.315789
dtype: float64

Max, Min, Std, and others are available too. The flag columns have NaN values for the mean, because there is not valid numeric data in those columns. Now lets access some data, lets say the month field

In [30]:
df['Month']
Out[30]:
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
...
351    12
352    12
353    12
354    12
355    12
356    12
357    12
358    12
359    12
360    12
361    12
362    12
363    12
364    12
365    12
Name: Month, Length: 366, dtype: int64

so we got a printout of the row index (left) and month (right). Pandas truncated the data, and only actually printed the first and last 15 columns, to make it easier to read. Now say we only want the first 10 rows, we can use the head function:

In [31]:
df.Month.head(10)
Out[31]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: Month, dtype: int64

Notice that here we also used the "dot syntax" (df.Month) to access the "month" column, compared to before when we used df["Month"]. Both are valid ways to access a column. However because many of our columns names contain spaces and weird characters, using the dot syntax won't work easily for most of our columns. We'll fix that soon. Firstly, lets get focused statistics for the Total precip field using describe:

In [32]:
df['Total Precip (mm)'].describe()
Out[32]:
count    366.000000
mean       2.633607
std        5.164477
min        0.000000
25%        0.000000
50%        0.000000
75%        2.800000
max       29.800000
dtype: float64

The columns names are a little unweildy, so lets replace them with something better: For each column name, lets strip out the wierd unit and spaces and replace the spaces with underscores and store the result in a list called dcol. This is just a python trick, not really pandas. You could just write a list of names if you wanted.

In [33]:
# use a list comprehension to strip off weird units, and replace spaces with underscores.
dcol = [ re.sub(r'\(.*?\)', '', col).strip(' ').replace(' ', '_') for col in df.columns ]
# now lets replace our old columns headers with dcol in the pandas dataframe. Remember, dcol
# could be any list of names you make up, as long as it is a list containing the correct number strings = # columns.

df.columns = dcol

# check that it worked
df.columns
Out[33]:
Index([u'Date/Time', u'Year', u'Month', u'Day', u'Data_Quality', u'Max_Temp', u'Max_Temp_Flag', u'Min_Temp', u'Min_Temp_Flag', u'Mean_Temp', u'Mean_Temp_Flag', u'Heat_Deg_Days', u'Heat_Deg_Days_Flag', u'Cool_Deg_Days', u'Cool_Deg_Days_Flag', u'Total_Rain', u'Total_Rain_Flag', u'Total_Snow', u'Total_Snow_Flag', u'Total_Precip', u'Total_Precip_Flag', u'Snow_on_Grnd', u'Snow_on_Grnd_Flag', u'Dir_of_Max_Gust', u'Dir_of_Max_Gust_Flag', u'Spd_of_Max_Gust', u'Spd_of_Max_Gust_Flag'], dtype=object)

Now that we have fixed the columns names, we can easily use dot sytax to get some more interesting data. Lets look at the maximum temperature field.

In [34]:
df.Max_Temp
Out[34]:
0      6.6
1     10.4
2      9.7
3     10.6
4      8.8
5      6.0
6      6.6
7      8.1
8     10.7
9      6.3
10     5.2
11     5.9
12     3.4
13     5.1
14     2.4
...
351    5.8
352    5.5
353    5.8
354    6.9
355    7.9
356    5.6
357    7.7
358    6.4
359    5.9
360    6.5
361    6.6
362    6.6
363    5.5
364    5.4
365    3.7
Name: Max_Temp, Length: 366, dtype: float64

Ok, now lets look at the "index" field that we heard about before. Currently the index is just an integer.

In [35]:
df.index
Out[35]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365], dtype=int64)

thats okay, but it turns out to be useful to use the datetime as the index. let write a function that converts our first column, the string Date/Time into an actual python datetime object.

In [36]:
# Define a function to convert strings to dates
def string_to_date(date_string):
    return datetime.strptime(date_string, "%Y-%m-%d")

Run the function on every date string and overwrite the column

In [37]:
df.date = df['Date/Time'].apply(string_to_date)
df.date.head()
Out[37]:
0   2012-01-01 00:00:00
1   2012-01-02 00:00:00
2   2012-01-03 00:00:00
3   2012-01-04 00:00:00
4   2012-01-05 00:00:00
Name: Date/Time, dtype: datetime64[ns]

Now lets replace our dataframes index with the date field

In [38]:
df.index = df.date

check that it worked

In [39]:
df.index
Out[39]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01 00:00:00, ..., 2012-12-31 00:00:00]
Length: 366, Freq: None, Timezone: None

Actually, we are being inefficient. When we loaded the dataframe, we could have told pandas to parse the dates and use them. But the above example shows how you can define a function and apply it to every line in a dataframe.

Basic filtering

Okay, now that we have the index as datetimes, it allows us to do some neat filetering and plotting. Lets start by doing a seasonal decomposition. Lets find the max temp in summer (JJA or months 6 to 8):

In [102]:
df.jja = df[ ( df.index.month >= 6 ) & ( df.index.month <= 8 )]
df.jja.Max_Temp.max()
Out[102]:
31.399999999999999

Lets find information on 18th July. To do this we will use ix, and pass it the datetime for 18 July.

In [103]:
df.ix[ datetime(2012, 6, 18) ]
Out[103]:
Date/Time               2012-06-18
Year                          2012
Month                            6
Day                             18
Data_Quality                   NaN
Max_Temp                      18.2
Max_Temp_Flag                  NaN
Min_Temp                       9.5
Min_Temp_Flag                  NaN
Mean_Temp                     13.9
Mean_Temp_Flag                 NaN
Heat_Deg_Days                  4.1
Heat_Deg_Days_Flag             NaN
Cool_Deg_Days                    0
Cool_Deg_Days_Flag             NaN
Total_Rain                     1.5
Total_Rain_Flag                NaN
Total_Snow                       0
Total_Snow_Flag                NaN
Total_Precip                   1.5
Total_Precip_Flag              NaN
Snow_on_Grnd                     0
Snow_on_Grnd_Flag              NaN
Dir_of_Max_Gust                 25
Dir_of_Max_Gust_Flag             E
Spd_of_Max_Gust                 39
Spd_of_Max_Gust_Flag             E
Name: 2012-06-18 00:00:00, dtype: object

Lets look only for days that are nice and warm, lets say Mean_Temp > 18. We'll request the index so we get a print out of the datetimes.

In [104]:
df[ df.Mean_Temp > 18 ].index
Out[104]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-07-08, ..., 2012-09-08]
Length: 16, Freq: None, Timezone: None

Basic plotting

Lets do some basic plotting. First up, lets look at a historgram of our maximum temperature data:

In [105]:
df.Max_Temp.hist()
Out[105]:
<matplotlib.axes.AxesSubplot at 0x710c8d0>

Okay, that's not bad, but now lets look at a time-series of the data instead:

In [112]:
df.Max_Temp.plot()
plt.ylabel('Max. temp. ($^{\circ}$C)')
Out[112]:
<matplotlib.text.Text at 0x8f23090>

So Pandas has given us a decent plot, with nice labelling on the time-axis. Because pandas uses matplotlib, we can add-to and alter our pandas plots in the same way we would for any matplotlib plot. Note above how I manually added the y-axis label. Now lets resample to a 5 day interval and then plot the result:

In [114]:
df.Max_Temp.plot(linewidth=1)
df.Max_Temp.resample('5d').plot(color='r',linewidth=3)
plt.ylabel('Max. temp. ($^{\circ}$C)')
plt.title('Resampled to 5 day means')
Out[114]:
<matplotlib.text.Text at 0x9710450>
In [0]:
In []: