working with pandas dataframes#

Note

This is a non-interactive version of the exercise. If you want to run through the steps yourself and see the outputs, you’ll need to do one of the following:

follow the setup steps and work through the notebook on your own computer
open the workshop materials on binder and work through them online
open a python console, copy the lines of code here, and paste them into the console to run them

In this lesson, we’re going to learn how we can work with datasets - combining tables, creating and re-arranging variables, selecting and sorting rows, and grouping and summarizing data. Mostly, we will be working with pandas, which is designed for data manipulation (again, read this as “analyzing” or “working with”, not “fabricating”!).

data#

The data used in this exercise are the historic meteorological observations from the Armagh Observatory (1853-present), the Oxford Observatory (1853-present), the Southampton Observatory (1855-2000), and Stornoway Airport (1873-present), downloaded from the UK Met Office.

Like with the Armagh dataset we used previously, I have done the following to make the data slightly easier to work with: - Removed the header on lines 1-5 - Replaced multiple spaces with a single space, and replaced single spaces with a comma (,) - Removed --- to indicate no data, leaving these fields blank - Removed * indicating provisional/estimated values - Removed the 2023 data - Renamed the file (e.g., oxforddata.txt -> oxforddata.csv).

If you wish to use your own data (and there are loads of stations available!), please feel free. I have also included a script, convert_metoffice.py (in the scripts/ folder), that will do these steps automatically. All you need to do is run the following from a terminal:

convert_metoffice.py {station}data.txt

This will create a new file, {station}data.csv, that has converted + cleaned the data into a CSV format that can easily be read by pandas.

loading libraries#

As before, we load the libraries that we will use in the exercise at the beginning. We will be using three libraries: - pandas, for reading the data from a file; - seaborn, for plotting the data; - pathlib, for working with filesystem paths;

As before, we’re going to alias pandas as pd and seaborn as sns. We only want the Path sub-package from pathlib, so we use from to specify this:

import pandas as pd
import seaborn as sns
from pathlib import Path

In this exercise, we’re going to see a number of different ways that we can work with data tables. Before we get to this, however, we need to load the individual data files and combine them into a single data table.

Rather than loading all four at once and then combining them, however, we can simplify this slightly using a for loop. First, we’ll load the Armagh Observatory data, because it’s currently in a different folder.

Rather than typing the path to the file directly, we can use pathlib.Path (documentation) to construct a path in a platform-independent way. In general, we want to do this because Windows uses \ to separate folders, while Unix-style systems such as Linux and MacOS use / - this way, we don’t run into issues if we share our code with people working on different systems.

To construct the filename, we’re using a relative path - that is, it is relative to some given working directory (typically the current working directory). To get to armaghdata.csv from the current directory, we have to go up a directory level (..), before entering the 03.plotting directory, and the data directory after that:

station_data = pd.read_csv(Path('..', '03.plotting', 'data', 'armaghdata.csv')) # use file.path to construct a path to the data file

Next, we want to make sure that we can keep track of which observation comes from which station - so, we should add a station variable to the table, and make sure to specify that these observations all come from the Armagh Observatory:

station_data['station'] = 'armagh' # add the station name as a column

Now, we can set up a loop to load the other 3 stations data. First, we can create a list of station names:

# create a list of station names
new_stations = ['oxford', 'southampton', 'stornoway']

Now that we have the vector of station names, we can construct the for loop to first read each file:

fn_data = Path('data', f"{station}data.csv")

Here, we use an f-string to combine the station variable (which takes on a value from the new_stations list on each pass through the loop) with 'data.csv', so that the resulting file names will be 'oxforddata.csv', 'southamptondata.csv', and 'stornowaydata.csv'. We then use Path() to combine this with the 'data' directory name, so that the value of fn_data is the complete relative path to each file.

Next, we use pd.read_csv() to read in the file, and add a station variable to the table, just like we did with the Armagh data.

Finally, we use pd.concat() (documentation) to combine the existing table, station_data, with the newly loaded table (data), and overwrite the value of station_data with this combined table:

station_data = pd.concat([station_data, data], ignore_index=True)

Each time through the for loop, the value of station is updated:

for station in new_stations:
    fn_data = Path('data', f"{station}data.csv") # create the filename for each csv file, using file.path and paste
    data = pd.read_csv(fn_data) # read the csv
    data['station'] = station # add the station to the table

    station_data = pd.concat([station_data, data], ignore_index=True) # combine the new data with the current data table

print(station_data) # show the data

Note that this is one advantage of using clear, consistent naming and formatting for data files - we can easily write a loop to load multiple files, instead of having to write individual paths.

selecting rows using expressions#

Now that we have a single table, we can also look at ways that we can select rows from the table. For a very in-depth overview of how indexing and slicing works with pandas, see this guide.

We have alread seen an example of this - for example, we could select all observations where the monthly maximum temperature (tmax) is greater than 20°C by using .loc and a conditional statement:

station_data.loc[station_data['tmax'] > 20]

If we want to use multiple conditions - for example, all observations where the monthly maximum temperature is greater than 20°C, and the monthly rainfall is grater than 100 mm, we can’t simply use the & operator with the two statements:

station_data.loc[station_data['tmax'] > 20 & station_data['rain'] > 100] # this won't work to combine conditions

Instead, we have to surround each condition with parentheses first:

station_data.loc[(station_data['tmax'] > 20) & (station_data['rain'] > 100)] # this will work to select tmax > 20 and rain > 100

Alternatively, we can also use the .query() method (documentation), which allows us to write slightly more natural expressions. The selection above using .query() looks like this:

station_data.query('tmax > 20 & rain > 100') # use query to select rows where tmax > 20 and rain > 100

We can also use variables in the query - we just need to prefix them with @:

min_temp = 20 # create a new variable with a value of 20

station_data.query('tmax > @min_temp & rain > 100') # reference the new variable in the query

using sort_values to sort rows#

Sometimes, we might want to sort our data according to the value of different variables. For example, we can sort the observations by rainfall, from smallest to largest values, using .sort_values() (documentation):

station_data.sort_values('rain') # sort by rainfall, from smallest to largest values

By default, the values are sorted in ascending order (from smallest to largest, or from A to Z for characters). If we want to see the reverse, we can set the ascending keyword argument to False:

station_data.sort_values('rain', ascending=False) # sort by rainfall, from largest to smallest values

Note that in both cases, NaN values come at the bottom - because they are not numbers, they are not sorted as being greater than or less than other values, so pandas moves them to the end by default (to put them at the beginning, we can use the na_position argument).

find unique values#

To find unique rows in a Series (column), we can use .unique() (documentation). For example, we can find the unique values of the station variable:

station_data['station'].unique() # find unique values of station

We can also .drop_duplicates() (documentation) to find only unique rows in a DataFrame. With the subset argument, we can choose which columns to use in determining whether rows are unique/duplicated:

station_data.drop_duplicates(subset='station') # find rows based on unique values of station

We can also use it to find combinations of variables:

station_data.drop_duplicates(subset=['station', 'mm']) # find rows with unique station/month pairs

Note that the distinct values found above are all from the first year of each dataset - this is because .drop_duplicates() discards all but the first occurrence of a unique row.

counting occurrences#

If we want to count the number of non-NaN values in a table, we can use .count() (documentation):

station_data.count() # count the number of non-nan values in each column

If we want to find the frequency of each distinct row in a DataFrame, we can use .value_counts() (documentation). By itself, this looks at all columns in a row to determine whether or not the row is unique. More often, we will probably want to specify which columns to use with the subset argument.

For example, we can count the number of times each station observed rainfall greater than 150 mm in a month by first using query() to select all rows where rain is greater than 150, then use value_counts() with the subset argument to count the number of unique occurrences of station in the resulting table:

station_data \
    .query('rain > 150') \
    .value_counts(subset='station')

From this, we can quickly see that Stornoway Airport, located in the Outer Hebrides, has far more months with heavy rainfall (278) than any other station in our dataset; by contrast, Oxford has only recorded 12 such months between 1853 and 2022.

Note that in this cell, we’re also using the line break character, \, to split the call across multiple lines to help with readability. As far as python is concerned, there is no difference between this:

station_data \
    .query('rain > 150') \
    .value_counts(subset='station')

and this:

station_data.query('rain > 150').value_counts(subset='station')

But, the former can be easier to read/understand what is being done. You will likely see code written in both styles, but I will try to break things into different lines when it makes sense.

adding columns to the table#

In a previous exercise, we saw how we can add a variable/column to a DataFrame using the output of a function:

station_data['date'] = pd.to_datetime({'year': station_data['yyyy'], 'month': station_data['mm'], 'day': 1})

And, we saw how we could assign values to a column based on the values in other columns:

station_data['season'] = '' # initialize an empty string column
station_data.loc[station_data['mm'].isin([1, 2, 12]), 'season'] = 'winter' # if month is 1, 2, or 12, set season to winter
station_data.loc[station_data['mm'].isin(range(3, 6)), 'season'] = 'spring' # if month is 3, 4, or 5, set season to spring
station_data.loc[station_data['mm'].isin(range(6, 9)), 'season'] = 'summer' # if month is 6, 7, or 8, set season to summer
station_data.loc[station_data['mm'].isin(range(9, 12)), 'season'] = 'autumn' # if month is 9, 10, or 11, set season to autumn

Now, let’s look at another way that we can accomplish the same thing, in a slightly more “pythonic” way, by using some of the features of the language.

First, we’ll use range() (documentation) to get a list of numbers from 1 to 12, corresponding to the months of the year:

months = range(1, 13) # get a list of numbers from 1 to 12

Next, we’ll use list multiplication and addition to create a list of the season names for each month:

seasons = ['winter'] * 2 + ['spring'] * 3 + ['summer'] * 3 + ['autumn'] * 3 + ['winter']

seasons # show the list of season names for each month

We could, of course, have written this out explicitly:

seasons = ['winter', 'winter', 'spring', 'spring', 'spring', 'summer', 'summer', 'summer', 'autumn', 'autumn', 'autumn', 'winter']

Instead, we have used the fact that multiplying a list by an integer repeats the list, and adding lists together concatenates them, to simplify this (and also to remind you of these properties of lists).

Next, we can create a dict() using zip() (documentation) to create pairs of month number/season name values:

dict(zip(months, seasons)) # create a dict() of month/season pairs

Finally, we will use .map() (documentation) to assign season names to each row, based on the value of mm (the month number):

station_data['season'] = station_data['mm'].map(dict(zip(months, seasons)))

station_data.head(n=12) # show the first 12 rows of the table

re-naming columns#

Often, we may also want to rename variables to make them easier to read/understand. For example, the yyyy, mm, and af variables in our table are not necessarily the easiest to understand. We can rename them to more clear names, such as year, month, and air_frost, using the .rename() method (documentation).

To make it clear that we are renaming the columns, we’ll use the columns argument, passing a dict() of old/new names. We also want this change to happen “in place”, meaning that it should update the column names of the existing DataFrame, rather than returning a new DataFrame:

station_data.rename(columns={'yyyy': 'year', 'mm': 'month', 'af': 'air_frost'}, # rename columns using old/new name pairs
                    inplace=True # update the values of this dataframe, not return a new one
                   )

station_data.head(n=5) # show the first 5 rows of the dataframe

Many of the methods that we are working with in this exercise have an inplace argument - by default, pandas assumes that you don’t want to overwrite the existing DataFrame object with these changes. If we don’t use the inplace argument, we need to assign the output to a new variable in order to use it; for example:

new_df = old_df.rename(columns={'old_name': 'new_name'})

selecting columns#

Selecting columns from a DataFrame works similarly to selecting rows. We can use square brackets ([ and ]) along with the name of the column (as a string) to select a single column:

station_data['rain'] # select the rain column

If we want to select multiple columns, we can use a list of column names inside of the square brackets:

station_data[['date', 'rain', 'station', 'season']] # select the date, rain, station, and season columns

Note that the order of the output will be the same as the order of the input - so, this is one way that we can also re-arrange columns.

You can also select a slice of columns using the : operator. Note that unlike how we have seen this used before, when used to select columns (or rows) from a DataFrame using labels, : is inclusive:

station_data.loc[:, 'year':'rain'] # select all columns from year to rain (inclusive)

Finally, we can also use .filter() (documentation) to select columns:

station_data.filter(['date', 'rain']) # select the date and rain columns using filter()

re-arranging columns using reindex()#

We might also want to re-arrange the order of columns - there are a number of different ways to do this, but we’ll have a look at using .reindex() (documentation).

First, we can create a list of the column names, in the order we want to see them. We then pass this to .reindex(), using the columns argument:

new_order = ['date', 'year', 'month', 'season', 'station', 'tmax', 'tmin', 'air_frost', 'rain', 'sun']
station_data = station_data.reindex(columns=new_order) # change the order of the columns and assign the output to the same variable

station_data.head() # show the first 5 rows of the dataframe

saving data to a file#

Now that we have combined the different data files, added some new columns to our data, and re-named and re-arranged the columns, we should save our dataset to a file. We’ll use the .to_csv() method (documentation) to do this. As with reading files, though, there are other file formatting options.

Once again, we will use Path to create a path object to write the file to; we also set the index argument to False so that pandas doesn’t write the row numbers to the file:

station_data.to_csv(Path('data', 'combined_stations.csv'), index=False)

Now, we’ll be able to load this file when we want to do further analysis, rather than needing to re-run the steps to load each file, combine the tables, create new variables, and so on.

grouping data#

Next, we’ll see how we can use different tools to aggregate and summarize our data, starting with .groupby() (documentation). To start, we’ll group the data by station:

station_data.groupby('station') # group the data by station

Here, we don’t see anything special - just that the output of .groupby() is, by itself, a DataFrameGroupBy object. Among other things, though, we can use this object to calculate descriptive statistics for each column, based on the applied groupings.

For example, we can use .mean() (documentation) to calculate the mean value of each column (specifying numeric_only=True to make sure that we only get a result for numeric columns):

station_data \
    .groupby('station') \
    .mean(numeric_only=True) # specify numeric_only=True to avoid warning messages

Note that by default, .groupby() drops NaN values - if we want to keep these, we need to specify dropna=False when we create the groupings.

Now, let’s combine this with what we learned in the previous lesson (the plotting exercise) to create a plot that shows the distribution of rainfall by season, separated by station. First, we want to create a plot that shows the density distribution of rainfall for each season, using sns.FacetGrid() to create a single panel for each station:

g = sns.FacetGrid(data=station_data, col='station', hue='season', col_wrap=2) # create a 2x2 grid with a panel for each station
g.map_dataframe(sns.kdeplot, x='rain', fill=True) # plot the density of rainfall
g.add_legend() # add a legend

Next, we can use group_by() to calculate the mean rainfall for each station, and assign this to a new variable, mean_values:

mean_values = station_data.groupby('station')['rain'].mean()

Now, we’ll iterate over axis and mean value pairs to plot a vertical line using matplotlib.pyplot.axvline() (documentation). As we mentioned in the previous exercise, seaborn, like many other plotting packages, is built on top of matplotlib - meaning that many seaborn objects inherit from correpsonding matplotlib objects.

First, though, we’ll make sure that we’re plotting in the correct panel by using the axes_dict attribute of our FacetGrid:

g.axes_dict # show the dict of key/value pairs for the facetgrid

We can iterate over the index of mean_values (which corresponds to each station), then use the axes_dict to plot a vertical line corresponding to each mean value:

for station in mean_values.index: # iterate over station names
    g.axes_dict[station].axvline(x=mean_values[station], color='k', linestyle='--') # plot a vertical line at the mean rain value for each station

g.fig # show the updated figure

In the next panel, write some lines of code to change the axes labels and increase the font size for the tick labels, axis labels, and panel labels:

# your code goes here!

Now that you have finished the plot, be sure to save it to a file:

g.fig.savefig('seasonal_rain_distribution.svg')

slicing#

We’ll finish up by looking at a few functions that we can use to slice a dataset - that is, extract specific rows from a group. For example, we can use .loc along with the .idxmax() function (documentation) to get the row corresponding to the maximum value of rain (for the minimum, we would use .idxmin()):

station_data.loc[station_data['rain'].idxmax()] # use idxmax to find the index of the maxmimum value

We can also make use of .head() (documentation), along with .sort_values(), to select n rows corresponding to the maximum value of one or more variables:

station_data \
    .sort_values('rain', ascending=False) \
    .head(n=5)

Alternatively, we can use .tail() (documentation), which returns the last n rows of the DataFrame (note, however, that this may give us NaN values):

station_data \
    .sort_values('rain') \
    .tail(n=5)

Let’s say that we wanted to find the month with the most rain from each of the stations. To do this, we can first sort rain in descending order, then group based on station, before using .head() to select the first row for each value of station:

station_data \
    .sort_values('rain', ascending=False) \
    .groupby('station') \
    .head(n=1)

Finally, we can select a random sample from a DataFrame using .sample() (documentation). On a grouped DataFrame, we get a random sample from each group:

sample = station_data \
    .groupby('station') \
    .sample(5)

exercise and next steps#

That’s all for this exercise. To practice your skills, create a notebook file that does the following:

loads the libraries that you need
loads the saved data file (combined_stations.csv)
helps you answer the following questions:
- what station has the highest recorded rainfall in the past 20 years, and on what date?
- what season has the lowest average rainfall? is it the same season for all four stations?
- what station has recorded the most months with tmin < 1°C? are all these observations from a single season?
- what is the median rainfall in months where tmax is greater than 20°C? make sure that your result is a number, not a DataFrame!
- what year saw the most total rainfall, using data from all four stations?
- what are the top 5 driest years, using only data from stations in Britain?
- what is the lowest annually-averaged monthly minimum temperature in the dataset, as measured by a single station?
- what is the sunniest month, on average, in Armagh?
  - bonus: write a line that will rename the months from the number to a 3-letter abbreviation (hint: we saw an example of this using .map())

For a bonus, try downloading at least one additional dataset from the Met Office, saving it to the data folder, and using the script provided (convert_metoffice.py) to convert the .txt file into a .csv file.

In your new notebook file, remember to add this new data to your existing dataset (and re-save the file!), then repeat the analyis questions above.