plotting data using seaborn#

Note

This is a non-interactive version of the exercise. If you want to run through the steps yourself and see the outputs, you’ll need to do one of the following:

  • follow the setup steps and work through the notebook on your own computer

  • open the workshop materials on binder and work through them online

  • open a python console, copy the lines of code here, and paste them into the console to run them

In this exercsise, we’re going to investigate how temperature has changed over time, using monthly observations from the Armagh Observatory. By the end of this exercise, you will:

  • load data from a file

  • add variables to a table

  • create date objects from character strings

  • recode values in a table

  • select data from a table using logical expressions

  • create scatter plots (with smoothing lines)

  • create histograms (+ smoothed histograms)

  • save plots to a file

data#

The data used in this exercise are the historic meteorological observations from the Armagh Observatory, downloaded from the UK Met Office.

To make the data slightly easier to work with, I have done the following: - Removed the header on lines 1-5 - Replaced multiple spaces with a single space, and replaced single spaces with a comma (,) - Removed --- to indicate no data, leaving these fields blank - Removed * indicating provisional/estimated values - Removed the 2023 data - Renamed the file armaghdata.csv.

If you wish to use your own data (and there are loads of stations available!), please feel free. I have also included a script, convert_metoffice.py (in the scripts/ folder), that will do these steps automatically. All you need to do is run the following from a terminal:

convert_metoffice.py {station}data.txt

This will create a new file, {station}data.csv, that has converted + cleaned the data into a CSV format that can easily be read by pandas.

importing libraries#

Before getting started, we will import the libraries (packages) that we will use in the exercise:

  • pandas, for reading the data from a file;

  • seaborn, for plotting the data

To do this, we use the import statement, followed by the name of the package. Remember that we can also alias the package name using as:

import pandas as pd # import pandas, but alias it as pd
import seaborn as sns # import seaborn, but alias it as sns

loading the data#

Now, we’re ready to load the data file, using pd.read_csv() (documentation):

armagh = pd.read_csv('data/armaghdata.csv')

pandas has a number of functions for reading data from files, depending on the format - another common one you might use is pd.read_excel() (documentation). Note that in order to read Excel files, you will need to install openpyxl (link) using conda.

To view a summary of the DataFrame, we can use the .head() method (documentation):

armagh.head(n=10) # show the first 10 rows of the dataframe

We can also see that only the rain variable has values back to 1853 - missing values are indicated by NaN, which means “Not a Number” - i.e., missing.

adding variables to the table#

Before moving on, we’ll see how we can add additional variables to the table, starting with the date. This will make it easier for us to plot and analyze the time series of observations.

We want this column to be represented as a datetime object so that it displays properly when we plot it, and because we may want to do calculations using the date/time. When working with DataFrame objects, the easiest way to do this is using pd.to_datetime() (documentation). The argument that we will use for this function is a dict object, with key/value pairs corresponding to year, month, and day. Because we don’t have a day column in the DataFrame, we will just default to a value of 1:

pd.to_datetime({'year': armagh['yyyy'], 'month': armagh['mm'], 'day': 1}) # format datetime objects using the yyyy and mm columns

Now, we can add this to the DataFrame as a column called date:

armagh['date'] = pd.to_datetime({'year': armagh['yyyy'], 'month': armagh['mm'], 'day': 1}) # add the datetime series to the dataframe with the name 'date'

armagh.head(n=10) # show the first 10 rows of the table

we can also check the data type of the new column - we should see that it is datetime64[ns], corresponding to a 64-bit datetime object with units of nanoseconds:

armagh.dtypes # see the data type of each column

calculating a new column#

One thing we might be interested in doing is aggregating our observations by meteorological season, rather than just by month or year. To help us with this, we can calculate a new variable, season, and assign it values based on whether the month is part of the meteorological spring (March, April, May), summer (June, July, August), autumn (September, October, November), or winter (December, January, February).

Another way to look at this is by thinking of these as if ... then ... else statements:

  • if month is 1, 2, or 12, then season is “winter”

  • if month is 3, 4, or 5, then season is “spring”

  • if month is 6, 7, or 8, then season is “summer”

  • if month is 9, 10, or 11, then season is “autumn”

First, let’s remember how we can select rows from a table using a conditional statement. For example, if we want to select all rows where the value in the mm column is 1, 2, or 12, we could write:

armagh.loc[(armagh['mm'] < 3) | (armagh['mm'] == 12)] # select rows where month is < 3 or == 12

As you can see, this selects a total of 510 rows from the table, wherever the value in the mm column is 1 or 2 (< 3), or 12.

Here, we’ve used the | (“pipe” or “logical or”) operator to combine two conditional statements: it returns True wherever armagh['mm'] < 3 or wherever armagh['mm'] == 12.

However, we can also use the in operator to write this a bit more compactly, using the .isin() method (documentation) and a list (or other sequence) of the values to compare:

armagh.loc[armagh['mm'].isin([1, 2, 12])] # select from the table based on whether values are in [1, 2, 12]

We can then write 4 separate statements to assign these values to the DataFrame:

armagh['season'] = '' # initialize an empty string column
armagh.loc[armagh['mm'].isin([1, 2, 12]), 'season'] = 'winter' # if month is 1, 2, or 12, set season to winter
armagh.loc[armagh['mm'].isin(range(3, 6)), 'season'] = 'spring' # if month is 3, 4, or 5, set season to spring
armagh.loc[armagh['mm'].isin(range(6, 9)), 'season'] = 'summer' # if month is 6, 7, or 8, set season to summer
armagh.loc[armagh['mm'].isin(range(9, 12)), 'season'] = 'autumn' # if month is 9, 10, or 11, set season to autumn

armagh.head(n=12) # show the first year of data, to check that we have set the values properly

plotting data#

To plot data, we’ll use seaborn, a popular package for plotting data that integrates well with pandas DataFrame objects.

example: scatter plot#

In this exercise, we will look at a number of different example plots using our data, starting with a simple scatter plot using sns.scatterplot() (documentation).

We can pass our DataFrame to sns.scatterplot() using the data argument, and specify which columns from the DataFrame to plot along the x and y axes. The following will plot tmax (monthly maximum temperature) vs date - in other words, a time series of monthly maximum temperature:

sns.scatterplot(data=armagh, x="date", y="tmax") # create a simple scatter plot of tmax vs date, using the armagh dataframe
../../../_images/plotting_19_1.png

Note the output of the cell above, above the actual plot:

<Axes: xlabel='date', ylabel='tmax'>

The scatterplot function creates an Axes object (documentation) - effectively, this is the container into which we put our plot. Our Axes object exists within a different kind of object - a Figure (documentation).

Over the rest of this exercise, we’ll look at how we can continue to customize plots, by adding colors, customizing labels, changing font sizes, and so on. As we do so, we’ll make use of these objects and their methods/attributes. As we do so, we’ll also discuss some important differences between axes-level and figure-level functions in seaborn.

example: basic histogram#

First, though, let’s look at another type of plot: a histogram. To create a histogram using sns.histplot() (documentation), we need to specify the x variable. We’ll continue using tmax, so this will be a histogram of the values of tmax:

sns.histplot(data=armagh, x='tmax') # plot a histogram of tmax, using the armagh dataframe
../../../_images/plotting_21_1.png

By default, seaborn calculates the bin width (and therefore number of bins) based on the sample size and variance of the variable. We can also specify the number of bins to use (using the bins argument), the width of the bins to use (using the binwidth argument), or even the edges of the bins (by passing a sequence of bin edges using the bins argument).

Here’, let’s see how the plot changes by specifying using 30 bins:

sns.histplot(data=armagh, x='tmax', bins=30) # plot a histogram of tmax, using the armagh dataframe and 30 bins
../../../_images/plotting_23_1.png

In the plot above, you can see how tmax is distributed, with several apparent peaks around 8°C, 14°C, and 18°C. Presumably, these would be peaks that roughly correspond to winter, spring/autumn, and summer, respectively - let’s change the plot slightly so that we can see if this is correct.

To do this, we can use the hue keyword argument to sns.histplot(), which will group the values based on the variable passed (in this case, season):

sns.histplot(data=armagh, x="tmax", hue="season", bins=30) # show the distribution of tmax, broken down by season
../../../_images/plotting_25_1.png

With this plot, we can see how the total distribution of the dataset is made up of each group - as we had suspected, the peaks on either side primarily correspond to winter (blue) and summer (green), while the peak in the middle is a combination of spring (orange) and autumn (red).

With different groups, we can also specify how to display the groups using the multiple keyword argument. The value of multiple must be one of the following:

  • 'layer' - the default, which plots each bin in place using transparency

  • 'dodge' - shifts and narrows each bin so that they don’t overlap

  • 'stack' - stacks bins on top of each other

  • 'fill' - stacks bins on top of each other, with each category/bin adding up to 1.

In the cell below, add an argument to histplot to show the histograms stacked on top of each other:

sns.histplot(data=armagh, x="tmax", hue="season", bins=30) # remember to add the right keyword argument!
../../../_images/plotting_27_1.png

example: density plot#

We can also plot the density distribution of the data, a smoothed version of the histogram, using sns.kdeplot() (documentation):

sns.kdeplot(data=armagh, x='tmax', hue='season', fill=True) # plot the density distribution of the data, colored by season, filled in

example: box plots#

To make a box plot, we can use sns.boxplot() (documentation):

sns.boxplot(data=armagh, x='season', y='rain') # create a vertical box plot of rain, grouped by season
../../../_images/plotting_31_1.png

To swap the orientation of the boxes, we can change the x and y mapping:

sns.boxplot(data=armagh, x='rain', y='season') # create a horizontal box plot of rain, grouped by season
../../../_images/plotting_33_1.png

figure-level vs axes-level functions#

seaborn has a number of different functions that accomplish the same thing in different ways. So far, we have seen a few different examples of functions that create plots (e.g., scatterplot, histplot, boxplot, etc.). The output of these functions, as we have noted above, is an Axes object. Inside of a script, these functions will draw the plot within the “currently active” Axes object, or inside of whatever Axes object you pass to the function using the ax argument.

Rather than using these axes-level functions, we can also use figure-level plots, which create their own Figure object before drawing the plot. To see a simple example of this, let’s draw a scatter plot using sns.relplot() (documentation):

sns.relplot(data=armagh, kind='scatter', x='date', y='tmax') # use relplot to show a scatterplot of tmax vs date
../../../_images/plotting_35_1.png

You should notice that the plot looks exactly the same as the scatter plot we created using sns.scatterplot(). But, the actual output of the function is different - this time, it’s:

<seaborn.axisgrid.FacetGrid at 0x25c20984490>

(note that your output will differ slightly). This time, it’s a FacetGrid (documentation), rather than an Axes - a multi-plot grid that can be used to show conditional relationships. To see how this works in a more complicated case, let’s plot the relationship between rain and tmax, and use the col argument to create a separate panel for each season:

sns.relplot(data=armagh, kind='scatter', x='rain', y='tmax', col='season', col_wrap=2) # create a 2x2 grid of panels, one for each value of season
../../../_images/plotting_37_1.png

we can also use the hue argument to assign a color based on a variable. Note that this will also place a Legend outside of the plot, explaining what each color/symbol combination means:

sns.relplot(data=armagh, kind='scatter', x='rain', y='tmax', col='season', col_wrap=2, hue='season') # create a 2x2 grid of panels, one for each value of season
../../../_images/plotting_39_1.png

seaborn plot types are organized into three different “modules”, based on what type of data they are meant to show:

  • relplot, for plotting relationships between variables [axes-level: scatterplot, lineplot]

  • displot, for plotting distributions [axes-level: histplot, kdeplot, ecdfplot, rugplot]

  • catplot, for plotting categorical data [axes-level: stripplot, swarmplot, boxplot, violinplot, pointplot, barplot]

For each type of plot we want to make, we can use either the figure-level or the axes-level functions, depending on what we are doing. For creating a figure with multiple subplots (for example, to show a separate subplot for each value of a categorical variable), it is easiest to use the figure-level function, rather than the axes-level function. For more information about the differences between figure-level and axes-level functions in seaborn, see this official guide.

customizing the plot#

In the final example, we’ll make a plot showing the relationship between rain and tmax, colored by the season, and plot regression lines for each season.

We’ll also see how we can change the axes labels, and increase font sizes, to help make our plot ready for including in a manuscript or presentation.

To plot the linear relationship between rain and tmax for each season, we can use sns.lmplot() (documentation) - this will plot the scatter plot, plus the regression lines:

sns.lmplot(data=armagh, x='rain', y='tmax', hue='season')
../../../_images/plotting_41_1.png

Up to now, we’ve just been showing the output of the different plotting calls directly by running each cell. In a script, however, this wouldn’t work - we want to assign the output to a new object, which we can then use in the script (including, ultimately, by saving the plot to a file).

We do this exactly the same way as we have previously, using the = operator:

rain_tmax_plot = sns.lmplot(data=armagh, x='rain', y='tmax', hue='season')
../../../_images/plotting_43_0.png

Now, we can use this object to change the properties of the axes. seaborn is built on top of another package, matplotlib, and many of the objects created by seaborn either are matplotlib objects, or they inherit a number of attributes from the matplotlib classes they are built on.

For example, the rain_tmax_plot object we have created is a seaborn.axisgrid.FacetGrid:

type(rain_tmax_plot) # show the type of rain_tmax_plot

This class has an attribute, ax, which is a matplotlib.axes.Axes (documentation):

type(rain_tmax_plot.ax) # show the type of rain_tmax_plot.ax

To change the properties of the axes, then, we can use the relevant Axes methods. For example, .set_xlabel() (documentation) and .set_ylabel() (documentation) will change the text and properties of the x and y axis labels, respectively, and we can use the fontsize argument to set the size of the text:

rain_tmax_plot.ax.set_xlabel('monthly rainfall (mm)', fontsize=14) # set the x axis label
rain_tmax_plot.ax.set_ylabel('monthly maximum temperature (°C)', fontsize=14) # set the y axis label

rain_tmax_plot.fig # show the figure again
../../../_images/plotting_49_0.png

We can also use .set_xticks() (documentation) and .set_yticks() (documentation) to change the location, text, and properties of the ticks on the x and y axis, respectively:

rain_tmax_plot.ax.set_xticks(range(0, 250, 50), labels=range(0, 250, 50), fontsize=12) # change the xtick label font size
rain_tmax_plot.ax.set_yticks(range(5, 25, 5), labels=range(5, 25, 5), fontsize=12) # change the xtick label font size

rain_tmax_plot.fig # show the figure again
../../../_images/plotting_51_0.png

saving the figure to a file#

As you can see from the previous few cells, we have been using rain_tmax_plot.fig to show the updates to the figure. As you might have guessed, this is a matplotlib.figure.Figure object (documentation) - effectively, the canvas where seaborn draws the plot:

type(rain_tmax_plot.fig) # show the type of rain_tmax_plot.fig

To save our figure to a file, we can use .savefig() (documentation) as an SVG (scalable vector graphics) file:

rain_tmax_plot.fig.savefig('rain_tmax_plot.svg', bbox_inches='tight') # save the figure cropped close to the axes

exercise and next steps#

That’s all for this exercise. To practice your skills, create a script that does the following:

  • loads the packages that you will need at the beginning of the script

  • adds a season variable

  • adds a variable to divide the data into three 50 year periods: 1871-1920, 1921-1970, and 1971-2020

  • selects only those observations between 1871 and 2020 (inclusive)

  • creates a figure to plot the density distribution of tmin for each period in its own panel, colored by season (using both color and fill)

  • creates a figure to plot the density distribution of tmin for each period in the same panel, colored by the period (using both color and fill)

  • sets appropriate labels and font sizes for the axis text

  • saves each plot to its own file.