basic statistical analysis using python#

Note

This is a non-interactive version of the exercise. If you want to run through the steps yourself and see the outputs, you’ll need to do one of the following:

  • follow the setup steps and work through the notebook on your own computer

  • open the workshop materials on binder and work through them online

  • open a python console, copy the lines of code here, and paste them into the console to run them

In this exercise, we’ll take a look at some basic statistical analysis with python - starting with using python and pandas to calculate descriptive statistics for our datasets, before moving on to look at a few common examples of hypothesis tests using pingouin.

data#

The data used in this exercise are the historic meteorological observations from the Armagh Observatory (1853-present), the Oxford Observatory (1853-present), the Southampton Observatory (1855-2000), and Stornoway Airport (1873-present), downloaded from the UK Met Office that we used in previous exercises. I have copied the combined_stations.csv data into this folder - this is the same file that you created in the process of working through the “pandas” exercise.

loading libraries#

First, we’ll import the various packages that we will be using here:

  • pandas, for reading the data from a file;

  • seaborn, for plotting the data;

  • pingouin, for the statistical tests that we’ll use;

  • pathlib, for working with filesystem paths.

import pandas as pd
import seaborn as sns
import pingouin as pg
from pathlib import Path

Next, we’ll use pd.read_csv() to load the combined station data. We’ll also use the parse_dates argument to tell pandas to read the date column as a date:

station_data = pd.read_csv(Path('data', 'combined_stations.csv'), parse_dates=['date'])

descriptive statistics#

Before diving into statistical tests, we’ll spend a little bit of time expanding on calculating descriptive statistics using pandas. We have seen a little bit of this already, using .groupby() and .mean() to calculate the mean value of rain for each station.

describing variables using .describe()#

First, we’ll have a look at .describe() (documentation), which provides a summary of each of the (numeric) columns in the table:

station_data.describe()

In the output above, we can see the count (count) minimum (min), 1st quartile (25%), median (50%), mean (mean), 3rd quartile (75%), maximum (max), and standard deviation (std) values of each numeric variable.

With this, we can quickly see where we might have errors in our data - for example, if we have non-physical or nonsense values in our variables. When first getting started with a dataset, it can be a good idea to check over the dataset using .describe().

using .describe() to summarize groups#

What if we wanted to get a summary based on some grouping - for example, for each station? We could use filter() to create an object for each value of station, then call summary() on each of these objects in turn.

Not surprisingly, however, there is an easier way, using split() (documentation) and map() (documentation). First, split() divides the table into separate tables based on some grouping:

station_data.groupby('station').describe()

On its own, this output isn’t all that readable - the summary statistics are put into individual columns, which means that the table is very wide. Let’s see how we can re-arrange this so that we have a dict() of DataFrames, one for each station. First, we’ll assign the output of .describe() to a variable, group_summary:

group_summary = station_data.groupby('station').describe()

Next, we’ll iterate over the stations to work with each row of the table in turn. First, though, let’s look at how the column names are organized:

group_summary.columns # show the column names

This is an example of a MultiIndex (documentation) - a multi-level index object, similar to what we have seen previously for rows. Before beginning the for loop below, we use columns.unique() (documentation) to get the unique first-level names from the columns (i.e., the variable names from the original DataFrame).

Inside of the for loop, we first select the row corresponding to each station using .loc. Have a look at this line:

reshaped = pd.concat([this_summary[ind] for ind in columns], axis=1)

This uses something called list comprehension to quickly create a list of objects. It is effectively the same as writing something like:

out_list = []
for ind in columns:
    out_list.append(this_summary[ind])

reshaped = pd.concat(out_list, axis=1)

Using list comprehension helps make the code more concise and readable - it’s a very handy tool for creating lists with iteration. In addition to list comprehension, python also has dict comprehension - we won’t use this here, but it works in a very similar way to list comprehension.

Once we have reshaped the row (the Series) into a DataFrame, we assign the column names, before using .append() to add the reshaped table to a list:

stations = group_summary.index.unique() # get the unique values of station from the table
columns = group_summary.columns.unique(level=0) # get the unique names of the columns from the first level (level 0)

combined_stats = [] # initialize an empty list

for station in stations:
    this_summary = group_summary.loc[station] # get the row corresponding to this station

    reshaped = pd.concat([this_summary[ind] for ind in columns], axis=1) # use list comprehension to reshape the table
    reshaped.columns = columns # set the column names
    combined_stats.append(reshaped) # add the reshaped table to the list

Finally, we’ll use the built-in function zip() to get pairs of station names (from station) and DataFrames (from combined_stats), then pass this to dict() to create a dictionary of station name/DataFrame key/value pairs:

summary_dict = dict(zip(stations, combined_stats)) # create a dict of station name, dataframe pairs

To check that this worked, let’s look at the summary data for Oxford:

summary_dict['oxford'] # show the summary data for oxford

using built-in functions for descriptive statistics#

This is helpful, but sometimes we want to calculate other descriptive statistics, or use the values of descriptive statistics in our code. pandas has a number of built-in functions for this - we have already seen .mean() (documentation), for calculating the arithmetic mean of each column of a DataFrame:

station_data.mean(numeric_only=True) # calculate the mean for each numeric column

Series objects (columns/rows) also have .mean() (documentation):

station_data['rain'].mean() # calculate the mean of the rain column

We can calculate the median of the columns of a DataFrame (or a Series) using .median() (documentation):

station_data.median(numeric_only=True) # calculate the median of each numeric column

To calculate the variance of the columns of a DataFrame (or a Series), use .var() (documentation):

station_data.var(numeric_only=True)

and for the standard deviation, .std() (documentation):

station_data.std(numeric_only=True)

pandas doesn’t have a built-in function for the inter-quartile range (IQR), but we can easily calculate it ourselves using .quantile() (documentation) to calculte the 3rd quantile and the 1st quantile and subtracting the outputs:

station_data.quantile(0.75, numeric_only=True) - station_data.quantile(0.25, numeric_only=True)

Finally, we can also calculate the sum of each column of a DataFrame (or a Series) using .sum() (documentation):

station_data.sum(numeric_only=True)

These are far from the only methods available, but they are some of the most common. For a full list, check the pandas documentation under Methods.

with .groupby()#

As we have seen, the output of .groupby() is a special type of DataFrame, and it inherits almost all of the methods for calculating summary statistics:

station_data.groupby('station').mean(numeric_only=True)

statistical tests#

In addition to descriptive statistics, we can use python for inferential statistics - for example, for hypothesis testing. In the remainder of the exercise, we’ll look at a few examples of some common statistical tests and how to perform these in python, using statsmodels. Please note that these examples are far from exhaustive - if you’re looking for a specific hypothesis test, there’s a good chance someone has programmed it into python, either as part of the statsmodels package (documentation), or as part of scipy.stats (documentation), or as part of an additional package that you can install. You should be able to find what you need with a quick internet search.

independent samples student’s t-test#

For a start, let’s test the hypothesis that Stornoway Airport gets more rain than Armagh. If we first have a look at a box plot:

selected = station_data.loc[station_data['station'].isin(['stornoway', 'armagh'])] # select stornoway airport and armagh data

sns.boxplot(data=selected, x='station', y='rain')
../../../_images/stats_35_1.png

It does look like Stornoway Airport does get more rain, on average, than Armagh.

To run Student’s t-test using pingouin, we use pg.ttest() (documentation):

armagh_rain = selected.loc[selected['station'] == 'armagh', 'rain'].dropna().sample(n=30) # take a sample of 30 rain observations from armagh
stornoway_rain = selected.loc[selected['station'] == 'stornoway', 'rain'].dropna().sample(n=30) # take a sample of 30 rain observations from stornoway

# test whether stornoway_rain.mean() > armagh_rain.mean() at the 99% confidence level
rain_comp = pg.ttest(stornoway_rain, armagh_rain, alternative='greater', confidence=0.99)

rain_comp # show the results dataframe

The output of pg.ttest() is a DataFrame that gives us the following:

  • T: the value of the t-statistic;

  • dof: the degrees of freedom;

  • alternative: the hypothesis that was used;

  • p-val: the p-value of the t-statistic;

  • CI{confidence}%: the confidence interval for the chosen significance level;

  • cohen-d: Cohen’s d effect size;

  • BF10: the Bayes Factor of the alternative hypothesis;

  • power: the achieved power of the test

Now, let’s look at an example of a one-sample t-test, to see if we can determine whether the mean of a small sample of summer temperatures provides a good estimate of the mean of all summer temperatures measured at Oxford.

First, we’ll select all of the summer values of tmax recorded at Oxford, then calculate the mean value of these temperatures:

oxford_summer_tmax = station_data.query('station == "oxford" & season == "summer"')['tmax']

print(f"Oxford Mean Summer Tmax: {oxford_summer_tmax.mean():.1f}")

So the mean summer temperature measured in Oxford between 1853-2022 is 21.1°C - now, let’s take a random sample of 30 temperatures using .sample():

sample_tmax = oxford_summer_tmax.sample(n=30) # select a random sample of 30 values

Once again, we use pg.ttest() to conduct the test. Because oxford_summer_tmax is a single value, the function knows that this is a one-sample test:

# test whether sample_tmax.mean() is not equal to oxford_summer_tmax at the 95% confidence level
tmax_comp = pg.ttest(sample_tmax, oxford_summer_tmax, alternative='two-sided', confidence=0.95)

tmax_comp # show the results dataframe

Based on this, we can’t conclude that our sample mean is significantly different from the mean of all of the summer values of tmax recorded at Oxford.

non-parametric tests#

We can also conduct non-parametric hypothesis tests using pingouin. The example we will look at is the one- or two-sample Wilcoxon tests, using pg.wilcoxon() (documentation). Let’s start by looking at the Wilcoxon Rank Sum test, which is analogous to the independent sample t-test. For this, we’ll use the same data that we did before, again testing whether Stornoway Airport gets more rainfall, on average, than Armagh:

# test whether mean(stornoway.rain) > mean(armagh.rain)
pg.wilcoxon(stornoway_rain, armagh_rain, alternative='greater')

The output table has the following columns:

  • W-val: the W-statistic of the test;

  • alternative: the alternative hypothesis uesd;

  • p-val: the p-value of the W-statistic;

  • RBC: the matched pairs rank-biserial correlation (i.e., the effect size);

  • CLES: the common language effect size.

From this, we can again conclude that Stornoway Airport does get more rainfall, on average, than Armagh.

analysis of variance#

Finally, we’ll see how we can set up and interpret an analysis of variance test. In this example, we’ll only look at data from Armagh, Oxford, and Stornoway Airport, because the Southampton time series ends in 1999. We’ll first calculate the annually-averaged values of the meteorological variables, using .groupby() and .mean():

annual_average = station_data.loc[station_data['station'].isin(['armagh', 'oxford', 'stornoway'])] \
    .groupby('year') \
    .mean(numeric_only=True)

Then, we’ll add a new variable, period, to divide the observations into three different 50-year periods: 1871-1920, 1921-1970, and 1971-2020. To do this, we’ll use pd.cut() (documentation). To do this, we first have to define the labels (periods) and category boundaries (bins). Then, we use .dropna() to remove any rows where period has a NaN value (i.e., is outside of the range 1871-2020):

periods = ['1871-1920', '1921-1970', '1971-2020'] # make a list of period names
bins = [1870, 1920, 1970, 2020] # make a list of period boundaries - must be 1 longer than the names

annual_average['period'] = pd.cut(annual_average.index, bins, labels=periods) # assign a value to period, using the boundaries and labels above

annual_average.dropna(subset=['period'], inplace=True) # drop any rows where period is NaN

annual_average # show the dataframe

Before running the test, let’s make a box plot that shows the distribution of tmax values among the three periods:

sns.boxplot(data=annual_average, x='period', y='tmax') # make a box plot of tmax, grouped by period
../../../_images/stats_51_1.png

From this, it certainly appears as though there is a difference in the mean value of tmax between the three periods. To formally test this, we’ll use aov() (documentation).

To run the one-way ANOVA test, we use pg.anova() (documentation). Using this, we can pass our annual_average DataFrame using the data argument, and specify tmax as the dependent variable (dv), and period as the grouping variable (between). We’ll also look at the detailed test output (detailed=True):

aov = pg.anova(dv='tmax', between='period', data=annual_average, detailed=True) # run one-way anova for differences of tmax between periods

aov.round(3) # round the output to 3 decimal places

The table output has the following columns across two rows:

  • Source: the factor names;

  • SS: the sums of squares;

  • DF: the degrees of freedom;

  • MS: the mean squares;

  • F: the value of the F-statistic;

  • p-unc: the uncorrected p-values;

  • np2: the partial eta-square effect sizes

From the table, we can see that there are significant differences between the groups at the 99.9% significance level.

This doesn’t tell us which pairs of groups are different - for this, we would need to run an additional test. As one example, we can use pg.pairwise_tukey() (documentation) to compute “Tukey’s Honest Significant Difference” between each pair of groups:

tmax_mc = pg.pairwise_tukey(dv='tmax', between='period', data=annual_average) # run the pairwise tukey HSD post-hoc test

tmax_mc.round(3) # round the output to 3 decimal places

From this, we can see the estimated difference in the means for each pair of groups (shown in A and B); the estimated mean of each group; the estimated difference between the mean of each group (diff); the standard error of the estimate (se); the t-statistic of the estimated difference (T); the corrected p-value of the t-statistic (p-tukey), and the effect size (by default, Hedges).

Using this, we can clude that, at the 99% significance level, there is a significant difference in tmax between the periods 1971-2020 and 1871-2020, and between the periods 1971-2020 and 1921-1970.

exercise and next steps#

That’s all for this exercise. To help practice your skills, try at least one of the following:

  • Set up and run an AOV test to compare annual total rainfall at all four stations, using data from all avaialable years. Are there significant differences between the stations? Use pg.pairwise_tukey() or pg.pairwise_tests() (documentation) to investigate further.

  • Using only observations from Armagh, set up and run a test to see if there are significant differences in rainfall based on the season.

  • Using only observations from Oxford, is there a significant difference between the values of tmax in the spring and the autumn at the 99.9% confidence level?

  • Using only observations from Stornoway Airport, is the value of tmin significantly lower in the winter, compared to the autumn?