transforming data#

Note

This is a non-interactive version of the exercise. If you want to run through the steps yourself and see the outputs, you’ll need to do one of the following:

  • follow the setup steps and work through the notebook on your own computer

  • open the workshop materials on binder and work through them online

  • open an R console or RStudio, copy the lines of code here, and paste them into the console to run them

In this lesson, we’re going to learn how we can work with datasets - combining tables, creating and re-arranging variables, selecting and sorting rows, and grouping and summarizing data. Mostly, we will be using functions from the dplyr package, which is designed for data manipulation (again, read this as “analyzing” or “working with”, not “fabricating”!).

data#

The data used in this exercise are the historic meteorological observations from the Armagh Observatory (1853-present), the Oxford Observatory (1853-present), the Southampton Observatory (1855-2000), and Stornoway Airport (1873-present), downloaded from the UK Met Office.

Like with the Armagh dataset we used previously, I have done the following to make the data slightly easier to work with:

  • Removed the header on lines 1-5

  • Replaced multiple spaces with a single space, and replaced single spaces with a comma (,)

  • Removed --- to indicate no data, leaving these fields blank

  • Removed * indicating provisional/estimated values

  • Removed the 2023 data

  • Renamed the file (e.g., oxforddata.txt -> oxforddata.csv).

If you wish to use your own data (and there are loads of stations available!), please feel free. For the best experience, you will likely need to repeat the steps indicated above.

loading libraries#

As before, we load the libraries that we will use in the exercise at the beginning. We will be using three libraries:

  • readr, for reading the data from a file;

  • ggplot2, for plotting the data;

  • and dplyr, for transforming/manipulating the data.

library(readr) # this loads the functions we'll use to load the data
library(ggplot2) # this loads the functions, etc. needed for us to plot
library(dplyr) # this loads the functions, etc. needed for us to work with the data

loading the data#

In this exercise, we’re going to see a number of different ways that we can work with data tables. Before we get to this, however, we need to load the individual data files and combine them into a single data table.

Rather than loading all four at once and then combining them, however, we can simplify this slightly using a for loop. First, we’ll load the Armagh Observatory data, because it’s currently in a different folder.

Rather than typing the path to the file directly, we can use file.path() (documentation) to construct a path in a platform-independent way. In general, we want to do this because Windows uses \ to separate folders, while Unix-style systems such as Linux and MacOS use / - this way, we don’t run into issues if we share our code with people working on different systems.

To construct the filename, we’re using a relative path - that is, it is relative to some given working directory (typically the current working directory). To get to armaghdata.csv from the current directory, we have to go up a directory level (..), before entering the 03.plotting directory, and the data directory after that:

station_data <- read_csv(file.path('..', '03.plotting', 'data', 'armaghdata.csv')) # use file.path to construct a path to the data file

Next, we want to make sure that we can keep track of which observation comes from which station - so, we should add a station variable to the table, and make sure to specify that these observations all come from the Armagh Observatory:

station_data$station <- 'armagh' # add the station name as a column

Now, we can set up a loop to load the other 3 stations data. First, we can create a vector of station names:

# create a vector of station names
new_stations <- c('oxford', 'southampton', 'stornoway')

Now that we have the vector of station names, we can construct the for loop to first read each file:

fn_data <- file.path('data', paste(station, 'data.csv', sep="")

Here, we first use paste() to combine the station variable (which takes on a value from the new_stations vector on each pass through the loop) with 'data.csv', using a separator (sep) value of "" so that the resulting file names will be 'oxforddata.csv', 'southamptondata.csv', and 'stornowaydata.csv'. We then use file.path() to combine this with the 'data' directory name, so that the value of fn_data is the complete relative path to each file.

Next, we use read_csv() to read in the file, and add a station variable to the table, just like we did with the Armagh data.

Finally, we use bind_rows() (documentation) to combine the existing table, station_data, with the newly loaded table (data), and overwrite the value of station_data with this combined table:

station_data <- bind_rows(station_data, data)

Each time through the for loop, the value of station is updated:

for (station in new_stations) {
    fn_data <- file.path('data', paste(station, 'data.csv', sep="")) # create the filename for each csv file, using file.path and paste
    data <- read_csv(fn_data) # read the csv
    data$station <- station # add the station to the table

    station_data <- bind_rows(station_data, data) # combine the new data with the current data table
}

print(station_data) # show the data

Note that this is one advantage of using clear, consistent naming and formatting for data files - we can easily write a loop to load multiple files, instead of having to write individual paths.

using filter() to select rows#

Now that we have a single table, we can also look at ways that we can select rows from the table. We have alread seen an example of this - for example, we could select all observations where the monthly maximum temperature (tmax) is greater than 20°C:

station_data[station_data$tmax > 20, ]

However, there’s a small problem with this. In the example above, you can see that there are a number of rows where the values are all NA - this is because of how R handles NA values with the extraction operators ([]). Rows where tmax is missing (the value is NA) also show up, because a comparison operator with NA returns NA, not TRUE/FALSE:

station_data$tmax > 20

When we then use this vector to index the tibble, the corresponding rows are filled with NA values due to something called vector recycling (see here for more information if you’re interested).

We could write a combined conditional expression to select the correct rows:

(station_data$tmax > 20) & (!is.na(station_data$tmax))

Here, the conditional is TRUE only when tmax > 20 and tmax is not NA. However, there is an easier, clearer way, using dplyr::filter() (documentation):

station_data |> filter(tmax > 20) # use filter to select rows where tmax > 20

Here, we’re using an operator we haven’t seen before: the |> (“pipe”) operator.

In brief, |> tells R to take the output of the thing on the left, and pass it to the function call on the right. Thinking about this mathematically, x |> f(y) is equivalent to f(x, y). We can also use this to combine multiple function calls - so, x |> f(y) |> g(z) is equivalent to g(f(x, y), z), and so on.

So, this:

station_data |> filter(tmax > 20) # use filter to select rows where tmax > 20

Is the same as this:

filter(station_data, tmax > 20) # use filter to select rows where tmax > 20

With only one function call, the difference may not seem like much - as we will see, the real power comes when we are combining many function calls together.

We can also use filter() with combined conditionals - for example, to select all monthly observations where tmax is greater than 20°C and rain is greater than 100 mm:

station_data |> filter(tmax > 20 & rain > 100) # use filter to select rows where tmax > 20 and rain > 100

using arrange() to sort rows#

Sometimes, we might want to sort our data according to the value of different variables. For example, we can sort the observations by rainfall, from smallest to largest values:

station_data |> arrange(rain) # sort by rainfall, from smallest to largest values

By default, the values are sorted in ascending order (from smallest to largest, or from A to Z for characters). If we want to see the reverse, we can use desc() (documentation):

station_data |> arrange(desc(rain)) # sort by rainfall, from largest to smallest values

We can also combine different variables to sort by - for example, sorting by season and rainfall:

station_data |> arrange(season, desc(rain)) # sort by season, then rainfall in descending order

using distinct() to find unique rows#

To find unique rows in the dataset, we can use distinct() (documentation). By itself, distinct() uses all of the variables to determine whether rows are distinct; most of the time, we likely want to use it to find unique values of a given variable:

station_data |> distinct(station) # find distinct values of station names

We can also use it to find combinations of variables:

station_data |> distinct(station, mm) # find distinct pairs of station and month values

We can also use the .keep_all argument to keep the other columns while filtering for unique rows:

station_data |> distinct(station, mm, .keep_all = TRUE) # keep all columns while finding distinct pairs of station and season values

Note that the distinct values found above are all from the first year of each dataset - this is because distinct() discards all but the first occurrence of a unique row.

counting occurrences with count()#

If we want to count the number times a particular value occurs in the table, we can use count() (documentation). We can also use this in combination with other functions - for example, we can count the number of times each station observed rainfall greater than 150 mm in a month by first using filter() to select all rows where rain is greater than 150, then use count() to count the number of unique occurrences of station in the resulting table:

station_data |> filter(rain > 150) |> count(station, sort = TRUE) # select all rows where rain > 150, then count the number of occurrence of station, sorted in descending order

From this, we can quickly see that Stornoway Airport, located in the Outer Hebrides, has far more months with heavy rainfall (278) than any other station in our dataset; by contrast, Oxford has only recorded 12 such months between 1853 and 2022.

adding variables to the table using mutate()#

In a previous exercise, we saw how we can use R’s built-in functionality to add variables to a data frame:

armagh$date <- as.Date(paste(armagh$yyyy, armagh$mm, "1", sep="/"), format="%Y/%m/%d")

We can also use mutate() (documentation). This is more flexible than the built-in functionality, because it also allows us to add more than one new variable, and it allows us to specify where to put the new variables(s) using the .before or .after arguments. For example, to place the new date variable on the left-hand side of the column, we can use .before = 1:

station_data |> mutate(date = as.Date(paste(yyyy, mm, "1", sep = "/"), format = "%Y/%m/%d"), .before = 1) # use mutate to add a date variable, before the other variables

Note that we haven’t assigned the output, so station_data is unchanged, and the new variable is only printed. We may want to overwrite our existing data by assigning the output to the same object, or we may want to create a new object with the output. Ultimately, the choice depends on what we’re planning to do.

We can also use mutate() to add multiple variables to the table - for example, adding the season and date variables as we saw previously:

station_data <- station_data |> mutate(
    season = case_when(
        mm %in% c(1, 2, 12) ~ 'winter', # if month is 1, 2, or 12, set it to winter
        mm %in% 3:5 ~ 'spring', # if month is 3, 4, 5, set it to spring
        mm %in% 6:8 ~ 'summer', # if month is 6, 7, 8, set it to summer
        mm %in% 9:11 ~ 'autumn', # if month is 9, 10, 11, set it to autumn
    ),
    date = as.Date(paste(yyyy, mm, "1", sep="/"), format="%Y/%m/%d") # add a date variable
)

print(station_data)

By default, mutate() adds variables to the right hand side of the table; in addition to specifying where to put them using .before and .after, we will also see how we can re-arrange the variables in the table later on.

using select() to select columns#

Sometimes, we might want to select a single variable, or a handful of variables from a table - we can do this using select() (documentation):

station_data |> select(date, tmax, station) # select only the date, tmax, and station variables

We can also select a subset using a range of columns:

station_data |> select(tmax:sun) # select columns between tmax and sun (inclusive)

and we can also select a subset by specifying which columns not to use:

station_data |> select(!tmax:sun) # select columns except those between tmax and sun (inclusive)

And, we can also select columns by their type using where() (documentation). For example, to select only variables that are numeric, we can use the is.numeric() function (documentation):

station_data |> select(where(is.numeric)) # select only numeric variables

using rename() to rename columns#

Often, we may also want to rename variables to make them easier to read/understand. For example, the yyyy, mm, and af variables in our table are not necessarily the easiest to understand. We can rename them to more clear names, such as year, month, and air_frost, using the rename() function (documentation):

station_data <- station_data |> rename(year = yyyy, month = mm, air_frost = af) # rename yyyy to year, mm to month, and af to air_frost

print(station_data)

using relocate() to move columns#

With mutate(), we saw how we can specify where to put new variables, using the .before and .after arguments. If we aren’t creating new variables, we can still re-arrange variables using relocate() (documentation), which works in much the same way. We can specify which column to move a variable .before or .after; like with select(), we can also move a range or selection of columns at once. In the cell below, we’re going to first move date so that it is the first column (before year); then, we move season so that it comes after month:

station_data |>
    relocate(date, .before = year) |>  # move date to before year
    relocate(season, .after = month) -> # move season to be after month
station_data # use the -> assignment operator to assign the output to station_data

print(station_data)

In the cell above, note that we have used -> (the right-hand assignment operator) to assign the ouput of the second relocate() function to the object station_data. Unlike the expression operator we have used so far (<-, the left-hand assignment operator), -> assigns the value of the expression on the left side of the operator, and assigns it to the object on the right-hand side.

Normally, we tend to use <-, but sometimes, especially with long “sentences” with multiple function calls, it can make sense to use -> at the end, rather than the beginning - the end result will be the same.

saving data to a file#

Finally, let’s save our cleaned, re-arranged dataset to a file, using write_csv() (documentation). In the simplest case, write_csv() takes two arguments: first, the data table to be written to disk, and second, the filename to write the data to. We’ll save our file to the 'data' folder, with a filename of combined_stations.csv:

write_csv(station_data, file.path('data', 'combined_stations.csv')) # write station_data to a file in the data folder

Now, we’ll be able to load this file when we want to do further analysis, rather than needing to re-run the steps to load each file, combine the tables, create new variables, and so on. We’re continuing to use a comma-separated variable (.csv) file format, though there are a number of different format options available - for more information, check the documentation.

grouping data#

Next, we’ll see how we can use different tools to aggregate and summarize our data, starting with group_by() (documentation). To start, we’ll group the data by station:

station_data |> group_by(station) # group the data by station

This looks largely the same as the previous output, with one important distinction: this is now a grouped_df, rather than a spec_tbl_df - this means that when we call the summarize() (documentation) function on the output, the summary is calculated based on each group, rather than all values of the variable. For example, if we want to calculate the mean of tmax for each station:

station_data |>
    group_by(station) |> # group the data by station
    summarize(
        tmax = mean(tmax, na.rm = TRUE) # calculate the mean of tmax, ignoring NA values
    )

We can also group based on multiple variables - for example, by both station and season:

station_data |>
    group_by(station, season) |> # group the data by station, then season
    summarize(
        tmax = mean(tmax, na.rm = TRUE), # calculate the mean of tmax, ignoring NA values
        rain = mean(rain, na.rm = TRUE)  # calculate the mean of rain, ignoring NA values
    )

Now, let’s combine this with what we learned in the previous lesson (the plotting exercise) to create a plot that shows the distribution of rainfall by season, separated by station.

First, we want to create a plot that shows the density distribution of rainfall for each season, using facet_wrap() to create a single panel for each station:

ggplot(data=station_data, mapping=aes(x=rain)) + # create a plot with tmax on the x-axis, colored by season
    geom_density(mapping=aes(color=season, fill=season), alpha=0.4, linewidth=1) + # add a density plot with transparency of 0.4 and lines of width 1
    facet_wrap(~station) -> # create one panel for each station
rain_plot # assign the plot to a variable

rain_plot

Next, we can use group_by() and summarize() to calculate the mean rainfall for each station, and assign this to a new object, mean_values:

mean_values <- station_data |>
    group_by(station) |> # group by station value
    summarize(rain = mean(rain, na.rm = TRUE)) # calculate the mean of rain, ignoring NA values

Now, to add a vertical line to our plot, we use geom_vline() (documentation), along with mean_values, to place a vertical line in each panel where the mean rainfall value is:

rain_plot <- rain_plot +
    geom_vline(data = mean_values, mapping = aes(xintercept = rain), linewidth = 1, linetype = 'dashed') # add dashed vertical lines at the mean rainfall value

rain_plot

In the next panel, write some lines of code to change the axes labels and increase the font size for the tick labels, axis labels, and panel labels.

# your code goes here!

Now that you have finished the plot, be sure to save it to a file:

ggsave('seasonal_rain_distribution.png', plot=rain_plot) # save the plot to a file

slicing the dataset#

We’ll finish up by looking at a few functions that we can use to slice a dataset - that is, extract specific rows from a group. For example, we can combine group_by() and slice_max() (documentation) to find the maximum monthly temperature from each season:

station_data |>
    group_by(season) |>
    slice_max(tmax, n=1) # take the top n rows based on the value of tmax

This lets us quickly see the observations corresponding to the highest temperature in each season - split between Southampton for autumn and spring, and Oxford for summer and winter. If we want to select the minimum, we can use slice_min():

station_data |>
    group_by(season) |>
    slice_min(tmax, n=1) # take the bottom n rows based on the value of tmax

Here, you can also see that by default, slice_min() (and slice_max()) keep tied values - so we end up with 6 rows instead of 4. If we want to discard ties, we can use the .with_ties argument set to FALSE.

If we only want the first or last row from a group, regardless of the value, we can use slice_head() to select the first n rows, and slice_tail() to select the last n rows:

station_data |>
    group_by(season) |>
    slice_head(n=1) # take the first n rows

Finally, we can select a random sample from each group using slice_sample():

station_data |>
    group_by(season) |>
    slice_sample(n=5) # take a random sample of 5 rows from each season

using ungroup() to remove groups#

Sometimes, we want to group data using group_by() to do some kind of calculations, but then we want to get rid of the groups - for example, if we want to find the top 5 warmest average months, regardless of what station they are measured at.

To do this, we can use ungroup() (documentation) to remove the grouping:

station_data |>
    group_by(station, month) |> # group by station, season
    summarize(tmax = mean(tmax, na.rm = TRUE)) |> # calculate the average of tmax
    ungroup() |> # use ungroup to remove the grouping by station and season on the output of summarize
    slice_max(tmax, n = 5) # select the 5 rows with the highest value of tmax

Note that without this, we would end up with 20 rows - 5 for each group:

station_data |>
    group_by(station, month) |> # group by station, season
    summarize(tmax = mean(tmax, na.rm = TRUE)) |> # calculate the average of tmax
    slice_max(tmax, n = 5) # select the 5 rows with the highest value of tmax

exercise and next steps#

That’s all for this exercise. To practice your skills, create a notebook file that does the following:

  • loads the libraries that you need

  • loads the saved data file (combined_stations.csv)

  • helps you answer the following questions:

    • what station has the highest recorded rainfall in the past 20 years, and on what date?

    • what season has the lowest average rainfall? is it the same season for all four stations?

    • what station has recorded the most months with tmin < 1°C? are all these observations from a single season?

    • what is the median rainfall in months where tmax is greater than 20°C? make sure that your result is a number, not a tibble!

    • what year saw the most total rainfall, using data from all four stations?

    • what are the top 5 driest years, using only data from stations in Britain?

    • what is the lowest annually-averaged monthly minimum temperature in the dataset, as measured by a single station?

    • what is the sunniest month, on average, in Armagh?

      • bonus: write a line that will rename the months from the number to a 3-letter abbreviation (hint)

For a bonus, try downloading at least one additional dataset from the Met Office, saving it to the data folder. Next, open a Terminal and enter the following:

python convert_metoffice.py data/{station}

remembering to replace {station} with the name of the file that you just downloaded (e.g., durhamdata.txt). This will convert the .txt file into a .csv file, using the steps outlined at the top of the exercise.