vector data using shapely and geopandas#

In this practical, you’ll gain some more experience working with vector data in python. You will learn about the different vector data types available in the shapely package, and how we can use the geopandas package to perform different vector data operations and analyses.

The practical this week is provided as a Jupyter Notebook, where you can interactively work through the different steps of plotting the data. There is a second file, exercise_script.py, which you can modify to perform additional analyses, based on what you’ve learned in the Jupyter Notebook and the mapping exercise in Practical 2.

Note

In the main folder, you should also see an example script, week3_example.py. Be sure to try out the exercise, and see if you can’t figure out a solution on your own, before having a look at the (one of many possible) solution offered there.

getting started#

Last week, we saw how we can use GitHub Desktop to merge two branches (in this case, week2 into main). This week, we’re going to see how to do this using the command line.

To get started with this week’s practical, open Anaconda Navigator, then launch the Command Prompt - either from Anaconda Navigator (make sure that your egm722 environment is selected), or from the Start Menu.

When the Command Prompt opens, navigate to your repository folder using cd, then type dir and press Enter. You should see something similar to the following:

the dir of the main branch


Switch to the week3 branch by typing:

git checkout week3

and pressing Enter.

Note

If you see some version of the following:

hint: If you meant to check out a remote tracking branch on, e.g., 'origin',
hint: you can do so by fully qualifying the name with the --track option:
hint:
hint:     git checkout --track origin/<name>
hint:
hint: If you'd like to always have checkouts of an ambiguous <name> prefer
hint: one remote, e.g. the 'origin' remote, consider setting
hint: checkout.defaultRemote=origin in your config

What this is telling you is that you either have to explicitly specify which remote branch you want to check out (e.g., origin or upstream), or you should set your defaultRemote option using git config:

git config --global checkout.defaultRemote origin

Next, type dir and press Enter again. You should now see this:

the dir of the week3 branch


To merge the week3 branch of our repository into main, we’ll use git from the command line.

Remember that at the start of last week’s practical, we discussed the difference between local, origin, and upstream branches:

  • local branches are the ones stored locally on your computer,

  • origin branches are the branches of your repository stored on GitHub,

  • upstream branches are the branches of the repository that you forked the egm722 repository from (iamdonovan/egm722).

Sometimes, there may be changes to the upstream repository that we want to integrate into our local version of a repository. For example, for this module I may have added an additional exercise to the practical in one week, and you want to make sure that you have this before you merge that week’s branch into the main branch.

To be able to keep track of the upstream changes, we need to make sure that our local repository knows where the upstream repository is. First, double-check what remote (not local) repositories git is keeping track of by typing the following at the command line:

git remote -v

This will list the remote repositories, and their nicknames. You should see an output like this:

the remote repositories for this repository


Note

If you only see origin, then we need to add the upstream remote location using git remote add:

git remote add upstream https://github.com/iamdonovan/egm722.git

This adds the URL for the upstream repository (iamdonovan/egm722.git) to our local configuration. You can check that this worked by typing git remote -v again - you should now see two lines for the upstream repository, as above.

Now, we can tell git to specifically pull the upstream version of a particular branch:

git pull upstream <branch>

This will pull (fetch and merge) the upstream version of <branch> (if it exists) into the local version of the current branch.

For example, git pull upstream week3 would merge the upstream week3 branch into our current branch (week3). Go ahead and enter this command now:

git pull upstream week3

You should see the following output:

pulling the upstream changes into the current branch


This indicates that there’s been no change to the upstream branch that isn’t already in our origin branch, so we can safely merge the local main and week3 branches.

Now, switch back to the main branch:

git checkout main

And enter the following command:

git merge week3

You should now see the following output in the window:

the updates displayed after merging week3 into main


This tells us what files have been changed (18 files) and how (854 insertions(+)). Because none the files in the Week3 folder were present in the main branch, we’ll only see additions/insertions. As you work on your project and commit changes to existing files, you’ll also see deletions (lines that are deleted or changed).

You should also see Weeks 1–3 in your repository folder:

the repository folder after merging week3 into main


The last thing to do now is to push these changes to your GitHub repository:

git push

You can confirm that the changes are now on your remote repository by heading over to GitHub:

the github repository, with the merged changes updated


At this point, you can launch Jupyter Notebooks as you have in the previous weeks, and begin to work through the practical exercise.

Note

Below this point is the non-interactive text of the notebook. To actually run the notebook, you’ll need to follow the instructions above to open the notebook and run it on your own computer!


Rachel McAdams#

overview#

Up to now, you have gained some experience working with basic features of python, and used cartopy and matplotlib to create a map. In each of these practicals, you have loaded different vector datasets using a module called geopandas. In this week’s practical, we’ll be looking at working vector data in a bit more depth, including the different geometry types available using shapely, analyses like spatial joins and summarizing based on attributes, and how to reproject vector data from one coordinate reference system to another.

objectives#

  • Gain experience working with different vector data types using shapely

  • Use geopandas to re-project vector datasets from one coordinate reference system to another

  • Summarize features using the groupby method of a GeoDataFrame

  • Learn how to perform different vector data operations using geopandas and shapely

data provided#

In the data_files folder, you should have the following: - NI_roads.shp, a shapefile of roads in Northern Ireland - Counties.shp, a shapefile of county outlines for Northern Ireland - NI_Wards.shp, a shapefile of electoral wards for Northern Ireland

getting started#

In this practical, we’ll be working with vector data. As a quick refresher, the three main types of vector data that we will work with are:

  • Point: point data represent a single point in space. For our purposes, points are either two-dimensional (x, y) or three-dimensional (x, y, z). In shapely, the corresponding class of data is a Point.

  • Line: lines are a sequence of at least two points that are joined together. In shapely, the corresponding class of data is known as a LineString.

  • Polygon: polygons are a sequence of at least three points that are connected to form a ring, as well as any additional rings that represent holes in the polygon. In shapely, the corresponding class of data is a Polygon.

We can also have Collections of vector data, where each feature represents a collection of Point, Line, or Polygon objects. In shapely, these are represented as MultiPoint, MultiLineString, or MultiPolygon objects.

To get started, run the following cell to import geopandas and shapely.

# this lets us use the figures interactively
%matplotlib inline

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, LineString, Polygon

shapely geometry types#

Points#

As we saw in Week 1, to create a Point, we pass x, y (and optionally, z) coordinates to the Point class constructor (documentation):

pt = Point(-6.677, 55.150) # creates a 2d point with coordinates -6.677, 55.150
pt2 = Point(-6.658, 55.213) # creates a 2d point with coordinates -6.658, 55.213

pt3d = Point(86.925278, 27.988056, 8848.86) # creates a 3d point

print(pt) # print a well-known text (WKT) representation of the Point object

The last line, print(pt), prints a well-known-text (WKT) representation of the Point object. WKT is a standard representation of vector geometry objects - most python libraries and GIS softwares are able to read and/or translate WKT into other formats, such as ESRI Shapefiles, GeoJSON, etc.

Remember that in python, we can find the attributes and methods for an object by looking up the documentation (for shapely, this can be found here), or using the built-in function dir(). To find out more about a particular function, we can use the built-in function help() (or, in jupyter notebooks/ipython, the ? operator).

As an example, let’s use the built-in function dir() to look at the methods and attributes associated with the Point class:

dir(pt) # show the attributes and methods associated with the pt object

Here, in addition to the speciall or magic methods (denoted with two underscores, __, at the beginning and end of the method name), there are a number of methods that we might find useful, including .distance().

To see what this method does, we can use help(Point.distance):

help(pt.distance)

So, .distance() provides the distance from the Point object to some other geometry. Because shapely does not directly deal with coordinate systems, this distance is unitless. This means that we have to make sure that the two objects have the same reference system - if we do not, the distance returned will not make sense. Don’t worry, we will cover working with coordinate reference systems later on in this exercise.

Use the cell below, along with the output of dir(pt) above, to work out how we can access the x, y coordinates of a Point object. Can you see more than one way to do this? If so, are there differences between them?

# write your method to access the x,y coordinates of pt here

One of the common operations we might want to do with a Point object is to create a buffer around the point. In the list of associated methods and attributes of Point objects above, you should see there is a method called .buffer().

A look at the help for this method:

help(pt.buffer) # show the help for pt.buffer

shows that buffer takes a positional parameter of distance, as well as a number of keyword parameters that determine how the buffer operation is done. Remember that the buffer distance will be in the same coordinate system as our point - shapely does not, by itself, do any conversion between coordinate systems or units.

Note that the object returned by buffer is a Polygon, rather than a point - this makes sense, as the buffer is a two-dimensional surface around the point location:

pt_buffer = pt.buffer(0.001) # buffer the point by 0.001 in the same coordinates
print(type(pt_buffer)) # show the type of the buffer

LineStrings#

Instead of using a single x, y coordinate pair, a LineString object (documentation) takes either a list of Point objects, or a list of coordinate tuples:

line1 = LineString([pt, pt2]) # method one of creating a LineString, using a list of Point objects
line2 = LineString([(-6.677, 55.150), (-6.658, 55.213)]) # method two, using a list of coordinate tuples

print(line1) # show the first line
print(line2) # show the second line

As we can see from the output above, these two LineStrings have the same coordinates. We can also use the .equals() method to check that the two objects are the same geometry:

line1.equals(line2) # check to see if these are the same geometry

The coordinates of a LineString are stored as a tuple in an attribute called xy. The tuple has two items representing the X and Y coordinate values. If we want the x and y coordinates as separate variables, we can access them using their respective indices:

In [4]: x = line1.xy[0]
In [5]: y = line1.xy[1]

We can also combine this using tuple assignment, or unpacking, which assigns values from a tuple on the right-hand side of the assignment to a comma-separated grouping of variables on the left-hand side:

x, y = line1.xy

print(x)
print(y)

LineString objects have a number of the same methods that Point objects do, including .buffer() and .distance().

LineString objects also have a .length attribute (just like with .distance(), it is unitless):

print(line1.length)

LineString objects have a .centroid attribute, corresponding to the midpoint of the LineString:

center = line1.centroid # get the midpoint of the line
print(center)

The last two methods of LineString objects that we will explore for now are .project() and .interpolate():

help(line1.project)

So .project() returns the distance along the LineString that comes closest to the Point (or other geometry object).

.interpolate(), on the other hand, does something a bit different:

help(line1.interpolate)

it returns the point along the line at a specified distance; the distance can be in the units of the LineString’s coordinates (normalized=False), or it can be as a fraction of the total length of the LineString (normalized=True).

line1.project(center) / line1.length # check to see how far along the line our centerpoint is

print(center) # print the WKT representation of the center point
print(line1.interpolate(0.5, normalized=True)) # print the WKT representation of the point 50% along the line

Polygons#

The last basic geometry type we will look at in this practical are Polygon objects. Similar to LineString objects, we can construct a Polygon object (documentation) using a list of coordinate pairs, or a list of Point objects:

poly1 = Polygon([(-6.677, 55.150), (-6.658, 55.213), (-6.722, 55.189)])
poly2 = Polygon([pt, pt2, Point(-6.722, 55.189)])

print(poly1) # print a wkt representation of the polygon
print(poly2)

and, just like we saw with LineString objects, we can use .equals() to check that these two geometries are the same:

poly1.equals(poly2)

Note that even though we only passed three Point objects (or coordinate pairs) to the Polygon constructor, the Polygon has four vertices, with the first and last vertex being the same - this is because the Polygon exterior is closed.

Note also the double parentheses - this is because a Polygon potentially has two sets of coordinates - the Shell, or exterior, and holes, or interiors. To create a Polygon with a hole in it, we can pass a list of coordinates that describe the shell, and a second that describes the holes:

polygon_with_hole = Polygon(shell=[(-6.677, 55.150), (-6.658, 55.213), (-6.722, 55.189)],
                            holes=[[(-6.684, 55.168), (-6.704, 55.187), (-6.672, 55.196)]]) # note the double brackets

print(polygon_with_hole)

Note the double brackets in the holes keyword argument:

holes=[[(-6.684, 55.168), (-6.704, 55.187), (-6.672, 55.196)]]

This is necessary, because holes is expecting a sequence of coordinate sequences - effectively, a list of Polygon shells.

Accessing the coordinates of a Polygon object is a little more complicated than it is for Point and LineString objects - this is because Polygon objects have two sets of coordinates, the .exterior (shell) and .interiors (holes).

But, the .exterior attribute of the Polygon is just a LinearRing (a special case of LineString where the first and last coordinates are the same), and the .interiors attribute is an InteriorRingSequence (basically, a collection of LinearRings that have to obey additional rules):

print(polygon_with_hole.exterior) # this is a single LinearRing
for lr in polygon_with_hole.interiors: # this is potentially multiple LinearRing objects
    print(lr)

Polygon objects have nonzero .area and non-zero .length (perimeter) attributes - as with the equivalent attributes for Point and LineString objects, these are unitless.

Polygon objects also have a .centroid (center), and we can bound the geometry using either the minimum bounding box parallel to the coordinate axes (the .envelope attribute), or a rotated minimum bounding box (the .minimum_rotated_rectangle attribute):

print('perimeter: ', poly1.length) # print the perimeter
print('area: ', poly1.area) # print the area
print('centroid: ', poly1.centroid) # get the centerpoint of the rectangle
print('bounding coordinates: ', poly1.bounds) # get the minimum x, minimum y, maximum x, maximum y coordinates
print('bounding box: ', poly1.envelope) # get the minimum bounding rectangle of the polygon, parallel to the coordinate axes
print('rotated bounding box: ', poly1.minimum_rotated_rectangle) # get the smallest possible rectangle that covers the polygon

There are a number of additional methods that we will cover more as we continue through the practicals - for now, this should be enough to give an idea for how these geometry objects work.

interactions between geometry objects#

shapely also provides a number of methods that we can use to check the spatial relationship between different objects. For example, the following code shows how we can use the .contains() method (documentation) of a shapely geometry object to see whether another geometry object is located fully within the object:

poly = Polygon([(0, 0), (2, 0), (2, 3), (0, 3)])
pt1 = Point(0, -0.1)
pt2 = Point(1, 1)

print(poly.contains(pt1)) # should return False, because pt1 is not within the polygon
print(poly.contains(pt2)) # should return True, because pt2 is within the polygon

We can also check to see whether two geometry objects intersect each other using the .intersects() method (documentation):

line1 = LineString([(0, 0), (1, 1)])
line2 = LineString([(0, 1), (1, 0)])

print(line1.intersects(line2)) # intersects() returns True if the geometries touch/intersect/overlap, False otherwise

To actually get the intersection of the two geometries, we use the .intersection() method, which returns the geometry of the intersection (whether this is a Point, a LineString, a Polygon, or a mixed collection of geometries depends on the geometries and how they intersect):

line1 = LineString([(0, 0), (1, 1)])
line2 = LineString([(0, 1), (1, 0)])
poly = Polygon([(0, 0), (2, 0), (2, 3), (0, 3)])

print(line1.intersection(line2)) # if the lines intersect, this will be the Point(s) of intersection
print(line1.intersection(poly)) # if the line intersects a polygon, the result may be a line or a point

There are a number of other methods provided by shapely that we can use to determine the relationship between geometry objects, including touches, within, and overlaps. Be sure to have a look at the full list from the shapely user manual to see the rest.

geopandas GeoDataFrames#

We have used geopandas in the previous two practicals to read provided shapefiles and work with the data they contain - in Practical 1, we translated a comma-separated variable (CSV) file into a shapefile, and in Practical 2, we read shapefile data and plotted it on a map using cartopy.

This week, we will extend this introduction to look at how we can use geopandas to do various GIS analyses, such as spatial joins and clipping operations, as well as projecting from one coordinate reference system to another.

To begin, we’ll load the NI_roads dataset from the data_files folder and use .head() (documentation) to show the first 5 rows of the GeoDataFrame:

roads = gpd.read_file('data_files/NI_roads.shp')

roads.head() # show the first five rows of the table

So this dataset has three columns: SURVEY, Road_class, and geometry.

Note that each of the geometries is a LineString object, which means that we are working with line geometries. Hopefully, given that the data are supposed to represent roads, this makes sense.

coordinate reference systems using PROJ#

To start with, let’s see if we can figure out how many kilometers of motorway are represented in the dataset - i.e., the sum of the length of all of the LineString objects that have the attribute MOTORWAY.

First, though, let’s check what the coordinate reference system (CRS) of our GeoDataFrame is, using the crs attribute:

roads.crs

So this dataset has a Geographic coordinate reference system, EPSG:4326. EPSG codes (originally organized by the European Petroleum Survey Group) are a common way of working with coordinate reference systems. Each CRS in the EPSG registry has a unique code and standard well-known text representation.

The crs attribute of the GeoDataFrame is actually a pyproj.CRS object (documentation). pyproj is a python interface to the PROJ library, which is a software for transforming geospatial coordinates from one CRS to another.

Each pyproj.CRS object provides a number of methods for converting to different formats, including well-known text, EPSG codes, JavaScript Object Notation (JSON), and PROJ string (i.e., '+proj=longlat +datum=WGS84 +no_defs +type=crs').

For example, to see the JSON representation of the CRS, we would use the .to_json() method (documentation):

roads.crs.to_json() # show the representation of the CRS in JSON format

Because this is a Geographic CRS, the length information provided by .length will also be in geographic units, which doesn’t really make sense for us. This means that we first have to convert the GeoDataFrame to a projected CRS.

To do this, we can use the method to_crs() (documentation):

help(roads.to_crs) # show the help for the .to_crs() method

So, to transform the GeoDataFrame to a different CRS, we have to provide either a CRS object or an EPSG code. We can also choose to do this in place (inplace=True), or assign the output to a new GeoDataFrame object (inplace=False, the default). Let’s transform the GeoDataFrame to the Irish Transverse Mercator CRS, and assign the output to a new object called roads_itm.

Rather than trying to find the correct JSON or PROJ representation of this CRS, we can instead use the EPSG code, which can be easier to work with.

Using the search function on the EPSG registry, or using an internet search, look up the EPSG code for the Irish Transverse Mercator CRS and enter it in the method call below:

roads_itm = roads.to_crs(epsg=XX) # replace XX with the correct EPSG code for Irish Transverse Mercator

roads_itm.head()

Note that only the geometry column has changed - instead of geographic coordinates (e.g., (-6.21243, 54.48706)), the points in each LineString should be in a projected CRS (e.g., (715821.764, 861315.722)). Now, when we access the .length attributes of each LineString object, the units will be in the same units as our CRS (meters).

summarizing data using geopandas#

So that’s the first part of our problem solved - our coordinates are in meters, and the lengths will be as well. The next step is to select all of the features that correspond to Motorways and sum the lengths. We saw an example of this in Practical 1 - we can slice the GeoDataFrame with a conditional statement ('Road_class' == 'MOTORWAY') to select only those rows where the road type is MOTORWAY:

roads_itm[roads_itm['Road_class'] == 'MOTORWAY']

But first, we might want to add a column to our GeoDataFrame that contains the .length of each of the features. One way to do this would be to iterate over the rows of the GeoDataFrame using the .iterrows() (documentation):

help(roads_itm.iterrows)

Because .iterrows() returns an iterator of (index, Series) pairs, we can use tuple assignment in our for loop definition:

for ind, row in roads_itm.iterrows():

This gives us two variables, ind and row, which we can use inside the body of the for loop: - ind corresponds to the index of the row - row corresponds to the Series, the actual data contained in the row

We can access the value stored in each “column” of the row in the same way that we do for the full GeoDataFrame - either row[column] or row.column.

Finally, we can assign a new column in the original GeoDataFrame using the .loc property, which uses either a label (for example, ind), or a Boolean array to index the GeoDataFrame.

So the line below,

roads_itm.loc[ind, 'Length'] = row['geometry'].length

assigns the length property of the row’s geometry to a new column, Length, at the index. Putting it all together, our loop looks like this:

for ind, row in roads_itm.iterrows(): # iterate over each row in the GeoDataFrame
    roads_itm.loc[ind, 'Length'] = row['geometry'].length # assign the row's geometry length to a new column, Length

roads_itm.head() # show the updated GeoDataFrame to see the changes

Finally, we can subset our GeoDataFrame to select only MOTORWAY features, and sum their length using the .sum() method (documentation):

sum_roads = roads_itm['Length'].sum()
sum_motorway = roads_itm[roads_itm['Road_class'] == 'MOTORWAY']['Length'].sum()
print(f'{sum_roads:.2f} total m of roads')
print(f'{sum_motorway:.2f} total m of motorway')

In the cell above, look at the print function argument:

print(f'{sum_motorway:.2f} total m of motorway')

Here, we are using a “formatted string literal” (f-String) to insert the value of an object, sum_motorway, into our print() statement. We saw this in the very first exercise in Week 1, but there’s something added here: the format specification, :.2f. Rather than printing the string in an unformatted way (which would contain a lot of extra decimal places), we can tell the format method to clean up the output using : and a format specification. In this case, .2f tells python to format as a float (f), with 2 places after the decimal.

Let’s say now that we want to find the sum of all of the different road classes in our dataset. We could, of course, repeat the exercise above for each of the different values of Road_class. But, pandas (and by extension, geopandas) provides a nicer way to summarize data based on certain properties: the .groupby() method (documentation).

.groupby() returns an object (a DataFrameGroupBy object) that is similar to a DataFrame, but that contains information about how the data in the table is grouped; to see different properties of those groups, we can use methods like .mean(), .median(), .sum(), etc., exactly like we can on a DataFrame, GeoDataFrame, or Series object.

If we want to summarize our dataset by Road_class and use .sum() to find the total length of each type of roadway, then, it would like this:

roads_itm.groupby(['Road_class'])['Length'].sum() / 1000 # convert to km

.groupby() returns a GeoDataFrame, which we can then index to return a single column, Length. As this is a numeric column, we can also use arithmetic on it to divide by a conversion factor, to convert the length from meters to kilometers. The .groupby() method is a very useful way to quickly summarize a DataFrame (or a GeoDataFrame - remember that this is a child class of DataFrame).

spatial data operations using geopandas and shapely#

Oftentimes in GIS analysis, we want to summarize our data spatially, as well as thematically. In this section, we will be looking at two examples of this kind of analysis: first, using a spatial join, and second, using a clipping operation.

Remember that the shapely geometry objects in the GeoDataFrame don’t have any inherent information about the CRS of the object. This means that in order to perform operations like a spatial join, we have to first ensure that the two GeoDataFrame objects have the same CRS. The cell below will load the Counties shapefile in the data_files folder, and test whether the CRS of the counties GeoDataFrame is the same as the CRS of the roads_itm GeoDataFrame.

If, when you first load the shapefile, the test below returns False, write a line of code that will ensure that the test returns True.

counties = gpd.read_file('data_files/Counties.shp') # load the Counties shapefile
# your line of code might go here.
print(counties.crs == roads_itm.crs) # test if the crs is the same for roads_itm and counties.

Now that the two GeoDataFrame objects have the same CRS, we can proceed with the spatial join using gpd.sjoin() (documentation):

join = gpd.sjoin(counties, roads_itm, how='inner', lsuffix='left', rsuffix='right') # perform the spatial join
join # show the joined table

Now, we can see that our table has additional columns - we have the unnamed index, COUNTY_ID, CountyName, Area_SqKM, OBJECTID, and geometry from the counties GeoDataFrame, and index_right (because the original column in roads_itm has the same name as index in counties), SURVEY, Road_class, and Length from the roads_itm GeoDataFrame.

Like we did with roads_itm, we can again summarize our new GeoDataFrame using .groupby(); this time, we’ll use both the CountyName and Road_class properties to see the total length of roads by each county, and by the type of road:

group_county_road = join.groupby(['CountyName', 'Road_class']) # group by county name, then road class

group_county_road['Length'].sum() / 1000 # show the total number of km for each category

From this, we can quickly see that County Antrim has the most motorway of any county in Northern Ireland (93.44 km), while County Tyrone has the most “< 4M Tarred” road surfaces by a factor of two (2809.43 km vs. 1453.77 km for County Armagh).

One thing to keep in mind is that with a spatial join, any feature in the “right” table that overlaps multiple features in the “left” table will be, in effect, double-counted. We can confirm this by calculating the total length of roads in the joined table and comparing it to the total length of roads in the original dataset:

join_total = join['Length'].sum() # find the total length of roads in the join GeoDataFrame

# check that the total length of roads is the same between both GeoDataFrames
print(f'Total length of roads from original file: {sum_roads:.2f}')
print(f'Total length of roads from spatial join: {join_total:.2f}')
print(f'Absolute difference in road length: {abs(sum_roads - join_total) / 1000:0.2f} km') # calculate the absolute difference as a percentage
print(f'Absolute difference in road length: {(100 * abs(sum_roads - join_total) / sum_roads):0.2f}%') # calculate the absolute difference as a percentage

And indeed, we can see that the total length of roads in the spatial join is ~300 km longer (1.42%) than the total length of roads in the original dataset.

We can also see that we have double-counted features by comparing the total number of road features in the join GeoDataFrame with the number of unique road features, which we can find using a combination of len() (documentation) and .unique() (documentation):

not_unique = len(join.index) - len(join.index_right.unique()) # get the difference between the number of objects in the table and the unique objects in the table

print(f'There are {not_unique} duplicated objects in the joined table.')

Obviously, we don’t want to double-count roads - to get around this, we can use the gpd.clip() function (documentation) to clip the features of roads_itm to each of the county boundaries in the counties GeoDataFrame:

help(gpd.clip)

Note that we have to do this for each of county, because - gpd.clip() will take the total boundary for the GeoDataFrame if there are multiple Polygon objects.

Using a for loop to loop over the counties GeoDataFrame, then, we can clip roads_itm to each county, and combine the results in another GeoDataFrame:

clipped = [] # initialize an empty list
for county in counties['CountyName'].unique(): # iterate over unique values of county
    tmp_clip = gpd.clip(roads_itm, counties[counties['CountyName'] == county]) # clip the roads by county border
    for ind, row in tmp_clip.iterrows():
        tmp_clip.loc[ind, 'Length'] = row['geometry'].length # remember to update the length for any clipped roads
        tmp_clip.loc[ind, 'CountyName'] = county # set the county name for each road feature
    clipped.append(tmp_clip) # add the clipped GeoDataFrame to the list

Note that this step will likely take some time, as we are iterating over a large number of features.

This creates a list of GeoDataFrame objects - one for each unique value of CountyName. Now, we can use pd.concat() (documentation) to combine these into a single DataFrame, then use gpd.GeoDataFrame() to convert this to a GeoDataFrame.

Note the use of ignore_index=True with pd.concat() - this means that pandas will assign each row in the combined DataFrame with a new index, rather than keeping the original index. Because in this case our index values only correspond to the row number, we don’t need to keep track of this in the new table.

clipped_gdf = gpd.GeoDataFrame(pd.concat(clipped, ignore_index=True)) # create a geodataframe from the combined county geodataframes

clipped_gdf # show the new, combined geodataframe

Now, we can compare the total length of the clipped roads with the total length of roads from the original dataset:

# pandas has a function, concat, which will combine (concatenate) a list of DataFrames (or GeoDataFrames)
# we can then create a GeoDataFrame from the combined DataFrame, as the combined DataFrame will have a geometry column.
clip_total = clipped_gdf['Length'].sum()

print(f'Total length of roads from original file: {sum_roads:.2f} m')
print(f'Total length of roads from clipped join: {clip_total:.2f} m')
print(f'Absolute difference in road length: {abs(sum_roads - clip_total) / 1000:0.2f} km')
print(f'Absolute difference in road length: {(100 * abs(sum_roads - clip_total) / sum_roads):0.2f}%')

So we don’t have perfect overlap. This is because there isn’t perfect overlap between the counties boundary and the roads features: there are a number of places where the roads extend beyond the border of Northern Ireland. One example of this is shown below:

one of the locations where the road shapefile extends beyond the boundaries of NI


To fix this, we could first clip roads_itm to the entire counties GeoDataFrame, which would eliminate these extraneous stretches of road.

For now, though, agrement to within 0.01% is acceptable for our purposes - much better than the 1.42% disagreement from the original spatial join.

To wrap up, write a line or two of code in the cell below that will summarize the clipped_gdf GeoDataFrame by county and road type. Which county has the most Motorways? The most roads in total?

# your code goes here!

exercise and next steps#

Now that you’ve gained some experience working with shapely geometry objects and geopandas GeoDataFrame objects, have a look at exercise_script.py in this folder.

Using the topics covered in the Week 2 practical and this practical, modify this script to do the following: 1. Load the counties and ward data 2. Using a spatial join, summarize the total population by county. What county has the highest population? What about the lowest? 3. Create a map like the one below to show population information by census area, with the county boundaries plotted overtop of the chloropleth map.

the sample map to be produced in the exercise

additional exercise questions#

  1. Are there any Wards that are located in more than one county? If so, how many, and what is the total population of these Wards?

  2. What Ward has the highest population? What about the lowest population?

  3. Repeat the exercise above using exercise_script.py, but this time use the population density (in number of residents per square km).

next steps#

Once you have finished the notebook and the exercise, make sure to send me an e-mail with some ideas for your coding project. They do not have to be completely fleshed out, but you should try to have a general idea of what you would like to work on for the final project – ideally, this will be something related to your work, or a potential MSc thesis topic.