2  Vector layers

2.1 Loading packages

First, we import three packages we will be working with:

  • pandas—for working with tables,
  • shapely—for working with individual geometries, and
  • geopandas—for working with vector layers,

as follows:

import pandas as pd
import shapely
import geopandas as gpd

2.2 What is geopandas?

The geopandas package, an extension of the pandas package, is aimed at working with vector layers. To facilitate the representation of vector geometries and layers, geopandas defines two data structures, which are extensions of the analogous data structures from pandas for non-spatial data:

  • GeoSeries—An extension of a Series for storing a sequence of geometries and a Coordinate Reference System (CRS) definition, where each geometry is a shapely object
  • GeoDataFrame—An extension of a DataFrame, where one of the columns is a GeoSeries

As will be demostrated in a moment (see Section 2.3), a GeoDataFrame, representing a vector layer, is thus a combination (Figure 2.1) of:

  • A GeoSeries, which holds the geometries in a special “geometry” column
  • Zero or more Series, which represent the non-spatial attributes

Figure 2.1: GeoDataFrame structure (source: https://geopandas.org/getting_started/introduction.html)

The geopandas package depends on several other Python packages, most importantly:

  • numpy
  • pandas
  • fiona—Interface to GDAL
  • shapely—Interface to GEOS
  • pyproj—Interface to PROJ

To put it another way, geopandas may be considered a high-level “wrapper” of fiona, shapely, and pyproj, giving access to all (or most) of their functionality through pandas-based data structures and a uniform interface.

2.3 Vector layer from scratch

2.3.1 Overview

In the first example, we create a vector layer from scratch. That way, we will be familiar with the hierarchical structure of the vector layer components:

  • Separate geometries (shapely.geometry.*)
  • Geometry column (GeoSeries)
  • Vector layer (GeoDataFrame)

2.3.2 Separate geometries (shapely.geometry.*)

To create a geometry object (shapely.geometry.*, package shapely), we can pass a list of coordinates, or geometries, to the respective shapely.* function, according to the geometry type. The seven commonly used Simple Feature geometry types (Figure 2.2) are supported.

Figure 2.2: Simple feature geometry types

For example, here is how we create three POINT geometries, named pnt1, pnt2, and pnt3, using function shapely.Point:

pnt1 = shapely.Point([16.9418, 52.4643])
pnt2 = shapely.Point([16.9474, 52.4436])
pnt3 = shapely.Point([16.9308, 52.4437])

Note the type of the resulting objects:

type(pnt1)
shapely.geometry.point.Point

Printing a shapely geometry object in Jupyter Notebook draws a small figure:

pnt1

The print function prints the Well Known Text (WKT) representation of the geometry:

print(pnt1)
POINT (16.9418 52.4643)

2.3.3 Geometry column (GeoSeries)

To create a geometry column (GeoSeries, package geopandas), we pass a list of shapely geometries to the gpd.GeoSeries function. The second (optional) argument of gpd.GeoSeries is a crs, specifying the Coordinate Reference System (CRS). In this case, we specify the WGS84 geographical CRS, using its EPSG code 4326:

geom = gpd.GeoSeries([pnt1, pnt2, pnt3], crs=4326)
geom
0    POINT (16.94180 52.46430)
1    POINT (16.94740 52.44360)
2    POINT (16.93080 52.44370)
dtype: geometry

One of the first things we might want to do with a GeoSeries is to .plot it, to examine what it looks like (Figure 2.3):

geom.plot();

Figure 2.3: GeoSeries with three points of interest in Poznan

Besides the GeoSeries, a vector layer typically contains one or more Series, representing non-spatial columns (Figure 2.1), such as for numeric, text, or boolean data. For example, let’s create a Series called name, with the names of the point locations in geom, using pd.Series:

name = pd.Series([
    'Faculty of Geographical and Geological Sciences', 
    'Hotel ForZa', 
    'HL Hotel Lechicka'
])
name
0    Faculty of Geographical an...
1                      Hotel ForZa
2                HL Hotel Lechicka
dtype: object

2.3.4 Vector layer (GeoDataFrame)

A vector layer (GeoDataFrame, package geopandas) is a table with a geometry column (GeoSeries) along with zero or more “ordinary” columns (Series) (AKA “attributes”). Or, in other words, a vector layer is an extension of an ordinary table (DataFrame, package pandas), where one of the columns is a GeoSeries.

One way to create a GeoDataFrame is to pass a dict of Series and GeoSeries to the gpd.GeoDataFrame function. For example, here we combine name and geom into a GeoDataFrame named poi (“points of interest”):

poi = gpd.GeoDataFrame({'name': name, 'geometry': geom})
poi
name geometry
0 Faculty of Geographical an... POINT (16.94180 52.46430)
1 Hotel ForZa POINT (16.94740 52.44360)
2 HL Hotel Lechicka POINT (16.93080 52.44370)

Figure 2.4 summarizes the above workflow of constructing a vector layer from scratch:

  • Individual geometries—shapely objects (package shapely)
  • Geometry column—GeoSeries objects
  • Vector layer—GeoDataFrame object

Figure 2.4: Creating a GeoDataFrame from scratch

Figure 2.5 shows the hierarchical structure of GeoDataFrame, using another vector layer (two points, London and Paris):

  • The entire vector layer is a GeoDataFrame object
  • The geometric part, i.e., the geometry column, is a GeoSeries object
  • Each cell in the geometry column, i.e., an individual geometry, is a shapely object (package shapely)

Figure 2.5: Structure of a GeoDataFrame
Note

For those familiar with R, Table 2.1 gives a summary of the analogous classes in R and Python for columns, tables, and vector layers.

Table 2.1: Classes in R and Python for columns, tables, and vector layers
Concept R class {package} Python class {package}
Column numeric,character… {base} Series {pandas}
Table data.frame {base} DataFrame {pandas}
Geometry sfg {sf} shapely.geometry.* {shapely}
Geometry column sfc {sf} GeoSeries {geopandas}
Vector layer sf {sf} GeoDataFrame {geopandas}

2.4 Reading from file

We can also import an existing vector layers from a file. In the next example, we import a Shapefile named gis_osm_transport_a_free_1.shp. This is a polygonal layer with the transport-related features around Poznan. The Shapefile is a processed subset of OpenStreetMap data, downloaded from geofabrik.

To import a vector layer from a file, we use the gpd.read_file function:

pol = gpd.read_file('data/osm/gis_osm_transport_a_free_1.shp')
pol
osm_id code fclass name geometry
0 27923283 5656 apron NaN POLYGON ((16.84088 52.4247...
1 28396243 5656 apron NaN POLYGON ((16.96750 52.3274...
2 28396249 5656 apron NaN POLYGON ((16.98029 52.3239...
... ... ... ... ... ...
279 1101813658 5601 railway_station NaN POLYGON ((16.52227 52.3442...
280 1146503680 5641 taxi NaN POLYGON ((16.90591 52.4287...
281 1157127550 5641 taxi NaN POLYGON ((18.07060 51.7482...

282 rows × 5 columns

We can see that the geometry type, in this case, is POLYGON.

2.5 Subsetting by attributes

Subsetting a GeoDataFrame by attributes works the same way as for a DataFrame in pandas. We create a boolean Series, using conditions applied on one or more columns, then pass the Series as an index.

For example, here we subset the OSM transport-related polygons matching a specific 'name':

sel = 'Port Lotniczy Poznań-Ławica im. Henryka Wieniawskiego'
pol = pol[pol['name'] == sel]
pol
osm_id code fclass name geometry
116 342024881 5651 airport Port Lotniczy Poznań-Ławic... POLYGON ((16.80040 52.4249...

2.6 Plotting 1

A GeoDataFrame also has a .plot method (Figure 2.6):

pol.plot();

Figure 2.6: Plot of a GeoDataFrame with one POLYGON (Poznan airport)

Two of the most useful parameters of .plot (Figure 2.7) are:

  • color—Polygon interior fill color (or point/line color)
  • edgecolor—Polygon border color
pol.plot(color='none', edgecolor='black');

Figure 2.7: Setting color and edgecolor in .plot

2.7 Table to point layer

In the next example, we are going to read a text file representing public transport stations (with lon/lat), and convert it to a point layer. The file stops.txt contains the public transport stops in Poznan, downloaded from the GTFS repository for Poznan.

Even though the file extension is .txt, this is actually a CSV file. To read a CSV file, we can use the pd.read_csv function. The result is a DataFrame, where we also drop one of the irrelevant columns ('stop_code'):

stops = pd.read_csv('data/gtfs/stops.txt')
stops = stops.drop(columns='stop_code')
stops
stop_id stop_name stop_lat stop_lon zone_id
0 2186 Żerniki/Skrzyżowanie 52.326840 17.042626 B
1 355 Sucholeska 52.462338 16.868885 A
2 4204 Pawłowicka 52.478103 16.786287 A
... ... ... ... ... ...
2918 3876 Witkowice/ 52.497704 16.529493 C
2919 594 Międzyzdrojska 52.436424 16.808997 A
2920 1190 Koziegłowy/Zakłady Drobiar... 52.441243 16.998186 B

2921 rows × 5 columns

The columns named 'stop_lon' and 'stop_lat' contain the stop coordinates. The two columns, along with the CRS definition (4326 for lon/lat) can be transformed to a GeoSeries object, using the gpd.points_from_xy and the gpd.GeoSeries functions:

geom = gpd.points_from_xy(stops['stop_lon'], stops['stop_lat'], crs=4326)
geom = gpd.GeoSeries(geom)
geom
0       POINT (17.04263 52.32684)
1       POINT (16.86888 52.46234)
2       POINT (16.78629 52.47810)
                  ...            
2918    POINT (16.52949 52.49770)
2919    POINT (16.80900 52.43642)
2920    POINT (16.99819 52.44124)
Length: 2921, dtype: geometry

A GeoSeries can be combined with non-spatial attributes, to create a GeoDataFrame. Again, we use the gpd.GeoDataFrame function, but this time passing an existing DataFrame (data) and a GeoSeries (geometry):

stops = gpd.GeoDataFrame(data=stops, geometry=geom)
stops = stops.drop(columns=['stop_lon', 'stop_lat'])
stops
stop_id stop_name zone_id geometry
0 2186 Żerniki/Skrzyżowanie B POINT (17.04263 52.32684)
1 355 Sucholeska A POINT (16.86888 52.46234)
2 4204 Pawłowicka A POINT (16.78629 52.47810)
... ... ... ... ...
2918 3876 Witkowice/ C POINT (16.52949 52.49770)
2919 594 Międzyzdrojska A POINT (16.80900 52.43642)
2920 1190 Koziegłowy/Zakłady Drobiar... B POINT (16.99819 52.44124)

2921 rows × 4 columns

2.8 Plotting 2

The resulting layer of stops is plotted below, using another argument of .plot (Figure 2.8):

  • markersize—Point size
stops.plot(markersize=1);

Figure 2.8: Setting markersize in .plot

A .plot with symbology can be created using the following parameters:

  • column—A column name
  • legend—Whether to show a legend
  • cmap—Color map, such as from ColorBrewer and a wide variety of other options

For example, here we plot stops points colored according to their 'zone_id' (Figure 2.9):

stops.plot(markersize=1, column='zone_id', legend=True, cmap='Dark2');

Figure 2.9: Setting symbology in .plot

To display more than one layer in the same plot, we need to:

  • store the first plot in a variable (e.g., base), and
  • pass it as the ax argument of any subsequent plot(s) (e.g., ax=base).

For example, here we plot stops and poi together (Figure 2.10):

base = stops.plot(markersize=0.1)
poi.plot(ax=base, color='red');

Figure 2.10: Displaying multiple layers in .plot

Another useful method to interactively examine the layer(s) is .explore. This creates an interactive map, with background for context:

stops.explore(column='zone_id', legend=True, cmap='Dark2')
Make this Notebook Trusted to load map: File -> Trust Notebook