Vector layers (geopandas)#

Last updated: 2024-11-02 15:58:07

Introduction#

In the previous chapter, we worked with individual geometries using the shapely package (Geometries (shapely)). In GIS paractice, however, we usually work with vector layers, rather than individual geometries. Vector layers can be thought of as:

  • a collection of (one or more) geometries,

  • associated with a Coordinate Reference System (CRS), and

  • associated with (zero or more) non-spatial attributes.

In this chapter, we introduce the geopandas package for working with vector layers. First, we are going to see how a vector layer can be created from scratch (see Creating a GeoDataFrame), based on shapely geometries, to understand the relation between shapely and geopandas data structures, namely:

  • GeoSeries—A geometry column, i.e., a collection of shapely geometries with a CRS definition

  • GeoDataFrame—A vector layer, i.e., a table containing a GeoSeries

Then, we go on to cover some of the basic operations and workflows with vector layers using the geopandas package:

In the next chapter, we are going to cover spatial operations with geopandas, such as calculating new geometries (buffers, etc.) and spatial join (see Geometric operations).

What is geopandas?#

geopandas, as the name suggests, is an extension of pandas that enables working with spatial (geographic) data, specifically with vector layers. To facilitate the representation of spatial data, geopandas defines two data structures, which are extensions of the analogous data structures from pandas for non-spatial data:

  • GeoSeries—An extension of a Series (see Creating a Series) for storing a sequence of geometries and a Coordinate Reference System (CRS) definition, whereas each geometry is a shapely object (see Geometries (shapely))

  • GeoDataFrame—An extension of a DataFrame (see Creating a DataFrame), where one of the columns is a GeoSeries

As will be demostrated in a moment (see Creating a GeoDataFrame), a GeoDataFrame, representing a vector layer, is thus a combination (Fig. 48) of:

  • A GeoSeries, which holds the geometries in a special “geometry” column

  • Zero or more Series, which represent the non-spatial attributes

Since geopandas is based on pandas, many of the DataFrame workflows and operators we learned about earlier (see Tables (pandas) and Table reshaping and joins) work exactly the same way with a GeoDataFrame. This is especially true for operations that involve just the non-spatial properties, such as filtering by attributes (see Filtering by attributes).

_images/geodataframe.svg

Fig. 48 GeoDataFrame structure (source: https://geopandas.org/getting_started/introduction.html)#

The geopandas package depends on several other Python packages, most importantly:

To put it another way, geopandas may be considered a “wrapper” of pyogrio, shapely, and pyproj, giving access to all (or most) of their functionality through pandas-based data structures and simpler-to-use methods. Compared to other packages, geopandas may be considered a high-level approach for working with vector layers in Python. A lower-level approach is provided through the fiona package, for example. The osgeo package provides (through its ogr sub-package) another, more low-level, approach for vector layer access.

Since geopandas is simpler to learn and work with than its alternatives, it is the recommended approach for the purposes in this book. For instance, using fiona and osgeo we typically have to process vector features one-by-one inside a for loop, while in geopandas the entire layer is being ‘read’ and processed at once. Nevertheless, in some situations it is necessary to resort to lower-level approaches. For example, processing features one-by-one is essential when the layer is too big to fit in memory.

Vector layer from scratch#

Creating a GeoSeries#

Before starting to work with geopandas, we have to load it. We also load pandas and shapely, which we are already familiar with, since they are needed for some of the examples:

import pandas as pd
import shapely
import geopandas as gpd

As a first step to learn about geopandas, we are going to create a geometry column (GeoSeries), and then a vector layer (GeoDataFrame) (see Creating a GeoDataFrame) from sratch. This will help us get a better understanding of the structure of a GeoDataFrame (Fig. 48).

Let us start with the geometry column, i.e., a GeoSeries. The geometric component of our layer is going to be a sequence of two 'Polygon' geometries, hereby named pol1 and pol2 (see shapely from WKT):

pol1 = shapely.from_wkt('POLYGON ((0 0, 0 -1, 7.5 -1, 7.5 0, 0 0))')
pol2 = shapely.from_wkt('POLYGON ((0 1, 1 0, 2 0.5, 3 0, 4 0, 5 0.5, 6 -0.5, 7 -0.5, 7 1, 0 1))')

To create a GeoSeries, we can pass a list of shapely geometries to the gpd.GeoSeries function. The second (optional) argument passed to gpd.GeoSeries is a crs, specifying the Coordinate Reference System (CRS). In this case, we specify the WGS84 geographical CRS, using its EPSG code 4326 (Table 24):

s = gpd.GeoSeries([pol1, pol2], crs=4326)
s
0    POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1    POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
dtype: geometry

Creating a GeoDataFrame#

A GeoDataFrame can be constructed from a dict representing the columns, one of them containing the geometries, and the other(s) containing non-spatial attributes (if any). The dict values may be Series, GeoSeries, or list data structure. First, let us create the dictionary:

d = {'name': ['a', 'b'], 'geometry': s}
d
{'name': ['a', 'b'],
 'geometry': 0    POLYGON ((0 0, 0 -1, 7.5 -1, 7....
 1    POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
 dtype: geometry}

In this particular case, we have:

  • the (mandatory) geometry column, conventionally named 'geometry', specified using a GeoSeries (see Creating a GeoSeries), and

  • one non-spatial attribute, named 'name', specified using a list

Now, we can convert the dict to a GeoDataFrame, using the gpd.GeoDataFrame function:

gdf = gpd.GeoDataFrame(d)
gdf
name geometry
0 a POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 b POLYGON ((0 1, 1 0, 2 0.5, 3 0,...

or in a single step:

gdf = gpd.GeoDataFrame({'name': ['a', 'b'], 'geometry': s})
gdf
name geometry
0 a POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 b POLYGON ((0 1, 1 0, 2 0.5, 3 0,...

We can also use a list of geometries instead of a GeoSeries, in which case the crs can be specified inside gpd.GeoDataFrame:

gdf = gpd.GeoDataFrame({'name': ['a', 'b'], 'geometry': [pol1, pol2]}, crs=4326)
gdf
name geometry
0 a POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 b POLYGON ((0 1, 1 0, 2 0.5, 3 0,...

Note that, the workflow to create a GeoDataFrame is very similar to the way we created a DataFrame with the pd.DataFrame function (see Creating a DataFrame). Basically, the only difference is the presence of the geometry column.

GeoDataFrame structure#

Series columns#

Let us examine the structure of the gdf object we just created to understand the structure of a GeoDataFrame.

An individual column of a GeoDataFrame can be accessed just like a column of a DataFrame (see Selecting DataFrame columns). For example, the following expression returns the 'name' column:

gdf['name']
0    a
1    b
Name: name, dtype: object

as a Series object:

type(gdf['name'])
pandas.core.series.Series

More specifically, 'name' is a Series of type object, since it contains strings.

GeoSeries (geometry) column#

The following expression returns the 'geometry' column:

gdf['geometry']
0    POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1    POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
Name: geometry, dtype: geometry

However, the recommended approach to access the geometry is through the .geometry property, which refers to the geometry column regardless of its name (whether the name is 'geometry' or not):

gdf.geometry
0    POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1    POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
Name: geometry, dtype: geometry

As mentioned above, the geometry column is a GeoSeries object:

type(gdf.geometry)
geopandas.geoseries.GeoSeries

The 'geometry' column is what differentiates a GeoDataFrame from an ordinary DataFrame.

Note

It is possible for a GeoDataFrame to contain multiple GeoSeries columns. However, only one GeoSeries column at a time can be the “active” geometry column, meaning that it becomes accessible through the .geometry property, participates in spatial calculations, displayed when plotting the layer, etc. To make a given column “active”, we use the .set_geometry method, as in dat=dat.set_geometry('column_name'), where column_name is a name of a GeoSeries column in dat.

Individual geometries#

As evident from the way we constructed the object (see Creating a GeoDataFrame), each “cell” in the geometry column is an individual shapely geometry:

gdf.geometry.iloc[0]
_images/3f5f3ed127ee5e83c8ecbb9989c7323fdb536e8c1f2e237a74a7510e8474a3e5.svg
gdf.geometry.iloc[1]
_images/a16328429069b9c096fa139809d27c9c20c26c7e27b02a524b269a716fb3164f.svg

If necessary, for instance when applying shapely functions (see Points to line (geopandas)), the entire geometry column can be extracted as a list of shapely geometries using to_list (see Series to list):

gdf.geometry.to_list()
[<POLYGON ((0 0, 0 -1, 7.5 -1, 7.5 0, 0 0))>,
 <POLYGON ((0 1, 1 0, 2 0.5, 3 0, 4 0, 5 0.5, 6 -0.5, 7 -0.5, 7 1, 0 1))>]

GeoDataFrame properties#

Geometry types#

The geometry types (Fig. 38) of the geometries in a GeoDataFrame or a GeoSeries can be obtained using the .geom_type property. The result is a Series of strings, with the geometry type names. For example:

s.geom_type
0    Polygon
1    Polygon
dtype: object
gdf.geom_type
0    Polygon
1    Polygon
dtype: object

The .geom_type property of geopandas thus serves the same purpose as the .geom_type property of shapely geometries (see Geometry type).

Coordinate Reference System (CRS)#

In addition to the geometries, a GeoSeries has a property named .crs, which specifies the Coordinate Regerence System (CRS) of the geometries. For example, the geometries in the gdf layer are in the geographic WGS84 CRS, as specified when creating the layer (see Creating a GeoDataFrame):

gdf.geometry.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

The .crs property can also be reached from the GeoDataFrame itself:

gdf.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

We are going to elaborate on CRS and modifying them later on (see CRS and reprojection).

Bounds (geopandas)#

The .total_bounds, property of a GeoDataFrame contains the bounding box coordinates of all geometries combined, in the form of a numpy array:

gdf.total_bounds
array([ 0. , -1. ,  7.5,  1. ])

For more detailed information, the .bounds property returns the bounds of each geometry, separately, in a DataFrame:

gdf.bounds
minx miny maxx maxy
0 0.0 -1.0 7.5 0.0
1 0.0 -0.5 7.0 1.0

In both cases, bounds are returned in the standard form \((x_{min}\), \(y_{min}\), \(x_{max}\), \(y_{max})\), in agreement with the .bounds property of shapely (see Bounds (shapely)).

Reading vector layers#

Usually, we read a vector layer from a file, rather than creating it from scratch (see Creating a GeoDataFrame). The gpd.read_file function can read vector layer files in any of the common formats, such as:

  • Shapefile (.shp)

  • GeoJSON (.geojson)

  • GeoPackage (.gpkg)

  • File Geodatabase (.gdb)

For example, the following expression reads a Shapefile named 'muni_il.shp' into a GeoDataFrame named towns. This is a polygonal layer of towns in Israel. The additional encoding='utf-8' argument makes the Hebrew text correctly interpreted:

towns = gpd.read_file('data/muni_il.shp', encoding='utf-8')
towns
Muni_Heb Muni_Eng Sug_Muni CR_PNIM CR_LAMAS ... FIRST_Nafa LAST_Nafa2 Shape_Leng Shape_Area geometry
0 שפיר Shafir מועצה אזורית 5534 None ... אשקלון None 48125.964322 2.721176e+07 POLYGON Z ((177281.694 610056.5...
1 דימונה Dimona עירייה 2200 2200 ... באר שבע None 1102.492262 7.708050e+04 POLYGON Z ((205753.745 560693.6...
2 דימונה Dimona עירייה 2200 2200 ... באר שבע None 7071.568894 1.043253e+06 POLYGON Z ((207740.171 563476.5...
3 זבולון Zevulun מועצה אזורית 5512 None ... חיפה None 1107.564213 4.769053e+04 POLYGON Z ((209445.229 740507.2...
4 מגידו Megido מועצה אזורית 5513 None ... יזרעאל None 1076.261429 2.144091e+04 POLYGON Z ((206769.626 727131.7...
... ... ... ... ... ... ... ... ... ... ... ...
406 ללא שיפוט - אזור מעיליא No Jurisdiction - Mi'ilya Area ללא שיפוט 9901 None ... None None 2617.936245 2.001376e+05 POLYGON Z ((225369.655 770523.6...
407 זבולון Zevulun מועצה אזורית 5512 None ... חיפה None 2209.128245 2.246316e+05 POLYGON Z ((207596.214 746263.0...
408 זבולון Zevulun מועצה אזורית 5512 None ... חיפה None 5507.014575 1.141930e+06 POLYGON Z ((208622.688 746540.1...
409 רמת השרון Ramat Hasharon עירייה 2650 2650 ... תל אביב - יפו None 937.968193 4.076764e+04 POLYGON Z ((183576.756 673627.0...
410 צור הדסה Tsur Hadasa מועצה מקומית 1113 1113 ... תל אביב - יפו None 11292.949932 3.961142e+06 POLYGON Z ((209871.005 625364.9...

411 rows × 14 columns

Let us read another layer, 'RAIL_STRATEGIC.shp', into a GeoDataFrame named rail. This is a line layer of railway lines in Israel:

rail = gpd.read_file('data/RAIL_STRATEGIC.shp')
rail
UPGRADE SHAPE_LENG SHAPE_Le_1 SEGMENT ISACTIVE UPGRAD geometry
0 שדרוג 2040 14489.400742 12379.715332 כפר יהושע - נשר_1 פעיל שדרוג בין 2030 ל-2040 LINESTRING (205530.083 741562.9...
1 שדרוג 2030 2159.344166 2274.111800 באר יעקב-ראשונים_2 פעיל שדרוג עד 2030 LINESTRING (181507.598 650706.1...
2 שדרוג 2040 4942.306156 4942.306156 נתבג - נען_3 לא פעיל שדרוג בין 2030 ל-2040 LINESTRING (189180.643 645433.4...
3 שדרוג 2040 2829.892202 2793.337699 לב המפרץ מזרח - נשר_4 פעיל שדרוג בין 2030 ל-2040 LINESTRING (203482.789 744181.5...
4 שדרוג 2040 1976.490872 1960.170882 קרית גת - להבים_5 פעיל שדרוג בין 2030 ל-2040 LINESTRING (178574.101 609392.9...
... ... ... ... ... ... ... ...
212 חדש 2030 2174.401983 2174.401983 ראש העין צפון - כפר סבא צפון_ לא פעיל פתיחה עד 2030 LINESTRING (195887.59 675861.96...
213 חדש 2030 0.000000 974.116931 רמלה דרום-ראשונים_218 לא פעיל פתיחה עד 2030 LINESTRING (183225.361 648648.2...
214 שדרוג 2030 3248.159569 1603.616014 השמונה - בית המכס_220 פעיל שדרוג עד 2030 LINESTRING (200874.999 746209.3...
215 שדרוג 2030 2611.879099 166.180958 לב המפרץ - בית המכס_221 פעיל שדרוג עד 2030 LINESTRING (203769.786 744358.6...
216 שדרוג 2030 1302.971893 1284.983680 224_לוד מרכז - נתבג פעיל שדרוג עד 2030 LINESTRING (190553.481 654170.3...

217 rows × 7 columns

Note

Note that a Shapefile consists of several files, such as .shp, .shx, .dbf and .prj, even though we specify just one the .shp file path when reading a Shapefile with gpd.read_file. The same applies to writing a Shapefile (see Writing GeoDataFrame to file).

Plotting (geopandas)#

Basic plots#

GeoSeries and GeoDataFrame objects have a .plot method, which can be used to visually examine the data we are working with. For example:

gdf.plot();
_images/5f95aa42340da117f30a0f1cbb49d21d243e30c4e1d4e9ea626426bcb786b319.png

We can use the color and edgecolor parameters to specify the fill color and border color, respectively:

gdf.plot(color='none', edgecolor='red');
_images/f7a36bbbb7ae4eb7fe0b90c995069ad85f7f672a0ad99ab5c9e38b3a4599dfdb.png

Another useful property is linewidth, to specify line width:

gdf.plot(color='none', linewidth=10);
_images/ba5fff85dbe00b28d6a98e98c37d227fd63b815468ac223a118e9966626c1764.png

Let us also plot the towns and rail layers to see what they look like:

towns.plot();
_images/cf0b0528d8fc66781f8ded691264c1aeece0eeef72b89e3c6baa3c2471219205.png
rail.plot();
_images/4d7bf37902fda15a6b4765fd93fefd019730bfc8efeb6bd8c53404e3668ec95b.png

Setting symbology#

By default, all plotted geometries share the same appearance. However, when plotting a GeoDataFrame, we can set symbology, i.e., geometry appearance corresponding to attributes values. The attribute to use for symbology is specified with the column parameter.

For example, the following expression results in a map where town fill colors correspond to the 'Machoz' attribute values:

towns.plot(column='Machoz');
_images/408748714ea66181ea16a3698fc1e7147f10636f86663e0c8581e40c451c95ee.png

The color palette for the symbology can be changed using the cmap (“color map’) argument. Valid palette names can be found in the Choosing Colormaps in Matplotlib page of the matplotlib documentation. For example, 'Set2' is one of the “Qualitative colormaps”, which is appropriate for categorical variables:

towns.plot(column='Machoz', cmap='Set2');
_images/2227863827f7f536d1cc6fcdf6a119035671a4b806cb04575e2a6a456e3400a1.png

As another example, the following expression sets a symbology corresponding to the continuous 'Shape_Area' (town area in square meters) columns. The argument legend=True adds a legend:

towns.plot(column='Shape_Area', cmap='Reds', legend=True);
_images/6dc2a8bd021c17d79b32d42ff6b518807e76d9fd2f1ed5b7c9d28c6b3bf30df9.png

Interactive maps#

geopandas also provides the .explore method, which facilitates interactive exploration of the vector layers we are working with. The .explore method takes many of the same arguments as .plot (such as column and cmap), but also some specific ones (such as style_kwds). See the documentation for details.

For example, the following expression creates an interactive map displaying the rail layer, colored according to status (active or not). Note that hovering over a feature with the mouse shows a tooltip with the attribute values (similarly to “identify” tools in GIS software).

rail.explore(
    column='ISACTIVE', 
    cmap='RdYlBu', 
    style_kwds={'weight':8, 'opacity':0.8}
)
Make this Notebook Trusted to load map: File -> Trust Notebook