Vector layers (geopandas
)#
Last updated: 2024-11-02 15:58:07
Introduction#
In the previous chapter, we worked with individual geometries using the shapely
package (Geometries (shapely)). In GIS paractice, however, we usually work with vector layers, rather than individual geometries. Vector layers can be thought of as:
a collection of (one or more) geometries,
associated with a Coordinate Reference System (CRS), and
associated with (zero or more) non-spatial attributes.
In this chapter, we introduce the geopandas
package for working with vector layers. First, we are going to see how a vector layer can be created from scratch (see Creating a GeoDataFrame), based on shapely
geometries, to understand the relation between shapely
and geopandas
data structures, namely:
GeoSeries
—A geometry column, i.e., a collection ofshapely
geometries with a CRS definitionGeoDataFrame
—A vector layer, i.e., a table containing aGeoSeries
Then, we go on to cover some of the basic operations and workflows with vector layers using the geopandas
package:
Reading vector layers—Importing a vector layer from a file, such as a Shapefile
Plotting (geopandas)—Plotting vector layers
Filtering by attributes—Selecting features by attribute
Table to point layer—Creating a point layer from a table
Points to line (geopandas)—Converting a point layer to a line layer
Writing GeoDataFrame to file—Writing a vector layer to file
In the next chapter, we are going to cover spatial operations with geopandas
, such as calculating new geometries (buffers, etc.) and spatial join (see Geometric operations).
What is geopandas
?#
geopandas
, as the name suggests, is an extension of pandas
that enables working with spatial (geographic) data, specifically with vector layers. To facilitate the representation of spatial data, geopandas
defines two data structures, which are extensions of the analogous data structures from pandas
for non-spatial data:
GeoSeries
—An extension of aSeries
(see Creating a Series) for storing a sequence of geometries and a Coordinate Reference System (CRS) definition, whereas each geometry is ashapely
object (see Geometries (shapely))GeoDataFrame
—An extension of aDataFrame
(see Creating a DataFrame), where one of the columns is aGeoSeries
As will be demostrated in a moment (see Creating a GeoDataFrame), a GeoDataFrame
, representing a vector layer, is thus a combination (Fig. 48) of:
A
GeoSeries
, which holds the geometries in a special “geometry” columnZero or more
Series
, which represent the non-spatial attributes
Since geopandas
is based on pandas
, many of the DataFrame
workflows and operators we learned about earlier (see Tables (pandas) and Table reshaping and joins) work exactly the same way with a GeoDataFrame
. This is especially true for operations that involve just the non-spatial properties, such as filtering by attributes (see Filtering by attributes).
The geopandas
package depends on several other Python packages, most importantly:
numpy
pandas
shapely
—Interface to GEOS (see Geometries (shapely))
To put it another way, geopandas
may be considered a “wrapper” of pyogrio
, shapely
, and pyproj
, giving access to all (or most) of their functionality through pandas
-based data structures and simpler-to-use methods. Compared to other packages, geopandas
may be considered a high-level approach for working with vector layers in Python. A lower-level approach is provided through the fiona
package, for example. The osgeo
package provides (through its ogr
sub-package) another, more low-level, approach for vector layer access.
Since geopandas
is simpler to learn and work with than its alternatives, it is the recommended approach for the purposes in this book. For instance, using fiona
and osgeo
we typically have to process vector features one-by-one inside a for
loop, while in geopandas
the entire layer is being ‘read’ and processed at once. Nevertheless, in some situations it is necessary to resort to lower-level approaches. For example, processing features one-by-one is essential when the layer is too big to fit in memory.
Vector layer from scratch#
Creating a GeoSeries
#
Before starting to work with geopandas
, we have to load it. We also load pandas
and shapely
, which we are already familiar with, since they are needed for some of the examples:
import pandas as pd
import shapely
import geopandas as gpd
As a first step to learn about geopandas
, we are going to create a geometry column (GeoSeries
), and then a vector layer (GeoDataFrame
) (see Creating a GeoDataFrame) from sratch. This will help us get a better understanding of the structure of a GeoDataFrame
(Fig. 48).
Let us start with the geometry column, i.e., a GeoSeries
. The geometric component of our layer is going to be a sequence of two 'Polygon'
geometries, hereby named pol1
and pol2
(see shapely from WKT):
pol1 = shapely.from_wkt('POLYGON ((0 0, 0 -1, 7.5 -1, 7.5 0, 0 0))')
pol2 = shapely.from_wkt('POLYGON ((0 1, 1 0, 2 0.5, 3 0, 4 0, 5 0.5, 6 -0.5, 7 -0.5, 7 1, 0 1))')
To create a GeoSeries
, we can pass a list
of shapely
geometries to the gpd.GeoSeries
function. The second (optional) argument passed to gpd.GeoSeries
is a crs
, specifying the Coordinate Reference System (CRS). In this case, we specify the WGS84 geographical CRS, using its EPSG code 4326
(Table 24):
s = gpd.GeoSeries([pol1, pol2], crs=4326)
s
0 POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
dtype: geometry
Creating a GeoDataFrame
#
A GeoDataFrame
can be constructed from a dict
representing the columns, one of them containing the geometries, and the other(s) containing non-spatial attributes (if any). The dict
values may be Series
, GeoSeries
, or list
data structure. First, let us create the dictionary:
d = {'name': ['a', 'b'], 'geometry': s}
d
{'name': ['a', 'b'],
'geometry': 0 POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
dtype: geometry}
In this particular case, we have:
the (mandatory) geometry column, conventionally named
'geometry'
, specified using aGeoSeries
(see Creating a GeoSeries), andone non-spatial attribute, named
'name'
, specified using alist
Now, we can convert the dict
to a GeoDataFrame
, using the gpd.GeoDataFrame
function:
gdf = gpd.GeoDataFrame(d)
gdf
name | geometry | |
---|---|---|
0 | a | POLYGON ((0 0, 0 -1, 7.5 -1, 7.... |
1 | b | POLYGON ((0 1, 1 0, 2 0.5, 3 0,... |
or in a single step:
gdf = gpd.GeoDataFrame({'name': ['a', 'b'], 'geometry': s})
gdf
name | geometry | |
---|---|---|
0 | a | POLYGON ((0 0, 0 -1, 7.5 -1, 7.... |
1 | b | POLYGON ((0 1, 1 0, 2 0.5, 3 0,... |
We can also use a list
of geometries instead of a GeoSeries
, in which case the crs
can be specified inside gpd.GeoDataFrame
:
gdf = gpd.GeoDataFrame({'name': ['a', 'b'], 'geometry': [pol1, pol2]}, crs=4326)
gdf
name | geometry | |
---|---|---|
0 | a | POLYGON ((0 0, 0 -1, 7.5 -1, 7.... |
1 | b | POLYGON ((0 1, 1 0, 2 0.5, 3 0,... |
Note that, the workflow to create a GeoDataFrame
is very similar to the way we created a DataFrame
with the pd.DataFrame
function (see Creating a DataFrame). Basically, the only difference is the presence of the geometry column.
GeoDataFrame
structure#
Series
columns#
Let us examine the structure of the gdf
object we just created to understand the structure of a GeoDataFrame
.
An individual column of a GeoDataFrame
can be accessed just like a column of a DataFrame
(see Selecting DataFrame columns). For example, the following expression returns the 'name'
column:
gdf['name']
0 a
1 b
Name: name, dtype: object
as a Series
object:
type(gdf['name'])
pandas.core.series.Series
More specifically, 'name'
is a Series
of type object
, since it contains strings.
GeoSeries
(geometry) column#
The following expression returns the 'geometry'
column:
gdf['geometry']
0 POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
Name: geometry, dtype: geometry
However, the recommended approach to access the geometry is through the .geometry
property, which refers to the geometry column regardless of its name (whether the name is 'geometry'
or not):
gdf.geometry
0 POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
Name: geometry, dtype: geometry
As mentioned above, the geometry column is a GeoSeries
object:
type(gdf.geometry)
geopandas.geoseries.GeoSeries
The 'geometry'
column is what differentiates a GeoDataFrame
from an ordinary DataFrame
.
Note
It is possible for a GeoDataFrame
to contain multiple GeoSeries
columns. However, only one GeoSeries
column at a time can be the “active” geometry column, meaning that it becomes accessible through the .geometry
property, participates in spatial calculations, displayed when plotting the layer, etc. To make a given column “active”, we use the .set_geometry
method, as in dat=dat.set_geometry('column_name')
, where column_name
is a name of a GeoSeries
column in dat
.
Individual geometries#
As evident from the way we constructed the object (see Creating a GeoDataFrame), each “cell” in the geometry column is an individual shapely
geometry:
gdf.geometry.iloc[0]
gdf.geometry.iloc[1]
If necessary, for instance when applying shapely
functions (see Points to line (geopandas)), the entire geometry column can be extracted as a list
of shapely
geometries using to_list
(see Series to list):
gdf.geometry.to_list()
[<POLYGON ((0 0, 0 -1, 7.5 -1, 7.5 0, 0 0))>,
<POLYGON ((0 1, 1 0, 2 0.5, 3 0, 4 0, 5 0.5, 6 -0.5, 7 -0.5, 7 1, 0 1))>]
GeoDataFrame
properties#
Geometry types#
The geometry types (Fig. 38) of the geometries in a GeoDataFrame
or a GeoSeries
can be obtained using the .geom_type
property. The result is a Series
of strings, with the geometry type names. For example:
s.geom_type
0 Polygon
1 Polygon
dtype: object
gdf.geom_type
0 Polygon
1 Polygon
dtype: object
The .geom_type
property of geopandas
thus serves the same purpose as the .geom_type
property of shapely
geometries (see Geometry type).
Coordinate Reference System (CRS)#
In addition to the geometries, a GeoSeries
has a property named .crs
, which specifies the Coordinate Regerence System (CRS) of the geometries. For example, the geometries in the gdf
layer are in the geographic WGS84 CRS, as specified when creating the layer (see Creating a GeoDataFrame):
gdf.geometry.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
The .crs
property can also be reached from the GeoDataFrame
itself:
gdf.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
We are going to elaborate on CRS and modifying them later on (see CRS and reprojection).
Bounds (geopandas
)#
The .total_bounds
, property of a GeoDataFrame
contains the bounding box coordinates of all geometries combined, in the form of a numpy
array:
gdf.total_bounds
array([ 0. , -1. , 7.5, 1. ])
For more detailed information, the .bounds
property returns the bounds of each geometry, separately, in a DataFrame
:
gdf.bounds
minx | miny | maxx | maxy | |
---|---|---|---|---|
0 | 0.0 | -1.0 | 7.5 | 0.0 |
1 | 0.0 | -0.5 | 7.0 | 1.0 |
In both cases, bounds are returned in the standard form \((x_{min}\), \(y_{min}\), \(x_{max}\), \(y_{max})\), in agreement with the .bounds
property of shapely
(see Bounds (shapely)).
Reading vector layers#
Usually, we read a vector layer from a file, rather than creating it from scratch (see Creating a GeoDataFrame). The gpd.read_file
function can read vector layer files in any of the common formats, such as:
Shapefile (
.shp
)GeoJSON (
.geojson
)GeoPackage (
.gpkg
)File Geodatabase (
.gdb
)
For example, the following expression reads a Shapefile named 'muni_il.shp'
into a GeoDataFrame
named towns
. This is a polygonal layer of towns in Israel. The additional encoding='utf-8'
argument makes the Hebrew text correctly interpreted:
towns = gpd.read_file('data/muni_il.shp', encoding='utf-8')
towns
Muni_Heb | Muni_Eng | Sug_Muni | CR_PNIM | CR_LAMAS | ... | FIRST_Nafa | LAST_Nafa2 | Shape_Leng | Shape_Area | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | שפיר | Shafir | מועצה אזורית | 5534 | None | ... | אשקלון | None | 48125.964322 | 2.721176e+07 | POLYGON Z ((177281.694 610056.5... |
1 | דימונה | Dimona | עירייה | 2200 | 2200 | ... | באר שבע | None | 1102.492262 | 7.708050e+04 | POLYGON Z ((205753.745 560693.6... |
2 | דימונה | Dimona | עירייה | 2200 | 2200 | ... | באר שבע | None | 7071.568894 | 1.043253e+06 | POLYGON Z ((207740.171 563476.5... |
3 | זבולון | Zevulun | מועצה אזורית | 5512 | None | ... | חיפה | None | 1107.564213 | 4.769053e+04 | POLYGON Z ((209445.229 740507.2... |
4 | מגידו | Megido | מועצה אזורית | 5513 | None | ... | יזרעאל | None | 1076.261429 | 2.144091e+04 | POLYGON Z ((206769.626 727131.7... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
406 | ללא שיפוט - אזור מעיליא | No Jurisdiction - Mi'ilya Area | ללא שיפוט | 9901 | None | ... | None | None | 2617.936245 | 2.001376e+05 | POLYGON Z ((225369.655 770523.6... |
407 | זבולון | Zevulun | מועצה אזורית | 5512 | None | ... | חיפה | None | 2209.128245 | 2.246316e+05 | POLYGON Z ((207596.214 746263.0... |
408 | זבולון | Zevulun | מועצה אזורית | 5512 | None | ... | חיפה | None | 5507.014575 | 1.141930e+06 | POLYGON Z ((208622.688 746540.1... |
409 | רמת השרון | Ramat Hasharon | עירייה | 2650 | 2650 | ... | תל אביב - יפו | None | 937.968193 | 4.076764e+04 | POLYGON Z ((183576.756 673627.0... |
410 | צור הדסה | Tsur Hadasa | מועצה מקומית | 1113 | 1113 | ... | תל אביב - יפו | None | 11292.949932 | 3.961142e+06 | POLYGON Z ((209871.005 625364.9... |
411 rows × 14 columns
Let us read another layer, 'RAIL_STRATEGIC.shp'
, into a GeoDataFrame
named rail
. This is a line layer of railway lines in Israel:
rail = gpd.read_file('data/RAIL_STRATEGIC.shp')
rail
UPGRADE | SHAPE_LENG | SHAPE_Le_1 | SEGMENT | ISACTIVE | UPGRAD | geometry | |
---|---|---|---|---|---|---|---|
0 | שדרוג 2040 | 14489.400742 | 12379.715332 | כפר יהושע - נשר_1 | פעיל | שדרוג בין 2030 ל-2040 | LINESTRING (205530.083 741562.9... |
1 | שדרוג 2030 | 2159.344166 | 2274.111800 | באר יעקב-ראשונים_2 | פעיל | שדרוג עד 2030 | LINESTRING (181507.598 650706.1... |
2 | שדרוג 2040 | 4942.306156 | 4942.306156 | נתבג - נען_3 | לא פעיל | שדרוג בין 2030 ל-2040 | LINESTRING (189180.643 645433.4... |
3 | שדרוג 2040 | 2829.892202 | 2793.337699 | לב המפרץ מזרח - נשר_4 | פעיל | שדרוג בין 2030 ל-2040 | LINESTRING (203482.789 744181.5... |
4 | שדרוג 2040 | 1976.490872 | 1960.170882 | קרית גת - להבים_5 | פעיל | שדרוג בין 2030 ל-2040 | LINESTRING (178574.101 609392.9... |
... | ... | ... | ... | ... | ... | ... | ... |
212 | חדש 2030 | 2174.401983 | 2174.401983 | ראש העין צפון - כפר סבא צפון_ | לא פעיל | פתיחה עד 2030 | LINESTRING (195887.59 675861.96... |
213 | חדש 2030 | 0.000000 | 974.116931 | רמלה דרום-ראשונים_218 | לא פעיל | פתיחה עד 2030 | LINESTRING (183225.361 648648.2... |
214 | שדרוג 2030 | 3248.159569 | 1603.616014 | השמונה - בית המכס_220 | פעיל | שדרוג עד 2030 | LINESTRING (200874.999 746209.3... |
215 | שדרוג 2030 | 2611.879099 | 166.180958 | לב המפרץ - בית המכס_221 | פעיל | שדרוג עד 2030 | LINESTRING (203769.786 744358.6... |
216 | שדרוג 2030 | 1302.971893 | 1284.983680 | 224_לוד מרכז - נתבג | פעיל | שדרוג עד 2030 | LINESTRING (190553.481 654170.3... |
217 rows × 7 columns
Note
Note that a Shapefile consists of several files, such as .shp
, .shx
, .dbf
and .prj
, even though we specify just one the .shp
file path when reading a Shapefile with gpd.read_file
. The same applies to writing a Shapefile (see Writing GeoDataFrame to file).
Plotting (geopandas
)#
Basic plots#
GeoSeries
and GeoDataFrame
objects have a .plot
method, which can be used to visually examine the data we are working with. For example:
gdf.plot();
We can use the color
and edgecolor
parameters to specify the fill color and border color, respectively:
gdf.plot(color='none', edgecolor='red');
Another useful property is linewidth
, to specify line width:
gdf.plot(color='none', linewidth=10);
Let us also plot the towns
and rail
layers to see what they look like:
towns.plot();
rail.plot();
Setting symbology#
By default, all plotted geometries share the same appearance. However, when plotting a GeoDataFrame
, we can set symbology, i.e., geometry appearance corresponding to attributes values. The attribute to use for symbology is specified with the column
parameter.
For example, the following expression results in a map where town fill colors correspond to the 'Machoz'
attribute values:
towns.plot(column='Machoz');
The color palette for the symbology can be changed using the cmap
(“color map’) argument. Valid palette names can be found in the Choosing Colormaps in Matplotlib page of the matplotlib
documentation. For example, 'Set2'
is one of the “Qualitative colormaps”, which is appropriate for categorical variables:
towns.plot(column='Machoz', cmap='Set2');
As another example, the following expression sets a symbology corresponding to the continuous 'Shape_Area'
(town area in square meters) columns. The argument legend=True
adds a legend:
towns.plot(column='Shape_Area', cmap='Reds', legend=True);
Interactive maps#
geopandas
also provides the .explore
method, which facilitates interactive exploration of the vector layers we are working with. The .explore
method takes many of the same arguments as .plot
(such as column
and cmap
), but also some specific ones (such as style_kwds
). See the documentation for details.
For example, the following expression creates an interactive map displaying the rail
layer, colored according to status (active or not). Note that hovering over a feature with the mouse shows a tooltip with the attribute values (similarly to “identify” tools in GIS software).
rail.explore(
column='ISACTIVE',
cmap='RdYlBu',
style_kwds={'weight':8, 'opacity':0.8}
)