(introduction)=
# Preface

In [2]:
!echo Last updated: `date +"%Y-%m-%d %H:%M:%S"`

Last updated: 2023-12-28 10:48:00


*****

## Welcome

This book contains the materials of the 3-credit undergraduate course named *Introduction to Spatial Data Programming with Python*, given at the [Department of Environmental, Geoinformatics and Urban Planning Sciences, Ben-Gurion University of the Negev](https://in.bgu.ac.il/en/humsos/geog/), in Autumn 2023.

The structure of the book is as follows:

* This section (see {ref}`introduction`) covers the context (why learn Python for working with spatial data?) and technical details about the material (which packages and sample data are we going to use?)
* **INTRODUCTION TO PYTHON** (see {ref}`setup`, {ref}`basics`, {ref}`conditionals-and-loops`) explains how to set up and use the Python working environment, and introduces the basics of the Python language, which are prerequisites for the later chapters
* **WORKING WITH DATA** (see {ref}`numpy`, {ref}`pandas1`, {ref}`pandas2`) covers the `numpy` and `pandas` packages for working with array and tabular data in Python, respectively
* **VECTOR LAYERS** (see {ref}`shapely`, {ref}`geopandas1`, {ref}`geopandas2`) covers the `shapely` and `geopandas` packages for working with vector layers in Python
* **RASTERS** (see {ref}`rasterio1` and {ref}`rasterio2`) covers the `rasterio` package for working with rasters in Python
* **ARCPRO SCRIPTING** (see {ref}`arcpro`) introduces the `arcpy` package for writing Python scripts to automate the ArcPro GIS software

## What is Python?

[Python](https://en.wikipedia.org/wiki/Python_(programming_language)) ({numref}`python-homepage`) is a general-purpose programming language. Python is used for a wide variety of [purposes](https://en.wikipedia.org/wiki/List_of_Python_software), such as: 

* Web servers and web applications (e.g., <https://en.wikipedia.org/wiki/Django_(web_framework)>)
* Scientific computing and data science (e.g., <https://en.wikipedia.org/wiki/NumPy>)
* Machine learning and AI (e.g., <https://en.wikipedia.org/wiki/TensorFlow>)
* Scripting language in other software (e.g., <https://en.wikipedia.org/wiki/ArcGIS_Pro>)

```{figure} images/python_homepage.png
---
name: python-homepage
---
Python website (<https://www.python.org/>)
```

Python is open-source, has an intuitive syntax, and it is very popular. Among programming language questions on [StackOverflow](https://stackoverflow.com/), Python currently (2023) stands at 1<sup>st</sup> place with ~14% of all questions ({numref}`python-popularity`).

```{figure} images/stackoverflow_most_popular_languages.png
---
name: python-popularity
---
Most popular programming languages, according to StackOverflow question proportions (<https://insights.stackoverflow.com/trends>)
```

Python was initially released in 1991. The present version, which we learn in this book, is Python 3, released in 2008. 

## Why choose Python for spatial data?

There are numerous reasons to choose Python for working with spatial data. These include general reasons for working through a Command Line Interface (CLI), as opposed to Graphical User Interfaces (GUI), that is, roughly speaking, writing code as opposed to clicking on menu buttons:

* Programming facilitates automation and reproducibility of our workflows. When programming, we interact with the computer through *scripts*. Therefore, the workflows we create can be repeated, adapted for other use cases in the future, and shared with other people who would like to accurately reproduce your workflow.
* Through programming, the user is "forced" to have a deeper understanding of the underlying data and the computational algorithms behind GIS workflows. Working through a CLI usually involves knowledge of lower-level details and requires us be specific about what we want to do.

There are also specific advantages of Python, over other CLI approaches:

* Python is a widespread and extremely popular language, in the GIS as well as other industries, and in academic research ({numref}`python-popularity`). For example, according to the 2021 StackOverflow survey, Python was the 3<sup>rd</sup> (after JavaScript and HTML) most popular programming technology, with 48.2% of respondents using it [^so_survey]. A recent FOSS4G (Free and Open Source Software for Geospatial) conference (FOSS4G 2021 Buenos Aires), Python was the major programming planguage in the Workshops, with three different 4-hour geospatial Python workshops ({numref}`foss4g-workshop`).
* Python and the packages we are going to learn (see {ref}`what-are-we-going-to-learn`) are free and open-source, which means that you can setup the workflows we learn at any place and time, at zero cost.
* The Python syntax was designed to be clear and straightforward. This means that Python has a (relatively) gentle learning curve, and that Python programs are often easy to read and understand, compared to other programming languages.
* In addition to being a standalone tool, Python is also used to automate GIS (and other) software, such as ArcGIS/ArcPro (`arcpy`) and QGIS (`PyQGIS`) (see {ref}`arcpro`). Often, Python is the main or only CLI interface of GIS software. Google Earth Engine, a polular cloud computing environment for working with big spatial data has an official [Python API](https://developers.google.com/earth-engine/guides/python_install).
* Deep learning libraries, such as [Keras](https://keras.io/)/[TensorFlow](https://www.tensorflow.org/) and [PyTorch](https://pytorch.org/), are almost exclusively accessed through Python. Among other uses, deep learning is applicable to spatial analysis tasks such as object detection and image classification in remote sensing ({numref}`building-segmentation`). 

[^so_survey]: <https://insights.stackoverflow.com/survey/2021#most-popular-technologies-language>

```{figure} images/foss4g_workshop.png
---
name: foss4g-workshop
---
Doing Geospatial with Python workshop in the FOSS4G 2021 Buenos Aires conference
```

```{figure} images/building_segmentation.png
---
name: building-segmentation
---
Building segmentation result, using deep learning with Python (source: <https://medium.com/@anthropoco/how-to-segment-buildings-on-drone-imagery-with-fast-ai-cloud-native-geodata-tools-ae249612c321>)
```

Nevertheless, Python has disadvantages over other CLI approaches. For example, the R programming language can be considred as an alternative CLI tool for spatial analysis, with the following advantages over Python:

* Python is a general-purpose language, which means that it is not natively designed to work with data. For example, the Python standard library does not support basic data science concepts and data structures, such as "No Data" values, arrays, and tables. This means we almost always need to rely on third-party packages such as `numpy` (see {ref}`numpy`) and `pandas` (see {ref}`pandas1`) when working with data. In R, the standard library covers all of those data-related concepts and much more.
* Python's spatial analysis "ecosystem" is more scattered, with numerous packages independently developed and not always inter-compatible. For example, although vector-based analysis is mostly unified into a single package called `geopandas` (see {ref}`geopandas1`), there are multiple packages for raster-based analysis, each with its own features, level of abstraction, advantages, and disadvantages, such as `rasterio` (see {ref}`rasterio1` and {ref}`rasterio2`), `xarray-spatial`, `rioxarray`, `earthpy`, and `geowombat`. Consequently, vector-based and raster-based ecosystems are not well integrated. For example, basic operations such as zonal statistics may require using additional third party packages, such as a package called `rasterstats` (see {ref}`zonal-statistics`). In R there is much tighter integration between spatial analysis packages. For example, the pair of compatible R packages `sf` and `stars` [cover](https://keen-swartz-3146c4.netlify.app/) most vector-based and raster-based analysis tasks, respectively.

(what-are-we-going-to-learn)=
## What are we going to learn?

In this book, we are going to work with the Python programming language, using packages from the standard library (which is built-in with the Python installation), as well as several third-party Python packages (which need to be installed separately, see {ref}`installing-packages`). 

By the end of this book, you will be able to write Python programs to automate processing and analysis of spatial data. You will be able to write Python scripts for spatial analysis workflows consisting of operations such as:

* Importing tables, vector layers, and rasters
* Filtering and aggregating the data
* Calculating new attributes, or reclassifying values to new categories
* Making spatial calculations, such as calculating distances, or creating buffered layers
* Creating simple plots and maps to inspect your data
* Exporting the results to a table, vector layer, or raster file

You will also have a strong background in the fundamental packages for data science in Python, namely `numpy` and `pandas`. This is a good starting point for learning data-related topics other than spatial analysis, such as:

* [Data processing](https://en.wikipedia.org/wiki/Data_processing)
* [Statistical analysis](https://en.wikipedia.org/wiki/Statistics)
* [Machine learning](https://en.wikipedia.org/wiki/Machine_learning)
* [Image processing](https://en.wikipedia.org/wiki/Digital_image_processing)

The most important third-party packages for spatial analysis in Python, which we are going to cover in detail, are listed in order of appearance in {numref}`python-packages`. The package versions being used when compiling the book are specified in {ref}`system-information`.

```{table} Main third-party Python packages used in this book
:name: python-packages

| Package | Functionality | Website |
|---|---|---|
| `numpy`     | Arrays            | <https://numpy.org/> |
| `pandas`    | Tables            | <https://pandas.pydata.org/> | 
| `shapely`   | Vector geometries | <https://shapely.readthedocs.io/> |
| `geopandas` | Vector layers     | <https://geopandas.org/> |
| `rasterio`  | Rasters           | <https://rasterio.readthedocs.io/> |
```

As we will see, these packages depend on one another. The major dependencies are depicted in {numref}`python-package-deps`.

```{figure} images/diagram_01_packages2.svg
---
name: python-package-deps
---
Main dependencies between the Python packages we are going to learn
```

Additionally, we are going to use the packages listed in {numref}`python-packages-2` for specific tasks. Packages from the standard library (which we'll explain in {ref}`what-are-packages`) are marked with (*) in the table.

```{table} Other Python packages used in this book. Packages from the standard library are marked with (*).
:name: python-packages-2

| Package | Functionality | Website |
|---|---|---|
| `csv` (*)     | Working with CSV files          | <https://docs.python.org/3/library/csv.html> |
| `math` (*)    | Mathematical functions          | <https://docs.python.org/3/library/math.html> |
| `matplotlib`  | Plots                           | <https://matplotlib.org/> |
| `glob` (*)    | File search by pattern          | <https://docs.python.org/3/library/glob.html> |
| `rasterstats` | Zonal statistics                | <https://pythonhosted.org/rasterstats/> |
| `richdem`     | Topographic raster calculations | <https://richdem.readthedocs.io/en/latest/> |
| `scipy`       | Focal filtering                 | <https://www.scipy.org/> |
```

(prerequisites)=
## Prerequisites

The book can be used as primary text for an introductory course on programmatic approach to spatial data analysis in geography departments, or by anyone who is interested in the topic. Each of the chapters is designed to be covered in a three-hour lecture, or through self-study. Short questions and exercises are given throughout the chapters to demonstrate the material from different angles and facilitate understanding. 

Familiarity with basic concepts of geographic data and GIS (coordinate systems, projections, spatial layer file formats, etc.) is necessary for deeper understanding of some of the topics in the book. Readers who are not familiar with GIS can skip the theoretic considerations and still follow the material from the technical point of view.

The book assumes no background knowledge in programming, going through all necessary material from the beginning. 

(sample-data)=
## Sample data

Throughout the book, we are going to use several datasets for demonstrating the methods we learn. The data can be downloaded here: 

* [data.zip](./data.zip)

{numref}`datasets` lists the datasets used in the book.

```{table} Datasets used in the book
:name: datasets

| Dataset | Filename | Format | Accessed | Source |
|---|---|---|---|---|
| Python script | `test.py` | Python | 2022 | - |  
| "Requirements" file | `requirements.txt` | TXT | 2023 | - |  
| World cities | `world_cities.csv` | CSV | 2021 | R package [`maps`](https://cran.r-project.org/package=maps) |  
| Carmel DEM | `carmel.csv` | CSV | 2016 | SRTM data, from [EarthExplorer](https://earthexplorer.usgs.gov/) |
| Carmel DEM (low resolution) | `carmel_lowres.csv` | CSV | 2016 | SRTM data, from [EarthExplorer](https://earthexplorer.usgs.gov/) |
| GISS global temperature | `ZonAnn.Ts+dSST.csv` | CSV | 2024 | [NASA](https://data.giss.nasa.gov/gistemp/) | 
| Trees in Beer-Sheva | `trees.csv` | CSV | 2023 | [Beer-Sheva Municipality](https://data.gov.il/dataset/trees-br7) |
| GTFS | `gtfs/*`[^gtfs-files] | CSV | 2023 | [Israel Ministry of Transport](https://www.gov.il/he/departments/general/gtfs_general_transit_feed_specifications) | 
| BGU logo | `bgu.wkt` | WKT | 2021 | [BGU](https://in.bgu.ac.il/) | 
| Railway stations | `RAIL_STAT_ONOFF_MONTH.shp`[^shp-datasets] | Shapefile | 2020 | [Israel Ministry of Transport](https://data.gov.il/dataset/rail_stat_onoff_month) |
| Railway lines | `RAIL_STRATEGIC.shp` | Shapefile | 2020 | [Israel Ministry of Transport](https://data.gov.il/dataset/rail_strategic) |
| Israel municipalities | `muni_il.shp` | Shapefile | 2022 | [Israel Ministry of Interior](https://www.gov.il/he/departments/guides/info-gis) |
| Statistical areas demography 2019 | `statisticalareas_demography2019.gdb` | Geodatabase | 2019 | [Israel Central Bureau of Statistics](https://www.cbs.gov.il/he/Pages/geo-layers.aspx) |
| Beer-Sheva aerial photo (2015) | `BSV_res200-M.tif` | GeoTIFF | 2021 | [MAPI](https://data.gov.il/dataset/bsv) |
| Sentinel2 image | `T36RXV_20201226T082249_B0*.jp2` | JPEG2000 | 2020 | Sentinel2, from [EarthExplorer](https://earthexplorer.usgs.gov) | 
```

[^gtfs-files]: The GTFS dataset is composed of several `.txt` files (located in the `gtfs` folder), namely: `agency.txt`, `calendar.txt`, `fare_attributes.txt`, `fare_rules.txt`, `routes.txt`, `shapes.txt`, `stops.txt`, `stop_times.txt`, `translations.txt`, `trips.txt`. 

[^shp-datasets]: By convention, Shapefile datasets are listed as `.shp` files. However, a Shapefile is actually composed of at least two more files (`.shx`, `.dbf`), and usually more, sharing the same prefix.

In some code examples in the book we are also going to create new files, to be used in later chapters or to demonstrate file export using Python. You can create them on your own, by running the code examples. Alternatively, you can download them from the following link:

* [output.zip](./output.zip)

{numref}`outputs` lists the files that we are going to create in the book.

```{table} Files created in the code examples in the book
:name: outputs

| Dataset | Filename | Format | Chapter | 
|---|---|---|---|
| World cities | `world_cities.shp` | Shapefile | {ref}`setup` |
| Packages | `packages.csv` | CSV | {ref}`writing-csv-files` |
| Railway stations | `stations.csv` | CSV | {ref}`pandas1`|
| Public transit routes | `routes.shp` | Shapefile | {ref}`geopandas1` |
| Public transit routes | `routes.geojson` | GeoJSON | {ref}`geopandas1` |
| Public transit routes | `routes.gpkg` | GeoPackage | {ref}`geopandas1` |
| Carmel DEM | `carmel.tif` | GeoTIFF | {ref}`rasterio1` | 
| Sentinel-2 stacked image | `sentinel2.tif` | GeoTIFF | {ref}`rasterio1` |
| Carmel topographic aspect | `carmel_aspect.tif` | GeoTIFF | {ref}`rasterio1` |
```

(system-information)=
## System information

Python version used when rendering the book is:

In [3]:
import sys
print(sys.version)

3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]


Package ({numref}`python-packages`, {numref}`python-packages-2`) versions used when rendering the book are:

In [1]:
import subprocess
packages = [
    'notebook',
    'numpy',
    'pandas',
    'shapely',
    'geopandas',
    'rasterio',
    'matplotlib',
    'rasterstats',
    'richdem',
    'scipy',
]
result = ''
for i in packages:
    x = 'pip freeze | grep ^%s==' % i
    result += subprocess.run(x, shell=True, executable='/bin/bash', capture_output=True, text=True).stdout
print(result)

notebook==7.0.6
numpy==1.26.0
pandas==1.5.3
shapely==2.0.1
geopandas==0.14.1
rasterio==1.3.6
matplotlib==3.5.1
rasterstats==0.18.0
richdem==0.3.4
scipy==1.8.0



Hostname:

In [1]:
import socket
print(socket.gethostname())

dell14


## Acknowledgements

I thank the authors of the Python language, and the authors of the `numpy`, `pandas`, `shapely`, `geopandas`, and `rasterio` packages which are used extensively in the book, for building these wonderful tools.

## Recommended materials online

### Overview

This section contains links to recommended online resources which are relevant to the scope of this book. All of the listed resources are freely available online.

### Official tutorials

The official tutorials of the third-party packages are often the best place to start. Go to the home pages of the main packages we are going to learn ({numref}`python-packages`) to find the official tutorials of those packages.

### Courses

Courses on general Python:

* [Practical Data Science (Duke University)](https://www.practicaldatascience.org/html/index.html)
* [Geo-Python (University of Helsinki)](https://geo-python-site.readthedocs.io/en/latest/)
* [Python Online (King's College London)](https://kingsgeocomputation.org/teaching/code-camp/code-camp-python/)
* [Programming in Python for Data Science (The University of British Columbia)](https://prog-learn.mds.ubc.ca/en/)
* [Python Programming for Data Science (University of British Columbia)](https://www.tomasbeuzen.com/python-programming-for-data-science/README.html)

Courses on geospatial Python:

* [Automating GIS-processes (University of Helsinki)](https://autogis-site.readthedocs.io/en/latest/)
* [Geospatial Analysis with Python (University of Tartu)](https://kodu.ut.ee/~kmoch/geopython2021/index.html)
* [PyGIS - Open Source Geospatial Programming & Remote Sensing (The George Washington University)](https://pygis.io/)
* [Geographic Data Science (University of Liverpool)](https://darribas.org/gds_course/content/home.html)
* [Geographic Data Science for Applied Economists](http://darribas.org/gds4ae/content/pages/home.html)
* [Python for GIS and Geoscience (Ghent University)](https://github.com/jorisvandenbossche/DS-python-geospatial)
* [Spatial data science for sustainable development (Aalto University)](https://sustainability-gis.readthedocs.io/en/latest/#)
* [Geospatial Analysis and Representation for Data Science (University of Trento)](https://napo.github.io/geospatial_course_unitn/)
* [Foundations of Spatial Data Science (University College London)](https://github.com/jreades/fsds)
* [Geospatial Data Science (University of Copenhagen)](https://github.com/mszell/geospatialdatascience)
* [Python for Geospatial Analysis (University of British Columbia)](https://www.tomasbeuzen.com/python-for-geospatial-analysis/README.html)

### Tutorials

Tutorials on general Python:

* [Python 3 tutorial (WikiBooks, in Hebrew)](https://he.wikibooks.org/wiki/%D7%A4%D7%99%D7%99%D7%AA%D7%95%D7%9F/%D7%A4%D7%99%D7%99%D7%AA%D7%95%D7%9F_%D7%92%D7%A8%D7%A1%D7%94_3)
* [Python in a hurry](https://pyhurry.readthedocs.io/en/latest/index.html)
* [Plotting and Programming in Python (Data Carpentry)](http://swcarpentry.github.io/python-novice-gapminder/)
* [Programming with Python (Data Carpentry)](https://swcarpentry.github.io/python-novice-inflammation/)
* [Data Analysis and Visualization with Python for Social Scientists (Data Carpentry)](https://datacarpentry.org/python-socialsci/)
* [Data Analysis and Visualization in Python for Ecologists (Data Carpentry)](https://datacarpentry.org/python-ecology-lesson/)
* [Python for Atmosphere and Ocean Scientists (Data Carpentry)](https://carpentries-lab.github.io/python-aos-lesson/)
* [Calm Code tutorials](https://calmcode.io/)

Tutorials on geospatial Python:

* [Mapping and Data Visualization with Python (SpatialThoughts)](https://courses.spatialthoughts.com/python-dataviz.html)
* [Python Foundation for Spatial Analysis (SpatialThoughts)](https://courses.spatialthoughts.com/python-foundation.html)
* [Introduction to Geospatial Raster and Vector Data with Python (Data Carpentry)](https://carpentries-incubator.github.io/geospatial-python/)
* [Get Started With GIS in Open Source Python](https://www.earthdatascience.org/workshops/gis-open-source-python/)
* [Python for Geosciences (post series)](https://medium.com/analytics-vidhya/python-for-geosciences-working-with-satellite-images-step-by-step-b141dc50e1df)
* [Geospatial Analysis Tutorials (Kaggle)](https://www.kaggle.com/learn/geospatial-analysis)
* [Earth Data Science Tutorials in Python](https://www.earthdatascience.org/tutorials/python/)
* [Pythia Foundations - A community learning resource for Python-based computing in the geosciences](https://foundations.projectpythia.org/landing-page.html)

Python tutorials on related topics:

* [Image Processing with Python](https://datacarpentry.org/image-processing/)

### Books

Books on general Python:

* [Think Python (2nd Ed.)](https://greenteapress.com/wp/think-python-2e/)
* [A Byte of Python](https://python.swaroopch.com/)

Books on working with data in Python:

* [Python for Data Analysis](https://wesmckinney.com/book/)
* [Coding for Economists](https://aeturrell.github.io/coding-for-economists/intro.html)

Books on geospatial Python:

* [Geocomputation with Python (in preparation)](https://py.geocompx.org/)
* [An Introduction to Earth and Environmental Data Science](https://earth-env-data-science.github.io/intro.html)
* [Introduction to Python for Geographic Data Analysis (in preparation)](https://pythongis.org/)
* [Geographic Data Science with PySAL and the PyData Stack](https://geographicdata.science/book/intro.html)
* [Spatial Data Science (Python version)](https://r-spatial.org/python/)