import pandas
4 Python Setup
4.1 תקציר הוראות התקנה (Summary in Hebrew)
שפת פייתון ידועה בעקומת “כניסה” תלולה יחסית—לפני שאפשר להתחיל לכתוב קוד, יש מספר רב של שלבי הכנה והתקנה, ונקודות שאפשר להסתבך בהן (Figure 4.1). המטרה של המסמך להלן היא לסקור את המושגים הבסיסיים של עבודה עם שפת פייתון, ולתת הוראות פרקטיות להתקנה של הסביבה הנחוצה לעבודה עם מידע מרחבי.
בקצרה, השלבים להתקנה ועבודה עם התוכנות הם:
להוריד ולהתקין את תוכנת Miniconda (ראו Section 4.4) ולאתר את התוכנה Anaconda Prompt אשר תתווסף לתפריט “התחל”
להפעיל את ה-Anaconda Prompt (ראו Section 4.5) כאדמיניסטרטור, ולהריץ את הפקודה
conda install -c conda-forge jupyterlab
על מנת להתקין את הספריהjupyterlab
(ראו Section 4.7.2)להתקין באותה השיטה גם את הספריות:
geopandas
rasterio
openpyxl
להפעיל את ה-Anaconda Prompt בצורה רגילה, ולהריץ את הפקודה
jupyter notebook
ליצור קובץ Notebook (קובץ
.ipynb
), או לפתוח קובץ קיים, ולהתחיל לכתוב קוד (ראו Section 4.6)
להסבר נוסף ופירוט, ראו להלן.
4.2 Introduction
In this document we cover the prerequisites to setup the Python environment and interface, before we start with actually learning the Python language in the workshop. As we will see, there are quite a few preliminary steps.
More specifically, before starting to write Python code you need to install and learn how to use several software components:
- Miniconda, which is a popular Python package manager, and includes the Python program itself (see Section 4.4 and Section 4.5)
- The Jupyter Notebook interface, which can be installed from within Miniconda (see Section 4.5.4 and Section 4.6) and is the de-facto standard editor for writing and executing Python code in the data science domain
- Several Python packages, most importantly
numpy
,pandas
,shapely
,geopandas
, andrasterio
, which we are going to use in the tutorial and are essential to work with arrays, tables, vector layers and rasters (see Section 4.7), also installed from within Miniconda
We also cover the concept of file paths (see Section 4.8), which is essential for importing and exporting data from files in your Python scripts.
4.3 Running Python code
4.3.1 Python setup
Python is an interpreted programming language. This means that the expressions we write in the Python language are executed one by one by the software component knows as the interpreter. This is unlike complied languages, which are more difficult to work with, such as the C language, where the entire script needs to be translated to an executable file before it can be run. The interpreter runs one expression at a time, returning the results or performing other operations (such as writing to a file) as it goes along our code.
There are several types of working environments that facilitate writing and running Python code. The common feature of all those environments is that they communicate with the Python interpreter. However, each environment also has other features that facilitate writing code and viewing the results. Python working environments include:
- The Python command line (Section 4.5.2)
- IPython (Section 4.5.3)
- Jupyter Notebooks (Section 4.5.4)
The Python program itself, as well as the environments used to access it and manage its packages (Section 4.7), are bundled into various software packages. For example, Python can be downloaded and installed in a minimalist standalone program, or as part of a package management system, such as Anaconda or Miniconda (which we will use). Python is also commonly incorporated into other software installations, including GIS software such as QGIS, ArcGIS or ArcPro, where it can be accessed through an internal command line.
Python setup methods and environemnts are notoriously variable, whereas multiple Python versions on the same computer may conflict and “break” (Figure 4.1). In this chapter we aim to use one of the simplest and safest methods, based on Miniconda.
4.3.2 What software will we use?
The main component we are going to use is the Miniconda software. Miniconda is the recommended way to install Python on your computer for the purposes of the tutorial (and in general). Miniconda is the minimalist version of the well-known Anaconda software. Miniconda includes just Python, the conda
package manager for Python (see Section 4.7.2), and few other packages, whereas Anaconda comes with over 250 pre-installed Python packages. However, the pre-installed packages in Anaconda do not include most spatial packages we are going to need.
All software components we need other than Miniconda, namely Jupyter Notebook and other Python packages, can be installed through the Miniconda command line. Therefore, a prerequisite to the following material is that you have Miniconda installed (see Section 4.4), that you have access to the Miniconda command prompt (Section 4.5), and that you can run Miniconda as administrator to install the required packages (see Section 4.7.2), or you have those packages already installed (e.g., by the IT administrator).
It is important to note that Python, Miniconda, and the third-party Python packages we are going to use in the tutorial are free and open-source. Therefore, you can flexibly use these tools later on, in any setting and at no cost: at work in a commercial company, in the academy, in personal projects at home, and so on.
The instructions in this chapter assume the Windows 10 operating system.
4.4 Installing Miniconda
To install Miniconda, go to the Miniconda website (Figure 4.2), choose the right installation file for your system (such as Miniconda3 Windows 64-bit), download the file and execute it to run the installation process. Choose the default options when asked.
4.5 Using the miniconda command line
4.5.1 The Miniconda command line
To start the miniconda command line, locate the Anaconda Prompt (Miniconda3) entry in your Start menu (Figure 4.3), and click on it.
You should see a command line window (Figure 4.4). Just like in any command-line interface, you can type and execute commands. For working with Python, in this tutorial and in general, you need to be familiar with just few commands:
- Commands to move through drives and directories:
cd
—Short for “change directory”, to move through the directory structure (in the same drive), as incd C:\Users\dorman\Downloads
X:
—To switch to a different drive, type the drive name followed by colon, as inD:
to switch to driveD:\
dir
—To list current directory contents
- Commands to run programs:
python
—Starting the Pyton command line interface (see Section 4.5.2)ipython
—Starting the IPython interface (see Section 4.5.3)jupyter notebook
—Starting the Jupyter Notebook interface (see Section 4.5.4)conda install
—Installing third-party Python packages (see Section 4.7.2)
Miniconda can thus be thought of as an extended version of the ordinary computer’s command line (“cmd”), where the Python program (python
) and Python package management tool (conda
) are pre-installed.
You can exit from the Miniconda command line by closing the window, or by typing exit
and pressing Enter.
4.5.2 The Python command line
The first and simplest thing that the Miniconda command line provides, is the basic command-line interface to Python, i.e., the python
program. The Python command line can be accessed by typing python
in the miniconda command line (see Section 4.5):
python
This opens the Python command line, which is marked by the >>>
symbol.
Once inside the Python command line, Python expressions can be typed and executed by pressing Enter. For example, Figure 4.5 shows what it looks like to open the Python command line and execute the two expressions, which define a list
object and print it:
= [1,2,3]
a print(a)
The Python command line is convenient for experimenting with short and simple commands. When our code gets bigger and more complex, we usually prefer to keep it in persistent code files so that we can keep a record of our code, using Python script files (.py
), or Jupyter notebooks (.ipynb
) (see Section 4.5.4).
To exit the Python command line, and return to the miniconda command line, type exit()
and press Enter:
exit()
4.5.3 IPython
IPython can be thought of as an enhanced Python command-line application, with many additional features. For example, IPython introduces new helper commands (such as ?
for getting help), more sophisticated autocompletion, shortcuts for interacting with the operating system, and more. Importantly, working with IPython involves a series of cells, with code input (marked with In
) and output (marked with Out
).
The IPython command line can be accessed by typing ipython
in the miniconda command line (see Section 4.5):
ipython
You can exit from IPython (and return to the Miniconda command line) by typing exit
, or exit()
, and pressing Enter:
exit
4.5.4 Jupyter Notebook
Jupyter Notebooks is an extended interface of IPython, where, on top of the interactive interpreter and formatting, we also have a web interface to edit “notebooks” that permanently contain both the code and the output. Additionally, notebooks may contain formatted text, tables, images, equations, etc. (Figure 4.7).
To use Jupyter Notebook, you need to install the jupyterlab
package, similarly to the way that any other Python package is installed. The jupyterlab
package includes the Jupyter Notebook interface program. For example, when using miniconda, the following expression can be used to install jupyterlab
:
conda install -c conda-forge jupyterlab
Open the miniconda command line as administrator (by right-clicking on the miniconda command line icon, then choosing Run as administrator), then run the above command to install jupyterlab
. See Section 4.7.2 for more details and instructions to install jupyterlab
, and the other packages we will be working with in the tutorial. When done, go back to this section to experiment with the notebook interface.
To start the Jupyter Notebook interface, open the Miniconda command line again as an ordinary (non-administrator) user, and navigate to the directory where your notebooks are stored (or where you would like to create a new notebook). Then, run jupyter notebook
to start the Jupyter Notebook interface:
jupyter notebook
For example, if your working directory is C:\Users\dorman\Downloads
, then you need to start the miniconda command-line and run the following two expressions to open the Jupyter Notebook interface (for example, to create a new notebook):
cd C:\Users\dorman\Downloads
jupyter notebook
or the following expression to open an existing notebook named code_01.ipynb
:
cd C:\Users\dorman\Downloads
jupyter notebook code_01.ipynb
After running either jupyter notebook
or jupyter lab
, you should see some output printed in the console (Figure 4.8). At the same time, the notebook interface should automatically open in the browser (Figure 4.7).
Jupyter notebooks are viewed and executed through a web browser, but they need to be connected to a running Python process. Therefore leave the console open as long as you are working with the Jupyter notebooks in the browser. When done, first close the Jupyter notebook tab(s) in the browser, and then terminate the jupyter notebook
process with Ctrl+C.
4.6 Working with notebooks
4.6.1 Notebook files (.ipynb
)
Notebooks are saved as .ipynb
files (short for “IPython notebook”). .ipynb
are plain text files, but more sophisticated ones then .py
script files. Importantly, .ipynb
files contains not just Python code inputs, but also:
- the division into separate “cells” (see Section 4.5.3),
- code outputs (see Section 4.6.5), both textual and graphical ones, resulting from previous evaluations of the code cells (see Section 4.6.4), if any, and
- non-code cells (see Section 4.6.2). (Do not worry if these terms are not clear yet—we are going to demonstrate them shortly.)
Accordingly, and unlike a plain .py
script file, an .ipynb
file does not make much sense when viewing and editing though an plain text editor. It is intended to be viewed and edited only through an interface that can process and display it, such as Jupyter Notebook.
A notebook documents the Python workflow, so that it can be kept for future reference or shared with other people.
To create a new notebook in the Jupyter Notebook interface:
- Click on the New button (in the top-right of the screen)
- Click on Python 3
This should open a new browser tab displaying the new notebook. By default, the notebook will be named Untitled.ipynb
. You can rename it through File→Rename… in the menu (or clicking on the file name, in the top ribbon). While working with the notebook, and before closing it, remember to save your progress using the File→Save and Checkpoint, or by pressing Ctrl+S.
You can also open an existing notebook, by running jupyter notebook
and then choosing it in the file browser, which displays all files in the directory where the jupyter notebook
command was executed. Alternatively, you can open a partucular notebook directly from the command line by specifying its name. For example, to open the notebook named code_01.ipynb
in the current working directory run jupyter notebook code_01.ipynb
.
4.6.2 Cell types
Jupyter notebooks are composed of cells. There are two types of cells:
- Code cells
- Non-code cells, also known as Markdown cells
You can distinguish between code cells and non-code cells, by the In [...]
part displayed to the left of code cells (Section 4.5.4).
Code cells contain Python code, which can be executed, displaying the result (if any) immediately below the cell. We elaborate on code execution (Section 4.6.4) and code output (Section 4.6.5) below.
Markdown cells contain text, typically explanations, or background information, describing the code. Markdown cells follow the Markdown syntax, so in addition to plain text they may display headings, bold or italic text, bullet lists, tables, URLs, and so on. For example, the online version of the book you are reading right now is prepared from a set of Jupyter notebooks (one for each chapter), where all contents other then code are set using Markdown syntax. You can download the source .ipynb
files of the tutorial to see how various elements, such as bullet lists, tables, etc., are specified.
4.6.3 Working modes
There are two “modes” when working inside a notebook:
- Navigation mode
- Cell editing mode
You can switch between the two modes as follows:
- When in the navigation mode, you can enter editing mode by pressing Enter. That way, you can edit the current cell.
- To exit editing mode, and return to navigation mode, press Esc.
When in navigation mode, there are quite a few things, related to cell organization, that you can do. The most common operations are summarized in Table 4.1.
Keyboard shortcut | Operation |
---|---|
↑ / ↓ | Navigate between cells |
A | Create new cell above current one |
B | Create new cell below current one |
D, D | Delete current cell |
Y | Convert the current cell to a code cell |
M | Convert the current cell to a markdown cell |
C | “Copy” current cell |
X | “Cut” current cell |
V | “Paste” the copied or cut cell, below current cell |
When in editing mode, you can type or edit cell contents, much like in a plain text editor. If the cell is a code cell, then the text is interpreted as Python code. If the cell is a markdown cell, then the text is interpreted as markdown text.
4.6.4 Executing cells
You can execute a cell by pressing Ctrl+Enter, or Shift+Enter. The difference between the two is that the latter advances to the next cell (Table 4.2). These keyboard shortcuts work in both editing and navigation modes.
- When executing a code cell, the code is sent to the underlying Python interpreter, and the result (if any) is displayed below the cell.
- When executing a markdown cell, the text is rendered, so that any formatting (such as bold text marked with
**
) is displayed
Keyboard shortcut | Operation |
---|---|
Ctrl+Enter | Execute current cell |
Shift+Enter | Execute current cell & advance |
Note that you can navigate between cells, skip cells, and execute the cells in a notebook, in any order you choose. The Python process in the background keeps track of the current state of the environment as a result of all code you have run so far. You can always reset the environment using the Kernel→Restart button in the menu. In general, it makes sense to write the notebook assuming the cells will be executed in the given order. To run the entire notebook, from start to finish, you can use the Kernel→Restart & Run All button.
4.6.5 Code output
Code output appears below a code cell, in case it has already been executed. It is important to understand that a Jupyter Notebook, in an .ipynb
file, may store code as well as output. This is unlike a plain Python script, in a .py
file, which stores just Python code, withouth any output. You can always remove all outputs from a notebook using the Kernel→Restart & Clear Output button. Alternatively, you can use the Kernel→Restart & Run All button to run the entire notebook from start to finish, thus calculating (or updating) all code outputs at once.
Usually, code output in a Jupyter notebook is just textual, in which case it is displayed in the notebook exactly the same way as it would appear in a Python command line (see Section 4.5.2), or in an IPython command line (see Section 4.5.3). However, since Jupyter notebooks are displayed in a web browser, they are able to dispay many other types of non-textual output that the command line cannot display. For example, Jupyter notebooks can display images such as plots or maps (Figure 4.10), which result from Python functions.
4.7 Python packages
4.7.1 What are packages?
Packages are code collections that we can import in our Python script, or a Jupyter notebook, to make the functions defined there available to us.
The Python installation comes with numerous built-in packages, known as Python’s standard library (https://docs.python.org/3/library/). For example, the csv
package, which provides a basic method of reading CSV files, is part of the standard library. Packages from the standard library, such as csv
, do not need to be installed. However, they need to be loaded before use (see Section 4.7.3).
There are many Python packages that are not part of the standard Python installation, known as third-party packages, or external packages. For example, pandas
is a third-party Python package which has more advanced methods for reading CSV files than the csv
package. Since third-party packages are not part of the Python installation, they need to be installed in a separate step (see Section 4.7.2). Once installed, third-party packages need to be loaded before use (see Section 4.7.3), just like standard packages.
4.7.2 Installing packages
When using Miniconda, Python packages can be installed from the Miniconda command line, using an expression such as:
conda install -c conda-forge PACKAGE
where the word PACKAGE
needs to be replaced with the package name. For example, here is the specific expression that can be used to install the geopandas
package:
conda install -c conda-forge geopandas
Remember that the Minoconda command line must be started as administrator when installing packages (see Section 4.5.4)!
To run the examples in this tutorial, you need to install several packages, most importantly geopandas
and rasterio
. Installing geopandas
installs several dependencies that we mention in the tutorial, such as numpy
, pandas
and shapely
. Another package that we need to install, to read Excel files, is openpyxl
. Therefore, you need to run three commands, one after the other, to install all required packages:
conda install -c conda-forge jupyterlab
conda install -c conda-forge geopandas
conda install -c conda-forge rasterio
conda install -c conda-forge openpyxl
As part of the installation of each package, you will see a printout of the package dependencies and versions that are going to be installed. Type y
(yes) and press Enter to approve and proceed with the installation (Figure 4.9). (The installation progress may take a few minutes; you may need top press Enter if you see that the process is stuck.)
4.7.3 Loading packages
A package can be loaded using the import
keyword, followed by package name. For example, here is how we can import the pandas
package (for working with tables in Python):
Afterwards, we can access any function from the package, by typing pandas.*
. For example, here we create a Series
(labelled 1-dimensional array) object:
1, 3, 5]) pandas.Series([
0 1
1 3
2 5
dtype: int64
It is common practice to import a package under a different name which is shorter to type, using the import package as name
expression. For example, by convention the pandas
package is usually imported under the name pd
:
import pandas as pd
We can now access all of the functions in pandas
using pd.*
, instead of pandas.*
:
1, 3, 5]) pd.Series([
0 1
1 3
2 5
dtype: int64
4.8 File paths
Simple Python scripts may be self-contained, in the sense that the data they operate on are created as part of the script itself, and the results are not saved to permanent strage (e.g., on the hard disk). However, in realistic data analysis it is often necessary to:
- import existing data from disk at the beginning of the script, and to
- export the results to a file on disk at the end.
For example, here is a short script that reads a Shapefile and plots the contained vector layer:
import geopandas as gpd
= gpd.read_file('data/TAZ_NORTH_POPDENS_2016.shp')
pol ='lightgrey', edgecolor='black', linewidth=0.5); pol.plot(color
Do not worry about the functions and methods used in the code, as we are going to cover them in the tutorial. For now, pay attention just to the the file path "data/TAZ_NORTH_POPDENS_2016.shp"
.
First, note that, in Python code, unlike in the Miniconda command line (see {ref}using-miniconda-command-line
), you must use /
, and not \
when writing file paths!
Second, note that data/TAZ_NORTH_POPDENS_2016.shp
is a relative path, in which we specify a partial path that begins with the notebook location. For example, the relative path "data/TAZ_NORTH_POPDENS_2016.shp"
, means that we refer to:
- the file named
TAZ_NORTH_POPDENS_2016.shp
, - which is located in the
data
directory, - which is located in the directory where our notebook or script is.
Specifying the just a file name, such as "TAZ_NORTH_POPDENS_2016.shp"
is a special case of a relative path, meaning that the file is located in the same directory where our script is.
In all other code examples in the tutorial, we are going to use relative file paths starting with data
, such as "data/TAZ_NORTH_POPDENS_2016.shp"
. To reproduce the examples, therefore, you need to:
- download source tutorial.zip archive, which includes
.ipynb
Jupyter notebooks and thedata
directory with sample data files, - extract the files from the archive,
- open the notebook files
setup.ipynb
(the setup instructions, this document) orindex.ipynb
(the main tutorial) in the Jupyter Notebook interface (see Section 4.5.4).