Table reshaping and joins
Contents
Table reshaping and joins#
Last updated: 2023-02-25 13:41:53
Introduction#
In the previous chapter (see Tables (pandas)), we introduced the pandas
package, and covered the basic operations with tables using pandas
, such as importing and exporting, subsetting, renaming columns, sorting, and calculating new columns. In practice, we often need to do more than that. For example, we may need to aggregate, or to reshape, tables into a different form, or to combine numerous tables into one. In this chapter, we cover some of the common table operations which involve modifying table structure or combining multiple tables.
When working with data, it is often necessary to combine data from multiple sources, and/or reshape data into the right form that a particular function accepts:
Combining may simply involve “pasting”, or concatenating, two tables one on top of the other, or side-by-side (see Concatenation).
In more complex cases, combining involves a table join, where the information from two tables is combined according to one or more “key” columns which are common to both tables (see Joining tables).
Moreover, we may need to summarize the values of a table according to a particular grouping, which is known as aggregation (see Aggregation).
A special type of aggregation is applying a function across each row, or each column, of a table (Row/col-wise operations).
In this chapter, we learn about all of those common methods for combining and reshaping tables. By the end of the chapter, you will have the ability to not just import and transform tables in Python (see Tables (pandas)), but also to reshape and combine tables to the right form, which facilitates further analysis and insight from the data in those tables.
Sample data#
First, let us recreate the small table named stations
which we used in the previous chapter (see Creating a DataFrame):
import numpy as np
import pandas as pd
name = pd.Series(['Beer-Sheva Center', 'Beer-Sheva University', 'Dimona'])
city = pd.Series(['Beer-Sheva', 'Beer-Sheva', 'Dimona'])
lines = pd.Series([4, 5, 1])
piano = pd.Series([False, True, False])
lon = pd.Series([34.798443, 34.812831, 35.011635])
lat = pd.Series([31.243288, 31.260284, 31.068616])
d = {'name': name, 'city': city, 'lines': lines, 'piano': piano, 'lon': lon, 'lat': lat}
stations = pd.DataFrame(d)
stations
name | city | lines | piano | lon | lat | |
---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4 | False | 34.798443 | 31.243288 |
1 | Beer-Sheva University | Beer-Sheva | 5 | True | 34.812831 | 31.260284 |
2 | Dimona | Dimona | 1 | False | 35.011635 | 31.068616 |
We will also load the sample climatic data which we are also familiar with from the previous chapter (Reading from file), into a DataFrame
named dat
:
dat = pd.read_csv('data/ZonAnn.Ts+dSST.csv')
dat
Year | Glob | NHem | SHem | 24N-90N | ... | EQU-24N | 24S-EQU | 44S-24S | 64S-44S | 90S-64S | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1880 | -0.16 | -0.29 | -0.04 | -0.37 | ... | -0.15 | -0.09 | -0.03 | 0.05 | 0.65 |
1 | 1881 | -0.08 | -0.17 | 0.01 | -0.34 | ... | 0.10 | 0.12 | -0.05 | -0.07 | 0.57 |
2 | 1882 | -0.11 | -0.21 | 0.00 | -0.31 | ... | -0.05 | -0.04 | 0.02 | 0.04 | 0.60 |
3 | 1883 | -0.17 | -0.28 | -0.06 | -0.35 | ... | -0.17 | -0.14 | -0.03 | 0.07 | 0.48 |
4 | 1884 | -0.28 | -0.43 | -0.14 | -0.61 | ... | -0.13 | -0.15 | -0.19 | -0.02 | 0.63 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
137 | 2017 | 0.92 | 1.18 | 0.67 | 1.40 | ... | 0.87 | 0.77 | 0.76 | 0.35 | 0.55 |
138 | 2018 | 0.85 | 1.04 | 0.66 | 1.25 | ... | 0.73 | 0.62 | 0.79 | 0.37 | 0.96 |
139 | 2019 | 0.98 | 1.22 | 0.75 | 1.43 | ... | 0.91 | 0.89 | 0.74 | 0.39 | 0.83 |
140 | 2020 | 1.02 | 1.36 | 0.68 | 1.68 | ... | 0.89 | 0.84 | 0.58 | 0.39 | 0.91 |
141 | 2021 | 0.85 | 1.15 | 0.55 | 1.43 | ... | 0.73 | 0.59 | 0.71 | 0.32 | 0.33 |
142 rows × 15 columns
Concatenation#
Overview#
The most basic method for combining several tables into one is concatenation, which essentially means binding tables one on top of another, or side-by-side:
Concatenation by row is usually desired when the two (or more) tables share exactly the same columns. For example, combining several tables reflecting the same data (e.g., measured on different dates, or for different subjects) involves concatenation by row.
Concatenation by column is useful when we need to attach new variables for the same observations we already have. For example, we may want to combine several different type of observations which are split across separate files.
In pandas
, concatenation can be done using pd.concat
. The pd.concat
function accepts a list
of tables, which are combined:
By row when using
axis=0
(the default)By column when using
axis=1
Concatenation by row#
To demonstrate concatenation by row, let us create two subsets x
and y
of stations
, with the 1st row and the 2nd-3rd rows, respectively:
x = stations.iloc[[0], :] ## 1st row
x
name | city | lines | piano | lon | lat | |
---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4 | False | 34.798443 | 31.243288 |
y = stations.iloc[[1, 2], :] ## 2nd and 3rd rows
y
name | city | lines | piano | lon | lat | |
---|---|---|---|---|---|---|
1 | Beer-Sheva University | Beer-Sheva | 5 | True | 34.812831 | 31.260284 |
2 | Dimona | Dimona | 1 | False | 35.011635 | 31.068616 |
The expression pd.concat([x,y])
then can be used to concatenates x
and y
, by row, re-creating the complete table stations
:
pd.concat([x, y])
name | city | lines | piano | lon | lat | |
---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4 | False | 34.798443 | 31.243288 |
1 | Beer-Sheva University | Beer-Sheva | 5 | True | 34.812831 | 31.260284 |
2 | Dimona | Dimona | 1 | False | 35.011635 | 31.068616 |
Note that pd.concat
accepts a list
of DataFrame
s to concatenate, such as [x,y]
. In this example the list
is of length 2, but in general it can be of any length.
Note
There is a special method .append
which is a stortcut for concatenation by row. Namely, instead of pd.concat([x,y])
you can use x.append(y)
.
Concatenation by column#
To demonstrate concatenation by column, let us now split stations
to subsets x
and y
by column. For example, the x
table is going to contain one column "name"
:
x = stations[['name']]
x
name | |
---|---|
0 | Beer-Sheva Center |
1 | Beer-Sheva University |
2 | Dimona |
and the y
table is going to contain two columns 'lon'
and 'lat'
:
y = stations[['lon', 'lat']]
y
lon | lat | |
---|---|---|
0 | 34.798443 | 31.243288 |
1 | 34.812831 | 31.260284 |
2 | 35.011635 | 31.068616 |
To concatenate by column, we use pd.concat
with the axis=1
argument. This concatenates the tables x
and y
side-by-side:
pd.concat([x, y], axis=1)
name | lon | lat | |
---|---|---|---|
0 | Beer-Sheva Center | 34.798443 | 31.243288 |
1 | Beer-Sheva University | 34.812831 | 31.260284 |
2 | Dimona | 35.011635 | 31.068616 |
Alignment by index#
Note that concatenation—like many other operations in pandas
—operates based on the index, rather than position. This is not a problem when concatenating by row, where columns typically need to be aligned by their index, i.e., column name (see Concatenation by row). When concatenating by column, however, we need to pay attention to the index: is it meaningful, and we want to use it when aligning rows? Or is it arbitarary, and we want to reset it before concatenating?
For example, here is what happens in concatenation by column, when the index is not aligned (see Setting Series index):
y.index = [2,3,4]
pd.concat([x, y], axis=1)
name | lon | lat | |
---|---|---|---|
0 | Beer-Sheva Center | NaN | NaN |
1 | Beer-Sheva University | NaN | NaN |
2 | Dimona | 34.798443 | 31.243288 |
3 | NaN | 34.812831 | 31.260284 |
4 | NaN | 35.011635 | 31.068616 |
To make sure we concatenate based on position, we may reset the index (using .reset_index(drop=True)
) on both tables before concatenation:
y = y.reset_index(drop=True)
pd.concat([x, y], axis=1)
name | lon | lat | |
---|---|---|---|
0 | Beer-Sheva Center | 34.798443 | 31.243288 |
1 | Beer-Sheva University | 34.812831 | 31.260284 |
2 | Dimona | 35.011635 | 31.068616 |
Row/col-wise operations#
Built-in methods#
pandas
defines summary methods which can be used to calculate row-wise and col-wise summaries of DataFrame
s. The pandas
methods are generally similar to the numpy
2D summary functions and methods (see Summaries per dimension (2D)) which we learned about earlier, but there are several important differences to keep in mind:
pandas
only has methods, not functions, i.e., there is a.mean
method but there is nopd.mean
function—unlikenumpy
, which mostly defines both, e.g.,.mean
andnp.mean
(see Table 14)There is no “global” summary option in the
pandas
methods, i.e., we can only calculate column (axis=0
, the default) or row (axis=1
) summaries—unlikenumpy
where “global” summary is the default option (see Global summaries)The
pandas
methods exclude “No Data” by default (see Operators with missing values), unless we explicitly specifyskipna=False
—unlikenumpy
which by default includes “No Data”, unless using the specific “No Data”-safe functions such asnp.nanmean
(see Operations with “No Data”)
The most commonly used pandas
summary methods are listed in Table 15.
Operation |
Method |
---|---|
|
Sum |
|
Minimum |
|
Maximum |
|
Mean |
|
Median |
|
First value |
|
Last value |
|
Count (of non-missing values) |
|
Number of unique values |
|
Index of (first) minimum |
|
Index of (first) maximum |
|
Is at least one element |
|
Are all elements |
For the next few examples, let us take the regional temperature anomaly time series into a separate DataFrame
named regions
:
cols = ['90S-64S', '64S-44S', '44S-24S', '24S-EQU', 'EQU-24N', '24N-44N', '44N-64N', '64N-90N']
regions = dat[cols]
regions
90S-64S | 64S-44S | 44S-24S | 24S-EQU | EQU-24N | 24N-44N | 44N-64N | 64N-90N | |
---|---|---|---|---|---|---|---|---|
0 | 0.65 | 0.05 | -0.03 | -0.09 | -0.15 | -0.27 | -0.49 | -0.81 |
1 | 0.57 | -0.07 | -0.05 | 0.12 | 0.10 | -0.19 | -0.45 | -0.92 |
2 | 0.60 | 0.04 | 0.02 | -0.04 | -0.05 | -0.15 | -0.28 | -1.42 |
3 | 0.48 | 0.07 | -0.03 | -0.14 | -0.17 | -0.27 | -0.56 | -0.19 |
4 | 0.63 | -0.02 | -0.19 | -0.15 | -0.13 | -0.47 | -0.64 | -1.31 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
137 | 0.55 | 0.35 | 0.76 | 0.77 | 0.87 | 1.05 | 1.38 | 2.54 |
138 | 0.96 | 0.37 | 0.79 | 0.62 | 0.73 | 1.06 | 1.09 | 2.15 |
139 | 0.83 | 0.39 | 0.74 | 0.89 | 0.91 | 1.01 | 1.44 | 2.73 |
140 | 0.91 | 0.39 | 0.58 | 0.84 | 0.89 | 1.20 | 1.82 | 2.92 |
141 | 0.33 | 0.32 | 0.71 | 0.59 | 0.73 | 1.28 | 1.37 | 2.04 |
142 rows × 8 columns
Using the .mean
method combined with axis=0
(i.e., the function is applied on each column), we can calculate the mean temperature anomaly in each region:
regions.mean(axis=0)
90S-64S -0.089507
64S-44S -0.060704
44S-24S 0.036127
24S-EQU 0.079437
EQU-24N 0.054930
24N-44N 0.036549
44N-64N 0.122042
64N-90N 0.236761
dtype: float64
Note that axis=0
is the default, so we can also get the same result without specifying it. Conversely to numpy
(see Global summaries), in pandas
there is no “global” option for summary methods such as .mean
, but only row-wise or col-wise “modes”:
regions.mean()
90S-64S -0.089507
64S-44S -0.060704
44S-24S 0.036127
24S-EQU 0.079437
EQU-24N 0.054930
24N-44N 0.036549
44N-64N 0.122042
64N-90N 0.236761
dtype: float64
Exercise 06-a
What if we need to, nevertheless, calculate a “global” summary of
DataFrame
values, such as the mean of values inregions
? How can we do that? (answer:0.05
)
Conversely, using axis=1
, we can calculate row means, i.e., the average temperature anomaly for each year across all regions:
regions.mean(axis=1)
0 -0.14250
1 -0.11125
2 -0.16000
3 -0.10125
4 -0.28500
...
137 1.03375
138 0.97125
139 1.11750
140 1.19375
141 0.92125
Length: 142, dtype: float64
If necessary, the result can be assigned into a new column (see Creating new columns), hereby named mean
, as follows:
dat['mean'] = regions.mean(axis=1)
dat
Year | Glob | NHem | SHem | 24N-90N | ... | 24S-EQU | 44S-24S | 64S-44S | 90S-64S | mean | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1880 | -0.16 | -0.29 | -0.04 | -0.37 | ... | -0.09 | -0.03 | 0.05 | 0.65 | -0.14250 |
1 | 1881 | -0.08 | -0.17 | 0.01 | -0.34 | ... | 0.12 | -0.05 | -0.07 | 0.57 | -0.11125 |
2 | 1882 | -0.11 | -0.21 | 0.00 | -0.31 | ... | -0.04 | 0.02 | 0.04 | 0.60 | -0.16000 |
3 | 1883 | -0.17 | -0.28 | -0.06 | -0.35 | ... | -0.14 | -0.03 | 0.07 | 0.48 | -0.10125 |
4 | 1884 | -0.28 | -0.43 | -0.14 | -0.61 | ... | -0.15 | -0.19 | -0.02 | 0.63 | -0.28500 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
137 | 2017 | 0.92 | 1.18 | 0.67 | 1.40 | ... | 0.77 | 0.76 | 0.35 | 0.55 | 1.03375 |
138 | 2018 | 0.85 | 1.04 | 0.66 | 1.25 | ... | 0.62 | 0.79 | 0.37 | 0.96 | 0.97125 |
139 | 2019 | 0.98 | 1.22 | 0.75 | 1.43 | ... | 0.89 | 0.74 | 0.39 | 0.83 | 1.11750 |
140 | 2020 | 1.02 | 1.36 | 0.68 | 1.68 | ... | 0.84 | 0.58 | 0.39 | 0.91 | 1.19375 |
141 | 2021 | 0.85 | 1.15 | 0.55 | 1.43 | ... | 0.59 | 0.71 | 0.32 | 0.33 | 0.92125 |
142 rows × 16 columns
Here is a visualization (see Line plots (pandas)) of the mean time series we just calculated:
dat.set_index('Year')['mean'].plot();
apply
and custom functions#
In case we need to apply a custom function on each row, or column, we can use the .apply
method combined with a lambda function (see Lambda functions). For example, here is an alternative way to calculate row-wise means, using the .apply
method and a lambda function. The result is the same as regions.mean()
(see Built-in methods):
regions.apply(lambda x: x.mean())
90S-64S -0.089507
64S-44S -0.060704
44S-24S 0.036127
24S-EQU 0.079437
EQU-24N 0.054930
24N-44N 0.036549
44N-64N 0.122042
64N-90N 0.236761
dtype: float64
The advantage is that we can use any custom expression inside the lambda function definition. For example, the following expression calculates whether any region had temperature anomaly above 1 degree, for each of the 140 years. Note the axis=1
part, which determines that, this time, the function is applied on each row:
regions.apply(lambda x: (x > 1).any(), axis=1)
0 False
1 False
2 False
3 False
4 False
...
137 True
138 True
139 True
140 True
141 True
Length: 142, dtype: bool
Exercise 06-b
Function
np.polyfit
can be used to calculate the slope of linear regression between two columns. For example, using the code section shown below, you can find out that the linear trend of temperature in the"44S-24S"
region was an increasing one, at0.0066
degrees per year.Adapt the code into to the form of a function named
slope
which, given twoSeries
objectsx
andy
, calculates the linear slope. For example,f(dat["Year"],regions["44S-24S"])
should return0.0066
.Use the
slope
function you defined within the.apply
method, to calculate the slope in all columns ofregions
at once.Hint: you need to keep the
x
argument ofslope
fixed atdat["Year"]
, while the various columns ofregions
are passed toy
. To do that, write another function (or use a lambda function) that “wraps”slope
and runs it with just they
parameter (keepingx
fixed).
x = dat["Year"]
y = regions["44S-24S"]
np.polyfit(x, y, 1)[0]
Joining tables#
Overview#
One of the most common operations when working with data is a table join. In a table join, two tables are combined into one, based on one or more common column(s). For example, when compiling a dataset about different cities, we may collect data on different aspects of each cities (population size, socio-economic status, built area, etc.) from different sources. To work with these data together, the various tables need to be joined into one. The common column, in this case, would be the city name or city ID.
When working with pandas
, tables can be joined, based on one or more common column, using the pd.merge
function. The pd.merge
has the following important parameters:
left
—The first (“left”) tableright
—The second (“right”) tablehow
—The type of join, one of:'left'
,'right'
,'outer'
,'inner'
(the default),'cross'
on
—The name(s) of the common column(s) where identical values are considered a “match”
The type of join (the how
parameter) basically determines what should be done with non-matching records. The options are 'left'
, 'right'
, 'outer'
, 'inner'
(the default), and 'cross'
. The meaning of the join types is the same as in SQL 1, and will be demonstrated through examples below.
Stations example#
For a first small example, let us get back to the stations
table:
stations
name | city | lines | piano | lon | lat | |
---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4 | False | 34.798443 | 31.243288 |
1 | Beer-Sheva University | Beer-Sheva | 5 | True | 34.812831 | 31.260284 |
2 | Dimona | Dimona | 1 | False | 35.011635 | 31.068616 |
Suppose we have another small table named status
which describes the status ('Closed'
or 'Open'
) of select stations:
name = pd.Series(['Beer-Sheva Center', 'Beer-Sheva University', 'Tel-Aviv University'])
status = pd.Series(['Closed', 'Open', 'Open'])
status = pd.DataFrame({'name': name, 'status': status})
status
name | status | |
---|---|---|
0 | Beer-Sheva Center | Closed |
1 | Beer-Sheva University | Open |
2 | Tel-Aviv University | Open |
The stations
and status
can be joined as follows:
pd.merge(stations, status)
name | city | lines | piano | lon | lat | status | |
---|---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4 | False | 34.798443 | 31.243288 | Closed |
1 | Beer-Sheva University | Beer-Sheva | 5 | True | 34.812831 | 31.260284 | Open |
This implies the default, on
being equal to the column(s) sharing the same name in both tables (on='name'
, in this case), and an 'inner'
join (i.e., how='inner'
). As you can see, an inner join returns just the matching rows. It is better to be explicit, for clarity:
pd.merge(stations, status, on='name', how='inner')
name | city | lines | piano | lon | lat | status | |
---|---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4 | False | 34.798443 | 31.243288 | Closed |
1 | Beer-Sheva University | Beer-Sheva | 5 | True | 34.812831 | 31.260284 | Open |
Unlike an 'inner'
join, a 'left'
join returns all records from the first (“left”) table, even those that have no match in the second (“right”) table. For example, 'Dimona'
station does not appear in the status
table, but it is still returned in the result (with status
value set to np.nan
):
pd.merge(stations, status, on='name', how='left')
name | city | lines | piano | lon | lat | status | |
---|---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4 | False | 34.798443 | 31.243288 | Closed |
1 | Beer-Sheva University | Beer-Sheva | 5 | True | 34.812831 | 31.260284 | Open |
2 | Dimona | Dimona | 1 | False | 35.011635 | 31.068616 | NaN |
A 'right'
join is the opposite of a 'left'
join, i.e., it returns all records from the “right” table, even those that have no match in the “left” table. This time, values absent in the “left” table are set to np.nan
:
pd.merge(stations, status, on='name', how='right')
name | city | lines | piano | lon | lat | status | |
---|---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4.0 | False | 34.798443 | 31.243288 | Closed |
1 | Beer-Sheva University | Beer-Sheva | 5.0 | True | 34.812831 | 31.260284 | Open |
2 | Tel-Aviv University | NaN | NaN | NaN | NaN | NaN | Open |
Finally, an 'outer'
(also known as “full”) join returns all record from both tables, regardless of whether they have a matching record in the other table. Values that have no match in both the “left” and “right” tables are set to np.nan
:
pd.merge(stations, status, on='name', how='outer')
name | city | lines | piano | lon | lat | status | |
---|---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4.0 | False | 34.798443 | 31.243288 | Closed |
1 | Beer-Sheva University | Beer-Sheva | 5.0 | True | 34.812831 | 31.260284 | Open |
2 | Dimona | Dimona | 1.0 | False | 35.011635 | 31.068616 | NaN |
3 | Tel-Aviv University | NaN | NaN | NaN | NaN | NaN | Open |
When working with data, we often have one main, or “template”, table we are working with (“left”) and we are interested to join data coming from another external (“right”) table to it. In this scenario we usually prefer not to lose any records from the “left” table. Even if some of the records have no match and get “No Data” values (namely, np.nan
), we usually prefer to keep those records and deal with the missing values as necessary, rather than lose the records altogether. Therefore, a 'left'
join is often most useful in data analysis. For example, in the above situation a "left"
join makes sense if our focus is on the station properties given in stations
, so that even though the status
of the "Dimona"
station is unknown we prefer to keep the 'Dimona'
record.
What is GTFS?#
In the next examples, we demonstrate table join with a more realistic large datasets on public transport known as GTFS. Before we go into the example, we need to be familiar with the GTFS format and structure.
The General Transit Feed Specification (GTFS) is a standard format for describing public transport routes and timetables. GTFS is typically published by government transport agencies, periodically. For example, the GTFS dataset provided as part of the book’s sample data was downloaded from Israel ministry of transport GTFS website on 2021-04-18
. GTFS data is used by commercial companies, such as for routing in Google Maps.
Technically, a GTFS dataset is composed of several CSV files (even though their file extension is .txt
), which are linked via common columns, or keys. In this book, we are going to use only some of the files in the GTFS datasets, namely the six files listed in Table 16. Recall that we already met one of the GTFS files, namely stops.txt
, earlier, in an example of text file processing through for
loops (see File object and reading lines).
Name |
Contents |
Keys |
---|---|---|
|
Agencies |
|
|
Routes |
|
|
Trips |
|
|
Stop times |
|
|
Stops |
|
|
Shapes |
|
Let us see what the GTFS tables look like, what is their purpose, and how are they related with one another. To make things easier, we immediately subset the most important columns, which are also the ones we use in subsequent examples.
We start with agency.txt
. This file lists the public transport agencies. Each row represents one agency, identified by agency_id
:
filename = 'data/gtfs/agency.txt'
cols = ['agency_id', 'agency_name']
pd.read_csv(filename)[cols]
agency_id | agency_name | |
---|---|---|
0 | 2 | רכבת ישראל |
1 | 3 | אגד |
2 | 4 | אגד תעבורה |
3 | 5 | דן |
4 | 6 | ש.א.מ |
... | ... | ... |
33 | 95 | מוניות יהלום תחבורה |
34 | 96 | מוניות גלים |
35 | 97 | אודליה מוניות בעמ |
36 | 98 | מוניות רב קווית 4-5 |
37 | 130 | מוניות הדר לוד |
38 rows × 2 columns
Note
pd.read_csv
uses the 'utf-8'
encoding by default. All sample CSV files in this book are in the UTF-8 encoding, so there is no need to specify the encoding when reading them with pd.read_csv
. Otherwise, the right encoding can be specified with the encoding
parameter of pd.read_csv
.
The next file is routes.txt
. This file describes the public transport routes, identified by route_id
and also specifying the name of the route:
route_short_name
—typically a number, such as bus line numberroute_long_name
—the route textual description
It also contains agency_id
to identify the agency which operates each route:
path = 'data/gtfs/routes.txt'
cols = ['route_id', 'agency_id', 'route_short_name', 'route_long_name']
pd.read_csv(path)[cols]
route_id | agency_id | route_short_name | route_long_name | |
---|---|---|---|---|
0 | 1 | 25 | 1 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... |
1 | 2 | 25 | 1 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... |
2 | 3 | 25 | 2 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... |
3 | 5 | 25 | 2 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... |
4 | 7 | 25 | 3 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... |
... | ... | ... | ... | ... |
7246 | 29363 | 18 | 450 | מסוף האוניברסיטה-תל אביב יפו<->... |
7247 | 29364 | 2 | 125 | תל אביב ההגנה-תל אביב יפו<->הרצ... |
7248 | 29365 | 2 | 69 | רחובות-רחובות<->תל אביב מרכז-תל... |
7249 | 29366 | 2 | 60 | נתניה-נתניה<->אשקלון-אשקלון |
7250 | 29368 | 2 | 21 | נהריה-נהריה<->תל אביב ההגנה-תל ... |
7251 rows × 4 columns
The routes.txt
table is linked to the trips.txt
table, which characterizes trips taking place at specific times of day for a given route. Trips are uniquely identified by trip_id
and linked with the routes table via the route_id
column. As you can see from the small subset below, each route may be associated with multiple trips. The trips table also has a shape_id
column which associates a trip with the geographical line geometry (see shapes.txt
below). You can see that trips of the same route at different times of day typically follow exactly the same geographical shape, thus having identical shape_id
values, but this is not always guaranteed:
filename = 'data/gtfs/trips.txt'
cols = ['route_id', 'trip_id', 'shape_id']
pd.read_csv(filename)[cols]
route_id | trip_id | shape_id | |
---|---|---|---|
0 | 1 | 28876224_180421 | 121757.0 |
1 | 1 | 28876225_180421 | 121757.0 |
2 | 1 | 28876226_180421 | 121757.0 |
3 | 1 | 28876227_180421 | 121757.0 |
4 | 1 | 28876228_180421 | 121757.0 |
... | ... | ... | ... |
287673 | 29363 | 56622686_230421 | 123144.0 |
287674 | 29364 | 56460152_180421 | NaN |
287675 | 29365 | 56458677_180421 | 67648.0 |
287676 | 29366 | 56458667_180421 | 51396.0 |
287677 | 29368 | 56458657_180421 | 123033.0 |
287678 rows × 3 columns
The stop_times.txt
table lists the stops associated with each trip (via the trip_id
column), including the (estimated) time of arrival at each stop. The stop times table also includes the stop_id
column to associate it with the stop characteristics (see stops.txt
below). In addition, the shape_dist_traveled
column contains cumulative distance traveled at each stop. You can see the increasing distance traveled from one from stop to another (in the same trip) in the printout below. The stop_times.txt
table is the larges one, in terms of file size, among the GTFS files: keep in mind that it lists all stops, at all times of day, of all public transport routes:
filename = 'data/gtfs/stop_times.txt'
cols = ['trip_id', 'arrival_time', 'departure_time', 'stop_id', 'shape_dist_traveled']
pd.read_csv(filename)[cols]
trip_id | arrival_time | departure_time | stop_id | shape_dist_traveled | |
---|---|---|---|---|---|
0 | 10021427_240421 | 00:00:00 | 00:00:00 | 36133 | 0 |
1 | 10021427_240421 | 00:01:43 | 00:01:43 | 2356 | 499 |
2 | 10021427_240421 | 00:03:36 | 00:03:36 | 3869 | 1838 |
3 | 10021427_240421 | 00:05:31 | 00:05:31 | 2357 | 2692 |
4 | 10021427_240421 | 00:06:32 | 00:06:32 | 3803 | 3379 |
... | ... | ... | ... | ... | ... |
10583641 | 9967472_180421 | 11:39:25 | 11:39:25 | 34662 | 37457 |
10583642 | 9967472_180421 | 11:39:58 | 11:39:58 | 34492 | 37729 |
10583643 | 9967472_180421 | 11:40:53 | 11:40:53 | 34498 | 38112 |
10583644 | 9967472_180421 | 11:41:44 | 11:41:44 | 34500 | 38495 |
10583645 | 9967472_180421 | 11:43:31 | 11:43:31 | 34503 | 39061 |
10583646 rows × 5 columns
The stops.txt
table, which we already worked with earlier (see Reading CSV files), describes the fixed, i.e., unrelated to time of day, properties of the stops. It is linked with the stops_time.txt
table via the stop_id
column. Stop properties include the stop name (stop_name
), and its coordinates (stop_lon
and stop_lat
):
filename = 'data/gtfs/stops.txt'
cols = ['stop_id', 'stop_name', 'stop_lat', 'stop_lon']
pd.read_csv(filename)[cols]
stop_id | stop_name | stop_lat | stop_lon | |
---|---|---|---|---|
0 | 1 | בי''ס בר לב/בן יהודה | 32.183985 | 34.917554 |
1 | 2 | הרצל/צומת בילו | 31.870034 | 34.819541 |
2 | 3 | הנחשול/הדייגים | 31.984553 | 34.782828 |
3 | 6 | ת. מרכזית לוד/הורדה | 31.956392 | 34.898098 |
4 | 8 | הרצל/משה שרת | 31.857565 | 34.824106 |
... | ... | ... | ... | ... |
28250 | 44668 | חניון שפירים דרומי/הורדה | 32.001228 | 34.840406 |
28251 | 44683 | מתחם נצבא | 32.112445 | 34.941549 |
28252 | 44684 | מתחם נצבא | 32.111795 | 34.940668 |
28253 | 44691 | מסוף דרך שכם | 31.783680 | 35.228825 |
28254 | 44692 | הקוממיות/אנה פרנק | 32.002782 | 34.752376 |
28255 rows × 4 columns
Finally, the shapes.txt
table contains the trip geometries. It is linked with the trips.txt
table via the common shape_id
column.
filename = 'data/gtfs/shapes.txt'
cols = ['shape_id', 'shape_pt_lat', 'shape_pt_lon', 'shape_pt_sequence']
pd.read_csv(filename)[cols]
shape_id | shape_pt_lat | shape_pt_lon | shape_pt_sequence | |
---|---|---|---|---|
0 | 44779 | 31.887695 | 35.016271 | 1 |
1 | 44779 | 31.887745 | 35.016253 | 2 |
2 | 44779 | 31.888256 | 35.016238 | 3 |
3 | 44779 | 31.888913 | 35.016280 | 4 |
4 | 44779 | 31.888917 | 35.016892 | 5 |
... | ... | ... | ... | ... |
7226268 | 123167 | 32.729572 | 35.331201 | 876 |
7226269 | 123167 | 32.729700 | 35.330923 | 877 |
7226270 | 123167 | 32.729706 | 35.330908 | 878 |
7226271 | 123167 | 32.729841 | 35.330538 | 879 |
7226272 | 123167 | 32.729899 | 35.330249 | 880 |
7226273 rows × 4 columns
The relations between the six above-mentioned files in the GTFS dataset are summarized in Fig. 30.
GTFS example#
Now that we have an idea about the structure and purpose of the GTFS dataset, let us demonstrate join operations between its tables. This is a classical example where joining tables is essential: you cannot go very far with GTFS data analysis, or processing, without joining the various tables together (see Fig. 30).
To make things simple, in this example, consider just two of the tables in the GTFS dataset:
agency.txt
, where rows represent public transport operatorsroutes.txt
, where rows represent public transport routes
As shown above (see What is GTFS?), the two tables share a common column agency_id
, which we can use to join the two, to find out which public routes belong to each operator.
First, let us import the agency
table and subset the columns of interest, "agency_id"
and "agency_name"
:
agency = pd.read_csv('data/gtfs/agency.txt')
agency = agency[['agency_id', 'agency_name']]
agency
agency_id | agency_name | |
---|---|---|
0 | 2 | רכבת ישראל |
1 | 3 | אגד |
2 | 4 | אגד תעבורה |
3 | 5 | דן |
4 | 6 | ש.א.מ |
... | ... | ... |
33 | 95 | מוניות יהלום תחבורה |
34 | 96 | מוניות גלים |
35 | 97 | אודליה מוניות בעמ |
36 | 98 | מוניות רב קווית 4-5 |
37 | 130 | מוניות הדר לוד |
38 rows × 2 columns
Second, also import the routes
table, this time selecting the "route_id"
, "agency_id"
, "route_short_name"
, and "route_long_name"
columns. Note the common "agency_id"
column, which we are going to use when joining the two tables:
routes = pd.read_csv('data/gtfs/routes.txt')
routes = routes[['route_id', 'agency_id', 'route_short_name', 'route_long_name']]
routes
route_id | agency_id | route_short_name | route_long_name | |
---|---|---|---|---|
0 | 1 | 25 | 1 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... |
1 | 2 | 25 | 1 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... |
2 | 3 | 25 | 2 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... |
3 | 5 | 25 | 2 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... |
4 | 7 | 25 | 3 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... |
... | ... | ... | ... | ... |
7246 | 29363 | 18 | 450 | מסוף האוניברסיטה-תל אביב יפו<->... |
7247 | 29364 | 2 | 125 | תל אביב ההגנה-תל אביב יפו<->הרצ... |
7248 | 29365 | 2 | 69 | רחובות-רחובות<->תל אביב מרכז-תל... |
7249 | 29366 | 2 | 60 | נתניה-נתניה<->אשקלון-אשקלון |
7250 | 29368 | 2 | 21 | נהריה-נהריה<->תל אביב ההגנה-תל ... |
7251 rows × 4 columns
Now, we can join the routes
and agency
tables based on the common 'agency_id'
column:
routes = pd.merge(routes, agency, on='agency_id', how='left')
routes
route_id | agency_id | route_short_name | route_long_name | agency_name | |
---|---|---|---|---|---|
0 | 1 | 25 | 1 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... | אפיקים |
1 | 2 | 25 | 1 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... | אפיקים |
2 | 3 | 25 | 2 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... | אפיקים |
3 | 5 | 25 | 2 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... | אפיקים |
4 | 7 | 25 | 3 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... | אפיקים |
... | ... | ... | ... | ... | ... |
7246 | 29363 | 18 | 450 | מסוף האוניברסיטה-תל אביב יפו<->... | קווים |
7247 | 29364 | 2 | 125 | תל אביב ההגנה-תל אביב יפו<->הרצ... | רכבת ישראל |
7248 | 29365 | 2 | 69 | רחובות-רחובות<->תל אביב מרכז-תל... | רכבת ישראל |
7249 | 29366 | 2 | 60 | נתניה-נתניה<->אשקלון-אשקלון | רכבת ישראל |
7250 | 29368 | 2 | 21 | נהריה-נהריה<->תל אביב ההגנה-תל ... | רכבת ישראל |
7251 rows × 5 columns
Now we have the additional 'agency_name'
column in the routes
table, which tells us the name of the operator for each route.
Note
For more information and examples of pd.merge
, check out the following official tutorials from the pandas
documentation:
Aggregation#
Overview#
Aggregation is a commonly used to summarize tabular data. In an aggregation procedure, a function is applied on subsets of table rows, whereas subsets are specified through one or more grouping variables, to obtain a new (smaller) table with summaries per group. Aggregation is one of the techniques following the split-apply-combine concept.
In pandas
, there are several different ways of specifying the function(s) to be applied on the table column(s). We will go over three commonly used scenarios:
Afterwards, we are going to introduce a useful shortcut for a particularly common type of aggregation: counting the occurence of each unique value in a column (see Value counts).
Same function on all columns#
For a minimal example of aggregation, let us consider the small table named stations
:
stations
name | city | lines | piano | lon | lat | |
---|---|---|---|---|---|---|
0 | Beer-Sheva Center | Beer-Sheva | 4 | False | 34.798443 | 31.243288 |
1 | Beer-Sheva University | Beer-Sheva | 5 | True | 34.812831 | 31.260284 |
2 | Dimona | Dimona | 1 | False | 35.011635 | 31.068616 |
Suppose that we are interested in the number of lines (the lines
column) and the number of stations with a piano (the piano
column), which involves summing the values in those two columns. We already know how the totals, i.e., sums of each column, can be calculated (see Row/col-wise operations):
stations[['lines', 'piano']].sum()
lines 10
piano 1
dtype: int64
However, what if we wanted to calculate the number of lines and stations with a piano per town, separately? To do that, we basically need to split the table according to the 'city'
column, then apply the .sum
method on each subset. Here is how this can be done. Instead of simply using .sum
, we split (or “group”) the table by 'city'
, and only then apply the .sum
method. Splitting is done using the .groupby
method, which accepts the column name (or list
of column names) that specifies the grouping:
stations[['city', 'lines', 'piano']].groupby('city').sum()
lines | piano | |
---|---|---|
city | ||
Beer-Sheva | 9 | 1 |
Dimona | 1 | 0 |
Note that the grouping variable(s) become the index of the resulting DataFrame
. For our purposes, we are usually going to prefer to work with the grouping variable in its own column. In such case, we end the expression with .reset_index()
(see Resetting the index):
stations[['city', 'lines', 'piano']].groupby('city').sum().reset_index()
city | lines | piano | |
---|---|---|---|
0 | Beer-Sheva | 9 | 1 |
1 | Dimona | 1 | 0 |
Note
Tables can also be grouped by two or more columns, in which case the column names are passed in a list
, as in dat.groupby(['a','b'])
, where 'a'
and 'b'
are column names in dat
.
The .sum
method is just one example, there are many other useful functions we can use, such as the ones listed in Table 15. For example, to get the first row per group we can combine .groupby
with the .first
method:
stations.groupby('city').first().reset_index()
city | name | lines | piano | lon | lat | |
---|---|---|---|---|---|---|
0 | Beer-Sheva | Beer-Sheva Center | 4 | False | 34.798443 | 31.243288 |
1 | Dimona | Dimona | 1 | False | 35.011635 | 31.068616 |
For a more realistic example, consider the routes
table:
routes
route_id | agency_id | route_short_name | route_long_name | agency_name | |
---|---|---|---|---|---|
0 | 1 | 25 | 1 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... | אפיקים |
1 | 2 | 25 | 1 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... | אפיקים |
2 | 3 | 25 | 2 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... | אפיקים |
3 | 5 | 25 | 2 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... | אפיקים |
4 | 7 | 25 | 3 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... | אפיקים |
... | ... | ... | ... | ... | ... |
7246 | 29363 | 18 | 450 | מסוף האוניברסיטה-תל אביב יפו<->... | קווים |
7247 | 29364 | 2 | 125 | תל אביב ההגנה-תל אביב יפו<->הרצ... | רכבת ישראל |
7248 | 29365 | 2 | 69 | רחובות-רחובות<->תל אביב מרכז-תל... | רכבת ישראל |
7249 | 29366 | 2 | 60 | נתניה-נתניה<->אשקלון-אשקלון | רכבת ישראל |
7250 | 29368 | 2 | 21 | נהריה-נהריה<->תל אביב ההגנה-תל ... | רכבת ישראל |
7251 rows × 5 columns
We can use aggregation with the .nunique
(Table 15) function to calculate the number of unique route IDs that each public transport agency operates, as follows:
tmp = routes[['agency_name', 'route_id']].groupby('agency_name').nunique().reset_index()
tmp
agency_name | route_id | |
---|---|---|
0 | אגד | 1869 |
1 | אגד תעבורה | 420 |
2 | אודליה מוניות בעמ | 1 |
3 | אפיקים | 616 |
4 | גי.בי.טורס | 51 |
... | ... | ... |
26 | נתיב אקספרס | 598 |
27 | סופרבוס | 493 |
28 | קווים | 1187 |
29 | רכבת ישראל | 111 |
30 | ש.א.מ | 153 |
31 rows × 2 columns
To find out which agencies have the most routes, the resulting table can be sorted using .sort_values
combined with ascending=False
(see Sorting):
tmp.sort_values('route_id', ascending=False)
agency_name | route_id | |
---|---|---|
0 | אגד | 1869 |
28 | קווים | 1187 |
24 | מטרופולין | 625 |
3 | אפיקים | 616 |
26 | נתיב אקספרס | 598 |
... | ... | ... |
20 | מוניות מטרו קו | 3 |
19 | מוניות מאיה יצחק שדה | 3 |
18 | כרמלית | 2 |
17 | כפיר | 2 |
2 | אודליה מוניות בעמ | 1 |
31 rows × 2 columns
Exercise 06-c
Calculate a series of global temperature means for each decade, based on the information in the
dat
.First, you need to create a new variable named
decade
based on the"Year"
, using an expression such as the one shown below.Second, use
.groupby
to calculate the mean global temperature in each decade.
dat["Year"] // 10 * 10
Exercise 06-d
The
routes.txt
table describes public transport routes (uniquely identified byroute_id
). Thetrips.txt
table, however, descibes specific journeys of a along those routes, at a specific time of day (uniquely identified bytrip_id
). The two tables are associated with one another via the commonroute_id
column (see What is GTFS?). For example, a route that operates five times a day will be represented in one row in theroutes.txt
table, and five rows in thetrips.txt
table.Join
routes.txt
totrips.txt
, using the commonroute_id
column.Calculate a table with the number of unique
trip_id
’s perroute_id
.Sort the table according to number of unique trips, and print a summary with the numbers and names (
route_short_name
androute_long_name
columns fromroutes.txt
, respectively) of the most and least frequent routes (Fig. 31).
route_id | route_short_name | route_long_name | trip_id | |
---|---|---|---|---|
678 | 2256 | 4 | מסוף הטייסים-תל אביב יפו<->מסוף... | 1001 |
677 | 2255 | 4 | מסוף רידינג/רציפים-תל אביב יפו<... | 934 |
3170 | 11525 | 6 | אבי עזרי א-מודיעין עילית<->רמב'... | 840 |
3171 | 11526 | 6 | שדרות בית שמאי/שדרות בית הלל-מו... | 830 |
2746 | 10509 | 1 | קדושת לוי/שלום רב-ביתר עילית<->... | 748 |
... | ... | ... | ... | ... |
2291 | 9539 | 24 | חמת גדר-עמק הירדן<->ת. מרכזית ט... | 1 |
1190 | 4402 | 251 | ת. מרכזית המפרץ/רציפים בינעירונ... | 1 |
1199 | 4443 | 271 | ת. מרכזית נהריה/רציפים-נהריה<->... | 1 |
1204 | 4450 | 273 | אוניברסיטה/רב תכליתי-חיפה<->ת. ... | 1 |
7250 | 29368 | 21 | נהריה-נהריה<->תל אביב ההגנה-תל ... | 1 |
7251 rows × 4 columns
Different functions (agg
)#
Sometimes we need to aggregate a DataFrame
while applying a different function on each column, rather than the same function on all columns. For the next example, consider the world_cities.csv
file, which contains information about all major cities in the world:
cities = pd.read_csv('data/world_cities.csv')
cities
city | country | pop | lat | lon | capital | |
---|---|---|---|---|---|---|
0 | 'Abasan al-Jadidah | Palestine | 5629 | 31.31 | 34.34 | 0 |
1 | 'Abasan al-Kabirah | Palestine | 18999 | 31.32 | 34.35 | 0 |
2 | 'Abdul Hakim | Pakistan | 47788 | 30.55 | 72.11 | 0 |
3 | 'Abdullah-as-Salam | Kuwait | 21817 | 29.36 | 47.98 | 0 |
4 | 'Abud | Palestine | 2456 | 32.03 | 35.07 | 0 |
... | ... | ... | ... | ... | ... | ... |
43640 | az-Zubayr | Iraq | 124611 | 30.39 | 47.71 | 0 |
43641 | az-Zulfi | Saudi Arabia | 54070 | 26.30 | 44.80 | 0 |
43642 | az-Zuwaytinah | Libya | 21984 | 30.95 | 20.12 | 0 |
43643 | s-Gravenhage | Netherlands | 479525 | 52.07 | 4.30 | 0 |
43644 | s-Hertogenbosch | Netherlands | 135529 | 51.68 | 5.30 | 0 |
43645 rows × 6 columns
Suppose we want to summarize the information per country ("country"
column). However, we want each column to be treated differently:
The city names (
'city'
column) will be countedPopulation (
'pop'
column) will be summedLongitude and latitude (
'lon'
and'lat'
columns, respectively) will be averaged
This can be done using the .agg
method applied on a grouped DataFrame
, instead of a specific function, such as nunique
. The .agg
method accepts a dictionary, with key:value
pairs of the form 'column':function
, where:
'column'
—name of the column to be aggregatedfunction
—the function to apply on each group in that column. This can be:a string referring to a
pandas
function (such as'nunique'
, see Table 15)a plain function that can be applied on a
Series
, such asnp.mean
ornp.sum
In our case, the aggregation expression is as follows. As a result, we get a new table where each row represents a country, with the count of unique city names, sum of the population across all cities, and average longitute and latitude across all cities:
cities.groupby('country').agg({
'city': 'nunique',
'pop': np.sum,
'lon': np.mean,
'lat': np.mean
}).reset_index()
country | city | pop | lon | lat | |
---|---|---|---|---|---|
0 | Afghanistan | 115 | 7543856 | 66.920342 | 34.807692 |
1 | Albania | 67 | 1536232 | 19.975522 | 41.087463 |
2 | Algeria | 315 | 20508642 | 3.535443 | 35.467025 |
3 | American Samoa | 35 | 58021 | -170.663714 | -14.299143 |
4 | Andorra | 7 | 69031 | 1.538571 | 42.534286 |
... | ... | ... | ... | ... | ... |
234 | Wallis and Futuna | 23 | 11380 | -176.585652 | -13.512609 |
235 | Western Sahara | 4 | 338786 | -13.820000 | 25.935000 |
236 | Yemen | 30 | 5492077 | 45.088667 | 14.481000 |
237 | Zambia | 73 | 4032170 | 28.179315 | -13.404247 |
238 | Zimbabwe | 76 | 4231859 | 30.442500 | -18.629211 |
239 rows × 5 columns
Custom functions (agg
)#
Finally, sometimes we need to apply a custom function when aggregating a table. For example, suppose that we need to calculate the population difference between the most populated, and least populated, cities, in each country. To do that, we can define our own function, named diff_max_min
, which:
accepts a
Series
namedx
, andreturns the difference between the maximum and minimum,
x.max()-x.min()
.
Here is the function definition:
def diff_max_min(x):
return x.max() - x.min()
It is worthwhile to test the function, to make sure it works as expected:
diff_max_min(pd.Series([15, 20, 4, 12, 6]))
16
Now, we can use the function inside the .agg
expression, just like any other predefined function:
cities.groupby('country').agg({
'city': 'nunique',
'pop': diff_max_min
}).reset_index()
country | city | pop | |
---|---|---|---|
0 | Afghanistan | 115 | 3117579 |
1 | Albania | 67 | 379803 |
2 | Algeria | 315 | 2024344 |
3 | American Samoa | 35 | 11341 |
4 | Andorra | 7 | 17740 |
... | ... | ... | ... |
234 | Wallis and Futuna | 23 | 1142 |
235 | Western Sahara | 4 | 146906 |
236 | Yemen | 30 | 1918607 |
237 | Zambia | 73 | 1305210 |
238 | Zimbabwe | 76 | 1574930 |
239 rows × 3 columns
For example, here is how the “new” pop value is obtained for the first country ('Afghanistan'
):
diff_max_min(cities[cities['country'] == 'Afghanistan']['pop'])
3117579
Alternatively, we can use a lambda function (see Lambda functions), to do the same, using more concise syntax:
cities.groupby('country').agg({
'city': 'nunique',
'pop': lambda x: x.max() - x.min()
}).reset_index()
country | city | pop | |
---|---|---|---|
0 | Afghanistan | 115 | 3117579 |
1 | Albania | 67 | 379803 |
2 | Algeria | 315 | 2024344 |
3 | American Samoa | 35 | 11341 |
4 | Andorra | 7 | 17740 |
... | ... | ... | ... |
234 | Wallis and Futuna | 23 | 1142 |
235 | Western Sahara | 4 | 146906 |
236 | Yemen | 30 | 1918607 |
237 | Zambia | 73 | 1305210 |
238 | Zimbabwe | 76 | 1574930 |
239 rows × 3 columns
Note
For more information and examples on aggregation, and other types of split-apply-combine operations which we have not mentioned, see https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation.
Value counts#
A very common type of aggregation is to calculate how many times each unique value is “repeated” in a Series
. For example, to find out how many times each country appears in the cities
table, we can group the table by "country"
, then apply .count
on any other column (such as "city"
):
cities[['country', 'city']].groupby('country').count().reset_index()
country | city | |
---|---|---|
0 | Afghanistan | 117 |
1 | Albania | 67 |
2 | Algeria | 316 |
3 | American Samoa | 35 |
4 | Andorra | 7 |
... | ... | ... |
234 | Wallis and Futuna | 23 |
235 | Western Sahara | 4 |
236 | Yemen | 30 |
237 | Zambia | 73 |
238 | Zimbabwe | 76 |
239 rows × 2 columns
However, this expression is quite long and not straightforward. Instead, pandas
defines a much shorter way to count unique values, namely the value_counts
method. For example:
cities['country'].value_counts()
France 1000
Greece 1000
Romania 999
Japan 999
Germany 998
...
Saint-Martin 1
Norfolk Island 1
Gibraltar 1
Easter Island 1
Nauru 1
Name: country, Length: 239, dtype: int64
The result is a Series
where the index is the country name, and the values are the counts. It tells us how many times each unique value in the "country"
column is repeated, that is, how many rows (i.e., cities) are there in the table for that country. For example, "France"
is repeated over 1000
rows.
As you can see, the resuling Series
is sorted by count, from highest to lowest. In case we need to sort by index, i.e., by country name, we can use the .sort_index
method:
cities['country'].value_counts().sort_index()
Afghanistan 117
Albania 67
Algeria 316
American Samoa 35
Andorra 7
...
Wallis and Futuna 23
Western Sahara 4
Yemen 30
Zambia 73
Zimbabwe 76
Name: country, Length: 239, dtype: int64
Exercise 06-e
How often is there more than one shape (
shape_id
) for the same route (route_id
)?Use the
trips.txt
table to calculate the number of unique values ofshape_id
for eachroute_id
.Summarize the resulting column to calculate how many
route_id
s has0
,1
,2
, etc. ofshape_id
. (answer:4
routes have zeroshape_id
,112
have two, and7135
have one)
More exercises#
Exercise 06-f
Read the
stops.txt
andstop_times.txt
tables (located in thegtfs
directory).Subset the:
'stop_id'
,'stop_name'
,'stop_lon'
, and'stop_lat'
columns fromstops
, and'trip_id'
,'arrival_time'
,'stop_id'
,'stop_sequence'
, and'shape_dist_traveled'
columns fromstop_times
. (These are the only columns you are going to need for this exercise.)
The
trip_id
value of"55745843_180421"
is one of the trips of Metropoline bus line number 24 is Beer-Sheva, departing from Ramot neighborhood at08:15:00
and arriving at the Central Bus Station of Beer-Sheva at08:53:19
. Subset thestop_times
table to get the stop times for only this particular bus trip.How many stops are there in the bus route? Write an expression to get the answer as an
int
(answer:32
).Join the resulting
stop_times
subset with thestops
table, using the common"stop_id"
column, to get the names of the stops.At what time does the bus arrive at the station named
'מרכז הספורט אוניברסיטת בן גוריון'
? Write an expression to get the answer as astr
(answer:'08:37:21'
).The
shape_dist_traveled
contains the cumulative traveled distance at each stop. What is the total traveled distance, i.e., the last value in theshape_dist_traveled
column? (answer:12767
)Which pair of stations are most close to each other (i.e., minimal consecutive difference in
shape_dist_traveled
)? Hint: use the.diff
method to get aSeries
of consecutive differences, then use the.idxmin
method to find the index of the maximum in that series.Which pair of stations are most far away from each other (i.e., maximal consecutive difference in
shape_dist_traveled
)? Hint: use the.idxmax
method to find the index of the maximum in a series.
Exercise 06-g
In this exercise, you are going to find the names of the five longest bus routes (in terms of distance traveled) in Israel. To do that, go through the following steps.
Read the
stop_times.txt
table and select just the'trip_id'
and'shape_dist_traveled'
columns.Calculate the maximum (i.e., total) distance traveled (
'shape_dist_traveled'
) for each public transit trip ('trip_id'
) (Fig. 32).Read the
trips.txt
table and select just the'trip_id'
, and'route_id'
.Join the resulting
trips
table with thestop_times
table from the previous step (using the common"trip_id"
column), to get the'route_id'
value of each trip.Retain just one (the first) trip per
'route_id'
(Fig. 33).Read the
routes.txt
table and subset the'route_id'
,'route_short_name'
, and'route_long_name'
columns.Join the resulting table with the trips table (using the common
'route_id'
column), to get the'route_short_name'
and'route_long_name'
values of each route (Fig. 34).Sort the table according to
'shape_dist_traveled'
(from largest to smallest), and print the first five rows (Fig. 35). Those are the longest-distance public transport routes in the GTFS dataset.Convert the
'route_long_name'
values of the first five rows from the previous step, collect them into alist
, and print thelist
, to see the full names of the routes. You can see that all five routes are from Haifa to Eilat or the other way around.
trip_id | shape_dist_traveled | |
---|---|---|
0 | 10021427_240421 | 49074 |
1 | 10021428_240421 | 49074 |
2 | 10021429_240421 | 49074 |
3 | 10021430_240421 | 49074 |
4 | 10033656_180421 | 2874 |
... | ... | ... |
287659 | 9957863_180421 | 33468 |
287660 | 9967071_180421 | 22341 |
287661 | 9967462_180421 | 39061 |
287662 | 9967467_180421 | 39061 |
287663 | 9967472_180421 | 39061 |
287664 rows × 2 columns
route_id | trip_id | shape_dist_traveled | |
---|---|---|---|
0 | 1 | 28876224_180421 | 7072 |
1 | 2 | 25448344_180421 | 6870 |
2 | 3 | 56334623_180421 | 9855 |
3 | 5 | 56335137_180421 | 9751 |
4 | 7 | 37173391_180421 | 7794 |
... | ... | ... | ... |
7245 | 29363 | 56622661_230421 | 33173 |
7246 | 29364 | 56460152_180421 | 4300 |
7247 | 29365 | 56458677_180421 | 16100 |
7248 | 29366 | 56458667_180421 | 54800 |
7249 | 29368 | 56458657_180421 | 81600 |
7250 rows × 3 columns
route_id | route_short_name | route_long_name | trip_id | shape_dist_traveled | |
---|---|---|---|---|---|
0 | 1 | 1 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... | 28876224_180421 | 7072 |
1 | 2 | 1 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... | 25448344_180421 | 6870 |
2 | 3 | 2 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... | 56334623_180421 | 9855 |
3 | 5 | 2 | ת. רכבת יבנה מזרח-יבנה<->ת. רכב... | 56335137_180421 | 9751 |
4 | 7 | 3 | ת. רכבת יבנה מערב-יבנה<->ת. רכב... | 37173391_180421 | 7794 |
... | ... | ... | ... | ... | ... |
7245 | 29363 | 450 | מסוף האוניברסיטה-תל אביב יפו<->... | 56622661_230421 | 33173 |
7246 | 29364 | 125 | תל אביב ההגנה-תל אביב יפו<->הרצ... | 56460152_180421 | 4300 |
7247 | 29365 | 69 | רחובות-רחובות<->תל אביב מרכז-תל... | 56458677_180421 | 16100 |
7248 | 29366 | 60 | נתניה-נתניה<->אשקלון-אשקלון | 56458667_180421 | 54800 |
7249 | 29368 | 21 | נהריה-נהריה<->תל אביב ההגנה-תל ... | 56458657_180421 | 81600 |
7250 rows × 5 columns
route_id | route_short_name | route_long_name | trip_id | shape_dist_traveled | |
---|---|---|---|---|---|
5856 | 19964 | 991 | ת. מרכזית חוף הכרמל/רציפים בינע... | 30526419_230421 | 446174 |
1668 | 7295 | 990 | ת. מרכזית אילת/רציפים-אילת<->ת.... | 56505849_180421 | 445508 |
6199 | 21726 | 991 | ת. מרכזית אילת/רציפים-אילת<->ת.... | 56505908_180421 | 445472 |
1669 | 7297 | 990 | ת. מרכזית חוף הכרמל/רציפים בינע... | 30526431_230421 | 443968 |
1671 | 7307 | 993 | ת. מרכזית המפרץ/רציפים בינעירונ... | 56505854_180421 | 432200 |
- 1
For a visual explanation of join types, see https://dataschool.com/how-to-teach-people-sql/sql-join-types-explained-visually/.