Table reshaping and joins#

Last updated: 2023-02-25 13:41:53

Introduction#

In the previous chapter (see Tables (pandas)), we introduced the pandas package, and covered the basic operations with tables using pandas, such as importing and exporting, subsetting, renaming columns, sorting, and calculating new columns. In practice, we often need to do more than that. For example, we may need to aggregate, or to reshape, tables into a different form, or to combine numerous tables into one. In this chapter, we cover some of the common table operations which involve modifying table structure or combining multiple tables.

When working with data, it is often necessary to combine data from multiple sources, and/or reshape data into the right form that a particular function accepts:

Combining may simply involve “pasting”, or concatenating, two tables one on top of the other, or side-by-side (see Concatenation).
In more complex cases, combining involves a table join, where the information from two tables is combined according to one or more “key” columns which are common to both tables (see Joining tables).
Moreover, we may need to summarize the values of a table according to a particular grouping, which is known as aggregation (see Aggregation).
A special type of aggregation is applying a function across each row, or each column, of a table (Row/col-wise operations).

In this chapter, we learn about all of those common methods for combining and reshaping tables. By the end of the chapter, you will have the ability to not just import and transform tables in Python (see Tables (pandas)), but also to reshape and combine tables to the right form, which facilitates further analysis and insight from the data in those tables.

Sample data#

First, let us recreate the small table named stations which we used in the previous chapter (see Creating a DataFrame):

import numpy as np
import pandas as pd
name = pd.Series(['Beer-Sheva Center', 'Beer-Sheva University', 'Dimona'])
city = pd.Series(['Beer-Sheva', 'Beer-Sheva', 'Dimona'])
lines = pd.Series([4, 5, 1])
piano = pd.Series([False, True, False])
lon = pd.Series([34.798443, 34.812831, 35.011635])
lat = pd.Series([31.243288, 31.260284, 31.068616])
d = {'name': name, 'city': city, 'lines': lines, 'piano': piano, 'lon': lon, 'lat': lat}
stations = pd.DataFrame(d)
stations

	name	city	lines	piano	lon	lat
0	Beer-Sheva Center	Beer-Sheva	4	False	34.798443	31.243288
1	Beer-Sheva University	Beer-Sheva	5	True	34.812831	31.260284
2	Dimona	Dimona	1	False	35.011635	31.068616

We will also load the sample climatic data which we are also familiar with from the previous chapter (Reading from file), into a DataFrame named dat:

dat = pd.read_csv('data/ZonAnn.Ts+dSST.csv')
dat

	Year	Glob	NHem	SHem	24N-90N	...	EQU-24N	24S-EQU	44S-24S	64S-44S	90S-64S
0	1880	-0.16	-0.29	-0.04	-0.37	...	-0.15	-0.09	-0.03	0.05	0.65
1	1881	-0.08	-0.17	0.01	-0.34	...	0.10	0.12	-0.05	-0.07	0.57
2	1882	-0.11	-0.21	0.00	-0.31	...	-0.05	-0.04	0.02	0.04	0.60
3	1883	-0.17	-0.28	-0.06	-0.35	...	-0.17	-0.14	-0.03	0.07	0.48
4	1884	-0.28	-0.43	-0.14	-0.61	...	-0.13	-0.15	-0.19	-0.02	0.63
...	...	...	...	...	...	...	...	...	...	...	...
137	2017	0.92	1.18	0.67	1.40	...	0.87	0.77	0.76	0.35	0.55
138	2018	0.85	1.04	0.66	1.25	...	0.73	0.62	0.79	0.37	0.96
139	2019	0.98	1.22	0.75	1.43	...	0.91	0.89	0.74	0.39	0.83
140	2020	1.02	1.36	0.68	1.68	...	0.89	0.84	0.58	0.39	0.91
141	2021	0.85	1.15	0.55	1.43	...	0.73	0.59	0.71	0.32	0.33

142 rows × 15 columns

Concatenation#

Overview#

The most basic method for combining several tables into one is concatenation, which essentially means binding tables one on top of another, or side-by-side:

Concatenation by row is usually desired when the two (or more) tables share exactly the same columns. For example, combining several tables reflecting the same data (e.g., measured on different dates, or for different subjects) involves concatenation by row.
Concatenation by column is useful when we need to attach new variables for the same observations we already have. For example, we may want to combine several different type of observations which are split across separate files.

In pandas, concatenation can be done using pd.concat. The pd.concat function accepts a list of tables, which are combined:

By row when using axis=0 (the default)
By column when using axis=1

Concatenation by row#

To demonstrate concatenation by row, let us create two subsets x and y of stations, with the 1^st row and the 2^nd-3^rd rows, respectively:

x = stations.iloc[[0], :]     ## 1st row
x

	name	city	lines	piano	lon	lat
0	Beer-Sheva Center	Beer-Sheva	4	False	34.798443	31.243288

y = stations.iloc[[1, 2], :]  ## 2nd and 3rd rows
y

	name	city	lines	piano	lon	lat
1	Beer-Sheva University	Beer-Sheva	5	True	34.812831	31.260284
2	Dimona	Dimona	1	False	35.011635	31.068616

The expression pd.concat([x,y]) then can be used to concatenates x and y, by row, re-creating the complete table stations:

pd.concat([x, y])

	name	city	lines	piano	lon	lat
0	Beer-Sheva Center	Beer-Sheva	4	False	34.798443	31.243288
1	Beer-Sheva University	Beer-Sheva	5	True	34.812831	31.260284
2	Dimona	Dimona	1	False	35.011635	31.068616

Note that pd.concat accepts a list of DataFrames to concatenate, such as [x,y]. In this example the list is of length 2, but in general it can be of any length.

Note

There is a special method .append which is a stortcut for concatenation by row. Namely, instead of pd.concat([x,y]) you can use x.append(y).

Concatenation by column#

To demonstrate concatenation by column, let us now split stations to subsets x and y by column. For example, the x table is going to contain one column "name":

x = stations[['name']]
x

	name
0	Beer-Sheva Center
1	Beer-Sheva University
2	Dimona

and the y table is going to contain two columns 'lon' and 'lat':

y = stations[['lon', 'lat']]
y

	lon	lat
0	34.798443	31.243288
1	34.812831	31.260284
2	35.011635	31.068616

To concatenate by column, we use pd.concat with the axis=1 argument. This concatenates the tables x and y side-by-side:

pd.concat([x, y], axis=1)

	name	lon	lat
0	Beer-Sheva Center	34.798443	31.243288
1	Beer-Sheva University	34.812831	31.260284
2	Dimona	35.011635	31.068616

Alignment by index#

Note that concatenation—like many other operations in pandas—operates based on the index, rather than position. This is not a problem when concatenating by row, where columns typically need to be aligned by their index, i.e., column name (see Concatenation by row). When concatenating by column, however, we need to pay attention to the index: is it meaningful, and we want to use it when aligning rows? Or is it arbitarary, and we want to reset it before concatenating?

For example, here is what happens in concatenation by column, when the index is not aligned (see Setting Series index):

y.index = [2,3,4]
pd.concat([x, y], axis=1)

	name	lon	lat
0	Beer-Sheva Center	NaN	NaN
1	Beer-Sheva University	NaN	NaN
2	Dimona	34.798443	31.243288
3	NaN	34.812831	31.260284
4	NaN	35.011635	31.068616

To make sure we concatenate based on position, we may reset the index (using .reset_index(drop=True)) on both tables before concatenation:

y = y.reset_index(drop=True)
pd.concat([x, y], axis=1)

	name	lon	lat
0	Beer-Sheva Center	34.798443	31.243288
1	Beer-Sheva University	34.812831	31.260284
2	Dimona	35.011635	31.068616

Row/col-wise operations#

Built-in methods#

pandas defines summary methods which can be used to calculate row-wise and col-wise summaries of DataFrames. The pandas methods are generally similar to the numpy 2D summary functions and methods (see Summaries per dimension (2D)) which we learned about earlier, but there are several important differences to keep in mind:

pandas only has methods, not functions, i.e., there is a .mean method but there is no pd.mean function—unlike numpy, which mostly defines both, e.g., .mean and np.mean (see Table 14)
There is no “global” summary option in the pandas methods, i.e., we can only calculate column (axis=0, the default) or row (axis=1) summaries—unlike numpy where “global” summary is the default option (see Global summaries)
The pandas methods exclude “No Data” by default (see Operators with missing values), unless we explicitly specify skipna=False—unlike numpy which by default includes “No Data”, unless using the specific “No Data”-safe functions such as np.nanmean (see Operations with “No Data”)

The most commonly used pandas summary methods are listed in Table 15.

Table 15 Pandas aggregation methods#
Operation	Method
`.sum`	Sum
`.min`	Minimum
`.max`	Maximum
`.mean`	Mean
`.median`	Median
`.first`	First value
`.last`	Last value
`.count`	Count (of non-missing values)
`.nunique`	Number of unique values
`.idxmin`	Index of (first) minimum
`.idxmax`	Index of (first) maximum
`.any`	Is at least one element `True`?
`.all`	Are all elements `True`?

For the next few examples, let us take the regional temperature anomaly time series into a separate DataFrame named regions:

cols = ['90S-64S', '64S-44S', '44S-24S', '24S-EQU', 'EQU-24N', '24N-44N', '44N-64N', '64N-90N']
regions = dat[cols]
regions

	90S-64S	64S-44S	44S-24S	24S-EQU	EQU-24N	24N-44N	44N-64N	64N-90N
0	0.65	0.05	-0.03	-0.09	-0.15	-0.27	-0.49	-0.81
1	0.57	-0.07	-0.05	0.12	0.10	-0.19	-0.45	-0.92
2	0.60	0.04	0.02	-0.04	-0.05	-0.15	-0.28	-1.42
3	0.48	0.07	-0.03	-0.14	-0.17	-0.27	-0.56	-0.19
4	0.63	-0.02	-0.19	-0.15	-0.13	-0.47	-0.64	-1.31
...	...	...	...	...	...	...	...	...
137	0.55	0.35	0.76	0.77	0.87	1.05	1.38	2.54
138	0.96	0.37	0.79	0.62	0.73	1.06	1.09	2.15
139	0.83	0.39	0.74	0.89	0.91	1.01	1.44	2.73
140	0.91	0.39	0.58	0.84	0.89	1.20	1.82	2.92
141	0.33	0.32	0.71	0.59	0.73	1.28	1.37	2.04

142 rows × 8 columns

Using the .mean method combined with axis=0 (i.e., the function is applied on each column), we can calculate the mean temperature anomaly in each region:

regions.mean(axis=0)

90S-64S   -0.089507
64S-44S   -0.060704
44S-24S    0.036127
24S-EQU    0.079437
EQU-24N    0.054930
24N-44N    0.036549
44N-64N    0.122042
64N-90N    0.236761
dtype: float64

Note that axis=0 is the default, so we can also get the same result without specifying it. Conversely to numpy (see Global summaries), in pandas there is no “global” option for summary methods such as .mean, but only row-wise or col-wise “modes”:

regions.mean()

90S-64S   -0.089507
64S-44S   -0.060704
44S-24S    0.036127
24S-EQU    0.079437
EQU-24N    0.054930
24N-44N    0.036549
44N-64N    0.122042
64N-90N    0.236761
dtype: float64

Exercise 06-a

What if we need to, nevertheless, calculate a “global” summary of DataFrame values, such as the mean of values in regions? How can we do that? (answer: 0.05)

Conversely, using axis=1, we can calculate row means, i.e., the average temperature anomaly for each year across all regions:

regions.mean(axis=1)

   -0.14250
   -0.11125
   -0.16000
   -0.10125
   -0.28500
        ...   
  1.03375
  0.97125
  1.11750
  1.19375
  0.92125
Length: 142, dtype: float64

If necessary, the result can be assigned into a new column (see Creating new columns), hereby named mean, as follows:

dat['mean'] = regions.mean(axis=1)
dat

	Year	Glob	NHem	SHem	24N-90N	...	24S-EQU	44S-24S	64S-44S	90S-64S	mean
0	1880	-0.16	-0.29	-0.04	-0.37	...	-0.09	-0.03	0.05	0.65	-0.14250
1	1881	-0.08	-0.17	0.01	-0.34	...	0.12	-0.05	-0.07	0.57	-0.11125
2	1882	-0.11	-0.21	0.00	-0.31	...	-0.04	0.02	0.04	0.60	-0.16000
3	1883	-0.17	-0.28	-0.06	-0.35	...	-0.14	-0.03	0.07	0.48	-0.10125
4	1884	-0.28	-0.43	-0.14	-0.61	...	-0.15	-0.19	-0.02	0.63	-0.28500
...	...	...	...	...	...	...	...	...	...	...	...
137	2017	0.92	1.18	0.67	1.40	...	0.77	0.76	0.35	0.55	1.03375
138	2018	0.85	1.04	0.66	1.25	...	0.62	0.79	0.37	0.96	0.97125
139	2019	0.98	1.22	0.75	1.43	...	0.89	0.74	0.39	0.83	1.11750
140	2020	1.02	1.36	0.68	1.68	...	0.84	0.58	0.39	0.91	1.19375
141	2021	0.85	1.15	0.55	1.43	...	0.59	0.71	0.32	0.33	0.92125

142 rows × 16 columns

Here is a visualization (see Line plots (pandas)) of the mean time series we just calculated:

dat.set_index('Year')['mean'].plot();

`apply` and custom functions#

In case we need to apply a custom function on each row, or column, we can use the .apply method combined with a lambda function (see Lambda functions). For example, here is an alternative way to calculate row-wise means, using the .apply method and a lambda function. The result is the same as regions.mean() (see Built-in methods):

regions.apply(lambda x: x.mean())

90S-64S   -0.089507
64S-44S   -0.060704
44S-24S    0.036127
24S-EQU    0.079437
EQU-24N    0.054930
24N-44N    0.036549
44N-64N    0.122042
64N-90N    0.236761
dtype: float64

The advantage is that we can use any custom expression inside the lambda function definition. For example, the following expression calculates whether any region had temperature anomaly above 1 degree, for each of the 140 years. Note the axis=1 part, which determines that, this time, the function is applied on each row:

regions.apply(lambda x: (x > 1).any(), axis=1)

    False
    False
    False
    False
    False
       ...  
   True
   True
   True
   True
   True
Length: 142, dtype: bool

Exercise 06-b

Function np.polyfit can be used to calculate the slope of linear regression between two columns. For example, using the code section shown below, you can find out that the linear trend of temperature in the "44S-24S" region was an increasing one, at 0.0066 degrees per year.
Adapt the code into to the form of a function named slope which, given two Series objects x and y, calculates the linear slope. For example, f(dat["Year"],regions["44S-24S"]) should return 0.0066.
Use the slope function you defined within the .apply method, to calculate the slope in all columns of regions at once.
Hint: you need to keep the x argument of slope fixed at dat["Year"], while the various columns of regions are passed to y. To do that, write another function (or use a lambda function) that “wraps” slope and runs it with just the y parameter (keeping x fixed).

x = dat["Year"]
y = regions["44S-24S"]
np.polyfit(x, y, 1)[0]

Joining tables#

Overview#

One of the most common operations when working with data is a table join. In a table join, two tables are combined into one, based on one or more common column(s). For example, when compiling a dataset about different cities, we may collect data on different aspects of each cities (population size, socio-economic status, built area, etc.) from different sources. To work with these data together, the various tables need to be joined into one. The common column, in this case, would be the city name or city ID.

When working with pandas, tables can be joined, based on one or more common column, using the pd.merge function. The pd.merge has the following important parameters:

left—The first (“left”) table
right—The second (“right”) table
how—The type of join, one of: 'left', 'right', 'outer', 'inner' (the default), 'cross'
on—The name(s) of the common column(s) where identical values are considered a “match”

The type of join (the how parameter) basically determines what should be done with non-matching records. The options are 'left', 'right', 'outer', 'inner' (the default), and 'cross'. The meaning of the join types is the same as in SQL 1, and will be demonstrated through examples below.

Stations example#

For a first small example, let us get back to the stations table:

stations

	name	city	lines	piano	lon	lat
0	Beer-Sheva Center	Beer-Sheva	4	False	34.798443	31.243288
1	Beer-Sheva University	Beer-Sheva	5	True	34.812831	31.260284
2	Dimona	Dimona	1	False	35.011635	31.068616

Suppose we have another small table named status which describes the status ('Closed' or 'Open') of select stations:

name = pd.Series(['Beer-Sheva Center', 'Beer-Sheva University', 'Tel-Aviv University'])
status = pd.Series(['Closed', 'Open', 'Open'])
status = pd.DataFrame({'name': name, 'status': status})
status

	name	status
0	Beer-Sheva Center	Closed
1	Beer-Sheva University	Open
2	Tel-Aviv University	Open

The stations and status can be joined as follows:

pd.merge(stations, status)

	name	city	lines	piano	lon	lat	status
0	Beer-Sheva Center	Beer-Sheva	4	False	34.798443	31.243288	Closed
1	Beer-Sheva University	Beer-Sheva	5	True	34.812831	31.260284	Open

This implies the default, on being equal to the column(s) sharing the same name in both tables (on='name', in this case), and an 'inner' join (i.e., how='inner'). As you can see, an inner join returns just the matching rows. It is better to be explicit, for clarity:

pd.merge(stations, status, on='name', how='inner')

	name	city	lines	piano	lon	lat	status
0	Beer-Sheva Center	Beer-Sheva	4	False	34.798443	31.243288	Closed
1	Beer-Sheva University	Beer-Sheva	5	True	34.812831	31.260284	Open

Unlike an 'inner' join, a 'left' join returns all records from the first (“left”) table, even those that have no match in the second (“right”) table. For example, 'Dimona' station does not appear in the status table, but it is still returned in the result (with status value set to np.nan):

pd.merge(stations, status, on='name', how='left')

	name	city	lines	piano	lon	lat	status
0	Beer-Sheva Center	Beer-Sheva	4	False	34.798443	31.243288	Closed
1	Beer-Sheva University	Beer-Sheva	5	True	34.812831	31.260284	Open
2	Dimona	Dimona	1	False	35.011635	31.068616	NaN

A 'right' join is the opposite of a 'left' join, i.e., it returns all records from the “right” table, even those that have no match in the “left” table. This time, values absent in the “left” table are set to np.nan:

pd.merge(stations, status, on='name', how='right')

	name	city	lines	piano	lon	lat	status
0	Beer-Sheva Center	Beer-Sheva	4.0	False	34.798443	31.243288	Closed
1	Beer-Sheva University	Beer-Sheva	5.0	True	34.812831	31.260284	Open
2	Tel-Aviv University	NaN	NaN	NaN	NaN	NaN	Open

Finally, an 'outer' (also known as “full”) join returns all record from both tables, regardless of whether they have a matching record in the other table. Values that have no match in both the “left” and “right” tables are set to np.nan:

pd.merge(stations, status, on='name', how='outer')

	name	city	lines	piano	lon	lat	status
0	Beer-Sheva Center	Beer-Sheva	4.0	False	34.798443	31.243288	Closed
1	Beer-Sheva University	Beer-Sheva	5.0	True	34.812831	31.260284	Open
2	Dimona	Dimona	1.0	False	35.011635	31.068616	NaN
3	Tel-Aviv University	NaN	NaN	NaN	NaN	NaN	Open

When working with data, we often have one main, or “template”, table we are working with (“left”) and we are interested to join data coming from another external (“right”) table to it. In this scenario we usually prefer not to lose any records from the “left” table. Even if some of the records have no match and get “No Data” values (namely, np.nan), we usually prefer to keep those records and deal with the missing values as necessary, rather than lose the records altogether. Therefore, a 'left' join is often most useful in data analysis. For example, in the above situation a "left" join makes sense if our focus is on the station properties given in stations, so that even though the status of the "Dimona" station is unknown we prefer to keep the 'Dimona' record.

What is GTFS?#

In the next examples, we demonstrate table join with a more realistic large datasets on public transport known as GTFS. Before we go into the example, we need to be familiar with the GTFS format and structure.

The General Transit Feed Specification (GTFS) is a standard format for describing public transport routes and timetables. GTFS is typically published by government transport agencies, periodically. For example, the GTFS dataset provided as part of the book’s sample data was downloaded from Israel ministry of transport GTFS website on 2021-04-18. GTFS data is used by commercial companies, such as for routing in Google Maps.

Technically, a GTFS dataset is composed of several CSV files (even though their file extension is .txt), which are linked via common columns, or keys. In this book, we are going to use only some of the files in the GTFS datasets, namely the six files listed in Table 16. Recall that we already met one of the GTFS files, namely stops.txt, earlier, in an example of text file processing through for loops (see File object and reading lines).

Table 16 GTFS files used in the examples#
Name	Contents	Keys
`agency.txt`	Agencies	`agency_id`
`routes.txt`	Routes	`route_id`, `agency_id`
`trips.txt`	Trips	`route_id`, `trip_id`, `shape_id`
`stop_times.txt`	Stop times	`trip_id`, `stop_id`
`stops.txt`	Stops	`stop_id`
`shapes.txt`	Shapes	`shape_id`

Let us see what the GTFS tables look like, what is their purpose, and how are they related with one another. To make things easier, we immediately subset the most important columns, which are also the ones we use in subsequent examples.

We start with agency.txt. This file lists the public transport agencies. Each row represents one agency, identified by agency_id:

filename = 'data/gtfs/agency.txt'
cols = ['agency_id', 'agency_name']
pd.read_csv(filename)[cols]

	agency_id	agency_name
0	2	רכבת ישראל
1	3	אגד
2	4	אגד תעבורה
3	5	דן
4	6	ש.א.מ
...	...	...
33	95	מוניות יהלום תחבורה
34	96	מוניות גלים
35	97	אודליה מוניות בעמ
36	98	מוניות רב קווית 4-5
37	130	מוניות הדר לוד

38 rows × 2 columns

Note

pd.read_csv uses the 'utf-8' encoding by default. All sample CSV files in this book are in the UTF-8 encoding, so there is no need to specify the encoding when reading them with pd.read_csv. Otherwise, the right encoding can be specified with the encoding parameter of pd.read_csv.

The next file is routes.txt. This file describes the public transport routes, identified by route_id and also specifying the name of the route:

route_short_name—typically a number, such as bus line number
route_long_name—the route textual description

It also contains agency_id to identify the agency which operates each route:

path = 'data/gtfs/routes.txt'
cols = ['route_id', 'agency_id', 'route_short_name', 'route_long_name']
pd.read_csv(path)[cols]

	route_id	agency_id	route_short_name	route_long_name
0	1	25	1	ת. רכבת יבנה מערב-יבנה<->ת. רכב...
1	2	25	1	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...
2	3	25	2	ת. רכבת יבנה מערב-יבנה<->ת. רכב...
3	5	25	2	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...
4	7	25	3	ת. רכבת יבנה מערב-יבנה<->ת. רכב...
...	...	...	...	...
7246	29363	18	450	מסוף האוניברסיטה-תל אביב יפו<->...
7247	29364	2	125	תל אביב ההגנה-תל אביב יפו<->הרצ...
7248	29365	2	69	רחובות-רחובות<->תל אביב מרכז-תל...
7249	29366	2	60	נתניה-נתניה<->אשקלון-אשקלון
7250	29368	2	21	נהריה-נהריה<->תל אביב ההגנה-תל ...

7251 rows × 4 columns

The routes.txt table is linked to the trips.txt table, which characterizes trips taking place at specific times of day for a given route. Trips are uniquely identified by trip_id and linked with the routes table via the route_id column. As you can see from the small subset below, each route may be associated with multiple trips. The trips table also has a shape_id column which associates a trip with the geographical line geometry (see shapes.txt below). You can see that trips of the same route at different times of day typically follow exactly the same geographical shape, thus having identical shape_id values, but this is not always guaranteed:

filename = 'data/gtfs/trips.txt'
cols = ['route_id', 'trip_id', 'shape_id']
pd.read_csv(filename)[cols]

	route_id	trip_id	shape_id
0	1	28876224_180421	121757.0
1	1	28876225_180421	121757.0
2	1	28876226_180421	121757.0
3	1	28876227_180421	121757.0
4	1	28876228_180421	121757.0
...	...	...	...
287673	29363	56622686_230421	123144.0
287674	29364	56460152_180421	NaN
287675	29365	56458677_180421	67648.0
287676	29366	56458667_180421	51396.0
287677	29368	56458657_180421	123033.0

287678 rows × 3 columns

The stop_times.txt table lists the stops associated with each trip (via the trip_id column), including the (estimated) time of arrival at each stop. The stop times table also includes the stop_id column to associate it with the stop characteristics (see stops.txt below). In addition, the shape_dist_traveled column contains cumulative distance traveled at each stop. You can see the increasing distance traveled from one from stop to another (in the same trip) in the printout below. The stop_times.txt table is the larges one, in terms of file size, among the GTFS files: keep in mind that it lists all stops, at all times of day, of all public transport routes:

filename = 'data/gtfs/stop_times.txt'
cols = ['trip_id', 'arrival_time', 'departure_time', 'stop_id', 'shape_dist_traveled']
pd.read_csv(filename)[cols]

	trip_id	arrival_time	departure_time	stop_id	shape_dist_traveled
0	10021427_240421	00:00:00	00:00:00	36133	0
1	10021427_240421	00:01:43	00:01:43	2356	499
2	10021427_240421	00:03:36	00:03:36	3869	1838
3	10021427_240421	00:05:31	00:05:31	2357	2692
4	10021427_240421	00:06:32	00:06:32	3803	3379
...	...	...	...	...	...
10583641	9967472_180421	11:39:25	11:39:25	34662	37457
10583642	9967472_180421	11:39:58	11:39:58	34492	37729
10583643	9967472_180421	11:40:53	11:40:53	34498	38112
10583644	9967472_180421	11:41:44	11:41:44	34500	38495
10583645	9967472_180421	11:43:31	11:43:31	34503	39061

10583646 rows × 5 columns

The stops.txt table, which we already worked with earlier (see Reading CSV files), describes the fixed, i.e., unrelated to time of day, properties of the stops. It is linked with the stops_time.txt table via the stop_id column. Stop properties include the stop name (stop_name), and its coordinates (stop_lon and stop_lat):

filename = 'data/gtfs/stops.txt'
cols = ['stop_id', 'stop_name', 'stop_lat', 'stop_lon']
pd.read_csv(filename)[cols]

	stop_id	stop_name	stop_lat	stop_lon
0	1	בי''ס בר לב/בן יהודה	32.183985	34.917554
1	2	הרצל/צומת בילו	31.870034	34.819541
2	3	הנחשול/הדייגים	31.984553	34.782828
3	6	ת. מרכזית לוד/הורדה	31.956392	34.898098
4	8	הרצל/משה שרת	31.857565	34.824106
...	...	...	...	...
28250	44668	חניון שפירים דרומי/הורדה	32.001228	34.840406
28251	44683	מתחם נצבא	32.112445	34.941549
28252	44684	מתחם נצבא	32.111795	34.940668
28253	44691	מסוף דרך שכם	31.783680	35.228825
28254	44692	הקוממיות/אנה פרנק	32.002782	34.752376

28255 rows × 4 columns

Finally, the shapes.txt table contains the trip geometries. It is linked with the trips.txt table via the common shape_id column.

filename = 'data/gtfs/shapes.txt'
cols = ['shape_id', 'shape_pt_lat', 'shape_pt_lon', 'shape_pt_sequence']
pd.read_csv(filename)[cols]

	shape_id	shape_pt_lat	shape_pt_lon	shape_pt_sequence
0	44779	31.887695	35.016271	1
1	44779	31.887745	35.016253	2
2	44779	31.888256	35.016238	3
3	44779	31.888913	35.016280	4
4	44779	31.888917	35.016892	5
...	...	...	...	...
7226268	123167	32.729572	35.331201	876
7226269	123167	32.729700	35.330923	877
7226270	123167	32.729706	35.330908	878
7226271	123167	32.729841	35.330538	879
7226272	123167	32.729899	35.330249	880

7226273 rows × 4 columns

The relations between the six above-mentioned files in the GTFS dataset are summarized in Fig. 30.

_images/gtfs_files_drawing.svg — Fig. 30 Relation between the six GTFS files listed in Table 16#

GTFS example#

Now that we have an idea about the structure and purpose of the GTFS dataset, let us demonstrate join operations between its tables. This is a classical example where joining tables is essential: you cannot go very far with GTFS data analysis, or processing, without joining the various tables together (see Fig. 30).

To make things simple, in this example, consider just two of the tables in the GTFS dataset:

agency.txt, where rows represent public transport operators
routes.txt, where rows represent public transport routes

As shown above (see What is GTFS?), the two tables share a common column agency_id, which we can use to join the two, to find out which public routes belong to each operator.

First, let us import the agency table and subset the columns of interest, "agency_id" and "agency_name":

agency = pd.read_csv('data/gtfs/agency.txt')
agency = agency[['agency_id', 'agency_name']]
agency

	agency_id	agency_name
0	2	רכבת ישראל
1	3	אגד
2	4	אגד תעבורה
3	5	דן
4	6	ש.א.מ
...	...	...
33	95	מוניות יהלום תחבורה
34	96	מוניות גלים
35	97	אודליה מוניות בעמ
36	98	מוניות רב קווית 4-5
37	130	מוניות הדר לוד

38 rows × 2 columns

Second, also import the routes table, this time selecting the "route_id", "agency_id", "route_short_name", and "route_long_name" columns. Note the common "agency_id" column, which we are going to use when joining the two tables:

routes = pd.read_csv('data/gtfs/routes.txt')
routes = routes[['route_id', 'agency_id', 'route_short_name', 'route_long_name']]
routes

	route_id	agency_id	route_short_name	route_long_name
0	1	25	1	ת. רכבת יבנה מערב-יבנה<->ת. רכב...
1	2	25	1	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...
2	3	25	2	ת. רכבת יבנה מערב-יבנה<->ת. רכב...
3	5	25	2	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...
4	7	25	3	ת. רכבת יבנה מערב-יבנה<->ת. רכב...
...	...	...	...	...
7246	29363	18	450	מסוף האוניברסיטה-תל אביב יפו<->...
7247	29364	2	125	תל אביב ההגנה-תל אביב יפו<->הרצ...
7248	29365	2	69	רחובות-רחובות<->תל אביב מרכז-תל...
7249	29366	2	60	נתניה-נתניה<->אשקלון-אשקלון
7250	29368	2	21	נהריה-נהריה<->תל אביב ההגנה-תל ...

7251 rows × 4 columns

Now, we can join the routes and agency tables based on the common 'agency_id' column:

routes = pd.merge(routes, agency, on='agency_id', how='left')
routes

	route_id	agency_id	route_short_name	route_long_name	agency_name
0	1	25	1	ת. רכבת יבנה מערב-יבנה<->ת. רכב...	אפיקים
1	2	25	1	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...	אפיקים
2	3	25	2	ת. רכבת יבנה מערב-יבנה<->ת. רכב...	אפיקים
3	5	25	2	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...	אפיקים
4	7	25	3	ת. רכבת יבנה מערב-יבנה<->ת. רכב...	אפיקים
...	...	...	...	...	...
7246	29363	18	450	מסוף האוניברסיטה-תל אביב יפו<->...	קווים
7247	29364	2	125	תל אביב ההגנה-תל אביב יפו<->הרצ...	רכבת ישראל
7248	29365	2	69	רחובות-רחובות<->תל אביב מרכז-תל...	רכבת ישראל
7249	29366	2	60	נתניה-נתניה<->אשקלון-אשקלון	רכבת ישראל
7250	29368	2	21	נהריה-נהריה<->תל אביב ההגנה-תל ...	רכבת ישראל

7251 rows × 5 columns

Now we have the additional 'agency_name' column in the routes table, which tells us the name of the operator for each route.

Note

For more information and examples of pd.merge, check out the following official tutorials from the pandas documentation:

Aggregation#

Overview#

Aggregation is a commonly used to summarize tabular data. In an aggregation procedure, a function is applied on subsets of table rows, whereas subsets are specified through one or more grouping variables, to obtain a new (smaller) table with summaries per group. Aggregation is one of the techniques following the split-apply-combine concept.

In pandas, there are several different ways of specifying the function(s) to be applied on the table column(s). We will go over three commonly used scenarios:

Same function on all columns
Different functions (agg) on each column
Custom functions (agg)

Afterwards, we are going to introduce a useful shortcut for a particularly common type of aggregation: counting the occurence of each unique value in a column (see Value counts).

Same function on all columns#

For a minimal example of aggregation, let us consider the small table named stations:

stations

	name	city	lines	piano	lon	lat
0	Beer-Sheva Center	Beer-Sheva	4	False	34.798443	31.243288
1	Beer-Sheva University	Beer-Sheva	5	True	34.812831	31.260284
2	Dimona	Dimona	1	False	35.011635	31.068616

Suppose that we are interested in the number of lines (the lines column) and the number of stations with a piano (the piano column), which involves summing the values in those two columns. We already know how the totals, i.e., sums of each column, can be calculated (see Row/col-wise operations):

stations[['lines', 'piano']].sum()

lines    10
piano     1
dtype: int64

However, what if we wanted to calculate the number of lines and stations with a piano per town, separately? To do that, we basically need to split the table according to the 'city' column, then apply the .sum method on each subset. Here is how this can be done. Instead of simply using .sum, we split (or “group”) the table by 'city', and only then apply the .sum method. Splitting is done using the .groupby method, which accepts the column name (or list of column names) that specifies the grouping:

stations[['city', 'lines', 'piano']].groupby('city').sum()

	lines	piano
city
Beer-Sheva	9	1
Dimona	1	0

Note that the grouping variable(s) become the index of the resulting DataFrame. For our purposes, we are usually going to prefer to work with the grouping variable in its own column. In such case, we end the expression with .reset_index() (see Resetting the index):

stations[['city', 'lines', 'piano']].groupby('city').sum().reset_index()

	city	lines	piano
0	Beer-Sheva	9	1
1	Dimona	1	0

Note

Tables can also be grouped by two or more columns, in which case the column names are passed in a list, as in dat.groupby(['a','b']), where 'a' and 'b' are column names in dat.

The .sum method is just one example, there are many other useful functions we can use, such as the ones listed in Table 15. For example, to get the first row per group we can combine .groupby with the .first method:

stations.groupby('city').first().reset_index()

	city	name	lines	piano	lon	lat
0	Beer-Sheva	Beer-Sheva Center	4	False	34.798443	31.243288
1	Dimona	Dimona	1	False	35.011635	31.068616

For a more realistic example, consider the routes table:

routes

	route_id	agency_id	route_short_name	route_long_name	agency_name
0	1	25	1	ת. רכבת יבנה מערב-יבנה<->ת. רכב...	אפיקים
1	2	25	1	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...	אפיקים
2	3	25	2	ת. רכבת יבנה מערב-יבנה<->ת. רכב...	אפיקים
3	5	25	2	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...	אפיקים
4	7	25	3	ת. רכבת יבנה מערב-יבנה<->ת. רכב...	אפיקים
...	...	...	...	...	...
7246	29363	18	450	מסוף האוניברסיטה-תל אביב יפו<->...	קווים
7247	29364	2	125	תל אביב ההגנה-תל אביב יפו<->הרצ...	רכבת ישראל
7248	29365	2	69	רחובות-רחובות<->תל אביב מרכז-תל...	רכבת ישראל
7249	29366	2	60	נתניה-נתניה<->אשקלון-אשקלון	רכבת ישראל
7250	29368	2	21	נהריה-נהריה<->תל אביב ההגנה-תל ...	רכבת ישראל

7251 rows × 5 columns

We can use aggregation with the .nunique (Table 15) function to calculate the number of unique route IDs that each public transport agency operates, as follows:

tmp = routes[['agency_name', 'route_id']].groupby('agency_name').nunique().reset_index()
tmp

	agency_name	route_id
0	אגד	1869
1	אגד תעבורה	420
2	אודליה מוניות בעמ	1
3	אפיקים	616
4	גי.בי.טורס	51
...	...	...
26	נתיב אקספרס	598
27	סופרבוס	493
28	קווים	1187
29	רכבת ישראל	111
30	ש.א.מ	153

31 rows × 2 columns

To find out which agencies have the most routes, the resulting table can be sorted using .sort_values combined with ascending=False (see Sorting):

tmp.sort_values('route_id', ascending=False)

	agency_name	route_id
0	אגד	1869
28	קווים	1187
24	מטרופולין	625
3	אפיקים	616
26	נתיב אקספרס	598
...	...	...
20	מוניות מטרו קו	3
19	מוניות מאיה יצחק שדה	3
18	כרמלית	2
17	כפיר	2
2	אודליה מוניות בעמ	1

31 rows × 2 columns

Exercise 06-c

Calculate a series of global temperature means for each decade, based on the information in the dat.
First, you need to create a new variable named decade based on the "Year", using an expression such as the one shown below.
Second, use .groupby to calculate the mean global temperature in each decade.

dat["Year"] // 10 * 10

Exercise 06-d

The routes.txt table describes public transport routes (uniquely identified by route_id). The trips.txt table, however, descibes specific journeys of a along those routes, at a specific time of day (uniquely identified by trip_id). The two tables are associated with one another via the common route_id column (see What is GTFS?). For example, a route that operates five times a day will be represented in one row in the routes.txt table, and five rows in the trips.txt table.
Join routes.txt to trips.txt, using the common route_id column.
Calculate a table with the number of unique trip_id’s per route_id.
Sort the table according to number of unique trips, and print a summary with the numbers and names (route_short_name and route_long_name columns from routes.txt, respectively) of the most and least frequent routes (Fig. 31).

	route_id	route_short_name	route_long_name	trip_id
678	2256	4	מסוף הטייסים-תל אביב יפו<->מסוף...	1001
677	2255	4	מסוף רידינג/רציפים-תל אביב יפו<...	934
3170	11525	6	אבי עזרי א-מודיעין עילית<->רמב'...	840
3171	11526	6	שדרות בית שמאי/שדרות בית הלל-מו...	830
2746	10509	1	קדושת לוי/שלום רב-ביתר עילית<->...	748
...	...	...	...	...
2291	9539	24	חמת גדר-עמק הירדן<->ת. מרכזית ט...	1
1190	4402	251	ת. מרכזית המפרץ/רציפים בינעירונ...	1
1199	4443	271	ת. מרכזית נהריה/רציפים-נהריה<->...	1
1204	4450	273	אוניברסיטה/רב תכליתי-חיפה<->ת. ...	1
7250	29368	21	נהריה-נהריה<->תל אביב ההגנה-תל ...	1

7251 rows × 4 columns

Fig. 31 Solution of exercise-06-c#

Different functions (`agg`)#

Sometimes we need to aggregate a DataFrame while applying a different function on each column, rather than the same function on all columns. For the next example, consider the world_cities.csv file, which contains information about all major cities in the world:

cities = pd.read_csv('data/world_cities.csv')
cities

	city	country	pop	lat	lon	capital
0	'Abasan al-Jadidah	Palestine	5629	31.31	34.34	0
1	'Abasan al-Kabirah	Palestine	18999	31.32	34.35	0
2	'Abdul Hakim	Pakistan	47788	30.55	72.11	0
3	'Abdullah-as-Salam	Kuwait	21817	29.36	47.98	0
4	'Abud	Palestine	2456	32.03	35.07	0
...	...	...	...	...	...	...
43640	az-Zubayr	Iraq	124611	30.39	47.71	0
43641	az-Zulfi	Saudi Arabia	54070	26.30	44.80	0
43642	az-Zuwaytinah	Libya	21984	30.95	20.12	0
43643	s-Gravenhage	Netherlands	479525	52.07	4.30	0
43644	s-Hertogenbosch	Netherlands	135529	51.68	5.30	0

43645 rows × 6 columns

Suppose we want to summarize the information per country ("country" column). However, we want each column to be treated differently:

The city names ('city' column) will be counted
Population ('pop' column) will be summed
Longitude and latitude ('lon' and 'lat' columns, respectively) will be averaged

This can be done using the .agg method applied on a grouped DataFrame, instead of a specific function, such as nunique. The .agg method accepts a dictionary, with key:value pairs of the form 'column':function, where:

'column'—name of the column to be aggregated
function—the function to apply on each group in that column. This can be:
- a string referring to a pandas function (such as 'nunique', see Table 15)
- a plain function that can be applied on a Series, such as np.mean or np.sum

In our case, the aggregation expression is as follows. As a result, we get a new table where each row represents a country, with the count of unique city names, sum of the population across all cities, and average longitute and latitude across all cities:

cities.groupby('country').agg({
    'city': 'nunique', 
    'pop': np.sum, 
    'lon': np.mean, 
    'lat': np.mean
}).reset_index()

	country	city	pop	lon	lat
0	Afghanistan	115	7543856	66.920342	34.807692
1	Albania	67	1536232	19.975522	41.087463
2	Algeria	315	20508642	3.535443	35.467025
3	American Samoa	35	58021	-170.663714	-14.299143
4	Andorra	7	69031	1.538571	42.534286
...	...	...	...	...	...
234	Wallis and Futuna	23	11380	-176.585652	-13.512609
235	Western Sahara	4	338786	-13.820000	25.935000
236	Yemen	30	5492077	45.088667	14.481000
237	Zambia	73	4032170	28.179315	-13.404247
238	Zimbabwe	76	4231859	30.442500	-18.629211

239 rows × 5 columns

Custom functions (`agg`)#

Finally, sometimes we need to apply a custom function when aggregating a table. For example, suppose that we need to calculate the population difference between the most populated, and least populated, cities, in each country. To do that, we can define our own function, named diff_max_min, which:

accepts a Series named x, and
returns the difference between the maximum and minimum, x.max()-x.min().

Here is the function definition:

def diff_max_min(x):
    return x.max() - x.min()

It is worthwhile to test the function, to make sure it works as expected:

diff_max_min(pd.Series([15, 20, 4, 12, 6]))

Now, we can use the function inside the .agg expression, just like any other predefined function:

cities.groupby('country').agg({
    'city': 'nunique', 
    'pop': diff_max_min
}).reset_index()

	country	city	pop
0	Afghanistan	115	3117579
1	Albania	67	379803
2	Algeria	315	2024344
3	American Samoa	35	11341
4	Andorra	7	17740
...	...	...	...
234	Wallis and Futuna	23	1142
235	Western Sahara	4	146906
236	Yemen	30	1918607
237	Zambia	73	1305210
238	Zimbabwe	76	1574930

239 rows × 3 columns

For example, here is how the “new” pop value is obtained for the first country ('Afghanistan'):

diff_max_min(cities[cities['country'] == 'Afghanistan']['pop'])

Alternatively, we can use a lambda function (see Lambda functions), to do the same, using more concise syntax:

cities.groupby('country').agg({
    'city': 'nunique', 
    'pop': lambda x: x.max() - x.min()
}).reset_index()

	country	city	pop
0	Afghanistan	115	3117579
1	Albania	67	379803
2	Algeria	315	2024344
3	American Samoa	35	11341
4	Andorra	7	17740
...	...	...	...
234	Wallis and Futuna	23	1142
235	Western Sahara	4	146906
236	Yemen	30	1918607
237	Zambia	73	1305210
238	Zimbabwe	76	1574930

239 rows × 3 columns

Note

For more information and examples on aggregation, and other types of split-apply-combine operations which we have not mentioned, see https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation.

Value counts#

A very common type of aggregation is to calculate how many times each unique value is “repeated” in a Series. For example, to find out how many times each country appears in the cities table, we can group the table by "country", then apply .count on any other column (such as "city"):

cities[['country', 'city']].groupby('country').count().reset_index()

	country	city
0	Afghanistan	117
1	Albania	67
2	Algeria	316
3	American Samoa	35
4	Andorra	7
...	...	...
234	Wallis and Futuna	23
235	Western Sahara	4
236	Yemen	30
237	Zambia	73
238	Zimbabwe	76

239 rows × 2 columns

However, this expression is quite long and not straightforward. Instead, pandas defines a much shorter way to count unique values, namely the value_counts method. For example:

cities['country'].value_counts()

France            1000
Greece            1000
Romania            999
Japan              999
Germany            998
                  ... 
Saint-Martin         1
Norfolk Island       1
Gibraltar            1
Easter Island        1
Nauru                1
Name: country, Length: 239, dtype: int64

The result is a Series where the index is the country name, and the values are the counts. It tells us how many times each unique value in the "country" column is repeated, that is, how many rows (i.e., cities) are there in the table for that country. For example, "France" is repeated over 1000 rows.

As you can see, the resuling Series is sorted by count, from highest to lowest. In case we need to sort by index, i.e., by country name, we can use the .sort_index method:

cities['country'].value_counts().sort_index()

Afghanistan          117
Albania               67
Algeria              316
American Samoa        35
Andorra                7
                    ... 
Wallis and Futuna     23
Western Sahara         4
Yemen                 30
Zambia                73
Zimbabwe              76
Name: country, Length: 239, dtype: int64

Exercise 06-e

How often is there more than one shape (shape_id) for the same route (route_id)?
Use the trips.txt table to calculate the number of unique values of shape_id for each route_id.
Summarize the resulting column to calculate how many route_ids has 0, 1, 2, etc. of shape_id. (answer: 4 routes have zero shape_id, 112 have two, and 7135 have one)

More exercises#

Exercise 06-f

Read the stops.txt and stop_times.txt tables (located in the gtfs directory).
Subset the:
- 'stop_id', 'stop_name', 'stop_lon', and 'stop_lat' columns from stops, and
- 'trip_id', 'arrival_time', 'stop_id', 'stop_sequence', and 'shape_dist_traveled' columns from stop_times. (These are the only columns you are going to need for this exercise.)
The trip_id value of "55745843_180421" is one of the trips of Metropoline bus line number 24 is Beer-Sheva, departing from Ramot neighborhood at 08:15:00 and arriving at the Central Bus Station of Beer-Sheva at 08:53:19. Subset the stop_times table to get the stop times for only this particular bus trip.
How many stops are there in the bus route? Write an expression to get the answer as an int (answer: 32).
Join the resulting stop_times subset with the stops table, using the common "stop_id" column, to get the names of the stops.
At what time does the bus arrive at the station named 'מרכז הספורט אוניברסיטת בן גוריון'? Write an expression to get the answer as a str (answer: '08:37:21').
The shape_dist_traveled contains the cumulative traveled distance at each stop. What is the total traveled distance, i.e., the last value in the shape_dist_traveled column? (answer: 12767)
Which pair of stations are most close to each other (i.e., minimal consecutive difference in shape_dist_traveled)? Hint: use the .diff method to get a Series of consecutive differences, then use the .idxmin method to find the index of the maximum in that series.
Which pair of stations are most far away from each other (i.e., maximal consecutive difference in shape_dist_traveled)? Hint: use the .idxmax method to find the index of the maximum in a series.

Exercise 06-g

In this exercise, you are going to find the names of the five longest bus routes (in terms of distance traveled) in Israel. To do that, go through the following steps.
Read the stop_times.txt table and select just the 'trip_id' and 'shape_dist_traveled' columns.
Calculate the maximum (i.e., total) distance traveled ('shape_dist_traveled') for each public transit trip ('trip_id') (Fig. 32).
Read the trips.txt table and select just the 'trip_id', and 'route_id'.
Join the resulting trips table with the stop_times table from the previous step (using the common "trip_id" column), to get the 'route_id' value of each trip.
Retain just one (the first) trip per 'route_id' (Fig. 33).
Read the routes.txt table and subset the 'route_id', 'route_short_name', and 'route_long_name' columns.
Join the resulting table with the trips table (using the common 'route_id' column), to get the 'route_short_name' and 'route_long_name' values of each route (Fig. 34).
Sort the table according to 'shape_dist_traveled' (from largest to smallest), and print the first five rows (Fig. 35). Those are the longest-distance public transport routes in the GTFS dataset.
Convert the 'route_long_name' values of the first five rows from the previous step, collect them into a list, and print the list, to see the full names of the routes. You can see that all five routes are from Haifa to Eilat or the other way around.

	trip_id	shape_dist_traveled
0	10021427_240421	49074
1	10021428_240421	49074
2	10021429_240421	49074
3	10021430_240421	49074
4	10033656_180421	2874
...	...	...
287659	9957863_180421	33468
287660	9967071_180421	22341
287661	9967462_180421	39061
287662	9967467_180421	39061
287663	9967472_180421	39061

287664 rows × 2 columns

Fig. 32 Solution of exercise-06-f1: Total distance traveled in each trip, based on stop_times.txt#

	route_id	trip_id	shape_dist_traveled
0	1	28876224_180421	7072
1	2	25448344_180421	6870
2	3	56334623_180421	9855
3	5	56335137_180421	9751
4	7	37173391_180421	7794
...	...	...	...
7245	29363	56622661_230421	33173
7246	29364	56460152_180421	4300
7247	29365	56458677_180421	16100
7248	29366	56458667_180421	54800
7249	29368	56458657_180421	81600

7250 rows × 3 columns

Fig. 33 Solution of exercise-06-f: Joined with trips.txt, selecting just the 1^st trips per route_id#

	route_id	route_short_name	route_long_name	trip_id	shape_dist_traveled
0	1	1	ת. רכבת יבנה מערב-יבנה<->ת. רכב...	28876224_180421	7072
1	2	1	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...	25448344_180421	6870
2	3	2	ת. רכבת יבנה מערב-יבנה<->ת. רכב...	56334623_180421	9855
3	5	2	ת. רכבת יבנה מזרח-יבנה<->ת. רכב...	56335137_180421	9751
4	7	3	ת. רכבת יבנה מערב-יבנה<->ת. רכב...	37173391_180421	7794
...	...	...	...	...	...
7245	29363	450	מסוף האוניברסיטה-תל אביב יפו<->...	56622661_230421	33173
7246	29364	125	תל אביב ההגנה-תל אביב יפו<->הרצ...	56460152_180421	4300
7247	29365	69	רחובות-רחובות<->תל אביב מרכז-תל...	56458677_180421	16100
7248	29366	60	נתניה-נתניה<->אשקלון-אשקלון	56458667_180421	54800
7249	29368	21	נהריה-נהריה<->תל אביב ההגנה-תל ...	56458657_180421	81600

7250 rows × 5 columns

Fig. 34 Solution of exercise-06-f: Joined with routes.txt#

	route_id	route_short_name	route_long_name	trip_id	shape_dist_traveled
5856	19964	991	ת. מרכזית חוף הכרמל/רציפים בינע...	30526419_230421	446174
1668	7295	990	ת. מרכזית אילת/רציפים-אילת<->ת....	56505849_180421	445508
6199	21726	991	ת. מרכזית אילת/רציפים-אילת<->ת....	56505908_180421	445472
1669	7297	990	ת. מרכזית חוף הכרמל/רציפים בינע...	30526431_230421	443968
1671	7307	993	ת. מרכזית המפרץ/רציפים בינעירונ...	56505854_180421	432200

Fig. 35 Solution of exercise-06-f: Sorted by shape_dist_traveled, from largest to smallest#

1: For a visual explanation of join types, see https://dataschool.com/how-to-teach-people-sql/sql-join-types-explained-visually/.

Spatial Data Programming with Python

Table reshaping and joins

Contents

Table reshaping and joins#

Introduction#

Sample data#

Concatenation#

Overview#

Concatenation by row#

Concatenation by column#

Alignment by index#

Row/col-wise operations#

Built-in methods#

apply and custom functions#

Joining tables#

Overview#

Stations example#

What is GTFS?#

GTFS example#

Aggregation#

Overview#

Same function on all columns#

Different functions (agg)#

Custom functions (agg)#

Value counts#

More exercises#

`apply` and custom functions#

Different functions (`agg`)#

Custom functions (`agg`)#