Table reshaping and joins

Last updated: 2022-06-26 00:26:24

Introduction

In the previous chapter (see Tables (pandas)), we introduced the pandas package, and covered the basic operations with tables using pandas, such as importing and exporting, subsetting, renaming columns, sorting, and calculating new columns. In practice, we often need to do more than that. For example, we may need to aggregate, or to reshape, tables into a different form, or to combine numerous tables into one. In this chapter, we cover some of the common table operations which involve modifying table structure or combining multiple tables.

When working with data, it is often necessary to combine data from multiple sources, and/or reshape data into the right form that a particular function accepts:

  • Combining may simply involve “pasting”, or concatenating, two tables one on top of the other, or side-by-side (see Concatenation).

  • In more complex cases, combining involves a table join, where the information from two tables is combined according to one or more “key” columns which are common to both tables (see Joining tables).

  • Moreover, we may need to summarize the values of a table according to a particular grouping, which is known as aggregation (see Aggregation).

  • A special type of aggregation is applying a function across each row, or each column, of a table (Row/col-wise operations).

In this chapter, we learn about all of those common methods for combining and reshaping tables. By the end of the chapter, you will have the ability to not just import and transform tables in Python (see Tables (pandas)), but also to reshape and combine tables to the right form, which facilitates further analysis and insight from the data in those tables.

Sample data

First, let us recreate the small table named stations which we used in the previous chapter (see Creating a DataFrame):

import numpy as np
import pandas as pd
name = pd.Series(['Beer-Sheva Center', 'Beer-Sheva University', 'Dimona'])
city = pd.Series(['Beer-Sheva', 'Beer-Sheva', 'Dimona'])
lines = pd.Series([4, 5, 1])
piano = pd.Series([False, True, False])
lon = pd.Series([34.798443, 34.812831, 35.011635])
lat = pd.Series([31.243288, 31.260284, 31.068616])
d = {'name': name, 'city': city, 'lines': lines, 'piano': piano, 'lon': lon, 'lat': lat}
stations = pd.DataFrame(d)
stations
name city lines piano lon lat
0 Beer-Sheva Center Beer-Sheva 4 False 34.798443 31.243288
1 Beer-Sheva University Beer-Sheva 5 True 34.812831 31.260284
2 Dimona Dimona 1 False 35.011635 31.068616

We will also load the sample climatic data which we are also familiar with from the previous chapter (Reading from file), into a DataFrame named dat:

dat = pd.read_csv('data/ZonAnn.Ts+dSST.csv')
dat
Year Glob NHem SHem 24N-90N ... EQU-24N 24S-EQU 44S-24S 64S-44S 90S-64S
0 1880 -0.16 -0.29 -0.04 -0.37 ... -0.15 -0.09 -0.03 0.05 0.65
1 1881 -0.08 -0.17 0.01 -0.34 ... 0.10 0.12 -0.05 -0.07 0.57
2 1882 -0.11 -0.21 0.00 -0.31 ... -0.05 -0.04 0.02 0.04 0.60
3 1883 -0.17 -0.28 -0.06 -0.35 ... -0.17 -0.14 -0.03 0.07 0.48
4 1884 -0.28 -0.43 -0.14 -0.61 ... -0.13 -0.15 -0.19 -0.02 0.63
... ... ... ... ... ... ... ... ... ... ... ...
137 2017 0.92 1.18 0.67 1.40 ... 0.87 0.77 0.76 0.35 0.55
138 2018 0.85 1.04 0.66 1.25 ... 0.73 0.62 0.79 0.37 0.96
139 2019 0.98 1.22 0.75 1.43 ... 0.91 0.89 0.74 0.39 0.83
140 2020 1.02 1.36 0.68 1.68 ... 0.89 0.84 0.58 0.39 0.91
141 2021 0.85 1.15 0.55 1.43 ... 0.73 0.59 0.71 0.32 0.33

142 rows × 15 columns

Concatenation

Overview

The most basic method for combining several tables into one is concatenation, which essentially means binding tables one on top of another, or side-by-side:

  • Concatenation by row is usually desired when the two (or more) tables share exactly the same columns. For example, combining several tables reflecting the same data (e.g., measured on different dates, or for different subjects) involves concatenation by row.

  • Concatenation by column is useful when we need to attach new variables for the same observations we already have. For example, we may want to combine several different type of observations which are split across separate files.

In pandas, concatenation can be done using pd.concat. The pd.concat function accepts a list of tables, which are combined:

  • By row when using axis=0 (the default)

  • By column when using axis=1

Concatenation by row

To demonstrate concatenation by row, let us create two subsets x and y of stations, with the 1st row and the 2nd-3rd rows, respectively:

x = stations.iloc[[0], :]     ## 1st row
x
name city lines piano lon lat
0 Beer-Sheva Center Beer-Sheva 4 False 34.798443 31.243288
y = stations.iloc[[1, 2], :]  ## 2nd and 3rd rows
y
name city lines piano lon lat
1 Beer-Sheva University Beer-Sheva 5 True 34.812831 31.260284
2 Dimona Dimona 1 False 35.011635 31.068616

The expression pd.concat([x,y]) then can be used to concatenates x and y, by row, re-creating the complete table stations:

pd.concat([x, y])
name city lines piano lon lat
0 Beer-Sheva Center Beer-Sheva 4 False 34.798443 31.243288
1 Beer-Sheva University Beer-Sheva 5 True 34.812831 31.260284
2 Dimona Dimona 1 False 35.011635 31.068616

Note that pd.concat accepts a list of DataFrames to concatenate, such as [x,y]. In this example the list is of length 2, but in general it can be of any length.

Note

There is a special method .append which is a stortcut for concatenation by row. Namely, instead of pd.concat([x,y]) you can use x.append(y).

Concatenation by column

To demonstrate concatenation by column, let us now split stations to subsets x and y by column. For example, the x table is going to contain one column "name":

x = stations[['name']]
x
name
0 Beer-Sheva Center
1 Beer-Sheva University
2 Dimona

and the y table is going to contain two columns 'lon' and 'lat':

y = stations[['lon', 'lat']]
y
lon lat
0 34.798443 31.243288
1 34.812831 31.260284
2 35.011635 31.068616

To concatenate by column, we use pd.concat with the axis=1 argument. This concatenates the tables x and y side-by-side:

pd.concat([x, y], axis=1)
name lon lat
0 Beer-Sheva Center 34.798443 31.243288
1 Beer-Sheva University 34.812831 31.260284
2 Dimona 35.011635 31.068616

Alignment by index

Note that concatenation—like many other operations in pandas—operates based on the index, rather than position. This is not a problem when concatenating by row, where columns typically need to be aligned by their index, i.e., column name (see Concatenation by row). When concatenating by column, however, we need to pay attention to the index: is it meaningful, and we want to use it when aligning rows? Or is it arbitarary, and we want to reset it before concatenating?

For example, here is what happens in concatenation by column, when the index is not aligned (see Setting Series index):

y.index = [2,3,4]
pd.concat([x, y], axis=1)
name lon lat
0 Beer-Sheva Center NaN NaN
1 Beer-Sheva University NaN NaN
2 Dimona 34.798443 31.243288
3 NaN 34.812831 31.260284
4 NaN 35.011635 31.068616

To make sure we concatenate based on position, we may reset the index (using .reset_index(drop=True)) on both tables before concatenation:

y = y.reset_index(drop=True)
pd.concat([x, y], axis=1)
name lon lat
0 Beer-Sheva Center 34.798443 31.243288
1 Beer-Sheva University 34.812831 31.260284
2 Dimona 35.011635 31.068616

Row/col-wise operations

Built-in methods

pandas defines summary methods which can be used to calculate row-wise and col-wise summaries of DataFrames. The pandas methods are generally similar to the numpy 2D summary functions and methods (see Summaries per dimension (2D)) which we learned about earlier, but there are several important differences to keep in mind:

  • pandas only has methods, not functions, i.e., there is a .mean method but there is no pd.mean function—unlike numpy, which mostly defines both, e.g., .mean and np.mean (see Table 14)

  • There is no “global” summary option in the pandas methods, i.e., we can only calculate column (axis=0, the default) or row (axis=1) summaries—unlike numpy where “global” summary is the default option (see Global summaries)

  • The pandas methods exclude “No Data” by default (see Operators with missing values), unless we explicitly specify skipna=False—unlike numpy which by default includes “No Data”, unless using the specific “No Data”-safe functions such as np.nanmean (see Operations with “No Data”)

The most commonly used pandas summary methods are listed in Table 15.

Table 15 Pandas aggregation methods

Operation

Method

.sum

Sum

.min

Minimum

.max

Maximum

.mean

Mean

.median

Median

.first

First value

.last

Last value

.count

Count (of non-missing values)

.nunique

Number of unique values

.idxmin

Index of (first) minimum

.idxmax

Index of (first) maximum

.any

Is at least one element True?

.all

Are all elements True?

For the next few examples, let us take the regional temperature anomaly time series into a separate DataFrame named regions:

cols = ['90S-64S', '64S-44S', '44S-24S', '24S-EQU', 'EQU-24N', '24N-44N', '44N-64N', '64N-90N']
regions = dat[cols]
regions
90S-64S 64S-44S 44S-24S 24S-EQU EQU-24N 24N-44N 44N-64N 64N-90N
0 0.65 0.05 -0.03 -0.09 -0.15 -0.27 -0.49 -0.81
1 0.57 -0.07 -0.05 0.12 0.10 -0.19 -0.45 -0.92
2 0.60 0.04 0.02 -0.04 -0.05 -0.15 -0.28 -1.42
3 0.48 0.07 -0.03 -0.14 -0.17 -0.27 -0.56 -0.19
4 0.63 -0.02 -0.19 -0.15 -0.13 -0.47 -0.64 -1.31
... ... ... ... ... ... ... ... ...
137 0.55 0.35 0.76 0.77 0.87 1.05 1.38 2.54
138 0.96 0.37 0.79 0.62 0.73 1.06 1.09 2.15
139 0.83 0.39 0.74 0.89 0.91 1.01 1.44 2.73
140 0.91 0.39 0.58 0.84 0.89 1.20 1.82 2.92
141 0.33 0.32 0.71 0.59 0.73 1.28 1.37 2.04

142 rows × 8 columns

Using the .mean method combined with axis=0 (i.e., the function is applied on each column), we can calculate the mean temperature anomaly in each region:

regions.mean(axis=0)
90S-64S   -0.089507
64S-44S   -0.060704
44S-24S    0.036127
24S-EQU    0.079437
EQU-24N    0.054930
24N-44N    0.036549
44N-64N    0.122042
64N-90N    0.236761
dtype: float64

Note that axis=0 is the default, so we can also get the same result without specifying it. Conversely to numpy (see Global summaries), in pandas there is no “global” option for summary methods such as .mean, but only row-wise or col-wise “modes”:

regions.mean()
90S-64S   -0.089507
64S-44S   -0.060704
44S-24S    0.036127
24S-EQU    0.079437
EQU-24N    0.054930
24N-44N    0.036549
44N-64N    0.122042
64N-90N    0.236761
dtype: float64

Exercise 06-a

  • What if we need to, nevertheless, calculate a “global” summary of DataFrame values, such as the mean of values in regions? How can we do that? (answer: 0.05)

Conversely, using axis=1, we can calculate row means, i.e., the average temperature anomaly for each year across all regions:

regions.mean(axis=1)
0     -0.14250
1     -0.11125
2     -0.16000
3     -0.10125
4     -0.28500
        ...   
137    1.03375
138    0.97125
139    1.11750
140    1.19375
141    0.92125
Length: 142, dtype: float64

If necessary, the result can be assigned into a new column (see Creating new columns), hereby named mean, as follows:

dat['mean'] = regions.mean(axis=1)
dat
Year Glob NHem SHem 24N-90N ... 24S-EQU 44S-24S 64S-44S 90S-64S mean
0 1880 -0.16 -0.29 -0.04 -0.37 ... -0.09 -0.03 0.05 0.65 -0.14250
1 1881 -0.08 -0.17 0.01 -0.34 ... 0.12 -0.05 -0.07 0.57 -0.11125
2 1882 -0.11 -0.21 0.00 -0.31 ... -0.04 0.02 0.04 0.60 -0.16000
3 1883 -0.17 -0.28 -0.06 -0.35 ... -0.14 -0.03 0.07 0.48 -0.10125
4 1884 -0.28 -0.43 -0.14 -0.61 ... -0.15 -0.19 -0.02 0.63 -0.28500
... ... ... ... ... ... ... ... ... ... ... ...
137 2017 0.92 1.18 0.67 1.40 ... 0.77 0.76 0.35 0.55 1.03375
138 2018 0.85 1.04 0.66 1.25 ... 0.62 0.79 0.37 0.96 0.97125
139 2019 0.98 1.22 0.75 1.43 ... 0.89 0.74 0.39 0.83 1.11750
140 2020 1.02 1.36 0.68 1.68 ... 0.84 0.58 0.39 0.91 1.19375
141 2021 0.85 1.15 0.55 1.43 ... 0.59 0.71 0.32 0.33 0.92125

142 rows × 16 columns

Here is a visualization (see Line plots (pandas)) of the mean time series we just calculated:

dat.set_index('Year')['mean'].plot();
_images/pandas2_55_0.png

apply and custom functions

In case we need to apply a custom function on each row, or column, we can use the .apply method combined with a lambda function (see Lambda functions). For example, here is an alternative way to calculate row-wise means, using the .apply method and a lambda function. The result is the same as regions.mean() (see Built-in methods):

regions.apply(lambda x: x.mean())
90S-64S   -0.089507
64S-44S   -0.060704
44S-24S    0.036127
24S-EQU    0.079437
EQU-24N    0.054930
24N-44N    0.036549
44N-64N    0.122042
64N-90N    0.236761
dtype: float64

The advantage is that we can use any custom expression inside the lambda function definition. For example, the following expression calculates whether any region had temperature anomaly above 1 degree, for each of the 140 years. Note the axis=1 part, which determines that, this time, the function is applied on each row:

regions.apply(lambda x: (x > 1).any(), axis=1)
0      False
1      False
2      False
3      False
4      False
       ...  
137     True
138     True
139     True
140     True
141     True
Length: 142, dtype: bool

Exercise 06-b

  • Function np.polyfit can be used to calculate the slope of linear regression between two columns. For example, using the code section shown below, you can find out that the linear trend of temperature in the "44S-24S" region was an increasing one, at 0.0066 degrees per year.

  • Adapt the code into to the form of a function named slope which, given two Series objects x and y, calculates the linear slope. For example, f(dat["Year"],regions["44S-24S"]) should return 0.0066.

  • Use the slope function you defined within the .apply method, to calculate the slope in all columns of regions at once.

  • Hint: you need to keep the x argument of slope fixed at dat["Year"], while the various columns of regions are passed to y. To do that, write another function (or use a lambda function) that “wraps” slope and runs it with just the y parameter (keeping x fixed).

x = dat["Year"]
y = regions["44S-24S"]
np.polyfit(x, y, 1)[0]

Joining tables

Overview

One of the most common operations when working with data is a table join. In a table join, two tables are combined into one, based on one or more common column(s). For example, when compiling a dataset about different cities, we may collect data on different aspects of each cities (population size, socio-economic status, built area, etc.) from different sources. To work with these data together, the various tables need to be joined into one. The common column, in this case, would be the city name or city ID.

When working with pandas, tables can be joined, based on one or more common column, using the pd.merge function. The pd.merge has the following important parameters:

  • left—The first (“left”) table

  • right—The second (“right”) table

  • how—The type of join, one of: 'left', 'right', 'outer', 'inner' (the default), 'cross'

  • on—The name(s) of the common column(s) where identical values are considered a “match”

The type of join (the how parameter) basically determines what should be done with non-matching records. The options are 'left', 'right', 'outer', 'inner' (the default), and 'cross'. The meaning of the join types is the same as in SQL 1, and will be demonstrated through examples below.

Stations example

For a first small example, let us get back to the stations table:

stations
name city lines piano lon lat
0 Beer-Sheva Center Beer-Sheva 4 False 34.798443 31.243288
1 Beer-Sheva University Beer-Sheva 5 True 34.812831 31.260284
2 Dimona Dimona 1 False 35.011635 31.068616

Suppose we have another small table named status which describes the status ('Closed' or 'Open') of select stations:

name = pd.Series(['Beer-Sheva Center', 'Beer-Sheva University', 'Tel-Aviv University'])
status = pd.Series(['Closed', 'Open', 'Open'])
status = pd.DataFrame({'name': name, 'status': status})
status
name status
0 Beer-Sheva Center Closed
1 Beer-Sheva University Open
2 Tel-Aviv University Open

The stations and status can be joined as follows:

pd.merge(stations, status)
name city lines piano lon lat status
0 Beer-Sheva Center Beer-Sheva 4 False 34.798443 31.243288 Closed
1 Beer-Sheva University Beer-Sheva 5 True 34.812831 31.260284 Open

This implies the default, on being equal to the column(s) sharing the same name in both tables (on='name', in this case), and an 'inner' join (i.e., how='inner'). As you can see, an inner join returns just the matching rows. It is better to be explicit, for clarity:

pd.merge(stations, status, on='name', how='inner')
name city lines piano lon lat status
0 Beer-Sheva Center Beer-Sheva 4 False 34.798443 31.243288 Closed
1 Beer-Sheva University Beer-Sheva 5 True 34.812831 31.260284 Open

Unlike an 'inner' join, a 'left' join returns all records from the first (“left”) table, even those that have no match in the second (“right”) table. For example, 'Dimona' station does not appear in the status table, but it is still returned in the result (with status value set to np.nan):

pd.merge(stations, status, on='name', how='left')
name city lines piano lon lat status
0 Beer-Sheva Center Beer-Sheva 4 False 34.798443 31.243288 Closed
1 Beer-Sheva University Beer-Sheva 5 True 34.812831 31.260284 Open
2 Dimona Dimona 1 False 35.011635 31.068616 NaN

A 'right' join is the opposite of a 'left' join, i.e., it returns all records from the “right” table, even those that have no match in the “left” table. This time, values absent in the “left” table are set to np.nan:

pd.merge(stations, status, on='name', how='right')
name city lines piano lon lat status
0 Beer-Sheva Center Beer-Sheva 4.0 False 34.798443 31.243288 Closed
1 Beer-Sheva University Beer-Sheva 5.0 True 34.812831 31.260284 Open
2 Tel-Aviv University NaN NaN NaN NaN NaN Open

Finally, an 'outer' (also known as “full”) join returns all record from both tables, regardless of whether they have a matching record in the other table. Values that have no match in both the “left” and “right” tables are set to np.nan:

pd.merge(stations, status, on='name', how='outer')
name city lines piano lon lat status
0 Beer-Sheva Center Beer-Sheva 4.0 False 34.798443 31.243288 Closed
1 Beer-Sheva University Beer-Sheva 5.0 True 34.812831 31.260284 Open
2 Dimona Dimona 1.0 False 35.011635 31.068616 NaN
3 Tel-Aviv University NaN NaN NaN NaN NaN Open

When working with data, we often have one main, or “template”, table we are working with (“left”) and we are interested to join data coming from another external (“right”) table to it. In this scenario we usually prefer not to lose any records from the “left” table. Even if some of the records have no match and get “No Data” values (namely, np.nan), we usually prefer to keep those records and deal with the missing values as necessary, rather than lose the records altogether. Therefore, a 'left' join is often most useful in data analysis. For example, in the above situation a "left" join makes sense if our focus is on the station properties given in stations, so that even though the status of the "Dimona" station is unknown we prefer to keep the 'Dimona' record.

What is GTFS?

In the next examples, we demonstrate table join with a more realistic large datasets on public transport known as GTFS. Before we go into the example, we need to be familiar with the GTFS format and structure.

The General Transit Feed Specification (GTFS) is a standard format for describing public transport routes and timetables. GTFS is typically published by government transport agencies, periodically. For example, the GTFS dataset provided as part of the book’s sample data was downloaded from Israel ministry of transport GTFS website on 2021-04-18. GTFS data is used by commercial companies, such as for routing in Google Maps.

Technically, a GTFS dataset is composed of several CSV files (even though their file extension is .txt), which are linked via common columns, or keys. In this book, we are going to use only some of the files in the GTFS datasets, namely the six files listed in Table 16. Recall that we already met one of the GTFS files, namely stops.txt, earlier, in an example of text file processing through for loops (see File object and reading lines).

Table 16 GTFS files used in the examples

Name

Contents

Keys

agency.txt

Agencies

agency_id

routes.txt

Routes

route_id, agency_id

trips.txt

Trips

route_id, trip_id, shape_id

stop_times.txt

Stop times

trip_id, stop_id

stops.txt

Stops

stop_id

shapes.txt

Shapes

shape_id

Let us see what the GTFS tables look like, what is their purpose, and how are they related with one another. To make things easier, we immediately subset the most important columns, which are also the ones we use in subsequent examples.

We start with agency.txt. This file lists the public transport agencies. Each row represents one agency, identified by agency_id:

filename = 'data/gtfs/agency.txt'
cols = ['agency_id', 'agency_name']
pd.read_csv(filename)[cols]
agency_id agency_name
0 2 רכבת ישראל
1 3 אגד
2 4 אגד תעבורה
3 5 דן
4 6 ש.א.מ
... ... ...
33 95 מוניות יהלום תחבורה
34 96 מוניות גלים
35 97 אודליה מוניות בעמ
36 98 מוניות רב קווית 4-5
37 130 מוניות הדר לוד

38 rows × 2 columns

Note

pd.read_csv uses the 'utf-8' encoding by default. All sample CSV files in this book are in the UTF-8 encoding, so there is no need to specify the encoding when reading them with pd.read_csv. Otherwise, the right encoding can be specified with the encoding parameter of pd.read_csv.

The next file is routes.txt. This file describes the public transport routes, identified by route_id and also specifying the name of the route:

  • route_short_name—typically a number, such as bus line number

  • route_long_name—the route textual description

It also contains agency_id to identify the agency which operates each route:

path = 'data/gtfs/routes.txt'
cols = ['route_id', 'agency_id', 'route_short_name', 'route_long_name']
pd.read_csv(path)[cols]
route_id agency_id route_short_name route_long_name
0 1 25 1 ת. רכבת יבנה מערב-יבנה<->ת. רכב...
1 2 25 1 ת. רכבת יבנה מזרח-יבנה<->ת. רכב...
2 3 25 2 ת. רכבת יבנה מערב-יבנה<->ת. רכב...
3 5 25 2 ת. רכבת יבנה מזרח-יבנה<->ת. רכב...
4 7 25 3 ת. רכבת יבנה מערב-יבנה<->ת. רכב...
... ... ... ... ...
7246 29363 18 450 מסוף האוניברסיטה-תל אביב יפו<->...
7247 29364 2 125 תל אביב ההגנה-תל אביב יפו<->הרצ...
7248 29365 2 69 רחובות-רחובות<->תל אביב מרכז-תל...
7249 29366 2 60 נתניה-נתניה<->אשקלון-אשקלון
7250 29368 2 21 נהריה-נהריה<->תל אביב ההגנה-תל ...

7251 rows × 4 columns

The routes.txt table is linked to the trips.txt table, which characterizes trips taking place at specific times of day for a given route. Trips are uniquely identified by trip_id and linked with the routes table via the route_id column. As you can see from the small subset below, each route may be associated with multiple trips. The trips table also has a shape_id column which associates a trip with the geographical line geometry (see shapes.txt below). You can see that trips of the same route at different times of day typically follow exactly the same geographical shape, thus having identical shape_id values, but this is not always guaranteed:

filename = 'data/gtfs/trips.txt'
cols = ['route_id', 'trip_id', 'shape_id']
pd.read_csv(filename)[cols]
route_id trip_id shape_id
0 1 28876224_180421 121757.0
1 1 28876225_180421 121757.0
2 1 28876226_180421 121757.0
3 1 28876227_180421 121757.0
4 1 28876228_180421 121757.0
... ... ... ...
287673 29363 56622686_230421 123144.0
287674 29364 56460152_180421 NaN
287675 29365 56458677_180421 67648.0
287676 29366 56458667_180421 51396.0
287677 29368 56458657_180421 123033.0

287678 rows × 3 columns

The stop_times.txt table lists the stops associated with each trip (via the trip_id column), including the (estimated) time of arrival at each stop. The stop times table also includes the stop_id column to associate it with the stop characteristics (see stops.txt below). In addition, the shape_dist_traveled column contains cumulative distance traveled at each stop. You can see the increasing distance traveled from one from stop to another (in the same trip) in the printout below. The stop_times.txt table is the larges one, in terms of file size, among the GTFS files: keep in mind that it lists all stops, at all times of day, of all public transport routes:

filename = 'data/gtfs/stop_times.txt'
cols = ['trip_id', 'arrival_time', 'departure_time', 'stop_id', 'shape_dist_traveled']
pd.read_csv(filename)[cols]
trip_id arrival_time departure_time stop_id shape_dist_traveled
0 10021427_240421 00:00:00 00:00:00 36133 0
1 10021427_240421 00:01:43 00:01:43 2356 499
2 10021427_240421 00:03:36 00:03:36 3869 1838
3 10021427_240421 00:05:31 00:05:31 2357 2692
4 10021427_240421 00:06:32 00:06:32 3803 3379
... ... ... ... ... ...
10583641 9967472_180421 11:39:25 11:39:25 34662 37457
10583642 9967472_180421 11:39:58 11:39:58 34492 37729
10583643 9967472_180421 11:40:53 11:40:53 34498 38112
10583644 9967472_180421 11:41:44 11:41:44 34500 38495
10583645 9967472_180421 11:43:31 11:43:31 34503 39061

10583646 rows × 5 columns

The stops.txt table, which we already worked with earlier (see Reading CSV files), describes the fixed, i.e., unrelated to time of day, properties of the stops. It is linked with the stops_time.txt table via the stop_id column. Stop properties include the stop name (stop_name), and its coordinates (stop_lon and stop_lat):

filename = 'data/gtfs/stops.txt'
cols = ['stop_id', 'stop_name', 'stop_lat', 'stop_lon']
pd.read_csv(filename)[cols]
stop_id stop_name stop_lat stop_lon
0 1 בי''ס בר לב/בן יהודה 32.183985 34.917554
1 2 הרצל/צומת בילו 31.870034 34.819541
2 3 הנחשול/הדייגים 31.984553 34.782828
3 6 ת. מרכזית לוד/הורדה 31.956392 34.898098
4 8 הרצל/משה שרת 31.857565 34.824106
... ... ... ... ...
28250 44668 חניון שפירים דרומי/הורדה 32.001228 34.840406
28251 44683 מתחם נצבא 32.112445 34.941549
28252 44684 מתחם נצבא 32.111795 34.940668
28253 44691 מסוף דרך שכם 31.783680 35.228825
28254 44692 הקוממיות/אנה פרנק 32.002782 34.752376

28255 rows × 4 columns

Finally, the shapes.txt table contains the trip geometries. It is linked with the trips.txt table via the common shape_id column.

filename = 'data/gtfs/shapes.txt'
cols = ['shape_id', 'shape_pt_lat', 'shape_pt_lon', 'shape_pt_sequence']
pd.read_csv(filename)[cols]
shape_id shape_pt_lat shape_pt_lon shape_pt_sequence
0 44779 31.887695 35.016271 1
1 44779 31.887745 35.016253 2
2 44779 31.888256 35.016238 3
3 44779 31.888913 35.016280 4
4 44779 31.888917 35.016892 5
... ... ... ... ...
7226268 123167 32.729572 35.331201 876
7226269 123167 32.729700 35.330923 877
7226270 123167 32.729706 35.330908 878
7226271 123167 32.729841 35.330538 879
7226272 123167 32.729899 35.330249 880

7226273 rows × 4 columns

The relations between the six above-mentioned files in the GTFS dataset are summarized in Fig. 30.

_images/gtfs_files_drawing.svg

Fig. 30 Relation between the six GTFS files listed in Table 16

GTFS example

Now that we have an idea about the structure and purpose of the GTFS dataset, let us demonstrate join operations between its tables. This is a classical example where joining tables is essential: you cannot go very far with GTFS data analysis, or processing, without joining the various tables together (see Fig. 30).

To make things simple, in this example, consider just two of the tables in the GTFS dataset:

  • agency.txt, where rows represent public transport operators

  • routes.txt, where rows represent public transport routes

As shown above (see What is GTFS?), the two tables share a common column agency_id, which we can use to join the two, to find out which public routes belong to each operator.

First, let us import the agency table and subset the columns of interest, "agency_id" and "agency_name":

agency = pd.read_csv('data/gtfs/agency.txt')
agency = agency[['agency_id', 'agency_name']]
agency
agency_id agency_name
0 2 רכבת ישראל
1 3 אגד
2 4 אגד תעבורה
3 5 דן
4 6 ש.א.מ
... ... ...
33 95 מוניות יהלום תחבורה
34 96 מוניות גלים
35 97 אודליה מוניות בעמ
36 98 מוניות רב קווית 4-5
37 130 מוניות הדר לוד

38 rows × 2 columns

Second, also import the routes table, this time selecting the "route_id", "agency_id", "route_short_name", and "route_long_name" columns. Note the common "agency_id" column, which we are going to use when joining the two tables:

routes = pd.read_csv('data/gtfs/routes.txt')
routes = routes[['route_id', 'agency_id', 'route_short_name', 'route_long_name']]
routes
route_id agency_id route_short_name route_long_name
0 1 25 1 ת. רכבת יבנה מערב-יבנה<->ת. רכב...
1 2 25 1 ת. רכבת יבנה מזרח-יבנה<->ת. רכב...
2 3 25 2 ת. רכבת יבנה מערב-יבנה<->ת. רכב...
3 5 25 2 ת. רכבת יבנה מזרח-יבנה<->ת. רכב...
4 7 25 3 ת. רכבת יבנה מערב-יבנה<->ת. רכב...
... ... ... ... ...
7246 29363 18 450 מסוף האוניברסיטה-תל אביב יפו<->...
7247 29364 2 125 תל אביב ההגנה-תל אביב יפו<->הרצ...
7248 29365 2 69 רחובות-רחובות<->תל אביב מרכז-תל...
7249 29366 2 60 נתניה-נתניה<->אשקלון-אשקלון
7250 29368 2 21 נהריה-נהריה<->תל אביב ההגנה-תל ...

7251 rows × 4 columns

Now, we can join the routes and agency tables based on the common 'agency_id' column:

routes = pd.merge(routes, agency, on='agency_id', how='left')
routes
route_id agency_id route_short_name route_long_name agency_name
0 1 25 1 ת. רכבת יבנה מערב-יבנה<->ת. רכב... אפיקים
1 2 25 1 ת. רכבת יבנה מזרח-יבנה<->ת. רכב... אפיקים
2 3 25 2 ת. רכבת יבנה מערב-יבנה<->ת. רכב... אפיקים
3 5 25 2 ת. רכבת יבנה מזרח-יבנה<->ת. רכב... אפיקים
4 7 25 3 ת. רכבת יבנה מערב-יבנה<->ת. רכב... אפיקים
... ... ... ... ... ...
7246 29363 18 450 מסוף האוניברסיטה-תל אביב יפו<->... קווים
7247 29364 2 125 תל אביב ההגנה-תל אביב יפו<->הרצ... רכבת ישראל
7248 29365 2 69 רחובות-רחובות<->תל אביב מרכז-תל... רכבת ישראל
7249 29366 2 60 נתניה-נתניה<->אשקלון-אשקלון רכבת ישראל
7250 29368 2 21 נהריה-נהריה<->תל אביב ההגנה-תל ... רכבת ישראל

7251 rows × 5 columns

Now we have the additional 'agency_name' column in the routes table, which tells us the name of the operator for each route.

Note

For more information and examples of pd.merge, check out the following official tutorials from the pandas documentation:

Aggregation

Overview

Aggregation is a commonly used to summarize tabular data. In an aggregation procedure, a function is applied on subsets of table rows, whereas subsets are specified through one or more grouping variables, to obtain a new (smaller) table with summaries per group. Aggregation is one of the techniques following the split-apply-combine concept.

In pandas, there are several different ways of specifying the function(s) to be applied on the table column(s). We will go over three commonly used scenarios:

Afterwards, we are going to introduce a useful shortcut for a particularly common type of aggregation: counting the occurence of each unique value in a column (see Value counts).

Same function on all columns

For a minimal example of aggregation, let us consider the small table named stations:

stations
name city lines piano lon lat
0 Beer-Sheva Center Beer-Sheva 4 False 34.798443 31.243288
1 Beer-Sheva University Beer-Sheva 5 True 34.812831 31.260284
2 Dimona Dimona 1 False 35.011635 31.068616

Suppose that we are interested in the number of lines (the lines column) and the number of stations with a piano (the piano column), which involves summing the values in those two columns. We already know how the totals, i.e., sums of each column, can be calculated (see Row/col-wise operations):

stations[['lines', 'piano']].sum()
lines    10
piano     1
dtype: int64

However, what if we wanted to calculate the number of lines and stations with a piano per town, separately? To do that, we basically need to split the table according to the 'city' column, then apply the .sum method on each subset. Here is how this can be done. Instead of simply using .sum, we split (or “group”) the table by 'city', and only then apply the .sum method. Splitting is done using the .groupby method, which accepts the column name (or list of column names) that specifies the grouping:

stations[['city', 'lines', 'piano']].groupby('city').sum()
lines piano
city
Beer-Sheva 9 1
Dimona 1 0

Note that the grouping variable(s) become the index of the resulting DataFrame. For our purposes, we are usually going to prefer to work with the grouping variable in its own column. In such case, we end the expression with .reset_index() (see Resetting the index):

stations[['city', 'lines', 'piano']].groupby('city').sum().reset_index()
city lines piano
0 Beer-Sheva 9 1
1 Dimona 1 0

Note

Tables can also be grouped by two or more columns, in which case the column names are passed in a list, as in dat.groupby(['a','b']), where 'a' and 'b' are column names in dat.

The .sum method is just one example, there are many other useful functions we can use, such as the ones listed in Table 15. For example, to get the first row per group we can combine .groupby with the .first method:

stations.groupby('city').first().reset_index()
city name lines piano lon lat
0 Beer-Sheva Beer-Sheva Center 4 False 34.798443 31.243288
1 Dimona Dimona 1 False 35.011635 31.068616

For a more realistic example, consider the routes table:

routes
route_id agency_id route_short_name route_long_name agency_name
0 1 25 1 ת. רכבת יבנה מערב-יבנה<->ת. רכב... אפיקים
1 2 25 1 ת. רכבת יבנה מזרח-יבנה<->ת. רכב... אפיקים
2 3 25 2 ת. רכבת יבנה מערב-יבנה<->ת. רכב... אפיקים
3 5 25 2 ת. רכבת יבנה מזרח-יבנה<->ת. רכב... אפיקים
4 7 25 3 ת. רכבת יבנה מערב-יבנה<->ת. רכב... אפיקים
... ... ... ... ... ...
7246 29363 18 450 מסוף האוניברסיטה-תל אביב יפו<->... קווים
7247 29364 2 125 תל אביב ההגנה-תל אביב יפו<->הרצ... רכבת ישראל
7248 29365 2 69 רחובות-רחובות<->תל אביב מרכז-תל... רכבת ישראל
7249 29366 2 60 נתניה-נתניה<->אשקלון-אשקלון רכבת ישראל
7250 29368 2 21 נהריה-נהריה<->תל אביב ההגנה-תל ... רכבת ישראל

7251 rows × 5 columns

We can use aggregation with the .nunique (Table 15) function to calculate the number of unique route IDs that each public transport agency operates, as follows:

tmp = routes[['agency_name', 'route_id']].groupby('agency_name').nunique().reset_index()
tmp
agency_name route_id
0 אגד 1869
1 אגד תעבורה 420
2 אודליה מוניות בעמ 1
3 אפיקים 616
4 גי.בי.טורס 51
... ... ...
26 נתיב אקספרס 598
27 סופרבוס 493
28 קווים 1187
29 רכבת ישראל 111
30 ש.א.מ 153

31 rows × 2 columns

To find out which agencies have the most routes, the resulting table can be sorted using .sort_values combined with ascending=False (see Sorting):

tmp.sort_values('route_id', ascending=False)
agency_name route_id
0 אגד 1869
28 קווים 1187
24 מטרופולין 625
3 אפיקים 616
26 נתיב אקספרס 598
... ... ...
20 מוניות מטרו קו 3
19 מוניות מאיה יצחק שדה 3
18 כרמלית 2
17 כפיר 2
2 אודליה מוניות בעמ 1

31 rows × 2 columns

Exercise 06-c

  • Calculate a series of global temperature means for each decade, based on the information in the dat.

  • First, you need to create a new variable named decade based on the "Year", using an expression such as the one shown below.

  • Second, use .groupby to calculate the mean global temperature in each decade.

dat["Year"] // 10 * 10

Exercise 06-d

  • The routes.txt table describes public transport routes (uniquely identified by route_id). The trips.txt table, however, descibes specific journeys of a along those routes, at a specific time of day (uniquely identified by trip_id). The two tables are associated with one another via the common route_id column (see What is GTFS?). For example, a route that operates five times a day will be represented in one row in the routes.txt table, and five rows in the trips.txt table.

  • Join routes.txt to trips.txt, using the common route_id column.

  • Calculate a table with the number of unique trip_id’s per route_id.

  • Sort the table according to number of unique trips, and print a summary with the numbers and names (route_short_name and route_long_name columns from routes.txt, respectively) of the most and least frequent routes (Fig. 31).

route_id route_short_name route_long_name trip_id
678 2256 4 מסוף הטייסים-תל אביב יפו<->מסוף... 1001
677 2255 4 מסוף רידינג/רציפים-תל אביב יפו<... 934
3170 11525 6 אבי עזרי א-מודיעין עילית<->רמב'... 840
3171 11526 6 שדרות בית שמאי/שדרות בית הלל-מו... 830
2746 10509 1 קדושת לוי/שלום רב-ביתר עילית<->... 748
... ... ... ... ...
2291 9539 24 חמת גדר-עמק הירדן<->ת. מרכזית ט... 1
1190 4402 251 ת. מרכזית המפרץ/רציפים בינעירונ... 1
1199 4443 271 ת. מרכזית נהריה/רציפים-נהריה<->... 1
1204 4450 273 אוניברסיטה/רב תכליתי-חיפה<->ת. ... 1
7250 29368 21 נהריה-נהריה<->תל אביב ההגנה-תל ... 1

7251 rows × 4 columns

Fig. 31 Solution of exercise-06-c

Different functions (agg)

Sometimes we need to aggregate a DataFrame while applying a different function on each column, rather than the same function on all columns. For the next example, consider the world_cities.csv file, which contains information about all major cities in the world:

cities = pd.read_csv('data/world_cities.csv')
cities
city country pop lat lon capital
0 'Abasan al-Jadidah Palestine 5629 31.31 34.34 0
1 'Abasan al-Kabirah Palestine 18999 31.32 34.35 0
2 'Abdul Hakim Pakistan 47788 30.55 72.11 0
3 'Abdullah-as-Salam Kuwait 21817 29.36 47.98 0
4 'Abud Palestine 2456 32.03 35.07 0
... ... ... ... ... ... ...
43640 az-Zubayr Iraq 124611 30.39 47.71 0
43641 az-Zulfi Saudi Arabia 54070 26.30 44.80 0
43642 az-Zuwaytinah Libya 21984 30.95 20.12 0
43643 s-Gravenhage Netherlands 479525 52.07 4.30 0
43644 s-Hertogenbosch Netherlands 135529 51.68 5.30 0

43645 rows × 6 columns

Suppose we want to summarize the information per country ("country" column). However, we want each column to be treated differently:

  • The city names ('city' column) will be counted

  • Population ('pop' column) will be summed

  • Longitude and latitude ('lon' and 'lat' columns, respectively) will be averaged

This can be done using the .agg method applied on a grouped DataFrame, instead of a specific function, such as nunique. The .agg method accepts a dictionary, with key:value pairs of the form 'column':function, where:

  • 'column'—name of the column to be aggregated

  • function—the function to apply on each group in that column. This can be:

    • a string referring to a pandas function (such as 'nunique', see Table 15)

    • a plain function that can be applied on a Series, such as np.mean or np.sum

In our case, the aggregation expression is as follows. As a result, we get a new table where each row represents a country, with the count of unique city names, sum of the population across all cities, and average longitute and latitude across all cities:

cities.groupby('country').agg({
    'city': 'nunique', 
    'pop': np.sum, 
    'lon': np.mean, 
    'lat': np.mean
}).reset_index()
country city pop lon lat
0 Afghanistan 115 7543856 66.920342 34.807692
1 Albania 67 1536232 19.975522 41.087463
2 Algeria 315 20508642 3.535443 35.467025
3 American Samoa 35 58021 -170.663714 -14.299143
4 Andorra 7 69031 1.538571 42.534286
... ... ... ... ... ...
234 Wallis and Futuna 23 11380 -176.585652 -13.512609
235 Western Sahara 4 338786 -13.820000 25.935000
236 Yemen 30 5492077 45.088667 14.481000
237 Zambia 73 4032170 28.179315 -13.404247
238 Zimbabwe 76 4231859 30.442500 -18.629211

239 rows × 5 columns

Custom functions (agg)

Finally, sometimes we need to apply a custom function when aggregating a table. For example, suppose that we need to calculate the population difference between the most populated, and least populated, cities, in each country. To do that, we can define our own function, named diff_max_min, which:

  • accepts a Series named x, and

  • returns the difference between the maximum and minimum, x.max()-x.min().

Here is the function definition:

def diff_max_min(x):
    return x.max() - x.min()

It is worthwhile to test the function, to make sure it works as expected:

diff_max_min(pd.Series([15, 20, 4, 12, 6]))
16

Now, we can use the function inside the .agg expression, just like any other predefined function:

cities.groupby('country').agg({
    'city': 'nunique', 
    'pop': diff_max_min
}).reset_index()
country city pop
0 Afghanistan 115 3117579
1 Albania 67 379803
2 Algeria 315 2024344
3 American Samoa 35 11341
4 Andorra 7 17740
... ... ... ...
234 Wallis and Futuna 23 1142
235 Western Sahara 4 146906
236 Yemen 30 1918607
237 Zambia 73 1305210
238 Zimbabwe 76 1574930

239 rows × 3 columns

For example, here is how the “new” pop value is obtained for the first country ('Afghanistan'):

diff_max_min(cities[cities['country'] == 'Afghanistan']['pop'])
3117579

Alternatively, we can use a lambda function (see Lambda functions), to do the same, using more concise syntax:

cities.groupby('country').agg({
    'city': 'nunique', 
    'pop': lambda x: x.max() - x.min()
}).reset_index()
country city pop
0 Afghanistan 115 3117579
1 Albania 67 379803
2 Algeria 315 2024344
3 American Samoa 35 11341
4 Andorra 7 17740
... ... ... ...
234 Wallis and Futuna 23 1142
235 Western Sahara 4 146906
236 Yemen 30 1918607
237 Zambia 73 1305210
238 Zimbabwe 76 1574930

239 rows × 3 columns

Note

For more information and examples on aggregation, and other types of split-apply-combine operations which we have not mentioned, see https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation.

Value counts

A very common type of aggregation is to calculate how many times each unique value is “repeated” in a Series. For example, to find out how many times each country appears in the cities table, we can group the table by "country", then apply .count on any other column (such as "city"):

cities[['country', 'city']].groupby('country').count().reset_index()
country city
0 Afghanistan 117
1 Albania 67
2 Algeria 316
3 American Samoa 35
4 Andorra 7
... ... ...
234 Wallis and Futuna 23
235 Western Sahara 4
236 Yemen 30
237 Zambia 73
238 Zimbabwe 76

239 rows × 2 columns

However, this expression is quite long and not straightforward. Instead, pandas defines a much shorter way to count unique values, namely the value_counts method. For example:

cities['country'].value_counts()
Greece              1000
France              1000
Japan                999
Romania              999
Germany              998
                    ... 
Easter Island          1
Norfolk Island         1
Seychelles             1
Saint-Barthelemy       1
Singapore              1
Name: country, Length: 239, dtype: int64

The result is a Series where the index is the country name, and the values are the counts. It tells us how many times each unique value in the "country" column is repeated, that is, how many rows (i.e., cities) are there in the table for that country. For example, "France" is repeated over 1000 rows.

As you can see, the resuling Series is sorted by count, from highest to lowest. In case we need to sort by index, i.e., by country name, we can use the .sort_index method:

cities['country'].value_counts().sort_index()
Afghanistan          117
Albania               67
Algeria              316
American Samoa        35
Andorra                7
                    ... 
Wallis and Futuna     23
Western Sahara         4
Yemen                 30
Zambia                73
Zimbabwe              76
Name: country, Length: 239, dtype: int64

Exercise 06-e

  • How often is there more than one shape (shape_id) for the same route (route_id)?

  • Use the trips.txt table to calculate the number of unique values of shape_id for each route_id.

  • Summarize the resulting column to calculate how many route_ids has 0, 1, 2, etc. of shape_id. (answer: 4 routes have zero shape_id, 112 have two, and 7135 have one)

More exercises

Exercise 06-f

  • Read the stops.txt and stop_times.txt tables (located in the gtfs directory).

  • Subset the:

    • 'stop_id', 'stop_name', 'stop_lon', and 'stop_lat' columns from stops, and

    • 'trip_id', 'arrival_time', 'stop_id', 'stop_sequence', and 'shape_dist_traveled' columns from stop_times. (These are the only columns you are going to need for this exercise.)

  • The trip_id value of "55745843_180421" is one of the trips of Metropoline bus line number 24 is Beer-Sheva, departing from Ramot neighborhood at 08:15:00 and arriving at the Central Bus Station of Beer-Sheva at 08:53:19. Subset the stop_times table to get the stop times for only this particular bus trip.

  • How many stops are there in the bus route? Write an expression to get the answer as an int (answer: 32).

  • Join the resulting stop_times subset with the stops table, using the common "stop_id" column, to get the names of the stops.

  • At what time does the bus arrive at the station named 'מרכז הספורט אוניברסיטת בן גוריון'? Write an expression to get the answer as a str (answer: '08:37:21').

  • The shape_dist_traveled contains the cumulative traveled distance at each stop. What is the total traveled distance, i.e., the last value in the shape_dist_traveled column? (answer: 12767)

  • Which pair of stations are most close to each other (i.e., minimal consecutive difference in shape_dist_traveled)? Hint: use the .diff method to get a Series of consecutive differences, then use the .idxmin method to find the index of the maximum in that series.

  • Which pair of stations are most far away from each other (i.e., maximal consecutive difference in shape_dist_traveled)? Hint: use the .idxmax method to find the index of the maximum in a series.

Exercise 06-g

  • In this exercise, you are going to find the names of the five longest bus routes (in terms of distance traveled) in Israel. To do that, go through the following steps.

  • Read the stop_times.txt table and select just the 'trip_id' and 'shape_dist_traveled' columns.

  • Calculate the maximum (i.e., total) distance traveled ('shape_dist_traveled') for each public transit trip ('trip_id') (Fig. 32).

  • Read the trips.txt table and select just the 'trip_id', and 'route_id'.

  • Join the resulting trips table with the stop_times table from the previous step (using the common "trip_id" column), to get the 'route_id' value of each trip.

  • Retain just one (the first) trip per 'route_id' (Fig. 33).

  • Read the routes.txt table and subset the 'route_id', 'route_short_name', and 'route_long_name' columns.

  • Join the resulting table with the trips table (using the common 'route_id' column), to get the 'route_short_name' and 'route_long_name' values of each route (Fig. 34).

  • Sort the table according to 'shape_dist_traveled' (from largest to smallest), and print the first five rows (Fig. 35). Those are the longest-distance public transport routes in the GTFS dataset.

  • Convert the 'route_long_name' values of the first five rows from the previous step, collect them into a list, and print the list, to see the full names of the routes. You can see that all five routes are from Haifa to Eilat or the other way around.

trip_id shape_dist_traveled
0 10021427_240421 49074
1 10021428_240421 49074
2 10021429_240421 49074
3 10021430_240421 49074
4 10033656_180421 2874
... ... ...
287659 9957863_180421 33468
287660 9967071_180421 22341
287661 9967462_180421 39061
287662 9967467_180421 39061
287663 9967472_180421 39061

287664 rows × 2 columns

Fig. 32 Solution of exercise-06-f1: Total distance traveled in each trip, based on stop_times.txt

route_id trip_id shape_dist_traveled
0 1 28876224_180421 7072
1 2 25448344_180421 6870
2 3 56334623_180421 9855
3 5 56335137_180421 9751
4 7 37173391_180421 7794
... ... ... ...
7245 29363 56622661_230421 33173
7246 29364 56460152_180421 4300
7247 29365 56458677_180421 16100
7248 29366 56458667_180421 54800
7249 29368 56458657_180421 81600

7250 rows × 3 columns

Fig. 33 Solution of exercise-06-f: Joined with trips.txt, selecting just the 1st trips per route_id

route_id route_short_name route_long_name trip_id shape_dist_traveled
0 1 1 ת. רכבת יבנה מערב-יבנה<->ת. רכב... 28876224_180421 7072
1 2 1 ת. רכבת יבנה מזרח-יבנה<->ת. רכב... 25448344_180421 6870
2 3 2 ת. רכבת יבנה מערב-יבנה<->ת. רכב... 56334623_180421 9855
3 5 2 ת. רכבת יבנה מזרח-יבנה<->ת. רכב... 56335137_180421 9751
4 7 3 ת. רכבת יבנה מערב-יבנה<->ת. רכב... 37173391_180421 7794
... ... ... ... ... ...
7245 29363 450 מסוף האוניברסיטה-תל אביב יפו<->... 56622661_230421 33173
7246 29364 125 תל אביב ההגנה-תל אביב יפו<->הרצ... 56460152_180421 4300
7247 29365 69 רחובות-רחובות<->תל אביב מרכז-תל... 56458677_180421 16100
7248 29366 60 נתניה-נתניה<->אשקלון-אשקלון 56458667_180421 54800
7249 29368 21 נהריה-נהריה<->תל אביב ההגנה-תל ... 56458657_180421 81600

7250 rows × 5 columns

Fig. 34 Solution of exercise-06-f: Joined with routes.txt

route_id route_short_name route_long_name trip_id shape_dist_traveled
5856 19964 991 ת. מרכזית חוף הכרמל/רציפים בינע... 30526419_230421 446174
1668 7295 990 ת. מרכזית אילת/רציפים-אילת<->ת.... 56505849_180421 445508
6199 21726 991 ת. מרכזית אילת/רציפים-אילת<->ת.... 56505908_180421 445472
1669 7297 990 ת. מרכזית חוף הכרמל/רציפים בינע... 30526431_230421 443968
1671 7307 993 ת. מרכזית המפרץ/רציפים בינעירונ... 56505854_180421 432200

Fig. 35 Solution of exercise-06-f: Sorted by shape_dist_traveled, from largest to smallest


1

For a visual explanation of join types, see https://dataschool.com/how-to-teach-people-sql/sql-join-types-explained-visually/.