Analysis of Data from Location-Based Social Networks (LBSN)

Michael Dorman

Geography and Environmental Development, BGU

2022-03-29


Aim

Requirements

Several Python packages need to be installed and loaded to run the code examples:

We also need to set the working directory to the folder with the data:

To reproduce the results, download the data and the notebook.

Contents

Part I: Introduction

Location-Based Social Networks (LBSN)

Lazer et al. 2009, Science
Density of Twitter (blue) and Flickr (red) data in the US (https://www.flickr.com/photos/walkingsf/5912385701)

Types of LBSN data

Two points of view on LBSN data: (A) network structure and (B) geographical structure (Onella et al. 2011)

Twitter APIs

Our examples of working with LBSN data are going to focus on Twitter data. Like in most online social networks, Twitter has numerous ways to access and interact with the data:

  1. The "ordinary" user interface, such as the Twitter web and mobile apps, where users view and create content
  2. The Twitter API, where developers and researchers can programmatically access content
Twitter home page
Tweet data

Part II: Practical Examples

Example 1: Setting up Twitter account

To access the Twitter API we need to obtain API keys:

  1. Make sure you have a Twitter account
  2. Navigate to https://developer.twitter.com/en/apps and create a new app by providing a Name, Description, Website and Callback URL
  3. Check Yes to agree and then click "Create your Twitter application"
  4. Once you've successfully created an app, click the tab labeled Keys and Access Tokens to retrieve your keys
Creating a Twitter application
Accessing application settings
Obtaining Twitter API keys

Once we have the API keys, Twitter APIs can be accessed using various software, such as Python package twarc. For example, the following script collects all available geo-referenced tweets in the area of Boston for an hour, from the Streaming API which provides tweets in real-time:

Twitter data collection

Example 2: Collecting tweets

The twarc script records tweets into a .json file, until the end of the current hour. If we run the script repeatedly at the beginning of each hour, we can collect tweets for longer time periods, organized into numerous .json files (one file per hour).

To process the data, first of all we need to detect the required file paths. For example, the following expression gets the file paths of all hours in 2022-03-11:

Next, we can read the files in a loop and combine them into one long table (DataFrame), using the pandas package:

The resulting table contains a lot of variables. We will keep just the most useful ones:

Example 3: Analyzing spatial patterns

The created_at variable is a date-time (datetime64) object, specifying date and time in the UTC time zone:

It is more convenient to work with local times rather than UTC. Boston is in the US/Eastern time zone:

Now we can find out the time frame of the collected tweets:

We will remove tweets from the first (incomplete) hour:

Let us now look into the coordinates columns:

Each element in the column is a dict, following the GeoJSON format:

We can translate it to a "Point" geometry object using function shape from the shapely.geometry package:

Using this principle, we can convert the entire column into a geometry column, using package geopandas:

and combine it with dat to create a layer named pnt with both the geometries and tweet attributes:

The resulting layer pnt can be plotted to see the spatial pattern of tweet locations:

The bounding box which was used to collect the tweets can also be converted to a geometry:

Here is a plot of the bounding box and tweet locations. We can see that the Tritter API returned many tweets that are outside of the requested bounding box:

We will keep just the tweets within the bounding box, using the .intersects method:

We are left with only those tweets that fall inside the bounding box (the Boston area):

We will find out which county each tweet falls in, using a country borders layer. First we read the layer, from a Shapefile:

The following expressions visualize the three spatial layers we now have:

The simplest temporal aggregation is omitting some of the time components, then counting occurences. For example, calculating a date+hour variable:

Then counting occurences:

For spatio-temporal aggregation we count the occurences in each unique combination of time and location. For example, we can do a spatial join between the tweets layer and the counties layer:

The result is a table with date+hour / county name per tweet:

Finally, we count occurences of each date+hour / county value:

The table can be displayed as a heatmap using the sns.heatmap function:

Chronologically ordered observations per user represent their path in space. To create the line layer of paths, we first need to extract the user name from the user column, which is a dict:

Then, we need to aggregate the point layer by user, collecting all points into a list. Importantly, the table needs to be sorted in chronological order (which we did earlier):

Keep in mind that most content is created by few dominant users. For example, here we calculate the numper of points n (i.e., tweets) per unique user:

At least two points are required to form a line, so we filter out users who have just one tweet:

Then, we "connect" the points into "LineString" geometries:

Here is a plot of the resulting paths layer:

LBSN research applications (1)

Inferred boundaries from collective Twitter user displacements in London (Yin et al. 2017)

LBSN research applications (2)

Unevenly segregated activity spaces of West End and East End residents in Louisville, Kentucky (Shelton et al. 2015)

LBSN research applications (3)

Flows derived from 3135 individuals who posted at least one tweet in Cilento (Chua et al. 2016)

Example 4: Collecting network data

Running a Python script to construct a Twitter social network

python get_followers.py -s MichaelDorman84 -d 2
Russell (2013) Mining the Social Web
Running the get_followers.py Python script for reconstructing social network of given depth around a given user

The resulting folder of user-metadata processed into a CSV ile with another Python script:

python twitter_network.py

Finally, the processed CSV file can be read into R with pd.read_csv. The table consists of an edge list:

Example 5: Network analysis

We can remove users for whom we have no friend data. That way, the social network ties between the remaining users are fully described:

The python script also produces user details, including self-reported location:

We can use a geocoding service to convert location text to coordinates. The following expression uses the free Nominatim geocoding service based on OpenStreetMap data, accessed through geopandas:

To get the country name for each user location, we can use the world borders polygonal layer available as a built-in dataset in geopandas:

Here is a map of:

We can add the country each user location falls in with a spatial join. First, we use a spatial join to detect the country name where each geocoded address "falls in":

Then, we use an ordinary join attach those country names back to the Twitter users list. That way, in the locations table, we now have user+country instead of user+address:

Now we can replace all user names in the "edge list" table with the corresponding country names:

Then, remove missing values:

and count:

The result is a country-to-country edge list.

To benefit from methods for visualization and analysis of networks we need to convert the edge list table to a network object. We use function from_pandas_edgelist from package networkx:

A network object basically contains two components:

In our case the nodes are countries:

and the edges are follower ties between users of those countries, weighted according to the number of ties ("count"):

The network can be visualized using function nx.draw. We are using one of the built-in algorithms (.kamada_kawai_layout) to calculate an optimal visual arrangement of the nodes:

To reflect the edge weights, we can use the width parameter of a network plot:

Let us calculate two basic graph properties:

Since G is a spatial network (where vertices represent locations), it may be more natural to display in a spatial layout. To do that, we first have to attach x/y coordinates to each vertex. We use the country centroids. Here is how we can get centroid coordinates of one specific country:

and here is how we get all coordinates at once, into a dict named pos:

which can be passed to nx.draw:

Node degree is the number of edges adjacent to the node, or, in a weighted network, the sum of the edge weights for that node:

Community detection algorithms aim at identifying sub-groups in a network. A sub-group is a set of nodes that has a relatively large number of internal ties, and also relatively few ties from the group to other parts of the network.

LBSN research applications (4)

Histogram of physical distances between connected users (Takhteyev et al. 2012)

LBSN research applications (5)

Examples ot five multilingual Twitter user network types, where English is integrated with (1) French, (2) Japanese, (3) Portuguese, (4) Greek and (5) Arabic Eleta & Golbeck 2014

Example 6: Sentiment analysis

Before calculating polarity of the tweets sample, we need to keep only those in English:

Then we can calculate polarity and place it in a new polarity column:

Here are the five most negative tweets:

and the five most positive tweets:

We can also examine the spatial pattern of tweet polarity using a map:

LBSN research applications (6)

Spearman correlations for 432 demographic attributes with happiness (Mitchell et al. 2013)

LBSN research applications (7)

Sentiment indices over time in (a) the direct affected region (DAR) and (b) the City of Boston (Lin & Margolin 2014)

Summary—Software Tools

Thank you for listening!