Bike Sharing Usage Patterns

Overview

Business Case

A “smart city” initiative has data from the Washington DC bike sharing system. They want to use this data to get a better understanding of the usage patterns across the city. Any discovered patterns can be used to improve the bike sharing system for customers.

As a first step, we will try to identify clusters of “similar” bike stations. Station similarity will be based on the types of users beginning trips from each station.

Supporting data

The use case is based on the following data sources:

  • Trips. Capital Bikeshare provides data on each bike trip, including an index of the available data. We will use a Download recipe in the walkthrough to create a dataset from these files.
  • Bike Stations. Capital Bikeshare provides an XML file with the list of bike stations and associated information about each station. We will use a Download recipe in the walkthrough to create a dataset from this file.
  • Demographics. We can use US census data to enrich the bike stations datasets with demographic information at the “block group” geographic level.

Note

The following downloadable archive contains the input data source for the Demographics dataset.

Workflow overview

The final Dataiku DSS pipeline will look like this:

../../_images/flow3.png

The Flow has the following high-level steps:

  1. Collect the data to form the input datasets
  2. Clean the datasets
  3. Join the datasets based on census blocks and station ids
  4. Create and deploy a clustering model
  5. Update the model based upon new data

Technical Requirements

  • The Get US census block is required to enrich the bike station data with its US census block, so that it can be joined with the per-block demographic information.

Detailed Walkthrough

Create a new Dataiku DSS project and name it Bike Sharing.

Prepare the Bike Stations Dataset

In the Flow, select + Recipe > Visual > Download. Name the output folder bikeStations and create the recipe. Add a new source and specify the following URL: http://capitalbikeshare.com/data/stations/bikeStations.xml. Run the recipe to download the files.

Note

If the URL does not work try this one instead: http://feeds.capitalbikeshare.com/stations/stations.xml.

Create a new Files in Folder dataset from the bikeStatiions folder. This can be done by clicking on Actions > Create a dataset (found in the upper-right corner). Click Test to let Dataiku detect the XML format and parse the data accordingly. Name the dataset bikeStations and create it.

In the new dataset, create a Visual Analysis with the following steps in its script:

  • In order to map the bike stations, we need to create a GeoPoint from the latitude and longitude of each station. Add a Create GeoPoint processor with the lat and long columns as the inputs and geopoint as the output column.
  • There are some columns we won’t use. Remove all columns except for nbBikes, long, name, lat, and geopoint.
  • Rename the column name to station_name to avoid column naming conflicts further downflow.

Switch to the Charts tab and create a new Scatter Map with geopoint as the geopoint column and nbBikes as the column to color the bubbles. To make the stations with the most bikes more visible, change the palette color to Green-red.

Deploy the Visual Analysis as a Prepare recipe and run it. The bike stations are now ready!

../../_images/r24JJseJ-charts.png

Prepare the Demographics Dataset

Create a new Uploaded Files dataset from the Demographics file. This contains the number of people in the US census ACS5Y 2013 at the block group level. The fully qualified block group identification is contained within the geoid column. We can split this column to obtain the block group.

Note

We could alternatively build up the block group from the state, county, tract, and blkgrp columns. As a stretch goal, see if you can figure out how to do that. Then consider which method you prefer more. There are many means to an end in data science, and you will need to assess what works best in each situation.

Create a new Prepare recipe from this dataset with the following steps in its Script:

  • Add a Split processor with geoid as the column to split and with US as the delimiter. Choose to Truncate and keep only one of the output columns, starting from the end. This will keep everything after the US in geoid.
  • Add a Rename processor to rename geoid_0 to block_group, name to block_name, and BOOOO1_001E to nbPeople.
  • Ensure the block_group column type is set to “string” and NOT “bigint”.
  • Add a Round values processor to round the values in nbPeople to integers.
  • Add a Remove processor to delete the geoid, state, county, tract, and blkgrp columns. We won’t need them anymore.
  • Finally, run the Prepare recipe. The demographic data is now ready!
../../_images/compute_block_group_demog_prepared.png

Prepare the Trips Dataset

In order to enrich the station information with trip data, we will need to:

  1. Download the raw trips data from the Capital Bikeshare website
  2. Prepare the trips data to extract the day of the week for each trip
  3. Pivot by customer type and day of the week to aggregate the individual trips. Doing so will allow you to compute the average trip duration by each customer type and day of the week.

Download the Raw Trips Data

In the Flow, select + Recipe > Visual > Download. Name the output folder bike_data and create the recipe. Add a new source with the following URL: https://s3.amazonaws.com/capitalbikeshare-data/2016-capitalbikeshare-tripdata.zip. Run the recipe to download the files.

Create a new Files in Folder dataset from the bike_data folder. Click Test to let Dataiku detect the CSV format and parse the data accordingly. Name the dataset bike_data and create it.

Prepare the Trips Data

Create a new Visual Analysis on the bike_data dataset, with the following steps in its script:

  • Parse the Start Date column into a proper date column. Dataiku should detect the correct date format as yyyy-MM-dd HH:mm:ss. If it does not, go ahead select the yyyy-MM-dd HH:mm:ss date format manually in the Smart date editor. Make sure to leave the output column blank.
  • Add an Extract date components processor and extract the day of week as a new column dow from Start Date.

On the Charts tab, create a new histogram with:

  • Duration on the Y axis
  • dow on the X axis. Be sure to change the binning (via the dropdown menu) for dow (i.e. X axis) to use raw values.
  • Member type defining subgroups of bars.

Hint

Feel free to adjust the sample size to include enough data to populate the chart for each day of the week.

The chart shows us a few interesting things.

  1. “Casual” customers tend to take trips that are significantly longer than “Member” users: approximately 40 minutes versus 12 minutes.
  2. “Member” customers do not show much day-to-day variation in the duration of trips, while “Casual” customers make their longest trips on Friday, Saturday, and Sunday, and shortest trips on Tuesday and Wednesday: a difference of about 10 minutes.

Deploy the Visual Analysis as a Prepare recipe and run the recipe.

../../_images/Irleg03q-charts.png

Create pivoted features

In this step, we’ll create new features to support our analysis. More specifically, we want to compute the number of trips and average trip duration by station, member type, and weekday. Additionally, we want the final dataset to have a single row for each station, and separate columns for each combination of member type and weekday.

To do this, we’ll create a Pivot recipe from the bike_data_prepared dataset. Choose Member type as the column to pivot by and rename the output dataset bike_data_pivoted.

In the Pivot step of the recipe:

  • Add dow as another pivoting column (under Create columns with)
  • Choose to pivot all values
  • Select Start station as a row identifier
  • Select Duration as a field to populate content with and choose Avg as the aggregation

Run the recipe. The resulting dataset should have 29 columns; one for the Start station and 28 for the 7 days of the week x 2 member types x 2 statistics (count and average) = 28 combinations requested in the pivot recipe.

../../_images/compute_bike_data_pivoted.png

Enrich the Station Dataset with Demographics and Trip Data

We now have three sources of data: bike station-level data about the stations, bike station-level data aggregated from individual trips, and block group-level demographic data. We want to join all of this information into a single dataset.

In order to enrich the bike station data with the demographic data, we need to map the geographic coordinates (lat, lon) of each bike station to the associated block group id from the US census.

We can do this mapping with a Plugin recipe. From the Flow, select + Recipe > Get US census block > From Dataset - get US census block_id from lat lon. Choose bikeStations_prepared as the input dataset and bikeStations_prepared_blocks as the output dataset. In the recipe, select lat and long as the latitude and longitude columns. Run the recipe. The resulting dataset has the block group, and we can use it as the link between the bike station dataset and the demographics dataset.

From the bikeStations_prepared dataset, create a Join recipe and select bikeStations_prepared_blocks as the dataset to join with. In the recipe settings:

  • In the Join step:
    • Dataiku has created a default left join using lat as the join key. This is a good start, but we need to add a second join condition where long equals lon.
    • Next, add block_group_demog_prepared as a new input dataset, joined to bikeStations_prepared_blocks as the existing input dataset. Dataiku should automatically find block_group as the join key. If it does not, please make sure that block_group is set as the join key for both datasets
    • Finally, in order to enrich the bike station data with the aggregated trip data, add bike_data_pivoted as a new input dataset, joined to bikeStations_prepared as the existing input dataset. Set the type of join to an Inner Join, and the join keys to station_name and Start station.
  • In the Selected columns step, we can drop some columns we won’t need.
    • From bikeStations_prepared, drop long and lat
    • From bikeStations_prepared_blocks, keep only county_name and state_name
    • From block_group_demog_prepared, keep block_name and nbPeople
    • From bike_data_pivoted, drop Start station

Run the recipe. Our station dataset is now enriched with demographic and trip information!

../../_images/compute_bikeStations_prepared_joined.png

Identify Similar Bike Stations

Now we are ready to identify “similar” stations with a clustering model. Create a clustering ML task using a K-Means model from the Lab of the bikeStations_prepared_joined dataset. Before training it, in the Features handling section of the ML task Design:

  • Set the Roles of nbBikes, county_name, state_name, and nbPeople to Use for display only

Then train the model. Within the resulting model, navigate to the Heatmap to gain a better understanding of the clustering results. The heatmap tells us the following:

  • Clusters 0, 3, and 4 are largely comprised of stations in DC, in decreasing order of strength of association with DC. Digging deeper:
    • Cluster 3 is most strongly associated with short trips by members on weekdays. We can rename this cluster DC Commuters
    • Cluster 4 is most strongly associated with longer trips by casual users. We can rename this cluster DC Tourists
    • Cluster 0 is most strongly associated with shorter trips by members on the weekend. We can rename this cluster DC Weekenders
  • Cluster 1 is largely comprised of stations in Virginia and Maryland, but more strongly associated with Virginia. We can rename this cluster VA. VA ridership has low counts and low duration of trips.
  • Cluster 2 is largely comprised of stations in Maryland. We can rename this cluser MD. MD ridership has low counts, but long duration.
../../_images/8xwWEU4o-heatmap.png

Let’s see the clustering results on a map. To do this, deploy the model to the Flow as a retrainable model. Now create an Apply recipe and score the bikeStations_prepared_joined dataset.

On the Charts tab of the output dataset, create a Scatter Map with:

  • geopoint as the column identifying the location of points
  • cluster_labels as the column to color the points
  • nbPeople as the column to set the size of points. You may need to adjust the base radius so that the points don’t overlap too much

The placement of labeled clusters on the map gives us even more insight:

  • The VA and MD clusters have a number of points outside those states. It might be better to respectively rename these clusters Suburban short trips and Suburban long trips, respectively.
  • The DC Tourists cluster is clustered around the Mall and other sites of interest to tourists
  • The DC Commuters cluster is spread across the downtown of DC, in blocks with large numbers of people.
  • The DC Weekenders cluster is interspersed among the DC Commuter locations

These general shapes make sense, and give us confidence in the clusters. From here, it can be useful to look at individual points that seem out of place. For example, there are three stations just north of the Constitution Gardens Pond in DC that are in the VA cluster. What makes them different from the nearby DC Commuters points? Perhaps these stations are underperforming, and should be closed, relocated, or have the number of available bikes reduced.

../../_images/bikeStations_prepared_joined_scored-visualize.png

Retrain the Model with New Data

New Capital Bikeshare data is constantly being created and uploaded to the site. We can incorporate this new data into the Flow and retrain our clustering model to account for changing usage of the Bikeshare system.

In the download_to_bike_data Download recipe, add a new source with the following URL: https://s3.amazonaws.com/capitalbikeshare-data/2017-capitalbikeshare-tripdata.zip. Run the recipe to download the files.

In the Settings tab of the bike_data dataset, click on Show Advanced options and change the “Files selection” setting from All to Explicitly select files. Afterwards, click List Files to see the list of available files. Then, click Add to add the corresponding new 2017 data (along with removing the 2016 data from the dataset)

Note

Depending on the situation, we might want to keep the 2016 data and analyze the combined data. For the purposes of this use case, we’ll retrain the model on just the 2017 data.

Save the dataset.

From the Clustering (KMEANS) on bikeStations_prepared_joined recipe, click Retrain and select Build & train required. This will perform a recursive build of the pipeline and pull the 2017 data through to the cluster model retraining.

Opening the retrained model and looking at the heatmap, it appears that the clusters have shifted slightly.

  • Rename VA to Suburban Long Trips
  • Rename MD to DC Tourists
  • Rename DC Commuters to Suburban Short Trips
  • Rename DC Tourists to DC Commuters

Rebuild the scored dataset and look at the map to see what has changed. Just eyeballing, it’s difficult to see any significant changes from 2016 to 2017. A couple of stations in the outlier cluster now appear in Maryland, far away from downtown DC. In the 2016 data, they were identified as suburban. It is now your role to identify and understand any other changes. :-)

../../_images/bikeStations_prepared_joined_scored-visualize2.png

Wrap-up

Congratulations! We created an end-to-end workflow to examine the geographic patterns of usage in a bike sharing system and retrained our clustering model on new data. Thank you for your time working through this use case.