Network Optimization

Overview

Business Case

The car rental company from an earlier use case wants to expand their network in order to satisfy new demand from a partnership with an insurance company. To get maximum value from this expansion, they have approached the data team that we are a part of, again. Some of the goals they’ve defined are to:

  • Understand how the current network fits with the observed and anticipated demand
  • Optimize the vehicle rotation schedule by predicting the demand at each agency location
  • Evaluate partnerships and/or acquisition locations to expand the network efficiently

Supporting Data

The use case is based on three input data sources:

  • Demand. This is a record of all road accidents handled by the insurance company spanning four years. The four yearly files total approximately 250k observations. This dataset is a modified version of a French open dataset
  • Network. This is a simulated dataset that records information on the current locations of the 350 rental agencies.
  • Partners. This is a simulated dataset that contains information on potential partners; that is, new locations to expand the network. The data must be parsed in order to extract the relevant information.

Note

The following downloadable archives contain the input data sources.

Workflow Overview

The final Dataiku DSS pipeline will look this:

../../_images/flow2.png

The Flow has the following high-level steps:

  1. Upload the datasets
  2. Clean the datasets
  3. Join the datasets based on geographic proximity
  4. Aggregate the data by geography and station

Additionally, we will create visualizations using Dataiku Charts and a Web App, to be shared on a dashboard.

Technical Requirements

Data preparation

The end goal of our data preparation is to have a single dataset, containing all the information from the input datasets, that is ready for further analysis.

To get started, within Dataiku, create a new project and name it Network Optimization.

Demand dataset

Create a new dataset from the yearly demand files (i.e. accidents.zip). Since all the files have the same structure, we can drag and drop them in a single upload and Dataiku will stack them automatically.

The dataset needs cleaning in order to be suitable for analysis. The main preparation steps we will undertake are:

  • Cleaning and parsing the date for each observation
  • Cleaning longitudes and latitudes and creating geopoints
  • Enriching geopoints with administrative information

From the demand dataset, create a Prepare recipe and store in an output dataset.

Cleaning Dates and Times

To get a recognizable time object of the form yyyy-MM-dd-HHmm, we will need to clean and concatenate the year, month, day and hrmn columns by adding the following steps to the Prepare script.

  • The year column has 2-digit years and they need to have 4 digits. Add a Formula processor with output column year and expression "20" + year.
  • There are inconsistencies in the number of digits in the hrmn column. Add a Formula processor with output column hrmn and an expression such that 12:30 am will be written as “0030”, 6:45am as “0645”, 10:00 pm as “2200”, and so on. In other words, the output of each row in the column should contain 4 digits.
if(length(hrmn) == 3,"0"+hrmn,
if(length(hrmn) == 2,"00"+hrmn,
if(length(hrmn) == 1,"000"+hrmn,hrmn)))
  • Add a Concatenate processor to merge the year, month, day, and hrmn columns into an output column datetime, using - as the delimiter.
  • Add a Parse Date processor with datetime as the column to parse and yyyy-MM-dd-HHmm as the date format.
  • Add an Extract date elements processor with datetime as the input date column and weekofyear as the column to which to extract the week of year field.

Cleaning Geographic Columns

  • The latitude and longitude columns are not in the right format. The platform uses WGS 84 coordinates system, so they should be decimal values. Add a Formula processor that divides each column by 100000.
  • Add a Create GeoPoint processor using latitude and longitude to define the output geopoint column.
  • Click on the geopoint column header and select Analyze. About 45% of the column values are empty or null. Discard these from the analysis by selecting Mass Actions > Remove rows.
  • Add a Reverse-geocode processor with the geopoint column as input, and the enclosing city, county, region and country as outputs.
  • Delete the columns with the “_enName” suffix.
  • Keep only the rows for which country is “France”. Add a Filter rows/cells on value processor, select country as the column, and add France as a value to match.

Run the recipe. This dataset is cleaned and ready!

../../_images/compute_accidents_database_prepared.png

Network dataset

Create a new dataset and upload rental_agencies.csv. Here, the data is almost completely clean, and the main task is to geolocate the agencies; that is, for each row, retrieve the longitude and latitude corresponding to the address and zipcode.

To do this, create a Prepare recipe with the following steps in its script:

  • Add a Simplify text processor with adress as the input column.
  • Add a Simplify text processor with city as the input column.
  • Add a Concatenate columns processor to concatenate adress, zipcode, and city, using a space as a delimiter, into the output column address_full.
  • Add a Geocode processor to retrieve the latitude and longitude of each rental agency, using your MapQuest Developer or Bing maps API key. Specify geo as the prefix for the generated columns.

Run the recipe. This dataset is cleaned and ready!

../../_images/compute_rental_agencies_prepared.png

Garage dataset

Create a new dataset and upload garage_locations.csv. In order for Dataiku to recognize the data, go into the preview screen and select One record per line from the Type dropdown. The created dataset should have a single column.

We now need to parse the html code in each cell in order to extract the relevant information (the name and latitude / longitude of each garage). For that, we will use a combination of string transformations. The trick is to find unique string combinations to split the column as efficiently as possible and avoid mistakes.

Create a new Prepare recipe and add the following steps to the script:

  • Add a Filter rows/cells on value processor and remove rows where the column line has the value line.
  • The values in the dataset are too long to view in their entirety in the table. Right-click on the first cell and select Show complete value. Looking at the complete value of the first cell, the garage name (“Carrosserie M. SERVICES AUTOMOBILES”) is between h1 tags. This seems to be the case for each observation, so let’s try splitting the first column on the h1 tag. Add a Split column processor with line as the input and h1 as the delimiter. There is still a common prefix before the garage name - let’s go back to that same Split and add the whole prefix h1 class=""""left"""">' + '. Select Truncate in the Split processor and make sure that “Starting from” is also set to End to to keep only the last column.
  • Split the line_0 column on ' + ' and Truncate to keep only the first column so as to have a clean name
  • Delete the column line_0 and rename the column line_0_0 to name
  • Let’s now parse the latitude and longitude using the same process:
    • Split the column line on LatLng(
    • Split the column line_0 on ),icon
    • Split the column line_0_0 on ,
    • Rename the column line_0_0_0 to latitude and line_0_0_1 to longitude
    • Delete the columns line, line_0 and line_0_0
  • Add a Create geopoint processor to create a new column geopoint from latitude and longitude
  • Add a Reverse-geocode processor to extract the city, department, region and country information associated with each value of geopoint
  • Delete all the remaining columns with the _enName suffix

Run the recipe. This dataset is cleaned and ready!

../../_images/compute_garage_locations_prepared.png

Geo-Joining Datasets

In this step, the objective is to enrich each accident with the nearest rental station and the nearest garage so as to simulate our operating model and plan capacity at the station or geographic level.

From the prepared Accidents dataset, create a new Prepare recipe with a new dataset accidents_joined as the output and the following steps in the script.

  • Add a Geo-Join processor, using the latitude and longitude columns from this dataset. Select rental_agencies_prepared as the dataset to join with and geolat and geolon as the columns identifying the latitude and longitude in that dataset. Copy the columns agency_name, geolat, and geolon from the agencies dataset, and specify station_ as the output column prefix.
  • Add another Geo-Join processor, this time with garage_locations_prepared as the dataset to join with. Select latitude and longitude as the columns identifying the latitude and longitude in that dataset. Copy the name and geopoint columns from the garage dataset, and specify garage_ as the output column prefix.

Run the recipe. Data preparation is done! The combined data is now ready for further analysis.

../../_images/compute_accidents_joined.png

Analysis & Visualization

The end goal of this project is to provide analyses that our coworkers outside the data team can understand and use. We can create a number of visualizations and organize them within a dashboard to get an idea of our network and the structure of the demand.

Charts

Without any further data transformation steps, the accidents_joined dataset has information we can use to create visualizations of the current network. Within the Charts tab of the accidents_joined dataset, create the following charts. Note that you will need to increase the number of records used in the sample in order to see all the data in the charts.

  • Distribution of accidents by hour of the day. This is a histogram with Count of records on the Y axis and datetime on the X axis, with Hour of day selected as the date range.
  • Distribution of accidents by day of week and year. This is a histogram with Count of records on the Y axis, datetime on the X axis with Day of week selected as the date range, and subgroups defined by year. For binning, year should use the raw values.
  • Weekly comparison of accidents count on the past 4 years. This is a line chart with Count of records on the Y axis, weekofyear on the X axis, and subgroups defined by year. For binning, weekofyear and year should use the raw values.
  • Geographic distribution of accidents, filtering out for low collision scores. This is a scatter map with geopoint defining the locations to show and collision defining the color of each point, with more collisions creating a darker color. Add the collision column as a filter and only show points where collision is greater than 3.
  • Heatmap of number of accidents. This is a filled map with geopoint defining the locations to show and Department/County as the admin level. Count of records defines the color of each point. The resulting map shows the highest number of accidents occur in Bouches-du-Rhône.
  • Heatmap of average distance between accidents and agencies. This is a filled map with geopoint defining the locations to show and Department/County as the admin level. station_join_distance defines the color of each point. The resulting map shows the greatest average distance to stations occurs in Corsica.
  • Cumulative distribution of accidents based on distance to agencies (5 km bins). This is a stacked bar chart with Count of records on the Y axis and station_join_distance on the X axis. The Compute setting for Count of records should be set to Cumulative values, and binning for station_join_distance should be set to fixed-size intervals of size 5.
../../_images/accidents_joined-visualize.png

This last chart is a first step towards capacity planning. We can see a significant number of accidents that are more than 15km from the closest rental station. In order to account for the effect of partner garages, we need to create some new columns. Create a new visual analysis for accidents_joined, and add the following steps to the script.

  • Add a Formula processor with effective_network as the output column and if(station_join_distance > 15 && garage_join_distance < station_join_distance,"garage","station") as the expression. This creates a rule where if the distance between the collision and the nearest rental station is over 15km and the nearest garage is closer, the garage will handle the request, otherwise the station will.
  • Add a Formula processor with effective_distance as the output column and if(effective_network=="garage",garage_join_distance,station_join_distance) as the expression. Given the rule for determining where the request is handled, this computes the actual distance between the collision and the location that will handle the request.

Now in the Charts tab, create the following chart.

  • Cumulative distribution of effective distance. This is a stacked bar chart with Count of records on the Y axis, effective_distance on the X axis, and subgroups defined by effective_network. The Compute setting for Count of records should be set to Cumulative values, and binning for effective_distance should be set to fixed-size intervals of size 5.

Now we are able to see that partnering with the garage network allows us to effectively cover many more collisions than we otherwise would be able to normally.

../../_images/TI066pno-charts.png

Deploy the script as a Prepare recipe, making sure to check Create graphs on the new dataset and selecting the Build the dataset now option.

This first set of charts are done!

Webapp

Now let’s create a custom visualization showing the network of rental stations, the network of garages, and a sample of the accidents.

The advantages of using a webapp for this visualization are its flexibility in creating advanced / dynamic charts and the ability to visualize data from several datasets in the Flow.

  • Create a new Standard Webapp, using the starter code for creating map visualizations, and give it a name like Geo Analysis. The starter code includes a preloaded tile map (from Leaflet) and a template for the javascript code.
  • Go into the Settings for the webapp and click CONFIGURE under Security. Give the webapp read permission for the prepared accidents, rental agencies, and garage datasets.
  • In the HTML code, remove the default <h1> heading and the following paragraph. Replace the default description with something like This geographic analysis shows the network of rental stations, the network of garages, and a sample of the accidents.
  • In the Javascript code, alter the block starting with dataiku.fetch() to the following. This block adds the rental agencies to the map.
dataiku.fetch('rental_agencies_prepared', {
    sampling : "head",
    limit : 20000
  }, function (df) {

    // Add a map marker for each row on the dataset
    // Each marker is a circle. The size of the circle varies with the 'size' column values
    var nbRows = df.getNbRows();
    for (var i = 0; i < nbRows; i++) {
      var record = df.getRecord(i);

      // Replace by your own column names here
      var lat = parseFloat(record["geolat"]);
      var lon = parseFloat(record["geolon"]);
      var name = record["agency_name"];
      var city = record["city"];
      if(isNaN(lat) || isNaN(lon)) continue;

      // Radius of the marker is in meters
      var radius = 15000;

      var marker = new L.circle([lat, lon], radius, {
          color: 'green',
          fillColor: 'green',
          fillOpacity: 0.2
        }).bindPopup("Name: <strong>" + name + "</strong>");

        marker.addTo(map);
      };
});
  • Now make a copy of that block of code and alter it to the following. This adds the garages to the map. For the sake of clarity, a random sample of 200 garages is displayed.
dataiku.fetch('garage_locations_prepared', {
    sampling : "random",
    limit : 200
  }, function (df) {

    // Add a map marker for each row on the dataset
    // Each marker is a circle. The size of the circle varies with the 'size' column values
    var nbRows = df.getNbRows();
    for (var i = 0; i < nbRows; i++) {
      var record = df.getRecord(i);

      // Replace by your own column names here
      var lat = parseFloat(record["latitude"]);
      var lon = parseFloat(record["longitude"]);
      var name = record["name"];

      // Radius of the marker is in meters
      var radius = 15000;

      var marker = new L.circle([lat, lon], radius, {
          color: 'blue',
          fillColor: 'blue',
          fillOpacity: 0.2
        }).bindPopup("Name: <strong>" + name + "</strong>");

        marker.addTo(map);
      };
});
  • Now make a copy of that block of code and alter it to the following. This adds the accidents to the map. For clarity, a random sample of 10,000 accidents is displayed.
dataiku.fetch('accidents_database_prepared', {
    sampling : "random",
    limit : 10000
  }, function (df) {

    // Add a map marker for each row on the dataset
    // Each marker is a circle. The size of the circle varies with the 'size' column values
    var nbRows = df.getNbRows();
    for (var i = 0; i < nbRows; i++) {
      var record = df.getRecord(i);

      // Replace by your own column names here
      var lat = parseFloat(record["latitude"]);
      var lon = parseFloat(record["longitude"]);
      var collision = record["collision"];

      // Radius of the marker is in meters
      var radius = 1;

      var marker = new L.circle([lat, lon], radius, {
          color: 'green',
          fillColor: 'green',
          fillOpacity: 0.2
        }).bindPopup("Number of collisions: <strong>" + collision + "</strong>");

        marker.addTo(map);
      };
});

Save the changes and the preview should update accordingly with the new custom visualization we just created. Our simple web app is done!

../../_images/DJTSUS1_geo-analysis-edit.png

Aggregations and Window functions

In this section, we use the Group By and Window recipes to create visualizations for capacity planning at the station and regional level.

Regional Level Analysis

We want to identify the regions that have the highest number of accidents per rental station, so that we can plan to increase capacity. Additionally, we want to see if there is seasonal variation.

Starting from the accidents_joined_prepared dataset, create a Group recipe that groups by month. In the recipe settings:

  • In the Pre-filter step, create a filter to keep only rows where the value of year is greater than 2012
  • In the Group step, add region as a group key, choose to compute the count for each group, and select the following aggregations:
    • For collision: Avg
    • For station_agency_name: Distinct
    • For station_join_distance: Avg
    • For garage_name: Distinct
    • For garage_join_distance: Avg
  • In the Post-filter step, keep only the rows where region is defined.

Run the recipe. From here, we can draw the monthly load curves by region, so as to identify areas where we need to increase our capacity, and possibly identify seasonal effects that can help us optimize the rotation of our vehicles.

Create a new visual analysis. On the Script tab, add a Formula processor with capacity_ratio as the output column and count/station_agency_name_distinct as the expression.

On the Charts tab, create a new Lines chart with capacity_ratio on the Y axis, month on the X axis, and subgroups defined by region. Set the aggregation for capacity_ratio to MAX, the binning for month to use raw values, and the sorting for region to display the 7 regions with the highest monthly ratios.

The resulting chart shows that Bretagne and Provence-Alpes-Côte d’Azur are most in need of extra resources. Among the regions, there are suggestions of seasonal spikes, but further analysis would be needed to determine whether these are significant or spurious variations.

Deploy the script as a Prepare recipe, making sure to check Create graphs on the new dataset and selecting the Build the dataset now option.

../../_images/mUSPLgbR-charts.png

Station level analysis

The goal here is to create a dataset with one record per station, and columns showing the monthly load values as well as a 3-month sliding average. We can do this with a Group recipe followed by a Window recipe.

Starting again from the accidents_joined_prepared dataset, create a Group recipe that groups by month and set the output to write to accidents_by_station. In the recipe settings:

  • In the Group step, add station_agency_name as a group key, choose to compute the count for each group, and select the following aggregations:
    • For station_join_distance: Avg

Run the recipe. From the resulting dataset, create a Window recipe. In the recipe settings:

  • In the Windows definitions step, identify station_agency_name as the partitioning column and month as the ordering column. In order to create a 3-month sliding window, set the window frame to limit the number of preceding rows to 3 and the number of following rows to 0.
  • In the Aggregations step, select the following aggregations:
    • For month: Retrieve
    • For station_agency_name: Retrieve
    • For station_join_distance_avg: Retrieve, Max
    • For count: Retrieve, Avg, LagDiff: 1

The LagDiff 1 aggregation gives, for each agency, the difference between the number of accidents nearest the agency in the current month and the number of accidents in the previous month.

Run the recipe. In the Charts tab of the output dataset, create a new Lines chart with count_lag_diff on the Y axis, month on the X axis, and subgroups defined by station_agency_name. Set the aggregation for count_lag_diff to AVG, the binning for month to use raw values, and the sorting for station_agency_name to display the 5 agencies with the highest average lag-difference.

The resulting chart shows us the rental stations with the most volatile demand, with SMEA experiencing the largest changes.

../../_images/accidents_by_station_windows-visualize.png

Dashboards

Each Dataiku project starts with a default dashboard. Let’s rename ours to Network Optimization Dashboard. Then, we will go through the following steps to publish our work to the dashboard:

  • Create three slides called Geo Analysis, Network Visualization, and Capacity Planning, and choose to use the the slide title as the header on each slide.
  • Publish the previously created Geo Analysis webapp to the “Geo Analysis” slide on this dashboard.
  • Publish the charts from accidents_joined, with the exception of the cumulative distribution chart, to the Network Visualization slide. This can easily be done by navigating to accidents_joined back from the Flow, opening the Charts tab, and publishing the charts to the Network Optimization Dashboard.
  • Publish the remaining charts to the Capacity Planning slide.

Hint

Feel free to adjust, resize, and reorganize the individual charts within the dashboard to improve the layout and display of each slide.

../../_images/OLiSqUR_network-optimization-dashboard-edit-DOXy8t9.png

Wrap-up

Congratulations! We created an end-to-end workflow to build a dashboard for non-technical experts. Thank you for your time working through this use case.