Web Logs Analysis


Business Case

The customer team at Dataiku is interested in using the website logs to perform:

  • Referrer Analysis to determine how visitors are getting to our website, and identify the top referrers, both in volume (number of visitors) and depth (number of pages linked).
  • Visitor Analysis to segment visitors according to how they engage with the website, and then map these segments to known customers in order to feed those customers into the right channel: Marketing, Prospective Sales, and Sales.

Supporting Data

The use case is based on 2 input data sources:

  • Web Logs. The Dataiku website logs, spanning 2 months, that contain information about each individual pageview on the website.
  • CRM. A simulated Customer Relationship Management (CRM) database containing transactional and demographic data about our clients.


The following downloadable archives contain the input data sources.

Workflow Overview

The final Dataiku DSS workflow will look this:


You will go through the following high-level steps:

  • Upload the datasets
  • Clean up and enrich the log data
  • Use visual grouping recipes
  • Run a clustering model to build segments
  • Join the CRM and segment data for known visitors
  • Customize and split dataset by segments

Technical Requirements

Detailed Walkthrough

Create a new Dataiku DSS project and name it Web Logs Analytics.

Preparing the Web Logs

Create a new UploadedFiles dataset from the Web Logs (LogsDataiku.csv.gz) and name it LogsDataiku.

The dataset is already quite clean so the Data Preparation steps will focus mostly on enrichment and feature engineering. Create a Prepare recipe with this dataset as the input and add the following steps to the script.

  • Cleaning
    • Remove columns br_height, br_width, sc_width and sc_height
    • Rename the column client_addr to ip_address
    • Clear cells in ip_address of values that are not valid for the IP address meaning
    • Rename the column location to url
  • Feature engineering dates, locations, and user agents
    • Parse date in server_ts
    • Extract the date components from server_ts (month, day, day_of_week, hour)
    • Geo-locate the ip_address column and extract the country, city and GeoPoint
    • Classify (enrich) user_agent
    • Remove user_agent and all of the enriched columns, with the exception of user_agent_type
  • Feature engineering the Dataiku URLs
    • Split URL in url, extracting the path
    • Split url_path on / and select Truncate so that it keeps only the first output column
    • Fill empty cells of url_path_0 with home
    • Create dummy columns from values of url_path_0. This final step creates the engineered features, that signifies the section of the website the visitor was on.
    • Remove the url_path and url_path_0 columns


To create dummy columns, use the Unfold processeor in the Prepare recipe.

  • Feature engineering the referrer URLs
    • Split URL in referer, extracting the hostname
    • In referer_host Replace t.co with twitter.com, matching on the complete value of the string
    • In referer_host Replace www. with an empty expression (i.e. no value), matching on substring
    • Similarly, in referer_host Replace \..* seperately with an empty expression (i.e. no value), making sure to match on regular expression. This step allows us to later put all traffic from the local Google domains under a single group.
    • Remove the referer column

In order to further reduce clutter, we can remove the type, visitor_params, session_params, event_params, br_lang, and tz_off columns.

Run the recipe. We now have a clean, enriched dataset containing information about our website visits!


Charts for the Prepared Logs

Now that our data has been enriched, cleaned, and interesting features have been added, we can do some initial analysis by creating some charts.

  • Distribution of visits by day of week. This is a histogram with Count of records on the Y axis and server_ts on the X axis with Day of week as the date range. The resulting chart shows that the number of visits is highest on Friday and Monday, lowest on the weekend, and visits steadily decline from Monday through Thursday before spiking on Friday.
  • Daily timeline of visits and from France and USA. This is a lines chart with Count of records on the Y axis and server_ts on the X axis with Day as the date range. Subgroups are defined by ip_address_country, and create a filter on ip_address_country with only France and the United States selected. Furthermore, the resulting chart shows that during the available period, the number of visits on any given day is highly variable, visits from France tend to outnumber those from the US, and aside from a spike on 2014-03-28, the number of daily visits is under 400.
  • Heatmap of visits by country of origin. This is a Filled Administrative Map with ip_address_geopoint providing the geographic information and Count of records providing the color details. Changing the selected color palette and the color scale mode to logarithmic can help make the differences between countries more apparent. The resulting chart shows that after France and the United States, the most visitors come from Great Britain, India, and Canada.

Referrer Analysis

We are trying to look at our top referrers to the Dataiku website, analyze the volume (i.e. number of pageviews and distinct visitors), as well as their level of engagement (i.e. number of distinct Dataiku URLs). In order to achieve this, we will need to group our dataset by unique values contained within the referer_host column.

From the LogsDataiku_prepared, create a new Group recipe with referer_host as the column to group by.

In the Group recipe settings, select Compute count for each group and the following per-field aggregations:

  • For server_ts: Min, Max
  • For visitor_id and url: Distinct

Run the recipe. In the output dataset, click on the Charts tab and create the following charts described below. Afterwards, publish these charts to a dashboard:

  • Number of pageviews per referrer (excluding Dataiku, Google, and No value). This is a bar chart with count on the X axis and referer_host on the Y axis. count should have the SUM aggregation and referer_host should be sorted by descending sum of count. Add referer_host host as a filter and exclude dataiku, google, and No value. Under Display, select Show horizontal axis. The resulting chart shows that journaldunet is the largest single referrer.
  • Pageviews, distinct visitors and distinct URLs per referrer. This is a pivot table with referer_host in the rows and count, visitor_id_distinct, and url_distinct as contents. referer_host should be sorted by descending order of count.

Visitor Analysis

Our visitor analysis has the following steps:

  • Group visits by visitor
  • Segment visitors using a clustering model
  • Join the cluster labels with the customer data
  • Send the segmented data to appropriate channels for further engagement

Grouping Visitors

Using the same technique we used for referrers, we will now take a look at our website visitors and their behavior (engagement with content) across time (sessions).

From the LogsDataiku_prepared dataset, create a new Group recipe with visitor_id as the column to group by.

In Group recipe settings, select Compute count for each group and the following per-field aggregations:

  • For day, day_of_week, and hour: Distinct.
  • For ip_address_country: Min. This retrieves the country of origin of the visitor
  • For session_id and url: Distinct
  • For blog, applications, home, company, products, and data-science: Sum
  • For user_agent_type: Min

Run the recipe. Basic data aggregation is done!


Clustering Web Visitors

In the prepared dataset grouped by visitor_id, go to the Lab and create a new Visual Analysis. Select Quick Model, followed by a Clustering task and a K-Means clustering model.

Dataiku will recognize the data types and use Machine Learning best practices when you train a default clustering model. The final cluster profiles can vary because of sampling effects, but there will be 5 primary clusters, plus a 6th cluster that contains rows that don’t fit the primary clusters.

Check the Summary tab of the model to observe basic information about the cluster profiles. You will also be able to give more descriptive names to the clusters. For example, we can label clusters:

  • US visitors, with larger-than-usual numbers of Americans
  • French visitors, with larger-than-usual numbers of French
  • Frequent visitors, with larger-than-usual distinct times they’ve visited
  • Engaged visitors, with larger-than-usual numbers of distinct URLs visited
  • Sales prospects, with larger-than-usual numbers of visits to the products page

Deploy the model to the Flow as a retrainable model and then apply it to the LogsDataiku_prepared_by_visitor_id dataset. Name the output dataset LogsDataiku_segmented.


Joining the Clusters and Customer Data

To map these segments to known customers, create a new UploadedFiles dataset from the customer data (CRM.csv.gz) and name it CRM.

Create a Join visual recipe from the CRM dataset and choose LogsDataiku_segmented as the dataset to join with. Name the output dataset LogsDataiku_segmented_joined.

  • In the Join step, visitor_id should be automatically recognized as the join key. Change the type of join to an Inner Join. This will allow us to keep only the records where a customer can be successfully matched with website visits.

Customizing and Activating Segments

Now we want to feed these customers into the right channel: Marketing, Prospective Sales, and Sales.

Create a Prepare recipe from the LogsDataiku_segmented_joined dataset and add a Formula processor to the script with the expression as defined below:

if(cluster_labels == "US visitors" || cluster_labels == "French visitors", "Marketing" +
  if(user_agent_type_min =="browser", " - browser", " - mobile"),
  if(applications_sum >= 2 || products_sum >= 2, "Sales", "Sales prospecting"))

Run the recipe. From the output dataset, create a Split recipe that sends each of the four values of Cluster_Final to individual datasets:

  • Send_to_Sales
  • Send_to_Sales_Prospecting
  • Send_to_Marketing_browser
  • Send_to_Marketing_mobile

Be sure to select “Drop data” from “Other values” to avoid additional outputs. These newly generated datasets are immediately usable by customer-facing team members to send targeted emails!



Congratulations! We created an end-to-end workflow to build datasets as collateral for colleagues in Sales and Marketing. Thank you for your time working through this use case.