Visual Window Analytic Functions

Window functions are one of the most powerful features of SQL (and SQL-like: Hive, Impala, Spark, …) databases. They are also one of the least known and most tricky.

The Window visual recipe in Dataiku allows you to use these features without coding:

  • Filter rows by order of appearance within a group
  • Compute moving averages, cumulative sums, …
  • Compute the number of events that occurred during the 7 days prior to another event

The PostgreSQL documentation has a very good introduction to the concept of Window analytics. Quoting:

Note

Definition of a window function

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

In other words, unlike a Group recipe, a Window recipe does not reduce the number of rows in a dataset. It creates new columns in a dataset that are the results of computations that use all rows in a “window”, that is, a subset, of all rows of the dataset.

Conceptually, there are two steps when defining a window recipe:

  • Defining the window(s): on what rows of the dataset will the computations apply?
  • Defining the aggregation(s): What do I compute on this window?

The window can be:

  • partitioned: you define one or several columns, and one window is created for each combination of values of these columns.
  • bounded: For each row, we can restrict the window to start:
    • at the beginning (resp. to the end)
    • at a given number of rows before the current row
    • at a given range of values before a given column of the current row
    • at the current row

The same bounding can be applied to the end of the window. In addition, a window can be ordered (this is actually required for bounding it).

Create Your Project

From the Dataiku home page, click + New Project > DSS Tutorials > Visual recipes > Window Recipe.

There are two datasets:

  • an extract from the “Blue book for Bulldozers” Kaggle competition data. It is a dataset where each row represents the sale of a used bulldozer. We’re interested in the saledate column from this dataset, but it is an unparsed text string. Attach a Prepare recipe and create a new output dataset called bulldozer, parse the saledate column, and run the recipe.
  • an extract of a web log, containing events where users switch from one page to another

Note

If you have an available connection to a database or distributed system, you may want to Sync the datasets to that system before continuing the rest of the tutorial.

Adding Average Price as a Column

The bulldozer dataset contains one row per sale, with information about the sale, information about the bulldozer, the sale date and the sale price.

For this first example, we want to add a new column that adds to each row: the average price of all bulldozers in the same category of ProductGroup.

With the bulldozer dataset selected, choose Actions > Window and create a new Window recipe.

The recipe opens on the Window step, which is where you define the windows. Here, we want to add an information about all datasets with the same ProductGroup, so we want to partition the window by ProductGroup, and not bound it, nor order it.

  • Activate “Partitioning columns”
  • Select the ProductGroup column
../../_images/window-avg-bulldozer-01.png

Let’s now move on to the Aggregations step.

We want to Retrieve all columns of the dataset (keep all existing data), and add another column: the average value of SalePrice. Check the SalePrice > Avg box.

../../_images/window-avg-bulldozer-02.png

Click Run and accept the schema change warning. The output dataset should include a new SalePrice_avg column.

You can verify that each time the ProductGroup changes, the average price changes.

Thanks to this new column, we could now compute, for example, the difference between the Average price of the ProductGroup and the SalePrice of an actual bulldozer, to get a notion of whether this particular bulldozer seems “overpriced”.

Computing Ranks

Continuing with this “overpriced” notion, let’s identify which bulldozers are the most expensive for their ProductGroup. In other words, for each value of ProductGroup, we would like to sort by descending values of SalePrice and pull only the first one.

Reopen the Window recipe and go to the Windows definitions step. Turn on Order columns, select SalePrice as the column, and click the icon so that values are sorted in decreasing order.

Window recipe; specifying Partitioning columns and Order columns

Now go to the Aggregations step. We’ve decided that we want to focus on SalesID, SalePrice, and ProductGroup. We can quickly disable Retrieve for all columns, and then add it back for the three we’re interested in.

In addition to computing aggregations based on the fields in the window, we have a few options to compute different types of ranks for each row.

Rank is the “index” of the current row within its window. In other words, since we have a window, partitioned by ProductGroup, and ordered by decreasing SalePrice, the most expensive bulldozer of each ProductGroup will have rank=1, the second most expensive of each group will have rank=2, and so on.

Enable Rank and run the recipe.

Window recipe; specifying Rank aggregations

The resulting dataset is sorted by ProductGroup and descending SalePrice and the most expensive “BL” bulldozer has rank = 1.

Dataset sorted by ProductGroup and SalePrice

At this point, we notice two important things:

  • First of all, the ranks are not continuous. That’s because we need to break the ties. There are three variants of rank:

    • Rank will output 1,1,3,3,3,5
    • Dense rank will output 1,1,2,2,2,3
    • Row number will output 1,2,3,4,5,6 (ties are broken randomly)
  • Strangely, the Average price now changes within a group! That’s because as soon as you Sort a window, the definition of the window automatically changes and it becomes “bounded” between “beginning” and “current row”. In other words, now, the Average price is computed for each bulldozer by only using the data of the more expensive ones (since the window is ordered by decreasing price).

    We could cancel this effect by forcing a Unbounded/Unbounded window in the DSS UI. But for the moment, we are going to use this to compute cumulative sums.

Cumulative sums

Return to the Window recipe. We’d now like to know the proportion of the total sales made as time passes. In other words, we want to compute the cumulative distribution of the summed price. We’re no longer interested in ProductGroup; we want global information. Since this will be a major change to the Window recipe settings and we want to keep our previous work, make a copy of the recipe. Choose Actions > Copy, name the output dataset something like bulldozer_windows_cumulative_sums, and click Create recipe.

Turn Partitioning columns off. Change the Order column to saledate and click the icon so that dates are sorted in increasing order.

Window recipe, Windows definitions step

In the Aggregations step, check the SalePrice > Sum box. Since the window is ordered, for each bulldozer, this will compute the sum of the sale price of all previous bulldozers, plus this one.

To better see what’s going on, also check saledate > Retrieve.

Window recipe, Aggregations step

Run the recipe. In the resulting dataset, on the Charts tab, we can chart the cumulative sums along with the marginal increases in the cumulative sum.

  • Select SalePrice_sum and SalePrice as Y-axis columns. Set the aggregation for SalePrice to SUM.
  • Select saledate as the X-axis column. Set the date range as Quarter.
Bar chart of cumulative quarterly sales with quarterly increment

Moving averages

For each bulldozer, we want to have a moving average of the 10 previous sales of this ProductGroup. This will allow us to have a vision of how the market is moving. We’ll use the “Window frame” feature to limit the window to the 10 rows preceding the current row (up to the current row).

Return to the Window recipe. Make a copy of the recipe and name the new output dataset something like bulldozer_moving_average. In the new recipe:

  • Turn Partitioning columns back on and ensure the ProductGroup column is selected.
  • Turn Window Frame on, and Limit the preceding rows to 10.
Window recipe; specifying Partitioning columns, Order columns, and Window Frame

Run the recipe. In the resulting dataset, on the Charts tab, we can chart the moving average along with the monthly average.

  • Choose Lines as the chart type.
  • Select SalePrice_avg and SalePrice as Y-axis columns.
  • Select saledate as the X-axis column. Set the date range as Month.
Line chart of moving average of sales with monthly increment

Filtering sessions with Rank

Let’s switch to the dataset: a web log. It contains events on a website. For each event, we have:

  • server_ts: the timestamp
  • visitor_id: the unique identifier of the visitor
  • session_id: Automatically reset each time a visitor comes back on the site
  • page_id: the name of the page on the site
  • sc_width: the screen width
  • sc_height: the screen height

There can be several events on the same page (for example because the user clicked a button on a page, and that triggered an event). We want to keep only the first event on each page for each session (regardless of whether the user went back to the page). Window functions can do that!

Create a window recipe and define the windows with the following settings:

  • Partitioned by session_id and page_id
  • Ordered by server_ts (Ascending)
../../_images/weblog-windows-filtering.png

On the Aggregations step, choose to compute Rank, and keep all columns on Retrieve for the moment.

Now we want to only keep the rows where rank=1. We could do that with a Prepare recipe, but we can also do this in the Post-filter step of the Window recipe.

  • Turn Filters On.
  • Create a condition where rank == 1.
../../_images/weblog-windows-post-filter.png

Run the recipe. Our output dataset now only contains the events that are the first time that the user interacted with a page within a given session.

Lead and lag

Now we would like to know how much time was spent on each page in a session. We need to compute the date difference between the rank==1 items that we extracted, within each session.

There’s a Window function for that: LAG. LAG retrieves the value of a column in the previous (or a previous) row and inserts it in the current row in another column. Furthermore, DSS adds the “Lagdiff” function to automatically compute the difference between this previous value and the current value.

Create a new window recipe, attached to the dataset that we built in the previous section. In the Windows definitions step:

  • Activate partitions and choose session_id as a partitioning column.
  • Order by server_ts (Ascending)

In the Aggregations step, choose to compute the Lag and LagDiff on the server_ts column.

The “Lag offset” is 1 because we want to retrieve the value of the 1st previous row. For LagDiff we also need to specify the unit in which the time difference will be expressed. Choose Seconds.

../../_images/weblog-windows-lag.png

Run the recipe and let’s look at few results in the output dataset.

Dataset with lagged values
  • Session 001516e1b8f688b only visited one page malastare_2. There is therefore no information on how long they were on the page.
  • Session 001be421aea8bf8 visited naboo_4``and ``trandosha_2. Here we see that the user spent 11 seconds on naboo_4.
  • The following few sessions only have one page each, so no lag info.
  • Session 008615e12f23064 visited a number of pages, staying 5 seconds on the first, then 45 seconds, then 21, and so on.