Visual Window Analytic Functions¶
Window functions are one of the most powerful features of SQL (and SQL-like: Hive, Impala, Spark, …) databases. They are also one of the least known and most tricky.
The Window visual recipe in Dataiku allows you to use these features without coding:
- Filter rows by order of appearance within a group
- Compute moving averages, cumulative sums, …
- Compute the number of events that occurred during the 7 days prior to another event
The PostgreSQL documentation has a very good introduction to the concept of Window analytics. Quoting:
Definition of a window function
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
In other words, unlike a Group recipe, a Window recipe does not reduce the number of rows in a dataset. It creates new columns in a dataset that are the results of computations that use all rows in a “window”, that is, a subset, of all rows of the dataset.
Conceptually, there are two steps when defining a window recipe:
- Defining the window(s): on what rows of the dataset will the computations apply?
- Defining the aggregation(s): What do I compute on this window?
The window can be:
- partitioned: you define one or several columns, and one window is created for each combination of values of these columns.
- bounded: For each row, we can restrict the window to start:
- at the beginning (resp. to the end)
- at a given number of rows before the current row
- at a given range of values before a given column of the current row
- at the current row
The same bounding can be applied to the end of the window. In addition, a window can be ordered (this is actually required for bounding it).
Create Your Project¶
From the Dataiku home page, click + New Project > DSS Tutorials > Visual recipes > Window Recipe.
There are two datasets:
- an extract from the “Blue book for Bulldozers” Kaggle competition data. It is a dataset where each row represents the sale of a used bulldozer. We’re interested in the saledate column from this dataset, but it is an unparsed text string. Attach a Prepare recipe and create a new output dataset called bulldozer, parse the saledate column, and run the recipe.
- an extract of a web log, containing events where users switch from one page to another
If you have an available connection to a database or distributed system, you may want to Sync the datasets to that system before continuing the rest of the tutorial.
Adding Average Price as a Column¶
The bulldozer dataset contains one row per sale, with information about the sale, information about the bulldozer, the sale date and the sale price.
For this first example, we want to add a new column that adds to each row: the average price of all bulldozers in the same category of ProductGroup.
With the bulldozer dataset selected, choose Actions > Window and create a new Window recipe.
The recipe opens on the Window step, which is where you define the windows. Here, we want to add an information about all datasets with the same ProductGroup, so we want to partition the window by ProductGroup, and not bound it, nor order it.
- Activate “Partitioning columns”
- Select the ProductGroup column
Let’s now move on to the Aggregations step.
We want to Retrieve all columns of the dataset (keep all existing data), and add another column: the average value of SalePrice. Check the SalePrice > Avg box.
Click Run and accept the schema change warning. The output dataset should include a new SalePrice_avg column.
You can verify that each time the ProductGroup changes, the average price changes.
Thanks to this new column, we could now compute, for example, the difference between the Average price of the ProductGroup and the SalePrice of an actual bulldozer, to get a notion of whether this particular bulldozer seems “overpriced”.
Continuing with this “overpriced” notion, let’s identify which bulldozers are the most expensive for their ProductGroup. In other words, for each value of ProductGroup, we would like to sort by descending values of SalePrice and pull only the first one.
Reopen the Window recipe and go to the Windows definitions step. Turn on Order columns, select SalePrice as the column, and click the icon so that values are sorted in decreasing order.
Now go to the Aggregations step. We’ve decided that we want to focus on SalesID, SalePrice, and ProductGroup. We can quickly disable Retrieve for all columns, and then add it back for the three we’re interested in.
In addition to computing aggregations based on the fields in the window, we have a few options to compute different types of ranks for each row.
Rank is the “index” of the current row within its window. In other words, since we have a window, partitioned by ProductGroup, and ordered by decreasing SalePrice, the most expensive bulldozer of each ProductGroup will have rank=1, the second most expensive of each group will have rank=2, and so on.
Enable Rank and run the recipe.
The resulting dataset is sorted by ProductGroup and descending SalePrice and the most expensive “BL” bulldozer has rank = 1.
At this point, we notice two important things:
First of all, the ranks are not continuous. That’s because we need to break the ties. There are three variants of rank:
- Rank will output 1,1,3,3,3,5
- Dense rank will output 1,1,2,2,2,3
- Row number will output 1,2,3,4,5,6 (ties are broken randomly)
Strangely, the Average price now changes within a group! That’s because as soon as you Sort a window, the definition of the window automatically changes and it becomes “bounded” between “beginning” and “current row”. In other words, now, the Average price is computed for each bulldozer by only using the data of the more expensive ones (since the window is ordered by decreasing price).
We could cancel this effect by forcing a Unbounded/Unbounded window in the DSS UI. But for the moment, we are going to use this to compute cumulative sums.
Return to the Window recipe. We’d now like to know the proportion of the total sales made as time passes. In other words, we want to compute the cumulative distribution of the summed price. We’re no longer interested in ProductGroup; we want global information. Since this will be a major change to the Window recipe settings and we want to keep our previous work, make a copy of the recipe. Choose Actions > Copy, name the output dataset something like
bulldozer_windows_cumulative_sums, and click Create recipe.
Turn Partitioning columns off. Change the Order column to saledate and click the icon so that dates are sorted in increasing order.
In the Aggregations step, check the SalePrice > Sum box. Since the window is ordered, for each bulldozer, this will compute the sum of the sale price of all previous bulldozers, plus this one.
To better see what’s going on, also check saledate > Retrieve.
Run the recipe. In the resulting dataset, on the Charts tab, we can chart the cumulative sums along with the marginal increases in the cumulative sum.
- Select SalePrice_sum and SalePrice as Y-axis columns. Set the aggregation for SalePrice to SUM.
- Select saledate as the X-axis column. Set the date range as Quarter.
For each bulldozer, we want to have a moving average of the 10 previous sales of this ProductGroup. This will allow us to have a vision of how the market is moving. We’ll use the “Window frame” feature to limit the window to the 10 rows preceding the current row (up to the current row).
Return to the Window recipe. Make a copy of the recipe and name the new output dataset something like
bulldozer_moving_average. In the new recipe:
- Turn Partitioning columns back on and ensure the ProductGroup column is selected.
- Turn Window Frame on, and Limit the preceding rows to 10.
Run the recipe. In the resulting dataset, on the Charts tab, we can chart the moving average along with the monthly average.
- Choose Lines as the chart type.
- Select SalePrice_avg and SalePrice as Y-axis columns.
- Select saledate as the X-axis column. Set the date range as Month.
Filtering sessions with Rank¶
Let’s switch to the dataset: a web log. It contains events on a website. For each event, we have:
- server_ts: the timestamp
- visitor_id: the unique identifier of the visitor
- session_id: Automatically reset each time a visitor comes back on the site
- page_id: the name of the page on the site
- sc_width: the screen width
- sc_height: the screen height
There can be several events on the same page (for example because the user clicked a button on a page, and that triggered an event). We want to keep only the first event on each page for each session (regardless of whether the user went back to the page). Window functions can do that!
Create a window recipe and define the windows with the following settings:
- Partitioned by session_id and page_id
- Ordered by server_ts (Ascending)
On the Aggregations step, choose to compute Rank, and keep all columns on Retrieve for the moment.
Now we want to only keep the rows where rank=1. We could do that with a Prepare recipe, but we can also do this in the Post-filter step of the Window recipe.
- Turn Filters On.
- Create a condition where
rank == 1.
Run the recipe. Our output dataset now only contains the events that are the first time that the user interacted with a page within a given session.
Lead and lag¶
Now we would like to know how much time was spent on each page in a session. We need to compute the date difference between the
rank==1 items that we extracted, within each session.
There’s a Window function for that: LAG. LAG retrieves the value of a column in the previous (or a previous) row and inserts it in the current row in another column. Furthermore, DSS adds the “Lagdiff” function to automatically compute the difference between this previous value and the current value.
Create a new window recipe, attached to the dataset that we built in the previous section. In the Windows definitions step:
- Activate partitions and choose session_id as a partitioning column.
- Order by server_ts (Ascending)
In the Aggregations step, choose to compute the Lag and LagDiff on the server_ts column.
The “Lag offset” is 1 because we want to retrieve the value of the 1st previous row. For LagDiff we also need to specify the unit in which the time difference will be expressed. Choose Seconds.
Run the recipe and let’s look at few results in the output dataset.
001516e1b8f688bonly visited one page
malastare_2. There is therefore no information on how long they were on the page.
naboo_4``and ``trandosha_2. Here we see that the user spent 11 seconds on
- The following few sessions only have one page each, so no lag info.
008615e12f23064visited a number of pages, staying 5 seconds on the first, then 45 seconds, then 21, and so on.