Data Pipelines

If you have learned how to manipulate data in the tutorials Basics and From Lab to Flow, you’re ready to build a more complex data pipeline.

In this tutorial, you will create a Flow whose output is a dataset, to be shared with other projects or externally to Dataiku.

Prerequisites

This tutorial assumes that you have completed Tutorial: From Lab to Flow prior to beginning this one!

Duplicate/Create Your Project

We’re going to start with the Flow as it exists at the end of the Tutorial: From Lab to Flow.

../../_images/tshirt-etl-flow-01.png

In this tutorial, we are going to make changes to the Flow, but we also want to keep the work done to create the existing Flow, so we’re going to make a copy of the project.

  • In the project home, select Actions > Duplicate this project
  • Give the duplicate project a name like Data Pipelines.
  • Click Duplicate.
../../_images/tshirt-duplicate.png

By default, only the input datasets are duplicated into the new project in order to speed up the duplication process. We will rebuild the other datasets as we make changes to the Flow.

../../_images/tshirt-etl-flow-02.png

Alternatively, you can import a Dataiku DSS Project that represents the end state of the Lab to Flow tutorial. From the Dataiku homepage, click +New Project > DSS Tutorials > Basics > Tutorial: Data Pipelines.

Update the Prepare Recipe

In the From Lab to Flow tutorial, we looked at the total purchase history of each customer. In this Flow, we want to break down the purchase history by year. As a first step, we’ll need to update the first Prepare recipe so that it extracts the year from column order_date.

  • Open the Prepare recipe.
  • Click on the order_date column heading and select Extract date components from the dropdown list.
  • In the newly created step in the Script, clear the entries for month and day so that only order_date_year is created.
  • Click Save and accept the schema changes.

Note

Recipe Run Options

After accepting the schema changes, it is not necessary to run the recipe at this time. The next few visual recipes can be created by saving and updating the schema each time, without building the datasets. Wait for instructions to build the dataset to better understand Dataiku’s Run options.

../../_images/tshirt-etl-prepare-01.png

Create a New Branch

Previously we used a Group recipe, by customer, to take the orders_prepared dataset and create the orders_by_customer dataset. This dataset contains the total purchase history of each customer. We’re going to keep this part of the Flow, and create a new branch by copying the Group recipe and adding order_date_year as a group key.

  • Select the Group recipe, and from the Actions menu, select Copy.
  • Leave orders_prepared as the input dataset.
  • Name the output datset something like orders_grouped.
  • Click Create Recipe.
../../_images/tshirt-etl-group-01.png
  • In the Group recipe, add order_date_year as a group key.
  • Click Save and accept the schema changes.
../../_images/tshirt-etl-group-02.png

Using the Window Recipe to Compute Customer Ranks by Year

We now have the schema of a dataset that groups purchases by customer and year. Our next step is to get a ranking of each customer’s spending in each year. We can do this with the Window recipe. With the empty orders_grouped dataset selected:

  • Choose Actions > Window.
  • Click Create Recipe.

In order to compute a ranking of spending by year, we need to compare each customer’s spending within a given year. In the Window definitions step of the Window recipe:

  • Turn Partitioning Columns On and select order_date_year as the partitioning column. This defines a “window” for each year, and customers within each year will be compared to each other.
  • Turn Order Columns On and select total_sum as the order column. Click the ordering button so that spending will be looked at in descending order. This means that the largest spender will be ranked 1, the next largest 2, and so on.
../../_images/tshirt-etl-window-01.png
  • In the Aggregations step, select Rank to compute the rank for each row.
  • Click Save and accept the schema changes.
../../_images/tshirt-etl-window-02.png

Pivot Ranks from Rows into Columns

Each row in the orders_grouped_windows dataset will represent a particular year’s spending for a given customer. We eventually want to join this information with other customer information where there is a single row for each customer. Thus, our next step is to Pivot the dataset so that there is a separate column for each year’s customer rankings. With the empty orders_grouped_windows dataset selected:

  • Choose Actions > Pivot.
  • Choose order_date_year as the column to Pivot By.
  • Rename the output dataset to something like orders_by_customer_pivoted.
  • Click Create Recipe.
  • Select customer_id as a row identifier.
  • Deselect Count of records to populate content with, then
  • Select rank from the dropdown list and choose min as the aggregration.
../../_images/tshirt-etl-pivot-01.png

Note that, until this point, we have not built any datasets, because our goal has been clear in each recipe. It wasn’t necessary to see the resulting dataset to move on to the next recipe; all we needed was the schema.

Since the columns created by the Pivot recipe are dependent upon the range of values of order_date_year, Dataiku cannot create a schema for orders_by_customer_pivoted without actually building the dataset.

So we are going to run this recipe, but we need to tell Dataiku to also build all of the upstream datasets.

  • Click on the gear in the Run button.
  • Select Build required dependent datasets as the Build mode.
  • Click OK, then click Run.
../../_images/tshirt-etl-pivot-02.png

Warning

Specifying the recipe to rebuild dependent datasets within the Settings means that any time it is run, Dataiku will build the upstream datasets. If we don’t want rebuilding to be a permanent setting, we can change it after running the recipe. Alternatively, we could select the orders_by_customer_pivoted dataset and build it recursively.

Join Pivoted Data to Customer Data

Finally, we’re ready to add these columns to the rest of the customer data.

  • Open the existing Join recipe.
  • Click +Add Input.
  • Select orders_by_customer_pivoted as the new input dataset and click Add Dataset.
../../_images/tshirt-etl-join-01.png

Dataiku automatically detects customerID and customer_id as the correct join keys. There’s no need to change the join type from left join.

  • On the Selected columns step, deselect customer_id from the orders_by_customer_pivoted list.
  • Click Save and accept the schema changes.
../../_images/tshirt-etl-join-02.png

Propagate Schema Changes

We are done making changes to recipes in the Flow, but we need to propagate the new columns to the end of the Flow. If we were to build the customers_labeled dataset now, it would not include the columns created by the Pivot recipe.

In order to update the schema of customers_labeled:

  • Select the customers_orders_joined dataset and select Actions > Build.
  • Choose to do a Recursive build so that upstream out-of-date datasets are built.
  • Click Build Dataset.
../../_images/tshirt-etl-build-01.png

Now, we could propagate the new columns by running the Prepare recipe, but an alternative for larger flows is to use the schema propagation tool.

  • Right-click on customers_orders_joined and select Propagate schema across Flow from here.
../../_images/tshirt-etl-schema-01.png

This opens the Schema propagation dialog in the lower left corner of the Flow.

  • Click Start. The tool determines that the schema for customers_labeled needs an update.
  • Click on the Prepare recipe, now colored red, and select Review Changes.
  • Click Update Schema.

The new columns have now been propagated to customers_labeled. You can close the schema propagation tool now.

../../_images/tshirt-etl-schema-02.png

Build the Final Dataset

We’re almost there!

  • Select the empty customers_labeled dataset and select Actions > Build
  • Choose to do a Recursive build so that upstream out-of-date datasets are built.
  • Click Build Dataset.
../../_images/tshirt-etl-build-02.png

The final dataset in the Flow now contains the pivoted columns and is ready for sharing and further analysis.

../../_images/tshirt-etl-customers-labeled-01.png

At this point, the dataset produces more questions than it answers. For example, applying a descending sort to a column like 2017_rank_min sparks a few interesting avenues for further exploration. Do the largest customers in 2017 tend to be the top customers in previous years? Or is there considerable variation among top customers year to year? Moreover, in 2017, certain ranks, like 4 and 10, are noticeably absent. We may wish to revisit the previous join conditions if important data is missing.

What’s Next

  • Now that you’ve created a data pipeline in the Flow, you might want to check out a use case to put those skills to use.
  • Find more detailed tutorials on visual data preparation techniques, such as the window and pivot recipes, for manipulating data without code.