Data Pipelines

If you have learned how to manipulate data in the tutorials Basics and From Lab to Flow, you’re ready to build a more complex data pipeline.

In this tutorial, you will create a Flow whose output is a dataset, to be shared with other projects or externally to Dataiku.

Prerequisites

This tutorial assumes that you have completed Tutorial: From Lab to Flow prior to beginning this one!

Duplicate/Create Your Project

We’re going to start with the Flow as it exists at the end of the Tutorial: From Lab to Flow.

../../_images/tshirt-etl-flow-01.png

In this tutorial, we are going to make changes to the Flow, but we also want to keep the work done to create the existing Flow, so we’re going to make a copy of the project.

  • In the project home, select Actions > Duplicate this project
  • Give the duplicate project a name like Data Piplines.
  • Click Duplicate.
../../_images/tshirt-duplicate.png

By default, only the input datasets are duplicated into the new project in order to speed up the duplication process. We will rebuild the other datasets as we make changes to the Flow.

../../_images/tshirt-etl-flow-02.png

Alternatively, you can import a Dataiku DSS Project that represents the end state of the Lab to Flow tutorial. From the Dataiku homepage, click +New Project > DSS Tutorials > Basics > Tutorial: Data Pipelines.

Update the Prepare Recipe

In the From Lab to Flow tutorial, we looked at the total purchase history of each customer. In this Flow, we want to break down the purchase history by year. As a first step, we’ll need to update the first Prepare recipe so that it extracts the year from column order_date.

  • Open the Prepare recipe.
  • Click on the order_date column heading and select Extract date components from the dropdown list.
  • In the newly created step in the Script, clear the entries for month and day so that only order_date_year is created.
  • Click Save and accept the schema changes.
../../_images/tshirt-etl-prepare-01.png

Create a New Branch

Previously we used a Group recipe, by customer, to take the orders_prepared dataset and create the orders_by_customer dataset. This dataset contains the total purchase history of each customer. We’re going to keep this part of the Flow, and create a new branch by copying the Group recipe and adding order_date_year as a group key.

  • Select the Group recipe, and from the Actions menu, select Copy.
  • Leave orders_prepared as the input dataset.
  • Name the output datset something like orders_grouped.
  • Click Create Recipe.
../../_images/tshirt-etl-group-01.png
  • In the Group recipe, add order_date_year as a group key.
  • Click Save and accept the schema changes.
../../_images/tshirt-etl-group-02.png

Using the Window Recipe to Compute Customer Ranks by Year

We now have a dataset that groups purchases by customer and year. Our next step is to get a ranking of each customer’s spending in each year. We can do this with the Window recipe. With the orders_grouped dataset selected:

  • Choose Actions > Window.
  • Click Create Recipe.

In order to compute a ranking of spending by year, we need to compare each customer’s spending within a given year. In the Window definitions step of the Window recipe:

  • Turn Partitioning Columns On and select order_date_year as the partitioning column. This defines a “window” for each year, and customers within each year will be compared to each other.
  • Turn Order Columns On and select total_sum as the order column. Click the ordering button so that spending will be looked at in descending order. This means that the largest spender will be ranked 1, the next largest 2, and so on.
../../_images/tshirt-etl-window-01.png
  • In the Aggregations step, select Rank to compute the rank for each row.
  • Click Save and accept the schema changes.
../../_images/tshirt-etl-window-02.png

Pivot Ranks from Rows into Columns

Each row in the orders_grouped_windows dataset will represent a particular year’s spending for a given customer. We eventually want to join this information with other customer information where there is a single row for each customer. Thus, our next step is to Pivot the dataset so that there is a separate column for each year’s customer rankings. With the orders_grouped_windows dataset selected:

  • Choose Actions > Pivot.
  • Choose order_date_year as the column to Pivot By.
  • Rename the output dataset to something like orders_by_customer_pivoted.
  • Click Create Recipe.
  • Select customer_id as a row identifier.
  • Deselect Count of records to populate content with, then
  • Select rank from the dropdown list and choose min as the aggregration.
../../_images/tshirt-etl-pivot-01.png

Note that, until this point, we have not built any datasets, because our goal has been clear in each recipe. It wasn’t necessary to see the resulting dataset to move on to the next recipe; all we needed was the schema.

Since the columns created by the Pivot recipe are dependent upon the range of values of order_date_year, Dataiku cannot create a schema for orders_by_customer_pivoted without actually building the dataset.

So we are going to run this recipe, but we need to tell Dataiku to also build all of the upstream datasets.

  • Click on the gear in the Run button.
  • Select Build required dependent datasets as the Build mode.
  • Click OK, then click Run.
../../_images/tshirt-etl-pivot-02.png

Warning

Specifying the recipe to rebuild dependent datasets within the Settings means that any time it is run, Dataiku will build the upstream datasets. If we don’t want rebuilding to be a permanent setting, we can change it after running the recipe. Alternatively, we could select the orders_by_customer_pivoted dataset and build it recursively.

Join Pivoted Data to Customer Data

Finally, we’re ready to add these columns to the rest of the customer data.

  • Open the existing Join recipe.
  • Click +Add Input.
  • Select orders_by_customer_pivoted as the new input dataset and click Add Dataset.
../../_images/tshirt-etl-join-01.png

Dataiku automatically detects customerID and customer_id as the correct join keys. There’s no need to change the join type from left join.

  • On the Selected columns step, deselect customer_id from the orders_by_customer_pivoted list.
  • Click Save and accept the schema changes.
../../_images/tshirt-etl-join-02.png

Propagate Schema Changes

We are done making changes to recipes in the flow, but we need to propagate the new columns to the end of the Flow. If we were to build the customers_labeled dataset now, it would not include the columns created by the Pivot recipe.

In order to update the schema of customers_labeled:

  • Select the customers_joined and select Actions > Build.
  • Choose to do a Recursive build so that upstream out-of-date datasets are built.
  • Click Build Dataset.
../../_images/tshirt-etl-build-01.png

Now, we could propagate the new columns by running the Prepare recipe, but an alternative for larger flows is to use the schema propagation tool.

  • Right-click on customers_joined and select Tools > Start schema propagation from here.
../../_images/tshirt-etl-schema-01.png

This opens the Schema propagation dialog in the lower left corner of the Flow.

  • Click Start. The tool determines that the schema for customers_labeled needs an update.
  • Click on the Prepare recipe, now colored red, and select Review Changes.
  • Click Update Schema.

The new columns have now be propagated to customers_labeled. You can close the schema propagation tool now.

../../_images/tshirt-etl-schema-02.png

Build the Final Dataset

We’re almost there!

  • Select the customers_labeled dataset and select Actions > Build
  • Choose to do a Recursive build so that upstream out-of-date datasets are built.
  • Click Build Dataset.
../../_images/tshirt-etl-build-02.png

The final dataset in the Flow now contains the pivoted columns and is ready for sharing and further analysis.

../../_images/tshirt-etl-customers-labeled-01.png

What’s Next

  • Now that you’ve created a data pipeline in the Flow, you might want to check out a use case to put those skills to use.
  • There are also a few more detailed tutorials on using the Pivot recipe to pivot values and tables as well as a deeper dive into the Window recipe.