Python in Dataiku DSS

Dataiku allows you to seamlessly integrate Python code and visual recipes in a flow.

In this tutorial, we will show you how to:

  • Integrate Python as part of your data pipeline through code recipes
  • Use Jupyter notebooks to prototype and test code
  • Transfer a Dataiku dataset into a pandas dataframe and back, using the dataiku Python package

We will work with the fictional retailer Haiku T-Shirt’s data.

Prerequisites

This tutorial assumes that you are familiar with the Dataiku Basics tutorials.

Create Your Project

The first step is to create a new Dataiku DSS Project. From the Dataiku homepage, click +New Project > DSS Tutorials > Code > Tutorial: Python in Dataiku DSS. From the project homepage, click Go to Flow.

../../_images/tshirt-python-flow-01.png

In the flow, you see the Haiku T-Shirt orders and customer data uploaded into Dataiku DSS. Further, the customer data has been prepared with a visual Prepare recipe.

Your First Python Recipe

Our current goal is to group past orders by customer, aggregating their past interactions. In Tutorial: Basics, we accomplished this with a Group visual recipe, but it can also be easily accomplished with Python code.

With the orders dataset selected, choose Actions > Code Recipes > Python. Add a new output dataset named orders_by_customer. Click Create Recipe.

The recipe form is now populated with the following code, which reads the orders dataset into a pandas dataframe named orders_df, passes it unchanged to a new dataframe named orders_by_customer_df, and writes that new dataframe out to the orders_by_customer dataset.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
orders = dataiku.Dataset("orders")
orders_df = orders.get_dataframe()

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.

orders_by_customer_df = orders_df # For this sample code, simply copy input to output

# Write recipe outputs
orders_by_customer = dataiku.Dataset("orders_by_customer")
orders_by_customer.write_with_schema(orders_by_customer_df)

As the commented TODO says, we’ll need to provide the code that aggregates the orders by customer. Dataiku provides a number of code samples to help get us started.

../../_images/tshirt-python-recipe-codesamples-01.png

In this case, we’ll replace the line that defines orders_by_customer_df with the following code.

orders_by_customer_df = orders_df.assign(total=orders_df.tshirt_price*orders_df.tshirt_quantity
       ).groupby(by="customer_id"
                ).agg({"pages_visited":"mean",
                       "total":"sum"})

This creates a dataframe with rows grouped by customer_id. For each customer, we’ve computed the average number of pages on the Haiku T-shirt website visited by the customer during orders, and the sum total of the value of orders made by the customer, where the value of each order is the price of each t-shirt times the number of t-shirts purchased.

Now run the recipe, and when it completes, explore the output dataset. Interestingly, the output dataset does not contain the customer_id column, but we’ll need it later in order to join the customers and grouped orders data.

../../_images/tshirt-python-orders_by_customer-01.png

To diagnose the issue, in the orders_by_customer dataset click Parent recipe to reopen the Python recipe, and then click Edit in Notebook. This opens a Jupyter notebook with the recipe code, where we can interactively test the code.

The recipe code begins in a single cell. Split the cell so that the code to write recipe outputs is in a separate cell. Next, add a cell between the two existing cells and put the following code in it.

orders_by_customer_df.head()

Run the first two cells. The output shows that the orders_by_customer dataframe has the customer_id information; however, the dataframe has a hierarchical index.

../../_images/tshirt-python-orders-notebook-01.png

In order to flatten the index, add .reset_index() to the code that defines the dataframe so that it looks like the following.

orders_by_customer_df = orders_df.assign(total=orders_df.tshirt_price*orders_df.tshirt_quantity
       ).groupby(by="customer_id"
                ).agg({"pages_visited":"mean",
                       "total":"sum"}).reset_index()

Re-run the first two cells to see how the dataframe has changed, then click Save back to recipe and run the recipe again. Now the output dataset contains a customer_id column.

../../_images/tshirt-python-orders_by_customer-02.png

Explore with a Python Notebook

Previously, we started with a Python recipe because we had a specific goal of transforming the orders dataset. Alternatively, we can explore a dataset using a notebook.

In the Flow, select the customers_stacked_prepared dataset and click Lab > New Code Notebook > Python. We’ll read the dataset using Pandas; click Create.

../../_images/tshirt-python-customers-create-notebook-01.png

The notebook is automatically populated with four cells.

../../_images/tshirt-python-customers-notebook-01.png

The first two cells import various useful packages. The first cell uses the built-in magics to import the numpy and matplotlib packages.

%pylab inline
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd

The third cell reads the customers_stacked_prepared dataset into a pandas dataframe named df.

# Read the dataset as a Pandas dataframe in memory
# Note: here, we only read the first 100K rows. Other sampling options are available
dataset_customers_stacked_prepared = dataiku.Dataset("customers_stacked_prepared")
df = dataset_customers_stacked_prepared.get_dataframe(limit=100000)

The fourth cell audits the columns in the dataframe to provide information on the type of data in the column and the number of unique and missing values.

# Get some simple descriptive statistics
pdu.audit(df)

Run each of the cells in order to produce the output. The notebook also now has the dataframe df ready in memory.

../../_images/tshirt-python-customers-notebook-02.png

As in the Python recipe, Dataiku provides code samples to help us get started. As an example, open up the list of code samples and search for Get statistics (distribution).

../../_images/tshirt-python-customers-notebook-samples-01.png

Copy the first code sample my_df["col_0"].value_counts() and paste it into a new cell in the notebook. Edit the code to be relevant to our data:

df["campaign"].value_counts()

Run the cell; it returns the number of customers who are part of the marketing campaign and the number who aren’t. Now we’d like to visualize the effect of campaign on the total amount a customer has spent. Since that information is in the orders_by_customer dataset, we’ll need to read that dataset into a new dataframe:

      dataset_orders = dataiku.Dataset("orders_by_customer")
df_orders = dataset_orders.get_dataframe(limit=100000)

… and join it with the df dataframe. Search the code samples for “join dataframes on columns”, copy the code for On columns with different names to a new notebook cell, and modify it to apply to our data.

df_joined = df.merge(df_orders,left_on="customerID",right_on="customer_id")

Finally, the following code produces a paneled histogram with the bar heights normalized so that it’s easier to compare across values of campaign.

pd.pivot_table(df_joined.reset_index(),
  index='index',
  columns='campaign',
  values=['total']).plot.hist(normed=True,subplots=True)
../../_images/tshirt-python-customers-notebook-histogram-01.png

Recall that the notebook is a lab environment, intended for experimentation, so the Join we performed between the dataframes isn’t reflected in the Flow as part of our data pipeline until we create a recipe.

../../_images/tshirt-python-customers-flow-01.png

From within the notebook, click Create Recipe > Python recipe. It has automatically included the customers_stacked_prepared dataset as an input, but now we’ll want to add orders_by_customer as an input and create a new output dataset called customers_enriched.

../../_images/tshirt-python-customers-create-recipe-01.png

In the resulting recipe, note that:

  • Dataiku adds a number of commented lines, each of which shows the beginning of a notebook cell. This way, if we need to edit the recipe in a notebook again, our existing cells are maintained.
  • Dataiku has added lines for the recipe output to the customers_enriched dataset. However, it cannot know which dataframe we want to output, so in the last line, change pandas_dataframe to df_joined.
../../_images/tshirt-python-customers-recipe-01.png

Run the recipe and see how the Flow is affected.

../../_images/tshirt-python-customers-flow-02.png

What’s next

Congratulations! You’ve taken the first steps on Python integration in Dataiku. As you progress, you’ll find that:

  • Python is everywhere. You can use Python code to create:
    • Bokeh webapps
    • Traditional webapps with custom Python backends
    • Custom models in the Visual ML interface
    • Custom automation scenarios, metrics, and checks
    • Scripts that use the Python API to remotely control Dataiku
    • … and more!
  • Python is extensible. You can create:
    • Code environments to manage package dependencies and versions for your projects
    • Custom Python libraries: reuse code all over the place. Should connect in to the Git-based dev workflow
    • Plugins