How to Create a Custom Recipe

By writing a custom recipe, you can add a new kind of recipe to Dataiku DSS. The idea is:

  • You write the core of the recipe in Python or R code
  • You write a JSON descriptor that declares:
    • The kinds of inputs and outputs of the recipe
    • The available configuration parameters
  • In the Python or R code of the recipe, you use a specific API to retrieve the inputs, outputs and parameters (i.e., the “instantiation parameters”) of the recipe

To the user, the custom recipe is a visual recipe in which they can enter the declared configuration parameters and run the recipe.

Let’s write a custom recipe that computes pairwise correlations (i.e., correlations between the values in pairs of columns). Such a recipe could be used, for example, to discover that the price of a car has a strong negative correlation with the mileage.

We will start by writing a Python recipe in the Flow of the tutorial project, and then make it “reusable”.

Create Your Project

The first step is to create a new Dataiku DSS Project. From the Dataiku homepage, click +New Project > DSS Tutorials > Code > Your first plugin (Tutorial).

This includes the example dataset wine_quality.

Create the Base Recipe

Create a Python recipe with the wine_quality dataset as an input and a new wine_correlation dataset as the output.

The recipe code should look like the following:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np

# Read the input
input_dataset = dataiku.Dataset("wine_quality")
df = input_dataset.get_dataframe()
column_names = df.columns

# We'll only compute correlations on numerical columns
# So extract all pairs of names of numerical columns
pairs = []
for i in xrange(0, len(column_names)):
    for j in xrange(i + 1, len(column_names)):
        col1 = column_names[i]
        col2 = column_names[j]
        if df[col1].dtype == "float64" and \
           df[col2].dtype == "float64":
            pairs.append((col1, col2))

# Compute the correlation for each pair, and write a
# row in the output array
output = []
for pair in pairs:
    corr = df[[pair[0], pair[1]]].corr().iloc[0][1]
    output.append({"col0" : pair[0],
                   "col1" : pair[1],
                   "corr" :  corr})

# Write the output to the output dataset
output_dataset =  dataiku.Dataset("wine_correlation")
output_dataset.write_with_schema(pd.DataFrame(output))

Run the recipe and see the output: a dataset with 3 columns (col0, col1, corr) and one row per input columns pair.

Convert It to a Custom Recipe

To make this Python recipe a custom recipe:

  • Go to the Advanced tab of the Python recipe
  • Click Convert to custom recipe
  • Select the dev plugin to add the custom recipe to
  • Choose to place it under the folder compute_correlation; we expect to extend use of this recipe beyond the wine dataset
  • Click Convert
  • Dataiku DSS generates the custom recipe files and suggests we edit them now in the Plugin Developer. Let’s do that now.

For the rest of the tutorial, we’ll tweak the generated files.

Edit Definitions in recipe.json

First, let’s have a look at the recipe.json file. The most important things to change are the inputRoles and outputRoles arrays. Roles allow you to associate one or more datasets to each kind of input and output of the recipe.

Our recipe is a simple one: it has one input role with exactly 1 dataset, and one output role with exactly 1 dataset. Edit your JSON to look like:

"inputRoles" : [
    {
        "name": "input",
        "label": "Input dataset",
        "description": "The dataset containing the raw data from which we'll compute correlations.",
        "arity": "UNARY",
        "required": true,
        "acceptsDataset": true
    }
],

"outputRoles" : [
    {
        "name": "main_output",
        "label": "Output dataset",
        "description": "The dataset containing the correlations.",
        "arity": "UNARY",
        "required": true,
        "acceptsDataset": true
    }
],

We’d like to allow users of this plugin to be able to focus on “strong” correlations (i.e., values that are closest to +1 or -1).

We can specify a threshold parameter that can be set in the recipe dialog by editing the params section of recipe.json:

"params": [
    {
        "name": "threshold",
        "label" : "Threshold for showing a correlation",
        "type": "DOUBLE",
        "defaultValue" : 0.5,
        "description":"Correlations below the threshold will not appear in the output dataset",
        "mandatory" : true
    }
],

Edit code in recipe.py

Now let’s edit recipe.py. The default contents include some generic starter code for referencing roles and parameters, the code from your Python recipe, and some comments that explain how to finish creating your custom recipe. In the end, your recipe.py should start with code for retrieving datasets and parameters like:

# Retrieve array of dataset names from 'input' role, then create datasets
input_names = get_input_names_for_role('input')
input_datasets = [dataiku.Dataset(name) for name in input_names]

# For outputs, the process is the same:
output_names = get_output_names_for_role('main_output')
output_datasets = [dataiku.Dataset(name) for name in output_names]

# Retrieve parameter values from the of map of parameters
threshold = get_recipe_config()['threshold']

The portion of your original recipe that reads inputs needs to be updated to refer to the datasets created from the input roles, like:

# Read the input
input_dataset = input_datasets[0]
df = input_dataset.get_dataframe()
column_names = df.columns

The portion of your original recipe that computes the correlations should be updated to include the threshold to filter out the weak correlations:

for pair in pairs:
    corr = df[[pair[0], pair[1]]].corr().iloc[0][1]
    if np.abs(corr) > threshold:
      output.append({"col0" : pair[0],
                     "col1" : pair[1],
                     "corr" :  corr})

The portion of your original recipe that writes the output datasets also needs to be updated to refer to the datasets created from the output roles, like:

# Write the output to the output dataset
output_dataset =  output_datasets[0]
output_dataset.write_with_schema(pd.DataFrame(output))

Verify that wine_quality and wine_correlation don’t appear anymore in your recipe. In general, the rest of recipe.py can be left as-is.

Use your custom recipe in the flow

Note

After editing recipe.json for a custom recipe, you must do the following:

  • Click Reload
  • Reload the Dataiku DSS page in your browser

When modifying the recipe.py file, you don’t need to reload anything. Simply run the recipe again.

  • Go to the Flow
  • Click + Recipe and select your plugin recipe. The usual recipe creation tab appears.
  • Select the wine_quality input dataset
  • Create a new output dataset
  • Run the recipe, editing the default threshold value if you desire
  • Congratulations, you have created your first custom visual recipe!