How to Create a Custom Dataset

Custom datasets work in a similar way to custom recipes; however, custom datasets can only be written in Python.

  • You write Python code that reads rows from the data source, or writes rows to it
  • You write a JSON descriptor that declares the configuration parameters
  • The user is shown with a visual interface in which they can enter the dataset’s configuration parameters.

The dataset then behaves like all other DSS datasets. For example, you can then run a preparation recipe on this custom dataset. Custom datasets can be used, for example, to connect to external data sources like REST APIs.

For our custom dataset, we’re going to read the Dataiku RaaS (Randomness as a Service) REST API. This API returns random numbers, so we want to use it to create a new type of dataset.

To use the API, we have to perform a GET query on http://raas.dataiku.com/api.php. For example, visit: http://raas.dataiku.com/api.php?nb=5&max=200&apiKey=secret. This returns 5 random numbers between 0 and 200.

Create the custom dataset

Custom datasets are a bit more challenging to write than custom recipes since we can’t start from a regular dataset, but must build the custom dataset from scratch.

  • Go to the plugin developer page
  • Create a new dev plugin (or reuse the previous one)
  • In the dev plugin page, click on +Add Component
    • Choose Dataset
    • Select Python as the language
    • Give the new dataset type an id, like raas and click Add
  • Use the editor to modify files.

We’ll start with the connector.json file. Our custom dataset needs the user to input 3 parameters:

  • Number of random numbers
  • Range
  • API Key

So let’s create our params array:

"params": [
    {
        "name": "apiKey",
        "label": "RAAS API Key",
        "type": "STRING",
        "description" : "You can enter more help here"
    },
    {
        "name": "nb",
        "label": "Number of random numbers",
        "type": "INT",
        "defaultValue" : 10 /* You can have the data prefilled */
    },
    {
        "name": "max",
        "label": "Max value",
        "type": "INT"
    }
]

For the Python part, we need to write a Python class.

In the constructor, we’ll retrieve the parameters:

# perform some more initialization
self.key = self.config["apiKey"]
self.nb = int(self.config["nb"])
self.max = int(self.config["max"])

We know in advance the schema of our dataset: it will only have one column named “random” containing integers. So, in get_read_schema, let’s return this schema

def get_read_schema(self):
    return {
        "columns" : [
            { "name" : "random", "type" : "int" }
        ]
    }

Finally, the core of the connector is the generate_rows method. This method is a generator over dictionaries. Each yield in the generator becomes a row in the dataset.

If you don’t know about generators in Python, you can have a look at https://wiki.python.org/moin/Generators

We’ll be using the requests library to perform the API calls.

The final code of our dataset is:

from dataiku.connector import Connector
import requests

class MyConnector(Connector):

    def __init__(self, config):
        Connector.__init__(self, config)  # pass the parameters to the base class

        self.key = self.config["apiKey"]
        self.nb = int(self.config["nb"])
        self.max = int(self.config["max"])

    def get_read_schema(self):
        return {
            "columns" : [
                { "name" : "random", "type" : "int" }
            ]
        }

    def generate_rows(self, dataset_schema=None, dataset_partitioning=None,
                            partition_id=None, records_limit = -1):

        req = requests.get("http://raas.dataiku.com/api.php", params = {
            "apiKey": self.key,
            "nb":self.nb,
            "max":self.max
        })

        array = req.json()
        for random_number in array:
            yield { "random"  : random_number}

(All other methods are not required at this point, so we removed them).

Use the plugin

In the new dataset menu, you can now see your new dataset (try reloading your browser if this is not the case). You are presented with a UI to set the 3 required parameters.

  • Set “secret” as API Key
  • Set anything as nb and max
  • Click Test
  • Your random numbers appear!

You can now hit Create, and you have created a new type of dataset. You can now use it like any other Dataiku DSS dataset.

About caching

There is no specific caching mechanism in custom datasets. Custom datasets are often used to access external APIs, and you may not want to perform another call on the API each time Dataiku DSS needs to read the input dataset.

It is therefore highly recommended that the first thing you do with a custom dataset is to either use a Prepare or Sync recipe to make a cached version on a first-party data store.