This is the first in a series of Dataiku Data Science Studio (DSS) tutorials, designed to provide a hands-on overview of the main Dataiku DSS concepts.
Throughout the tutorials, we will work with a fictional online t-shirt retailer called Haiku T-Shirt, and use their enterprise data to illustrate the basic steps to working with Dataiku.
In this tutorial, we will load Haiku T-Shirt’s order log into Dataiku DSS, perform a cursory review of the data, and apply the screening rules identified in the review to create an updated dataset.
Before jumping into the hands-on portion of the tutorial, you can watch the following video, which walks through the outline of the steps.
In this tutorial, we’ll upload a file from your own computer to Dataiku DSS. Start by downloading the orders CSV file.
Create Your Project¶
The first step is to create a new Dataiku DSS Project. From the Dataiku homepage, click +New Project, select DSS Tutorials from the list, and select 101: Basic (Tutorial).
This creates the new project and lands you on the project home page.
- Key concept: project
A Dataiku DSS project is a container for all your work on a particular activity. The project home acts as the command center from which you can see the overall status of a project, view recent activity, and collaborate through comments, tags, and a project to-do list.
Each project has a name and a unique project ID. Project names can be changed, but project IDs cannot.
Create the Dataset¶
Let’s load our first Dataset! In the project home, click +Import your first dataset.
Dataiku DSS lets you connect to a wide variety of data sources. For this tutorial, we’ll upload the orders file you just downloaded from Dataiku’s website. Click on Upload your files.
You are presented with the Upload dialog. Click on Add a file, select the orders.csv file you downloaded, and validate. The file is uploaded.
Let’s now check if Dataiku DSS detected our CSV format correctly by clicking on the Preview button.
Dataiku DSS automatically detects that the file has delimiter-separated values with a comma separator. You can see the data is in a tabular format, with columns (features) and rows (records or observations).
Dataiku DSS provides the default dataset name orders in the field on top of the screen. Since that’s OK for us, save your work by either hitting the Create button or using the shortcut Ctrl-S. This creates the new dataset and lands you on the Explore page for the orders dataset.
- Key concept: dataset
In Dataiku DSS, a Dataset is any piece of data that you have, and which is of a tabular nature. A CSV file like
orders.csvis a dataset. A sheet in an Excel file is also a dataset.
More generally, companies (and people) have systems to store all their data. They can store it in an Excel file, a relational database, or a distributed storage system if they have larger amounts of data.
Most of the time, creating a dataset means that you merely inform Dataiku DSS of how it can access the data. These external or source datasets remember the location of the original data. The data is not copied into DSS. The dataset in DSS is a view of the data in the original system.
The uploaded files dataset we just created is a bit specific, because in that precise case, the data is copied into DSS (since we don’t have another database to host it yet).
For more information about datasets, check out the main Dataiku DSS concepts
Explore Your Data¶
The Explore page of a dataset provides a tabular view of your data where you can start to examine it.
When looking at or working interactively with your data, Dataiku DSS only shows a sample of it. This sample makes it possible to display your dataset, even if the data itself is huge. You control the size of the sample and the sampling method (should Dataiku DSS simply grab the first records, which is fast but can result in a biased sample, or should it extract a random sample?). To see these settings, click Configure sample, which opens a panel on the left.
Storage Type and Meaning¶
Each column is identified by name, and has a storage type that indicates how the original data is stored (a string, an integer, a decimal number, a date…). For a CSV file, all of the storage types are string (because it’s just text).
Dataiku DSS also detects a meaning, in blue, which is a “semantic” type that reflects the nature of the values of the column. Dataiku DSS has many built-in meanings such as integer, US states, IP addresses, emails, etc. Here, Dataiku DSS has inferred that the tshirt_price column values are of meaning Decimal and that the tshirt_quantity column values are of meaning Integer. The meaning of a column affects whether Dataiku DSS considers a value to be valid for that column. Below the meaning is a gauge that indicates the proportion of column values that do not seem to match the meaning (in red) or are completely missing (in gray).
Dataiku DSS detects a meaning of Integer for customer_id, based upon the fact that most of values of customer_id are integers. The gauge shows red for the few values that do not match this meaning, which allows us to determine whether these values are truly invalid customer ID’s, or, as is the case here, Integer is too restrictive a meaning for customer_id. Click on the meaning and select Text to update it. Now the gauge for customer_id is entirely green.
You can use charts to explore the input dataset; for example, we might want to know how often each type of t-shirt is ordered:
- Click on the Charts tab.
- Drag and drop Count of records as the Y variable.
- Drag and drop tshirt_category as the X variable.
DSS shows a column chart of Count of records by tshirt_category.
The chart reveals that the values of tshirt_category are not consistently recorded. Sometimes black shirt color is recorded as “Black”, and sometimes as “Bl”. Similarly, white shirts are sometimes recorded as “White” and sometimes as “Wh”.
Let’s handle these issues by preparing the data with a recipe.
Prepare Data with a Visual Recipe¶
A video below goes through the steps described in the following section.
- Key concept: recipe
A Dataiku DSS recipe is a set of actions to perform on one or more input datasets, resulting in one or more output datasets. A recipe can be visual or code.
A visual recipe allows a quick and interactive transformation of the input dataset through a number of prepackaged operations available in a visual interface. A code recipe allows a user with coding skills to go beyond visual recipe functionality to take control of the transformation using any supported language (SQL, Python, R, etc).
Dataiku allows “coders” and “clickers” to seamlessly collaborate on the same project through code and visual recipes.
Click on the Actions button at the top right of the orders dataset and choose Prepare.
The Prepare Recipe allows you to define a series of steps, or actions, to take on the dataset. The types of steps you can add to a Prepare Recipe are wide-ranging and powerful. As part of a Prepare Recipe, you can also reorder columns by dragging and dropping them.
When creating a recipe, you must provide the input dataset (the one with the initial columns and values) and an output dataset (the one with the prepared columns and values). You can also set the value of “Store into” to decide where the output data will live. In this example, the output is written on the local filesystem, but the output could be written to a relational database or a distributed filesystem, if the infrastructure exists. Dataiku DSS provides a default output dataset name of orders_prepared. Click Create recipe.
In order to standardize the coding of tshirt_category, we will recode the values. Click on the column name, which opens a dropdown, and select Analyze. The analyze box opens. It allows you to view a quick summary of the data in the column (for the sample of data that is being displayed by DSS). The analyze box also gives you the ability to perform various data cleansing actions.
Select White T-Shirt M and Wh Tshirt M, and from the “Mass Actions” dropdown choose Merge selected. Choose to replace the values with White T-Shirt M and click Merge. Repeat this process for other categories as needed.
Close the Analyze dialog and see that a “Replace” step has been added to the Prepare Script. You just added the first step to your script.
The “Replace” step affects 517 rows in the sample, and replaces values “Wh Tshirt M” with “White T-Shirt M”, “Bl Tshirt M” with “Black T-Shirt M”, “Wh Tshirt F” with “White T-Shirt F”, and “Bl Tshirt F” with “Black T-Shirt F” of tshirt_category in place. You can create this step explicitly in the script, but the Analyze dialog provides a quick and intuitive shortcut to build the step.
Key concept: preparation script
When using a preparation recipe, you are building a sequence of actions, or steps, that are registered in the script. Each step in the sequence is called a processor and reflects a single data transformation.
The original data is never modified, but you are visualizing the changes on a sample of your data (10,000 lines by default).
To apply the transformations to your whole dataset and create the output dataset with the cleaned data, you’ll have to Run the recipe as we will see later on.
A preparation script has many benefits:
First, it is like a Cancel menu on steroids. You can modify/delete any step that you added earlier.
Second, it is a history of actions that tells you how a dataset is being modified for future reference.
You will learn more about the power of processors in the Tutorial: From Lab to Flow
At this point, you will see some values in yellow. This is because you are in the “Step preview” mode. In this mode, you can see what the step changes. The values in yellow are the ones that were modified by our “Replace” step.
If you want to see your data as it will appear after processing, click on the Disable preview (green eye) button in the top bar.
The following video goes through what we just covered
Now, let’s deal with the order_date. Input dates are unparsed, and need to be processed (“parsed”) if we want to use them. Dates are notoriously difficult to work with, but DSS makes it very easy.
Open the column dropdown and select Parse date. The Smart Date dialog opens, and shows you the most likely formats for your dates and how the dates would look like once parsed, with some sample values from the dataset. In our case, the dates appear to be in
yyyy/MM/dd format. Select this format, click OK and see that a “Parse date” step has been added to the Prepare Script.
By default, “Parse date” creates a new column order_date_parsed. Since we’d rather want the parsed date to stay in the order_date column, let’s clear the “Output column” text box in the script step, which has the effect of parsing dates in place.
Finally, let’s compute the value of each order. The orders dataset includes the number of t-shirts in each order and the price per t-shirt, but we need to compute the total of each order so that we can analyze the value of each order. We are going to use a Formula step for this. DSS formulas are a very powerful expression language to perform calculations, manipulate strings, and much more.
This time, we will not add the step by clicking on a column header, but instead use the processors library which references all 80+ data preparation processors:
- Click the yellow +Add a New Step button in the left bar
- Select Formula (you can search for it)
totalas the name of the new column.
- In the expression, type
tshirt_price * tshirt_quantity(you can also click Edit to bring up the advanced formula editor, which will autocomplete column names)
- Click anywhere and see the new total column appear
We no longer need the price and quantity, so we can delete those columns.
- Click on the column header
- Choose Delete
Recall that the data visible in the recipe is merely a sample, meant to give you immediate visual feedback on the design of your Prepare Script. With the Prepare Script complete, you must now Run the recipe on the whole input dataset. Click Run in the lower left corner of the page. Dataiku DSS uses its own engine for this recipe runtime, but depending upon your infrastructure and the type of recipe, you can choose where the computation takes place.
You will also be asked to update the schema. A dataset’s schema is a list of the columns, plus their storage type and meaning. When you initially created this recipe, the default output schema was identical to the input schema because there were no steps in the recipe. You have since created the column total and removed the columns tshirt_price and tshirt_quantity, so you need to allow Dataiku DSS to update the schema.
When the job completes, click Explore dataset orders_prepared to visit the output dataset.
Explore the Output Dataset¶
In the Prepare Recipe, DSS automatically selected the storage type of columns according to their detected meanings; thus pages_visited was written as a bigint and total as a double in the output dataset.
A natural question at this point is which categories of t-shirt are most popular? We can begin to answer this with the cleaned data and a simple bar chart. Click the Charts tab. Select total as the Y variable and tshirt_category as the X variable.
The resulting bar chart shows that, on average, orders for tennis shirts are larger than orders for other t-shirts. However, what about overall customer spending?
Open the total dropdown and choose Sum as the aggregation. This changes our perspective. While each order for tennis shirts is relatively valuable, there are relatively few orders for tennis shirts. Hoodies and male t-shirts are more commonly ordered, and the total value of those orders exceeds that of tennis shirts.
The following section on Interactive Statistics requires DSS 7.0.
You can also use the Statistics page of the dataset to perform exploratory data analysis (EDA). The Statistics page allows you to generate statistical reports on your data by creating Worksheets and Cards.
Key concept: Worksheet
In DSS, a Worksheet is a visual summary of EDA tasks. For a dataset, you can create multiple worksheets by clicking the Statistics tab.
The worksheet header consists of a button for creating a new card and menu items that provide options to customize the worksheet. For example, there are options for running the worksheet in a container, specifying how to sample the dataset used in the worksheet, changing the global confidence level for statistical tests, duplicating the worksheet, and so on.
For more information about worksheets, see The Worksheet Interface in the reference documentation.
Within a worksheet, you can create multiple cards by clicking the +New Card button.
Key concept: Card
In DSS, a Card is used to perform a specific EDA task. For example, you can describe your dataset, draw inferences about an underlying population, analyze the effect of dimensionality reduction, and so on.
A worksheet can have many cards, with the cards appearing below the worksheet header. All cards have a configuration menu for editing card settings, viewing the JSON payloads and responses (for the purpose of leveraging the public API), and so on. Some cards also contain multiple sections, with each section having its own configuration menu.
For more information about cards, see Elements of a card in the reference documentation.
Let’s create a worksheet with cards that perform EDA tasks. For example, if we are interested in seeing a side-by-side summary of the orders_prepared dataset for each of the variables pages_visited, tshirt_category, and total, then:
- Click on the Statistics tab, and click +Create Your First Worksheet.
- Select the Univariate analysis box to open the “Univariate analysis” window.
- Select pages_visited, tshirt_category, and total from the first column of the window, and click the “plus” button next to “Variables to describe”.
DSS automatically selects the statistical “Options” (in the third column of the window) that are appropriate for the numerical variables (pages_visited and total) and the categorical variable (tshirt_category). You can deselect any of these options if you so choose.
- Click Create Card.
DSS creates a card with one section for each variable. The type of statistical chart and descriptive statistic in each section depends on whether the variable is categorical or numerical. For example, tshirt_category, a categorical variable, has a bar chart (or categorical histogram), while pages_visited and total each have a numerical histogram and box plot insert. Also, the quantile table is applicable to the numerical variables, while the frequency table is applicable to the categorical variable.
By default, DSS computes worksheet statistics on a sample of the first records in your dataset. You can configure this setting by clicking the drop-down arrow next to Sampling and filtering.
We may also be interested in checking whether the total variable follows an exponential distribution. The interactive statistics feature allows you to estimate the parameters of univariate probability distributions using the Fit Distribution card.
- Click the +New Card button from the “Worksheet” window, and then select Fit curves & distributions.
- Select the Fit Distribution card.
- Select total as the “Variable” and Exponential as the “Distribution”.
- Click Create Card.
DSS creates a card that shows the exponential distribution fit to the data. There is also a Q-Q plot that compares the quantiles of the data against the quantiles of the fitted distribution. Observing points far from the identity line suggests that the data could not have been drawn from the exponential distribution.
To learn about the full capabilities in the Statistics page, see the Interactive statistics section of the reference documentation. You can also get some hands-on practice by following the Interactive Statistics and Visualization tutorial.
Group Orders by Customer¶
Our ultimate goal is to understand Haiku T-Shirts’ customers; to that end we want to group all past orders by customer, aggregating their past interactions.
With the orders_prepared dataset open, look in the upper-right corner for the Actions menu. From this menu, choose Group in the list of Visual recipes. The Group Recipe allows you to aggregate the values of some columns by the values of one or more keys.
Choose to group by customer_id. The default name of the new dataset is not very descriptive, so change it to
orders_by_customer. Click Create Recipe.
The Group Recipe has several steps (on the left). The core step is the Group step, where you choose which columns to aggregate and what aggregations you want performed.
Some columns, like order_id, we won’t need in the new dataset. For the others, make the following selections:
- order_date: Min
- pages_visited: Avg
- total: Sum
This will give us the first date at which a customer ordered (minimum of order_date), the average number of visited pages by visit, and the sum of all orders of the customer.
In the Output step, you can select to rename the output columns. Let’s rename order_date_min to first_order_date.
Click Run to create the new grouped dataset, then Explore the new dataset once run is complete.
Explore the Flow¶
Click on the flow icon next to the project name at the top of the page to go to the project Flow. The Flow is the visual representation of your project pipeline. You can see the Preparation Recipe, represented by a broom, followed by its output dataset, followed by the Group Recipe, followed by its output dataset.