SQL in Dataiku DSS¶
Dataiku allows you to seamlessly integrate SQL code and visual recipes in a flow. Where possible, processing is performed in-database to minimize movement of large amounts of data.
In this tutorial, we will show you how to:
- Upload local data into an SQL database through Dataiku
- Process this data In-database using SQL recipes
We will work with the fictional retailer Haiku T-Shirt’s data.
This tutorial assumes that you have an SQL database installed and have established a connection between Dataiku and the SQL database.
This tutorial was written while connected to PostgreSQL – see Connecting to PostgreSQL in Dataiku for details – but you should be able to follow along using any SQL database of your choice.
In particular, this tutorial assumes that:
- the PostgreSQL connection is named PostgreSQL_tshirt
- the database is named dku
- the database contains a schema named dku_tshirt, where Dataiku will write managed datasets
- Dataiku will prefix table names that it manages in the database with the project key, which is DKU_TUTORIAL_SQL
You may need to alter some of the instructions and code in this tutorial to reflect your PostgreSQL connection specifications.
Create Your Project¶
The first step is to create a new project. From the Dataiku homepage, click +New Project > DSS Tutorials > Code > SQL in Dataiku DSS (Tutorial). Click on Go to Flow.
In the Flow, you see the Haiku T-Shirt orders and customer data uploaded into Dataiku DSS. Further, the labeled and unlabeled customer data is stacked into a single dataset.
Multiple Ways to Move Data to Your Database¶
First, we want to get the uploaded data into the SQL database. We’ll highlight a couple ways to do this.
Let’s start by opening the orders dataset. Its data consists of a CSV file that has been uploaded into Dataiku DSS. CSV files do not contain any kind of typing information. Thus, the columns of this dataset are for the moment not specifically typed and Dataiku DSS assumes by default that all columns have string storage type. When we sync this dataset to the database, we want pages_visited, tshirt_price, and tshirt_quantity to have integer, double, and integer storage, respectively.
The Sync recipe maps the storage type in the input dataset to a similar type in the output database. So let us first set the column types in the input dataset. One way to handle this is to infer the storage types from the data and save the updated schema. Note that the type inference is performed against a sample of the data, and you should check that the inferred types correspond to your actual data:
- Open the Settings of the orders dataset
- In the Schema screen, click Check Now to confirm the schema is consistent, then click Infer types from data, confirm and then save your dataset.
With the updated storage types, we can open the Actions menu and select Sync to add a new Sync recipe to the flow. We’ll leave the default dataset name of orders_copy. Choose to store the new dataset into the PostgreSQL_tshirt connection. Create and run the recipe.
When the job completes, note that the columns in the new dataset have the desired storage types.
The Sync recipe is convenient when you only need to alter the storage of a dataset’s columns. When you have some more serious preprocessing to do prior to loading your local data into a database, you can use a Prepare recipe.
From the Flow, select the customers_stacked dataset. Choose Prepare from the list of visual recipes. Leave customers_stacked_prepared as the default dataset name. Choose to store the new dataset into the PostgreSQL_tshirt connection. Click Create Recipe.
In the Prepare recipe, we will:
- Parse birthdate
- Classify the user_agent column, keeping the resulting user_agent_brand and user_agent_os columns
- Resolve the GeoIP of the ip_address column, keeping the resulting ip_address_country and ip_address_geopoint columns
If you need more details on these operations, refer to the Tutorial: From Lab to Flow.
Click Run. The Prepare recipe operations are run in the DSS engine, and the data are then pushed into PostgreSQL. The Prepare recipe infers the storage type of each column, so typically you don’t have to make any manual adjustments.
Architecture model for databases
By choosing to store the output of our recipes in the database, we have created new datasets in DSS of type “PostgreSQL” (which is one kind of SQL dataset).
A SQL dataset in DSS is a pointer to a database table, with the same concepts of rows and columns, and a storage type for each column. The connection used by the dataset determines on which database and schema the table is created. Dataiku DSS sets the name of the table (by adding some prefix to the dataset name) and sets the types of the columns for you so that you don’t have to write CREATE TABLE statements.
Create a New Dataset with a SQL Query Recipe¶
Now that we have datasets that correspond to tables in a database, we want all our processing to be done in-database. One way to do this is to write our own SQL code and execute it in a code recipe.
With the orders_copy dataset selected, select the SQL code recipe from the Actions menu. Choose the SQL Query type, and click Set to define the output dataset. Name it orders_by_customer; by default it will choose to store data in PostgreSQL. Click Create Dataset and Create Recipe.
The recipe form is now populated with the following code, which selects all columns from the DKU_TUTORIAL_SQL_orders_copy table in the dku_shirt schema.
SELECT * FROM "dku_tshirt"."DKU_TUTORIAL_SQL_orders_copy"
Our current goal with the past orders is to group them by customer, aggregating their past interactions. In the Basics Tutorial, we accomplished this with a Group visual recipe, but it can also be easily accomplished with SQL code.
Edit the query to read the following:
SELECT customer_id, AVG(pages_visited) AS pages_visited_avg, SUM(tshirt_price*tshirt_quantity) AS total FROM "dku_tshirt"."DKU_TUTORIAL_SQL_orders_copy" GROUP BY customer_id;
The query result has the following columns:
- pages_visited_avg, representing the average number of pages on the Haiku T-shirt website visited by the customer during orders
- total, representing the sum total of the value of orders made by the customer, where the value of each order is the price of each t-shirt multiplied by the number of t-shirts purchased
Click on Validate. Dataiku parses your query, verifies that the syntax is correct, and computes the output dataset column types. Update the schema when prompted.
Click Run. DSS starts a job that creates the output table in the PostgreSQL database and fills it with the results of your query. Go to the Flow and see the newly added recipe and dataset.
SQL Query Recipes in Dataiku DSS
The concept of a SQL query recipe is that it must “return” as results the rows that you want in the output dataset.
If the output dataset is a SQL dataset (that corresponds to a new table in the database), DSS will automatically INSERT these records into the target table. If the output is not on SQL, DSS will retrieve the records from the database and store them in the output dataset.
Thus, while SQL recipes can be used to perform a number of general database operations within a Flow, the query ultimately must perform a SELECT statement that returns the records that you want to get in the output dataset.
Using Visual Recipes to Perform In-database Operations¶
Many visual recipes (Group, Join, Stack, etc.) can also perform in-database operations. For example, we are now ready to enrich the customers_stacked_prepared dataset with information about orders customers have made. Click on customers_stacked_prepared, and from the Actions menu choose Join with… from the list of visual recipes.
Select orders_by_customer as the second input dataset. Change the name of the output dataset to customers_enriched. Click Create Recipe.
In the Join step, change the join type to an Inner join in order to return only customers who have made an order. Note that Dataiku has automatically determined the join keys customerID and customer_id, even though they have different names in the two datasets.
In the Selected columns step, we can deselect customer_id from the orders_by_customer dataset because it is redundant with customerID in the customers_stacked_prepared dataset.
We can run the recipe now, and it will execute in-database (look under the Run button for confirmation), and we can also see the SQL code that is run by this Join recipe.
In the Output step, click View Query to preview the SQL code. We’re satisfied with this, so click Run. Confirm the schema update, and then Run the recipe.
When we create charts on database datasets, the summary statistics used to build the charts can be computed in-database. For example, in the customers_enriched dataset, click the Charts tab.
The default chart type is a bar chart. Drag pages_visited_avg to the X axis and Count of records to the Y axis.
At the moment, the chart is based on a sample of the data. Click to the Sampling & Engine tab of the left pane. Select In-database as the execution engine and click Save. In this case, the shape of the bar chart is essentially unchanged, but when the sample is not representative of the full data, using the full dataset can be illuminating, and using the in-database engine can speed your results.
SQL notebooks are useful for prototyping code and querying and analyzing data.
From the customers_enriched dataset, select Lab > New Code Notebook > SQL.
Click Create and +Query. In the Tables tab of the left pane, click on the icon to the right of the table name to paste DKU_TUTORIAL_SQL_customers_enriched to the query.
The query form is now populated with the following code, which selects all columns from the DKU_TUTORIAL_SQL_customers_enriched table in the dku_shirt schema.
SELECT * FROM "dku_tshirt"."DKU_TUTORIAL_SQL_customers_enriched"
If you click Run, the query executes and shows you a sample of the query results.
We are interested in how Haiku T-Shirt customers respond to the campaign, and in particular across different age groups. We can quickly run some SQL queries to analyze this.
Click +Query, copy the following code to the new query cell, and click Run.
SELECT campaign, AVG(total), SUM(total), COUNT(campaign) FROM "dku_tshirt"."DKU_TUTORIAL_SQL_customers_enriched" WHERE birthdate >= '1980-01-01' AND birthdate < '1990-01-01' GROUP BY campaign
The query results pull customers who were born in the 1980’s and group them by whether they are part of the campaign. The columns report:
- campaign, since this is the GROUP BY variable, the two rows represent the two values indicating whether a customer is part of the campaign
- avg, representing the average order placed by a customer. Customers who were part of the campaign placed orders averaging 97.40 in value, while those who were not placed orders averaging 34.49
- sum, representing the total value of orders placed by customers in this group
- count, representing the number of customers in this group
Click +Query, copy the following code to the new query cell, and click Run.
SELECT campaign, AVG(total), SUM(total), COUNT(campaign) FROM "dku_tshirt"."DKU_TUTORIAL_SQL_customers_enriched" WHERE birthdate >= '1990-01-01' AND birthdate < '2000-01-01' GROUP BY campaign
The query results show summaries for customers born in the 1990’s. What is of immediate interest here is that there is a much smaller gap between customers who are and are not part of the campaign, versus customers born in the 80’s. This suggests that there might exist some interaction between age and the effect of the campaign.
The SQL Notebook editor has a number of useful features for developing SQL code. You can prototype multiple queries, each in its own cell, and use code samples to quickly populate queries with starter code to achieve various tasks. As you add multiple queries, use the Queries tab of the left pane to organize and search your queries. The Tables tab of the left pane provides a handy reference of the contents of your connection, and allows you to quickly add table and column references to queries. Click Ctrl+Enter to quickly run your query.
Congratulations, you have taken your first steps toward working with SQL databases in Dataiku!