Before jumping into the hands-on portion of the tutorial, you can watch the following video, which walks through the outline of the steps.
Let’s Get Started!¶
In this tutorial, you will create your first machine learning model by analyzing the historical customer records and order logs from Haiku T-Shirts.
This is a two-part tutorial:
- First, we’ll create and improve your first model.
- Then, we’ll deploy this predictive model to score new records, like in a real application.
The goal of this tutorial is to predict whether a new customer will become a high-value customer, based on the information gathered during their first purchase.
Create Your Project¶
From the Dataiku homepage, click +New Project, select DSS Tutorials from the list, and select 103: Machine Learning (Tutorial). Click on Go to Flow. In the Flow, you can see the steps used in the previous tutorials to create, prepare, and join the customers and orders datasets.
Additionally, there is a dataset of “unlabeled” customers representing the new customers that we want to predict. These customers have been joined with the orders log and prepared in much the same way as the historical customer data.
Alternatively, you can continue in the same project you worked on in the From Lab to Flow tutorial, by
- Removing the total_sum and count columns from the customers_labeled dataset
- Downloading a copy of the customers_unlabeled.csv file and uploading it to the project.
Predicting Whether a Customer Will be of High Value¶
Based upon the joined customer and order data, our goal is to predict (i.e. guess) whether the customer will become a “high revenue” customer. If we can predict this correctly, we would be able to assess the quality of the cohorts of new users and more effectively drive acquisition campaigns and channels.
In the Flow, select the customers_labeled dataset and click on the Lab button to create a new visual analysis. Give the analysis the more descriptive name
High revenue analysis.
Our labeled dataset contains personal information about the customer, his device and his location. The last column high_revenue is a flag for customers generating a lot of revenue based on their purchase history. It will be used as the target variable of our modeling task.
Now let’s build our first model!
Click on the Models tab in the visual analysis and then click Create first model. A modal dialog appears where you must choose the type of modeling task you want to perform.
Different kinds of modeling tasks
Prediction models are learning algorithms that are supervised, e.g. they are trained on past examples for which the actual values (the target column) is known. The nature of the target variable will drive the kind of prediction task.
- Regression is used to predict a real-valued quantity (i.e a duration, a quantity, an amount spent…).
- Two-class classification is used to predict a boolean quantity (i.e presence / absence, yes / no…).
- Multiclass classification is used to predict a variable with a finite set of values (red/blue/green, small/medium/big…).
Clustering models are inferring a function to describe hidden structure from “unlabeled” data. These unsupervised learning algorithms are grouping similar rows given features.
Here, we want to predict high_revenue. Let us choose the Prediction option, and select high_revenue as the target variable. Dataiku DSS allows you complete control over the machine learning algorithms, but since this is our first model, let’s click on Automated Machine Learning.
Automated machine learning provides templates to create models depending on what you want to achieve; for example, either using machine learning to get some insights on your data or creating a highly performant model. Let us keep the default Quick Prototypes template on the In-memory (Python) backend and click Create. Click Train on the next screen.
Dataiku guesses the best preprocessing to apply to the features of your dataset before applying the machine learning algorithms.
A few seconds later, Dataiku presents a summary of the results of this modeling session. By default, 2 classes of algorithms are used on the data:
- a simple generalized linear model (logistic regression)
- a more complex ensemble model (random forest)
The model summaries contain some important information:
- the type of model
- a performance measure; here the Area Under the ROC Curve or AUC is displayed
- a summary of the most important variables in predicting your target
The AUC measure is handy: the closer to 1, the better the model. Here the Random forest model seems to be the most accurate. Click on it, and you will be taken to the main Results page for this specific model.
Understanding Prediction Quality and Model Results¶
The Summary tab showed an AUC value of about 0.767, which is pretty good for this type of application. Your actual figure might vary slightly due to differences in how rows are randomly assigned to training and testing samples.
To get a better understanding of your model results, Dataiku DSS also offers several different outputs.
Going down the list in the left panel, you will find a first section called Interpretation, showing information about the contribution of the different variables in the model. Keep in mind that the values here are algorithm-dependent (i.e for a linear model you’ll find the model’s coefficients, while for tree-based methods this will be related to the numbers of splits on a variable weighted by the depth of the split in the tree), but the information provided in this section can prove quite useful for better understanding your model.
We notice that some variables seem to have a strong relationship with being a high-value customer. Notably, the age at the time of first purchase age_first_order seems to be a good indicator.
Following the Interpretation section, you will find a Performance section.
The Confusion matrix compares the actual values of the target variable with predicted values (hence values such as false positives, false negatives…) and some associated metrics: precision, recall, f1-score. A machine learning model usually outputs a probability of belonging to one of the two groups, and the actual predicted value depends on which cut-off threshold we decide to use on this probability; e.g., at which probability do we decide to classify our customer as a high value one?
The Confusion matrix shown will be dependent on the given threshold, which can be changed using the slider at the top:
The Decision Chart represents precision, recall, and f1 score for all possible cut-offs:
The Lift charts and ROC curve are visual aids, perhaps the most useful, to assess the performance of your model. While, of course, a longer version about the construction and interpretation of the Lift charts and ROC curve can be found separately, you can remember for now that, in both cases, the steeper the curves are at the beginning of the graphs, the better the model.
In our example again, the results look pretty good:
Finally, the Density chart shows the distribution of the probability to be high-value customer, compared across the two actual groups. A good model will be able to separate the 2 curves as much as possible, as we can see here:
The last section, Model Information, is a recap about how the model has been built. If you go the Features tab, you will notice some interesting things:
By default, all the variables available except customerID have been used to predict our target. Dataiku DSS has rejected customerID because this feature was detected as an unique identifier and was not helpful to predict high-profile customers. Furthermore, criteria like the geopoint is probably not really interesting in a predictive model, because it will not generalize well on new records. We may want to refine the settings of the model.
Tuning the Settings of a Model¶
To change the way models are built, go back to the models list page by clicking on the Models link and opening the Design page.
To address the issue about how we use the variables, proceed directly to the Features handling tab. Here DSS will let you tune different settings.
The Role of the variable (or feature) is the fact that a variable can be either used (Input) or not used (Reject) in the model. Here, we want to remove ip_address_geopoint from the model. Click on ip_address_geopoint and hit the Reject button (or alternatively use the on/off toggle directly):
The Type of the variable is very important to define how it should be preprocessed before it is fed to the machine learning algorithm:
- Numerical variables are real-valued ones. They can be integer or numerical with decimals.
- Categorical variables are the ones storing nominal values: red/blue/green, a zip code, a gender, etc. Also, there will often be times when a variable that looks like Numerical should actually be Categorical instead. For example, this will be the case when an “id” is used in lieu of the actual value.
- Text is meant for raw blocks of textual data, such as a Tweet, or customer review. Dataiku DSS is able to handle raw text features with specific preprocessing.
Each type can be handled differently. For instance, the numerical variables age_first_order and pages_visited_avg have been automatically normalized using a standard rescaling (this means that the values are normalized to have a mean of 0 and a variance of 1). You can disable this behavior by selecting again both names in the list, and clicking the No rescaling button:
After altering these settings, you can now click on Train and build some new models:
The performance of the random forest model has now slightly increased:
Increasing Accuracy with Features Generation¶
Go to Design, and click the Feature generation tab. We can automatically generate new numeric features using Pairwise linear combinations and Polynomial combinations of existing numeric features. Click on these feature generation methods and set “Enable” to Yes. Sometimes these generated features can reveal unexpected relationships between the inputs and target.
When done, you can train your model again by clicking on the Train button:
The resulting Random Forest beats the previous one – the AUC value is now higher than in either of the first two models – possibly because of the changes we made to the handling of features. Looking at the Variables importance chart for the latest model, the importance is spread across the campaign variable along with the features automatically generated from age_first_order and pages_visited_avg, so the generated features may have uncovered some previously hidden relationships. On the other hand, the increase in AUC isn’t huge, so it may be best to be grateful for the boost without reading too much into it.
Now that you have trained several models, all the results may not fit your screen anymore. To see all your models at a glance, you can switch to the Table view, which can be sorted on any column. Here we have sorted on ROC AUC.