Before jumping into the hands-on portion of the tutorial, you can watch the following video, which walks through the outline of the steps.
Let’s Get Started!¶
In this tutorial, you will create your first machine learning model by analyzing the historical customer records and order logs from Haiku T-Shirts.
This is a two-part tutorial:
- First, we’ll create and improve your first model.
- Then, we’ll deploy this predictive model to score new records, like in a real application.
The goal of this tutorial is to predict whether a new customer will become a high-value customer, based on the information gathered during their first purchase.
Create Your Project¶
From the Dataiku homepage, click +New Project, select DSS Tutorials from the list, and select 103: Machine Learning (Tutorial). Click on Go to Flow. In the Flow, you can see the steps used in the previous tutorials to create, prepare, and join the customers and orders datasets.
Additionally, there is a dataset of “unlabeled” customers representing the new customers that we want to predict. These customers have been joined with the orders log and prepared in much the same way as the historical customer data.
Alternatively, you can continue in the same project you worked on in the From Lab to Flow tutorial, by
- Removing the total_sum and count columns from the customers_labeled dataset.
- Downloading a copy of the customers_unlabeled.csv file and uploading it to the project.
- Preparing the customers_unlabeled dataset to match the schema of the customers_labeled dataset. Remember to use an inner join to join customers_unlabeled with orders_by_customer. You can even copy-paste the Prepare recipe steps from the script you used to prepare the customers_orders_joined dataset.
Predicting Whether a Customer Will be of High Value¶
Based upon the joined customer and order data, our goal is to predict (i.e. guess) whether the customer will become a “high revenue” customer. If we can predict this correctly, we would be able to assess the quality of the cohorts of new users and more effectively drive acquisition campaigns and channels.
In the Flow, select the customers_labeled dataset and click on the Lab button to create a new visual analysis. Give the analysis the more descriptive name
High revenue analysis.
Our labeled dataset contains personal information about the customer, his device and his location. The last column high_revenue is a flag for customers generating a lot of revenue based on their purchase history. It will be used as the target variable of our modeling task.
Now let’s build our first model!
Click on the Models tab in the visual analysis and then click Create first model. A modal dialog appears where you must choose the type of modeling task you want to perform.
Different kinds of modeling tasks
Prediction models are learning algorithms that are supervised, e.g. they are trained on past examples for which the actual values (the target column) are known. The nature of the target variable will drive the kind of prediction task.
- Regression is used to predict a real-valued quantity (i.e a duration, a quantity, an amount spent…).
- Two-class classification is used to predict a boolean quantity (i.e presence / absence, yes / no…).
- Multiclass classification is used to predict a variable with a finite set of values (red/blue/green, small/medium/big…).
Clustering models are inferring a function to describe hidden structure from “unlabeled” data. These unsupervised learning algorithms are grouping similar rows given features.
Here, we want to predict high_revenue. Let us choose the Prediction option, and select high_revenue as the target variable. Dataiku DSS allows you complete control over the machine learning algorithms, but since this is our first model, let’s click on Automated Machine Learning.
Automated machine learning provides templates to create models depending on what you want to achieve; for example, either using machine learning to get some insights on your data or creating a highly performant model. Let us keep the default Quick Prototypes template on the In-memory (Python) backend and click Create. Click Train on the next screen.
Dataiku guesses the best preprocessing to apply to the features of your dataset before applying the machine learning algorithms.
A few seconds later, Dataiku presents a summary of the results of this modeling session. By default, 2 classes of algorithms are used on the data:
- a simple generalized linear model (logistic regression)
- a more complex ensemble model (random forest)
The model summaries contain some important information:
- the type of model
- a performance measure; here the Area Under the ROC Curve or AUC is displayed
- a summary of the most important variables in predicting your target
The AUC measure is handy: the closer to 1, the better the model. Here the Random forest model seems to be the most accurate. Click on it, and you will be taken to the main Results page for this specific model.
The Summary tab showed an AUC value of 0.767, which is pretty good for this type of application. Your actual figure might vary due to differences in how rows are randomly assigned to training and testing samples.
To get a better understanding of your model results, Dataiku DSS also offers several different outputs in the left panel. These outputs are grouped into:
- Interpretation, for assessing model behavior and the effects of features
- Performance, for evaluating the model, using performance metrics
- Model Information, for providing more information about the model
Model and Feature Interpretation¶
Going down the list in the left panel, you will find a first section called Interpretation. This section provides information for assessing the behavior of the model and the contribution of features to the model outcome.
Some of the panels in this section are algorithm-dependent; for example, a linear model will display information about the model’s coefficients, while a tree-based model will display information about decision trees and variable importance.
The Interpretation section also contains a panel for creating partial dependence plots (Partial dependence), performing subpopulation analysis (Subpopulation analysis), and providing individual prediction explanations at a row-level (Individual explanations). All the information provided in this section can prove quite useful for a better understanding of your model.
To understand the random forest model, let’s begin by looking at the Variables importance panel.
We notice that some variables seem to have a strong relationship with being a high-value customer. Notably, the age at the time of first purchase age_first_order seems to be a good indicator.
Next, let’s use a partial dependence plot to understand the effect of a feature (age_first_order) on the target (high_revenue).
- Click Partial dependence in the left panel to open the partial dependence page of the output.
- Specify age_first_order as the variable.
- Click Compute.
The partial dependence plot shows the dependence of high_revenue on the age_first_order feature, computed on the test set (2177 rows). A negative partial dependence value represents a negative dependence of the predicted response on the feature value, while a positive partial dependence value represents a positive dependence of the predicted response on the feature value.
For example, the partial dependence plot shows that high_revenue being “True” has a negative relationship with age_first_order for ages below 42 years. The relationship slowly increases between ages 50 and 67, but then drops off until age 74.
The plot also displays the distribution of the age_first_order feature. From the distribution, you can see that there is sufficient data to interpret the relationship between the feature and the target.
Let’s see what the partial dependence plot looks like for a categorical feature gender.
- Select gender as the variable.
- Click Compute.
The partial dependence plot shows that high_revenue being “True” has a negative relationship with gender being “F”. In the cases where gender is “M” or has no value, then the relationship is positive. The gender distribution is roughly equal between males and females, and it accounts for about 90% of the data.
Another useful tool to better understand the model is subpopulation analysis. Using this tool, we can assess if the model behaves identically across subgroups or if the model shows biases for certain groups.
Let’s use a subpopulation analysis to understand how our model behaves across different gender groups.
- Click Subpopulation analysis in the left panel to open the subpopulation analysis page of the output.
- Specify gender as the variable.
- Click Compute.
The table shows a subpopulation analysis for gender, computed on the test set. The model predicted that high_revenue was true 18% of the time when it was actually true only 9% of the time.
For the “F” subgroup, the model predicted that high_revenue was true 17% of the time when the actual number was 8%. Similarly, the model predicted “True” 20% of the time for the “M” subgroup, when the actual number was 10%. The predicted probabilities for male and female are close, but not identical. We can investigate whether this difference is significant enough by displaying more metrics in the table and more detailed statistics related to the subpopulations represented by the “F” and “M” rows.
- Click the Displayed Metrics dropdown at the top right of the table, and select F1 Score. Note that this metric considers both the precision and recall values. The best possible value is one, and the worst is zero.
- Click anywhere on the “F” and “M” rows to expand them.
In this analysis, the male subgroup has the highest F1 score (0.36) of all the groups, even though this score is quite low. Also, the confusion matrices (displaying % of actual classes) for both the male and female groups show that the male subgroup does better (54%) at correctly predicting high_revenue to be true than the female subgroup (44%).
Apart from exploring the effects of features on the model, it can also be useful to understand how certain features impact the prediction of specific rows in the dataset. The individual prediction explanations feature allows you to do just this! Note that this feature can be computed in two ways:
- From the Individual explanations tab in the model results page.
- Within a scoring recipe (after deploying a model to the flow), by checking the option Output explanations. See Tutorial: Scoring a Machine Learning Model for an example.
Let’s use the Individual explanations tab in the model results page to visualize the five most influential features that impact the prediction for specific samples in the dataset.
- Click Individual explanations in the left panel to open the page.
5most influential features to use for the explanation.
- Keep the default ICE method.
- Click the gear icon in the top right corner to see more settings, such as the sampling details. Keep the “Sample size” as
1000. This sample is drawn from the test set because the model implemented a simple train/test split. If the model implemented K-Fold validation, then the sample would be drawn from the entire dataset.
- Move the left slider close to
0.10(corresponding to “~42 rows”) and the right slider close to
0.70(corresponding to “~50 rows”) to specify the number of rows at the low and high ends of the predicted probabilities. Note that the probability density function in the background is an approximation based on the test set. The boxes to enable low and high probabilities are checked by default, so that you can move the sliders.
- Click Compute.
Depending on the exact location of the sliders, your values may be different from the ones shown in this analysis.
DSS returns 30 rows for which the output probabilities are less than
0.10, and 55 rows for which the output probabilities are greater than
0.70. Each row explanation is represented in a card below the probability density plot. Notice that DSS has selected customerID as the identifier, but you can change this selection.
On the left side of the page, the cards have low probabilities of high_revenue being “True” and are sorted in order of increasing probabilities. In contrast, the cards on the right side of the page have high probabilities of high_revenue being “True” and are sorted in order of decreasing probabilities. For each card, the predicted probability is in the top right, and the “customerID” (card identifier) is in the top left.
For cards on the left side of the page, observe that all of the bars are red and oriented to the left. This reflects that the predicted probability is below average and the features negatively impact the prediction. In some cases, some of the bars may have counter effects (green and oriented to the right), even so, the net effect of the features will still be negative for cards on the left side of the page.
The opposite observation can be made for the cards on the right side of the page, where the bars are mostly — if not always — green and oriented to the right to reflect the positive impact of the features on the outcome.
You can click Features in a card to display the full list of all its features and their values.
Understanding Prediction Quality and Model Results¶
Following the Interpretation section, you will find a Performance section.
The Confusion matrix compares the actual values of the target variable with predicted values (hence values such as false positives, false negatives…) and some associated metrics: precision, recall, f1-score. A machine learning model usually outputs a probability of belonging to one of the two groups, and the actual predicted value depends on which cut-off threshold we decide to use on this probability; e.g., at which probability do we decide to classify our customer as a high value one?
The Confusion matrix shown will be dependent on the given threshold, which can be changed using the slider at the top:
The Decision Chart represents precision, recall, and f1 score for all possible cut-offs:
The Lift charts and ROC curve are visual aids, perhaps the most useful, to assess the performance of your model. While, of course, a longer version about the construction and interpretation of the Lift charts and ROC curve can be found separately, you can remember for now that, in both cases, the steeper the curves are at the beginning of the graphs, the better the model.
In our example again, the results look pretty good:
Finally, the Density chart shows the distribution of the probability to be high-value customer, compared across the two actual groups. A good model will be able to separate the 2 curves as much as possible, as we can see here:
The last section, Model Information, is a recap about how the model has been built. If you go the Features tab, you will notice some interesting things:
By default, all the variables available except customerID have been used to predict our target. Dataiku DSS has rejected customerID because this feature was detected as an unique identifier and was not helpful to predict high-profile customers. Furthermore, criteria like the geopoint is probably not really interesting in a predictive model, because it will not generalize well on new records. We may want to refine the settings of the model.
Tuning the Settings of a Model¶
To change the way models are built, go back to the models list page by clicking on the Models link and opening the Design page.
To address the issue about how we use the variables, proceed directly to the Features handling tab. Here DSS will let you tune different settings.
The Role of the variable (or feature) is the fact that a variable can be either used (Input) or not used (Reject) in the model. Here, we want to remove ip_address_geopoint from the model. Click on ip_address_geopoint and hit the Reject button (or alternatively use the on/off toggle directly):
The Type of the variable is very important to define how it should be preprocessed before it is fed to the machine learning algorithm:
- Numerical variables are real-valued ones. They can be integer or numerical with decimals.
- Categorical variables are the ones storing nominal values: red/blue/green, a zip code, a gender, etc. Also, there will often be times when a variable that looks like Numerical should actually be Categorical instead. For example, this will be the case when an “id” is used in lieu of the actual value.
- Text is meant for raw blocks of textual data, such as a Tweet, or customer review. Dataiku DSS is able to handle raw text features with specific preprocessing.
Each type can be handled differently. For instance, the numerical variables age_first_order and pages_visited_avg have been automatically normalized using a standard rescaling (this means that the values are normalized to have a mean of 0 and a variance of 1). You can disable this behavior by selecting again both names in the list, and clicking the No rescaling button:
After altering these settings, you can now click on Train and build some new models:
The performance of the random forest model has now slightly increased:
Increasing Accuracy with Features Generation¶
Go to Design, and click the Feature generation tab. We can automatically generate new numeric features using Pairwise linear combinations and Polynomial combinations of existing numeric features. Click on these feature generation methods and set “Enable” to Yes. Sometimes these generated features can reveal unexpected relationships between the inputs and target.
When done, you can train your model again by clicking on the Train button:
The resulting Random Forest beats the previous one – the AUC value is now higher than in either of the first two models – possibly because of the changes we made to the handling of features. Looking at the Variables importance chart for the latest model, the importance is spread across the campaign variable along with the features automatically generated from age_first_order and pages_visited_avg, so the generated features may have uncovered some previously hidden relationships. On the other hand, the increase in AUC isn’t huge, so it may be best to be grateful for the boost without reading too much into it.
Now that you have trained several models, all the results may not fit your screen anymore. To see all your models at a glance, you can switch to the Table view, which can be sorted on any column. Here we have sorted on ROC AUC.