Scoring a Machine Learning Model¶
This is part 2 of Tutorial: Machine Learning. Please make sure that you have completed the first part before starting, since we’ll be continuing where we left off.
In this part, we will learn how to use a predictive model to score new records.
We will go through the following steps:
- deploying a model to the Flow
- using this deployed model to score records from another dataset
- understanding the different components used by Dataiku DSS during this workflow
What Are We Going To Do?¶
In the first part, we trained a model to predict the “high revenue potential” of customers for whom we have already observed their previous long-term behavior. They were then stored in the customers_labeled dataset.
Now, we have some new customers, for whom we have the first purchase, and we want to predict whether they’ll turn out to be high revenue customers. This is the customers_unlabeled_prepared dataset. In this dataset, we do not yet have an indication of whether they are high revenue.
Start by going back to your Tutorial: Machine Learning project. Go to the Flow, click on the customers_labeled dataset, and click on the LAB button.
The Visual Analysis Lab should be as you left it at the end of part 1, with the corresponding Script. Open the Models tab, and you should see the 6 models you previously trained. Click on your best model – the last random forest.
The following video goes through what we just covered
Naming and describing models
From the main Results view, you can “star” a model. When you dive into the individual summary of a model, you can edit the model name and give it a description. This helps you document your best models and allow others to find and understand them more easily.
Deploy the Model¶
We are now going to deploy this model to the Flow, where we’ll be able to use it to score another dataset. Click on the Deploy button on the top right.
A new important popup shows up. It will let you create a new Train recipe. Train recipes, in Dataiku DSS, are the way to automatically deploy a model in the dataflow, where you can then use it to produce predictions on new records.
We’re not going to deploy many models, so let’s change the model name to a more manageable Random Forest, and click on the Create button:
You will now be taken back to the Flow. Two new green items are displayed. The first one is the actual train recipe, and the second one is its output, the model. Now click on the model icon and look at the right panel.
You have access to some interesting features here. If you choose Open, you will be redirected to a view, similar to the Models one from your previous analysis bench, but focusing only on the model you chose to deploy (the random forest):
Without going into too much detail in this tutorial, notice that the model is marked as the Active version. If your data were to evolve over time (which is very likely in real life!), you would have the ability from this screen to train again your model (clicking on Actions and then Retrain). In this case, new versions of the models would be available, and you would be able to select which version of the model you’d like to use.
Go back to the Flow and click on the model output icon. You should see a Retrain button close the Open one. This is a shortcut to the function described above: you can update the model with new training data, and activate a new version.
Finally, the Score icon is the one we are looking for to use the model:
Click on it, and a popup window shows up. This is where you set up a few things:
- the dataset you want to score (customers_unlabeled_prepared)
- the Prediction Model you want to use (already selected)
- a name for the output dataset
- the connection you want to store the results into
Fill the values and hit the Create recipe button:
You are now in the scoring recipe.
The threshold is the optimal value computed to maximize a given metric (in part 1). In our case it was set to 0.625. Rows with probability above the threshold will be classified as high value, below as low value.
You can now click on the Run button at the bottom left to score the second dataset.
A few seconds later, you should see Job succeeded.
Go back to the Flow screen, you can visualize your final workflow:
- start from the “history data”
- apply a training recipe
- get a trained model
- apply the model to get the scores on the unlabeled dataset.
Get the Scored Results¶
We’re almost done! Open the customers_unlabeled_scored dataset to see how the scored results look.
Three new columns have been created at the right:
The two “proba” columns are of particular interest. The model provides two probabilities, i.e. a value between 0 and 1, measuring the likelihood to become a high value customer (proba_True), and the opposite likelihood to not become a high value customer (proba_False).
The prediction column is the decision based on the probability and the threshold value of the scoring recipe. Whenever the column proba_True is above 0.625, then Dataiku DSS will decide to make a prediction “True”.
That’s it! You now know enough to build your first predictive model, analyze its results, and deploy it. These are the first steps towards a more complex application.