We are part of a data team working on a predictive maintenance use case at a car rental company.
To make sure customers don’t rent cars that might break down, the company wants to replace ones that are more likely to break down. Unexpected problems can really add to costs, because of the associated repairs, unavailability and the inconvenience to customers. At the same time, replacing the vehicles too often would not make sense either.
The company has some information on past failures as well as on car usage and maintenance. As the data team, we are here to offer a data-driven approach. More specifically, we want to use the information we have to answer the following questions:
- What are the factors behind these failures?
- Which cars are more likely to fail?
These questions are interrelated. As a data team, we are looking to isolate and understand which factors can predict whether vehicles have higher probabilities of failures. To do so, we’ll build end-to-end predictive models in Dataiku DSS. We’ll see an entire advanced analytics workflow from start to finish. Hopefully, its results will end up as a data product that has a direct impact on the company’s bottom line!
Here is a brief description of the 3 datasets available to us:
- Usage: number of miles the cars have been used, collected at various points
- Failure: whether a vehicle had a recorded failure–not all cases are labelled
- Maintenance: records of when cars were serviced, for which part and the reason provided for and quantity of parts replaced during maintenance
All the cars are identified by an Asset ID, which is available in each file. Some datasets are organized at the level of the vehicle, others are not. Some data detective work might be required!
By the end of this walkthrough, our goal is to get to this complete workflow in Dataiku DSS :
In order to complete this workflow, we will go through the following steps:
- import the data
- clean, restructure and merge the input datasets together
- split the merged dataset by whether outcomes are known and unknown, i.e. labelled and unlabelled
- train and analyze a predictive model on the known cases
- score the unlabelled cases using the predictive model
To complete this walkthrough, the following requirements will need to be met:
- Have access to a Dataiku DSS instance–that’s it!
Creating the Project and Importing Datasets¶
First, we will create the DSS datasets from the three input files in the Supporting data section. Here are the steps we will take to get started.
Workflows are organized as projects on a DSS instance. The project home page contains metadata (description, tags), record of activities and a lot more. These and other important ideas are detailed in explanations about the platform’s main concepts.
- Create a new DSS project called “Predictive Maintenance”. It is automatically assigned a project key. We can leave it the same or assign something else in its place.
- Let’s create the first dataset and call it usage. Here’s one way to do that:
- Click on the Import your first dataset button
- Select Files > Upload your files
- Upload the usage.csv.gz file downloaded from the Supporting Data section
- Click on Preview to view the import settings that will be used by DSS by default–these can be overidden!
- If the data is stored in a JSON format, for example, then we can detect the format in this step
- If the settings and import look ok, click Create
- By repeating these actions, we can import the two other files as datasets (via the Flow): maintenance_failure.csv.gz and maintenance.gz.csv.
With data import completed, we’ve now created the three datasets that we will need to proceed. If done correctly, they should be visible in the Flow.
Preparing the Usage Dataset¶
The “usage” dataset tracks the mileage for cars, identified by their “Asset” ID. The total number of miles covered are recorded over time, multiple times for each vehicle. As such, in the dataset’s current form, a car might have more than a single row of data. We need the data at the car-level for modeling outcomes for each vehicle, i.e. if the unit of analysis is the car, we need the data to be structured in the same way. Here is one way to aggregate the information in this dataset this way:
- First, we’ll make sure everything is alright with this dataset by going under the hood. We can find out how any dataset is set up by opening the dataset and going into the Settings tab.
- Within Settings, the Schema tab shows the storage types and meanings for all columns. By clicking the Infer types from data, we can let DSS automatically find data types and high-level classifications for the type of data in each column. We will check whether the data and existing schema is consistent using the
Check Nowoption, before we can make DSS infer data types.
DSS uses two different data types (storage types and meanings) for ease of use and interpretation. The documentation covers this distinction of data types
- Once the dataset is ready to go, we’ll do the aggregation. Here is a quick way using a visual recipe:
- We can quickly launch actions by selecting the dataset and going to the Actions tab. Let’s use the Group recipe.
- Once we have initiated the recipe, let’s Group By “Asset”
- We can use the name suggested by DSS for the Output dataset or choose something else. The rest of this walkthrough will assume that the default name is used
Visual recipes can be deployed easily by clicking on the dataset and going to the “Actions” tab on the top right of the pane that opens up.
- In the next step, we can define how we want the aggregation to take place, i.e. specify how the input dataset will be converted into the output. Within the recipe Settings and under the Group tab, we select options so the output dataset provides the following information for each “Asset”:
- Count for each group
- Minimum and Maximum for “Time”
- Minimum and Maximum for “Use”
- We can now create the new dataset by saving and running the recipe.
The output dataset is now more fit for our purposes, since it is aggregated this way. Onto the next step!
Preparing the Maintenance Dataset¶
The data preparation for the maintenance dataset is somewhat different. (HINT: We will use a different recipe!) This dataset contains all activity that has occurred, organized by part (i.e. what was repaired) and time (i.e. when it was repaired). A “Reason” code is given to each part used and provided as a column in the dataset.
At the same time, we’ll follow the same basic idea. While the current dataset has many observations for each vehicle, we want the output dataset to be “pivoted” at the level of each vehicle; that is, transformed from narrow to wide.
Data transformation from narrow to wide is a common step in data preparation. Different statistical software packages and programming languages have their own terms to describe this transformation. Wide and narrow is one standard, there are of course others.
- Again, we can make sure that the dataset’s configuration such as data types for each column are set correctly:
- Open the Maintenance dataset, click on Settings
- Navigate to the Schema tab
- Infer types from data
- Next, we are going to use the Pivot recipe to restructure the dataset at the level of each vehicle. Here’s how it’s done in brief:
- Pivot by
Reasonby creating a new Pivot visual recipe
- Create column with the
Reasoncolumn from within the Pivot tab of the recipe
- Within Settings, select
Assetas the Row identifiers
- Populate content with the sum of the Quantity column, so
- Saving and running the recipe will create the Output dataset
A step-by-step description of the Pivot recipe is available on this how-to on long to wide transformations.
The final Failure dataset shows any information we have on past failures. It provides labels necessary to model predictions for failures among the fleet. It is structured at the level of the individual cars–perfect!
How can we be sure? We would need to make sure there are no duplicates in the dataset. Here’s a quick way to check:
- Explore the Failure dataset–it has two columns: Asset and failure_bin
- Using the Analyze… tool to get a quick snapshot, confirm that all observations in the dataset are Unique, i.e. that all Asset IDs have only 1 row associated with them
Find out more about the Analyze tool.
Our workflow is now beginning to look like a data pipeline. Next we’ll merge all of our datasets together!
Merging the Datasets¶
We have several datasets with the same level of granularity: the asset, i.e. the cars. We can join them together to create a unique dataset, which can be used to create our model. Here’s how we will achieve this merge:
- Since we have done no data preparation on it, data types for the Failure dataset can be auto-detected using the Infer Types From Data option within the dataset’s Settings > Schema (just as we did previously with the two other datasets)
- After saving these changes, we should be able to perform the merge using a single visual recipe. Let’s create a Join recipe starting with the Failure dataset.
- Let’s combine all three datasets on the Asset column using a Left Join throughout and name the output
Here is a short explanation about the Join recipe
All the input datasets are now ready. Congratulations, great work!
Creating the Training Dataset¶
To train models, we’ll use the Split recipe to create 2 separate datasets from merged dataset:
trainingdataset: contains labels for whether or not there was an failure event on an asset, which we’ll use to train a predictive model
scoringdataset: contains no data on failures, i.e. is unlabelled, so we’ll predict whether or not these assets have a high probability of failure
This recipe allows us to Map values of a single column onto these output datasets. We split on the column
failure_bin, based on Discrete values. We’ll send the rows with values 0 and 1 for that column to the training dataset and send the rest of the unlabelled rows to the scoring dataset.
Of course, there are different kinds of splits and options within Settings that would produce the same outputs. This combination is hopefully the simplest. Now that we have our training dataset, we can move to the model.
Creating the Training Dataset¶
Before making our first model on the training dataset, we’ll create a few more features. At the same time, we’re still designing this workflow, so we’ll create a sandbox environment that won’t create an output dataset, yet. By going into the LAB, we can test out such transformations as well as try out some modeling strategies, plus much more. Nothing is added back to the Flow until we are done testing and ready to deploy!
The Lab-to-Flow tutorial covers how steps in an analytics workflow can move to and from the LAB to the FLOW.
The quickest way to get to this sandbox is by right-clicking the training dataset in the Flow and selecting Lab. We can do some some quick transformations by creating a new Visual Analysis. We’ll start in the Script tab, where we can carry out data steps similar to the Prepare recipe. Here’s what we’ll do:
- Create a new column
Use_max - Use_min
- Create column
Time_max - Time_min
- Replace null values with zeros in all columns starting with “Reason” with a Fill empty cells step [HINT: The regular expression formula
^R.*_Quantity_sum$is one way to get the job done!]
- Finally, we’ll rename some columns! Here’s some suggested names to make sure the model results are interpretable more easily.
|Old col name||New col name|
Next, we’ll make some models!
Creating the Prediction Model¶
Now that we have a dataset ready on which to train models, let’s use machine learning to predict car breakdown. We can find the visual Machine Learning (ML) interface in the “Models” tab. DSS lets us choose between:
- Prediction (or supervised learning): to predict a target variable (including labels), given a set of input features
- Clustering (or unsupervised learning): to create groups of observations that based on some shared patterns
In this case, we are trying to determine whether or not a car will have problems. So, we need to create a Prediction model. It will calculate the probabilities for one of two outcomes–failure or non-failure, i.e. perform two-class classification. The platform can pick out the type of supervised learning problem, based on the target variable, failure_bin.
Once we have picked the type of machine learning problem, we can customize this model development. Automated Machine Learning is available within the platform to help with some important decisions like the type of algorithms and parameters of those algorithms. Here, we’ll go with Quick Prototypes, the default suggestions.
Once we have some initial models, we can come back to the DESIGN tab to fiddle with these settings. For example, in the Basics > Metrics screen, we can define how we want model selection to occur. By default, the platform optimizes for AUC (Area Under the Curve), i.e. picks the model with the best AUC, while the threshold (or probability cut-off) is selected to give the best F1 score. Similarly, feature engineering can also be tailored as needed, from which/ how features are used as well as options for dimension reduction.
An important setting is the type of algorithms with which to model the data (under Modeling > Algorithms). In addition, we can define hyper-parameters for each of them. For now, we’ll run two machine learning algorithms: Logistic Regression and Random Forest. They come from two classes of algorithms popular for these kinds of problems, linear and tree-based respectively. Let’s hit Train and let the model competition begin!
Understanding the Model¶
Finally, we have some results. The platform provides model metrics auto-magically! For example, we can compare how models performed against each other. By default, the AUC is graphed for each model. By that metric, Random Forest has performed better than Logistic Regression. We can switch to TABLE view to see a number of metrics. In fact, Random Forest has performed better across a number of different metrics. Let’s explore this model in a bit more detail.
Looking underneath the hood of each model (by clicking on the name of the model), we’re given a lot of ready-made analysis. Besides providing information on features and training and validation strategies, this analysis also helps us interpret the model and to understand its performance. For example, we can get the number of correct and incorrect predictions made by this Random Forest model in a Confusion matrix. We can see the ROC curve that is used to calculate the AUC, the metric using which we selected our top model, or explore Detailed metrics.
We can dive into the Decision trees that are aggregated to calculate which features are important and to what extent. At the same time, we don’t want to miss out of the important details of this Random Forest for these trees! So let’s discuss some of the implications of this model.
The Variable importance chart displays the importance of each feature in the model for tree-based methods. Some important ones are, not surprisingly, related to its age and usage. Its time_in_service as well as its last known age and distance predict whether or not the car fails. The last known and total mileage (distance_last_known and distance) are also important. Finally, data from maintenance records are also useful. Among them, R193_Quantity_sum, i.e. the total number of parts used for reason code 193, is important in predicting failure at a later date.
Contextual knowledge becomes important at this point. For example, knowledge of the rental company’s acquisition strategy for vehicles can discern whether these features are important or why they make sense. For now, let’s use the model to make some predictions.
Using the Model¶
Let’s use this Random Forest model in the Flow on the scoring dataset. We can Deploy it directly from the model page by selecting and opening the model from within the LAB. The goal is to anticipate which cars are likely to fail and the probability of their failure. In brief, these are the rough steps that follow:
- By deploying the model to the flow, create a new train recipe
- Since it’s the first model we have, it’ll become the Active version
- Predict the likelihood of failure in the scoring data using the Score recipe
- Store output in a new dataset named
The resulting dataset now contains three new columns with the predictions:
- proba_1: probability of failure
- proba_0: probability of non-failure (1 - proba_1)
- prediction: model prediction of failure or not (based on probability threshold)
And that’s a wrap! The goal here was to build an end-to-end data product using Dataiku DSS. Hope it was fun.
To summarize, we created an end-to-end workflow to predict car failures. Hopefully, this data product will save the company a lot of money. Once we have a single working model built, we could try to go further to improve this predictive workflow.
There are many ways to improve the accuracy of the predictions, of which some are:
- adding features to the model by combining information in datasets in more ways
- trying different algorithms and hyper-parameter settings