Forecasting Time Series Data with R and Dataiku DSS

The R language has several great packages that are built specifically to handle time series data. Using these packages, you can perform time series visualization, modeling, forecasting, etc.

Let’s Get Started!

In this tutorial, you will learn how to use R in DSS for time series analysis, exploration, and modeling. You will also learn to deploy a time series model in DSS. Let’s get started!

We will use the passenger dataset from the U.S. International Air Passenger and Freight Statistics Report. This dataset contains data on the total number of passengers for each month and year between a pair of airports, as serviced by a particular airline.

Prerequisites

This tutorial assumes that you have access to a DSS instance having the R integration installed.

Workflow Overview

The final pipeline in Dataiku DSS is shown below. You can follow along with the completed project in the Dataiku gallery, or you can create the project within DSS and implement the steps described in this tutorial.

../../_images/airport_final_flow.png

Create Your Project

From the DSS homepage, click +New Project, select DSS Tutorials from the list, go to the Time Series section, and select Forecasting Time Series With R (Tutorial).

../../_images/airport_initial_flow.png

Notice that the Flow already performs the following preliminary steps:

  1. Uses a Download recipe to import the data from the URL: https://data.transportation.gov/api/views/xgub-n9bw/rows.csv?accessType=DOWNLOAD and creates the passengers dataset.
  2. Uses a Prepare recipe to modify the dataset so that we are left with only those columns that are relevant to the analysis: Date, carriergroup, and Total.
  3. Uses a Group recipe to create a new dataset group0_passengers that contains the total number of travellers per month for carrier group “0”.
../../_images/airport_dataset_cleaned.png

Now we can proceed to perform analysis and forecasting on the group0_passengers dataset.

Plot the Time Series Dataset

First, let’s create a Lines chart type to get a feel for the data. To do this:

  • Open the group0_passengers dataset and go to the Charts tab.
  • Select the Lines chart.
  • Drag and drop “Total_passengers” as the Y variable, and “Date” as the X variable.
../../_images/linechart_total_passengers.png

We see two really interesting patterns. First, there’s a general upward trend in the number of passengers. Second, there is a yearly cycle with the lowest number of passengers occurring around the new year and the highest number of passengers during the late summer. Let’s see if we can use these trends to forecast the number of passengers after March 2019.

Perform Interactive Analysis With R

For this part, we will use an R notebook. To begin, go back to the flow and click on the group0_passengers data set, then click Lab, New Code Notebook, R, and then Create.

../../_images/airport_r_notebook.png

Dataiku DSS will then open an R notebook with some basic starter coded already filled in.

../../_images/airport_r_startercode.png

Now that we have an R notebook, let’s focus on the code.

Begin by loading the R libraries that we need for this analysis:

library(dataiku)
library(forecast)
library(dplyr)
library(zoo)
  • The `dataiku package lets us read and write datasets to Dataiku DSS.
  • The forecast package has the functions we need for training models to predict time series.
  • The dplyr package has functions for manipulating data frames.
  • The zoo package has functions for working with regular and irregular time series.

Next, we’ll load the data into R from DSS

df <- dkuReadDataset("group0_passengers", samplingMethod="head", nbRows=100000)
head(df)
../../_images/airport_passengers_head.png

Now that we’ve loaded our data, let’s create a time series object using the ts() function.

The ts() function takes a numeric vector, the start time and the frequency of measurement. For our dataset, these values are: Total_passengers, 1990 (the year for which the measurements begin), and a frequency of 12 (months in a year).

ts_passengers = ts(df$Total_passengers, start=1990, frequency=12)
plot(ts_passengers)
../../_images/airport_data_plot.png

We’ve successfully visualized our time series data using the chart tool in DSS and the plot function in R. Now let’s start modeling!

Choose a Forecasting Model

We’re going to try three different forecasting methods and deploy the best one to DSS. In general, it is good practice to test several different modeling methods and choose the method that provides the best performance.

Model 1: Exponential Smoothing State Space Model

The ets() function in the forecast package fits exponential state smoothing (ETS) models. This function automatically optimizes the choice of model parameters.

Let’s use the function to make a forecast for the next 24 months.

m_ets = ets(ts_passengers)
f_ets = forecast(m_ets, h=24) # forecast 24 months into the future
plot(f_ets)
../../_images/airport_ets_forecast.png

The forecast is shown in blue, with the gray area representing a 95% confidence interval. Just by looking, we see that the forecast roughly matches the historical pattern of the data.

Model 2: Autoregressive Integrated Moving Average (ARIMA) Model

The auto.arima() function returns the best ARIMA model based on performance metrics. Using the auto.arima() function is almost always better than calling the arima() function directly. For more information on the auto.arima() function, see auto.arima.

Let’s use the function to make a forecast for the next 24 months.

m_aa = auto.arima(ts_passengers)
f_aa = forecast(m_aa, h=24)
plot(f_aa)
../../_images/airport_arima_forecast.png

Observe that these confidence intervals are a bit smaller than those for the ETS model. This could be the result of a better fit to the data. Let’s train a third model and then do a model comparison.

Model 3: TBATS Model

The last model we’re going to train is a TBATS model. This model is designed for use when there are multiple cyclic patterns (e.g. daily, weekly and yearly patterns) in a single time series. We’ll see if this model can detect complicated patterns in our time series.

m_tbats = tbats(ts_passengers)
f_tbats = forecast(m_tbats, h=24)
plot(f_tbats)
../../_images/airport_tbats_forecast.png

Now we have three models that all seem to give reasonable predictions. Let’s compare them to see which one performs best

Compare Models

We’ll use the Akaike Information Criterion (AIC) to compare the different models. AIC is a common method for determining how well a model fits the data, while penalizing more complex models. The model with the smallest AIC value is the best fitting model.

barplot(c(ETS=m_ets$aic, ARIMA=m_aa$aic, TBATS=m_tbats$AIC),col="light blue",
ylab="AIC")
../../_images/airport_aic_comparison.png

We see that the ARIMA model performs the best. Let’s now proceed to convert our interactive notebook into an R recipe that can be integrated into our DSS workflow.

To do this, we first have to store the output of the forecast() function into a data frame, so that we can pass it to DSS. The following code can be broken down into these three steps:

  1. Find the last date for which we have a measurement.
  2. Create a data frame with the prediction for each month. We’ll also include the lower and upper bounds of the predictions, and the date. Since we’re representing dates by the year, each month is 1/12 of a year.
  3. Split the date column into separate columns for year and month.
last_date = index(ts_passengers)[length(ts_passengers)]
data.frame(passengers_predicted=f_aa$mean,
         passengers_lower=f_aa$lower[,2],
         passengers_upper=f_aa$upper[,2],
         date=last_date + seq(1/12, 2, by=1/12)) %>%
  mutate(year=floor(date)) %>%
  mutate(month=round(((date %% 1) * 12) + 1)) -> forecast

Awesome! Now that we have the code to create the forecast for the next 24 months and the code to convert the result into a data frame, we are all set to deploy the model to DSS.

Deploying The Model to DSS

To deploy our model, we must create a new R recipe. In the notebook:

  • Click +Create Recipe and select R recipe - native R language.
  • Ensure that group0_passengers dataset is the input dataset, and create a new managed dataset, forecast, as the output of the recipe.
../../_images/airport_output_dataset.png
  • Create the recipe, and DSS opens the recipe editor with the code from the notebook in the recipe.
../../_images/airport_final_recipe.png

We can optimize the code in the recipe to only run the portions that will output the forecast dataset, but for now, simply run the recipe. Return to the Flow where you can see our newly created dataset.

../../_images/airport_final_flow.png

Open the forecast dataset to look at the new predictions.

../../_images/airport_forecast_dataset.png

Next Steps

Congratulations! Now that you have spent some time forecasting the time series dataset with R in DSS, you may also want to practice using the Forecast Plugin to repeat this tutorial without using code.