Preparing Time Series Data

Time series preparation is useful for cleaning and structuring time series data in ways that are suitable for further analysis or forecasting. For example, a time series dataset may have irregular or missing timestamps, so that it becomes necessary to implement resampling (to make the time stamps equispaced) or interpolation (to infer values for missing timestamps).

Dataiku DSS provides a fully supported time series preparation plugin that includes recipes for:

  • Resampling
  • Windowing
  • Extrema extraction
  • Interval extraction

Let’s Get Started!

In this tutorial, you will learn to use the time series preparation plugin to perform resampling, windowing, extrema extraction, and interval extraction on a time series dataset.

We will continue to work with the orders_by_date time series dataset from the Visualizing Time Series Data tutorial.

Prerequisites

This tutorial assumes that:

../../_images/Installed_prep_plugin.png

Workflow Overview

The final pipeline in Dataiku DSS is shown below. You can follow along with the completed project in the Dataiku gallery, or you can create the project within DSS and implement the steps described in this tutorial.

../../_images/time-series-basics-final-flow.png

Create Your Project

From the DSS homepage, click +New Project, select DSS Tutorials from the list, go to the Time Series section, and select Time Series Basics (Tutorial). Notice that the orders_by_date dataset is already uploaded in the Flow.

Alternatively, you can continue in the Time Series Basics project that you worked on in the Visualizing Time Series Data tutorial, or you can create a new +Blank Project in Dataiku DSS, and then upload the orders_by_date.csv file into the project.

The orders_by_date time series dataset consists of four columns:

  • order_date, which has been parsed
  • tshirt_category, an identifier that labels each row as belonging to one of six tshirt categories, each category corresponding to a time series
  • tshirt_quantity, the daily number of items sold in a category
  • amount_spent, the daily amount spent on a tshirt category

The dataset consists of six different time series (one for each value of the tshirt_category column). Each time series also consists of two variables (or dimensions): tshirt_quantity and amount_spent. Note that the data is stored in long format.

Before using the time series preparation plugin with this dataset, change the storage type for the order_date column from “string” to “date”.

  • Click the drop-down arrow next to “string” in the order_date column and select Date.
  • In the window that appears, check the “Refresh sample” box and click Ok.

The storage type is highlighted by the red rectangle in the following figure. Also, note that the plugin works with a parsed date column, that is, the column meaning (detected by DSS) is of “Date” type (highlighted by the blue oval in the following figure).

../../_images/dataset_parsed.png

Resample the Time Series Dataset

Observe that the dataset contains daily records, however, these records appear in irregular time intervals. To begin, we will use the resampling recipe to transform the dataset into equispaced data. This step is important because the other recipes in the plugin work only with equispaced data.

  • Click the orders_by_date dataset and then the Time Series Preparation plugin in the right panel.
  • Select the Time series resampling recipe from the window that appears.
  • Keep the default “Input time series” as orders_by_date and name the output dataset orders_resampled. Then create the output dataset.

A “Time series resampling” window opens up. To align the time series in the orders_by_date dataset on a daily sampling rate, let’s specify values for the parameters in this window.

  • Set the value of the “Timestamp column” to order_date.
  • For the resampling parameters, specify “Time step”: 1 and “Unit”: Days.
  • To compute values of missing timestamps, specify Quadratic as the interpolation method, and Same as Interpolation to use the same interpolation method for extrapolation.
  • Keep the default values of 0 in the “Edit Series” section.
  • Finally, check the “Long format” box and specify tshirt_category as the value for the “Column with identifier” parameter.
../../_images/resampling_recipe.png
  • Save and Run the recipe to build the output dataset.
../../_images/dataset_resampled.png

Notice that the order_date column now consists of equispaced daily samples. Also, the tshirt_quantity column that consisted of whole numbers (“Integer” type) is now of type “Decimal”, due to the interpolation method that was used. This would also be the case if you choose Mean, Linear, Quadratic, or Cubic as the interpolation method.

Note

When you use the resampling recipe (as well as the other recipes in the Time Series Preparation plugin) on your dataset, it automatically works on all the numerical columns (integers, decimals, …) in the dataset.

Now that the time series are equispaced, we can proceed to apply the other recipes in the plugin.

Compute Functions Over a Moving Window of the Dataset

For noisy time series data, such as our orders_resampled dataset, observing the variations between successive time series values may not always provide insightful information. In such cases, it can be useful to filter or compute aggregations over a rolling window of timestamps.

We will use the windowing recipe to compute a 1-week moving average on the time series data.

  • Click the orders_resampled dataset and then the Time Series Preparation plugin.
  • Select the Time series windowing recipe from the window that appears.
  • Name the output dataset window_functions. Then create the output dataset.

A window opens up for the “Time series windowing” recipe. To perform windowing functions on the dataset, let’s specify values for the parameters in this window.

  • Set the value of the “Timestamp column” to order_date.
  • For the window parameters:
    • Check the “Causal window” box to use a window that does not contain any future observations. The current row in the data will be at the right border of the window.
    • Specify a Rectangular shaped window with “Width”: 1 and “Unit”: Weeks.
    • Set “Include window bounds” to Yes, left only to use a strictly causal window that does not include the present observation.
    • Select the “Aggregations”: Retrieve — to return the time series values for each day, and Average — to compute the moving average.
  • Finally, check the “Long format” box and specify tshirt_category as the value for the “Column with identifier” parameter.
../../_images/windowing_recipe.png
  • Save and Run the recipe to build the output dataset.
../../_images/dataset_windowed.png

Notice that the moving average value for the first order date (2013-03-29) is empty for all the time series. This is because we used a strictly causal moving window.

Compute Aggregates Around an Extremum Value

Now let’s use the extrema extraction recipe to find aggregates for each time series around their global maximum value. The extrema extraction recipe lets you find a global extremum (minimum or maximum) in one dimension of a time series and perform windowing functions around the timestamp of the extremum on all dimensions.

We will use this recipe to find the global maximum amount spent for each time series, and then compute the average spent in a 7-day window around each global maximum value.

  • Click the orders_by_date dataset and then the Time Series Preparation plugin in the right panel.
  • Select the Time series extrema extraction recipe from the window that appears.
  • Name the output dataset compute_extremum. Then create the output dataset.

A window opens up for the “Time series extrema extraction” recipe. Let’s specify the following values for the parameters in this window.

  • Set the value of the “Timestamp column” to order_date.
  • Set “Find extremum in column” to amount_spent.
  • Specify the “Extremum type” as Global maximum.
  • For the window parameters:
    • Leave the “Causal window” box unchecked to use a bilateral window (that is, a window which places the timestamp for the extremum point at its center).
    • Specify a Rectangular shaped window with “Width”: 7 and “Unit”: Days.
    • Select the “Aggregations”: Average.
  • Finally, check the “Long format” box and specify tshirt_category as the value for the “Column with identifier” parameter.
../../_images/extrema_recipe.png
  • Save and Run the recipe to build the output dataset.
../../_images/dataset_extrema.png

Notice that the extremum dataset consists of one row for each time series in the orders_resampled dataset.

The average amount spent in the 7-days around the global maximum (of the amount spent) for each time series is considerably lower than the global maximum value. For example, the maximum amount spent on Hoodies was $1679, but in the 7 days surrounding the date of this spending, the average spent on hoodies was only about $630.

Finally, let’s use the interval extraction recipe of the plugin with our dataset.

Identify Periods When Time Series Values Are in a Given Range

Sometimes we may be interested in extracting periods when time series values are within a given range. For example, we may be interested in finding periods when the amounts spent on tshirts were in the range of $500 to $1000.

The interval extraction recipe provides the means to perform these kinds of tasks, while allowing small deviations from the acceptable interval. Let’s begin!

  • Click the orders_by_date dataset and then the Time Series Preparation plugin in the right panel.
  • Select the Time series interval extraction recipe from the window that appears.
  • Name the output dataset compute_interval. Then create the output dataset.

A window opens up for the “Time series interval extraction” recipe. Let’s specify the following values for the parameters in this window.

  • Set the value of the “Timestamp column” to order_date.
  • Set “Apply threshold to column” to amount_spent, to specify the column to which the recipe will apply the threshold parameters.
  • Specify the “Minimal valid value” as 500 and the “Maximum valid value” as 1000.
  • For the segment parameters, specify:
    • “Unit”: Days
    • “Acceptable deviation”: 1
    • “Minimal segment duration”: 7

With these parameter values, the recipe returns time segments when amount_spent is within the valid interval ($500 — $1000) for at least 7 days. Within these time segments, the recipe also allows small deviations from the valid interval by returning values greater than $1000 or less than $500 — provided those occurrences last no more than 1 day (the acceptable deviation).

  • Finally, check the “Long format” box and specify tshirt_category as the value for the “Column with identifier” parameter.
../../_images/interval_recipe.png
  • Save and Run the recipe to build the output dataset.
../../_images/dataset_interval.png

Notice that the interval dataset consists of segments of the time series dataset that meet the criteria defined by the parameter values. It is interesting to note that only the time series for Hoodie and White T-Shirt M are represented in the output dataset. The other time series do not have time segments that satisfy the criteria used by the recipe.

Next Steps

Congratulations! Now that you have spent some time preparing your dataset, you are ready to move on to the next step of performing analysis and forecasting with time series datasets.

Proceed to the tutorial on Forecasting Time Series Data with R and Dataiku DSS to learn how.