## Preparing Time Series Data

#### Summary

# Preparing Time Series Data

Time series preparation is useful for cleaning and structuring time series data in ways that are suitable for further analysis or forecasting. For example, a time series dataset may have irregular or missing timestamps, so that it becomes necessary to implement resampling (to make the time stamps equispaced) or interpolation (to infer values for missing timestamps).

Dataiku DSS provides a fully supported time series preparation plugin that includes recipes for:

- Resampling
- Windowing
- Extrema extraction
- Interval extraction

## Let’s Get Started!

In this tutorial, you will learn to use the time series preparation plugin to perform resampling, windowing, extrema extraction, and interval extraction on a time series dataset.

We will continue to work with the orders_by_date time series dataset from the Time Series Visualization lesson.

### Prerequisites

This tutorial assumes that:

- You have downloaded the orders_by_date file and extracted the time series dataset.
- You have installed the Time Series Preparation plugin on your Dataiku instance (see Instructions for installing plugins).

### Workflow Overview

The final pipeline in Dataiku DSS is shown below. You can follow along with the completed project in the Dataiku gallery, or you can create the project within DSS and implement the steps described in this tutorial.

## Create Your Project

From the DSS homepage, click **+New Project**, select **DSS Tutorials** from the list, go to the **Time Series** section, and select **Time Series Basics (Tutorial)**. Notice that the *orders_by_date* dataset is already uploaded in the **Flow**.

Alternatively, you can continue in the **Time Series Basics** project that you worked on in the Visualizing Time Series Data tutorial, or you can create a new **+Blank Project** in Dataiku DSS, and then upload the orders_by_date.csv file into the project.

The *orders_by_date* time series dataset consists of four columns:

*order_date*, which has been parsed*tshirt_category*, an identifier that labels each row as belonging to one of six tshirt categories, each category corresponding to a time series*tshirt_quantity*, the daily number of items sold in a category*amount_spent*, the daily amount spent on a tshirt category

The dataset consists of six different time series (one for each value of the *tshirt_category* column). Each time series also consists of two variables (or dimensions): *tshirt_quantity* and *amount_spent*. Note that the data is stored in long format.

Before using the time series preparation plugin with this dataset, change the storage type for the *order_date* column from “string” to “date”.

- Click the drop-down arrow next to “string” in the
*order_date*column and select**Date**. - In the window that appears, check the “Refresh sample” box and click
**Ok**.

The storage type is highlighted by the red rectangle in the following figure. Also, note that the plugin works with a parsed date column, that is, the column meaning (detected by DSS) is of “Date” type (highlighted by the blue oval in the following figure).

## Resample the Time Series Dataset

Observe that the dataset contains daily records, however, these records appear in irregular time intervals. To begin, we will use the resampling recipe to transform the dataset into equispaced data. This step is important because the other recipes in the plugin work only with equispaced data.

- Click the
*orders_by_date*dataset and then the**Time Series Preparation**plugin in the right panel. - Select the
**Time series resampling**recipe from the window that appears. - Keep the default “Input time series” as
*orders_by_date*and name the output dataset`orders_resampled`

. Then create the output dataset.

A “Time series resampling” window opens up. To align the time series in the *orders_by_date* dataset on a daily sampling rate, let’s specify values for the parameters in this window.

- Set the value of the “Timestamp column” to
`order_date`

. - For the resampling parameters, specify “Time step”:
`1`

and “Unit”:**Days**. - To compute values of missing timestamps, specify
**Quadratic**as the interpolation method, and**Same as Interpolation**to use the same interpolation method for extrapolation. - Keep the default values of
`0`

in the “Edit Series” section. - Finally, check the “Long format” box and specify
`tshirt_category`

as the value for the “Column with identifier” parameter.

**Save**and**Run**the recipe to build the output dataset.

Notice that the *order_date* column now consists of equispaced daily samples. Also, the *tshirt_quantity* column that consisted of whole numbers (“Integer” type) is now of type “Decimal”, due to the interpolation method that was used. This would also be the case if you choose **Mean**, **Linear**, **Quadratic**, or **Cubic** as the interpolation method.

Note

When you use the resampling recipe (as well as the other recipes in the Time Series Preparation plugin) on your dataset, it automatically works on all the numerical columns (integers, decimals, …) in the dataset.

Now that the time series are equispaced, we can proceed to apply the other recipes in the plugin.

## Compute Functions Over a Moving Window of the Dataset

For noisy time series data, such as our *orders_resampled* dataset, observing the variations between successive time series values may not always provide insightful information. In such cases, it can be useful to filter or compute aggregations over a rolling window of timestamps.

We will use the windowing recipe to compute a 1-week moving average on the time series data.

- Click the
*orders_resampled*dataset and then the**Time Series Preparation**plugin. - Select the
**Time series windowing**recipe from the window that appears. - Name the output dataset
`window_functions`

. Then create the output dataset.

A window opens up for the “Time series windowing” recipe. To perform windowing functions on the dataset, let’s specify values for the parameters in this window.

- Set the value of the “Timestamp column” to
`order_date`

. - For the window parameters:
- Check the “Causal window” box to use a window that does not contain any future observations. The current row in the data will be at the right border of the window.
- Specify a
**Rectangular**shaped window with “Width”:`1`

and “Unit”:**Weeks**. - Set “Include window bounds” to
**Yes, left only**to use a strictly causal window that does not include the present observation. - Select the “Aggregations”:
**Retrieve**— to return the time series values for each day, and**Average**— to compute the moving average.

- Finally, check the “Long format” box and specify
`tshirt_category`

as the value for the “Column with identifier” parameter.

**Save**and**Run**the recipe to build the output dataset.

Notice that the moving average value for the first order date (2013-03-29) is empty for all the time series. This is because we used a strictly causal moving window.

## Compute Aggregates Around an Extremum Value

Now let’s use the extrema extraction recipe to find aggregates for each time series around their global maximum value. The extrema extraction recipe lets you find a global extremum (minimum or maximum) in one dimension of a time series and perform windowing functions around the timestamp of the extremum on all dimensions.

We will use this recipe to find the global maximum amount spent for each time series, and then compute the average spent in a 7-day window around each global maximum value.

- Click the
*orders_resampled*dataset and then the**Time Series Preparation**plugin in the right panel. - Select the
**Time series extrema extraction**recipe from the window that appears. - Name the output dataset
`compute_extremum`

. Then create the output dataset.

A window opens up for the “Time series extrema extraction” recipe. Let’s specify the following values for the parameters in this window.

- Set the value of the “Timestamp column” to
`order_date`

. - Set “Find extremum in column” to
`amount_spent`

. - Specify the “Extremum type” as
**Global maximum**. - For the window parameters:
- Leave the “Causal window” box unchecked to use a bilateral window (that is, a window which places the timestamp for the extremum point at its center).
- Specify a
**Rectangular**shaped window with “Width”:`7`

and “Unit”:**Days**. - Select the “Aggregations”:
**Average**.

- Finally, check the “Long format” box and specify
`tshirt_category`

as the value for the “Column with identifier” parameter.

**Save**and**Run**the recipe to build the output dataset.

Notice that the *extremum* dataset consists of one row for each time series in the *orders_resampled* dataset.

The average amount spent in the 7-days around the global maximum (of the amount spent) for each time series is considerably lower than the global maximum value. For example, the maximum amount spent on Hoodies was $1679, but in the 7 days surrounding the date of this spending, the average spent on hoodies was only about $630.

Finally, let’s use the interval extraction recipe of the plugin with our dataset.

## Identify Periods When Time Series Values Are in a Given Range

Sometimes we may be interested in extracting periods when time series values are within a given range. For example, we may be interested in finding periods when the amounts spent on tshirts were in the range of $500 to $1000.

The interval extraction recipe provides the means to perform these kinds of tasks, while allowing small deviations from the acceptable interval. Let’s begin!

- Click the
*orders_resampled*dataset and then the**Time Series Preparation**plugin in the right panel. - Select the
**Time series interval extraction**recipe from the window that appears. - Name the output dataset
`compute_interval`

. Then create the output dataset.

A window opens up for the “Time series interval extraction” recipe. Let’s specify the following values for the parameters in this window.

- Set the value of the “Timestamp column” to
`order_date`

. - Set “Apply threshold to column” to
`amount_spent`

, to specify the column to which the recipe will apply the threshold parameters. - Specify the “Minimal valid value” as
`500`

and the “Maximum valid value” as`1000`

. - For the segment parameters, specify:
- “Unit”:
**Days** - “Acceptable deviation”:
`1`

- “Minimal segment duration”:
`7`

- “Unit”:

With these parameter values, the recipe returns time segments when *amount_spent* is within the valid interval ($500 — $1000) for at least 7 days. Within these time segments, the recipe also allows small deviations from the valid interval by returning values greater than $1000 or less than $500 — provided those occurrences last no more than 1 day (the acceptable deviation).

- Finally, check the “Long format” box and specify
`tshirt_category`

as the value for the “Column with identifier” parameter.

**Save**and**Run**the recipe to build the output dataset.

Notice that the *interval* dataset consists of segments of the time series dataset that meet the criteria defined by the parameter values. It is interesting to note that only the time series for *Hoodie* and *White T-Shirt M* are represented in the output dataset. The other time series do not have time segments that satisfy the criteria used by the recipe.

## Next Steps

Congratulations! Now that you have spent some time preparing your dataset, you are ready to move on to the next step of performing analysis and forecasting with time series datasets.

See also