Reshaping Data from Long to Wide Format¶
Pivoting values, sometimes called unstacking, is the action of reshaping a dataset from long format to wide format.
Suppose we have a dataset that contains an aggregated order log from a retail company, and each row represents the total orders per country, per type of item sold.
If we want to examine the amount spent on each item, across countries, we could reorganize the dataset by pivoting the values as follows:
We’ll work with the Haiku T-shirt company’s Orders_by_Country_Category dataset, which you can download and import into a new project.
The original dataset contains a separate row per country, per t-shirt category.
We are interested in examining how sales of categories of shirts compare across countries; however, the current dataset is hard to read for this purpose. Instead, we want to pivot the dataset on the country, so that each row corresponds to a different t-shirt category, and each column records the amount spent in each country.
From the Actions menu of the Orders_by_Country_Category dataset, choose Pivot. Choose ip_address_country as the column to pivot by. Give the output dataset a descriptive name like
Pivot_by_Country and click Create Recipe.
In the Pivot recipe:
- The Create columns with field is already populated by ip_address_country. For each distinct value of this column, the Pivot recipe will create a new column for each statistic specified in the Populate content with section. By default, only the 20 most frequently occurring values are pivoted.
- Select tshirt_category as the row identifier. The output dataset will have a single row for each category of t-shirt.
- Choose to populate content with total_sum. By default it selects the count of total_sum; change the aggregate to first. Since each combination of country and t-shirt category is unique, it shouldn’t matter whether you use the first, last, min, max, sum, or average as the aggregation method. Deselect Count of records.
After clicking Run, the resulting dataset is ready to go. Note that because all categories of shirts have been sold in many countries, more than 20 countries are the “most frequently occurring”, and the Pivot recipe chooses 20 from among the ties.
What Happens When New Data Is Added¶
Unlike most other recipes, the schema of the output dataset of a Pivot recipe is affected by the values of the input dataset. In this case, the number of columns in the output dataset was determined by the fact that there are more than 20 countries, so the number of columns is limited to the 20 most frequent. But what happens as new orders come in? Let’s go back to the Output step of the Pivot recipe.
The Pivot recipe is a two-pass algorithm that determines the values to pivot on in the first pass in order to construct the schema of the output dataset, and then uses the second pass to do the pivot.
If there are a lot of unique values in the column to pivot by, the number of columns in the output dataset can grow out of control, so take care to Analyze the column to pivot by and choose the number of values to pivot carefully. By default, the maximum number of columns created by the Pivot recipe is 200.
The schema of the output dataset was established the first time we ran the Pivot recipe. Rebuilding the dataset continues to use the same schema, performing a single pass on the input of the Pivot recipe – and ignoring any changes to which 20 countries are the most frequently occurring, unless we update the schema. There are two options for updating the schema:
- You can manually drop the existing schema by clicking the Drop button and
- You can select Recompute schema at each run and it will automatically update every time the Pivot_by_Country dataset is built. The danger in this method is that a changed schema can break the flow if there are recipes that expect the old schema.
If there are particular countries that we want to examine, we can provide an explicit list to the Pivot recipe.
Choosing the Values to Pivot¶
Let’s go back to the Pivot step of the Pivot recipe. We can specify the list of values we want to pivot by selecting explicit list from the pivoted values dropdown and then adding each individual value. (Alternatively, if we want to include most of the column’s values, we can +Add Values from Dataset and then remove the undesired values from the list.)
Click Run, and the resulting dataset has the number of pivoted columns limited to those explicitly listed in the Pivot recipe. By providing an explicit list, the schema of the output dataset is also determined in advance, and only a single data pass is required to do the pivot.