The Processors Library

The Foundational materials have already introduced the Prepare recipe for data wrangling.

As you would have seen there, the power of the Prepare recipe is the ability to apply a variety of processors to transform your data.

This brief tutorial provides an overview of the functionality that can be found in the processors library.

Prerequisites

It is not strictly required, but we recommend all newcomers to Dataiku DSS begin with the Foundational learning materials.

Visual Processors

The Prepare recipe gives you access to 90 built-in visual processors for code-free data wrangling.

From text replacements to enrichment of complex data types or various reshaping operations, these processors will help you prepare your data for steps like visualization or modeling.

"Processors library"

Reshaping data

The goal of many processors in the Prepare recipe is to reshape your data. Below are some of these processors and their effect:

  • Transpose
    • Transpose flips your data so that rows become columns and columns become rows.
  • Pivot
    • The Pivot processor transforms multiple rows into columns. It uses a column as index, another as labels, and a third as values. It will create one line per distinct index, as many columns as there are labels, and fill them with the associated values.
  • Fold
    • The opposite of a Pivot, folding takes values from multiple columns and transforms them to one line per column.
  • Unfold
    • Unfolding is used for categorical data and transforms cell values into binary columns. This process is also called “Dummification”.
  • Split and fold
    • The split and fold operation creates new lines by splitting the values within a column on a delimiter.

This list is not exhaustive. For example, there are other reshaping processors for nesting or unnesting data. See the reference documentation on processors for a complete list.

Enriching data

Many of the processors available in prepare recipes can be used to enrich your data, especially when you handle complex data types. Here are a few examples:

  • Parsing dates
    • Extract date related information and create new columns based on this.
  • Classify User-Agent
    • Extract information from a browser’s User-Agent string.
  • Resolve GeoIP
    • Resolve geographic information about an IP address, such as City or lat-lon coordinates.
  • Enrich from French postcode
    • Take a column containing a French post code and output several columns with demographic data about the cities using this post code.
  • Geocode (API)
    • Perform forward geocoding, by using an external API of your choice.
  • Reverse geocoding (plugin)
    • Perform a reverse-geocoding (latitude / longitude -> address).

What’s Next?

After this overview of the processors library, practice with specific operators for parsing dates, handling decimal formats, and enriching web logs.