Applying Prepare Steps to Multiple Columns¶
In many cases when preparing data, you may want to apply the same operation to multiple columns.
This can be achieved in a Prepare recipe or visual analysis in a few different ways, depending on the type of operation and type of data at hand.
It is not strictly required, but we recommend all newcomers to Dataiku DSS begin with the Foundational learning materials.
From the Columns view¶
In the Columns View, you can select multiple columns and choose an operation from the Actions menu to apply to the selected columns. This creates a new step in the Script, and is essentially a shortcut to manually re-creating the steps individually. It is especially useful for processors that cannot be extended to multiple columns in the step editor.
For example, you can apply this strategy to rename multiple columns at once. You may have several columns that relate to a customer’s home contact information. In order to help differentiate this from their work contact information, you can select the needed columns from the Columns View, choose Rename from the Actions menu, and then add
home_ as a prefix to all of the selected columns.
As shown in the video below, the new step in the script is a Rename processor that renames each column individually.
The processors that can be applied to multiple columns in this manner depend upon the data type of columns selected:
- Selecting all text columns would suggest applying string transformations.
- Selecting all numeric columns could suggest filtering by a numeric range.
- Selecting a mixture of different data types might suggest removing invalid rows or clearing invalid cells based on meaning.
From the step editor¶
Another way to apply a step to multiple columns is through the step editor. Many processors have several modes for specifying the column.
For example, you may want to fill empty cells of a column with a fixed value, such as
Unknown. In the Fill empty cells with fixed value processor, there are a number of options for selecting columns.
- Single: Specify a single column by name, and it fills empty cells in that column. In the example below, home_zip is the selected column.
- Multiple: Specify a list of columns by name, and it fills empty cells in each of those columns. In the example below, home_zip and work zipcode are the selected columns.
- Pattern: Specify a regular expression. The processor operates on all columns that match the pattern. In the example below, the pattern
.*zip.*selects home_zip and work zipcode.
- All: The processor operates on all columns.
The video below demonstrates these differences.
Other examples of processors that can be applied in this way include processors to impute missing values, flag or filter rows, and rounding or converting number formats.
Within a Python function¶
It is also possible to process multiple columns using a Python function. This is especially useful for processors that cannot be extended to multiple columns within the step editor.
In the example below, the
process function adds 5 to the integer-valued columns.
import ast def process(row): # In 'row' mode, the process function must return the full row. # You may modify the 'row' in place to # keep the previous values of the row. for i in row.keys(): try: isint = type(ast.literal_eval(row[i])) is int except: isint = False if isint: row[i] = int(row[i]) + 5 return row