Prepare recipe - using a custom Python function as a processor¶
The Python function processor allows you to perform complex row-wise operations on your dataset. The output of the function depends on the mode used.
Cell mode¶
In cell mode, Dataiku expects a single column as output. This is useful for finding the largest value across columns, or the sum of a set of columns whose names follow some pattern.
In the example below, the process function finds the longest string value across the columns in the dataset.
def process(row):
# In 'cell' mode, the process function must return
# a single cell value for each row
# The 'row' argument is a dictionary of columns of the row
max_len = -1
longest_str = None
for val in row.values():
if val is None:
continue
if len(val) > max_len:
max_len = len(val)
longest_str = val
return longest_str

Row mode¶
In row mode, Dataiku expects a row as output, which replaces the existing row. This is useful when you need to perform operations in place on the rows (as in the image below), or when you need to add multiple rows to the dataset (such as if you compute the mean and standard deviation of values in a row across columns).
In the example below, the process
function adds 5 to the integer-valued columns. Note that the row
parameter passed to the process
function is a dictionary of the string values, so we use the ast
package to evaluate whether they are integer-valued.
import ast
def process(row):
# In 'row' mode, the process function must return the full row.
# You may modify the 'row' in place to
# keep the previous values of the row.
for i in row.keys():
try:
isint = type(ast.literal_eval(row[i])) is int
except:
isint = False
if isint:
row[i] = int(row[i]) + 5
return row

Rows mode¶
In rows mode, Dataiku expects one or more rows as output, which replace the existing row. This is especially useful when you need to transform your data from wide to long format.
In the example below, the process
function splits each row into two. Where before a single row contained both a person’s work and home zip code and state, now each row contains either the home or work information, along with a new column that indicates whether it is the home or work information.
# Modify the process function to fit your needs
def process(row):
# In 'multi rows' mode, the process function
# must return an iterable list of rows.
ret = []
home = {"name": row["name"],
"type": "home",
"zip": row["home_zip"],
"state": row["home_state"]}
work = {"name": row["name"],
"type": "work",
"zip": row["work zipcode"],
"state": row["work state"]}
ret.append(home)
ret.append(work)
return ret
