Handling Decimal Notations

When preparing data, you often encounter numeric data in a variety of formats from around the world.

This brief tutorial introduces how Dataiku DSS handles conversion of decimal notations into a universally-understood raw format.

Decimal notations

Many parts of the world commonly display large and decimal numbers as 1,234,567.89. However, this same number, depending on the country, might be more commonly written as:

Since Dataiku DSS needs to assist different systems in talking to each other, and those systems may not have the same opinions, DSS only treats “computer-notation” numbers as decimals, out of the box.

Thus, both for the float and double storage types, and for the Decimal meaning, DSS will only accept the following kind of notation:

  • 1234567.89
  • 1.23456789E6
  • -1234.33

Note

You might want to re-read our documentation about storage types and meanings

While DSS could recognize more forms, other systems, such as Hive, would not, and that would cause various inconsistencies.

Thus, for example, 1,234,567.89 will be recognized as a string by DSS, not a number.

Normalizing in a Prepare recipe

You can use a Prepare script (either in a visual analysis or a recipe) to handle datasets with various kinds of numeric representations. In particular, this is a job for the Convert number formats processor.

Here is a snippet of a dataset in a visual analysis containing decimals formatted in both US and French styles.

"A dataset with decimal columns in two US and French formats"

For the us_notation column, DSS predicts a meaning of “Decimal”, but the first two values are invalid. On the other hand, DSS predicts a meaning of “Decimal (comma)” for the fr_notation column. Our goal is for DSS to recognize both of these columns as valid decimals.

For the fr_notation column, DSS suggests a conversion from the French decimal format to a regular decimal. This steps uses the Convert number formats processor to convert this column to a Decimal meaning.

"Context menu to convert French format to regular decimal format"

The same processor can fix the us_notation column. Add a new step to the script and find the Convert number formats processor. The input format should be recognized as “English” and the output format set to “Raw”.

../../../_images/decimal-notations-output.png

Now DSS recognizes all values of both output columns with a Decimal meaning, and can be processed as such by all DSS-supported compute engines.

What’s Next?

You might wish to consult the reference documentation about storage types and meanings or about the Convert number formats processor.