Data exploration goes hand in hand with data wrangling. A typical progression is to gain an understanding of the columns in your dataset, the distribution of values within columns of interest, and then to explore values and patterns of values within the dataset.
As you enrich and transform a dataset, this exploration process can occur in a number of ways, including:
- Within the Explore tab of a dataset
- Within a Prepare recipe
- Within a Visual Analysis
It is not strictly required, but we recommend all newcomers to Dataiku DSS begin with the Foundational learning materials.
Changing from the default Table view to the Columns view is often useful when working with many columns. Using filtering and sorting, you can discover columns that are similar and find columns you are looking for.
As demonstrated in the video below, you can filter columns in three ways:
- By Text
- Show only columns whose names contain the typed text
- By Meaning
- Show only columns with the selected meaning. Note that when a meaning encompasses a sub-meaning, such as “Text”, which includes “Natural Language”, columns with the sub-meaning are included in the filter.
- By Status
- Show only columns with all valid values, with at least one invalid value, or with at least one missing value. This allows you to quickly identify which columns are clean and which columns have issues.
You can sort by various criteria, some of which are only appropriate to columns with a numeric meaning. It is generally useful to display the sort criteria in the column under the sort menu.
Any filtering and sorting you apply is cumulative.
The Table view allows you to quickly navigate to a column by typing
c and then entering text in the name of the column. The dropdown selection updates as you type to show columns whose name contain the typed text. Additionally, you can display a selection of columns as shown in the short video below.
Distributions of values in columns¶
As demonstrated in the video below, two built-in methods can help you quickly explore the distributions of values in columns:
- Quick column stats (to the right of the Columns view icon) shows a histogram for each column at once.
- The Analyze dialog (found in the Context menu after clicking on a column name) provides greater detail than the Quick column stats, along with the ability to take actions based upon your findings.
While exploring data, the summaries provided by options such as the Analyze dialog are always based on the design sample. In order to see results for the whole dataset, you need to open the Analyze dialog from a dataset, not a recipe.
Using coloring, filtering, and highlighting, you can zero in on values of interest in the Table view.
By default, cells are colored by meaning validity, with red for cells that do not match the column meaning. However, you can also color cells by column values.
- Numeric column values are binned and colored with increasing intensity from low to high values.
- Categorical column values are colored with a different color for each of the most commonly occurring categories, and no color (white) for all other categories.
- Columns with mostly unique values are shaded light grey for all values.
The video below demonstrates how, using a combination of color shading and column selection, you can visually scan for patterns of values across columns of interest.
Filtering values is performed:
- Globally, using the search bar, or
- By column
Like with filtering columns, any coloring, filtering and sorting you apply is cumulative.
When a value is very long, you can select Show complete value, or use the
Shift + v shortcut, to display the full cell contents so that it is easier to copy.
Triple-clicking on a cell also selects the full cell contents, even if the contents are not entirely displayed.
You can also highlight a row of interest by selecting Toggle row highlight, or using the
Shift + h shortcut.
The video below demonstrates how to effectively filter values in different ways.