# Interactive Statistics and Visualization¶

Note

The Interactive Statistics feature requires DSS 7.0.

Now that you have done some preliminary data exploration in the Basics tutorial, let’s create some statistical reports and visualizations.

Dataiku DSS provides the ability to perform exploratory data analysis (EDA) through the **Statistics** tab of a dataset. Using this feature, you can implement descriptive statistics, inferential statistics, and principal component analysis (PCA).

This tutorial walks you through how to perform EDA tasks on the wine quality dataset (see [CCA+09]) that is available in the UCI Machine Learning Repository. The dataset consists of 12 features (or variables), and in this tutorial, we create an additional column for a variable *Type* to indicate whether an observation belongs to the red wine or white wine category. For the purpose of this tutorial, the *type* and *quality* variables in the dataset are treated as categorical variables, while all other variables are numerical.

## Prerequisite¶

This tutorial assumes that you have access to Dataiku DSS.

## Create Your Project¶

The first step is to create a new Dataiku DSS **Project**. From the Dataiku homepage, click **+New Project**, select **DSS Tutorials** from the list, and select **101.1: Interactive Stats (Tutorial)**. Notice that the project already performs some preliminary data preparation steps to arrive at the *winequality* dataset. These steps include:

- Uploading the
*winequality_red*and*winequality_white*datasets - Stacking the two datasets into a new dataset
*winequality*, and adding a new column*type*to indicate the data source (*white*or*red*) - Changing the storage type for the numerical columns in the
*winequality*dataset from “string” to “double”, so that Dataiku DSS can treat these columns as numerical variables instead of categorical variables

The following figure shows a snippet of the *winequality* dataset, with the red box highlighting the storage type for one of the numerical columns.

## The Statistics Interface¶

The **Statistics** page of a dataset allows you to generate statistical reports of your data by creating **worksheets**.

Note

**Key concept: Worksheet**

In DSS, a **Worksheet** is a visual summary of EDA tasks. For a dataset, you can create multiple worksheets by clicking the **Statistics** tab.

The worksheet header consists of a button for creating a new card and menu items that provide options to customize the worksheet. For example, there are options for running the worksheet in a container, specifying how to sample the dataset used in the worksheet, changing the global confidence level for statistical tests, duplicating the worksheet, and so on.

For more information about worksheets, see The Worksheet Interface in the reference documentation.

Navigate to the **Statistics** page of the dataset and click **+Create Your First Worksheet**. This brings up a window that contains a selection of card types.

Note

**Key concept: Card**

In DSS, a **Card** is used to perform a specific EDA task. For example, you can describe your dataset, draw inferences about an underlying population, analyze the effect of dimensionality reduction, and so on.

A worksheet can have many cards, with the cards appearing below the worksheet header. All cards have a configuration menu for editing card settings, viewing the JSON payloads and responses (for the purpose of leveraging the public API), and so on. Some cards also contain multiple sections, with each section having its own configuration menu.

For more information about cards, see Elements of a card in the reference documentation.

## Implement Descriptive Statistics¶

To describe or summarize the *winequality* dataset, we begin by performing descriptive statistics. This includes methods for:

- Univariate analysis
- Bivariate analysis
- Distribution fitting
- Curve fitting
- Computing correlations

### Perform Univariate Analysis¶

Univariate analysis is useful for exploring the data distribution for individual variables side-by-side. For example, we might be interested in seeing the data distribution for the three variables: *density*, *alcohol*, and *type*.

- From the “Select a card type” window, click the
**Univariate analysis**box. This brings up the “Univariate analysis” window.

The first column of the window lists the number of available variables, with the symbol “\(\#\)” denoting a numerical variable, and “\(\mathrm{A}\)” denoting a categorical variable.

- Select
**density**,**alcohol**, and**type**, and click the “plus” icon to add them to “Variables to describe”.

Notice that DSS automatically selects the statistical “Options” (in the third column of the window) that are appropriate for the numerical variables (*density* and *alcohol*) and the categorical variable (*type*). You can deselect any of these options if you so choose.

- Click
**Create Card**to create the univariate analysis card.

DSS creates a card with one section for each variable. The type of statistical chart and descriptive statistic in each section depends on whether the variable is categorical or numerical. For example, *type*, a categorical variable, has a bar chart (or categorical histogram), while *density* and *alcohol* each have a numerical histogram and box plot insert. Also, the quantile table is applicable to the numerical variables, while the frequency table is applicable to the categorical variable.

Note

By default, DSS computes worksheet statistics on a sample of the first records in your dataset. You can configure this setting by clicking the drop-down arrow next to **Sampling and filtering**.

For more information about the univariate analysis card, see Univariate Analysis in the reference documentation.

### Perform Bivariate Analysis¶

Next, let’s use the **Bivariate analysis** card to examine the data distribution for pairs of variables simultaneously.

For example, let’s examine the response variable (*type*) for each factor variable (*density* and *alcohol*). To examine the distributions for each factor-response pair:

- Click the
**New Card**button from the “Worksheet” header, and then select**Bivariate analysis**. This brings up the “Bivariate analysis” window. - Select the variables:
**density**and**alcohol**, and click the “plus” icon to add them to the “Factor(s)” box. Then select**type**and add it to the “Response” box. Notice that DSS selects the statistical “Options” that are appropriate based on the combination of the variable types.

- Click
**Create Card**to create the bivariate analysis card.

DSS creates a card with one section for each factor-response pair.

Notice that each descriptive statistical option (e.g. histogram) in the card has a menu (⋮) that provides options to configure its output. For example, clicking the menu for a histogram plot provides options that include **Configure histogram…**.

- To get a better view of the distributions from the histogram plots, click
**Configure histogram…**for each histogram plot, and change the “Max nb. of bins:” value for*density*(and*alcohol*) from`5`

to`100`

.

The card also shows additional options, as appropriate, for the selected variable types. For more information about the bivariate analysis card, see Bivariate Analysis in the reference documentation.

### Fit Univariate Distributions¶

Another aspect of descriptive statistics involves modeling the probability distribution of your dataset.

DSS allows you to estimate the parameters of univariate probability distributions using the **Fit Distribution** card. This feature is available only for numerical variables.

For example, let’s attempt to fit the Normal and Beta distributions to the dataset, considering only the *alcohol* variable.

- Click the
**New Card**button from the “Worksheet” header, and then select**Fit curves & distributions**. - Select the
**Fit Distribution**card. - Select
**alcohol**as the “Variable” and**Normal**as the “Distribution”. - Add another distribution by clicking the
**+Add a Distribution**box and selecting**Beta**. - Click
**Create Card**.

DSS creates a card that shows the normal and beta probability density functions fit to the data. There is also a Q-Q plot that compares the quantiles of the data to the quantiles of the fitted distributions. Observing points that are far from the identity line suggests that the data could not have been drawn from either distribution.

Additionally, the card includes goodness of fit metrics and the estimated parameters for the normal and beta distributions.

### Fit Bivariate Distributions¶

Similarly, the **2D Fit Distributions** card is available for visualizing and estimating bivariate probability distributions on your dataset.

For example, let’s attempt to fit a 2D kernel density estimate (KDE) to the dataset, considering only the *density* and *alcohol* variables.

- Click the
**New Card**button from the “Worksheet” header, then select**Fit curves & distributions**. - Select the
**2D Fit Distribution**card. - Specify
**density**as the “X Variable” and**alcohol**as the “Y Variable”. - Select the “2D KDE” radio button. Notice that the “X relative bandwidth” and “Y relative bandwidth” have the default value of
`15`

. Let’s keep these default values. However, you can increase the values to make the KDE plot smoother, or decrease the values to make the plot less smooth. - Click
**Create Card**to create the card.

### Model the Relationship Between Two Variables¶

Let’s now use the **Fit Curve** card to find the best line or curve to model the relationship between the *free sulfur dioxide* and *total sulfur dioxide* variables.

- Click the
**New Card**button from the “Worksheet” header, and then select**Fit curves & distributions**. - Select the
**Fit Curve**card. - Specify
**free sulfur dioxide**as the “X Variable” and**total sulfur dioxide**as the “Y Variable”. - Select the
**Polynomial**“Curve Type” and specify`1`

as the polynomial “Degree”. - Click
**Create Card**to create the card.

It appears that an increase in the value of the *free sulfur dioxide* variable results in an increase in the value of the *total sulfur dioxide variable*, and vice-versa. This indicates that both variables are positively correlated. We can confirm this by finding the correlation coefficient between these variables.

For more information, see Fit curves and distributions in the reference documentation.

### Create a Correlation matrix¶

The **Correlation matrix** card allows you to examine the degree to which pairwise relationships may exist for variables in the dataset. Let’s proceed to create the card.

- Click the
**New Card**button from the “Worksheet” header, and then select**Correlation matrix**. - Select the 11 numerical variables to add to the “Variables” column.
- Click the “Pearson” radio button to use the Pearson correlation coefficient.
- Click
**Create Card**to create the card.

The correlation matrix card displays a heatmap with the pairwise correlation values in the matrix cells. Of all the variables in the dataset, *free sulfur dioxide* and *total sulfur dioxide* have the largest positive correlation (0.721). This confirms the observation that we made from finding the fit curve.
Also, notice that the variables *density* and *alcohol* have the largest negative correlation (-0.687) in the dataset. This negative correlation implies that wines having higher density values tend to have lower alcohol content.

For more information about the correlation matrix card, see Correlation Matrix in the reference documentation.

## Implement Inferential Statistics¶

Next, let’s go beyond describing the *winequality* dataset, by using inferential statistics to make quantitative decisions about the underlying population from which the dataset was drawn.

DSS enables you to perform hypothesis tests that include one-sample, two-sample, and N-sample tests on numerical variables, or categorical tests on categorical variables.

### Perform One-sample Tests¶

These tests allow you to compare the location parameters of a population to a hypothesized constant, or to compare the distribution of a population to a hypothesized one.

Let’s determine whether the mean of the underlying population for the *density* variable is equal to a specified value. To do this, we will use the one-sample **Student t-test** card.

- Click the
**New Card**button from the “Worksheet” header, and then select**Statistical tests**. This brings up the “Statistical Tests” window.

The left pane of the window lists the four different categories for statistical tests (one-sample tests, two-sample tests, N-sample tests, and categorical tests). Clicking any of those categories shows the specific tests that are available within the category.

- Click
**One-sample test**from the left column of the window, and then click**Student t-test**. - Select
**density**as the “Variable”, and specify`0.995`

as the value for the “Hypothesized mean”.

- Click
**Create Card**to create the student*t*-test card on the*density*variable.

The card displays a summary of the *density* variable, including its mean, the tested hypothesis, results of the test, and a plot of the distribution for the test statistic. The card also displays a conclusion about the test — in this case “The population mean of density is different from 0.995”. For more information about the student *t*-test card, see Student t-test (one-sample) in the reference documentation.

Similarly, you can test whether the median of the population for the *density* variable is equal to a specified value, using the Sign test (one-sample).

Next, let’s test whether the variable *density* is normally distributed for the population. To do this, we will use the **Shapiro-Wilk Test** card.

- Click the
**New Card**button from the “Worksheet” header, and then select**Statistical tests**. Under**One-sample test**, click**Shapiro-Wilk Test**. - Select
**density**as the “Variable”. - Click
**Create Card**to create the card.

The card displays a figure of a normal distribution fit to the data, a summary of the data, the tested hypothesis, results of the test, and a conclusion about the test — in this case, “density is not normally distributed”. For more information about the Shapiro-Wilk card, see Shapiro-Wilk test in the reference documentation.

### Perform Two-sample Tests¶

These tests allow you to compare the location parameters of two populations, or to compare the distributions of two populations.

Let’s determine whether the medians of two populations for the *density* variable are equal. To do this, we will use the two-sample **Median Mood Test** card.

- Click the
**New Card**button from the “Worksheet” header, and then select**Statistical tests**. Under**Two-sample test**, click**Median Mood Test**. - Select
**density**as the “Test Variable” - Select
**type**as the “Grouping Variable”. This prompts you to specify values of*type*to create the two populations. - Add
`red`

for**Population 1**and`white`

for**Population 2**to create two disjoint groups.

- Click
**Create Card**to create the card.

The card displays a summary of samples from the *red* and *white* wine populations, the tested hypothesis, results of the test, and a conclusion about the test — in this case, “The median of density is different in both populations”. For more information about the two-sample median mood test, see Median mood test (two-sample) in the reference documentation.

Similarly, you can test whether the means of two populations are equal for the *density* variable using the Student t-test (two-sample).

Next, let’s test whether the distribution of the *density* variable is the same for the *red* wine and the *white* wine populations. To do this, we will use the **Kolmogrov-Smirnov** test card.

- Click the
**New Card**button from the “Worksheet” header, then select**Statistical tests**. Under**Two-sample test**, click**Kolmogrov-Smirnov**. - Select
**density**as the “Test Variable” - Select
**type**as the “Grouping Variable”. This prompts you to specify values of*type*to create the two populations. - Add
`red`

for**Population 1**and`white`

for**Population 2**to create two disjoint groups. - Click
**Create Card**to create the card.

The card displays a figure of the empirical Cumulative Distribution Functions (CDFs) of the two populations, a summary of samples from the two populations, the tested hypothesis, results of the test, and a conclusion about the test — in this case, “density distribution is different in the two populations”. For more information about the Kolmogorov-Smirnov card, see Kolmogrov-Smirnov test (two-sample) in the reference documentation.

### Perform N-sample Tests¶

These tests allow you to compare the location parameters of multiple populations.

Let’s determine whether the means of multiple populations for the *density* variable are equal. To do this, we will use the N-sample **Oneway ANOVA** card.

- Click the
**New Card**button from the “Worksheet” header, and then select**Statistical tests**. Under**N-sample test**, click**Oneway ANOVA**. - Select
**density**as the “Test Variable”. - Select
**quality**as the “Grouping Variable”. Because*quality*has more than two values, we can use this variable to create multiple groups. - Select “Build groups from most frequent values”, and keep the default “Maximum number of groups
`10`

.

- Click
**Create Card**to create the card.

The card displays a summary of the samples for all the groups, the tested hypothesis, results of the test, and a conclusion about the test — in this case, “The mean of density is different in all populations”. For more information about the Oneway ANOVA card, see One-way ANOVA in the reference documentation.

Other available N-sample test cards include the Median mood test (N-samples), the Pairwise student *t*-test, and the Pairwise median mood test. To learn more about these cards, see N-sample tests in the reference documentation.

### Perform Tests on Categorical Variables¶

So far, we’ve implemented hypothesis testing only on numerical variables. Now, let’s test whether two categorical variables in the *winequality* dataset are independent.

To do this, we will use the **Chi-square Independence Test** card.

- Click the
**New Card**button from the “Worksheet” header, then select**Statistical tests**. Under**Categorical test**, click**Chi-square Independence Test**. - Select the categorical variable
**quality**for “Variable 1”. - Select the categorical variable
**type**for “Variable 2”. - Keep the default values
`5`

for “Maximum X Values to Display” and “Maximum Y Values to Display”. - Click
**Create Card**to create the card.

The card displays the tested hypothesis, results of the test, and a conclusion about the test — in this case, “Variables quality and type are not independent”. For more information about this card, see Chi-square independence test in the reference documentation.

## Analyze Effects of Dimensionality Reduction¶

Lastly, when working with a dataset having many variables, we may be interested in analyzing the effects of using a reduced number of variables (or dimensions) of the data. For example, we may choose to explore the structure of the *winequality* dataset in two dimensions.

Dataiku DSS enables you to analyze the effects of dimensionality reduction using a feature extraction method called Principal Component Analysis PCA.

### Perform PCA¶

Let’s use the **Principal Component Analysis** card to represent the *winequality* dataset in two dimensions.

- Click the
**New Card**button from the “Worksheet” header, and then select**Principal Component Analysis**. - Select the 11 numerical variables to add to the “Variables” column.
- Click
**Create Card**to create the card.

The scree plot in the PCA card shows that the first two principal components account for only about 50.2% of the variance in the dataset. To obtain a variance of at least 90% (the red vertical line), you must retain a minimum of 7 principal components.

The 2D scatter plot to the right shows the data projected onto the first two principal components.

Finally, the heatmap shows the composition of all the principal components.

For more information about the PCA card, see Principal Component Analysis (PCA) in the reference documentation.

## Next Steps¶

Congratulations! Now that you have spent some time exploring your dataset, you are ready to move on to other tasks like further preparing your data, or building machine learning models.

Check out the From Lab to Flow tutorial to work on your flow and learn more about the power of preparation scripts and processors. You can also proceed to the tutorial on Machine Learning to learn how to build machine learning models for prediction.

References

[CCA+09] | Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. 2009. |