Interactive Statistics and Visualization

Note

The Interactive Statistics feature requires DSS 7.0.

Now that you have done some preliminary data exploration in the Basics tutorial, let’s create some statistical reports and visualizations.

Dataiku DSS provides the ability to perform exploratory data analysis (EDA) through the Statistics tab of a dataset. Using this feature, you can implement descriptive statistics, inferential statistics, and principal component analysis (PCA).

This tutorial walks you through how to perform EDA tasks on the wine quality dataset (see [CCA+09]) that is available in the UCI Machine Learning Repository. The dataset consists of 12 features (or variables), and in this tutorial, we create an additional column for a variable Type to indicate whether an observation belongs to the red wine or white wine category. For the purpose of this tutorial, the type and quality variables in the dataset are treated as categorical variables, while all other variables are numerical.

Prerequisite

This tutorial assumes that you have access to Dataiku DSS.

Create Your Project

The first step is to create a new Dataiku DSS Project. From the Dataiku homepage, click +New Project, select DSS Tutorials from the list, and select 101.1: Interactive Stats (Tutorial). Notice that the project already performs some preliminary data preparation steps to arrive at the winequality dataset. These steps include:

  • Uploading the winequality_red and winequality_white datasets
  • Stacking the two datasets into a new dataset winequality, and adding a new column type to indicate the data source (white or red)
  • Changing the storage type for the numerical columns in the winequality dataset from “string” to “double”, so that Dataiku DSS can treat these columns as numerical variables instead of categorical variables

The following figure shows a snippet of the winequality dataset, with the red box highlighting the storage type for one of the numerical columns.

../../_images/stats_data_winequality1.png

The Statistics Interface

The Statistics page of a dataset allows you to generate statistical reports of your data by creating worksheets.

Note

Key concept: Worksheet

In DSS, a Worksheet is a visual summary of EDA tasks. For a dataset, you can create multiple worksheets by clicking the Statistics tab.

The worksheet header consists of a button for creating a new card and menu items that provide options to customize the worksheet. For example, there are options for running the worksheet in a container, specifying how to sample the dataset used in the worksheet, changing the global confidence level for statistical tests, duplicating the worksheet, and so on.

For more information about worksheets, see The Worksheet Interface in the reference documentation.

Navigate to the Statistics page of the dataset and click +Create Your First Worksheet. This brings up a window that contains a selection of card types.

../../_images/stats_card1.png

Note

Key concept: Card

In DSS, a Card is used to perform a specific EDA task. For example, you can describe your dataset, draw inferences about an underlying population, analyze the effect of dimensionality reduction, and so on.

A worksheet can have many cards, with the cards appearing below the worksheet header. All cards have a configuration menu for editing card settings, viewing the JSON payloads and responses (for the purpose of leveraging the public API), and so on. Some cards also contain multiple sections, with each section having its own configuration menu.

For more information about cards, see Elements of a card in the reference documentation.

Implement Descriptive Statistics

To describe or summarize the winequality dataset, we begin by performing descriptive statistics. This includes methods for:

  • Univariate analysis
  • Bivariate analysis
  • Distribution fitting
  • Curve fitting
  • Computing correlations

Perform Univariate Analysis

Univariate analysis is useful for exploring the data distribution for individual variables side-by-side. For example, we might be interested in seeing the data distribution for the three variables: density, alcohol, and type.

  • From the “Select a card type” window, click the Univariate analysis box. This brings up the “Univariate analysis” window.

The first column of the window lists the number of available variables, with the symbol “\(\#\)” denoting a numerical variable, and “\(\mathrm{A}\)” denoting a categorical variable.

  • Select density, alcohol, and type, and click the “plus” icon to add them to “Variables to describe”.

Notice that DSS automatically selects the statistical “Options” (in the third column of the window) that are appropriate for the numerical variables (density and alcohol) and the categorical variable (type). You can deselect any of these options if you so choose.

../../_images/stats_univariate_window1.png
  • Click Create Card to create the univariate analysis card.
../../_images/stats_univariate_card1.png

DSS creates a card with one section for each variable. The type of statistical chart and descriptive statistic in each section depends on whether the variable is categorical or numerical. For example, type, a categorical variable, has a bar chart (or categorical histogram), while density and alcohol each have a numerical histogram and box plot insert. Also, the quantile table is applicable to the numerical variables, while the frequency table is applicable to the categorical variable.

Note

By default, DSS computes worksheet statistics on a sample of the first records in your dataset. You can configure this setting by clicking the drop-down arrow next to Sampling and filtering.

../../_images/stats_configure_sample3.png

For more information about the univariate analysis card, see Univariate Analysis in the reference documentation.

Perform Bivariate Analysis

Next, let’s use the Bivariate analysis card to examine the data distribution for pairs of variables simultaneously.

For example, let’s examine the response variable (type) for each factor variable (density and alcohol). To examine the distributions for each factor-response pair:

  • Click the New Card button from the “Worksheet” header, and then select Bivariate analysis. This brings up the “Bivariate analysis” window.
  • Select the variables: density and alcohol, and click the “plus” icon to add them to the “Factor(s)” box. Then select type and add it to the “Response” box. Notice that DSS selects the statistical “Options” that are appropriate based on the combination of the variable types.
../../_images/stats_bivariate_window1.png
  • Click Create Card to create the bivariate analysis card.

DSS creates a card with one section for each factor-response pair.

../../_images/stats_bivariate_card1.png

Notice that each descriptive statistical option (e.g. histogram) in the card has a menu (⋮) that provides options to configure its output. For example, clicking the menu for a histogram plot provides options that include Configure histogram….

  • To get a better view of the distributions from the histogram plots, click Configure histogram… for each histogram plot, and change the “Max nb. of bins:” value for density (and alcohol) from 5 to 100.
../../_images/stats_bivariate_histograms1.png

The card also shows additional options, as appropriate, for the selected variable types. For more information about the bivariate analysis card, see Bivariate Analysis in the reference documentation.

Fit Univariate Distributions

Another aspect of descriptive statistics involves modeling the probability distribution of your dataset.

DSS allows you to estimate the parameters of univariate probability distributions using the Fit Distribution card. This feature is available only for numerical variables.

For example, let’s attempt to fit the Normal and Beta distributions to the dataset, considering only the alcohol variable.

  • Click the New Card button from the “Worksheet” header, and then select Fit curves & distributions.
  • Select the Fit Distribution card.
  • Select alcohol as the “Variable” and Normal as the “Distribution”.
  • Add another distribution by clicking the +Add a Distribution box and selecting Beta.
  • Click Create Card.
../../_images/stats_1Dfit1.png

DSS creates a card that shows the normal and beta probability density functions fit to the data. There is also a Q-Q plot that compares the quantiles of the data to the quantiles of the fitted distributions. Observing points that are far from the identity line suggests that the data could not have been drawn from either distribution.

Additionally, the card includes goodness of fit metrics and the estimated parameters for the normal and beta distributions.

Fit Bivariate Distributions

Similarly, the 2D Fit Distributions card is available for visualizing and estimating bivariate probability distributions on your dataset.

For example, let’s attempt to fit a 2D kernel density estimate (KDE) to the dataset, considering only the density and alcohol variables.

  • Click the New Card button from the “Worksheet” header, then select Fit curves & distributions.
  • Select the 2D Fit Distribution card.
  • Specify density as the “X Variable” and alcohol as the “Y Variable”.
  • Select the “2D KDE” radio button. Notice that the “X relative bandwidth” and “Y relative bandwidth” have the default value of 15. Let’s keep these default values. However, you can increase the values to make the KDE plot smoother, or decrease the values to make the plot less smooth.
  • Click Create Card to create the card.
../../_images/stats_2D_KDE1.png

Model the Relationship Between Two Variables

Let’s now use the Fit Curve card to find the best line or curve to model the relationship between the free sulfur dioxide and total sulfur dioxide variables.

  • Click the New Card button from the “Worksheet” header, and then select Fit curves & distributions.
  • Select the Fit Curve card.
  • Specify free sulfur dioxide as the “X Variable” and total sulfur dioxide as the “Y Variable”.
  • Select the Polynomial “Curve Type” and specify 1 as the polynomial “Degree”.
  • Click Create Card to create the card.
../../_images/stats_fit_curve1.png

It appears that an increase in the value of the free sulfur dioxide variable results in an increase in the value of the total sulfur dioxide variable, and vice-versa. This indicates that both variables are positively correlated. We can confirm this by finding the correlation coefficient between these variables.

For more information, see Fit curves and distributions in the reference documentation.

Create a Correlation matrix

The Correlation matrix card allows you to examine the degree to which pairwise relationships may exist for variables in the dataset. Let’s proceed to create the card.

  • Click the New Card button from the “Worksheet” header, and then select Correlation matrix.
  • Select the 11 numerical variables to add to the “Variables” column.
  • Click the “Pearson” radio button to use the Pearson correlation coefficient.
  • Click Create Card to create the card.
../../_images/stats_correlation_matrix1.png

The correlation matrix card displays a heatmap with the pairwise correlation values in the matrix cells. Of all the variables in the dataset, free sulfur dioxide and total sulfur dioxide have the largest positive correlation (0.721). This confirms the observation that we made from finding the fit curve. Also, notice that the variables density and alcohol have the largest negative correlation (-0.687) in the dataset. This negative correlation implies that wines having higher density values tend to have lower alcohol content.

For more information about the correlation matrix card, see Correlation Matrix in the reference documentation.

Implement Inferential Statistics

Next, let’s go beyond describing the winequality dataset, by using inferential statistics to make quantitative decisions about the underlying population from which the dataset was drawn.

DSS enables you to perform hypothesis tests that include one-sample, two-sample, and N-sample tests on numerical variables, or categorical tests on categorical variables.

Perform One-sample Tests

These tests allow you to compare the location parameters of a population to a hypothesized constant, or to compare the distribution of a population to a hypothesized one.

Let’s determine whether the mean of the underlying population for the density variable is equal to a specified value. To do this, we will use the one-sample Student t-test card.

  • Click the New Card button from the “Worksheet” header, and then select Statistical tests. This brings up the “Statistical Tests” window.
../../_images/stats_statistical_tests1.png

The left pane of the window lists the four different categories for statistical tests (one-sample tests, two-sample tests, N-sample tests, and categorical tests). Clicking any of those categories shows the specific tests that are available within the category.

  • Click One-sample test from the left column of the window, and then click Student t-test.
  • Select density as the “Variable”, and specify 0.995 as the value for the “Hypothesized mean”.
../../_images/stats_uni_ttest_window1.png
  • Click Create Card to create the student t-test card on the density variable.
../../_images/stats_uni_ttest_card1.png

The card displays a summary of the density variable, including its mean, the tested hypothesis, results of the test, and a plot of the distribution for the test statistic. The card also displays a conclusion about the test — in this case “The population mean of density is different from 0.995”. For more information about the student t-test card, see Student t-test (one-sample) in the reference documentation.

Similarly, you can test whether the median of the population for the density variable is equal to a specified value, using the Sign test (one-sample).

Next, let’s test whether the variable density is normally distributed for the population. To do this, we will use the Shapiro-Wilk Test card.

  • Click the New Card button from the “Worksheet” header, and then select Statistical tests. Under One-sample test, click Shapiro-Wilk Test.
  • Select density as the “Variable”.
  • Click Create Card to create the card.
../../_images/stats_shapiro_wilk_card1.png

The card displays a figure of a normal distribution fit to the data, a summary of the data, the tested hypothesis, results of the test, and a conclusion about the test — in this case, “density is not normally distributed”. For more information about the Shapiro-Wilk card, see Shapiro-Wilk test in the reference documentation.

Perform Two-sample Tests

These tests allow you to compare the location parameters of two populations, or to compare the distributions of two populations.

Let’s determine whether the medians of two populations for the density variable are equal. To do this, we will use the two-sample Median Mood Test card.

  • Click the New Card button from the “Worksheet” header, and then select Statistical tests. Under Two-sample test, click Median Mood Test.
  • Select density as the “Test Variable”
  • Select type as the “Grouping Variable”. This prompts you to specify values of type to create the two populations.
  • Add red for Population 1 and white for Population 2 to create two disjoint groups.
../../_images/stats_two_median_mood_window1.png
  • Click Create Card to create the card.
../../_images/stats_two_median_mood_test1.png

The card displays a summary of samples from the red and white wine populations, the tested hypothesis, results of the test, and a conclusion about the test — in this case, “The median of density is different in both populations”. For more information about the two-sample median mood test, see Median mood test (two-sample) in the reference documentation.

Similarly, you can test whether the means of two populations are equal for the density variable using the Student t-test (two-sample).

Next, let’s test whether the distribution of the density variable is the same for the red wine and the white wine populations. To do this, we will use the Kolmogrov-Smirnov test card.

  • Click the New Card button from the “Worksheet” header, then select Statistical tests. Under Two-sample test, click Kolmogrov-Smirnov.
  • Select density as the “Test Variable”
  • Select type as the “Grouping Variable”. This prompts you to specify values of type to create the two populations.
  • Add red for Population 1 and white for Population 2 to create two disjoint groups.
  • Click Create Card to create the card.
../../_images/stats_kolmogorov_smirnov_card1.png

The card displays a figure of the empirical Cumulative Distribution Functions (CDFs) of the two populations, a summary of samples from the two populations, the tested hypothesis, results of the test, and a conclusion about the test — in this case, “density distribution is different in the two populations”. For more information about the Kolmogorov-Smirnov card, see Kolmogrov-Smirnov test (two-sample) in the reference documentation.

Perform N-sample Tests

These tests allow you to compare the location parameters of multiple populations.

Let’s determine whether the means of multiple populations for the density variable are equal. To do this, we will use the N-sample Oneway ANOVA card.

  • Click the New Card button from the “Worksheet” header, and then select Statistical tests. Under N-sample test, click Oneway ANOVA.
  • Select density as the “Test Variable”.
  • Select quality as the “Grouping Variable”. Because quality has more than two values, we can use this variable to create multiple groups.
  • Select “Build groups from most frequent values”, and keep the default “Maximum number of groups 10.
../../_images/stats_ANOVA_window1.png
  • Click Create Card to create the card.
../../_images/stats_ANOVA_card1.png

The card displays a summary of the samples for all the groups, the tested hypothesis, results of the test, and a conclusion about the test — in this case, “The mean of density is different in all populations”. For more information about the Oneway ANOVA card, see One-way ANOVA in the reference documentation.

Other available N-sample test cards include the Median mood test (N-samples), the Pairwise student t-test, and the Pairwise median mood test. To learn more about these cards, see N-sample tests in the reference documentation.

Perform Tests on Categorical Variables

So far, we’ve implemented hypothesis testing only on numerical variables. Now, let’s test whether two categorical variables in the winequality dataset are independent.

To do this, we will use the Chi-square Independence Test card.

  • Click the New Card button from the “Worksheet” header, then select Statistical tests. Under Categorical test, click Chi-square Independence Test.
  • Select the categorical variable quality for “Variable 1”.
  • Select the categorical variable type for “Variable 2”.
  • Keep the default values 5 for “Maximum X Values to Display” and “Maximum Y Values to Display”.
  • Click Create Card to create the card.
../../_images/stats_chi-square_card1.png

The card displays the tested hypothesis, results of the test, and a conclusion about the test — in this case, “Variables quality and type are not independent”. For more information about this card, see Chi-square independence test in the reference documentation.

Analyze Effects of Dimensionality Reduction

Lastly, when working with a dataset having many variables, we may be interested in analyzing the effects of using a reduced number of variables (or dimensions) of the data. For example, we may choose to explore the structure of the winequality dataset in two dimensions.

Dataiku DSS enables you to analyze the effects of dimensionality reduction using a feature extraction method called Principal Component Analysis PCA.

Perform PCA

Let’s use the Principal Component Analysis card to represent the winequality dataset in two dimensions.

  • Click the New Card button from the “Worksheet” header, and then select Principal Component Analysis.
  • Select the 11 numerical variables to add to the “Variables” column.
  • Click Create Card to create the card.
../../_images/stats_PCA_card1.png

The scree plot in the PCA card shows that the first two principal components account for only about 50.2% of the variance in the dataset. To obtain a variance of at least 90% (the red vertical line), you must retain a minimum of 7 principal components.

The 2D scatter plot to the right shows the data projected onto the first two principal components.

Finally, the heatmap shows the composition of all the principal components.

For more information about the PCA card, see Principal Component Analysis (PCA) in the reference documentation.

Next Steps

Congratulations! Now that you have spent some time exploring your dataset, you are ready to move on to other tasks like further preparing your data, or building machine learning models.

Check out the From Lab to Flow tutorial to work on your flow and learn more about the power of preparation scripts and processors. You can also proceed to the tutorial on Machine Learning to learn how to build machine learning models for prediction.

References

[CCA+09]Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. 2009.