The Main Dataiku DSS Concepts¶
Dataiku DSS is an on-premises/on-cloud software product (not SaaS) that operates as part of your data stack’s existing infrastructure.
A data stack typically includes development, production, and deployment environments; in order to work across these environments, a separate instance of Dataiku DSS is installed in each environment.
A Dataiku DSS instance is an installation of the product to serve the needs of a particular environment:
- The Design node instance, in the development environment, is used to create the pipelines that turn data into outputs [Dashboards/Reports, Data (to build reports), Models]
- The Automation node instance, in the production environment, puts pipelines from the Design node into production to turn your enterprise data into the final outputs
- The API node instance, in the deployment environment, makes model outputs from the Automation node available for use in real-time scoring
Pipelines in the Design and Automation nodes are organized into projects, which can be accessed from the main page after logging in to the Dataiku DSS instance.
A Dataiku DSS project is a container for all your work on a particular activity. The project home acts as the command center from which you can see the overall status of a project, view recent activity, and collaborate through comments, tags, and a project to-do list.
Flow, Datasets, and Recipes¶
A Dataiku DSS dataset is a tabular view into your data that allows you to access, visualize and write data in the same way, regardless of the underlying storage system. You can connect to a variety of storage systems (file system, SQL database, Hadoop, etc), and file formats (CSV, JSON, Hadoop file formats, etc).
Creating your first DSS dataset and learning how to cleanse it is the subject of the Basics Tutorial.
A Dataiku DSS recipe is a set of actions to perform on one or more input datasets, resulting in one or more output datasets. Each time you prepare, join, group… your datasets, this will be through a recipe. A recipe can be visual or code.
- A visual recipe allows a quick and interactive transformation of the input dataset through a number of prepackaged operations available in a visual interface.
- A code recipe allows a user with coding skills to go beyond visual recipe functionality to take control of the transformation using any supported language (SQL, Python, R, etc).
Dataiku allows “coders” and “clickers” to seamlessly collaborate on the same project through code and visual recipes.
The lineage of a dataset (or a model) is thus defined by the inputs and outputs of its ancestor recipes. The Flow is a visual representation of your work as a set of dependencies between datasets and the recipes used to produce them.
The knowledge of these dependencies helps the Dataiku DSS engine minimize the number of data processes to be launched when (re)building a dataset.
Lab - Visual Analysis¶
The Visual Analysis lab allows you to experiment with your data in a code-free environment where you can:
- Perform interactive analyses with built-in charts and data preparation (cleaning, filtering, enriching). These steps can be deployed to the Flow as Prepare recipes.
- Use machine learning algorithms (unsupervised and supervised training) to generate insights and build predictive models. These models can be deployed to the Flow.
Lab - Code¶
The code lab allows you to experiment with your data in Jupyter notebooks (for Python / R) or SQL notebooks when working with SQL DB’s, Hive, or Impala. You can perform interactive analysis in these notebooks and then deploy them to the Flow as code recipes.
The code lab also provides the ability to create some advanced R Markdown reports that mix text, code, and complex visualizations using Python and R. These reports can be shared on dashboards or distributed in various printable formats.
Jobs, Scenarios, and Monitoring¶
Jobs are created when you build a dataset. Dataiku DSS provides a full job log to let you monitor what works and what does not, along with the ability to debug potential errors.
Scenarios help you automate reconstruction tasks; for example, running daily updates to your models to always have up-to-date predictive scoring. Reports on scenarios that ran previously and their results are shown in Monitoring.
Wikis are a widely-used tool for collaboration, allowing team members to create and edit articles that document a project, a group of projects, or more. Every Dataiku DSS project is wiki-enabled.
Dashboard & Insights¶
The Dashboard is a communication tool to organize, share, and deliver the Insights of your data project. Insights can include any Dataiku DSS object, such as charts, datasets, web apps, and reports.