Best Practices for Collaborating in Dataiku DSS¶
In this series of guides we would like to pass on some experience we gained over several projects. Here we will share some tips and tools to ease collaboration.
After learning how to use the Studio, our users start solving their own problems. For the success of a project, here are tips based on the experience gained through several projects done by Dataiku.
Properly naming your datasets and your recipes is arguably the most important element for collaboration. Good naming helps you recover your previous work, share your work with others, and understand quickly what your colleagues are working on.
The two main objectives are readable and self explanatory names. Keep your names as short as possible, and think of what this element is doing in your flow. Default names are created by appending the name of the operation to the input’s name. This ordered naming scheme has the benefit of being simple, but it quickly becomes unreadable. Try to replace this name with something more self explanatory.
A good method is to focus on what the created dataset will be used for, and find differentiating names, e.g foo_raw, foo_clean. The input is raw data, the output is clean.
Suggested naming scheme¶
The following rules maintain names compatible with all storage connections (SQL dialects, HDFS, Python dataframe columns, etc.):
- only alphanum and underscore (“_”),
- all lowercase,
- no spaces,
- does not begin with a number.
Optionally, you can adopt prefixes and suffixes for your datasets. (E.g.: foo_t for a dataset in a SQL database, foo_hdfs for a HDFS dataset etc…)
Keep the same tips in mind when naming columns of your datasets, notebooks and projects.
For projects, informative naming can be a good solution: topic, author, version (date based).
Remember to use fully explicit project names (ex: “Data Ingestion” and not “p001_data_ingestion”…)
Additional collaboration features¶
On the insights page, one can see all created graphs and webapps, and publish them on a dashboard. Dashboards can also contain webapps, notebooks (esp. the images they generate), datasets and more.
Dashboards are a good way to share findings among the team, and can be used to show a report to a read-only (e.g. manager) user.
Most code input boxes have a button in their top right corner “code samples”, for instance Python recipes or custom python code for a model. Start by exploring the already provided code samples. They are meant as a helper to start when in front of a blank page.
If you find yourself repeatedly writing similar portions of code, consider writing a plugin (big investment, easiest to use even by non-coders), a library, or a code sample (lightest investment). The code snippet can then easily be inserted in other code boxes, and is available for all team members, it’s time saved for everyone!