Data Governance

In order to help data teams comply with internal policies and external regulations (such as GDPR) around data privacy and protection, Dataiku provides a plugin that allows you to:

  • Document data sources with sensitive information, and enforce good practices
  • Restrict access to projects and data sources with sensitive information
  • Audit the sensitive information in a Dataiku DSS instance

Prerequisites

Technical Requirements

  • Account with administrative privileges, to install and configure the plugin and related security
  • Account with privileges as described in the plugin configuration below, to work with sensitive data

Install and Configure the Plugin

Logged in on an account with administrative privileges, install the GDPR plugin from the plugin store.

In the plugin store, click Settings. Two fields are available:

  • GDPR admin groups. These are user groups that are able to configure projects for work with sensitive data. Let’s set this to privacy_admin. By setting this to a separate group from the main administrators group, we can target specific users for this privilege.
  • GDPR documentation groups. These are user groups that are able to document whether datasets contain sensitive data. Let’s set this to privacy_doc.

We could use existing groups instead of creating new groups, and how you configure these settings depends on your team’s particular situation; however, it is good practice to separate groups of privileges to specific users that need those privileges.

../../_images/plugin-settings.png

Configure Security for Work with Personal Data

Now we need to actually create the security groups privacy_admin and privacy_doc.

  • Navigate to the Security tab of the administration tool, and then to the Groups panel.
  • Click +New Group
  • Name the new group privacy_admin and give it a description like “Configure projects for work with secure data”, then click Save

There’s no need to set any global permissions for this group, since its existence is simply to confer project administrator permissions related to privacy. A user’s other global permissions come from membership in other appropriate groups. Note that it may good practice to give privacy_admin the Create projects permission, and only allow users with privacy_admin to create projects on the instance.

  • Create another group called privacy_doc and give it a description like “Document datasets with secure data”
../../_images/security-groups.png

On the Users panel, assign users to these groups.

../../_images/user_add-to-group.png

Note

Depending on your organization’s situation, it can be useful to create other “privacy_” groups for users who will be given access to individual projects containing secure data, but not the authority to configure the project or dataset privacy settings.

Configure Projects for Work with Personal Data

Once the Dataiku DSS instance is configured to work with personal data, you can configure projects. While logged in as a user with the privacy_admin group privilege, from the instance homepage select +New Project > DSS tutorials > General Topics > Data Governance.

At the bottom of the project homepage is an area with GDPR fields.

  • Forbid dataset sharing. This prevents any user from sharing a dataset with personal data outside of this project.
  • Forbid dataset and project export. This prevents any user from exporting a dataset with personal data, or exporting this project if it contains any dataset with personal data.
  • Forbid model creation. This prevents any user from creating a model with the Dataiku Visual ML tool on any dataset with personal data.
  • Forbid uploaded datasets. This prevents any user from creating an “Uploaded files” dataset and potentially introducing personal data to the project in an insecure way. Note that this restriction only affects new datasets, and not existing ones.
  • Forbidden connections. This prevents any user from creating a dataset in this project from any of the connections. The idea is that you may want to explicitly restrict the usage of some connections because they contain personal data (CRM source for example).
../../_images/project_gdpr-settings.png

Warning

The restrictions put in place here encourage best practices, but do not guarantee they will be followed. A user can circumvent the restrictions using code or the API.

Document Datasets with Personal Data

Once the project is configured to work with personal data, we can begin to document which datasets contain personal data. While logged in as a user with the privacy_doc group privilege, from the project home, go to the Customers dataset.

Click on the GDPR icon next to the dataset’s name to edit the GDPR fields for the dataset.

  • Set Contains personal data to Yes, since there is identifying information on the customer here.
  • For the dataset Purposes enter something like Marketing communication Recommender system. This documents, for auditing purposes, the reason why this data was collected.
  • For Retention policy enter something like 3 years after last action. This documents, for auditing purposes and to take appropriate filtering actions, how long the personal data can be used
  • For Legal basis for consent enter something like Explicit consent on website. This documents, for auditing purposes, how the personal data came into our possession.
../../_images/dataset-gdpr-fields.png

Now select Actions > Share and choose a project to share this dataset with. Because we’ve documented this dataset as containing personal data, and because of the project settings, Dataiku prevents the share.

../../_images/share-error.png

Now open the Orders_enriched_prepared dataset. The Prepare recipe has removed the identifying information, so we can mark this dataset as not containing personal data.

Note

When datasets marked as not having personal data are used as inputs to a recipe, Dataiku will automatically mark the output dataset as not containing personal data. If any input is marked as having personal data, or Not yet defined, then the output dataset will be marked as Not yet defined.

In the Flow, you can view the GDPR status of each dataset in the flow by choosing Metadata fields from the View menu. Ensure that GDPR fields - Contains personal data is the selected field.

../../_images/flow-metadata.png

Produce an Audit Report

You can produce two different types of audit reports for personal data. Choose … > Macros

GDPR Datasets Check-up

Open this macro, deselect Only UNSURE and click Run Macro. This builds a list of all the datasets in the project (or all projects, if we select that in the macro dialog), and allows you to quickly scan the list to see if there is missing information; for example, datasets that are in an UNSURE state, or datasets that contain personal data, but do not have a listed purpose, retention policy, or basis for legal consent.

../../_images/gdpr-datasets-audit.png

GDPR audit

Open this macro and click Run Macro. This builds an audit trail of the personal data policies applying to each object within the project (or all projects, if we select that in the macro dialog). It has greater detail than the datasets check-up report, and helps you to identify potential problem areas in your projects and across the Dataiku DSS instance.

../../_images/gdpr-audit.png

What’s Next

Congratulations! You’ve taken your first steps towards understanding proper data governance and can begin to apply it in your own organization.