Git for Projects

Dataiku DSS has three primary integrations with Git:

  • Code libraries
  • Plugin development
  • Projects

The first two integrations enable coders and developers to more effectively share their work across DSS projects and instances. Git integration for projects enables even the non-coders on the team to take advantage of version control.

Each change that you make in a DSS project is automatically committed to a local Git repository. Thus, any normal contribution to a DSS project passively uses the git integration for projects.

This tutorial will walk through the active use of the git integration to:

  • Connect a local project to a remote Git repository
  • Branch the project in order to do some “experimental” work without affecting the flow for other members of the data team
  • Push project changes from the local branch to the remote Git repository
  • Merge the branch into master
  • Pull the changes to master from the remote Git repository to the local project

Prerequisites

It is strongly recommended to have a good understanding of the Git model and terminology before using this feature.

Technical Requirements

  • Access to a remote Git repository where you can push changes. Ideally it should be an empty repository.
  • Access to a remote git repository and a DSS instance that has been set up to work with remote Git repositories. See Working with Git in the reference documentation.
  • A project to practice with. This tutorial will use the Haiku Starter project, which can be found by selecting +New Project > DSS Tutorials > General Topics > Haiku Starter.

Connect to a Remote Git Repository

From the project menu in the top navigation bar, select Version Control. This shows that we are on the master branch of the project.

../../_images/project-version-control.png
  • Click on the change tracking indicator and select Add remote.
  • Enter the URL of the remote and click OK.
  • From the change tracking indicator, select Push.
../../_images/project-version-control-push.png

In your remote Git repo, you can see that the master branch has been successfully pushed.

../../_images/project-version-control-push-github.png

Note

Each project must have its own repository.

Branch the Project

  • From the branch indicator, click Create new branch.
  • Name the new branch prune-flow and click Next.
  • Click Duplicate and Create Branch.

This creates a duplicate project working on the prune-flow branch.

../../_images/create-branch.png

Note

Key concept: Duplicated projects for branching

A given DSS project can only be on one branch at any given time. If you switch the branch of the current project, this will affect all collaborators, and you can’t work on multiple branches at once.

Now we can make our changes to the duplicate project on the prune-flow branch without disturbing the rest of the data team’s use of the master branch of the project. Go to the Flow of the project and see that the Flow forks three ways from the Orders_enriched_prepared dataset.

../../_images/project-branch-flow.png

We will prune the flow by removing the Orders_by_Country_Category and Orders_filtered datasets.

../../_images/project-branch-flow-pruned.png

Push Branch Changes to the Remote Repository

  • From the project menu in the top navigation bar, select Version Control.
  • From the change tracking indicator, select Push.
../../_images/project-branch-push.png

Merge Branch Changes to Master

You can see the prune-flow branch has been pushed to your remote Git repo. In order to merge the changes with the master branch, do that in the normal way outside of Dataiku DSS.

../../_images/project-branch-push-github.png

Note

Branching and Merge Conflicts. This tutorial describes an extremely simple branch and merge. If multiple collaborators each create a separate branch off of master and then try to merge their separate branches back to master, they are likely to encounter Git merge conflicts. These can be difficult to resolve and we may not be able to solve them for you. Your data team should agree on a plan for how to collaborate on projects using Git, in order to avoid merge conflicts.

Pull Master Changes to Local

Finally, to see the merges reflected in Dataiku DSS, first return to the original project.

  • From the change tracking indicator, Fetch the changes from the remote Git repo, and then
  • Pull the changes to your local Git.
../../_images/project-master-pull.png