February 21, 2024

Git for Business users with Matillion Data Productivity Cloud

By Marcus Montenegro

Git techniques originated in software engineering and have now spread to a variety of other fields, including data and analytics engineering. Since they are becoming more common on visual programming platforms, even business users have begun to understand what they mean and how to utilize them.

Most individuals believe they are very complex and build mental barriers that prevent them from learning, but this blog will convince you of the opposite. Git is not difficult to use, and the Matillion Data Productivity Cloud makes it much easier. Let’s take a look at how to use Git from now on.

What is Matillion Data Productivity Cloud?

Matillion’s Data Productivity Cloud is a versatile platform designed to increase the productivity of data teams. It provides a unified platform for creating and managing data pipelines that are effective for both coders and non-coders. 

The platform features AI-powered tools that enable the integration of large language models (LLM) into your data pipelines, as well as a great connector library and a visual, low-code design that supports a wide range of data movement and transformation operations.

The platform simplifies data pipeline orchestration by providing tools for automation, scheduling, and comprehensive visibility. It’s designed to work with people of all skill levels and interact smoothly with existing technology stacks. 

Matillion is also built for scalability and future data demands, with support for cloud data platforms such as Snowflake Data Cloud, Databricks, Amazon Redshift, Microsoft Azure Synapse, and Google BigQuery, making it future-ready, everyone-ready, and AI-ready.

Its core, PipelineOS, uses stateless microservice agents for scalable data flow and transformation while keeping costs low and performance high, with consumption pricing based on time spent running data pipelines rather than server uptime. As a result, Matillion is an excellent choice for businesses wishing to optimize their data operations in a scalable and user-friendly environment.

Why use Git Repositories

Connecting your Matillion Data Productivity Cloud to GIT allows you to empower your solutions with all of the best GIT development practices. Connecting both allows your team to work on multiple versions simultaneously and collaborate easily, making your developments more reliable and allowing you to work at a more granular level with agile methodologies.

How to use Git Repository

As Matillion Data Productivity Cloud evolves, additional git resources will become available, but those that are already on the platform are going to stick around. Let’s go over them, describing what they mean, where they are on the platform, and how they are similar to other simple acts you currently perform outside of Git to demonstrate how this is not as complicated as you may assume.

Local vs Remote

The first thing we must understand is the distinction between local and remote repositories. Long story short, consider the difference between folders on your local computer vs folders stored in a cloud file storage service like Google Drive or Sharepoint.

There will be times when you create a file on your computer and want to share it with the rest of the team, so you upload it to Sharepoint. Whenever someone on your team wishes to make changes to that file, they can download it to their own local computers and work on it without making direct changes to the shared file until it gets uploaded again and overwritten.

Git repositories basically follow the same concept with some extra advantages. So think of the local repository as a folder created on your local computer to store files you’re working on, which only you can see. The remote repository will be a folder on a cloud file storage provider, such as Google Drive. Remember that you can make subfolders within your local or cloud folders. You can also create something similar to subfolders within local and remote repositories, which we refer to as branches, but we’ll get into that later.

In summary:

  • Local – the repository in the instance in which you are working. The files are only accessible to you.

  • Remote – the repository hosted by the git provider. Files can be accessed by others based on permissions.

Matillion Data Productivity Cloud is a totally SaaS product, resulting in local repositories being slightly different. The local repository will not exist on your PC but rather in a Matillion-hosted location that is only accessible through your account. This way, the local repository logic will be as previously described but in a location hosted by Matillion and only available to you.

Branches

Branches are an excellent Git feature. Essentially, branches will let you manage version control over your development, allowing you to make changes to many files in a new place without affecting the original ones.

Basically, a branch is a folder that contains many files. Creating a new branch means making a copy of the original folder and its contents so that you can work on them separately. It’s like going to Google Drive and creating a copy of your team’s folder so you can work on the files without affecting the originals.

This allows you to make modifications at your own pace while keeping your team’s original files available to them. When your team wants to view your progress, they can access the branch you created and every change you made. Once your modifications are complete, you can merge branches to apply them to the original files, as explained in the Merge section below.

Branches will allow you to have version control over your files. In the Data Productivity Cloud, the initial page of every project will show the branches that exist in it. The files in your development area will be displayed based on the branch you select.

As previously indicated, branches can also serve as folders. It is common for teams to create a Main or Master branch to keep files safe for real-world use (production). In addition to them, it is common to build at least a Development branch, which is derived from the main branch, in order to have various sub-branches for your team’s development. 

This way, the git repository will have some branches that act similarly to folders in your computer’s file explorer, while other branches work clearly for version control.

Pull Remote Changes

Now that you understand about branches and the distinction between local and remote repositories, let’s look at how to use them with the Matillion Data Productivity Cloud.

Before you start working on something, especially when collaborating with others on the same team, ensure you have the most recent file versions in your local repository. In this manner, you avoid the possibility of using obsolete files or making changes that have already been developed by others. So you’ll need to synchronize your local and remote repositories.

Selecting any branch on your project opens up an environment where you have access to Matillion jobs in that branch. The Git button will be located in the top right corner, and it shows all the git commands we will learn here. Let’s start with Pull remote changes.

That command will trigger your Data Productivity Cloud instance to look at the remote repository and search for updates that were made there and not in your local repository for that branch. That includes creating new files, making changes to existing files, and removing files. These updates in the remote repository will be applied to your local to replicate them, but files that you created, modified, or removed in your local but never sent to the remote will be unaffected.

That means the Pull remote changes command will compare all branches on both local and remote changes while applying them in a single way. From remote to local. When it’s finished, your local repository will be updated with all of the modifications made by your team.

Commit

One thing you should know is that the modifications you made in your local repository are not yet effectively saved. That’s like creating a spreadsheet, doing a lot of work on it, and then leaving it unsaved.

Because Matillion’s Data Productivity Cloud is available 24 hours a day, seven days a week, you will not lose your work when you leave the platform, but you may lose it when syncing repositories.

To ensure that the changes you make are saved in your local repository, you must commit your work. Consider Commit to serve as a Save button, allowing you to save your work in your local repository and ensure that it is not lost. That will still not apply the changes in the remote repository to other people in the team who are accessing it. 

That guarantees the changes will remain in your local repository after you make a hard reset, merge, or pull remote changes. To make it available to the rest of your team, you will need to push local changes, as described in the next section.

Push Local Changes

Now that you’ve completed all of the commits required to update your local branches, it’s time to sync the repositories again. The command Push local changes will upload your modifications to the remote repository. Consider this the time when you want to upload a file back to Sharepoint after making changes to the original version of it.

By pushing your local changes, Data Productivity Cloud will compare your local repository to the remote repository. Pushing your local changes, in contrast, to Pull remote changes, will apply all modifications, creations, or deletions you make locally in the branch to your remote repository, keeping your team up to date with the most recent version of your files in this branch.

During the push, there may be some conflicts similar to those seen during the merge process. Decide which one should stay in the remote repository. If there is no conflict, all of your modifications will be applied.

Merge

There will be times when you will have to set up new branches to work in a separate stream from the original one to guarantee that the changes you make do not affect the original branch until you are certain they are working properly.

However, there will come a moment when you will want to apply the changes you made to the original branch. Think about downloading a file from Google Drive, making numerous changes on your local computer, saving it (commit), and then uploading that file back to Google Drive to replace the original with the latest version.

The operation is comparable to the Merge command. Once you’ve completed all of the modifications you required and wanted to make to the original branch (such as overwriting the original file), you’ll merge it into the original, which will include the most recent updates. That method will trigger the Data Productivity Cloud to treat the original branch as the primary one and replicate everything from your branch that has not yet been created there.

Let’s take one example to illustrate it. You have the main branch for all your production-ready Matillion jobs and a Test branch for the developments you’re doing. You have a project that is finished, and after some validation, you concluded that it is now good to be moved to production, the main branch. In order to move these jobs from one branch to another, we will use the merge feature.

The first thing to be sure is that your final version of the jobs in the Test branch was committed to guarantee they are saved in your local repository. Once you commit the last changes in the Test branch, you will make them available to everyone by pushing your changes to the remote repository (Push local changes, as seen in the previous section). Now that they are available in the remote repository, you can go to your main branch and select the merge option. 

That will give you the chance to bring another branch into the remote repository to be mixed with your current local branch, in this case, the main branch. That mix is the merge. All changes you previously applied in the remote branch will be replicated in the branch you are in the local repository. The merge window will be like the one below.

After you get that merged, that is still living only in your local main branch. To make that merging result available to the rest of the team, it is a matter of pushing the local changes to your remote repository, as we saw in the last section. This way, you will have completed a full promotion of your Matillion jobs from your Test branch to your Main branch, and everyone on the team can find it whenever they need it.

One thing to keep an eye out for is changes made to the original branch while you were working on your own branch. This may result in changes being applied to the same things that you modified, causing git conflicts during the merge process.

For example, suppose the original branch contained a transformation job that used a certain data source. You built a new branch based on top of the previous one, changed the data source, and merged it back into the original, but someone else did the same thing using another data source. 

The Data Productivity Cloud will notify you that someone made a change to the job that differs from your original flow and will ask you to pick whether changes should be maintained, yours or the ones currently in the original branch.

Once you’ve resolved all of these issues, the merge will be complete. If there are no conflicts, everything will be applied immediately.

Hard Reset

Hard reset is a very interesting git feature added to Matillion Data Productivity Cloud. If you have made multiple changes to the current branch but later discover they need to be undone, the hard reset will allow you to return the branch to the last commit you made.

Take caution when using this feature. That is an important resource to have on hand, but only use it if you are certain you need to return to the previous commit. The hard reset will delete all uncommitted modifications in that branch, not just for one of your jobs but for all of them.

Git Example

Several git strategies can be applied to manage version control and take advantage of various developments across the team. Let’s go over a simple git strategy that you may adopt and modify to meet your needs.

When you first establish a project on the Data Productivity Cloud, the main branch is automatically created. Let us refer to it as your production branch, which contains the latest version of your developments that have been tested to ensure data quality and requirement adherence. 

Create a new branch called QA based on this main branch by clicking the Add new branch button on the branches page. That will keep all of the developments you’ve released for testing before they go into production.

Finally, you will create a development branch. This will effectively hold all of your incomplete developments. Once your developments are complete, you will promote your Matillion jobs by merging your development branch into QA and then QA into main. 

To return to other developments, simply select the development branch again and continue to commit changes there until the next promotion, which requires new merges.

Logically, you would have something like this:

When working on a new development, it is common to create a new branch known as the feature branch for the project in question. This allows the development branch to hold numerous developments simultaneously, with the only requirement being to merge the development and feature branches to promote them later to other environments. In this simple example, it will logically be something like the following:

These images illustrate Git trees, which are commonly used to demonstrate how different branches link to one another over time. You can find it by navigating to your git provider.

If you work in a small team or even alone, you don’t need to complicate development with multi-step git techniques, but you can still benefit from some of its advantages. A small team of up to three developers working on distinct projects can get by with just the main branch and the development branch. In conclusion, it is critical that you consider git strategies that are appropriate for your team.

Closing

In this blog, you learned what the git terms used in the Matillion Data Productivity Cloud mean and how to use these amazing resources to manage different team developments. Even though it started in the software engineering field, it is a very good practice to incorporate it into your developments now that you know how to apply it.

Do you require assistance in designing and implementing a data pipeline or leveraging your organization’s Matillion Data Productivity Cloud?

Please contact our team for assistance in accomplishing this goal.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit