August 20, 2024

How to Effectively Version Control Your Machine Learning Pipeline

By Omar Abid

Version control is a fundamental practice in software development that ensures the integrity and reproducibility of code. However, applying version control to machine learning (ML) pipelines comes with unique challenges. From data prep and model training to validation and deployment, each step is intricate and interconnected, demanding a robust system to manage it all.

In this post, we will define and explain why version control is important in ML pipelines, outline the pillars of version control, and finally, discuss how to properly version control each of the components in an ML pipeline, ensuring enhanced collaboration, reproducibility, and efficient management of your ML projects.

What is Version Control?

Version control, also known as source control, is a system that records changes to a file or set of files over time so that you can recall specific versions later. In the context of ML pipelines, version control not only applies to code but also to datasets, model artifacts, and model hyperparameters. 

Figure 1 below highlights some of the key components of an ML pipeline and the corresponding frameworks commonly used for version control.

Figure 1. Some of the components of an ML pipeline and the common tools used for version control.

Why is Version Control Critical?

Implementing proper version control in ML pipelines is essential for efficient management of code, data, and models by ensuring reproducibility and collaboration.

Reproducibility ensures that experiments can be reliably reproduced by tracking changes in code, data, and model hyperparameters. Detailed documentation of experiments and results further enhances reproducibility. Without reproducibility, it becomes challenging to validate results, leading to potential inconsistencies and errors that are difficult to diagnose, loss of trust, and setbacks in project timelines.

Collaboration enables multiple team members to work on the same project concurrently by allowing parallel development and integration of changes. Version control systems like Git provide mechanisms for branching and merging, which streamline collaborative workflows. Without collaboration, teams risk duplication of effort, misaligned goals, and integration conflicts, slowing progress.

Common Challenges Faced Without Proper Version Control

With proper version control, ML teams can avoid significant challenges. For example, issues with reproducibility and software dependencies can hinder progress and innovation, leading to a negative technological impact. 

In addition, there’s also a potential for a business impact due to errors in production models. This can lead to revenue loss and decreased efficiency in cross-team collaboration, slowing down time to production.

What are the Pillars of Version Control in ML Pipelines?

Code Versioning

There’s a plethora of resources available online on this topic. Strategies vary based on a team’s preference. Below are two key aspects to watch out for. For more information, see Adopt a Git Branching Strategy.

  • Code Management: Use branches and tags in Git and store configuration files and parameters to ensure that code and configurations are consistently managed and traceable.

  • Environment Management: Log dependencies and use containers and environment management tools like Docker and Conda to ensure reproducible results.

  • Branching and Merging: Employing branching strategies (e.g., feature branches, release branches) to manage different versions of the code and facilitate collaborative development.

Data Versioning

Data is often considered the lifeblood that fuels the algorithms in an ML pipeline. Tracking changes and lineage ensures traceability for downstream components of the ML pipeline ingesting the data. Refer to this LakeFS blog post for a more detailed description.

  • Data Versioning: Tracks changes in datasets, ensuring that models can be trained and evaluated consistently over time. This helps manage data drift and maintain the integrity of training and test sets.

  • Data Lineage: Keeping a record of data transformations and preprocessing steps to ensure the data pipeline is reproducible and auditable.

Experiment Tracking

There are often tens, if not hundreds, of trained models. Without a proper way to manage experiments and the corresponding model artifacts, team collaboration and efficiency can suffer. This Neptune.AI blog post provides a detailed description of experiment tracking and a comparison of commonly used experiment trackers. 

  • Experiment Management: Logging experiments, including code versions, data versions, model parameters, and results. This helps compare different experiments and reproduce results. Tools like MLflow, Comet, and Weights & Biases are commonly used for experiment management.

Model Versioning

  • Model Tracking: Keeping track of different versions of machine learning models, including hyperparameters, training configurations, and performance metrics. Use tools like MLflow, NVIDIA Triton Inference Server, AWS SageMaker, or BentoML to track and deploy different model versions.

  • Model Registry: Maintaining a registry of models to manage model lifecycle, including staging and production. See the excellent blog post What is a Model Registry? for more information.

  • Tagging Models: Implement tagging for different stages (development, QA, production) and testing methods (A/B testing, canary releases, shadow testing) to manage and streamline model deployment.

Automation with CI/CD

  • Continuous Integration/Continuous Deployment (CI/CD): Integrate version control into your workflow and implement CI/CD for ML pipelines to streamline development and deployment processes. Tools like Jenkins, GitHub Actions, and GitLab CI can be integrated into the pipeline.

Documentation

  • Consistent Naming Conventions: Maintain clear naming conventions using Linters such as Pylint. Integrations of linters are available in many IDE’s. For example, see the documentation on Linting Python in Visual Studio.

  • Comprehensive Documentation: Clear documentation facilitates collaboration and understanding among team members. 

How Do I Integrate Version Control in ML Pipelines?

In this section, we will walk through integrating version control for some of the key components in the ML pipeline, as shown in Figure 1. As each component relates to the previous section on the pillars of version control, we’ll refer readers back to that section.

To make this section easier to understand, let’s use a real-world example. Imagine a company that wants to train an image classification model to identify cats and dogs.

Prepare Data

The company begins by collecting images of cats and dogs. These images are in their unaltered form and are referred to as raw data. Processed or Transformed data refers to data that has been cleaned, transformed, and organized in a format ready for use by an ML model. 

To adhere to best practices, the company follows the Data Versioning pillar of version control, allowing them to track changes to the dataset over time. One method of doing this is to use DVC to version data in a Git-like manner. 

				
					# initialize a DVC project within an existing Git repository
dvc init
# add the data of dogs
dvc add data/dogs/
# add the data of cats
dvc add data/cats/
# track the changes in Git
git add -A .
# Commit and push the changes
git commit -m "Add raw data"
git push

				
			

Depending on where the data is stored, we can set up remote storage and then finally push the data.

				
					dvc remote add -d storage s3://mybucket/dvcstore
dvc push

				
			

Data pipelines can track processed data concurrently and ensure data lineage. For more details, see the DVC Data Pipelines documentation.

Alternative Frameworks: There are several other frameworks that can be used for data version control. The choice of the tool depends on the use case, business, and technical requirements, as well as the team’s preference. Other popular tools include, but are not limited to:

  1. Git LFS: Extension for Git to handle large files. 

  2. LakeFS: Git-like version control for data lakes. See the lakeFS-samples for sample usage.

  3. Pachyderm: Data driven pipelines.

  4. Time Travel capabilities in platforms like the Snowflake AI Data Cloud

Train and Validate Model

After collecting, storing, and versioning their data, the company wants to train an ML model to distinguish between cats and dogs. To adhere to best practices, the company follows the Experiment Tracking and Model Versioning pillar of version control, which allows them to manage and track the experiments and model versions over time. They also employ a model registry to maintain the model artifacts, using tags to identify and quickly search for staging and production models.

One possible method to adhere to this pillar is to use MLflow, see their QuickStart guide for more information. Within their training script, train.py, they can add code to track experiments and log model artifacts. See the pseudocode below.

				
					import mlflow
mlflow.set_tracking_uri(uri="http://<host>:<port>")
mlflow.set_experiment("Cats and Dogs")
mlflow.autolog()
# Training code here ...

				
			

The tracked models, experiment artifacts, and run IDs will now appear in the MLflow UI. This information can be used to trace the hyperparameters used to run the experiment and the corresponding data used for model training. Models can also be registered here for use in deployment.

Alternative Frameworks: There’s no shortage of experiment-tracking tools out there. While MLflow is open source, users may prefer other tools depending on the features they are looking for. Examples of other tools include Weights & Biases (W&B), ClearML, CometML and Neptune.AI. In addition, AWS now offers SageMaker with MLflow for a more seamless user experience. This blog post provides an excellent comparison of some of the most popular experiment tracking tools. 

Model Deployment

With the models trained, the company now wants to deploy their models to production. To adhere to best practices, the company follows the Model Versioning pillar of version control to understand how to correctly maintain a model registry in conjunction with the Code Versioning and Automation with CI/CD pillars.

To maintain simplicity, the company deploys its model to an AWS SageMaker Endpoint. To ensure version control, model tags are created, and a model registry is maintained. The sample code they use to achieve this is as follows. For more information, refer to MLflow’s documentation on deploying models to SageMaker.

				
					# Build a Docker Imge compatible with SageMaker
mlflow sagemaker build-and-push-container  -m runs:/<run_id>/model

# Deploy the SageMaker Endpoint
mlflow deployments create -t sagemaker -m runs:/<run_id>/model \
    -C region_name=<your-region> \
    -C instance-type=ml.m4.xlarge \
    -C instance-count=1 \
    -C env='{"DISABLE_NGINX": "true"}''

				
			

Alternative Frameworks: BentoML, Triton Inference Server, and ONNX Runtime are other alternative frameworks that are used for deployment. For an excellent comparison of deployment frameworks, see the blog posts by Neptune.AI and Modelbit.

Automation with CI/CD

Finally, in a production environment, we may want to automate these steps so the end-to-end process is version-controlled. One method the company may use to do this is GitHub Actions. Here’s a sample GitHub workflow the company may use to put the pieces together.

				
					name: ML Pipeline for Cats and Dogs

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

jobs:
  data_preparation:
    runs-on: ubuntu-latest
    steps:
    - name: Prerequisites
      run: |
       # Checkout repository
       # Install Dependencies 
    - name: Prepare data
      run: |
        dvc pull
        python src/data_preprocessing.py
    - name: Push processed data to DVC
      run: |
        dvc add ...
        git commit -m "Add processed data"
        dvc push

  model_training:
    needs: data_preparation
    runs-on: ubuntu-latest
    steps:
    - name: Prerequisites
      run: |
       # Checkout repository
       # Install Dependencies 
    - name: Pull processed data
      run: |
        dvc pull ...
    - name: Train model
      run: |
        # Train model and log artifacts to MLflow
        python src/train_model.py
        # Tag the models (prod, dev, staging, etc.)

  model_deployment:
    needs: model_training
    runs-on: ubuntu-latest
    steps:
    - name: Prerequisites
      run: |
       # Checkout repository
       # Install Dependencies 

    - name: Deploy model
      run: |
        # Delete any old deployments
        mlflow deployments delete ...
        # Create a new deployment
        mlflow deployments create ...

				
			

Closing

There’s no single “right” way to properly version control ML Pipelines. As we have seen in this blog post, the choice of frameworks is dependent on the use case, business requirements, and the team’s software stack preferences. Having said this, there are several pillars of version control that serve as guideposts. By implementing these practices, you can ensure reproducibility, manage dependencies, and streamline team collaboration.

Looking for more help?

From machine learning inception to production, phData can help! Reach out today for answers, advice, best practices, and help on your toughest machine learning challenges.

FAQs

To manage your ML workflow, you can use tools like MetaFlow, which simplifies building and managing data science projects, Kubeflow for running ML workflows on Kubernetes, and Airflow for scheduling and monitoring workflows. These tools help streamline your workflow, improve collaboration, and ensure scalability, making your machine learning projects more efficient and manageable.

To check for data drift in machine learning, compare the current data with the training data to see if they look different. You can use tools like Evidently AI, or WhyLabs to monitor these changes. These tools help spot when data patterns shift, ensuring your model stays accurate and reliable.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit