This article aims to answer many frequently asked questions about model registries. For a broader perspective on how model registries fit into an MLOps framework, check out the Tracking section of our Ultimate Guide to Deploying ML Models.
What is a Model Registry?
A model registry is a repository used to store trained machine learning (ML) models.
In addition to the models themselves, a model registry stores information (metadata) about the data and training jobs used to create the model. Tracking these requisite inputs is essential to establish lineage for ML models. In this way, a model registry serves a function analogous to version control systems (e.g. Git, SVN) and artifact repositories (e.g. Artifactory, PyPI) for traditional software.
Another way to think about model lineage is to consider all of the details that would be necessary to recreate a trained model from scratch. Establishing lineage through a model registry is a vital component to a robust MLOps architecture.
How Does a Model Registry Work?
Each model stored in a model registry is assigned a unique identifier, also known as a model ID or UUID. Many off-the-shelf registry tools also include a mechanism for tracking multiple versions of the same model. The model ID and version can be used for data science and ML teams to refer to specific models for comparison and confidence in deployment.
Registry tools also allow for storage of parameters or metrics. For instance, training and evaluation jobs could write hyperparameter values and performance metrics (e.g. accuracy) when registering a model. Storing these values allows for simple comparison of models. As they develop new models, having this data on hand can help teams see whether new versions of a model are improving upon previous versions. Many registry tools also include a graphical interface to visualize these parameters and metrics.
Parameters and metrics tracked by MLflow autologging for LightGBM.
Under the hood, model registries are generally comprised of the following elements:
What Can Go Wrong Without a Model Registry?
Without a model registry, data scientists and machine learning engineers are more likely to cut corners or make costly mistakes.
Here are some common pitfalls that we’ve seen:
What Information Should a Model Registry Store?
Key forms of information stored in a model registry fall into the following categories: software, data, metrics, and models.
A robust model registry should be able to store all details necessary to establish model lineage.
Model registry tools can also store input parameters to training jobs and performance metrics to enable comparisons between different models or versions of models. These elements can usually be captured completely by storing the following forms of information:
How Does a Model Registry Help Data Scientists?
Model registry tools help data scientists by enabling reproducible research during model development. You can think of a good model registry as a specialized lab notebook for machine learning models. As such, they simplify the bookkeeping process for data scientists. By logging metrics, data, and software to a model registry, data scientists can quickly see how the changes they make impact model performance. From their observations, they can quickly move on to new experiments because their previous ones have already been documented in the registry.
Reproducible models are also easier to operationalize, which reduces friction for data scientists. A model in a registry is easier to hand off to engineering teams for deployment. When model artifacts are stored in a registry with lineage, the engineering team doesn’t need to invest effort into training the models using a more robust framework. If subsequent retraining is necessary, operations teams can take it on – or engineering teams can automate it, since the process is already documented. This frees up the time of data scientists to create new innovations rather than retraining old models.
Using a registry to track models may initially seem like an extra burden on data scientists, but they will quickly see that a small amount of extra code will greatly accelerate the work of data scientists.
How Does a Model Registry Fit into an MLOps Framework?
Model registries provide a common source of truth for referencing models and underlying versions. When data scientists communicate with engineering teams, they can use the unique ID stored in the registry to refer to a model with zero ambiguity. Similarly, applications can take the unique ID as a parameter in their deployment pipeline and fetch the associated artifacts from the registry to make updating models painless.
Lineage established in a model registry can eliminate the need for engineers to rewrite training code because details necessary to reproduce models are readily available. And if model retraining jobs also publish models and metrics, it is easy for any team member to track performance over time. Monitoring performance can establish the return on a particular ML investment and for justifying operational costs.
How Does a Model Registry Contribute to Governance, Compliance, and Auditing?
Model governance, compliance, and audits are increasingly important in the context of machine learning.
Regulations have been increasing in this space; for example, GDPR contains language about a consumer’s right to explanations. But, even without regulations, models may sometimes produce erroneous or confusing predictions that require audit. In these cases, you’ll need to trace the specific version of a model that generated such predictions, as well as the underlying training data. Proper use of a model registry ensures that this is possible.
Data governance can also create issues for ML models. Many models are trained using sensitive data, and some forms of data are required to be destroyed after some period of time. Other times, data might seem relevant and get deleted even though it was used to train a model. What happens to models that were trained using this data? A model registry can help organizations manage the dependence of their models on specific data and put appropriate guardrails around data governance.
What are Some Popular Model Registry Tools?
There are many model registry tools available, but the following tools can get you and your team to get started:
Using a model registry is a key component to building a robust MLOps framework. Model registries simplify research and development for data scientists and streamline the model deployment process. They also enable complex auditing and governance that would otherwise be virtually impossible.
If your team is interested in integrating a model registry into your MLOps framework but unsure how to start, check out our Ultimate MLOps Guide: How to Deploy ML Models to Production.