Machine-learning (ML) models almost always require deployment to a production environment to provide business value. The unfortunate reality is that many models never make it to production, or if they do, the deployment process takes much longer than necessary. Even successfully deployed models will require domain-specific upkeep that can create new engineering and operations challenges.
The simple fact is that ML models are software. Deploying and maintaining any software is a serious task, but ML introduces new complexities. These demands have given rise to the field of MLOps. Analogous to the way that DevOps has added structure to the process of software engineering, a proper MLOps implementation streamlines the process of developing and deploying ML models.
On top of Observability, Operations, and other DevOps principles that have evolved for common software projects, ML models require monitoring data quality and automation of model retraining. Most importantly, reproducing an ML model requires the original dataset to be available in addition to all software and relevant configuration parameters; this volume of information is vastly more significant than traditional source code and build artifacts.
In this guide, we will introduce MLOps and outline considerations that will help ensure ML applications make it to production and run smoothly. At the end of the day, that’s what it takes for a model to provide business value. We’ll leave aside the issue of estimating the business value that could be achieved by deploying such a model, though we do discuss architectures that will simplify decisions regarding future upgrades.
In order to be successful, organizations should lay out an architecture that enables ML and supports their business needs. Since various industries and domains are governed by specific data regulations, there is no one-size-fits-all solution. As ML adoption has increased, addressing these unique challenges has given rise to the completely new subfields of ML Engineering and MLOps.
Successful ML deployments generally take advantage of a few key MLOps principles, which are built on the following pillars:
Throughout the rest of this post we will drill deep into these pillars to help provide a guide for any organization looking to deploy models to production effectively.
While data-science research and model development may seem decoupled from the deployment lifecycle, the MLOps cycle starts in the lab. The most important aspects discussed in the upcoming Automation/DevOps and Monitoring/Observability sections will rely on properly tracked models. When diagnosing issues with models, engineering and operations teams must quickly be able to determine how and when a model was created. Tracing models back to their source is also increasingly important for regulatory and compliance audits.
Tracking models in the R&D phase may seem like a hassle for data scientists, but with the right tools, tracking can be unobtrusive. Further, standardizing on the right tools for tracking (and training) models will significantly reduce the time and effort necessary to transfer models between the data science and engineering teams.
PRO TIP: Have you ever spent a long time trying to understand a data scientist’s model-training notebook?
You’re not alone – most teams could greatly reduce friction at that step. Standardizing on the right tools for tracking (and training) models will noticeably reduce the time and effort necessary to transfer models between the data science and engineering teams.
Notebooks themselves are not the problem – in fact, they are vital to research and development – but they can cause problems when it is hard to reproduce the environment in which the notebook was run. Data science platforms like Dataiku and Sagemaker allow users to develop and execute notebooks while providing a consistent and well-documented setting for notebook execution.
Tracking models creates a mechanism for model and data governance. Many models are trained using sensitive data. Some forms of data are required to be destroyed after a period of time. In other instances, data is deemed irrelevant and is just deleted. If data is deleted, what happens to the models that were trained using that data? Organizations should take steps to guard against these scenarios by tracking models as well as the data used to train them.
Beyond that, compliance, regulations, and auditing are becoming increasingly relevant and challenging issues. For example, the European GDPR law provides individuals with a “Right to Explanation” for automated decisions. Similar restrictions already apply to certain algorithms in the United States, such as credit monitoring.
Compliance with regulations often requires auditing. Even without regulations, models may produce erroneous or confusing predictions that require investigation. Model tracking makes auditing a tractable task rather than an impossible one.
Finally, model tracking puts structure around the model-creation process. Development of ML models is experimental in nature, and reproducibility can be a major challenge. In light of that challenge, consistent patterns for data science and model development are vital. Standard tools, practices, and processes for data scientists can greatly reduce the amount of time it takes to transfer models to engineering teams.
Standardization can also greatly reduce the time and energy they spend on setting up environments and infrastructure. Organizations should build and invest in tools that streamline model development and model tracking.
The foundation for model tracking is a model registry. Ideally, your organization should have a common, enterprise-wide model registry for all ML operations. A model registry acts as a location for data scientists to store models as they are trained, simplifying the bookkeeping process during research and development. Models retrained as part of the production deployment should also be stored in the same registry to enable comparison to the original versions.
A good model registry should allow tracking of models by name/project and assign a version number. When a model is registered, it should also include metadata from the training job. At the very least, the metadata should include:
Having a model registry puts structure around the handoff between data scientists and engineering teams. When a model in production produces erroneous output, registries make it easy to determine which model is causing the issue, and roll back to a previous version of the model if necessary. Without a model registry, you might run the risk of deleting or losing track of the previous model, making rollback tedious or impossible. Model registries also enable auditing of model predictions.
Some data scientists may resist incorporating model registries into their workflows, citing the inconvenience of having to register models during their training jobs. Bypassing the model-registration step should be discouraged or disallowed by policy. It is easy to justify a registry requirement on the grounds of streamlined handoff and auditing, and data scientists usually come to find that registering models can simplify their bookkeeping as they experiment.
PRO TIP: Bypassing the model-registration step should be discouraged or disallowed by policy. Data scientists may see it as a shortcut to not use a model registry, but bookkeeping and auditing challenges later easily justify doing things right the first time.
Good model-registry tools make tracking of models virtually effortless for data scientists and engineering teams; in many cases, it can be automated in the background or handled with a single API call from model training code.
Model registries come in many shapes and sizes to fit different organizations based on their unique needs. Common options fall into a few categories:
Feature stores can make it easier to track what data is being used for ML predictions, but also help data scientists and ML engineers reuse features for multiple models. A feature store provides a repository for data scientists to keep track of features they have extracted or developed for models. In other words, if a data scientist retrieves data for a model (or engineers a new feature based on some existing features), they can commit that to the feature store. Once a feature is in the feature store, it can be reused to train new models – not just by the data scientist who created it, but by anyone within your organization who trains models.
The intent of a feature store is to allow data scientists to iterate more quickly by reusing past work, but they also accelerate the work for productionizing models. If features are committed to a feature store, your engineering teams can more easily incorporate the associated logic into the production pipeline. When it’s time to deploy a new model that uses the same feature, there won’t be any additional work to code up new calculations.
Feature stores work best for organizations that have commonly used data entities that are applicable to many different models or applications. Take, for example, a retailer with many e-commerce customers – most of that company’s ML models will be used to predict customer behavior and trends. In that case, it makes a lot of sense to build a feature store around the customer entity. Every time a data scientist creates a new feature to better represent customers, it can be committed to the feature store for any ML model making predictions about customers.
Another good reason to use feature stores is batch-scoring scenarios. If you are scoring multiple models on large batches of data (rather than one-off/real-time) then it makes sense to pre-compute the features. The pre-computed features can be stored for reuse rather than being recalculated for every model.
If your organization hasn’t already adopted a DevOps mindset, it’s time to wake up and smell the roses. If you’re familiar with these principles, you’ll find many similarities in applying them to ML applications. That said, there are some additional challenges you will need to face.
ML models are software and should be treated with the same care as any good software product. While traditional software artifacts (source code and binaries) are generally small and easy to store, ML artifacts commonly include large binary files and volumes of data. As opposed to traditional software that only needs updates due to new development, ML applications must be retrained to account for evolving data. By combining DevOps principles with tracking techniques tools like model registries and feature stores, you can deliver ML applications with agility and peace of mind.
Effective collaboration is key to DevOps success. This is even more important in developing ML applications since data-science teams are added to the traditional DevOps mix of engineering and operations teams. Eliminating silos and reducing friction between teams allows organizations to deploy applications more quickly. In doing so, you’ll create more business value and keep your teams happy.
You should use collaborative development workflows that enable members of any team to submit pull requests on codebases. This means that data scientists should be able to submit pull requests to an ML application codebase developed by engineers. Meanwhile, the engineers should be able to contribute to the codebases data scientists use for research and development. By creating a structured approval process on pull requests, appropriate teams can own the code, while also enabling other teams to make contributions.
To effectively use pull requests, all enterprise code should be kept in a common version-control system. To encourage collaboration, it should be painless and simple to onboard users into the system by allowing team leads to grant access to additional developers. Github and Bitbucket are common tools for version control that support pull requests and simple access management.
By promoting collaboration through pull requests, you can emphasize the skillsets of every role and allow individuals to contribute where they are best suited. If your data scientists aren’t the best coders, upskill them through pair programming and code reviews. Creating small changes and receiving feedback is a great way for anyone to learn software skills!
Pull requests and code reviews also allow teams to make sure application codebases are always in a production-ready state. A good development practice is to develop software such that it is releasable at any given time. Your teams should always aim to develop and deliver small product increments. Even if you don’t want to release or deploy changes continuously, having the ability to do so will make sure the decision can be made from a business standpoint rather than a technical one.
Automation is key to building good applications, and that principle extends to ML applications. Any manual process can introduce errors and waste valuable resources. Automation helps guard against such errors. As an added benefit, engineers will spend less time on deployments so they can focus on more interesting problems.
Automation of builds and deployments is commonly referred to as continuous integration and continuous delivery, or CI/CD for short. CI/CD pipelines designed to deliver small and safe increments of application development. Under this scheme, builds and deployments are triggered by commits (changes) to the version-control repository, and thus the changes are small. To keep things safe, CI/CD pipelines should automatically execute test suites to make sure the newest version is stable and ready.
The CI/CD pattern reduces the time and administrative burden of deployments. It also reduces the risk of issues being raised by deployments, since each released increment is small and understandable. If an issue arises, it’s easy to roll back that change and alleviate the problem, then redevelop the problematic feature to resolve the issue.
CI/CD pipelines are also self-documenting. When a version of the application is deployed, it is easy for operations teams to understand which version of the code is running and provide support for issues that arise. As a result, CI/CD deployments are highly reproducible and easy to explain. Auditing becomes easier because changes in behavior can be mapped back to build logs and commit messages.
In the ML context, model training can be treated as a build job. If data scientists are using a code repository, commits to that repository could trigger a training job. Data scientists typically save time by using a small sample of the data for development and testing, but this build job could reference the full dataset for greater accuracy. That same job can include a model evaluation step to validate the performance of the model and automatically register the model if the performance is sufficient. Automating training in this way reduces the risk of reporting erroneous model performance when the model is updated.
Whether your ML application is serving up real-time or batch predictions, there will likely be some periodic long-running processing involved. Batch scoring is of course a periodic process, but real-time applications will need to be retrained, and that will require some long-running processes. It is vital to automate these processes rather than relying on a team member to manually trigger the jobs. Not all models will be retrained at fixed intervals, but even on-demand triggers can be automated.
Some processes may involve some amount of manual review (model retraining is a good example) but this shouldn’t discourage you from automating the execution of those jobs. Even if there is a manual review step, automation will ensure that each job is configured and executed in a repeatable manner, thus reducing the risk of errors. To enable review, these jobs can generate automated reports. For a simple solution, tools such as papermill can execute Jupyter notebooks to generate reports.
Automated retraining raises the question of whether deployment of retrained models can be automatic. This depends on the application and nature of retraining, but in many cases, it makes sense to deploy updated models automatically. Consider a case where a model is trained using data from the most recent day, and will need to be updated daily as a result. In this case, anything but automatic deployment could become arduous. But deploying a model without any conditions can also be risky. In these cases, the retraining jobs should include an evaluation step that outputs performance metrics. For complete automation, a threshold can be applied – if the performance metrics are all above prescribed values, it should be safe to automatically deploy the model.
When a manual approval step is strictly necessary, the right tools make all the difference. For instance, CI/CD tools such as Jenkins and AWS CodePipeline support manual approval steps. A simple pattern for this is to automatically email approvers with a model-evaluation report and the CI/CD approval prompt. Once approved, the model can be automatically deployed to a production environment, or even a lower testing environment subject to further promotion processes.
Good tools can also help to manage the jobs themselves. Rather than using Cron jobs and Bash scripts, tools like Apache Airflow can be used to orchestrate processing. Airflow enables scheduling and execution of arbitrary DAG pipelines with integrations for many ecosystems, such as Hadoop and AWS.
A model in a registry can’t provide business value. To do so, it must be somehow packaged and integrated into software. There are a few broad architectures you should consider depending on your application requirements:
Application monitoring has evolved into an entire subfield called Observability Engineering. Today’s leading enterprises are making huge investments in this space and seeing dividends. We can’t possibly cover the entire space, but we aim to lay a foundation and discuss ML-specific considerations.
When designing and developing your applications, make sure you have a plan for logging, monitoring, and alerting. It’s not enough to just write arbitrary print statements throughout your code. Developers should have a common understanding of how messages should be generated from the code and how log levels (debug, info, error) will be utilized. Architects should make sure they understand the destination system that will store log messages to determine an appropriate format.
As an example, consider an imaginary Python application. From the outset of the project, the team should establish patterns such as:
The above pattern is certainly no gold standard – this is just one way to architect logging. But it does serve as an example of how clear guidelines can be established. In the absence of such guidelines, developers may introduce log messages with arbitrary discretion, leading to excessive messaging in some parts of the codebase and a complete lack thereof elsewhere.
In addition to capturing log messages throughout your code, you should make sure to monitor the health of your infrastructure. Make sure your systems are keeping track of the volume of requests and resource (CPU, memory, disk space) utilization. Raise alerts when your infrastructure starts to become overloaded, and ideally before your users start to notice.
If your ML application is some sort of real-time service which responds to requests, you should also be monitoring the reliability of your service. It’s important to measure uptime, that is, what percentage of the time your service is actively responding to requests. To do this, periodically probe your service and make sure it is responding. Log those responses, and raise alerts when they fail.
You’ll also want to track latency, or how long it is taking your service to respond to requests. You should track the average latency, as well as the 95th percentile. If they differ significantly, many requests may be responding too slowly. Extreme latency values could help identify deeper issues, such as cold-start latency when infrastructure is scaled or internal network delays.
Since ML applications depend on data, monitoring them requires keeping an eye on the input features and output predictions. It also means making sure that you have a good understanding of which model version transformed the input to output.
Whether predictions are generated individually or in batches, every prediction should be assigned a unique ID corresponding to that prediction. If the request for that prediction is coming from an external system, this allows that prediction to be linked to the request. The prediction ID should also be linked to the version of the model in use, in whatever format that model is identified in the model registry. The model, request, and prediction identifiers should be recorded in a log message and inserted into a database if necessary.
It is also vital to capture the input and output of the model. ML applications need to be monitored for data drift, which can only occur if the data itself is captured. While it may be possible in some cases to include this data in log messages, it might not be practical to pass so much structured data through the logging system. As an alternative, your application could write this data directly to a database or object store. Or, if the data is written to logs, it should be done in a clearly structured format such that the logs can be parsed and streamed into a dedicated storage system.
Finally, as mentioned in the tracking section, some ML applications will need to be audited for explanations of the model predictions. In these cases, it may be necessary to generate and store model explanations for each prediction at the same time it is generated. Alternatively, you could design a separate system capable of generating model explanations based on other stored information, such as the input to the model, model ID, or request/prediction ID. The latter approach may serve better in cases where audits are rare and it is not worth paying the compute and storage costs associated with generating and storing explanations for every single prediction, but it requires additional engineering effort upfront.
ML applications are unique in that they depend heavily on data from other systems and processes. Even a perfectly stable ML application is subject to changes from the outside.
Imagine a weather forecast trained on the last three months of data. As summer turns to winter, that model would have no understanding of freezing temperatures, and fail to predict that winter precipitation is more likely to come as snow than rain. Of course, any meteorologist would call that a terrible forecast model, but seasonality forms a great example of data drift.
There are two key mechanisms for tracking drift in ML systems: input monitoring and ground-truth evaluation. Input monitoring can happen in real time and provide early signals for detecting drift. Ground-truth evaluation can only take place once predictions have been labeled and compared with the true value, but ultimately provides smoking-gun evidence that drift is degrading predictive power.
Input monitoring involves tracking features used as input to the model for changes relative to the original distributions. The most basic approach for input monitoring is to record descriptive statistics for the training data and compare those metrics to the batches of data observed in the production system. In other words, by comparing the mean value of each feature within a particular time window – hourly, daily, or monthly – to the mean value from the training dataset, the metrics could indicate that the data has drifted in some way. The more metrics tracked in this way, the better – consider including attributes such as median, interquartile range, standard deviation, etc. It also helps track the rate of outliers based on traditional measures of spread (z-score, extreme quantiles, etc.) or even introduce outlier detection models.
A more sophisticated means for input monitoring takes advantage of a secondary ML model. As new data is passed to the production system, batches of that data can be merged with the training data. By labeling the examples based on whether they come from the training or production sample, a classifier can be trained to predict “training data” versus “production data.” If no drift is present, the accuracy of that classifier should be quite poor since the production data will closely resemble the training data. But if drift starts to occur, the model may start to learn to distinguish between the two datasets. Alerts can be generated by tracking the secondary model’s accuracy over time.
Ground-truth evaluation requires predictions to be labeled, which introduces some lag time for this method. In ML systems with a human in the loop, ground-truth labels might be generated naturally and on a relatively short timescale. For instance, a text-prediction (autocomplete) model in a word processor receives labeled ground truth immediately based on whether the user accepts the suggestion. In these cases, ground-truth labeling can provide feedback in a very timely manner.
In many cases, however, predictions are used on the fly without any natural mechanism for labeling. To monitor these systems, some proportion of predictions should be manually labeled for evaluation. This could be done by experts within your organization, or by external labelers at market rates. Third-party labeling services – including software as a service and labelers themselves – are becoming increasingly available; two notable examples are Amazon Sagemaker Ground Truth and Labelbox.
Assuming labels are generated in some way, the evaluation itself is straightforward. The metrics relevant to your model, such as classification or regression accuracy, can simply be calculated and tracked over time. If these metrics start to degrade, drift has likely impacted your model.
ML predictions are only useful if they are accurate and readily available. Many data science and analytics groups are still in the early stages of justifying their existence, and it could be reputationally disastrous to deploy a model that produces erroneous output or fails to produce output at all. As with any software project, it is vital to take reliability into account from the beginning; if you don’t build a reliable platform from the start, you will create technical debt that can be too cumbersome to eliminate later.
Using the right infrastructure is as much about reliability as it is about optimizing costs. You don’t want to overbuild and end up with an infrastructure budget that exceeds the business value provided by your ML applications. In that sense, the key is technologies that allow your infrastructure to scale as necessary.
In many cases, the simplest architecture is a serverless one. Serverless deployments allow you to focus on writing source code that actually delivers business value, leaving the platform to handle the complicated hardware and scaling questions. If you’re not familiar with serverless technologies, the general idea is to enable developers to create microservices by only writing source code as a series of functions or simple containers. This small amount of source code is then deployed to the serverless platform to automatically spin up and scale servers behind the scenes using a load balancer. AWS Lambda and Heroku are popular platforms for serverless architectures.
While serverless architectures offer a quick means for developers to get code to production, they don’t always work well for ML deployments. The most common issue is that ML applications rely on large artifacts such as data or serialized models. Since serverless applications run on non-dedicated infrastructure, those artifacts will need to be downloaded each time the platform first executes your application on a new server. Those artifacts can be stored in temporary storage for subsequent requests on the same machine, which can partially mitigate this problem. But even if you download and store the artifacts, requests on new servers will be “cold starts” that likely have higher latency than subsequent requests.
Serverless platforms can also be expensive compared to running on dedicated infrastructure. If your application has a very constant and consistent level of activity, it is generally much more cost-effective to run on dedicated servers of an appropriate size.
PRO TIP: Use serverless architectures if you:
Assuming your team has the expertise to work with them, container platforms offer a great alternative to serverless architectures and are ideally suited to the vast majority of ML applications. Container platforms can be set up on-premise or in the cloud. Kubernetes, Amazon ECS, and Red Hat OpenShift are common container platforms.
Containers offer a good developer experience and make it easy to set up CI/CD pipelines with appropriate testing. When it comes to deployment, container platforms offer more flexibility with regards to server placement and how they are spun up and shut down. There are also better features for managing storage volumes, which can mitigate issues commonly found on serverless platforms such as “cold starts.”
As opposed to serverless architectures, your teams will have a bit more to manage when using container platforms. You’ll have to think more carefully about how traffic is routed and balanced, and how containers are scaled up and down. But compared with deploying on bare metal or virtual machines, the container platforms will provide consistency across different applications to reduce the burden on your operations teams.
In the case of a real-time prediction service, you’ll want to make sure it’s available when it needs to be. But not all applications require 100% uptime. For instance, enterprise applications only have to be available on the internal network. Or, if your application is only supporting a small team, it might be ok for that application to go down once in a while. In other cases, it will be vital that your application is up and running constantly.
If constant availability is a requirement, make sure your application is deployed across multiple geographically distributed data centers. For serverless applications, this might be trivial, since most platforms automatically deploy applications to manage high availability. Cloud container platforms, such as Amazon ECS, have features that allow you to place containers in distributed availability zones. On-premise deployments will likely depend highly on the unique environment, but teams should make sure to evaluate how availability is likely to affect their solution.
We probably don’t need to tell you that all modern software projects should include unit tests. Data science and ML projects are no exception. Data scientists should be writing unit tests for any custom code used to train models. Engineers should be writing unit tests for all application code. When builds and model-training jobs are automated, the unit tests should be executed as a build step. Failing tests should raise alerts and terminate the job to make sure erroneous source code and models don’t make it out into the wild. Unit test coverage should also be measured when tests are executed, and coverage should regularly be reported to and tracked by the team as a key performance metric.
In addition to unit tests, integration tests should be developed to ensure your application is appropriately communicating across its component services. And when bugs are reported and resolved, regression tests should be written to make sure those bugs never creep back in. Just like unit tests, execution of integration tests and regression tests should be automated.
The dependence of ML pipelines on data introduces another layer of complexity. Training data should be tested to validate assumptions and make sure there are no emergent data quality issues. Write custom scripts or applications to assess the quality of your data, with specific checks that are relevant to your business case. When data-quality issues are discovered, incorporate them into the suite of tests to make sure the data never causes issues again. There are good open-source libraries to help create these checks, such as Great Expectations and Deequ.
Evaluation of models should be an ongoing process – it doesn’t end when the data scientists sign off for deployment. The same metrics used to evaluate models during research and development should be measured regularly on new data in the production environment. Evaluation of newly trained models commonly uses cross validation or some other form of measurement using a holdout sample. Naturally, any new data used to generate predictions forms a holdout sample by default. New data serves as a rich resource that can be used to learn more about your model and application. Application developers should make sure to measure the same metrics used by the data scientists and regularly communicate to ensure that the model is performing as expected. These metrics should also be incorporated into CI/CD pipelines.
When models are updated or retrained, it may be unclear whether the new model performs better than the old one. In this case, you may want to consider having both models deployed and evaluating them “in the wild.” How you evaluate the models in production will depend slightly on how your application generates new ground-truth labels – if this requires a long cycle, you may have to make comparisons between predicted labels without ground truth. Regardless, here are a few strategies for deploying you might want to consider:
Security and privacy represents a very broad, but important domain for any software application.
ML applications will very often involve personally identifiable information (PII) or, in health contexts, protected health information (PHI). In addition to making sure your data and environments are well protected, there are specific considerations you should make for your deployed model.
First, you should make sure to consider the risks of your model behaving badly. What would happen if your model produced the most erratic output you could imagine? What would be the impact on consumers of such predictions? What are the financial, reputational, security, or safety risks that could occur as a result? Depending on the severity of risks, you may want to implement extra guardrails against erroneous output.
Adversarial attacks are another important situation to guard against. Research has shown that deep neural networks, for example, are prone to attacks where images can be made to trick classifiers when altered in imperceptible ways. In other words, two images that look the same to the human eye can lead to dramatically different results. Exposure to this sort of attack is enhanced if attackers have information about your model architecture or training datasets. If this risk is relevant to your application, architectural details about your model should be hidden from external parties (such as consumers of model predictions) as much as possible.
It can also be possible to reverse engineer ML models and datasets by generating predictions. By submitting requests with random input and receiving predictions, users can build a dataset that serves as a proxy for the data which originally trained the model. This can be very dangerous. A proxy dataset generated in this way could be used to extract information about your other users or your business logic. It could also be used to engineer the sort of adversarial attacks described above. A simple way to mitigate this risk is to implement rate limits that prevent an attacker from submitting a large number of requests to generate a large dataset.
Deploying ML models and implementing MLOps pipelines can be a challenging endeavor. In the modern data-driven world, however, avoiding these challenges is out of the question. The reward for developing MLOps expertise is more than just the business value created by your current ML project – it will pay off in multiples as your organization develops more and more models.
If the full details of this guide is a bit too much for your organization to take on at the moment, make sure to check out our Beginner’s Guide to Deploying Machine Learning Models.