Artificial IntelligenceCloud ComputingComputer VisionData ScienceDeep LearningDevOpsExponential TechnologyInfrastructureMachine LearningServerless

The path to putting your ML model in production

Fernando CamargoNovember 24, 2020

Suppose you are a Data Scientist or Machine Learning Engineer (or another role name of this kind). You took your time to analyze your dataset, clean it, and prepare it to train your model. You then prepared many model candidates using the most recent techniques and took your time to fine-tune them. After all this extensive work, you finally created a model to be proud of. You finally finished your job. Well, unfortunately, not. If your model never goes live and is actively used, delivering value to the client, you wasted your time creating it in the first place.

What do you need to do next, then? This post aims to give you some guidance on what are the next steps. It's not your responsibility, especially as a Data Scientist, to do all those steps. But it is useful to see the whole picture. Also, depending on your company's size, you might need to do it all by yourself. So, let's start.

Step 1: Package your model

After you finish training, you can think of your model as a piece of software. It is a function that receives an input and produces an output. And to be useful, another piece of software will call it and use its returned value elsewhere. To make it possible, you need to package your model so that the Software Engineers can use it. And for that, you have some options, depending on your use case.

Online vs. Embedded

The first thing is to decide if your model will be on your company servers or embedded into a device. It depends a lot on your use case. For Autonomous Vehicles, for example, you can't wait a couple of milliseconds to make a decision, or you risk crashing the car. In other cases, you might not even have an internet connection available. In those cases, you must embed your model into the device and run it offline. Otherwise, you'll probably prefer having your model online.

Even though having your model online might add some network latency, you might still prefer it for two reasons. The first one is to make your model safer. When you embed your model into a device, it is much easier to access and steal it. The second reason is that it is much easier to update and monitor your model inside your servers (I will speak more about it later). It makes the evolution and bug fixes of your model much faster.

If you need to embed your model, you will probably need to convert it to a low-level language (like C) and use some tools to embed it. Otherwise,  you will commonly expose it through a REST (or gRPC) API or package it as a library, depending on if it will do Online or Batch Inference.

Online Inference vs. Batch Inference

If your model is on your servers, you need to decide if it will run per-request or in batch. Machine Learning models usually benefit from running batches of data performance-wise. So, if you can run it this way, you better do it. It is appropriate for some use cases to run the inference in a given schedule (once a day, for example) and save the results for later. But in most use cases, especially for user-facing applications, you need to run it per-request.

Packaging for Online Inference

When packaging for Online Inference, you will usually expose a REST API in which other software can request an inference. The reason for that is that it makes it much easier to scale the resources for your model. If you suddenly have many requests, you can spin other model instances to do a load balance.

You could create such REST API by hand, using something like Flask, for example. But there are a lot of tools available to make your life easier. After all, not all Data Scientists know how to develop a REST API properly. You have Tensorflow Serving, for example, that exposes a Tensorflow Model. One option that I like is BentoML. They have ways to package your model (with the preprocessing and postprocessing code) within a single package without worrying about the REST aspects. It already supports most ML frameworks and comes with micro-batching. This feature will automatically group bursts of requests into micro-batches to improve the model throughput.

Nowadays, you will probably also need to package this API into a container (possibly Docker). It makes your code much more reproducible and makes it easier to put it into production in a cluster. There are also some tools to package a Docker container if you don't know how to do it yourself. BentoML, for example, already provide it to you. Once you have your Docker container, you can smoothly run it locally or in a cluster.

It is usually the DevOps (or MLOps) job to prepare the cluster to run the software and your models. But if you work for a small company, I recommend going with Kubernetes. It is rising in usage and is currently the de facto solution. To make it easier to integrate your model into a Kubernetes cluster, there also some tools. KFServing and Seldon Core are some of those tools.

Packaging for Batch Inference

When packaging your model for Batch Inference, another software will probably invoke your model directly. In this case, it is easier if you package your model as a library. This way, the Software Engineers can install it and run it inside their software.

Step 2: Prepare your model monitoring

It is common knowledge that your model performance decays over time. It would be best if you noticed it before the client starts complaining. For that reason, you need to monitor your model preemptively. There are a lot of useful metrics you should be aiming for. Some of them are performance-wise and are already common for software in general, like CPU and RAM usage, request latency, and so on. There are also business metrics you need to monitor, like CTR (Click-Through-Rate). And for last, there are the metrics that are useful to understand your model behavior. For that, you should also monitor the inputs and outputs of the model. If a large deviation of their distributions is detected, you know that something is wrong.

A widely used tool that you can use is Prometheus. It is very well known and has been used for standard software for years. Even though they didn't design it with ML in mind, it works very well for this use case. The metrics are saved in a Time Series database and can be used to create dashboards and alerts.

You can also use another model for Drift and Outlier Detection. Alibi-Detect, for example, provides many implementations for different data types.

Step 3: Prepare to replace your model

Yes, you've read it correctly. Your model is in production, serving many clients and working very well, but your job is still not done. This model is now your responsibility, and it will not work well forever. Like said in Step 2, the model performance decays over time.

Depending on your use case, this decay might be swift. In such a case, your model needs to be retrained with new data frequently. Two examples are Recommender Systems and Demand Forecasters. These two need recent data to work well. In those cases, you need to automate the retraining process and the model rollout. To achieve that, you need pipelines of data already automated (which is commonly a Data Engineer job) and have a reproducible training setting that you can easily invoke. You then can have a scheduled task that trains your model with new data (using the same hyperparameters) and repackage it.

After some time, your model performance might decay even with constant retraining. Or maybe you simply had another idea for a new model that would do a better job than the current one. It is worth keeping in mind that good offline performance doesn't always translate into good production performance. So, even if the new model is better in offline evaluation, you can't simply replace the old one.

To make sure that your new model is better, you need to evaluate it online. A widespread technique is AB testing. You redirect a small portion of the requests to the new model (B) and start collecting metrics. After enough data is collected, you make a statistical test to decide if the B model is better than the A (old). But one downside of this technique is that if your B model is terrible, it will hurt your business metrics for the portion of the requests it receives. And since you need enough data to decide, you will expose more clients to such a model.

Another technique is called MAB (Multi-Armed Bandit). It is a reinforcement learning technique that balances exploration and exploitation. It will be responsible for selecting a model to serve each request while learning how good is each model according to feedback received. It will explore at first but will converge with much fewer interactions than an AB test would require. The problem is that you need some feedback for each request, that might not always be available.

If you can collect the required business metrics without the need to expose your clients to the new model, you have another option. It is called Shadow Deployment. The old model receives all the requests, while the new model gets copies of them. The user only gets the old model responses, and the new model responses are used solely to collect metrics. This way, you can do an AB test without hurting the business metrics.


Once you have trained your model, there are plenty of steps to take. And even once you finish all of them and everything is automated, the model is still your responsibility. It would be best if you were ready to take action at all times. And the more tools you have created for yourself to deal with the problems, the best.

Useful resources!/tfeeds/393523502011/c/machine learning monitoring series