Machine Learning Model Deployment Patterns in a nutshell

An introduction into generic design patterns to solve common problems when deploying trained ML models in production.

Photo by SELİM ARDA ERYILMAZ on Unsplash

Ideally you should care about ML model deployment patterns before you deploy machine learning models into a production data pipeline. You can refactor design decisions afterwards but w.r.t. some patterns it might get very hard to fix a sub-optimal design afterwards. I’ll not deep dive into this but to give some short hint… it relates to model life cycle management in a lot of cases.

This term consists o two parts:

Strictly speaking this posts should have the title “The Machine Learning Model Deployment Pattern Language in a nutshell”. However I thought this title would not have been very SEO-friendly :)

The problems with respect to model deployment can be classified into the following problem domains which are seperated from each other and whose patterns complement each other.

  • model deployment unit: How is the software component containing the model, it’s runtime environment, it’s pre-processing, it’s post-processing, the runtime environment for pre-/post-processing deployed into the overall system? E.g. many times this is an immutable Linux container.
  • model runtime environment: How is the model, it’s pre-processing and post-processing run? E.g. the Linux container contains a Python interpreter. The Python interpreter is the fundamental runtime for the pre-processing, the inference (execution of the deserialized model) and the post-processing. However in addition there is the need for either (A) the Python specific implementation of the runtime for an interoperability standard like ONNX or (B) the Python machine learning framework used for model training like e.g. scikit-learn.
  • model packaging: How do I package the model, it’s pre-processing, it’s post-proccesing and potentially model-meta data into the model deployment unit? E.g. when using Python, an ONNX model the ONNX model file and Python files for pre-/post-processing could be packaged as part of the Linux container for immutable model packaging. In contrast mutable model deployment would be possible by not to package the model at all but transmit the files content as e.g. binary data via HTTP (model interface), save the ONNX model into a temporary file, load the model as well as the post-/pre-processing dynamically prior to execution.
  • model interface: How can I get input data to pre-processing, output data from post-processing and configuration of the communication interfaces into the model deployment unit? E.g. in case of flat data which is JSON serializable the input can be transmitted to the Linux container (model deployment unit) via MQTT messages on a topic for input data, the prediction via a MQTT message on a topic for output data and the configuration for the MQTT connection estblishment via a RESTful API. The model interface is not always required. This is e.g. the case when deploying a model as part of an Android app, iPhone app or native desktop (Windows, Linux, MacOS) app.

It’s important to know about other problem context which cannot be put into a single class of pattern cause they are cross-cutting concerns and cannot be solved by simply considering a single pattern of a set of suitable patterns. We’ll see different design alternatives with respect to mutability. However in the case of inference runtime a lot of things not related to deployment are essential in addition (programming language, programming paradigm used, data pipeline processing design, architecture of potential hardware-based inference acceleration, etc.). An in-depth dive into patterns w.r.t. inference runtime optimization are beyond the scope of this post.

  • Mutability: immutable vs. mutable model deployment
  • Pre-processing, Inference and post-processing execution time: hard vs. firm vs. soft real-time


This topic relates to the model runtime environment, model packaging and the configuration problem domain. An immutable model deployment is a deployment whose model (state) cannot be modified after it has been deployed (compare mutable object). The model of a mutable model deployment can be changed after deployment. Parameters are “hard-coded” into a trained model whereas Hyperparameters are tuned after training. This means a “parameter”-trained model itself is immutable whereas the overall model artifact consisting of “parameter”-trained and it’s hyperparameters is mutable. However cause hyperparameters are optimized during the learning phase we’ll consider a trained model as immutable artifact from now on.

Nowadays a single model is deployed in a single Linux container very often without the possibility to change the model after deployment, means the deployment unit is immutable, in production easily. However probably you want to allow switching between different, alternative models in production easily which are part of a single deployment unit instead. In the later case the model in the deployment unit seen as a black box is not immutable anymore, it’s mutable.

Pre-processing, Inference and Post-processing execution time

The term “real-time” is miss-used and miss-understood heavily cause it makes little sense to use it without calling the type (hard, firm, soft) in addition. Usually they mean soft real-time behaviour. In different use cases different types of real-time behaviour are required. Let’s clarify the differences:

In a system with hard real-time requirements “missing a deadline is a total system failure.” An example is e.g. a model used in autonomous driving which shall classify human persons in front of a car. If the inference is delivered too late the car is probably not able to dodge or break before hitting the person.

In a system with firm real-time requirements “infrequent deadline misses are tolerable, but may degrade the system’s quality of service. The usefulness of a result is zero after its deadline.” Let’s take another example from the automotive driving domain. Consider a camera system for detecting car speed cameras to automatically reduce the speed if one is detected. It would be annoying if we’d have to pay a fee from time to time in case the result of the inference would not result in slowing down the car fast enough. However it’s just about paying a bill. Not potentially killing someone. The majority of industrial applications falls into this category as well.

In a system with soft real-time requirements “the usefulness of a result degrades after its deadline, thereby degrading the system’s quality of service.” This is the usual requirement for model deployments into customer facing products and services like “whatever you can think of detection” apps for smartphones and the like.

Especially when hard or firm real-time bevaviour is required one want’s usually precicly and predictable execution time of the pre-processing, the inference and the post-processing. Without proper low level design pre-processing can end up with a bad “real-time” behaviour easily, having badly predictable execution time for pre-processing for varying input in the bounds of the valid input range. The post-processing has usually a lower share w.r.t. the overall model execution time consisting of post-processing, inference and post-processing. The predictability of the inference execution time depends heavily on the algorithm type used to generate the model. Of course e.g. a “real-time” capable operating system task schedulers will usually result in a better predictable, however usually not shorter pre-/post-processing and inference execution time.

As you could see these cross-cutting concern topics can be extremly important in the overall system architecture. Designing model deployment which is mutable and has challenging inference execution time requirements is the really interesting part.

I’ll edit this post regularly. Next time I’ll add the current table of content for generic patterns for Linux container based model deployments. Stay tuned!

Happy designing :)

Software Developer for rapid prototype or high quality software with interest in distributed systems and high performance on premise server applications.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store