ONNX in a nutshell
In the majority of use cases ONNX will be the machine learning interoperability for you. Of course it’s evolving, but there is a lot of support for training frameworks, support for algorithms and inference hardware acceleration already.
Why should I care about ONNX?
When you are working with artificial intelligence you’ll learn that there a lot of different frameworks to train models, runtimes to execute models, potentially compilers to improve runtime of interences and other tooling. When it comes to inference runtime optimization (including optimization of potentially very costly pre-processing) the hardware architectures the models may be deployed onto can make a significant difference.
Often there is need for interoperability between these different tools. E.g. when training a model with algorithm A in framework X the trained model may have a better prediction accuracy, lower execution runtime and better other “quality attributes” than when training a model with the same algorithm A but a different framework Y. For a different algorithm the situation could be the other way arround. The reason for this is that the low level implementation of the algorithms differ or that they use slightly different algorithms which are called the same. In addition the development experience with framework Y could be way better than the one provided by framework X.
However w.r.t. inference runtime deployment you’ve two choices:
- either you deploy the inference runtimes for all the frameworks you want to use right now and foresee the need to extend your overall runtime with additional frameworks or
- use an interoperability standard like ONNX to serialize the trained model in a way that it may be executed by whatever interoperability standard runtime and every runtime implementation you might think of.
You’ll recognize Cadence and NVIDIA which are big players in the industrial/embedded domain for high performance computing. In addition there is Intel AI which is well known for their Intel Neural Compute Stick 2 which enables significant faster inference in case the architecture (X64: Windows, Mac, Linux + ARM64: embedded Linux) does not provide other hardware acceleration like GPUs in comparison to plain CPU-based acceleration. “Significant faster” means dependent on the source considered roughly in the range between 5–10. There is already good hardware acceleration support for ONNX and it will get even better in the future.
A comparison between ONNX and it’s potential alternatives is not part of this post. However it’s important to note that there are alternative interoperability standards which are in general usually not compatible with ONNX in a complimentary manner and deciding for ONNX could make switching to an alternative a lot harder.
- Neural Network Exchange Format (NNEF): A serialization format which tries to abstract machine representation as much as possible while desribing networks in a manner which allows highly efficient hardware acceleration. For a comparison with ONNX refer to the NNEF website.
- special_k: A serialization format focussed on providing strong statistical and security guarantees for non-dockerized model deployments. For an introduction refer to this release blog post.
From a technological perspective NNEF and special_k are quite interesting. However right now both standards are quite new (initial releases 2018 and 2020 respectively) and not supported widely enough. In the majority of cases you’ll either end up not using an interoperability standard at all or ONNX.
The dimensions of interoperability
- Data format interoperability: The ability to exchange persisted (serialized) pre-processing, trained model and post-processing.
- Data (de-)serialization interoperability: The ability to exchange the serialization programming language specific data structures or object state into persisted pre-processing+trained model+post-processing and the other way arround (de-serialization). This type of interoperability is white-box characteristic and is not neccessarily required.
- Semantic interoperability: “(T)he ability of computer systems to exchange data with unambiguous, shared meaning.”
A lot of artificial intelligence frameworks support the export in ONNX format directly or via a converter. A converter converts framework specific model serialization formats into the ONNX format. From the number of GitHub stars and Google Trends I derive the probable adoption of the frameworks.
Widely adopted frameworks (alphabetically sorted) with PyTorch/TensorFlow, scikit-learn and Keras the top 3/4:
- Core ML: “Integrate machine learning models into your (Apple Mac and iOS) app.” Little Google Trends usage indication. But contained in a lot of Apple products.
- Keras: “Keras is an API designed for human beings, not machines. … Built on top of TensorFlow 2.0.”
- Microsoft Cognitive Toolkit (CNTK): “(…) a unified deep learning toolkit that describes neural networks as a series of computational steps via a directed graph.” Little Google Trends usage indication. But contained in a lot of Microsoft products.
- TensorFlow: The probably most widely used end-to-end open source platform for machine learning.
- scikit-learn: “Python module for machine learning built on top of SciPy”. Well known for it’s good documentation of statistical algorithms.
- PyTorch: “Tensors and Dynamic neural networks in Python with strong GPU acceleration.”
- XGBoost: “Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more.” Not that much Google Trends usage indication. But seems to be used for forecasting quite a lot.
Frameworks adopted quite a lot (alphabetically sorted):
- Apache MXNet: “Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity.”
- Matlab Deep Learning Toolbox: Tookit to develop, train and analyze deep learning nets.
- MindSpore: “Open Source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.” Probably mostly mostly used in China.
- Paddle: “(T)he only independent R&D deep learning platform in China, has been officially open-sourced to professional communities since 2016.”
Not as widely adopted and niche frameworks, some of them depend on converters (alphabetically sorted):
- Apache Singa: “A distributed deep learning platform.”
- CatBoost: “(…) high-performance open source library for gradient boosting on decision trees”
- Chainer: “A flexible framework of neural networks for deep learning.” Side note: According to a blog post of the company behind Chainer they do not continue adding features in the future focus on PyTorch instead.
- DLPy: “The SAS Deep Learning Python (DLPy) package provides the high-level Python APIs to deep learning methods in SAS Visual Data Mining and Machine Learning.”
- libSVM: “(…) simple, easy-to-use, and efficient software for SVM
classification and regression.”
- MyCaffe: “A complete deep learning platform written almost entirely in C# for Windows developers!”
- NeoML: “Machine learning framework for both deep learning and traditional algorithms.”
- ncnn: “(…) high-performance neural network inference framework optimized for the mobile platform.”
- Siemens Simcenter Amesim: “(…) the leading integrated, scalable system simulation platform, allowing system simulation engineers to virtually assess and optimize the performance of mechatronic systems.”
- Sony Neural Network Libraries (nnabla): “(…) deep learning framework that is intended to be used for research, development and production. We aim to have it running everywhere: desktop PCs, HPC clusters, embedded devices and production servers.”
- Tengine: “Tengine is a lite, high performance, modular inference engine for embedded devices.”
Something to consider is if frameworks are able to import the ONNX serialized models or just it’s own potential framework specific serialization format. If there is no support for ONNX import the framework specific serialized models as well as potential separate pre-processing and post-processing artifacts need to be managed in addition to the exported ONNX models.
For Deep Learning based ONNX models there are three tools which visualize models. In addition one of the tools (Netron) is capable of visualizing non-DL based ONNX models as well. These tools are helpful cause they can be used as interactive documentation. In addition they may help to explain how the models work (explainable AI).
ONNX inference runtimes
ONNX inference runtimes provide a runtime environment to enable the execution of ONNX models on different operating systems (Windows, Linux, Mac, Android in preview, iOS in preview), chip architectures (X64, X86, ARM64, ARM32), hardware accelerators (CPU, CUDA, DirectML, DNNL, OpenVINO, TensorRT, ACL, ArmNN, CoreML, MIGraphX, NNAPI, NUPHAR, Rockchip NPU, Vitis AI) and in different runtime environments. In the docs hardware accelerators are called “Execution Providers” by the way. With runtime environments I mean programming language (e.g. Python) or programming language specific frameworks (e.g. C# .NET). For an up to date compatibility overview of possible combinations of the overall runtime components refer to the official interactive compatibility overview. W.r.t. programming language APIs C/C++ followed by Python have best compatibility
whereas Node.js and WinRT have worst compatibility
Which combinations of runtime components to choose depends on the overall use case at hand (product type: smartphone app, desktop app, industrial edge server, etc.) and the previous experience of developers in your team (cause it cuts development cost).
Let’s say you want to develop a desktop application to be deployed to Windows as well as Mac operating systems and developers experienced in C#. Then the reasonable runtime component selection for the Windows App would be
- OS: Windows, API: C#, Architecture: X64, Hardware Acceleration: CUDA → Installation Instructions: Install Nuget package Microsoft.ML.OnnxRuntime
and for the Mac App
- OS: Mac, API: C#, Architecture: x64, Hardware Acceleration: Default CPU → Install instructions: Download .tgz file from Github
- OS: Windows and Mac and Linux, Architecture: X64, Hardware Acceleration: Default CPU → Install instructions:
npm install onnxruntime
Of course system design is all about tradeoffs: In the later case there would be no support for CUDA-based acceleration on Windows machines anymore. However you could deploy to Linux as well for free. It would even be possible to include an additional conversion step of the serialized model from the ONNX format into a WebAssembly:
ONNX runtimes (NodeJS) are often both magnitudes larger and slower in execution (compared to here, a WebAssembly).
ONNX inference runtime APIs
For the officially supported programming language specific APIs refer to the official docs:
- ONNX Runtime C API: C binding for running inference on ONNX models in a C runtime environment.
- ONNX Runtime C# API: “.Net binding for running inference on ONNX models in any of the .Net standard platforms.”
- ONNX Runtime Java API: “Java binding for running inference on ONNX models on a JVM, using Java 8 or newer.”
- ONNX Runtime Node.js API: “Node.js binding enables Node.js applications to run ONNX model inference.”
- ONNX Runtime Python API: Python binding for running inference on ONNX models in a Python runtime environment.
- ONNX WinRT API: Bindings to enable running inference on ONNX models in a WinRT runtime environment.
It’s worth to mention that there are unofficial third party ONNX inference runtimes as well. For Rust there is e.g. tract, for Go there is e.g. ONNX Go, for Ruby there is e.g. onnxruntime for Ruby, for R there is e.g. ONNX-r, etc.
Open Neural Network Compiler (ONNC)
ONNC (Open Neural Network Compiler) is a retargetable compilation framework designed specifically for proprietary deep learning accelerators.
The ONNC enables to compile ONNX models into a binary format which can be executed on NVDLA-based hardware designs. NVDLA stands for “NVIDIA Deep Learning Accelerator” and is a free and open architecture that promotes a standard way to design deep learning inference accelerators widely used in IoT. The standard is quite new with the first version released in march 2019. NVDLA is e.g. supported by FPGA-based Amazon EC2 F1 instances. There seems to be no support for NVDLA by Google Cloud instances and Microsoft Azure instances yet.
We are not there yet…
So far we got the big picture about ONNX w.r.t. training framework, application integration point of view as wll as inference acceleration support. We’ll continue with an introduction into the technical aspects next. Stay tuned.