MLPerf Tiny Inference Benchmark Lays Foundation for TinyML Technology Evaluation, Commercialization
July 02, 2021
Story
The speed with which edge AI ecosystems like TinyML are evolving has made standardization difficult, much less the creation of performance and resource utilization benchmarks that could simplify technology evaluation. Edge AI benchmarks would be hugely beneficial to the ML industry as they could help accelerate solution comparison, selection, and the productization process.
But standing in the way of this is the fundamentally distributed nature of the edge and the varied applications and systems that reside there, which mean a benchmark of any value must account for:
- Hardware heterogeneity ranging from general-purpose MCUs and processors to novel accelerators and emerging memory technologies that are commonplace in the TinyML ecosystem.
- Software Heterogeneity varying wildly across TinyML systems that often use their own inference stacks and deployment toolchains.
- Cross-Product Support, as the heterogeneity mentioned previously means that interchangeable components can and are being used at every level of TinyML stacks.
- Low power by profiling device/system power consumption and energy efficiency via a power analysis mechanism that considers factors like chip peripherals and any underlying firmware.
- Limited memory within different devices with different resource constraints, which in the case of edge AI is usually under a gigabyte.
In an effort to overcome these barriers, MLCommons, the organization behind the popular MLPerf family of benchmarks AI training and inferencing benchmarks, recently released version 0.5 of the MLPerf Tiny benchmark. It’s an open-source, system-level Inferencing benchmark designed to measure how quickly, accurately, and power-efficiently resource-constrained embedded technologies can execute trained neural networks of 100 kB or less
Inside the MLPerf Tiny Edge Inferencing Benchmark
Developed in collaboration with EEMBC, the Embedded Microprocessor Benchmark Consortium, this iteration of MLPerf Tiny Inference consists of four separate tasks for measuring the latency and accuracy or power consumption of an ML technology:
- Keyword Spotting (KWS) uses a neural network that detects keywords from a spectrogram
- Visual Wake Words (VWW) is a binary image classification task for determining the presence of a person in an image
- Tiny Image Classification (IC) is a small image classification benchmark with 10 classes
- Anomaly Detection (AD) uses a neural network to identify abnormalities in machine operating sounds
These tasks are presented in four different scenarios that an edge device may encounter or be deployed in, namely single-stream queries, multiple-stream queries, server configuration, or offline mode. Each scenario requires approximately 60 seconds to complete, and some have latency constraints.
Figure 1. The MLPerf Tiny inferencing benchmark v0.5 presents each of the tasks in four different deployment scenarios. (Source: ML Commons)
This combination of tasks and scenarios make it possible to analyze sensors, ML applications, ML datasets, ML models, training frameworks, graph formats, inference frameworks, libraries, operating systems, and hardware components. This is possible thanks to multi-layered test suites that look at the rational, dataset, model, and quality targets (usually a measure of accuracy when executing the data set and model).
Figure 2. The MLPerf Tiny inference benchmark test suite permits the evaluation of the end-to-end edge ML stack. (Source: ML Commons)
The test suite procedure is as follows:
- Latency – The latency measurement is performed five times in the following order:
- Download the input stimulus,
- Load the tensor and converting the data as needed,
- Run the inference for a minimum of 10 seconds and over 10 iterations
- Measure the inferences per second (IPS)
The median IPS of the five runs is reported as the latency score.
- Energy – The energy test is identical to latency, but measures of the total energy used during the compute timing window
- Accuracy – A single inference is performed on the entire set of validation inputs, which vary depending on the model. The output tensor probabilities are then collected to calculate the percentage score.
Modular, Open and Closed
Of course, there are also limitations around the MLPerf Tiny benchmark in the form of run rules that ensure components are analyzed accurately and reproducibly. The run rules are established via a modular benchmark design that addresses the end-to-end ML stack, as well as two divisions that permit different types of analysis.
- Modular design allows hardware and software users to target specific components of the pipeline, like quantization, or complete solutions. Each benchmark within the TinlyML suite has a reference implementation that contains training scripts, a hardware platform and more to provide a baseline result that can be modified by a submitter to show the performance of a single component.
Closed and Open divisions are more strict and more flexible, respectively, in the submissions they accept. The closed division offers a more direct comparison of systems whereas the open division provides a broader scope that allows submitters to demonstrate performance, energy, and/or accuracy improvements in any stage of the ML pipeline. The open division also allows submitters to change the model, training scripts, and dataset.
Figure 3. MLPerf Tiny’s two divisions provide a flexible way to test edge ML components against each other and a generic reference implementation. (Source: ML Commons)
The MLPerf Tiny inferencing benchmark rules are available on Github.
The first batch of submissions has already been published. It includes entries from Latent AI, Peng Cheng Laboratory, Syntiant and hls4ml, all of whom except hls4ml submitted to the Closed division.
In the Closed Division:
- Latent AI submitted its Latent AI Efficient Inference Platform (LEIP) software development kit (SDK) for deep learning, which it executed on a Raspberry Pi 4.
- Syntiant submitted its NDP120 neural decision processor equipped with the company’s Syntiant Core 2 deep learning accelerator and an Arm Cortex-M0 that executed TensorFlow and the Syntiant training and software development kits.
- Peng Cheng Laboratory ran a modified version of TensorFlowLite for Microcontrollers (v2.3.1) on its PCL Scepu02 containing an open-source RISC-V RV32IMA core with floating-point unit.
- Editor’s note: The Closed Division benchmark reference was submitted by Harvard: an ST Nucleo-L4R5ZI that utilizes an Arm Cortex-M4 and FPU to execute TensorFlowLite for Microcontrollers. The reference implementations can be found on Github.
Figure 4. The MLPerf Tiny inferencing benchmark reference implementation is based on an STMicroelectronics Nucleo-L4R5ZI. (Source: ML Commons)
In the Open Division:
- hls4ml submitted its python package for machine learning inferencing on FPGAs, electing to run it on a Xilinx Pynq-Z2 device with dual-core Arm Cortex-A9 MPCores and a Xilinx Z-7020 accelerator.
Measured on latency and energy consumption, these ML stack combinations ran the Visual Wake Word, Image Classification, Keyword Spotting, and Anomaly Detection workloads described in Table 1.
Task |
Visual Wake Words |
Image Classification |
Keyword Spotting |
Anomaly Detection |
Data |
||||
Model |
||||
Accuracy |
80% (top 1) |
85% (top 1) |
90% (top 1) |
0.85 (AUC) |
Table 1. Submitters to the MLPerf Tiny v0.5 inferencing benchmark put their solutions up against these workloads. (Source: ML Commons)
Below are the results for each entrant:
- Harvard (Reference)
- o Visual Wake Word Latency: 603.14 ms
- o Image Classification Latency: 704.23 ms
- o Keyword Spotting Latency: 181.92 ms
- o Anomaly Detection Latency: 10.40 ms
- Latent AI LEIP Framework
- Visual Wake Word Latency: 3.175 ms (avg)
- Image Classification Latency: 1.19 ms (avg)
- Keyword Spotting Latency: .405 ms (avg)
- Anomaly Detection Latency: .18 ms (avg)
- Peng Cheng Laboratory:
- Visual Wake Word Latency: 846.74 ms
- Image Classification Latency: 1239.16 ms
- Keyword Spotting Latency: 325.63 ms
- Anomaly Detection Latency: 13.65 ms
- Syntiant:
- Keyword Spotting Latency: 5.95 ms
- Keyword Spotting Latency: 5.95 ms
- hls4ml:
- Image Classification Latency: 7.9 ms
- Image Classification Accuracy: 77%
- Anomaly Detection Latency: 0.096 ms
- Anomaly Detection Accuracy: 82%
Editor’s note: An expanded table containing the results can be found here: https://mlcommons.org/en/inference-tiny-05/
New Classes of Edge AI
The MLPerf Tiny inferencing benchmark is a step in the right direction for the commercialization of edge AI technology and the new classes of applications it will bring. A product of collaboration between more than 50 organizations throughout industry and academia, the benchmark provide a fair measure of component and system-level ML technologies with room to expand into other applications and higher-order benchmarks like MLPerf Inference Mobile, Edge, and Data Center.
For more information or to submit your results to the MLPerf Tiny inference benchmark, visit https://mlcommons.org/en.