On-Device Continual Learning for TinyML Devices
November 30, 2021
Blog
With the rise of modern AI technologies, the need for on-device model training has become a significant area of research. Increased task complexity and workload amount have emphasized the need to bring AI model training to the edge.
After inference at the edge, an AI model needs to be trained continuously on-device at the edge to tackle uncertain situations with non-stationary inputs. Deep learning models are trained on remote servers before being deployed to embedded devices. But there has been a shift toward continual learning with on-device personalization that can increase adaptive functionalities based on the user interaction through newly acquired data.
Updating and retraining already-trained models on the device can take a long time, becoming a nearly impossible task for real-time inputs. Even when simply updating the prediction model, new incoming data can incur catastrophic forgetting, in which an artificial neural network completely and abruptly forgets previously learned information upon learning new information.
Continual learning (CL) is the ability to learn incrementally with changing external environments, dynamic incoming data, and the capability to generalize out-of-distribution and perform transfer and meta-learning. Due to the increased memory and computation, the neural networks are trained for inference only before deployment on the embedded devices. Until recently, the research on deep learning models for ultra-low-power devices was based on the train-then-deploy assumption with static models that cannot be adopted in changing environments. To change the dynamics, the work carried out on Latent Replay-based CL techniques, the demand for computation and memory for ultra-low-power TinyML devices has been a problem.
Latent Replay for Real-Time Continual Learning
(Image Credit: Research Paper)
The Latent Replay method for continual learning actually signifies several aspects that can be understood from the architectural diagram above. In the Latent Replay, instead of storing a portion of past data in the input space, the data is stored in activation volumes at some intermediate layer. This in turn solves the problem of computational and storage problems, for which the benchmarks are carried out on complex videos such as CORe50 NICv2 and OpenLORIS.
Looking at the architectural diagram of Latent Replay, the layer closer to the input layer, often known as the representation layer, will usually perform low-level feature extraction. The weights of the pre-trained model are stable and can be reused across applications, while the higher levels extract class-specific features and are crucial to maximizing the accuracy. To maintain stability, the proposed methodology employs the slowing of learning at the layers below the Latent Replay one and leaves the layers above to learn at their own speed.
Even when the lower layers are slowed to zero, there is computational- and storage-saving because of the tiny fraction of patterns that need to flow forward and backward across the network. But in a normal scenario where the representation layer is not frozen to zero, the activations stored in the external memory can go through an aging effect. If the training of the layers is slow, the aging effect is not disruptive as the external memory gets time to restore the new patterns.
On-Device Continual Learning with Quantized Latent Replays
In the recent study based on the work done by Pellegrini, the researchers worked on developing a TinyML platform for on-device continual learning with quantized Latent Replays. This work takes VEGA, a TinyML platform for Deep Learning based on PULP that is an end-node System-on-Chip prototype fabricated in 22nm process technology. The Latent Replay for CL has been tested on smart embedded devices including smartphones running on a Snapdragon-845 CPU. But this work focuses more on the ultra-low-power TinyML devices to save on the computation and memory constraints associated with it.
The research proposes the idea of extending the Latent Replay algorithm to work with an 8-bit quantized and frozen front-end. This will not affect the CL process and support the Latent Replay compression with quantization, which in turn reduces the memory needs by up to 4.5x. This is known as the Quantized Latent Replay for Continual Learning. The CL primitives include the forward and backward propagation of common layers, like the convolution, depthwise convolution, and fully connected layers, which are tuned for optimized execution on VEGA.
There will always be a trade-off between the computation and storage accuracy that can be defined based on the application and the available resources. Latent Replay for Continual Learning is the most efficient way for a wide range of systems, from embedded devices to smart gadgets.