Deploying neural networks in resource-constrained environments, such as mobile devices or edge computing systems, is challenging. Efficient architectures are therefore essential. Traditionally, convolutions and attention mechanisms have been used to capture local and global dependencies in images. State Space Models (SSMs) have recently emerged as an effective method for global token interaction, as they exhibit linear computational complexity with respect to the number of tokens. However, efficient vision backbones based on SSMs have been less explored. This article introduces Efficient Vision Mamba (EfficientViM), a novel architecture based on Hidden State Mixer-based State Space Duality (HSM-SSD) that captures global dependencies with further reduced computational cost.
At the core of EfficientViM is the HSM-SSD layer. This layer builds upon the concept of State Space Duality (SSD), which is an evolution of SSMs. In contrast to the standard SSD layer, which performs channel mixing in the feature space, HSM-SSD shifts these operations to the hidden state space. The hidden states can be considered a reduced latent representation of the input data. This shift reduces the main computational cost of the SSD layer without compromising the model's generalization ability. The number of hidden states becomes a controllable parameter that influences the complexity and therefore the speed of the model.
To further enhance the representational power of the hidden states, EfficientViM employs multi-stage hidden state fusion. The original logits are combined with the logits derived from the hidden states of each stage. This fusion allows the model to integrate information from different stages of processing, thus improving the accuracy of predictions.
In addition to reducing computational cost through HSM-SSD, EfficientViM has also been optimized for minimizing memory-bound operations. In practice, memory-bound operations are often the bottleneck for performance, especially in resource-constrained environments. The design of EfficientViM therefore prioritizes practical performance over theoretical metrics like FLOPs.
Extensive experiments on ImageNet-1k show that EfficientViM achieves a new state-of-the-art in terms of the speed-accuracy trade-off. Compared to other efficient models, such as SHViT and MobileNetV3, EfficientViM achieves significant improvements in both accuracy and throughput, especially when scaling images and using distillation training.
Due to its efficiency, EfficientViM is particularly suitable for use in resource-constrained environments, such as smartphones, embedded systems, and edge computing platforms. The architecture can be used as a backbone for various computer vision tasks, such as image classification, object detection, and segmentation. Future research could focus on extending EfficientViM for further tasks and optimizing for specific hardware platforms. The combination of HSM-SSD, multi-stage hidden state fusion, and the focus on memory-bound operations makes EfficientViM a promising approach for efficient deep learning in computer vision.
Bibliography