The development of Artificial Intelligence (AI) is progressing rapidly, especially in the field of embodied AI, which aims to give robots human-like abilities in interacting with their environment. A promising approach in this field is Vision-Language-Action models (VLA). These models combine visual information with linguistic instructions to generate actions in the real world. A new, much-discussed open-source model in this area is NORA.
NORA is a comparatively small VLA model with three billion parameters. It is based on the Qwen2.5-VL-3B model as a foundation and was trained with 970,000 demonstrations from real robot actions. This approach allows NORA to handle complex tasks that require both visual perception and linguistic understanding. The small size of the model compared to other VLAs is particularly noteworthy, as it allows deployment on less powerful hardware and thus democratizes research and development in this area.
The release of NORA as an open-source model is an important step for the AI community. Open-source projects promote transparency and allow researchers worldwide to examine, modify, and further develop the code. This accelerates progress and enables the development of new, innovative applications. By sharing resources and insights, the development of embodied AI can be advanced more quickly and efficiently.
The application areas of VLAs like NORA are diverse and range from robotics in industry and the household to assistance systems for people with disabilities. In industry, robots equipped with VLAs can take on complex tasks in manufacturing or logistics. In the household, they could assist with everyday tasks such as cooking or cleaning. Such systems could also play an important role in the field of care and support.
Despite the promising advances, VLAs still face some challenges. The robustness and reliability of the models in complex and unpredictable environments must be further improved. The ability to learn from few examples (few-shot learning) is also an important area of research. Future developments could focus on the integration of multimodal information, such as tactile data, to further enhance the capabilities of VLAs.
The development of NORA and other open-source VLAs is an important step towards a future where robots can seamlessly interact with humans and perform complex tasks in various fields. Open collaboration and the exchange of knowledge within the AI community are crucial to fully exploit the potential of this technology and to master the challenges of the future.
Bibliography: - https://www.arxiv.org/abs/2504.19854 - https://declare-lab.github.io/nora - https://twitter.com/NielsRogge/status/1917207269445456028 - https://twitter.com/iScienceLuvr/status/1917137877827490120 - https://github.com/jonyzhang2023/awesome-embodied-vla-va-vln - https://arxiv.org/abs/2406.09246 - https://openvla.github.io/ - https://learnopencv.com/vision-language-action-models-lerobot-policy/