The world of Artificial Intelligence is in constant motion. One area that has advanced particularly dynamically in recent years is multimodal AI, i.e., the combination of different data types like image and text. Google has already made an important contribution to this field with SigLIP and is now setting new standards with SigLIP 2.
SigLIP 2 is a family of multilingual vision-language encoders that builds upon its predecessor, SigLIP, and further expands its strengths in image-text processing. The core idea behind SigLIP was the use of the Sigmoid loss function instead of the Contrastive loss, as used, for example, in CLIP. SigLIP 2 extends this approach with additional training objectives that improve the semantic understanding, localization, and density of the generated features.
The development from SigLIP to SigLIP 2 can be understood by looking at some central questions and the improvements based on them.
One initial challenge was to improve the visual representation regarding the localization and understanding of spatial relationships. The solution: the integration of a decoder. This decoder has three tasks: the creation of image captions, the prediction of bounding boxes based on descriptions of specific image regions, and conversely, the generation of descriptions for given bounding boxes. This decoder provides the vision encoder with additional information about the spatial arrangement of objects, thus improving the localization capability.
Another step towards optimizing SigLIP 2 was the improvement of the fine-grained, local semantic understanding of the image representation. For this purpose, two new training objectives were introduced: Global-Local Loss and Masked Prediction Loss. Inspired by self-supervised learning, the model is used as its own teacher. The Global-Local Loss trains the network to derive the representation of the entire image from a part of the image. The Masked Prediction Loss masks parts of the image and trains the network to reconstruct the missing information. These two mechanisms promote the spatial understanding and local semantics of the encoder.
Finally, adaptability to different resolutions plays a crucial role. SigLIP 2 offers two variants here: models with fixed resolution and models with dynamic resolution (naflex). The dynamic resolution allows for the flexible processing of images of different sizes and aspect ratios, which is particularly advantageous for applications such as OCR and document analysis.
SigLIP 2 models outperform their predecessors in various core areas, including zero-shot classification, image-text retrieval, and the extraction of visual representations for Vision-Language Models (VLMs). Integration into existing frameworks like Transformers is straightforward and allows for direct deployment in various applications.
Of particular interest is the use of SigLIP 2 as a basis for VLMs like PaliGemma. By combining it with powerful language models, new possibilities for complex image-text tasks are opened up.
With SigLIP 2, Google presents a significant advancement in the field of multimodal AI. The further development of the SigLIP approach through additional training objectives and the flexible handling of resolutions sets new standards in image-text processing and opens up promising perspectives for future applications in areas such as image analysis, document processing, and the development of advanced VLMs. The easy integration into existing frameworks and the availability of various model sizes make SigLIP 2 a valuable tool for developers and researchers.