The ability to precisely segment objects in images and videos is a central component of many computer vision applications. A promising approach is contextual segmentation, where similar objects are recognized and segmented in other images based on a single, marked example image. This method, also known as one-shot segmentation in few-shot learning, tests the generalization ability of segmentation models and finds application in areas such as scene understanding and image/video editing.
Although current models like Segment Anything Model (SAM) achieve impressive results in the field of interactive segmentation, they are not directly applicable to contextual segmentation. Researchers have now developed a new method called Dual Consistency SAM (DC-SAM), based on prompt tuning, which adapts SAM and SAM2 for the contextual segmentation of images and videos.
The core of DC-SAM lies in improving the prompt encoder functions of SAM by providing high-quality visual prompts. When generating a mask template, the SAM functions are merged to better align the prompt encoder. Furthermore, DC-SAM uses cyclically consistent cross-attention on merged features and initial visual prompts. A dual-branch design utilizes discriminative positive and negative prompts in the prompt encoder. For video processing, a special mask-tube training strategy has been developed that integrates the dual-consistency method into the mask tube.
Since there has been no benchmark for contextual segmentation in the video domain so far, the researchers have created the In-Context Video Object Segmentation (IC-VOS) benchmark. This benchmark consists of manually curated examples from existing video segmentation datasets and enables a comprehensive evaluation of the contextual capabilities of models.
Extensive experiments show that DC-SAM achieves excellent results compared to previous approaches. The method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the IC-VOS benchmark. These results underscore the potential of DC-SAM for contextual segmentation and open up new possibilities for the application of AI in image and video processing.
The development of DC-SAM and IC-VOS represents an important advance in the field of contextual segmentation. By combining innovative techniques such as prompt tuning, dual consistency, and mask-tube training, DC-SAM succeeds in significantly improving the performance of SAM and SAM2 while laying the foundation for future research in the field of video segmentation.
Bibliographie: https://arxiv.org/abs/2504.12080 https://arxiv.org/html/2504.12080v1 https://github.com/zaplm/DC-SAM https://openreview.net/forum?id=Ha6RTeWMd0 https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/ https://www.themoonlight.io/review/sam-2-segment-anything-in-images-and-videos https://www.kdd.org/exploration_files/vol26issue2-all-with-frontpage.pdf https://infoscience.epfl.ch/bitstreams/370892af-7769-4427-8064-bbf07c25ecad/download https://aclanthology.org/anthology-files/pdf/law/2024.law-1.pdf https://library.oapen.org/bitstream/id/3ecbc4ec-2dac-4e05-881a-10414c20f7f2/9783839467107.pdf