Text-to-image diffusion models have revolutionized image synthesis. They allow the creation of impressive images from pure text descriptions. Despite this progress, artists and designers who value precise control and flexible adaptation often encounter limitations. Text input alone is often insufficient to convey specific details, especially when it comes to the accurate reproduction of people or objects in new contexts. This article highlights an innovative method called "Diffusion Self-Distillation," which addresses this challenge and enables personalized image generation without complex finetuning.
A common use case is "identity-preserving generation," where a specific person or object needs to be depicted in different environments and scenarios. This requires image-and-text-conditioned generative models. However, training such models requires large amounts of high-quality, paired data, which is often unavailable in practice. Existing approaches such as finetuning with methods like DreamBooth or LoRA are effective but time-consuming and computationally intensive, as separate training is required for each new person or object. Zero-shot methods like IP-Adapter or InstantID offer faster solutions but do not achieve the desired consistency and adaptability.
Diffusion Self-Distillation offers an elegant solution to this dilemma. The core of the method lies in using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. This process takes place in several steps:
First, the method uses the diffusion model's ability for contextual generation to create grids of images. These grids contain variations of the desired person or object in different poses, perspectives, and environments. Using a vision-language model (VLM), this dataset is then curated. The VLM filters the generated images and ensures that only those images remain in the dataset that consistently reproduce the identity of the person or object. This curated dataset then serves as the basis for the actual training. The original text-to-image model is now fine-tuned into a text-and-image-to-image model. Through this process, the model learns to extract the identity of the person or object and transfer it to new contexts without requiring retraining for each new instance.
Diffusion Self-Distillation offers several advantages over existing methods. It is significantly faster and more efficient than traditional finetuning methods while achieving comparable quality in identity-preserving generation. Furthermore, the method is very flexible and can be applied to a variety of applications, including:
Consistent character generation for comics and animations Camera control and perspective changes Relighting and adjustment of lighting Object adaptation and modification Diffusion Self-Distillation allows artists and designers to quickly and easily generate personalized images without requiring in-depth knowledge of machine learning. This opens up new creative possibilities and significantly accelerates the design process. Mindverse, as a provider of AI-powered content solutions, integrates this technology into its platform to provide users with easy and intuitive access to this innovative image generation method.
Diffusion Self-Distillation represents an important step towards more user-friendly and efficient image generation. Future research could focus on improving the quality of the generated images, expanding the application possibilities, and integrating further control mechanisms. The development of more powerful VLMs and the improvement of training algorithms will further increase the performance of this method and open up new possibilities for creative applications.
Bibliographie: Cai, S., Chan, E., Zhang, Y., Guibas, L., Wu, J., & Wetzstein, G. (2024). Diffusion Self-Distillation for Zero-Shot Customized Image Generation. arXiv preprint arXiv:2411.18616. https://arxiv.org/abs/2411.18616 https://arxiv.org/html/2411.18616v1 https://www.chatpaper.com/chatpaper/fr/paper/85660 https://paperreading.club/page?id=268859 https://github.com/DmitryRyumin/ICCV-2023-Papers/blob/main/sections/2023/main/image-and-video-synthesis.md https://openaccess.thecvf.com/CVPR2024?day=2024-06-19 https://github.com/DmitryRyumin/AAAI-2024-Papers/blob/main/sections/2024/main/computer_vision.md https://www.reddit.com/r/ninjasaid13/comments/1h1pkir/241118616_diffusion_selfdistillation_for_zeroshot/ https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/04429.pdf https://openreview.net/forum?id=Hc2ZwCYgmB&referrer=%5Bthe%20profile%20of%20Xiuchao%20Sui%5D(%2Fprofile%3Fid%3D~Xiuchao_Sui2)