November 28, 2024

Evaluating Interleaved Text-and-Image Generation with Scene Graphs

Listen to this article as Podcast
0:00 / 0:00
Evaluating Interleaved Text-and-Image Generation with Scene Graphs

Evaluation of Interleaved Text-and-Image Generation Using Scene Graphs

The generation of content that seamlessly interweaves text and images offers enormous potential for various applications, from cookbooks to interactive tutorials. Imagine asking an AI for a recipe and receiving not only a list of ingredients and instructions, but also matching images for each step. This vision of multimodal content creation, however, faces the challenge of ensuring consistency and coherence between text and image.

A novel approach to evaluating such interleaved text-and-image generation is presented by the "Interleaved Scene Graph" (ISG) framework. ISG uses the structure of a scene graph to capture the relationships between text and image blocks, enabling detailed evaluation at four levels of granularity:

Holistic Evaluation: Assesses the overall quality and relevance of the generated response with respect to the original request.

Structural Evaluation: Analyzes the logical arrangement and connection of text and image blocks.

Block-Level Evaluation: Examines the consistency and coherence within each individual block that combines text and image.

Image-Specific Evaluation: Focuses on the quality, accuracy, and relevance of the generated images.

This multi-level evaluation method allows for a differentiated assessment of the consistency, coherence, and accuracy of the generated content and provides interpretable feedback. ISG was developed in conjunction with a benchmark dataset called "ISG-Bench." This comprises 1,150 examples from 8 categories and 21 subcategories and includes complex dependencies between language and image. ISG-Bench allows for the evaluation of models in challenging, image-centric tasks such as style transfer, which pose a challenge for current models.

Challenges and Potentials

Tests with ISG-Bench have shown that current unified vision-language models struggle to generate interleaved content effectively. Compositional approaches, which combine separate language and image models, show an improvement of 111% over unified models at the holistic level, but their performance at the block and image level remains suboptimal.

To encourage future research, ISG-Agent was developed, a baseline agent that uses a "plan-execute-refine" pipeline to call various tools and achieve a performance improvement of 122%.

The development of ISG and ISG-Bench represents an important step towards robust and comprehensive evaluation of interleaved text-and-image generation. The results of the benchmark tests highlight current challenges and offer valuable insights for future research. In particular, the development of more effective compositional frameworks and the improvement of image quality and consistency at the block level are crucial for the advancement of this technology.

For companies like Mindverse, which specialize in AI-powered content creation, these developments are of great importance. The ability to automatically generate and evaluate high-quality interleaved content opens up new possibilities for developing innovative applications and improving existing solutions. From creating engaging marketing materials to developing personalized learning content, the future of AI-powered content creation promises a seamless integration of text and image.

Bibliography Chen, D., et al. "Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment." arXiv preprint arXiv:2411.17188 (2024). Liu, M., et al. "Holistic Evaluation for Interleaved Text-and-Image Generation." arXiv preprint arXiv:2406.14643 (2024). Su, H., et al. "CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation." arXiv preprint arXiv:2406.05597 (2024). Wu, C., et al. "Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning." arXiv preprint arXiv:2406.04260 (2024).