January 24, 2025

Video-MMMU: A New Benchmark for AI Video Knowledge Acquisition

Listen to this article as Podcast
0:00 / 0:00
Video-MMMU: A New Benchmark for AI Video Knowledge Acquisition

Video-MMMU: A New Benchmark for Knowledge Acquisition from Videos by AI Models

The ability to learn from videos is a central aspect of human cognition. We perceive information, process it into knowledge, and finally apply this knowledge to solve new problems. Videos offer a rich medium for this learning process. But how well do Artificial Intelligences, especially large multimodal models (LMMs), master this task? A new benchmark called Video-MMMU aims to determine precisely that.

Previous benchmarks for video data have mostly focused on tasks such as object recognition or action recognition. The systematic evaluation of knowledge acquisition from videos by LMMs, however, has remained largely unexplored. Video-MMMU closes this gap and provides a comprehensive test for the ability of AI models to extract and apply knowledge from videos.

The benchmark consists of 300 expert-level videos from six different disciplines. For each video, 900 questions were formulated and annotated by humans, covering the three cognitive levels - perception, understanding, and application. This allows the capabilities of the LMMs to be precisely evaluated at each of these levels.

A novel metric, called "Knowledge Gain" (Δknowledge), quantifies the improvement in model performance after watching a video. Initial tests with various LMMs show a significant drop in performance as the cognitive demands increase. There is a significant gap between human and machine knowledge acquisition from videos. This finding underscores the need for improved methods that enable LMMs to learn more effectively from videos and apply the acquired knowledge.

The developers of Video-MMMU see the improvement of video-based knowledge acquisition as an important step towards Artificial General Intelligence (AGI). The ability to extract, understand, and apply complex information from videos is essential for the development of AI systems that can adapt to new situations and solve problems, just like humans do.

Video-MMMU not only provides a valuable resource for evaluating LMMs, but also a basis for the development of new learning methods. Through the targeted analysis of the strengths and weaknesses of current models, researchers can work specifically on improving video-based knowledge acquisition, thus paving the way for more powerful and versatile AI systems. The results of the benchmark tests show that much research is still needed to close the gap between human and machine cognition.

For Mindverse, a German company specializing in AI-powered content creation, image generation, research, and customized AI solutions, these developments are of great importance. The ability to efficiently extract and process knowledge from videos plays a crucial role in the further development of applications such as chatbots, voicebots, AI search engines, and knowledge systems. Video-MMMU provides valuable insights that can help to further improve the performance of these technologies and open up new application possibilities.

Bibliographie: Hu, K. et al. (2025). Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos. arXiv preprint arXiv:2501.13826. Paperreading.club. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos. Yue, X. et al. (2024). MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. ResearchGate. arXiv. Computer Vision and Pattern Recognition. LongHZ140516. Awesome Framework Gallery. GitHub. Amazon Science. The Amazon Nova Family of Models: Technical Report and Model Card. ChatPaper. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos. Gómez, J. D. et al. (2024). Methods and Technologies for Supporting Knowledge Sharing within Learning Communities: A Systematic Literature Review. ResearchGate. 4EU+ Alliance. 4EU+ Education Framework: Examples of Good Practices in Teaching and Learning.