MMLA: A New Benchmark for Multimodal Language Analysis in Large Language Models

Multimodal Language Analysis: A New Benchmark for Large Language Models

The analysis of multimodal language is a rapidly developing research field that uses multiple modalities to improve understanding of the complex semantics of human conversation. Until now, the ability of multimodal large language models (MLLMs) to capture these cognitive levels of semantics has been little explored. A new benchmark called MMLA aims to close this gap.

MMLA: A Comprehensive Benchmark

MMLA consists of over 61,000 multimodal utterances from staged and real-world scenarios. These encompass six core dimensions of multimodal semantics: intent, emotion, dialogue act, mood, speaking style, and communication behavior. The data comes from a variety of sources, including movies, television series, YouTube, Vimeo, Bilibili, TED Talks, and improvised scripts. Overall, the dataset comprises 76.6 hours of video material and covers three modalities.

Evaluation of Current Language Models

Eight common large language models (LLMs and MLLMs) were evaluated using MMLA. The evaluation was carried out using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. The tested models include five MLLMs (Qwen2-VL, VideoLLaMA2, LLaVA-Video, LLaVA-OV, MiniCPM-V-2.6) and three LLMs (InternLM2.5, Qwen2, LLaMA3).

Results and Outlook

The results of the extensive experiments show that even fine-tuned models only achieve an accuracy of about 60% to 70%. This highlights the current limitations of MLLMs in understanding complex human language. MMLA is intended to serve as a solid foundation for further research into the potential of large language models in multimodal language analysis and to provide valuable resources for the further development of this field.

The Importance of MMLA for AI Research

MMLA offers the research community an important tool for evaluating and improving MLLMs. By providing a comprehensive and diverse dataset, MMLA enables an objective comparison of different models and promotes the development of more robust and powerful AI systems. The benchmark's six semantic dimensions cover a broad spectrum of communicative aspects and allow for a detailed analysis of the strengths and weaknesses of the tested models.

The Future of Multimodal Language Analysis

The development of MLLMs that can capture human language in all its complexity is a long-term goal of AI research. MMLA represents an important step in this direction. The benchmark will help to reveal the limitations of current models and drive the development of new, more powerful algorithms. The open availability of the dataset and code enables broad participation from the research community and accelerates progress in this promising field.

Bibliographie: https://arxiv.org/abs/2504.16427 https://arxiv.org/html/2504.16427 https://www.aimodels.fyi/papers/arxiv/can-large-language-models-help-multimodal-language https://huggingface.co/papers https://twitter.com/SciFi/status/1915798623570309322 https://openreview.net/forum?id=JDiER86r8v https://pmc.ncbi.nlm.nih.gov/articles/PMC11645129/ https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models https://aclanthology.org/2024.acl-long.25/ https://pmc.ncbi.nlm.nih.gov/articles/PMC11464944/