The analysis of multimodal language is a rapidly developing research field that uses multiple modalities to improve understanding of the complex semantics of human conversation. Until now, the ability of multimodal large language models (MLLMs) to capture these cognitive levels of semantics has been little explored. A new benchmark called MMLA aims to close this gap.
MMLA consists of over 61,000 multimodal utterances from staged and real-world scenarios. These encompass six core dimensions of multimodal semantics: intent, emotion, dialogue act, mood, speaking style, and communication behavior. The data comes from a variety of sources, including movies, television series, YouTube, Vimeo, Bilibili, TED Talks, and improvised scripts. Overall, the dataset comprises 76.6 hours of video material and covers three modalities.
Eight common large language models (LLMs and MLLMs) were evaluated using MMLA. The evaluation was carried out using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. The tested models include five MLLMs (Qwen2-VL, VideoLLaMA2, LLaVA-Video, LLaVA-OV, MiniCPM-V-2.6) and three LLMs (InternLM2.5, Qwen2, LLaMA3).
The results of the extensive experiments show that even fine-tuned models only achieve an accuracy of about 60% to 70%. This highlights the current limitations of MLLMs in understanding complex human language. MMLA is intended to serve as a solid foundation for further research into the potential of large language models in multimodal language analysis and to provide valuable resources for the further development of this field.
MMLA offers the research community an important tool for evaluating and improving MLLMs. By providing a comprehensive and diverse dataset, MMLA enables an objective comparison of different models and promotes the development of more robust and powerful AI systems. The benchmark's six semantic dimensions cover a broad spectrum of communicative aspects and allow for a detailed analysis of the strengths and weaknesses of the tested models.
The development of MLLMs that can capture human language in all its complexity is a long-term goal of AI research. MMLA represents an important step in this direction. The benchmark will help to reveal the limitations of current models and drive the development of new, more powerful algorithms. The open availability of the dataset and code enables broad participation from the research community and accelerates progress in this promising field.
Bibliographie: https://arxiv.org/abs/2504.16427 https://arxiv.org/html/2504.16427 https://www.aimodels.fyi/papers/arxiv/can-large-language-models-help-multimodal-language https://huggingface.co/papers https://twitter.com/SciFi/status/1915798623570309322 https://openreview.net/forum?id=JDiER86r8v https://pmc.ncbi.nlm.nih.gov/articles/PMC11645129/ https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models https://aclanthology.org/2024.acl-long.25/ https://pmc.ncbi.nlm.nih.gov/articles/PMC11464944/