The rapid development of Large Multimodal Models (LMMs) has led to remarkable advancements in research and application. These models, which can process text, images, and other data types, are used in areas such as medical diagnostics, the development of personal assistants, and robotics. Despite their power, the workings of LMMs remain largely opaque. This raises questions about the interpretability and safety of these complex systems. A research team has now presented a new approach that offers insights into the inner mechanisms of LMMs and reveals possibilities for controlling their behavior.
The interpretation of LMMs presents particular challenges for research. The internal representations of these models are high-dimensional and polysemantic. This means that individual neurons can encode multiple meanings, while a single meaning can be distributed across multiple neurons. This complexity makes it difficult to assign specific functions to individual components of the model. Another aspect is the enormous amount of concepts that LMMs can process. In contrast to traditional models, which are often trained on a limited number of concepts, LMMs operate with an open and dynamic set of meanings. This makes manual analysis and interpretation practically impossible.
The research team has developed an automated approach based on the use of Sparse Autoencoders (SAEs) and interpretation by larger LMMs. SAEs serve to decompose the complex representations of the LMMs into individual, interpretable features. These features can then be analyzed and interpreted by a larger LMM. The approach was demonstrated using the LLaVA-NeXT-8B model, which was interpreted with the help of the larger LLaVA-OV-72B model.
Specifically, the procedure works as follows: First, an SAE is integrated into a specific layer of the smaller LMM and trained with a dataset. The SAE learns to encode the representations of the LMM into a sparse set of features. Subsequently, the learned features are interpreted through an automated pipeline. For each feature, the images and image regions that most strongly activate this feature are identified. This information is then presented to the larger LMM, which analyzes the common factors and generates an interpretation of the feature.
The application of this procedure has provided interesting insights into the workings of LMMs. For example, features that correlate with emotions could be identified. By specifically manipulating these features, it was possible to influence the behavior of the model and, for example, control the generation of emotional responses. Furthermore, the causes of certain model behaviors, such as hallucinations, could be identified and corrected by adjusting the corresponding features. Remarkable is also the discovery of features that show parallels to cognitive processes in the human brain. This suggests that the interpretation of LMMs could also contribute to the understanding of human information processing.
The presented research offers a promising approach to the interpretation and control of LMMs. The automated analysis and interpretation of features makes it possible to gain insights into the complex mechanisms of these models and to specifically influence their behavior. Future research could focus on extending the procedure to other model architectures and investigating further application scenarios. The development of robust and interpretable LMMs is an important step towards a safe and trustworthy application of Artificial Intelligence.