February 22, 2025

Microsoft Introduces Magma: A Multimodal AI Model for Software and Robot Control

Listen to this article as Podcast
0:00 / 0:00
Microsoft Introduces Magma: A Multimodal AI Model for Software and Robot Control

Microsoft's Magma: A Multimodal AI Model for Software and Robot Control

Microsoft has introduced a new AI model called Magma, which can control software and robots. Magma, short for "Multimodal Agentic Model at Microsoft Research," was developed in collaboration with various US universities. What makes Magma special is its multimodality, which allows it to not only understand different input types such as visual information and language, but also to plan and act based on them.

Previous multimodal AI models often require multiple separate models for processing inputs and controlling applications or robots. Magma, on the other hand, combines these capabilities into a single model. Microsoft describes Magma as a bridge between verbal, spatial, and temporal intelligence, capable of solving complex tasks and situations.

Magma's Capabilities at a Glance

Magma can control software based on user instructions. For example, it can activate airplane mode on a smartphone by navigating to the home screen, opening the quick settings, and pressing the corresponding button. Even more complex tasks, such as querying weather information in a browser, are possible.

In the field of robotics, Magma can control robot arms to precisely grip, place, and move objects. Demonstrations show a robot arm under Magma's control centering a cloth on a table or moving an orange to a water bottle.

Furthermore, Magma can process video inputs from everyday situations and respond to user queries. An example scenario shows Magma integrated into glasses, suggesting strategic moves to users during a chess game or recommending suitable activities based on objects in the living room.

Training and Functionality of Magma

Magma was trained using a combination of images, videos, and robot data. Two key techniques are used: "Set-of-Mark" and "Trace-of-Mark".

Set-of-Mark enables the execution of actions by numbering objects in the video. This allows Magma to specifically target elements in a user interface or objects in real space.

Trace-of-Mark serves the planning phase. By learning movement patterns from video data, Magma can anticipate future states and plan complex movement sequences, such as the step-by-step movement of an orange to a water bottle.

Benchmarks and Challenges

Benchmark tests conducted by Microsoft show that Magma can compete with other multimodal AI models like GPT-4V or Qwen-VL in many areas. However, Microsoft acknowledges that Magma still reaches its limits with very complex tasks that require many steps.

To promote further research and development, Microsoft plans to publish Magma's inference and training data via Github.

Future Potential

Magma represents an important step in the development of multimodal AI models. The ability to process different input types and control both software and robots opens up diverse application possibilities in areas such as human-computer interaction, robotics, and automation.

Although Magma still faces challenges, the model demonstrates the potential of multimodal AI for the future of technology. Microsoft's publication of the training data will drive further research and development in this area and enable new innovations.