Magma: A Foundation Model for Multimodal AI Agents

Baolin Peng; Jianfeng Gao; Jianwei Yang; Joel Jang; Lars Liden; Mu Cai; Qianhui Wu; Reuben Tan; Ruijie Zheng; Seonghyeon Ye

arxiv: 2502.13130 · v1 · pith:BKEN5DMQnew · submitted 2025-02-18 · 💻 cs.CV · cs.AI· cs.HC· cs.LG· cs.RO

Magma: A Foundation Model for Multimodal AI Agents

Jianwei Yang , Reuben Tan , Qianhui Wu , Ruijie Zheng , Baolin Peng , Yongyuan Liang , Yu Gu , Mu Cai

show 5 more authors

Seonghyeon Ye Joel Jang Yuquan Deng Lars Liden Jianfeng Gao

This is my paper

classification 💻 cs.CV cs.AIcs.HCcs.LGcs.RO

keywords magmatasksmodelmultimodalagenticintelligencemodelsability

0 comments

read the original abstract

We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models
cs.LG 2026-06 unverdicted novelty 7.0

Act2Answer protocol reveals VLA models retain simple concepts but show larger gaps on complex semantics than source VLMs, with VQA co-training linked to better retention and knowledge signals peaking in middle layers.
TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 7.0

TTT-VLA performs test-time training for VLA models by optimizing only a latent prompt on new interaction data via a proxy self-supervised signal, yielding higher task success rates on SimplerEnv in single- and multi-e...
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
cs.RO 2025-11 unverdicted novelty 6.0

AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
FLARE: Robot Learning with Implicit World Modeling
cs.RO 2025-05 unverdicted novelty 6.0

FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
cs.LG 2025-04 unverdicted novelty 6.0

π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
cs.CV 2025-07 unverdicted novelty 5.0

ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
Toward Self-Organizing Production Logistics: A Multi-Agent Approach
eess.SY 2026-04 unverdicted novelty 4.0

The paper derives system objectives for self-organizing production logistics and proposes a multi-agent architecture with embodied agents, event-driven coordination, and a three-phase demonstration roadmap.
Toward Self-Organizing Production Logistics: A Multi-Agent Approach
eess.SY 2026-04 unverdicted novelty 4.0

A multi-agent architecture with embodied agents, shared semantic knowledge, and dynamic digital twins is proposed to support decentralized, resilient production logistics in circular factories.