MiMo-Embodied: X-Embodied Foundation Model Technical Report

Xiaoshuai Hao , Lei Zhou , Zhijian Huang , Zhiwen Hou , Yingbo Tang , Lingfeng Zhang , Guang Li , Zheng Lu

show 36 more authors

Shuhuai Ren Xianhui Meng Yuchen Zhang Jing Wu Jinghui Lu Chenxu Dang Jiayi Guan Jianhua Wu Zhiyi Hou Hanbing Li Shumeng Xia Mingliang Zhou Yinan Zheng Zihao Yue Shuhao Gu Hao Tian Yuannan Shen Jianwei Cui Wen Zhang Shaoqing Xu Bing Wang Haiyang Sun Zeyu Zhu Yuncheng Jiang Zibin Guo Chuhong Gong Chaofan Zhang Wenbo Ding Kun Ma Guang Chen Rui Cai Diyun Xiang Heng Qu Fuli Luo Hangjun Ye Long Chen

Authors on Pith no claims yet

classification 💻 cs.RO cs.CLcs.CV

keywords mimo-embodiedacrossdrivingmodelautonomousbenchmarksembodiedfoundation

0 comments

read the original abstract

We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation
cs.AI 2026-04 unverdicted novelty 7.0

RAG-KT frames cross-platform knowledge tracing as context-constrained LLM inference by building unified multi-source context via Question Group abstractions and retrieving complementary reliable context for grounded p...
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 6.0

VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.
Large Vision-Language Models Get Lost in Attention
cs.AI 2026-05 unverdicted novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

A routing framework maintains three parallel 3D feature streams for LiDAR, 4D radar, and fusion, with a lightweight router using weather prompts to dynamically weight them and auxiliary supervision to keep branches di...
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 conditional novelty 6.0

FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...