pith. machine review for the scientific record. sign in

arxiv: 2511.16518 · v2 · submitted 2025-11-20 · 💻 cs.RO · cs.CL· cs.CV

Recognition: unknown

MiMo-Embodied: X-Embodied Foundation Model Technical Report

Authors on Pith no claims yet
classification 💻 cs.RO cs.CLcs.CV
keywords mimo-embodiedacrossdrivingmodelautonomousbenchmarksembodiedfoundation
0
0 comments X
read the original abstract

We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.

  2. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.

  3. RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation

    cs.AI 2026-04 unverdicted novelty 7.0

    RAG-KT frames cross-platform knowledge tracing as context-constrained LLM inference by building unified multi-source context via Question Group abstractions and retrieving complementary reliable context for grounded p...

  4. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  5. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.

  6. Large Vision-Language Models Get Lost in Attention

    cs.AI 2026-05 unverdicted novelty 6.0

    In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

  7. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  8. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  9. Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...

  10. Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    A routing framework maintains three parallel 3D feature streams for LiDAR, 4D radar, and fusion, with a lightweight router using weather prompts to dynamically weight them and auxiliary supervision to keep branches di...

  11. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 conditional novelty 6.0

    FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.

  12. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...