hub Mixed citations

MiMo-Embodied: X-Embodied Foundation Model Technical Report

· 2025 · cs.RO · arXiv 2511.16518

Mixed citation behavior. Most common role is background (67%).

10 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 10 citing papers arXiv PDF

abstract

We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 2

citation-polarity summary

background 4 baseline 2

representative citing papers

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

cs.AI · 2026-05-09 · unverdicted · novelty 7.0 · 3 refs

VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.

RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

RAG-KT frames cross-platform knowledge tracing as context-constrained LLM inference by building unified multi-source context via Question Group abstractions and retrieving complementary reliable context for grounded predictions and interpretable diagnosis.

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

Large Vision-Language Models Get Lost in Attention

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.

Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

A routing framework maintains three parallel 3D feature streams for LiDAR, 4D radar, and fusion, with a lightweight router using weather prompts to dynamically weight them and auxiliary supervision to keep branches distinct, achieving SOTA on K-Radar.

FASTER: Rethinking Real-Time Flow VLAs

cs.RO · 2026-03-19 · unverdicted · novelty 6.0 · 2 refs

FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

cs.RO · 2026-01-11 · unverdicted · novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.

citing papers explorer

Showing 10 of 10 citing papers.

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents cs.AI · 2026-05-09 · unverdicted · none · ref 40 · 3 links · internal anchor
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation cs.AI · 2026-04-13 · unverdicted · none · ref 4 · internal anchor
RAG-KT frames cross-platform knowledge tracing as context-constrained LLM inference by building unified multi-source context via Question Group abstractions and retrieving complementary reliable context for grounded predictions and interpretable diagnosis.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data cs.RO · 2026-05-13 · unverdicted · none · ref 59 · internal anchor
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Large Vision-Language Models Get Lost in Attention cs.AI · 2026-05-07 · unverdicted · none · ref 101 · internal anchor
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation cs.CV · 2026-04-20 · unverdicted · none · ref 41 · 2 links · internal anchor
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-04-20 · unverdicted · none · ref 10 · internal anchor
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection cs.CV · 2026-04-07 · unverdicted · none · ref 21 · internal anchor
A routing framework maintains three parallel 3D feature streams for LiDAR, 4D radar, and fusion, with a lightweight router using weather prompts to dynamically weight them and auxiliary supervision to keep branches distinct, achieving SOTA on K-Radar.
FASTER: Rethinking Real-Time Flow VLAs cs.RO · 2026-03-19 · unverdicted · none · ref 26 · 2 links · internal anchor
FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation cs.RO · 2026-01-11 · unverdicted · none · ref 35 · internal anchor
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 31 · internal anchor
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer