pith. machine review for the scientific record. sign in

arxiv: 2604.18000 · v1 · submitted 2026-04-20 · 💻 cs.RO

Recognition: unknown

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:23 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelsembodied reasoningrobotic policiesdiagnostic benchmarksarchitectural bottleneckssemantic feature collapsecausal interventionskinematic isolation
0
0 comments X

The pith

State-of-the-art vision-language-action models fail in dynamic scenarios because of architectural bottlenecks that produce lexical shortcuts and semantic collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BeTTER, a benchmark that applies targeted causal interventions such as spatial layout shifts and temporal extrapolation while enforcing kinematic isolation to separate high-level reasoning from low-level motor execution. Systematic tests on current VLAs reveal catastrophic breakdowns in changing environments, including reliance on lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. These symptoms are traced directly to capacity compression and myopic downsampling in the models' architectures, which degrade their core semantic representations. Static benchmarks hide the problem by allowing models to overfit sensorimotor patterns, and the same failures appear in real-world robot tests.

Core claim

State-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Mechanistic analysis traces these symptoms to fundamental architectural bottlenecks such as capacity compression and myopic downsampling, which systematically degrade the model's foundational semantic representation. Highly static evaluation protocols mask this degradation by allowing optimization to overfit to sensorimotor priors, and the representational breakdown persists in real-world robotic validation.

What carries the argument

BeTTER, a diagnostic benchmark that applies targeted causal interventions (spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to decouple high-level reasoning failures from low-level execution limits.

If this is right

  • Highly static evaluation protocols mask representational degradation by allowing optimization to overfit to sensorimotor priors.
  • The representational breakdown is confirmed in real-world robotic validation and is not a simulation artifact.
  • Future VLA paradigms must resolve the structural tension between high-frequency control and high-level reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If capacity compression is the root cause, simply scaling model size or training data may not restore robust semantic representations without changes to the core vision-language-action pipeline.
  • Benchmarks that enforce dynamic changes and kinematic isolation could be applied to other multimodal models to test for similar illusions of reasoning.
  • Architectures that preserve semantic features across temporal and spatial shifts may require explicit mechanisms for long-range dependency tracking beyond current downsampling approaches.

Load-bearing premise

That the kinematic isolation in the causal interventions fully separates high-level reasoning failures from low-level execution limits so that observed drops can be attributed to architecture.

What would settle it

A current VLA architecture that achieves high success rates on BeTTER dynamic scenarios without displaying lexical-kinematic shortcuts, behavioral inertia, or semantic feature collapse.

Figures

Figures reproduced from arXiv: 2604.18000 by Haiweng Xu, Hao Luo, Sipeng Zheng, Wanpeng Zhang, Ziheng Xi, Zongqing Lu.

Figure 1
Figure 1. Figure 1: Unmasking the illusion of embodied reasoning. Left (Illusion): State-of-the-art VLAs achieve high success rates on standard, visually static benchmarks, creating a misleading impression of robust semantic grounding and sequential planning. Middle (Intervention): We introduce systematic cognitive stress tests via controlled interventions (e.g., spatial layout shifts and primitive recomposition) to challenge… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of shortcut learning in instruction following. We evaluate models under counterfactual layouts by recombining spatial (top/bottom) and semantic (red/blue) instructions. Dashed lines denote observed trajectory patterns. Counterfactual cases (diagonal) expose two representative failure modes: Top-right a lexical-kinematic shortcut, where the token “red” directly triggers a learned motion primiti… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of reasoning breakdowns in sequential planning. We use the Packing Fast Food Order task to expose two distinct failure modes. Top (Subgoal recomposition): Models trained on fixed sequences (A → B and A → C) fail to generalize to novel compositions (B → C), instead exhibiting behavioral inertia by defaulting to the most frequent starting primitive (A). Bottom (Causal confusion): Models memorize… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of fine-grained semantic grounding under adversarial visual distractors. (a) Instance-Level: Adversarial substitutions (e.g., replacing a red apple with a red block) induce severe semantic feature collapse, leading models to act on superficial attributes such as color, π0.5 shows partial robustness. (b) Scene-Level: In cluttered environments without reliable spatial priors, policies fail to grou… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world evaluation protocol. We evaluate model robustness across two task horizons (short vs. long) and two spatial layouts. To disentangle cognitive reasoning from kinematic execution, models are fine-tuned on canonical configurations (a–d) and then subjected to targeted causal interventions (e.g., partial target absence or pre-placed objects). These perturbations expose underlying failure modes and re… view at source ↗
Figure 6
Figure 6. Figure 6: Representative Examples of State-Grounded VQA. By leveraging deterministic privileged simulation states (e.g., exact 3D coordinates, object counts, and procedural task milestones), our automated pipeline generates dense, hallucination-free visual-language supervision. The three panels demonstrate fine￾grained spatial grounding, robust instance counting, and dynamic execution tracking. 23 [PITH_FULL_IMAGE:… view at source ↗
read the original abstract

Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Crucially, our mechanistic analysis traces these symptoms to fundamental architectural bottlenecks - such as capacity compression and myopic downsampling - which systematically degrade the model's foundational semantic representation. We demonstrate that highly static evaluation protocols effectively mask this degradation by allowing optimization to overfit to sensorimotor priors. Supported by real-world robotic validation, our findings confirm that this representational breakdown is not a simulation artifact, highlighting the critical need for future VLA paradigms to resolve the structural tension between high-frequency control and high-level reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BeTTER, a diagnostic benchmark for testing true embodied reasoning in Vision-Language-Action (VLA) models. It applies targeted causal interventions such as spatial layout shifts and temporal extrapolation under kinematic isolation to decouple high-level reasoning from low-level execution. Systematic evaluation reveals that state-of-the-art VLAs exhibit lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse in dynamic scenarios, which the authors trace to architectural bottlenecks including capacity compression and myopic downsampling. Real-world robotic validation is provided to rule out simulation artifacts, and the work argues that static benchmarks mask these issues by allowing overfitting to sensorimotor priors.

Significance. If the central attribution holds, the work is significant for the robotics and embodied AI community. It supplies an independent benchmark (BeTTER) with falsifiable interventions rather than relying on existing fitted metrics, and includes real-world validation. This could shift evaluation practices away from overly static protocols and highlight structural tensions between high-frequency control and semantic reasoning in VLAs.

major comments (2)
  1. [BeTTER benchmark design] BeTTER benchmark design (abstract and §3): the claim that kinematic isolation successfully attributes performance drops to architectural bottlenecks (capacity compression, myopic downsampling) rather than residual low-level execution limits is load-bearing. The description does not report quantitative checks (e.g., ablation on sensorimotor feedback completeness or state-dependent dynamics controls) that would confirm decoupling in dynamic regimes; without these, unmodeled interactions could still confound the mechanistic analysis.
  2. [Results and mechanistic analysis] Results and mechanistic analysis (§4–5): the abstract states systematic evaluation and real-world validation, yet no quantitative results, error bars, data exclusion criteria, or baseline comparisons are referenced. This makes it impossible to assess the magnitude of the reported failures (e.g., how severe the semantic feature collapse is relative to controls) or to verify that the interventions isolate the claimed architectural causes.
minor comments (2)
  1. [Abstract] The abstract refers to 'highly static evaluation protocols' without citing specific prior benchmarks or protocols being critiqued; adding explicit references would clarify the contrast.
  2. [Benchmark description] Notation for the new benchmark (BeTTER) and its interventions should be defined more formally (e.g., a table listing each causal intervention and its kinematic-isolation enforcement) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [BeTTER benchmark design] BeTTER benchmark design (abstract and §3): the claim that kinematic isolation successfully attributes performance drops to architectural bottlenecks (capacity compression, myopic downsampling) rather than residual low-level execution limits is load-bearing. The description does not report quantitative checks (e.g., ablation on sensorimotor feedback completeness or state-dependent dynamics controls) that would confirm decoupling in dynamic regimes; without these, unmodeled interactions could still confound the mechanistic analysis.

    Authors: We agree that the kinematic isolation procedure is central to attributing failures to architectural bottlenecks rather than low-level execution. Section 3 describes the targeted causal interventions (spatial layout shifts and temporal extrapolation) under kinematic isolation, but we acknowledge the absence of explicit quantitative ablations on sensorimotor feedback completeness and state-dependent dynamics controls. In the revised version, we will add these ablations to §3 (or a dedicated subsection) to empirically confirm decoupling in dynamic regimes and rule out potential confounds. revision: yes

  2. Referee: [Results and mechanistic analysis] Results and mechanistic analysis (§4–5): the abstract states systematic evaluation and real-world validation, yet no quantitative results, error bars, data exclusion criteria, or baseline comparisons are referenced. This makes it impossible to assess the magnitude of the reported failures (e.g., how severe the semantic feature collapse is relative to controls) or to verify that the interventions isolate the claimed architectural causes.

    Authors: The abstract is a high-level summary and conventionally omits specific numerical values. The full quantitative results—including error bars, data exclusion criteria, baseline comparisons, and the magnitude of failures such as semantic feature collapse—are reported in detail in §4 and §5, together with the real-world validation. To improve readability and address the concern, we will revise the abstract to briefly reference the key quantitative findings and relative severity of the observed issues. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark with independent evaluations; no derivation chain reduces to inputs

full rationale

The paper introduces the BeTTER benchmark and applies causal interventions plus kinematic isolation to evaluate existing VLAs. Claims about failures (lexical-kinematic shortcuts, behavioral inertia, semantic feature collapse) and their attribution to architectural bottlenecks (capacity compression, myopic downsampling) rest on experimental performance drops and real-world validation rather than any equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing step is shown to be equivalent to its inputs by construction. The work is self-contained against external benchmarks and models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that the benchmark interventions isolate reasoning from execution and that performance drops indicate architectural degradation rather than other causes.

axioms (1)
  • domain assumption Targeted causal interventions such as spatial layout shifts and temporal extrapolation with kinematic isolation decouple high-level reasoning failures from low-level execution limits.
    This premise is invoked to justify that observed failures reflect reasoning deficits rather than control issues.
invented entities (1)
  • BeTTER benchmark no independent evidence
    purpose: Diagnostic tool to expose true embodied reasoning capabilities versus shortcuts in VLA models
    Newly introduced evaluation framework whose validity depends on the isolation assumption.

pith-pipeline@v0.9.0 · 5540 in / 1434 out tokens · 58565 ms · 2026-05-10T04:23:43.129696+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 36 canonical work pages · 20 internal anchors

  1. [1]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

  2. [2]

    Temporal action detection with structured segment networks

    Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. InProceedings of the IEEE international conference on computer vision, pages 2914–2923, 2017

  3. [3]

    Few-shot action recognition with hierarchical matching and contrastive learning

    Sipeng Zheng, Shizhe Chen, and Qin Jin. Few-shot action recognition with hierarchical matching and contrastive learning. InEuropean conference on computer vision, pages 297–313. Springer, 2022

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  6. [6]

    Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

    Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

  7. [7]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  8. [8]

    Being-h0: vision-language-action pretraining from large-scale human videos,

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

  9. [9]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  10. [10]

    MiMo-Embodied: X-Embodied Foundation Model Technical Report

    Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, et al. Mimo-embodied: X-embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025

  11. [11]

    Vlm4vla: Revisiting vision-language-models in vision-language-action models

    Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. In International Conference on Learning Representations, 2026

  12. [12]

    Robovqa: Multimodal long-horizon reasoning for robotics

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024

  13. [13]

    arXiv preprint arXiv:2412.04447 (2024)

    Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

  14. [14]

    Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355, 2024

  15. [15]

    Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025

    Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025

  16. [16]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

  17. [17]

    Libero-pro: Towards robust and fair evalua- tion of vision-language-action models beyond memoriza- tion.[arXiv preprint arXiv:2510.03827], 2025

    Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827, 2025. 15

  18. [18]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  19. [19]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  20. [20]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  21. [21]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  22. [22]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  23. [23]

    Gpt-3: Its nature, scope, limits, and consequences.Minds and machines, 30(4):681–694, 2020

    Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences.Minds and machines, 30(4):681–694, 2020

  24. [24]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  25. [25]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  26. [26]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  27. [27]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

    Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

  28. [28]

    From pixels to tokens: Byte-pair encoding on quantized visual modalities

    Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, and Zongqing Lu. From pixels to tokens: Byte-pair encoding on quantized visual modalities. InThe Thirteenth International Conference on Learning Representations, 2025

  29. [29]

    Spatial-aware vla pretraining through visual-physical alignment from human videos.arXiv preprint arXiv:2512.13080, 2025

    Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, and Zongqing Lu. Spatial-aware vla pretraining through visual-physical alignment from human videos.arXiv preprint arXiv:2512.13080, 2025

  30. [30]

    arXiv preprint arXiv:2406.10721 (2024)

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024

  31. [31]

    Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  32. [32]

    Maniskill: Generalizable manipulation skill bench- mark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021

    Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021

  33. [33]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  34. [34]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

  35. [35]

    Open x-embodiment: Robotic learning datasets 16 and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets 16 and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  36. [36]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  37. [37]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

  38. [38]

    Realm: A real-to-sim validated benchmark for generalization in robotic manipulation.arXiv preprint arXiv:2512.19562, 2025

    Martin Sedlacek, Pavlo Yefanov, Georgy Ponimatkin, Jai Bardhan, Simon Pilc, Mederic Fourmy, Evangelos Kazakos, Cees GM Snoek, Josef Sivic, and Vladimir Petrik. Realm: A real-to-sim validated benchmark for generalization in robotic manipulation.arXiv preprint arXiv:2512.19562, 2025

  39. [39]

    Towards generalizable vision-language robotic manipulation: A benchmark and llm-guided 3d policy

    Ricardo Garcia, Shizhe Chen, and Cordelia Schmid. Towards generalizable vision-language robotic manipulation: A benchmark and llm-guided 3d policy. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8996–9002. IEEE, 2025

  40. [40]

    Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks

    Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

  41. [41]

    VLA-Arena: An open-source framework for benchmarking vision-language-action models.arXiv preprint arXiv:2512.22539, 2025

    Borong Zhang, Jiahao Li, Jiachen Shen, Yishuai Cai, Yuhao Zhang, Yuanpei Chen, Juntao Dai, Jiaming Ji, and Yaodong Yang. Vla-arena: An open-source framework for benchmarking vision-language-action models.arXiv preprint arXiv:2512.22539, 2025

  42. [42]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  43. [43]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

  44. [44]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations, 2023

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations.arXiv preprint arXiv:2310.17596, 2023

  45. [45]

    InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy. arXiv preprint arXiv:2510.13778, 2025

  46. [46]

    Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

    Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

  47. [47]

    arXiv preprint arXiv:2507.02029 , year=

    BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

  48. [48]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  49. [49]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  50. [50]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  51. [51]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  52. [52]

    Modeling context in referring expressions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 17

  53. [53]

    Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes

    Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, and Shengjin Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

  54. [54]

    The all-seeing project v2: Towards general relation comprehension of the open world

    Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. arXiv preprint arXiv:2402.19474, 2024

  55. [55]

    Embodiedonevision: Interleaved vision-text-action pretraining for general robot control

    Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Embodiedonevision: Interleaved vision-text-action pretraining for general robot control. arXiv e-prints, pages arXiv–2508, 2025

  56. [56]

    Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

  57. [57]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  58. [58]

    Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

  59. [59]

    BC-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-z: Zero-shot task generalization with robotic imitation learning. In5th Annual Conference on Robot Learning, 2021

  60. [60]

    Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation.arXiv preprint arXiv:1806.10293, 2018

  61. [61]

    Task Templates

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023. 18 Appendix A Data Recipes, Evaluation Framework, and Implementation Details A.1 Pre-tra...

  62. [62]

    ‘meta‘: Task name and task description

  63. [63]

    Each group has: - ‘group_id‘: Unique identifier

    ‘object_groups‘: A list of abstract object groups. Each group has: - ‘group_id‘: Unique identifier. - ‘description‘: What this group represents. - ‘allowed_types‘: Semantic categories allowed. - ‘count_range‘: [min, max] number of objects to generate. **Your Task:** Generate **{request.num_scenarios}** distinct scenarios of this template. For each variation:

  64. [64]

    Making a BLT sandwich

    **Invent a Scenario:** A specific real-world context (e.g., "Making a BLT sandwich" for a stacking task)

  65. [65]

    Make a ham sandwich on the plate

    **Generate Instruction:** Write a clear, concise natural language instruction for the robot based on the ‘ task_description‘ and your invented scenario, and avoid chaining all subtasks or steps to complete the task (e.g., " Make a ham sandwich on the plate.")

  66. [66]

    scenario_name

    **Instantiate Objects:** For each ‘object_group‘, select specific objects that fit the scenario and the group’s constraints. - **Count:** You MUST generate a number of items within the ‘count_range‘. - **Consistency:** The objects must make sense together (e.g., bread, lettuce, tomato). - **Physicality:** Ensure objects are physically suitable (e.g., don’...

  67. [67]

    Object Grounding Q:What is the bounding box of the salt shaker? A:[254.0, 305.0, 287.0, 368.0]

  68. [68]

    Object Counting Q:How many cheeseburger objects can you see? A:2

  69. [69]

    Execution T racking Q:What is the robot doing? A:The robot is packing apple into the lunchbox container. Figure 6:Representative Examples of State-Grounded VQA.By leveraging deterministic privileged simulation states (e.g., exact 3D coordinates, object counts, and procedural task milestones), our automated pipeline generates dense, hallucination-free visu...