pith. sign in

arxiv: 2606.30686 · v1 · pith:X4X2BQTHnew · submitted 2026-06-28 · 💻 cs.RO · cs.AI

Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning

Pith reviewed 2026-07-01 06:53 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords vision-language-action modelsphysical reasoningtask success ratesemantic mappingevaluation protocolsrobot manipulationidentifiability gapgeneralization
0
0 comments X

The pith

Task success rates on robot benchmarks cannot verify whether vision-language-action models reason about physics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that gains on manipulation benchmarks do not establish physical reasoning because the main metric cannot separate semantic pattern matching from decisions that depend on physical laws. It decomposes VLA policies into a semantic mapping step and a physical action decision step, then shows that overall success stays high under either source of improvement. A sympathetic reader would care because this means reported progress could come from internet-scale data overlap rather than new competence in predicting object behavior under forces and constraints. Without designs that hold semantics fixed while varying physical conditions, claims about generalization from vision-language pretraining remain untestable. The authors therefore propose controlled variations in evaluation that allow separate measurement of each component.

Core claim

By decomposing VLA policies into semantic mapping and physical action decision, the paper shows that task success rate cannot distinguish between these two sources of capability, so benchmark improvements remain consistent with semantic matching, distributional overlap, or genuine physical generalization.

What carries the argument

Decomposition of VLA policies into semantic mapping and physical action decision components that isolates their separate contributions to task success without model internals.

If this is right

  • Benchmark improvements can arise from semantic matching or distributional overlap without any physical generalization.
  • Narrative drift across successive systems strengthens unverified claims about physical competence.
  • New evaluation protocols must introduce controlled physical variation while holding semantics fixed to attribute performance causally.
  • The role of VLM backbones can be assessed only as semantic interfaces, not as implicit sources of physical competence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same identifiability problem likely appears in other robotics transfer settings where pretraining data shares surface features with test tasks.
  • A direct test would fix object identities and language descriptions while varying only dynamics parameters and measure the resulting change in success.
  • Continued scaling of current VLA training may not close the physical reasoning gap until such separated measurements exist.

Load-bearing premise

That evaluation designs can be changed to separate the contributions of semantic mapping from physical action decisions without access to model internals.

What would settle it

Run the same VLA policy on matched task variants that keep semantic elements identical but change physical parameters such as mass, friction, or gravity, then check whether success rate drops when only the physical component is altered.

Figures

Figures reproduced from arXiv: 2606.30686 by Huaming Chen, Ian Manchester, Taozhao Chen.

Figure 1
Figure 1. Figure 1: VLM-backbone VLA systems (2023–2026) organized into three architectural lineages: [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) systems, built on pretrained vision-language models (VLMs), have shown rapidly improving performance on robot manipulation benchmarks. These gains are commonly interpreted as evidence that semantic representations learned from internet-scale data transfer to physical execution generalization. This position paper argues that the assumption underlying this interpretation -- that semantic generalization is sufficient to support physical action decisions -- has not been independently verified and cannot be tested under current evaluation protocols. We support this claim by decomposing VLA policies into semantic mapping and physical action decision, and showing that task success rate -- the dominant evaluation metric -- cannot distinguish between these two sources of capability. As a result, improvements in benchmark performance are consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization. We further argue that this identifiability gap has been reinforced through narrative drift, whereby successive systems inherit and strengthen prior interpretations of performance gains without isolating the underlying causal mechanism. To address this limitation, we propose a research direction based on evaluation designs that introduce controlled variation to separately measure semantic and physical generalization. Such designs make it possible to causally attribute performance without requiring access to model internals, and to empirically assess the role of VLM backbones as semantic interfaces rather than implicit sources of physical competence. Our goal is not to refute the role of VLMs in robotics, but to clarify the conditions under which claims of physical generalization can be meaningfully evaluated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This position paper claims that Vision-Language-Action (VLA) models built on pretrained VLMs cannot be verified to perform physical reasoning under existing evaluation protocols. It decomposes VLA policies into semantic mapping (from VLM pretraining) and physical action decision components, argues that task success rate—the dominant metric—cannot distinguish between them, attributes this to an identifiability gap reinforced by narrative drift in the literature, and proposes controlled-variation evaluation designs to separately measure semantic and physical generalization without requiring model internals.

Significance. If the core argument holds, the result would imply that benchmark gains in robotic manipulation cannot be causally attributed to physical generalization from VLM backbones, requiring the field to adopt evaluation protocols that isolate capability sources. The manuscript receives credit for its explicit conceptual decomposition and for framing the problem as one of metric ambiguity rather than outright refutation of VLM utility in robotics.

major comments (2)
  1. [Abstract] Abstract: the claim that task success rate 'cannot distinguish between these two sources of capability' rests on the premise that semantic mapping and physical action decision are separable components whose contributions can be isolated via controlled variations in evaluation design alone. The manuscript provides no formal argument or example showing that such variations succeed in disentangling the factors when representations are learned end-to-end, leaving the central identifiability conclusion dependent on an unverified separability assumption.
  2. [Abstract] Abstract (decomposition paragraph): the argument that improvements are 'consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization' is presented without demonstrating that the proposed controlled-variation designs can rule out the first two explanations in practice; the absence of even a schematic example or proof sketch makes the proposed research direction non-operational as stated.
minor comments (1)
  1. The manuscript would benefit from an explicit statement of the minimal assumptions required for the controlled-variation approach to succeed (e.g., that performance deltas remain interpretable even under partial entanglement).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and presentation of our position paper. We address each major comment below, noting that our argument is primarily conceptual and that we will make targeted revisions to strengthen the operational aspects of the proposed evaluation direction.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that task success rate 'cannot distinguish between these two sources of capability' rests on the premise that semantic mapping and physical action decision are separable components whose contributions can be isolated via controlled variations in evaluation design alone. The manuscript provides no formal argument or example showing that such variations succeed in disentangling the factors when representations are learned end-to-end, leaving the central identifiability conclusion dependent on an unverified separability assumption.

    Authors: We agree that the manuscript offers no formal proof of separability for end-to-end models; the decomposition is presented as an analytical framework to expose the identifiability gap rather than an assertion of modular independence. The core claim—that success rate alone is consistent with multiple explanations—follows from the structure of existing benchmarks and does not require proving strict separability. We will revise the abstract to emphasize the conceptual nature of the decomposition and include a short illustrative example of a controlled-variation protocol in the body of the paper. revision: partial

  2. Referee: [Abstract] Abstract (decomposition paragraph): the argument that improvements are 'consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization' is presented without demonstrating that the proposed controlled-variation designs can rule out the first two explanations in practice; the absence of even a schematic example or proof sketch makes the proposed research direction non-operational as stated.

    Authors: The consistency with multiple explanations is offered as a logical consequence of current protocols rather than an empirical demonstration that specific designs have already ruled out alternatives. The controlled-variation designs are proposed as a future research direction precisely to close this gap. We accept that the absence of a schematic example renders the direction less concrete and will add one in the revised manuscript to illustrate how such designs could isolate semantic versus physical contributions. revision: partial

Circularity Check

0 steps flagged

No significant circularity in conceptual position paper

full rationale

The paper advances a position that task success rate cannot distinguish semantic mapping from physical action decision in VLA policies, supported by a proposed decomposition of policies and analysis of evaluation metrics. No equations, fitted parameters, or derivations appear in the provided text. The central claim follows from logical examination of identifiability under current protocols rather than any self-referential construction, self-citation chain, or renaming of known results. The argument is self-contained as a conceptual critique without reducing to its inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a domain assumption that policies decompose into separable semantic and physical components and that controlled evaluation variation can causally attribute performance; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption VLA policies can be decomposed into semantic mapping and physical action decision components whose contributions are separable via evaluation design
    Invoked to establish that task success rate cannot distinguish the sources of capability.

pith-pipeline@v0.9.1-grok · 5786 in / 1223 out tokens · 40743 ms · 2026-07-01T06:53:58.985790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 35 canonical work pages · 20 internal anchors

  1. [1]

    Learning to poke by poking: Experiential learning of intuitive physics.Advances in neural information processing systems, 29, 2016

    Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics.Advances in neural information processing systems, 29, 2016

  2. [3]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kua...

  3. [4]

    Covla: Comprehensive vision-language-action dataset for autonomous driving

    Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

  4. [5]

    A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009

    Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009

  5. [6]

    Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327– 18332, 2013

    Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327– 18332, 2013

  6. [7]

    RT-H: Action Hierarchies Using Language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024

  7. [8]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery

  8. [9]

    Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

    Aude Billard and Danica Kragic. Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

  9. [10]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  10. [11]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  11. [12]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  12. [13]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  13. [14]

    Spatialbot: Precise spatial understanding with vision language models

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025

  14. [15]

    Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025

    Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025

  15. [16]

    A retrieval-augmented framework enabling vlm spatial awareness for object-centric robot manipulation.Science Robotics, 11(113):eaea2092, 2026

    Kai Chen, Chengkun Li, Chang Tu, Jiahui Pan, Yiyao Ma, Wei Chen, Zhongxiang Zhou, Xuecheng Xu, Stephen James, Chi-Wing Fu, Rong Xiong, Pieter Abbeel, Yun-Hui Liu, and Qi Dou. A retrieval-augmented framework enabling vlm spatial awareness for object-centric robot manipulation.Science Robotics, 11(113):eaea2092, 2026. 10

  16. [17]

    Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

    Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

  17. [18]

    RoboNet: Large-Scale Multi-Robot Learning

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019

  18. [19]

    GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

    Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025

  19. [20]

    Quar-vla: Vision-language-action model for quadruped robots

    Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, and Donglin Wang. Quar-vla: Vision-language-action model for quadruped robots. In European Conference on Computer Vision, pages 352–367. Springer, 2024

  20. [21]

    Fast ecot: Efficient embodied chain-of-thought via thoughts reuse.arXiv preprint arXiv:2506.07639, 2025

    Zhekai Duan, Yuan Zhang, Shikai Geng, Gaowen Liu, Joschka Boedecker, and Chris Xiaox- uan Lu. Fast ecot: Efficient embodied chain-of-thought via thoughts reuse.arXiv preprint arXiv:2506.07639, 2025

  21. [22]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language- action models.arXiv preprint arXiv:2510.13626, 2025

  22. [23]

    Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

    Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

  23. [24]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  24. [25]

    The perception of the visual world

    James J Gibson. The perception of the visual world. 1950

  25. [26]

    Psychology press, 2014

    James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014

  26. [27]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

  27. [28]

    Toward general-purpose robots via foundation models: A survey and meta-analysis.arXiv preprint arXiv:2312.08782, 2023

    Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis.arXiv preprint arXiv:2312.08782, 2023

  28. [29]

    Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    Suning Huang, Jiaqi Shao, Ke Wang, Qianzhong Chen, Jiankai Sun, Yanjiang Guo, Mac Schwager, and Jeannette Bohg. Breaking lock-in: Preserving steerability under low-data vla post-training.arXiv preprint arXiv:2604.23121, 2026

  29. [30]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

  30. [31]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

  31. [32]

    Scalable deep reinforcement learning for vision-based robotic manipulation

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018. 11

  32. [33]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  33. [34]

    Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

    Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

  34. [35]

    Dynabench: Rethinking benchmarking in nlp

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp. InProceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, pages 4110–4...

  35. [36]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  36. [37]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  37. [38]

    Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

    Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

  38. [39]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

  39. [40]

    Learning physical intuition of block towers by example

    Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. InInternational conference on machine learning, pages 430–438. PMLR, 2016

  40. [41]

    Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987

    Alan M Leslie and Stephanie Keeble. Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987

  41. [42]

    Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

    Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

  42. [43]

    arXiv preprint arXiv:2506.19816 (2025)

    Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, et al. Cronusvla: Towards efficient and robust manipulation via multi-frame vision-language-action modeling.arXiv preprint arXiv:2506.19816, 2025

  43. [44]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  44. [45]

    Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025

    Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, and Jian Tang. Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025

  45. [46]

    Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation

    Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou, Zhuotao Tian, and Liqiang Nie. Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18397–18405, 2026

  46. [47]

    Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024

    Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024

  47. [48]

    Onetwovla: A unified vision- language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

    Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onet- wovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

  48. [49]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 12

  49. [50]

    Eva-vla: Evaluating vision-language-action models’ robustness under real-world physical variations

    Hanqing Liu, Shouwei Ruan, Jiahuan Long, Junqi Wu, Jiacheng Hou, Huili Tang, Tingsong Jiang, Weien Zhou, and Wen Yao. Eva-vla: Evaluating vision-language-action models’ robust- ness under real-world physical variations.arXiv preprint arXiv:2509.18953, 2025

  50. [51]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  51. [52]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

  52. [53]

    Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours

    Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016

  53. [54]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  54. [55]

    Transfusion: Understanding transfer learning for medical imaging.Advances in neural information processing systems, 32, 2019

    Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning for medical imaging.Advances in neural information processing systems, 32, 2019

  55. [56]

    Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3(1):297–330, 2020

    Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3(1):297–330, 2020

  56. [57]

    Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015

    David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015

  57. [58]

    Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,

    Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system-2 thinking in visual-language- action model.arXiv preprint arXiv:2505.21432, 2025

  58. [59]

    Reconvla: Reconstructive vision- language-action model as effective robot perceiver

    Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision- language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

  59. [60]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  60. [61]

    Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503, 2021

    Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503, 2021

  61. [62]

    VLA Knows Its Limits: Adaptive Execution Horizons for Robot Policies

    Haoxuan Wang, Gengyu Zhang, Yan Yan, Ramana Rao Kompella, and Gaowen Liu. Vla knows its limits.arXiv preprint arXiv:2602.21445, 2026

  62. [63]

    Latentvla: Taming latent space for generalizable and long-horizon bimanual manipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18593– 18601, Mar

    Junming Wang. Latentvla: Taming latent space for generalizable and long-horizon bimanual manipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18593– 18601, Mar. 2026

  63. [64]

    Program- matically grounded, compositionally generalizable robotic manipulation.arXiv preprint arXiv:2304.13826, 2023

    Renhao Wang, Jiayuan Mao, Joy Hsu, Hang Zhao, Jiajun Wu, and Yang Gao. Program- matically grounded, compositionally generalizable robotic manipulation.arXiv preprint arXiv:2304.13826, 2023

  64. [65]

    Vla-adapter: An effective paradigm for tiny-scale vision-language-action model

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026. 13

  65. [66]

    Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293, 2024

    Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293, 2024

  66. [67]

    Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in neural information processing systems, 28, 2015

    Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in neural information processing systems, 28, 2015

  67. [68]

    Robust finetuning of vision-language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025

    Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, and Sergey Levine. Robust finetuning of vision-language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025

  68. [69]

    A survey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12):7173–7186, 2024

    Maryam Zare, Parham M Kebria, Abbas Khosravi, and Saeid Nahavandi. A survey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12):7173–7186, 2024

  69. [70]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

  70. [71]

    From recognition to cognition: Visual commonsense reasoning

    Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019

  71. [72]

    Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022

  72. [73]

    Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation

    Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Dan Wang, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18764–18772, 2026

  73. [74]

    Recognizing limits: Investigating infeasibility in large language models.arXiv preprint arXiv:2408.05873, 2024

    Wenbo Zhang, Zihang Xu, and Hengrui Cai. Recognizing limits: Investigating infeasibility in large language models.arXiv preprint arXiv:2408.05873, 2024

  74. [75]

    Dexgraspvla: A vision-language-action framework towards general dexterous grasping

    Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Nam Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18836–18844, 2026

  75. [76]

    LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

    Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair evaluation of vision-language- action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

  76. [77]

    Chatvla: Unified multimodal understanding and robot control with vision-language-action model

    Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025

  77. [78]

    Objectvla: End-to-end open-world object manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

    Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Objectvla: End-to-end open-world object manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

  78. [79]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 14