Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning

Huaming Chen; Ian Manchester; Taozhao Chen

arxiv: 2606.30686 · v1 · pith:X4X2BQTHnew · submitted 2026-06-28 · 💻 cs.RO · cs.AI

Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning

Taozhao Chen , Ian Manchester , Huaming Chen This is my paper

Pith reviewed 2026-07-01 06:53 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords vision-language-action modelsphysical reasoningtask success ratesemantic mappingevaluation protocolsrobot manipulationidentifiability gapgeneralization

0 comments

The pith

Task success rates on robot benchmarks cannot verify whether vision-language-action models reason about physics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that gains on manipulation benchmarks do not establish physical reasoning because the main metric cannot separate semantic pattern matching from decisions that depend on physical laws. It decomposes VLA policies into a semantic mapping step and a physical action decision step, then shows that overall success stays high under either source of improvement. A sympathetic reader would care because this means reported progress could come from internet-scale data overlap rather than new competence in predicting object behavior under forces and constraints. Without designs that hold semantics fixed while varying physical conditions, claims about generalization from vision-language pretraining remain untestable. The authors therefore propose controlled variations in evaluation that allow separate measurement of each component.

Core claim

By decomposing VLA policies into semantic mapping and physical action decision, the paper shows that task success rate cannot distinguish between these two sources of capability, so benchmark improvements remain consistent with semantic matching, distributional overlap, or genuine physical generalization.

What carries the argument

Decomposition of VLA policies into semantic mapping and physical action decision components that isolates their separate contributions to task success without model internals.

If this is right

Benchmark improvements can arise from semantic matching or distributional overlap without any physical generalization.
Narrative drift across successive systems strengthens unverified claims about physical competence.
New evaluation protocols must introduce controlled physical variation while holding semantics fixed to attribute performance causally.
The role of VLM backbones can be assessed only as semantic interfaces, not as implicit sources of physical competence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same identifiability problem likely appears in other robotics transfer settings where pretraining data shares surface features with test tasks.
A direct test would fix object identities and language descriptions while varying only dynamics parameters and measure the resulting change in success.
Continued scaling of current VLA training may not close the physical reasoning gap until such separated measurements exist.

Load-bearing premise

That evaluation designs can be changed to separate the contributions of semantic mapping from physical action decisions without access to model internals.

What would settle it

Run the same VLA policy on matched task variants that keep semantic elements identical but change physical parameters such as mass, friction, or gravity, then check whether success rate drops when only the physical component is altered.

Figures

Figures reproduced from arXiv: 2606.30686 by Huaming Chen, Ian Manchester, Taozhao Chen.

read the original abstract

Vision-Language-Action (VLA) systems, built on pretrained vision-language models (VLMs), have shown rapidly improving performance on robot manipulation benchmarks. These gains are commonly interpreted as evidence that semantic representations learned from internet-scale data transfer to physical execution generalization. This position paper argues that the assumption underlying this interpretation -- that semantic generalization is sufficient to support physical action decisions -- has not been independently verified and cannot be tested under current evaluation protocols. We support this claim by decomposing VLA policies into semantic mapping and physical action decision, and showing that task success rate -- the dominant evaluation metric -- cannot distinguish between these two sources of capability. As a result, improvements in benchmark performance are consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization. We further argue that this identifiability gap has been reinforced through narrative drift, whereby successive systems inherit and strengthen prior interpretations of performance gains without isolating the underlying causal mechanism. To address this limitation, we propose a research direction based on evaluation designs that introduce controlled variation to separately measure semantic and physical generalization. Such designs make it possible to causally attribute performance without requiring access to model internals, and to empirically assess the role of VLM backbones as semantic interfaces rather than implicit sources of physical competence. Our goal is not to refute the role of VLMs in robotics, but to clarify the conditions under which claims of physical generalization can be meaningfully evaluated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task success rates on VLA benchmarks leave semantic mapping and physical decision-making confounded, so claims of physical generalization rest on untested assumptions.

read the letter

The central point is that standard robot manipulation benchmarks do not let us separate whether VLA models succeed because they map language and images correctly or because they have learned genuine physical action rules. The paper decomposes the policy into those two pieces and shows that overall success rate is consistent with either story, plus distributional overlap. That framing is useful and matches how many VLA papers describe their gains.

What the paper does well is spell out the identifiability problem without overclaiming. It also notes how successive papers have carried forward the physical-generalization interpretation without new isolating evidence, which is a fair description of the citation pattern in the area.

The soft spot is that the argument stays conceptual. No concrete evaluation designs are worked through with actual tasks or data, so it is hard to judge whether controlled variations would cleanly separate the two components in practice. The stress-test concern lands here: if the model entangles semantic and physical factors inside its representations, performance deltas under changed conditions may still be ambiguous. The paper assumes the decomposition is identifiable from behavior alone, but does not show why that holds for typical end-to-end VLMs.

This is for people who run VLA experiments or design benchmarks and want to tighten what their numbers actually mean. It is not a new result or method, but the measurement issue it flags affects how the community should read recent progress. The logic is internally consistent and the concern is real enough that the paper should go to peer review rather than be desk-rejected.

Referee Report

2 major / 1 minor

Summary. This position paper claims that Vision-Language-Action (VLA) models built on pretrained VLMs cannot be verified to perform physical reasoning under existing evaluation protocols. It decomposes VLA policies into semantic mapping (from VLM pretraining) and physical action decision components, argues that task success rate—the dominant metric—cannot distinguish between them, attributes this to an identifiability gap reinforced by narrative drift in the literature, and proposes controlled-variation evaluation designs to separately measure semantic and physical generalization without requiring model internals.

Significance. If the core argument holds, the result would imply that benchmark gains in robotic manipulation cannot be causally attributed to physical generalization from VLM backbones, requiring the field to adopt evaluation protocols that isolate capability sources. The manuscript receives credit for its explicit conceptual decomposition and for framing the problem as one of metric ambiguity rather than outright refutation of VLM utility in robotics.

major comments (2)

[Abstract] Abstract: the claim that task success rate 'cannot distinguish between these two sources of capability' rests on the premise that semantic mapping and physical action decision are separable components whose contributions can be isolated via controlled variations in evaluation design alone. The manuscript provides no formal argument or example showing that such variations succeed in disentangling the factors when representations are learned end-to-end, leaving the central identifiability conclusion dependent on an unverified separability assumption.
[Abstract] Abstract (decomposition paragraph): the argument that improvements are 'consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization' is presented without demonstrating that the proposed controlled-variation designs can rule out the first two explanations in practice; the absence of even a schematic example or proof sketch makes the proposed research direction non-operational as stated.

minor comments (1)

The manuscript would benefit from an explicit statement of the minimal assumptions required for the controlled-variation approach to succeed (e.g., that performance deltas remain interpretable even under partial entanglement).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and presentation of our position paper. We address each major comment below, noting that our argument is primarily conceptual and that we will make targeted revisions to strengthen the operational aspects of the proposed evaluation direction.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that task success rate 'cannot distinguish between these two sources of capability' rests on the premise that semantic mapping and physical action decision are separable components whose contributions can be isolated via controlled variations in evaluation design alone. The manuscript provides no formal argument or example showing that such variations succeed in disentangling the factors when representations are learned end-to-end, leaving the central identifiability conclusion dependent on an unverified separability assumption.

Authors: We agree that the manuscript offers no formal proof of separability for end-to-end models; the decomposition is presented as an analytical framework to expose the identifiability gap rather than an assertion of modular independence. The core claim—that success rate alone is consistent with multiple explanations—follows from the structure of existing benchmarks and does not require proving strict separability. We will revise the abstract to emphasize the conceptual nature of the decomposition and include a short illustrative example of a controlled-variation protocol in the body of the paper. revision: partial
Referee: [Abstract] Abstract (decomposition paragraph): the argument that improvements are 'consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization' is presented without demonstrating that the proposed controlled-variation designs can rule out the first two explanations in practice; the absence of even a schematic example or proof sketch makes the proposed research direction non-operational as stated.

Authors: The consistency with multiple explanations is offered as a logical consequence of current protocols rather than an empirical demonstration that specific designs have already ruled out alternatives. The controlled-variation designs are proposed as a future research direction precisely to close this gap. We accept that the absence of a schematic example renders the direction less concrete and will add one in the revised manuscript to illustrate how such designs could isolate semantic versus physical contributions. revision: partial

Circularity Check

0 steps flagged

No significant circularity in conceptual position paper

full rationale

The paper advances a position that task success rate cannot distinguish semantic mapping from physical action decision in VLA policies, supported by a proposed decomposition of policies and analysis of evaluation metrics. No equations, fitted parameters, or derivations appear in the provided text. The central claim follows from logical examination of identifiability under current protocols rather than any self-referential construction, self-citation chain, or renaming of known results. The argument is self-contained as a conceptual critique without reducing to its inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a domain assumption that policies decompose into separable semantic and physical components and that controlled evaluation variation can causally attribute performance; no free parameters or invented entities are introduced.

axioms (1)

domain assumption VLA policies can be decomposed into semantic mapping and physical action decision components whose contributions are separable via evaluation design
Invoked to establish that task success rate cannot distinguish the sources of capability.

pith-pipeline@v0.9.1-grok · 5786 in / 1223 out tokens · 40743 ms · 2026-07-01T06:53:58.985790+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 35 canonical work pages · 20 internal anchors

[1]

Learning to poke by poking: Experiential learning of intuitive physics.Advances in neural information processing systems, 29, 2016

Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics.Advances in neural information processing systems, 29, 2016

2016
[3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kua...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

1933
[5]

A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009

Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009

2009
[6]

Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327– 18332, 2013

Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327– 18332, 2013

2013
[7]

RT-H: Action Hierarchies Using Language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery

2021
[9]

Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

Aude Billard and Danica Kragic. Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

2019
[10]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[13]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025

2025
[15]

Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025

Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025

work page arXiv 2025
[16]

A retrieval-augmented framework enabling vlm spatial awareness for object-centric robot manipulation.Science Robotics, 11(113):eaea2092, 2026

Kai Chen, Chengkun Li, Chang Tu, Jiahui Pan, Yiyao Ma, Wei Chen, Zhongxiang Zhou, Xuecheng Xu, Stephen James, Chi-Wing Fu, Rong Xiong, Pieter Abbeel, Yun-Hui Liu, and Qi Dou. A retrieval-augmented framework enabling vlm spatial awareness for object-centric robot manipulation.Science Robotics, 11(113):eaea2092, 2026. 10

2026
[17]

Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

2022
[18]

RoboNet: Large-Scale Multi-Robot Learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[19]

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Quar-vla: Vision-language-action model for quadruped robots

Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, and Donglin Wang. Quar-vla: Vision-language-action model for quadruped robots. In European Conference on Computer Vision, pages 352–367. Springer, 2024

2024
[21]

Fast ecot: Efficient embodied chain-of-thought via thoughts reuse.arXiv preprint arXiv:2506.07639, 2025

Zhekai Duan, Yuan Zhang, Shikai Geng, Gaowen Liu, Joschka Boedecker, and Chris Xiaox- uan Lu. Fast ecot: Efficient embodied chain-of-thought via thoughts reuse.arXiv preprint arXiv:2506.07639, 2025

work page arXiv 2025
[22]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language- action models.arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

2025
[24]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

2020
[25]

The perception of the visual world

James J Gibson. The perception of the visual world. 1950

1950
[26]

Psychology press, 2014

James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014

2014
[27]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Toward general-purpose robots via foundation models: A survey and meta-analysis.arXiv preprint arXiv:2312.08782, 2023

Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis.arXiv preprint arXiv:2312.08782, 2023

work page arXiv 2023
[29]

Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

Suning Huang, Jiaqi Shao, Ke Wang, Qianzhong Chen, Jiankai Sun, Yanjiang Guo, Mac Schwager, and Jeannette Bohg. Breaking lock-in: Preserving steerability under low-data vla post-training.arXiv preprint arXiv:2604.23121, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Scalable deep reinforcement learning for vision-based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018. 11

2018
[33]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[34]

Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

2025
[35]

Dynabench: Rethinking benchmarking in nlp

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp. InProceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, pages 4110–4...

2021
[36]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

2017
[39]

A path towards autonomous machine intelligence version 0.9

Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

2022
[40]

Learning physical intuition of block towers by example

Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. InInternational conference on machine learning, pages 430–438. PMLR, 2016

2016
[41]

Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987

Alan M Leslie and Stephanie Keeble. Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987

1987
[42]

Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

2026
[43]

arXiv preprint arXiv:2506.19816 (2025)

Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, et al. Cronusvla: Towards efficient and robust manipulation via multi-frame vision-language-action modeling.arXiv preprint arXiv:2506.19816, 2025

work page arXiv 2025
[44]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[45]

Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025

Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, and Jian Tang. Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025

work page arXiv 2025
[46]

Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation

Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou, Zhuotao Tian, and Liqiang Nie. Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18397–18405, 2026

2026
[47]

Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024

Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024

work page arXiv 2024
[48]

Onetwovla: A unified vision- language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onet- wovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

work page arXiv 2025
[49]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 12

2023
[50]

Eva-vla: Evaluating vision-language-action models’ robustness under real-world physical variations

Hanqing Liu, Shouwei Ruan, Jiahuan Long, Junqi Wu, Jiacheng Hou, Huili Tang, Tingsong Jiang, Weien Zhou, and Wen Yao. Eva-vla: Evaluating vision-language-action models’ robust- ness under real-world physical variations.arXiv preprint arXiv:2509.18953, 2025

work page arXiv 2025
[51]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

2022
[52]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours

Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016

2016
[54]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[55]

Transfusion: Understanding transfer learning for medical imaging.Advances in neural information processing systems, 32, 2019

Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning for medical imaging.Advances in neural information processing systems, 32, 2019

2019
[56]

Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3(1):297–330, 2020

Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3(1):297–330, 2020

2020
[57]

Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015

2015
[58]

Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,

Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system-2 thinking in visual-language- action model.arXiv preprint arXiv:2505.21432, 2025

work page arXiv 2025
[59]

Reconvla: Reconstructive vision- language-action model as effective robot perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision- language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

2026
[60]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[61]

Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503, 2021

Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503, 2021

work page arXiv 2021
[62]

VLA Knows Its Limits: Adaptive Execution Horizons for Robot Policies

Haoxuan Wang, Gengyu Zhang, Yan Yan, Ramana Rao Kompella, and Gaowen Liu. Vla knows its limits.arXiv preprint arXiv:2602.21445, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[63]

Latentvla: Taming latent space for generalizable and long-horizon bimanual manipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18593– 18601, Mar

Junming Wang. Latentvla: Taming latent space for generalizable and long-horizon bimanual manipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18593– 18601, Mar. 2026

2026
[64]

Program- matically grounded, compositionally generalizable robotic manipulation.arXiv preprint arXiv:2304.13826, 2023

Renhao Wang, Jiayuan Mao, Joy Hsu, Hang Zhao, Jiajun Wu, and Yang Gao. Program- matically grounded, compositionally generalizable robotic manipulation.arXiv preprint arXiv:2304.13826, 2023

work page arXiv 2023
[65]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026. 13

2026
[66]

Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293, 2024

Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293, 2024

work page arXiv 2024
[67]

Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in neural information processing systems, 28, 2015

Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in neural information processing systems, 28, 2015

2015
[68]

Robust finetuning of vision-language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025

Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, and Sergey Levine. Robust finetuning of vision-language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025

work page arXiv 2025
[69]

A survey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12):7173–7186, 2024

Maryam Zare, Parham M Kebria, Abbas Khosravi, and Saeid Nahavandi. A survey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12):7173–7186, 2024

2024
[70]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

From recognition to cognition: Visual commonsense reasoning

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019

2019
[72]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[73]

Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation

Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Dan Wang, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18764–18772, 2026

2026
[74]

Recognizing limits: Investigating infeasibility in large language models.arXiv preprint arXiv:2408.05873, 2024

Wenbo Zhang, Zihang Xu, and Hengrui Cai. Recognizing limits: Investigating infeasibility in large language models.arXiv preprint arXiv:2408.05873, 2024

work page arXiv 2024
[75]

Dexgraspvla: A vision-language-action framework towards general dexterous grasping

Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Nam Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18836–18844, 2026

2026
[76]

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair evaluation of vision-language- action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Chatvla: Unified multimodal understanding and robot control with vision-language-action model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025

2025
[78]

Objectvla: End-to-end open-world object manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Objectvla: End-to-end open-world object manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

work page arXiv 2025
[79]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 14

2023

[1] [1]

Learning to poke by poking: Experiential learning of intuitive physics.Advances in neural information processing systems, 29, 2016

Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics.Advances in neural information processing systems, 29, 2016

2016

[2] [3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kua...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [4]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

1933

[4] [5]

A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009

Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009

2009

[5] [6]

Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327– 18332, 2013

Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327– 18332, 2013

2013

[6] [7]

RT-H: Action Hierarchies Using Language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [8]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery

2021

[8] [9]

Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

Aude Billard and Danica Kragic. Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

2019

[9] [10]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [11]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [12]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901

[12] [13]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [14]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025

2025

[14] [15]

Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025

Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025

work page arXiv 2025

[15] [16]

A retrieval-augmented framework enabling vlm spatial awareness for object-centric robot manipulation.Science Robotics, 11(113):eaea2092, 2026

Kai Chen, Chengkun Li, Chang Tu, Jiahui Pan, Yiyao Ma, Wei Chen, Zhongxiang Zhou, Xuecheng Xu, Stephen James, Chi-Wing Fu, Rong Xiong, Pieter Abbeel, Yun-Hui Liu, and Qi Dou. A retrieval-augmented framework enabling vlm spatial awareness for object-centric robot manipulation.Science Robotics, 11(113):eaea2092, 2026. 10

2026

[16] [17]

Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

2022

[17] [18]

RoboNet: Large-Scale Multi-Robot Learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[18] [19]

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [20]

Quar-vla: Vision-language-action model for quadruped robots

Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, and Donglin Wang. Quar-vla: Vision-language-action model for quadruped robots. In European Conference on Computer Vision, pages 352–367. Springer, 2024

2024

[20] [21]

Fast ecot: Efficient embodied chain-of-thought via thoughts reuse.arXiv preprint arXiv:2506.07639, 2025

Zhekai Duan, Yuan Zhang, Shikai Geng, Gaowen Liu, Joschka Boedecker, and Chris Xiaox- uan Lu. Fast ecot: Efficient embodied chain-of-thought via thoughts reuse.arXiv preprint arXiv:2506.07639, 2025

work page arXiv 2025

[21] [22]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language- action models.arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [23]

Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

2025

[23] [24]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

2020

[24] [25]

The perception of the visual world

James J Gibson. The perception of the visual world. 1950

1950

[25] [26]

Psychology press, 2014

James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014

2014

[26] [27]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [28]

Toward general-purpose robots via foundation models: A survey and meta-analysis.arXiv preprint arXiv:2312.08782, 2023

Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis.arXiv preprint arXiv:2312.08782, 2023

work page arXiv 2023

[28] [29]

Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

Suning Huang, Jiaqi Shao, Ke Wang, Qianzhong Chen, Jiankai Sun, Yanjiang Guo, Mac Schwager, and Jeannette Bohg. Breaking lock-in: Preserving steerability under low-data vla post-training.arXiv preprint arXiv:2604.23121, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [30]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [31]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [32]

Scalable deep reinforcement learning for vision-based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018. 11

2018

[32] [33]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[33] [34]

Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

2025

[34] [35]

Dynabench: Rethinking benchmarking in nlp

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp. InProceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, pages 4110–4...

2021

[35] [36]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [37]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [38]

Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

2017

[38] [39]

A path towards autonomous machine intelligence version 0.9

Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

2022

[39] [40]

Learning physical intuition of block towers by example

Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. InInternational conference on machine learning, pages 430–438. PMLR, 2016

2016

[40] [41]

Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987

Alan M Leslie and Stephanie Keeble. Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987

1987

[41] [42]

Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

2026

[42] [43]

arXiv preprint arXiv:2506.19816 (2025)

Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, et al. Cronusvla: Towards efficient and robust manipulation via multi-frame vision-language-action modeling.arXiv preprint arXiv:2506.19816, 2025

work page arXiv 2025

[43] [44]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023

[44] [45]

Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025

Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, and Jian Tang. Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025

work page arXiv 2025

[45] [46]

Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation

Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou, Zhuotao Tian, and Liqiang Nie. Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18397–18405, 2026

2026

[46] [47]

Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024

Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024

work page arXiv 2024

[47] [48]

Onetwovla: A unified vision- language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onet- wovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

work page arXiv 2025

[48] [49]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 12

2023

[49] [50]

Eva-vla: Evaluating vision-language-action models’ robustness under real-world physical variations

Hanqing Liu, Shouwei Ruan, Jiahuan Long, Junqi Wu, Jiacheng Hou, Huili Tang, Tingsong Jiang, Weien Zhou, and Wen Yao. Eva-vla: Evaluating vision-language-action models’ robust- ness under real-world physical variations.arXiv preprint arXiv:2509.18953, 2025

work page arXiv 2025

[50] [51]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

2022

[51] [52]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [53]

Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours

Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016

2016

[53] [54]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[54] [55]

Transfusion: Understanding transfer learning for medical imaging.Advances in neural information processing systems, 32, 2019

Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning for medical imaging.Advances in neural information processing systems, 32, 2019

2019

[55] [56]

Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3(1):297–330, 2020

Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3(1):297–330, 2020

2020

[56] [57]

Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015

2015

[57] [58]

Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,

Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system-2 thinking in visual-language- action model.arXiv preprint arXiv:2505.21432, 2025

work page arXiv 2025

[58] [59]

Reconvla: Reconstructive vision- language-action model as effective robot perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision- language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

2026

[59] [60]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998

[60] [61]

Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503, 2021

Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503, 2021

work page arXiv 2021

[61] [62]

VLA Knows Its Limits: Adaptive Execution Horizons for Robot Policies

Haoxuan Wang, Gengyu Zhang, Yan Yan, Ramana Rao Kompella, and Gaowen Liu. Vla knows its limits.arXiv preprint arXiv:2602.21445, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[62] [63]

Latentvla: Taming latent space for generalizable and long-horizon bimanual manipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18593– 18601, Mar

Junming Wang. Latentvla: Taming latent space for generalizable and long-horizon bimanual manipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18593– 18601, Mar. 2026

2026

[63] [64]

Program- matically grounded, compositionally generalizable robotic manipulation.arXiv preprint arXiv:2304.13826, 2023

Renhao Wang, Jiayuan Mao, Joy Hsu, Hang Zhao, Jiajun Wu, and Yang Gao. Program- matically grounded, compositionally generalizable robotic manipulation.arXiv preprint arXiv:2304.13826, 2023

work page arXiv 2023

[64] [65]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026. 13

2026

[65] [66]

Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293, 2024

Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293, 2024

work page arXiv 2024

[66] [67]

Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in neural information processing systems, 28, 2015

Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in neural information processing systems, 28, 2015

2015

[67] [68]

Robust finetuning of vision-language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025

Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, and Sergey Levine. Robust finetuning of vision-language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025

work page arXiv 2025

[68] [69]

A survey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12):7173–7186, 2024

Maryam Zare, Parham M Kebria, Abbas Khosravi, and Saeid Nahavandi. A survey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12):7173–7186, 2024

2024

[69] [70]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [71]

From recognition to cognition: Visual commonsense reasoning

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019

2019

[71] [72]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[72] [73]

Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation

Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Dan Wang, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18764–18772, 2026

2026

[73] [74]

Recognizing limits: Investigating infeasibility in large language models.arXiv preprint arXiv:2408.05873, 2024

Wenbo Zhang, Zihang Xu, and Hengrui Cai. Recognizing limits: Investigating infeasibility in large language models.arXiv preprint arXiv:2408.05873, 2024

work page arXiv 2024

[74] [75]

Dexgraspvla: A vision-language-action framework towards general dexterous grasping

Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Nam Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18836–18844, 2026

2026

[75] [76]

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair evaluation of vision-language- action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [77]

Chatvla: Unified multimodal understanding and robot control with vision-language-action model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025

2025

[77] [78]

Objectvla: End-to-end open-world object manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Objectvla: End-to-end open-world object manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

work page arXiv 2025

[78] [79]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 14

2023