Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning
Pith reviewed 2026-07-01 06:53 UTC · model grok-4.3
The pith
Task success rates on robot benchmarks cannot verify whether vision-language-action models reason about physics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decomposing VLA policies into semantic mapping and physical action decision, the paper shows that task success rate cannot distinguish between these two sources of capability, so benchmark improvements remain consistent with semantic matching, distributional overlap, or genuine physical generalization.
What carries the argument
Decomposition of VLA policies into semantic mapping and physical action decision components that isolates their separate contributions to task success without model internals.
If this is right
- Benchmark improvements can arise from semantic matching or distributional overlap without any physical generalization.
- Narrative drift across successive systems strengthens unverified claims about physical competence.
- New evaluation protocols must introduce controlled physical variation while holding semantics fixed to attribute performance causally.
- The role of VLM backbones can be assessed only as semantic interfaces, not as implicit sources of physical competence.
Where Pith is reading between the lines
- The same identifiability problem likely appears in other robotics transfer settings where pretraining data shares surface features with test tasks.
- A direct test would fix object identities and language descriptions while varying only dynamics parameters and measure the resulting change in success.
- Continued scaling of current VLA training may not close the physical reasoning gap until such separated measurements exist.
Load-bearing premise
That evaluation designs can be changed to separate the contributions of semantic mapping from physical action decisions without access to model internals.
What would settle it
Run the same VLA policy on matched task variants that keep semantic elements identical but change physical parameters such as mass, friction, or gravity, then check whether success rate drops when only the physical component is altered.
Figures
read the original abstract
Vision-Language-Action (VLA) systems, built on pretrained vision-language models (VLMs), have shown rapidly improving performance on robot manipulation benchmarks. These gains are commonly interpreted as evidence that semantic representations learned from internet-scale data transfer to physical execution generalization. This position paper argues that the assumption underlying this interpretation -- that semantic generalization is sufficient to support physical action decisions -- has not been independently verified and cannot be tested under current evaluation protocols. We support this claim by decomposing VLA policies into semantic mapping and physical action decision, and showing that task success rate -- the dominant evaluation metric -- cannot distinguish between these two sources of capability. As a result, improvements in benchmark performance are consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization. We further argue that this identifiability gap has been reinforced through narrative drift, whereby successive systems inherit and strengthen prior interpretations of performance gains without isolating the underlying causal mechanism. To address this limitation, we propose a research direction based on evaluation designs that introduce controlled variation to separately measure semantic and physical generalization. Such designs make it possible to causally attribute performance without requiring access to model internals, and to empirically assess the role of VLM backbones as semantic interfaces rather than implicit sources of physical competence. Our goal is not to refute the role of VLMs in robotics, but to clarify the conditions under which claims of physical generalization can be meaningfully evaluated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper claims that Vision-Language-Action (VLA) models built on pretrained VLMs cannot be verified to perform physical reasoning under existing evaluation protocols. It decomposes VLA policies into semantic mapping (from VLM pretraining) and physical action decision components, argues that task success rate—the dominant metric—cannot distinguish between them, attributes this to an identifiability gap reinforced by narrative drift in the literature, and proposes controlled-variation evaluation designs to separately measure semantic and physical generalization without requiring model internals.
Significance. If the core argument holds, the result would imply that benchmark gains in robotic manipulation cannot be causally attributed to physical generalization from VLM backbones, requiring the field to adopt evaluation protocols that isolate capability sources. The manuscript receives credit for its explicit conceptual decomposition and for framing the problem as one of metric ambiguity rather than outright refutation of VLM utility in robotics.
major comments (2)
- [Abstract] Abstract: the claim that task success rate 'cannot distinguish between these two sources of capability' rests on the premise that semantic mapping and physical action decision are separable components whose contributions can be isolated via controlled variations in evaluation design alone. The manuscript provides no formal argument or example showing that such variations succeed in disentangling the factors when representations are learned end-to-end, leaving the central identifiability conclusion dependent on an unverified separability assumption.
- [Abstract] Abstract (decomposition paragraph): the argument that improvements are 'consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization' is presented without demonstrating that the proposed controlled-variation designs can rule out the first two explanations in practice; the absence of even a schematic example or proof sketch makes the proposed research direction non-operational as stated.
minor comments (1)
- The manuscript would benefit from an explicit statement of the minimal assumptions required for the controlled-variation approach to succeed (e.g., that performance deltas remain interpretable even under partial entanglement).
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the scope and presentation of our position paper. We address each major comment below, noting that our argument is primarily conceptual and that we will make targeted revisions to strengthen the operational aspects of the proposed evaluation direction.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that task success rate 'cannot distinguish between these two sources of capability' rests on the premise that semantic mapping and physical action decision are separable components whose contributions can be isolated via controlled variations in evaluation design alone. The manuscript provides no formal argument or example showing that such variations succeed in disentangling the factors when representations are learned end-to-end, leaving the central identifiability conclusion dependent on an unverified separability assumption.
Authors: We agree that the manuscript offers no formal proof of separability for end-to-end models; the decomposition is presented as an analytical framework to expose the identifiability gap rather than an assertion of modular independence. The core claim—that success rate alone is consistent with multiple explanations—follows from the structure of existing benchmarks and does not require proving strict separability. We will revise the abstract to emphasize the conceptual nature of the decomposition and include a short illustrative example of a controlled-variation protocol in the body of the paper. revision: partial
-
Referee: [Abstract] Abstract (decomposition paragraph): the argument that improvements are 'consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization' is presented without demonstrating that the proposed controlled-variation designs can rule out the first two explanations in practice; the absence of even a schematic example or proof sketch makes the proposed research direction non-operational as stated.
Authors: The consistency with multiple explanations is offered as a logical consequence of current protocols rather than an empirical demonstration that specific designs have already ruled out alternatives. The controlled-variation designs are proposed as a future research direction precisely to close this gap. We accept that the absence of a schematic example renders the direction less concrete and will add one in the revised manuscript to illustrate how such designs could isolate semantic versus physical contributions. revision: partial
Circularity Check
No significant circularity in conceptual position paper
full rationale
The paper advances a position that task success rate cannot distinguish semantic mapping from physical action decision in VLA policies, supported by a proposed decomposition of policies and analysis of evaluation metrics. No equations, fitted parameters, or derivations appear in the provided text. The central claim follows from logical examination of identifiability under current protocols rather than any self-referential construction, self-citation chain, or renaming of known results. The argument is self-contained as a conceptual critique without reducing to its inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLA policies can be decomposed into semantic mapping and physical action decision components whose contributions are separable via evaluation design
Reference graph
Works this paper leans on
-
[1]
Learning to poke by poking: Experiential learning of intuitive physics.Advances in neural information processing systems, 29, 2016
Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics.Advances in neural information processing systems, 29, 2016
2016
-
[3]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kua...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Covla: Comprehensive vision-language-action dataset for autonomous driving
Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025
1933
-
[5]
A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009
Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009
2009
-
[6]
Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327– 18332, 2013
Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327– 18332, 2013
2013
-
[7]
RT-H: Action Hierarchies Using Language
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery
2021
-
[9]
Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019
Aude Billard and Danica Kragic. Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019
2019
-
[10]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
1901
-
[13]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Spatialbot: Precise spatial understanding with vision language models
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025
2025
-
[15]
Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025
-
[16]
A retrieval-augmented framework enabling vlm spatial awareness for object-centric robot manipulation.Science Robotics, 11(113):eaea2092, 2026
Kai Chen, Chengkun Li, Chang Tu, Jiahui Pan, Yiyao Ma, Wei Chen, Zhongxiang Zhou, Xuecheng Xu, Stephen James, Chi-Wing Fu, Rong Xiong, Pieter Abbeel, Yun-Hui Liu, and Qi Dou. A retrieval-augmented framework enabling vlm spatial awareness for object-centric robot manipulation.Science Robotics, 11(113):eaea2092, 2026. 10
2026
-
[17]
Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022
Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022
2022
-
[18]
RoboNet: Large-Scale Multi-Robot Learning
Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[19]
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Quar-vla: Vision-language-action model for quadruped robots
Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, and Donglin Wang. Quar-vla: Vision-language-action model for quadruped robots. In European Conference on Computer Vision, pages 352–367. Springer, 2024
2024
-
[21]
Zhekai Duan, Yuan Zhang, Shikai Geng, Gaowen Liu, Joschka Boedecker, and Chris Xiaox- uan Lu. Fast ecot: Efficient embodied chain-of-thought via thoughts reuse.arXiv preprint arXiv:2506.07639, 2025
-
[22]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language- action models.arXiv preprint arXiv:2510.13626, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025
Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025
2025
-
[24]
Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
2020
-
[25]
The perception of the visual world
James J Gibson. The perception of the visual world. 1950
1950
-
[26]
Psychology press, 2014
James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014
2014
-
[27]
Deep Learning Scaling is Predictable, Empirically
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis.arXiv preprint arXiv:2312.08782, 2023
-
[29]
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
Suning Huang, Jiaqi Shao, Ke Wang, Qianzhong Chen, Jiankai Sun, Yanjiang Guo, Mac Schwager, and Jeannette Bohg. Breaking lock-in: Preserving steerability under low-data vla post-training.arXiv preprint arXiv:2604.23121, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Scalable deep reinforcement learning for vision-based robotic manipulation
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018. 11
2018
-
[33]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[34]
Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025
Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025
2025
-
[35]
Dynabench: Rethinking benchmarking in nlp
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp. InProceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, pages 4110–4...
2021
-
[36]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017
Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017
2017
-
[39]
A path towards autonomous machine intelligence version 0.9
Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022
2022
-
[40]
Learning physical intuition of block towers by example
Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. InInternational conference on machine learning, pages 430–438. PMLR, 2016
2016
-
[41]
Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987
Alan M Leslie and Stephanie Keeble. Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987
1987
-
[42]
Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026
Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026
2026
-
[43]
arXiv preprint arXiv:2506.19816 (2025)
Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, et al. Cronusvla: Towards efficient and robust manipulation via multi-frame vision-language-action modeling.arXiv preprint arXiv:2506.19816, 2025
-
[44]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
2023
-
[45]
Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, and Jian Tang. Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025
-
[46]
Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation
Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou, Zhuotao Tian, and Liqiang Nie. Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18397–18405, 2026
2026
-
[47]
Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024
-
[48]
Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onet- wovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025
-
[49]
Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 12
2023
-
[50]
Eva-vla: Evaluating vision-language-action models’ robustness under real-world physical variations
Hanqing Liu, Shouwei Ruan, Jiahuan Long, Junqi Wu, Jiacheng Hou, Huili Tang, Tingsong Jiang, Weien Zhou, and Wen Yao. Eva-vla: Evaluating vision-language-action models’ robust- ness under real-world physical variations.arXiv preprint arXiv:2509.18953, 2025
-
[51]
Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022
2022
-
[52]
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours
Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016
2016
-
[54]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[55]
Transfusion: Understanding transfer learning for medical imaging.Advances in neural information processing systems, 32, 2019
Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning for medical imaging.Advances in neural information processing systems, 32, 2019
2019
-
[56]
Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3(1):297–330, 2020
Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration.Annual review of control, robotics, and autonomous systems, 3(1):297–330, 2020
2020
-
[57]
Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015
David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015
2015
-
[58]
Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,
Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system-2 thinking in visual-language- action model.arXiv preprint arXiv:2505.21432, 2025
-
[59]
Reconvla: Reconstructive vision- language-action model as effective robot perceiver
Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision- language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026
2026
-
[60]
MIT press Cambridge, 1998
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
1998
-
[61]
Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503, 2021
-
[62]
VLA Knows Its Limits: Adaptive Execution Horizons for Robot Policies
Haoxuan Wang, Gengyu Zhang, Yan Yan, Ramana Rao Kompella, and Gaowen Liu. Vla knows its limits.arXiv preprint arXiv:2602.21445, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[63]
Latentvla: Taming latent space for generalizable and long-horizon bimanual manipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18593– 18601, Mar
Junming Wang. Latentvla: Taming latent space for generalizable and long-horizon bimanual manipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18593– 18601, Mar. 2026
2026
-
[64]
Renhao Wang, Jiayuan Mao, Joy Hsu, Hang Zhao, Jiajun Wu, and Yang Gao. Program- matically grounded, compositionally generalizable robotic manipulation.arXiv preprint arXiv:2304.13826, 2023
-
[65]
Vla-adapter: An effective paradigm for tiny-scale vision-language-action model
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026. 13
2026
-
[66]
Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293, 2024
-
[67]
Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in neural information processing systems, 28, 2015
Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in neural information processing systems, 28, 2015
2015
-
[68]
Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, and Sergey Levine. Robust finetuning of vision-language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025
-
[69]
A survey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12):7173–7186, 2024
Maryam Zare, Parham M Kebria, Abbas Khosravi, and Saeid Nahavandi. A survey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12):7173–7186, 2024
2024
-
[70]
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
From recognition to cognition: Visual commonsense reasoning
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019
2019
-
[72]
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[73]
Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation
Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Dan Wang, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18764–18772, 2026
2026
-
[74]
Wenbo Zhang, Zihang Xu, and Hengrui Cai. Recognizing limits: Investigating infeasibility in large language models.arXiv preprint arXiv:2408.05873, 2024
-
[75]
Dexgraspvla: A vision-language-action framework towards general dexterous grasping
Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Nam Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18836–18844, 2026
2026
-
[76]
LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization
Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair evaluation of vision-language- action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Chatvla: Unified multimodal understanding and robot control with vision-language-action model
Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025
2025
-
[78]
Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Objectvla: End-to-end open-world object manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025
-
[79]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 14
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.