LA4VLA: Learning to Act without Seeing via Language-Action Pretraining
Pith reviewed 2026-06-26 04:38 UTC · model grok-4.3
The pith
Language-action pretraining without visual observations lets VLA policies learn reusable manipulation skills from language alone before visuals are added.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LA4VLA decomposes expert demonstration trajectories into atomic action segments paired with low-level action descriptions to create the LA4-33K dataset, then applies LA-only, sequential LA-to-VLA, or mixed LA-VLA pretraining to a 1B-parameter model; the mixed schedule produces policies whose average success rates exceed the no-pretraining baseline by up to 17.8 percentage points in simulation and 45.0 percentage points in real-world tasks.
What carries the argument
The language-action pretraining framework that isolates language-conditioned action priors by training without visual input.
If this is right
- LA-pretrained policies consistently outperform matched VLA-pretrained counterparts across simulation and real-world tasks.
- Mixed LA-VLA pretraining produces further gains beyond either LA-only or VLA-only pretraining.
- The resulting policies reduce reliance on scene-specific visual cues by focusing on reusable skills shared across tasks and scenes.
- LA4VLA-1B achieves up to 17.8 and 45.0 percentage point improvements in average success rates in simulation and real-world tasks respectively.
Where Pith is reading between the lines
- The same decomposition step could be applied to existing large demonstration libraries to generate language-action data at scale without new robot runs.
- Policies trained this way may maintain performance under visual distribution shifts such as novel lighting or camera angles not present in the original demonstrations.
- Combining the approach with higher-level language descriptions could support skill composition for longer-horizon tasks.
Load-bearing premise
That decomposing demonstration trajectories into atomic action segments and pairing them with low-level language descriptions produces language-action priors that accurately capture reusable manipulation skills independent of visual context and transfer effectively when visual observations are later added.
What would settle it
Evaluate the trained policies on the same language instructions but with systematically altered visual conditions such as changed table textures, lighting, or background objects; if the success-rate advantage over non-LA baselines disappears, the claim that visual-shortcut reliance has been reduced is falsified.
read the original abstract
Vision-Language-Action (VLA) models are commonly pretrained on robot demonstrations by jointly mapping visual observations and language instructions to actions. However, dense visual-action supervision can dominate the comparatively sparse language-action signal. As a result, policies may rely on visual shortcuts rather than learn how language conditions action execution, making them sensitive to visual variations. To address this limitation, we propose LA4VLA, a language-action pretraining framework that enables policies to acquire language-conditioned action priors without visual observations. These priors capture reusable manipulation skills shared across tasks and scenes, reducing reliance on scene-specific visual cues. Specifically, LA4VLA decomposes expert demonstration trajectories into atomic action segments and pairs each segment with a corresponding low-level action description. This yields LA4-33K, a dataset of 33K Language-Action (LA) episodes derived entirely from existing demonstrations without additional robot data collection. We further develop LA4VLA-1B, a lightweight 1B-parameter VLA model, and investigate three paradigms for incorporating language-action supervision into VLA learning: LA-only pretraining, sequential LA-to-VLA pretraining, and mixed LA-VLA pretraining. Across simulation and real-world tasks, LA-pretrained policies consistently outperform matched VLA-pretrained counterparts, while combining LA and VLA supervision leads to further gains. In particular, mixed LA-VLA pretraining improves the average success rate of LA4VLA-1B over the no-pretraining baseline by up to 17.8 and 45.0 percentage points in simulation and real-world tasks, respectively. These results establish LA4VLA as an effective and complementary pretraining strategy for building stronger and more robust VLA policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LA4VLA, a pretraining framework that decomposes robot demonstration trajectories into atomic action segments paired with low-level language descriptions to create the LA4-33K dataset of 33K language-action episodes. It trains a 1B-parameter LA4VLA-1B model under LA-only, sequential LA-to-VLA, and mixed LA-VLA pretraining regimes, claiming that mixed pretraining yields average success-rate gains of up to 17.8 percentage points in simulation and 45.0 percentage points in real-world tasks over a no-pretraining baseline by learning visual-independent action priors.
Significance. If the gains are shown to arise specifically from the claimed language-action priors rather than from differences in total supervision volume or optimization, the approach could offer a practical complementary pretraining stage for VLA models. The derivation of LA4-33K entirely from existing demonstrations without new robot collection is a concrete, reusable contribution that other groups could build upon.
major comments (3)
- [§4] §4 (Experiments): The reported success-rate improvements (e.g., 17.8 pp and 45.0 pp) are presented without any description of baseline matching on total training tokens, number of gradient steps, or data volume; the mixed LA-VLA condition necessarily incorporates additional LA episodes, so it is impossible to determine whether the gains are attributable to the visual-independence mechanism or simply to extra supervision.
- [§3.1] §3.1 (LA4-33K construction): The segmentation criterion used to split trajectories into atomic action segments and the provenance/generation method for the accompanying low-level language strings are not specified (state thresholds, visual features, manual annotation, or LLM prompting). Without this information it cannot be verified that the resulting LA pairs are free of implicit visual or scene-specific information, which is load-bearing for the central claim that the priors are visual-independent.
- [§4.3] §4.3 (real-world evaluation): The real-world results show the largest claimed gains (45.0 pp) yet supply no information on the number of trials per task, randomization of initial conditions, or statistical testing; this is required to assess whether the improvement is robust or could be explained by uncontrolled factors.
minor comments (2)
- [Abstract, §1] The abstract and §1 use the phrase “parameter-free” for the priors; this should be clarified or removed since the LA pretraining still involves learned weights.
- [Figure 2, §3.2] Figure 2 caption and §3.2 should explicitly state the total parameter count and training compute for each of the three pretraining paradigms so readers can compare them directly.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to incorporate clarifications and additional details where the comments identify gaps in the current presentation.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The reported success-rate improvements (e.g., 17.8 pp and 45.0 pp) are presented without any description of baseline matching on total training tokens, number of gradient steps, or data volume; the mixed LA-VLA condition necessarily incorporates additional LA episodes, so it is impossible to determine whether the gains are attributable to the visual-independence mechanism or simply to extra supervision.
Authors: The manuscript states that comparisons use matched VLA-pretrained counterparts with respect to model size, architecture, and number of gradient steps. The LA-only condition uses identical data volume to the VLA baseline (just different modality), providing evidence that the gains are not solely from extra supervision. For the mixed condition, additional LA episodes are included by design. We acknowledge the manuscript lacks an explicit table of total tokens seen across regimes. We will add this comparison in a revised Section 4 and note that the visual-independence claim rests primarily on the LA-only results. revision: partial
-
Referee: [§3.1] §3.1 (LA4-33K construction): The segmentation criterion used to split trajectories into atomic action segments and the provenance/generation method for the accompanying low-level language strings are not specified (state thresholds, visual features, manual annotation, or LLM prompting). Without this information it cannot be verified that the resulting LA pairs are free of implicit visual or scene-specific information, which is load-bearing for the central claim that the priors are visual-independent.
Authors: We agree the current text provides only a high-level description. The full manuscript will be revised to specify that segmentation uses end-effector velocity thresholds and gripper-state changes (no visual features), and language strings are produced via template rules on action primitives plus LLM prompting for phrasing. No scene or visual information enters the LA pairs. We will include pseudocode and examples in an expanded Section 3.1. revision: yes
-
Referee: [§4.3] §4.3 (real-world evaluation): The real-world results show the largest claimed gains (45.0 pp) yet supply no information on the number of trials per task, randomization of initial conditions, or statistical testing; this is required to assess whether the improvement is robust or could be explained by uncontrolled factors.
Authors: We accept this point. The revised Section 4.3 will report 20 trials per task, scripted randomization of initial object poses and lighting, and statistical details (standard errors across trials). These protocol elements were used in the original experiments and will now be documented. revision: yes
Circularity Check
No circularity: empirical pretraining comparison with external task metrics
full rationale
The paper reports success-rate gains from LA-only, sequential, and mixed pretraining regimes on held-out simulation and real-world tasks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the LA4-33K dataset is constructed once from existing demonstrations and the performance deltas are measured against an explicit no-pretraining baseline. The derivation chain is therefore self-contained experimental comparison rather than reduction to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[2]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV.2410.24164
Pith/arXiv arXiv 2024
-
[3]
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
Pith/arXiv arXiv 2025
-
[4]
Qwen-vla: Unifying vision-language-action modeling across tasks, environments, and robot embodiments
Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, et al. Qwen-vla: Unifying vision-language-action modeling across tasks, environments, and robot embodiments. arXiv preprint arXiv:2605.30280, 2026
Pith/arXiv arXiv 2026
-
[5]
Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. Evo-0: Vision- language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416, 2025
arXiv 2025
-
[6]
Evo-depth: A lightweight depth-enhanced vision-language-action model
Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu, Yunhe Li, Yuqian Fu, Yinxinyu Chen, Hongyi Cai, Zewei Ye, Bing Cheng, et al. Evo-depth: A lightweight depth-enhanced vision-language-action model. arXiv preprint arXiv:2605.14950, 2026
Pith/arXiv arXiv 2026
-
[7]
Runze Wang, Yuqian Fu, Yu Li, Tao Lin, Tianwen Qian, Mohamed Elhoseiny, Bo Zhao, Yanwei Fu, Yu-Gang Jiang, and Xiangyang Xue. Afford-vla: Action-aligned visual planning via internalized affordance.arXiv preprint arXiv:2605.24203, 2026
Pith/arXiv arXiv 2026
-
[8]
Kuanning Wang, Ke Fan, Chenhao Qiu, Zeyu Shangguan, Yuqian Fu, Yanwei Fu, Daniel Seita, and Xiangyang Xue. Oflow: Injecting object-aware temporal flow matching for robust robotic manipulation.arXiv preprint arXiv:2604.17876, 2026
Pith/arXiv arXiv 2026
-
[9]
Gaze2act: Gaze-conditioned vision-language-action policies for interactive robot manipulation
Kuangji Zuo, Gen Li, Bofan Lyu, Yanshuo Lu, Boyu Ma, Shijia Han, Xinyu Zhou, Xichen Yuan, Chuhao Zhou, Jiaqi Bai, et al. Gaze2act: Gaze-conditioned vision-language-action policies for interactive robot manipulation. arXiv preprint arXiv:2605.30282, 2026
Pith/arXiv arXiv 2026
-
[10]
Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
Pith/arXiv arXiv 2024
-
[11]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[12]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023
2023
-
[13]
Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026
Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, Shirui Chen, Yi Ru Wang, et al. Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026
Pith/arXiv arXiv 2026
-
[14]
Seeing to act, prompting to specify: A bayesian factorization of vision language action policy
Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang, Yifei Yang, Haojian Lu, Rong Xiong, Masayoshi Tomizuka, and Yue Wang. Seeing to act, prompting to specify: A bayesian factorization of vision language action policy. arXiv preprint arXiv:2512.11218, 2025
arXiv 2025
-
[15]
Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, and Mingyu Ding. When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026
arXiv 2026
-
[16]
StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026. 15
Pith/arXiv arXiv 2026
-
[17]
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020
2020
-
[18]
Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023
2023
-
[19]
Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
Pith/arXiv arXiv 2022
-
[20]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[21]
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025
Pith/arXiv arXiv 2025
-
[22]
Evo-1: Lightweight vision-language-action model with preserved semantic alignment
Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision-language-action model with preserved semantic alignment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13397–13406, 2026
2026
-
[23]
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
Pith/arXiv arXiv 2025
-
[24]
A2a: Action-to-action flow matching policy
Jindou Jia, Gen Li, Xiangyu Chen, Tuo An, Yuxuan Hu, Jingliang Li, Xinying Guo, and Jianfei Yang. A2a: Action-to-action flow matching policy. InProceedings of Robotics: Science and Systems, 2026
2026
-
[25]
Focusable monocular depth estimation.arXiv preprint arXiv:2605.11756, 2026
Yuxin Du, Tao Lin, Zile Zhong, Runting Li, Xiyao Chen, Jiting Liu, Chenglin Liu, Ying-Cong Chen, Yuqian Fu, and Bo Zhao. Focusable monocular depth estimation.arXiv preprint arXiv:2605.11756, 2026
Pith/arXiv arXiv 2026
-
[26]
Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, and Bo Zhao. Vla-pruner: Temporal-aware dual-level visual token pruning for efficient vision-language-action inference.arXiv preprint arXiv:2511.16449, 2025
Pith/arXiv arXiv 2025
-
[27]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022
2022
-
[28]
Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023
2023
-
[29]
Cliport: What and where pathways for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894–906. PMLR, 2022
2022
-
[30]
Do as i can, not as i say: Grounding language in robotic affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022
Pith/arXiv arXiv 2022
-
[31]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023
2023
-
[32]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025
2025
-
[33]
Hi robot: Open-ended instruction following with hierarchical vision-language-action models
Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models. arXiv preprint arXiv:2502.19417, 2025. 16
Pith/arXiv arXiv 2025
-
[34]
Mask2iv: Interaction-centric video generation via mask trajectories
Gen Li, Bo Zhao, Jianfei Yang, and Laura Sevilla-Lara. Mask2iv: Interaction-centric video generation via mask trajectories. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6091–6099, 2026
2026
-
[35]
Embodied large language models enable robots to complete complex tasks in unpredictable environments.Nature Machine Intelligence, 7(4):592–601, 2025
Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, and Christopher G Lucas. Embodied large language models enable robots to complete complex tasks in unpredictable environments.Nature Machine Intelligence, 7(4):592–601, 2025
2025
-
[36]
Lihan Zha, Asher J Hancock, Mingtong Zhang, Tenny Yin, Yixuan Huang, Dhruv Shah, Allen Z Ren, and Anirudha Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer.arXiv preprint arXiv:2602.10556, 2026
arXiv 2026
-
[37]
Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025
Pith/arXiv arXiv 2025
-
[38]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[39]
Unleashing large-scale video generative pre-training for visual robot manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, volume 2024, pages 10641–10662, 2024
2024
-
[40]
Feng Yan, Fanfan Liu, Liming Zheng, Yufeng Zhong, Yiyang Huang, Zechao Guan, Chengjian Feng, and Lin Ma. Robotron-mani: All-in-one multimodal large model for robotic manipulation.arXiv preprint arXiv:2412.07215, 2024
arXiv 2024
-
[41]
arXiv preprint arXiv:2504.16054, 2025
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025
Pith/arXiv arXiv 2025
-
[42]
Zuojin Tang, Haoyun Liu, Xinyuan Chang, Changjie Wu, Dongjie Huo, Yandan Yang, Bin Liu, Zhejia Cai, Feng Xiong, Mu Xu, et al. Alam: Algebraically consistent latent transitions for vision-language-action models.arXiv preprint arXiv:2605.10819, 2026. 17 Appendix A LA Dataset Construction Details This appendix complements the dataset construction pipeline de...
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.