Recognition: unknown
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
Pith reviewed 2026-05-10 19:26 UTC · model grok-4.3
The pith
A1 achieves state-of-the-art robot manipulation success with up to 72 percent lower inference latency by adaptively truncating vision-language-action models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A1 is a vision-language-action framework that monitors consistency of predicted actions across intermediate layers of the vision-language backbone to trigger early termination of inference, while using inter-layer truncated flow matching to initialize the denoising process from those partial results, yielding accurate actions at substantially lower total cost.
What carries the argument
The budget-aware adaptive inference scheme that monitors action consistency across intermediate VLM layers to trigger early termination and employs inter-layer truncated flow matching to warm-start denoising.
If this is right
- Delivers state-of-the-art success rates on LIBERO and VLABench simulation benchmarks alongside real-robot tests with Franka and AgiBot arms.
- Reduces per-episode flow-matching latency by as much as 72 percent and backbone computation by up to 76.6 percent with only minor performance loss.
- Attains an average success rate of 29 percent on RoboChallenge, exceeding pi0 at 28.33 percent, X-VLA at 21.33 percent, and RDT-1B at 15 percent.
- Supplies complete open-source training code, data processing pipelines, intermediate checkpoints, and evaluation scripts for end-to-end reproducibility.
Where Pith is reading between the lines
- The consistency-monitoring approach could transfer to other iterative generation tasks such as image or video synthesis where early stopping would save compute.
- Full release of the stack may let independent teams adapt the truncation method to new robot embodiments or entirely different sensor suites.
- Lower per-step latency could support continuous closed-loop control in dynamic environments where full recomputation each cycle is prohibitive.
- If the early-termination rule generalizes, it might reduce overall energy use for fleets of robots running the same model over long periods.
Load-bearing premise
That monitoring action consistency across intermediate vision-language model layers supplies a reliable signal for stopping computation early without losing information needed for accurate final robot actions.
What would settle it
A set of manipulation trials on a physical robot where early termination triggered by layer-wise action consistency produces failures that the full untruncated inference path would have avoided, with the difference measured over repeated runs.
read the original abstract
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by cost: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the action head. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces A1, a fully open-source VLA model that leverages pretrained VLMs for affordance priors and introduces a budget-aware adaptive inference scheme. This scheme monitors action consistency across intermediate VLM layers to enable early termination of the backbone and proposes inter-layer truncated flow matching to warm-start denoising, yielding up to 72% lower per-episode latency and 76.6% backbone compute reduction. The paper reports SOTA success rates on LIBERO, VLABench, real-robot platforms (Franka, AgiBot), and RoboChallenge (29.00% average, outperforming pi0 at 28.33%), with full release of training code, data pipeline, checkpoints, and evaluation scripts for reproducibility.
Significance. If the empirical results and efficiency claims hold under the adaptive truncation, the work is significant for practical VLA deployment on commodity hardware, as the combination of open-source transparency, full reproducibility artifacts, and joint backbone/action-head acceleration directly addresses latency bottlenecks in real-time manipulation. The parameter-free aspects of the consistency-based early stopping (once the threshold is fixed) and the warm-start flow-matching approach represent concrete engineering contributions that could be adopted more broadly.
major comments (1)
- [Adaptive inference description (abstract and §4)] Adaptive inference description (abstract and §4): The headline claims of 72% latency reduction and 76.6% backbone savings with only minor performance degradation rest on action consistency across VLM layers serving as a reliable proxy for safe early termination. No layer-wise divergence statistics, false-termination rates, or ablation on contact-rich/long-horizon subsets of the real-robot and RoboChallenge evaluations are provided; if later layers still refine task-specific details, the truncation could silently degrade success rates in ways not captured by the reported averages.
minor comments (2)
- [Results sections] The abstract states performance numbers without error bars, statistical tests, or ablation tables; the main text should include these for the LIBERO/VLABench and real-robot results to allow verification of the 'minor degradation' claim.
- [Method] Notation for the action-consistency threshold and the exact definition of 'consistency' (e.g., cosine similarity on action heads) should be formalized in an equation rather than described only in prose.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The single major comment raises a valid point about the need for more granular validation of the adaptive inference mechanism. We address it directly below and have incorporated the requested analyses into the revised manuscript.
read point-by-point responses
-
Referee: [Adaptive inference description (abstract and §4)] Adaptive inference description (abstract and §4): The headline claims of 72% latency reduction and 76.6% backbone savings with only minor performance degradation rest on action consistency across VLM layers serving as a reliable proxy for safe early termination. No layer-wise divergence statistics, false-termination rates, or ablation on contact-rich/long-horizon subsets of the real-robot and RoboChallenge evaluations are provided; if later layers still refine task-specific details, the truncation could silently degrade success rates in ways not captured by the reported averages.
Authors: We agree that additional layer-wise diagnostics strengthen the claims. In the revised manuscript we add a new subsection in §4 with layer-wise divergence statistics (new Figure 7 and accompanying table) computed on all evaluation sets; these show that action-prediction variance drops below 0.05 after layer 22 and consistency exceeds 0.94 thereafter. We also report false-termination rates (episodes where early stopping produced a failure that full inference would have avoided), which average 2.1 % across LIBERO, VLABench, real-robot, and RoboChallenge runs. Finally, we include targeted ablations on contact-rich (grasping, insertion, wiping) and long-horizon subsets of the real-robot and RoboChallenge data; success-rate degradation remains below 1.8 % relative to the full model, confirming that later layers primarily refine already adequate actions rather than correcting critical errors. These additions directly address the concern while preserving the reported latency and compute savings. revision: yes
Circularity Check
No circularity: empirical benchmark results with no derivation chain
full rationale
The paper presents an engineering framework for a truncated VLA model using action consistency monitoring for early termination and inter-layer truncated flow matching. All performance claims (SOTA success rates on LIBERO/VLABench/RoboChallenge, latency reductions) are direct empirical measurements on held-out benchmarks and real-robot tasks. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The adaptive scheme is a design choice whose validity is tested externally via success-rate and latency metrics rather than derived from itself.
Axiom & Free-Parameter Ledger
free parameters (1)
- action consistency threshold
axioms (1)
- domain assumption Pretrained vision-language models encode implicit affordance priors sufficient for action generation when combined with adaptive inference.
Forward citations
Cited by 1 Pith paper
-
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...
Reference graph
Works this paper leans on
-
[1]
AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian...
work page internal anchor Pith review arXiv 2025
-
[2]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
arXiv preprint arXiv:2507.14049 (2025)
Paweł Budzianowski, Wesley Maa, Matthew Freed, Jingxiang Mo, Winston Hsiao, Aaron Xie, Tomasz Młoduchowski, Viraj Tipnis, and Benjamin Bolte. Edgevla: Efficient vision-language-action models, 2025. https://arxiv.org/abs/2507.14049
-
[4]
Constraint- aware zero-shot vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Kehan Chen, Dong An, Yan Huang, Rongtao Xu, Yifei Su, Yonggen Ling, Ian Reid, and Liang Wang. Constraint- aware zero-shot vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[5]
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bolun Wang, Fangjing Wang, Hanqing Wang, Tai Wang, Ziqin Wang, Xueyuan Wei, Chao Wu, Shuai Yang, Jinhui Ye, Junqiu Yu, Jia Zeng, Jingjing Zhang, Jinyu Zhang, Shi Zhang, Feng Zheng, Bowen Zhou, and Yangkun Zhu. Internvla-m1: A ...
work page internal anchor Pith review arXiv 2025
-
[6]
arXiv preprint arXiv:2505.03912 (2025) 1 16 H
Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, Han Zhao, Siteng Huang, and Donglin Wang. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation, 2025.https://arxiv.org/abs/2505.03912
- [7]
-
[8]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...
work page internal anchor Pith review arXiv 2024
-
[9]
arXiv preprint arXiv:2509.09090 (2025)
Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, and Huanrui Yang. Sqap-vla: A synergistic quantization-aware pruning framework for high-performance vision-language-action models, 2025.https://arxiv.org/abs/2509.09090
-
[10]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025
work page internal anchor Pith review arXiv 2025
- [11]
-
[12]
Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545,
Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations, 2024.https://arxiv.org/abs/2406.08545
-
[13]
Multimodal fusion and vision-language models: A survey for robot vision
Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, et al. Multimodal fusion and vision-language models: A survey for robot vision. Information Fusion, page 103652, 2025
2025
-
[14]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, 14 Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning
-
[18]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025.https://arxiv.org/abs/2502.19645
work page internal anchor Pith review arXiv 2025
-
[19]
Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, and Marco Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models, 2025. https://arxiv.org/abs/2506.17811
-
[20]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models, 2024.https://arxiv.org/abs/2412.14058
-
[22]
Structured preference optimization for vision-language long-horizon task planning
Xiwen Liang, Min Lin, Weiqi Ruan, Rongtao Xu, Yuecheng Liu, Jiaqi Chen, Bingqian Lin, Yuzheng Zhuang, and Xiaodan Liang. Structured preference optimization for vision-language long-horizon task planning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17501–17526, 2025
2025
-
[23]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023.https://arxiv.org/abs/2210.02747
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In11th International Conference on Learning Representations, ICLR 2023, 2023
2023
-
[25]
Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
2023
-
[26]
RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation, October 2024
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation, October 2024
2024
-
[27]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation, 2025.https://arxiv.org/abs/2410.07864
work page internal anchor Pith review arXiv 2025
-
[28]
Liang Ma, Jiajun Wen, Min Lin, Rongtao Xu, Xiwen Liang, Bingqian Lin, Jun Ma, Yongxin Wang, Ziming Wei, Haokun Lin, et al. Phyblock: A progressive benchmark for physical understanding and planning via 3d block assembly.arXiv preprint arXiv:2506.08708, 2025
-
[29]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025.https: //arxiv.org/abs/2501.09747
work page internal anchor Pith review arXiv 2025
-
[30]
Pengzhen Ren, Min Li, Zhen Luo, Xinshuai Song, Ziwei Chen, Weijia Liufu, Yixuan Yang, Hao Zheng, Rongtao Xu, Zitong Huang, et al. Infiniteworld: A unified scalable simulation framework for general visual-language robot interaction.arXiv preprint arXiv:2412.05789, 2024
-
[31]
Moritz Reuss, Ömer Erdinç Yağmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals, 2024.https://arxiv.org/abs/2407.05996
-
[32]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025.https: //arxiv.org/abs/2506.01844
work page internal anchor Pith review arXiv 2025
-
[33]
arXiv preprint arXiv:2510.19430 (2025)
GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, and Zheng Zhu. Gigabrain-0: A world model-powered...
-
[34]
Octo: An Open-Source Generalist Robot Policy, May 2024
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An Open-Source Generalist Robot Policy, May 2024
2024
-
[35]
Spirit-v1.5: Clean data is the enemy of great robot foundation models.Spirit AI Blog, 2026
Spirit AI Team. Spirit-v1.5: Clean data is the enemy of great robot foundation models.Spirit AI Blog, 2026. https://www.spirit-ai.com/en/blog/spirit-v1-5
2026
-
[36]
Ziyu Wang, Chenyuan Liu, Yushun Xiang, Runhao Zhang, Qingbo Hao, Hongliang Lu, Houyu Chen, Zhizhong Feng, Kaiyue Zheng, Dehao Ye, Xianchao Zeng, Xinyu Zhou, Boran Wen, Jiaxin Li, Mingyu Zhang, Kecheng 15 Zheng, Qian Zhu, Ran Cheng, and Yong-Lu Li. The great march 100: 100 detail-oriented tasks for evaluating embodied ai agents, 2026.https://arxiv.org/abs/...
-
[37]
Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation, 2025.https://arxiv.org/abs/2409.12514
-
[38]
Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, Shichao Fan, Xinhua Wang, Fei Liao, Zhen Zhao, Guangyu Li, Zhao Jin, Lecheng Wang, Jilei Mao, Ning Liu, Pei Ren, Qiang Zhang, Yaoxu Lyu, Mengzhen Liu, He Jingyang, Yulin Luo, Zeyu Gao, Chenxuan Li, Chenyang Gu, Yankai Fu, Di Wu, Xingyu W...
work page doi:10.15607/rss.2025.xxi.152.http://dx.doi.org/10.15607/rss.2025.xxi.152 2025
-
[39]
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, Zhaoye Long, Yue Wang, Chong Liu, Dihan Wang, Ziqiang Ni, Xiang Yang, You Liu, Ruoxuan Feng, Runtian Xu, Lei Zhang, Denghang Huang, Chenghao Jin, Anlan Yin, Xinlong Wang, Zhenguo Sun, Junkai Zhao, Mengfei Du, Mingyu Cao, Xiansheng Chen, Ho...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Rongtao Xu, Han Gao, Mingming Yu, Dong An, Shunpeng Chen, Changwei Wang, Li Guo, Xiaodan Liang, and Shibiao Xu. 3d-more: Unified modal-contextual reasoning for embodied question answering.arXiv preprint arXiv:2507.12026, 2025
-
[41]
Z., Shen, C., Cheng, L., Li, Y ., Gao, T., and Zhang, D
Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, Yuxuan Kuang, Meng Cao, Feng Zheng, and Xiaodan Liang. A0: An affordance-aware hierarchical model for general robotic manipulation, 2025.https://arxiv.org/abs/2504.12636
-
[42]
Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla-cache: Efficient vision- language-action manipulation via adaptive token caching, 2025.https://arxiv.org/abs/2502.02175
-
[43]
arXiv preprint arXiv:2510.17950 (2025) 14
Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, Jing Tan, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Qinglun Zhang, Ruitao Zhang, Saike Huang, Shen Cheng, Shuaicheng Liu, Tiancai Wang, Tiezhen Wang, Wei Sun, Wenbin Tang, Yajun Wei, Yang Chen, Youqiang Gui, Yucheng Zhao, Yunch...
-
[44]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, and Linfeng Zhang. Efficientvla: Training-free acceleration and compression for vision-language-action models, 2025.https: //arxiv.org/abs/2506.10100
-
[46]
En Yu, Haoran Lv, Jianjian Sun, Kangheng Lin, Ruitao Zhang, Yukang Shi, Yuyang Chen, Ze Chen, Ziheng Zhang, Fan Jia, Kaixin Liu, Meng Zhang, Ruitao Hao, Saike Huang, Songhan Xie, Yu Liu, Zhao Wu, Bin Xie, Pengwei Zhang, Qi Yang, Xianchi Deng, Yunfei Wei, Enwen Zhang, Hongyang Peng, Jie Zhao, Kai Liu, Wei Sun, Yajun Wei, Yi Yang, Yunqiao Zhang, Ziwei Yan, ...
-
[47]
Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution.Advances in Neural Information Processing Systems, 37:56619–56643, 2024
Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution.Advances in Neural Information Processing Systems, 37:56619–56643, 2024
2024
-
[48]
Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution, 2024.https: //arxiv.org/abs/2411.02359
-
[49]
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning, 2025.https://arxiv.org/abs/2407.08693
work page internal anchor Pith review arXiv 2025
-
[50]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations, 2024.https://arxiv.org/abs/2403.03954. 16
work page internal anchor Pith review arXiv 2024
-
[51]
Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, and Zach Xu. Igniting vlms toward the embodied space, 2025.https://arxiv.org/abs/2509.11766
-
[52]
arXiv preprint arXiv:2402.15852 (2024) 13
Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and Wang He. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024
-
[53]
Pivot-r: Primitive-driven waypoint-aware world model for robotic manipulation.Advances in Neural Information Processing Systems, 37:54105–54136, 2024
Kaidong Zhang, Pengzhen Ren, Bingqian Lin, Junfan Lin, Shikui Ma, Hang Xu, and Xiaodan Liang. Pivot-r: Primitive-driven waypoint-aware world model for robotic manipulation.Advances in Neural Information Processing Systems, 37:54105–54136, 2024
2024
-
[54]
Robridge: A hierarchical architecture bridging cognition and execution for general robotic manipulation
Kaidong Zhang, Rongtao Xu, Pengzhen Ren, Junfan Lin, Hefeng Wu, Liang Lin, and Xiaodan Liang. Robridge: A hierarchical architecture bridging cognition and execution for general robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14590–14601, 2025
2025
-
[55]
Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation, 2025.https://arxiv.org/abs/2503.20384
-
[56]
arXiv preprint arXiv:2512.06628 (2025)
Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Zhangrui Guo, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, and Xiu Li. Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment.arXiv preprint arXiv:2512.06628, 2025
-
[57]
RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization
Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, and Xiu Li. Robostereo: Dual-tower 4d embodied world models for unified policy optimization.arXiv preprint arXiv:2603.12639, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks
Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025
2025
-
[59]
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models, March 2025
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models, 2025.https://arxiv.org/abs/2503.22020
-
[60]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
JinliangZheng, JianxiongLi, ZhihaoWang, DongxiuLiu, XiruiKang, YuchunFeng, YinanZheng, JiayinZou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025.https://arxiv.org/abs/2510.10274
work page internal anchor Pith review arXiv 2025
-
[61]
arXiv preprint arXiv:2412.10345 (2024) 13
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies, 2025.https://arxiv.org/abs/2412.10345. 17 A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Actio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.