Recognition: no theorem link
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
Pith reviewed 2026-05-13 16:55 UTC · model grok-4.3
The pith
VLA models improve robotic manipulation by adaptively choosing action chunk sizes with prediction entropy
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By computing action entropy from the model's output distribution at inference time, the adaptive action chunking strategy selects an appropriate number of actions to execute in sequence, thereby balancing the trade-off between maintaining consistent motion and remaining responsive to environmental changes, which leads to higher success rates in diverse manipulation tasks.
What carries the argument
Action entropy computed from current predictions, used as the signal to adaptively set the length of the action chunk to execute without replanning.
If this is right
- Longer chunks are used for low-entropy confident predictions to promote smooth execution.
- Higher entropy triggers shorter chunks for increased reactivity to new observations.
- The approach avoids the need for task-specific empirical tuning of chunk length.
- Performance gains are observed in both simulation and real-world robot experiments.
Where Pith is reading between the lines
- This inference-time adaptation might reduce the reliance on extensive task-specific training for chunk parameters.
- It could be extended to other sequence-based control policies where uncertainty varies over time.
- Integrating entropy signals with visual feedback might further refine chunk decisions in cluttered scenes.
Load-bearing premise
The entropy of the model's action predictions accurately reflects the need for longer or shorter chunks without leading to unstable or inefficient behavior.
What would settle it
Running the same suite of simulated and real-world manipulation tasks with the adaptive method disabled and finding equal or superior results with any fixed chunk size would falsify the improvement claim.
Figures
read the original abstract
In Vision-Language-Action (VLA) models, action chunking (i.e., executing a sequence of actions without intermediate replanning) is a key technique to improve robotic manipulation abilities. However, a large chunk size reduces the model's responsiveness to new information, while a small one increases the likelihood of mode-jumping, jerky behavior resulting from discontinuities between chunks. Therefore, selecting the optimal chunk size is an urgent demand to balance the model's reactivity and consistency. Unfortunately, a dominant trend in current VLA models is an empirical fixed chunk length at inference-time, hindering their superiority and scalability across diverse manipulation tasks. To address this issue, we propose a novel Adaptive Action Chunking (AAC) strategy, which exploits action entropy as the cue to adaptively determine the chunk size based on current predictions. Extensive experiments on a wide range of simulated and real-world robotic manipulation tasks have demonstrated that our approach substantially improves performance over the state-of-the-art alternatives. The videos and source code are publicly available at https://lance-lot.github.io/adaptive-chunking.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Adaptive Action Chunking (AAC) for Vision-Language-Action (VLA) models. It uses action entropy computed from the model's output distribution at each step to dynamically select chunk size at inference time (higher entropy yields smaller chunks), aiming to balance reactivity to new observations against consistency and avoidance of mode-jumping. The central claim is that this entropy-driven adaptation yields substantial performance gains over fixed-chunk SOTA baselines across a range of simulated and real-world robotic manipulation tasks.
Significance. If the entropy signal proves reliable and general without per-task calibration, the method would address a practical limitation in current VLA deployments by enabling task-adaptive chunking without retraining or architectural changes. The public release of code and videos is a positive factor for reproducibility.
major comments (2)
- [§3.2] §3.2: The chunk-size mapping is defined as a monotonic function of action entropy with no derivation or comparative analysis showing why entropy dominates other uncertainty signals (e.g., predictive variance or attention entropy). Without this justification or an ablation isolating the signal choice, observed gains could be explained by implicit per-task heuristic tuning rather than a general principle.
- [Experiments] Experiments section: The abstract asserts 'substantial improvements' and 'extensive experiments' but the manuscript provides no quantitative metrics, baseline comparisons, or ablation results on the entropy-to-chunk mapping; this prevents evaluation of effect size, statistical significance, or sensitivity to the mapping parameters.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a concise statement of the exact functional form used to map entropy to chunk length and any hyperparameters involved.
- [Figures/Tables] Figure captions and experimental tables should explicitly list the fixed chunk sizes used by the compared baselines for fair comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the concerns on theoretical justification for entropy and the presentation of experimental results below, and we will incorporate the suggested additions in the revised manuscript.
read point-by-point responses
-
Referee: [§3.2] The chunk-size mapping is defined as a monotonic function of action entropy with no derivation or comparative analysis showing why entropy dominates other uncertainty signals (e.g., predictive variance or attention entropy). Without this justification or an ablation isolating the signal choice, observed gains could be explained by implicit per-task heuristic tuning rather than a general principle.
Authors: Action entropy is chosen because it directly quantifies uncertainty in the model's per-timestep action distribution, which correlates with the risk of mode-jumping when committing to a chunk. We will add a short information-theoretic derivation in §3.2 and include a new ablation comparing entropy against predictive variance and attention entropy across tasks to demonstrate its relative effectiveness. revision: partial
-
Referee: Experiments section: The abstract asserts 'substantial improvements' and 'extensive experiments' but the manuscript provides no quantitative metrics, baseline comparisons, or ablation results on the entropy-to-chunk mapping; this prevents evaluation of effect size, statistical significance, or sensitivity to the mapping parameters.
Authors: Quantitative results, including success-rate tables versus fixed-chunk baselines (sizes 1/4/8/16) on simulated and real tasks, are reported in Section 4. We will expand this section with explicit ablations on the entropy-to-chunk mapping, parameter sensitivity plots, and statistical significance tests to allow full evaluation of effect sizes. revision: yes
Circularity Check
No significant circularity; entropy signal is an external heuristic input
full rationale
The paper's core proposal computes action entropy directly from the VLA model's output distribution at each timestep and maps it monotonically to chunk size as a design choice. This mapping is introduced as a novel strategy rather than derived from prior fitted parameters, self-referential equations, or self-citations. No load-bearing step reduces the claimed performance gains to a tautology or to the inputs by construction; the entropy cue is computed independently from the model's predictions and serves as an external signal. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Action entropy from current predictions reliably indicates when to shorten or lengthen the action chunk
Forward citations
Cited by 4 Pith papers
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
-
Adaptive Action Chunking via Multi-Chunk Q Value Estimation
ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
Reference graph
Works this paper leans on
-
[1]
Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Zheng Shou, and Harold Soh. VLA-Touch: Enhancing vision-language- action models with dual-level tactile feedback.arXiv preprint arXiv:2507.17294, 2025. 1
-
[2]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1, 2, 3, 4, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: a vision-language- action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
RT-1: Robotics Transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023. 2
work page 2023
-
[6]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. WorldVLA: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Ren- rui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, et al. Fast-in-Slow: A dual-system foun- dation model unifying fast manipulation within slow reason- ing.arXiv preprint arXiv:2506.01953, 2025. 2
-
[8]
Adaptive action chunk selec- tor
Ruopei Chen, Ke Wang, et al. Adaptive action chunk selec- tor. Stanford CS224R 2025 Final Report, 2025. 2
work page 2025
-
[9]
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44: 1684–1704, 2023. 1, 5
work page 2023
-
[10]
Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025. 3
work page 2025
-
[11]
PhD thesis, University of British Columbia, 2024
Ruiyu Gou.Learning temporal action chunking for motor control. PhD thesis, University of British Columbia, 2024. 2
work page 2024
-
[12]
Dita: Scaling diffusion trans- former for generalist vision-language-action policy
Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, et al. Dita: Scaling diffusion trans- former for generalist vision-language-action policy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7686–7697, 2025. 2
work page 2025
-
[13]
Tactile-vla: Unlocking vision-language-action model’s physical knowledge for tactile generalization,
Jialei Huang, Shuo Wang, Fanqi Lin, Yihang Hu, Chuan Wen, and Yang Gao. Tactile-VLA: Unlocking vision- language-action model’s physical knowledge for tactile gen- eralization.arXiv preprint arXiv:2507.09160, 2025. 1
-
[14]
Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, et al. RynnVLA-001: Using human demon- strations to improve robot manipulation.arXiv preprint arXiv:2509.15212, 2025. 1
-
[15]
Test-time stochasticity estimation for adaptive action chunk selection
Sarosh Khan and Ellie Tanimura. Test-time stochasticity estimation for adaptive action chunk selection. Stanford CS224R 2025 Final Report, 2025. 1
work page 2025
-
[16]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
OpenVLA: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning, 2025. 2, 5
work page 2025
-
[18]
HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Youngyo Seo, and Jinwoo Shin. Ham- let: Switch your vision-language-action model into a history- aware policy.arXiv preprint arXiv:2510.00695, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. PointVLA: Injecting the 3D world into vision- language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026. 1
work page 2026
-
[20]
Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiang- nan Wu, Yan Huang, Liang Wang, et al. BridgeVLA: Input- output alignment for efficient 3D manipulation learning with vision-language models. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems, 2025. 1
work page 2025
-
[21]
Wei Li, Renshan Zhang, et al. CogVLA: Cognition-aligned vision-language-action models via instruction-driven routing & sparsification. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2
work page 2025
-
[22]
Eagle 2: Building post-training data strategies from scratch for frontier vision-language models
Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025. 5
-
[23]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, et al. LIBERO: Benchmarking knowledge trans- fer for lifelong robot learning.Advances in Neural Informa- tion Processing Systems, 2023. 1, 5
work page 2023
-
[24]
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. HybridVLA: Collaborative dif- fusion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631, 2025. 1
-
[25]
Bidirectional decoding: Improv- ing action chunking via guided test-time sampling
Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Max Du, and Chelsea Finn. Bidirectional decoding: Improv- ing action chunking via guided test-time sampling. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 1, 3
work page 2025
-
[26]
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
MimicGen: A data generation system for scalable robot learning using human demonstrations
Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning, 2023. 5
work page 2023
-
[28]
Robocasa: Large-scale simulation of every- day tasks for generalist robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of every- day tasks for generalist robots. InRSS 2024 Workshop: Data Generation for Robotics. 1, 2, 4, 5
work page 2024
-
[29]
Efficient test- time model adaptation without forgetting
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test- time model adaptation without forgetting. InInternational Conference on Machine Learning, 2022. 3
work page 2022
-
[30]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. SpatialVLA: Exploring spatial repre- sentations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Learning affor- dances at inference-time for vision-language-action models
Ameesh Shah, William Chen, Adwait Godbole, Federico Mora, Sanjit A Seshia, and Sergey Levine. Learning affor- dances at inference-time for vision-language-action models. arXiv preprint arXiv:2510.19752, 2025. 1
-
[33]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Ar- actingi, Caroline Pascal, Martino Russi, et al. SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025. 1, 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
OG-VLA: 3D-aware vision language action model via orthographic image generation
Ishika Singh, Ankit Goyal, Stan Birchfield, Dieter Fox, Ani- mesh Garg, and Valts Blukis. OG-VLA: 3D-aware vision language action model via orthographic image generation. arXiv preprint arXiv:2506.01196, 2025. 1
-
[35]
Improv- ing generative behavior cloning via self-guidance and adap- tive chunking
Junhyuk So, Chiwoong Lee, Shinyoung Lee, et al. Improv- ing generative behavior cloning via self-guidance and adap- tive chunking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1
work page 2025
-
[36]
Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, et al. Hume: Introducing system-2 thinking in visual-language- action model.arXiv preprint arXiv:2505.21432, 2025. 2
-
[37]
Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated with action chunking via parallel decoding.arXiv preprint arXiv:2503.02310, 2025. 2
-
[38]
Tent: Fully test-time adaptation by entropy minimization
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021. 3
work page 2021
-
[39]
VLA-Adapter: An effective paradigm for tiny-scale vision-language-action model
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. VLA-Adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 1
work page 2026
-
[40]
Test-time adapted reinforcement learning with action entropy regularization
Shoukai Xu, Zihao Lian, Mingkui Tan, Liu Liu, Zhong Zhang, and Peilin Zhao. Test-time adapted reinforcement learning with action entropy regularization. InInternational Conference on Machine Learning, 2025. 3
work page 2025
-
[41]
ForceVLA: Enhancing VLA models with a force-aware MoE for contact- rich manipulation
Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, Wenqiang Zhang, and Cewu Lu. ForceVLA: Enhancing VLA models with a force-aware MoE for contact- rich manipulation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1
work page 2025
-
[42]
Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. DepthVLA: Enhancing vision-language-action models with depth-aware spatial rea- soning.arXiv preprint arXiv:2510.13375, 2025. 1
-
[43]
3D diffusion policy: General- izable visuomotor policy learning via simple 3D representa- tions
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D diffusion policy: General- izable visuomotor policy learning via simple 3D representa- tions. InICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024. 1
work page 2024
-
[44]
4D-VLA: Spatiotemporal vision- language-action pretraining with cross-scene calibration
Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 4D-VLA: Spatiotemporal vision- language-action pretraining with cross-scene calibration. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1
work page 2025
-
[45]
CoT-VLA: Visual chain-of-thought reasoning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, et al. CoT-VLA: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Com- puter Vision and Pattern Recognition Conference, 2025. 2
work page 2025
-
[46]
Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023. 1, 2
work page 2023
-
[47]
3D- VLA: A 3D vision-language-action generative world model
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3D- VLA: A 3D vision-language-action generative world model. InInternational Conference on Machine Learning, 2024. 1
work page 2024
-
[48]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-VLA: Soft-prompted Transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
DexGraspVLA: A vision-language-action framework towards general dexter- ous grasping
Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Nam Lui, Yuyao Ye, Yitao Liang, et al. DexGraspVLA: A vision-language-action framework towards general dexter- ous grasping. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 2
work page 2026
-
[50]
Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. LIBERO-PRO: Towards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025. 6, 7
-
[51]
RT-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, 2023. 2 Adaptive Action Chunking at Inference-time for Vision-Language-Action Models Supplementary Material A. Mini...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.