pith. sign in

arxiv: 2511.15669 · v2 · submitted 2025-10-31 · 💻 cs.LG · cs.AI· cs.RO

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Pith reviewed 2026-05-18 03:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO
keywords Vision-Language-Action modelsChain-of-Thought reasoningDecoding alignmentCausal alignmentHybrid attention decoderReinforcement learning for robotsLIBERO benchmarkRobot manipulation
0
0 comments X

The pith

Chain-of-thought reasoning improves vision-language-action robot models only when decoding and causal alignments are jointly satisfied.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether adding chain-of-thought reasoning to vision-language-action models genuinely helps robots act better or simply adds cost. Systematic tests reveal that prior inconsistent results stem from two missing requirements: thoughts and actions must be produced with separate attention mechanisms suited to their modalities, and the entire reasoning sequence must be optimized against real task outcomes rather than imitation alone. The authors introduce DeepThinkVLA to meet both requirements through a hybrid decoder and a supervised-then-reinforcement learning pipeline. This yields large gains on standard robot benchmarks plus real-world validation. A reader would care because the work supplies concrete rules for when reasoning helps embodied agents instead of leaving the question open.

Core claim

For chain-of-thought to raise performance in vision-language-action models, two conditions must hold together. Decoding alignment requires causal attention for language reasoning paired with bidirectional attention for parallel action generation; routing both through one autoregressive decoder reduces success by 4.2 points. Causal alignment requires that the full reasoning-to-action chain be trained with sparse rewards tied to task success; without this link, supervised reasoning behaves like a reasoning-free baseline and loses 32 points under distribution shift. DeepThinkVLA satisfies both conditions and records 97.0 percent success on LIBERO, 79.0 percent robustness on LIBERO-Plus, and 59.

What carries the argument

Hybrid-attention decoder that applies causal attention to language reasoning and bidirectional attention to parallel action decoding, together with a two-stage supervised-fine-tuning then reinforcement-learning pipeline that aligns the full reasoning-action chain to sparse task-success rewards.

If this is right

  • Single autoregressive decoding for both reasoning and actions actively harms performance instead of being neutral.
  • Supervised chain-of-thought without outcome rewards collapses under distribution shift exactly as a no-reasoning baseline does.
  • The hybrid decoder plus two-stage pipeline produces 97 percent success on LIBERO and 59.3 percent on RoboTwin 2.0.
  • Real-world robot experiments confirm the same pattern observed in simulation.
  • The same two alignments can be applied to other vision-language-action architectures to recover similar robustness gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future vision-language-action systems may need to treat reasoning and action generation as architecturally distinct from the start rather than adding reasoning as an optional prefix.
  • The findings suggest testing whether the same decoding and causal requirements appear in non-robot multimodal models that interleave language and continuous outputs.
  • A direct comparison that applies the two-stage pipeline to an existing baseline without the hybrid decoder would isolate how much each alignment contributes.
  • Extending the outcome-based reward to denser intermediate signals could further reduce the 32-point drop seen under distribution shift.

Load-bearing premise

The gains come primarily from satisfying the two identified alignments rather than from other unmeasured details of the training data, model size, or benchmark construction.

What would settle it

A controlled run that keeps the hybrid decoder but removes the outcome-based reinforcement stage and still matches the reported 97 percent LIBERO success rate would falsify the necessity of causal alignment.

Figures

Figures reproduced from arXiv: 2511.15669 by Cheng Yin, Sikyuen Tam, Wang Xu, Xiangrui Zeng, Yankai Lin, Zhiyuan Liu, Zhouping Yin.

Figure 1
Figure 1. Figure 1: Comparison of VLA architectures. Existing designs adopt either fully autoregressive [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline for constructing an embodied CoT dataset. Stage 1 extracts keyframes via gripper [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reinforcement learning stage with grouped credit assignment. The model generates CoT [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of RL on long-horizon task performance (LIBERO-Long). Bars show base SR for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: "Think before acting" enables error recovery. Comparison of rollouts on a LIBERO task. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for Constructing CoT Data at Keyframes Using a Cloud-based LVLM: The prompt [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Does Chain-of-Thought (CoT) reasoning genuinely improve Vision-Language-Action (VLA) models, or does it merely add overhead? Existing CoT-VLA systems report limited and inconsistent gains, yet no prior work has rigorously diagnosed when and why CoT helps robots act. Through systematic experiments, we identify two necessary conditions that must be jointly satisfied for CoT to be effective in VLA: (1) Decoding Alignment -- CoT and actions must be generated with modality-appropriate mechanisms; forcing both through a single autoregressive decoder is not merely suboptimal but actively harmful, degrading performance by 4.2 percentage points; (2) Causal Alignment -- CoT must be causally linked to task success via outcome-based optimization; without it, supervised CoT is indistinguishable from no reasoning at all under distribution shift, exhibiting a 32.0\,pp performance drop nearly identical to the 31.6\,pp drop of a reasoning-free baseline. Guided by these findings, we build DeepThinkVLA: a hybrid-attention decoder satisfies Condition~1 by pairing causal attention for language with bidirectional attention for parallel action decoding, while a two-stage SFT-then-RL pipeline satisfies Condition~2 by aligning the full reasoning--action chain with sparse task-success rewards. DeepThinkVLA achieves 97.0\% success on LIBERO, 79.0\% robustness on LIBERO-Plus (vs.\ 61.6\% for $\pi_0$-FAST), and 59.3\% success on RoboTwin~2.0, exceeding the strongest baseline by 21.7 points. Furthermore, we validate the practical effectiveness of our approach through real-world robot experiments. Code available at https://github.com/OpenBMB/DeepThinkVLA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces DeepThinkVLA, a Vision-Language-Action model that incorporates Chain-of-Thought reasoning. Through systematic experiments, the authors identify two jointly necessary conditions for CoT to be effective: (1) Decoding Alignment, where CoT and actions must use modality-appropriate mechanisms (forcing both through a single autoregressive decoder degrades performance by 4.2 pp); (2) Causal Alignment, where CoT must be linked to task success via outcome-based RL (supervised CoT without it yields a 32.0 pp drop under distribution shift, matching the no-reasoning baseline). They propose a hybrid-attention decoder (causal for language, bidirectional for parallel actions) and a two-stage SFT-then-RL pipeline, reporting 97.0% success on LIBERO, 79.0% robustness on LIBERO-Plus (vs. 61.6% for π0-FAST), 59.3% on RoboTwin 2.0 (exceeding strongest baseline by 21.7 pp), and real-world robot validation.

Significance. If the central claims hold, the work offers a principled diagnosis of when CoT improves VLA performance rather than adding overhead, with concrete architectural and optimization fixes that yield substantial gains on standard benchmarks plus real-robot confirmation. The empirical focus on decoding and causal alignments, combined with code release, supports reproducibility and could guide future VLA designs for robustness under shift.

major comments (1)
  1. Systematic experiments section diagnosing the two conditions: the necessity claims rest on ablations showing 4.2 pp and 32.0 pp drops. These ablations must keep the base model, training data volume, optimizer schedule, and total compute identical while toggling only the decoder attention pattern or the SFT-vs-RL stage. If any of those factors covary with the tested condition, the gaps cannot be attributed specifically to decoding or causal alignment rather than incidental pipeline differences. Clarify and confirm this control in the revised manuscript.
minor comments (2)
  1. Abstract and methods: the hybrid-attention decoder is described at a high level; add a precise specification of the attention masks and how parallel action decoding is implemented to aid replication.
  2. Results tables: ensure all reported success rates include the number of evaluation episodes or trials and any variance measures for the benchmark numbers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding experimental controls below and will revise the paper accordingly to improve clarity.

read point-by-point responses
  1. Referee: Systematic experiments section diagnosing the two conditions: the necessity claims rest on ablations showing 4.2 pp and 32.0 pp drops. These ablations must keep the base model, training data volume, optimizer schedule, and total compute identical while toggling only the decoder attention pattern or the SFT-vs-RL stage. If any of those factors covary with the tested condition, the gaps cannot be attributed specifically to decoding or causal alignment rather than incidental pipeline differences. Clarify and confirm this control in the revised manuscript.

    Authors: We appreciate the referee's emphasis on rigorous experimental controls. In the ablations for Decoding Alignment, we held the base model, training data volume and composition, optimizer, learning rate schedule, and total compute budget fixed, varying only the attention mechanism (single autoregressive decoder versus hybrid causal-bidirectional). For Causal Alignment, the SFT-only versus SFT-then-RL comparisons likewise used identical base models, datasets, optimizer schedules, and compute, differing solely in the addition of the outcome-based RL stage. These controls are described in the experimental setup but were not stated with sufficient explicitness in the Systematic Experiments section. We will revise the manuscript to add a dedicated paragraph confirming that all listed factors were matched across conditions, ensuring the reported gaps (4.2 pp and 32.0 pp) are attributable only to the toggled variables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims grounded in external benchmarks and task-success signals

full rationale

The paper's derivation proceeds from systematic ablations on LIBERO, LIBERO-Plus, and RoboTwin 2.0 that measure performance drops when CoT and actions share a single autoregressive decoder or when supervised CoT lacks outcome-based RL. These conditions are diagnosed using external task-success rewards and distribution-shift metrics that are independent of the model's internal reasoning tokens. The hybrid-attention decoder and SFT-then-RL pipeline are then constructed to satisfy the diagnosed conditions. No equations reduce to their own inputs by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems appear in the provided text. Results are further validated on real-world robots, confirming the chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the two conditions appear derived from experiments rather than postulated a priori.

pith-pipeline@v0.9.0 · 5878 in / 1119 out tokens · 24126 ms · 2026-05-18T03:06:04.470601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.

  2. Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks

    cs.MA 2026-04 unverdicted novelty 7.0

    PE-MAMoE combines sparsely gated mixture-of-experts actors with a non-parametric phase controller in MAPPO to maintain plasticity under dynamic user mobility and traffic, yielding 26.3% higher normalized IQM return in...

  3. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  4. SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 6.0

    SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.

  5. DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    DA-PTQ quantizes VLAs by compensating cross-space distortions and allocating mixed precision to minimize motion errors and kinematic drift in trajectories.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 5 Pith papers · 15 internal anchors

  1. [1]

    Rt-h: Action hierarchies using language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. In Robotics: Science and Systems, 2024

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta \ n eda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  3. [3]

    pi\_0 : A vision-language-action flow model for general robot control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi\_0 : A vision-language-action flow model for general robot control. In Robotics: Science and Systems, 2025

  4. [4]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025 a

  5. [5]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111, 2025 b

  6. [6]

    Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

    Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caroline Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

  7. [7]

    Training strategies for efficient embodied reasoning

    William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning. arXiv preprint arXiv:2505.08243, 2025 a

  8. [8]

    Conrft: A reinforced fine-tuning method for vla models via con- sistency policy.arXiv preprint arXiv:2502.05450,

    Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025 b

  9. [9]

    arXiv preprint arXiv:2506.17639 (2025)

    Yuxuan Chen and Xiao Li. Rlrc: Reinforcement learning-based recovery for compressed vision-language-action models. arXiv preprint arXiv:2506.17639, 2025

  10. [10]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, pp.\ 02783649241273668, 2023

  11. [11]

    Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912, 2025

    Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation. arXiv preprint arXiv:2505.03912, 2025

  12. [12]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In IEEE International Conference on Robotics and Automation, 2024

  13. [13]

    Improving vision-language-action model with online reinforcement learning

    Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning. In IEEE International Conference on Robotics and Automation, 2025

  14. [14]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and brian ichter. Inner monologue: Embodied reasoning through planning with language models. In Proceedings of The Conference on Robot L...

  15. [15]

    NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854, 2025

  16. [16]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. 0. 5: a vision-language-action model with open-world generalization, 2025. URL https://arxiv. org/abs/2504.16054, 1 0 (2): 0 3, 2025

  17. [17]

    Droid: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. In Robotics: Science and Systems, 2024

  18. [18]

    Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics

    Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, and Younggyo Seo. Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics. arXiv preprint arXiv:2506.00070, 2025 a

  19. [19]

    Open VLA : An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Open VLA : An open-source vision-language-action model. In Conference on Robot Learning, 2024

  20. [20]

    Fine-tuning vision-language-action models: Optimizing speed and success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. In Robotics: Science and Systems, 2025 b

  21. [21]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674, 2025

  22. [22]

    Onetwovla: A unified vision-language-action model with adaptive reasoning,

    Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025

  23. [23]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310, 2023

  24. [24]

    RDT -1b: a diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT -1b: a diffusion foundation model for bimanual manipulation. In The Thirteenth International Conference on Learning Representations, 2025 a

  25. [25]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai

    Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai. IEEE/ASME Transactions on Mechatronics, 2025 b

  26. [26]

    Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025a

    Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning. arXiv preprint arXiv:2501.10074, 2025 c

  27. [27]

    Bidirectional decoding: Improving action chunking via closed-loop resampling

    Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Max Du, and Chelsea Finn. Bidirectional decoding: Improving action chunking via closed-loop resampling. International Conference on Learning Representations, 2025 d

  28. [28]

    VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning. arXiv preprint arXiv:2505.18719, 2025

  29. [29]

    A Survey on Vision-Language-Action Models for Embodied AI

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093, 2024

  30. [30]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration. In IEEE International Conference on Robotics and Automation, pp.\ 6892--6903. IEEE, 2024

  31. [31]

    Fast: Efficient action tokenization for vision-language-action models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. In Robotics: Science and Systems, 2025

  32. [32]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025

  33. [33]

    Vision-language- action models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769,

    Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision-language-action models: Concepts, progress, applications and challenges. arXiv preprint arXiv:2505.04769, 2025

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  35. [35]

    Accelerating vision-language-action model integrated with action chunking via parallel decoding, 2025

    Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated with action chunking via parallel decoding. arXiv preprint arXiv:2503.02310, 2025

  36. [36]

    Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752, 2025

  37. [37]

    Robobrain 2.0 technical report

    BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, et al. Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029, 2025

  38. [38]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  39. [39]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

  40. [40]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pp.\ 1723--1736. PMLR, 2023

  41. [41]

    All robots in one: A new standard and unified dataset for versatile, general-purpose embodied agents,

    Zhiqiang Wang, Hao Zheng, Yunshuang Nie, Wenjun Xu, Qingwei Wang, Hua Ye, Zhe Li, Kaidong Zhang, Xuewen Cheng, Wanxi Dong, et al. All robots in one: A new standard and unified dataset for versatile, general-purpose embodied agents. arXiv preprint arXiv:2408.10899, 2024

  42. [42]

    Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. In Robotics: Science and Systems, 2025

  43. [43]

    A survey on non-autoregressive generation for neural machine translation and beyond

    Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, and Tie-Yan Liu. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  44. [44]

    A survey on robotics with foundation models: toward embodied ai

    Zhiyuan Xu, Kun Wu, Junjie Wen, Jinming Li, Ning Liu, Zhengping Che, and Jian Tang. A survey on robotics with foundation models: toward embodied ai. arXiv preprint arXiv:2402.02385, 2024

  45. [45]

    Robot fine-tuning made easy: Pre-training rewards and policies for autonomous real-world reinforcement learning

    Jingyun Yang, Max Sobol Mark, Brandon Vu, Archit Sharma, Jeannette Bohg, and Chelsea Finn. Robot fine-tuning made easy: Pre-training rewards and policies for autonomous real-world reinforcement learning. In IEEE International Conference on Robotics and Automation, pp.\ 4804--4811. IEEE, 2024

  46. [46]

    Robotic control via embodied chain-of-thought reasoning

    Micha Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In Conference on Robot Learning, 2024

  47. [47]

    Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024

    Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. Grape: Generalizing robot policy via preference alignment. arXiv preprint arXiv:2411.19309, 2024

  48. [48]

    Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning.arXiv preprint arXiv:2504.12680, 2025

    Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, and Wenwu Zhu. Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning. arXiv preprint arXiv:2504.12680, 2025 a

  49. [49]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 1702--1713, 2025 b

  50. [50]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum \'e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024

  51. [51]

    A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925, 2025

  52. [52]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pp.\ 2165--2183. PMLR, 2023

  53. [53]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  54. [54]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  55. [55]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  56. [56]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...