pith. sign in

arxiv: 2605.11567 · v2 · pith:MHFEI5HEnew · submitted 2026-05-12 · 💻 cs.CV

Dynamic Execution Commitment of Vision-Language-Action Models

Pith reviewed 2026-05-20 23:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords Vision-Language-Action modelsaction chunkingadaptive executionprefix verificationconsensus scoringdynamic horizonVLA robustness
0
0 comments X

The pith

A3 lets VLA models dynamically select safe action execution lengths using internal consensus instead of fixed manual horizons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-Language-Action models predict short sequences of low-level actions but must decide how far to commit them to the real world, a choice normally set by fixed horizons tuned separately for each task. The paper introduces A3 to replace that manual step with an automatic verification process that runs inside the model itself. It draws multiple action trajectories, measures how much they agree on each step, then checks whether lower-agreement steps remain stable when re-predicted from higher-agreement ones and whether the whole sequence stays physically consistent from the start. If the checks hold, the longest unbroken prefix that passes both tests becomes the execution horizon. A sympathetic reader would care because this removes a brittle tuning step that often breaks when the environment changes or the model sees new situations.

Core claim

A3 reframes dynamic execution commitment in VLA models as a self-speculative prefix verification problem. It first obtains a trajectory-wise consensus score through group sampling, then applies consensus-ordered conditional invariance to validate lower-consensus actions by re-decoding them conditioned on higher-consensus ones, and prefix-closed sequential consistency to accept only the longest continuous verified sequence from the beginning. The resulting execution horizon is therefore the longest prefix that satisfies both the model's internal logic and sequential execution constraints, removing the need for manual tuning while improving the robustness-throughput trade-off across models and

What carries the argument

The A3 Adaptive Action Acceptance mechanism that computes consensus scores via group sampling and verifies action prefixes with conditional invariance and sequential consistency checks.

If this is right

  • The execution horizon is determined automatically as the longest verifiable prefix rather than a preset value.
  • Manual per-task horizon tuning is eliminated across benchmarks.
  • A better balance between execution success rate and inference throughput is obtained compared with fixed-horizon baselines.
  • The method applies to diverse VLA models without architecture-specific changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consensus-and-prefix verification pattern could be tested on other autoregressive planners that output action sequences.
  • Integration with external safety filters might further reduce the risk of accepting an unsafe prefix that passes internal checks.
  • Measuring how often the accepted prefix length varies within a single long-horizon task would show how much adaptivity the method actually uses.

Load-bearing premise

A trajectory-wise consensus score obtained via group sampling, combined with conditional invariance and prefix-closed consistency checks, reliably identifies the longest physically safe execution prefix in real-world dynamic or out-of-distribution conditions.

What would settle it

Deploy A3 on a VLA model in a changing physical environment and observe whether the automatically chosen prefixes produce fewer task failures or higher inference costs than the best manually tuned fixed horizons on the same tasks.

Figures

Figures reproduced from arXiv: 2605.11567 by Boying Li, Feng Chen, Xianghui Wang, Yefei He, Yicheng Wu, Yuxuan Chen, Zeyu Zhang.

Figure 1
Figure 1. Figure 1: Performance analysis of π-0.5 under varying execution horizons on LIBERO benchmark [8]. (a) Success rate first increases while then decreases as horizon increases, dropping below 80% when horizon is larger than 15 for most suites. (b) Completion step increases substantially with a larger horizon, as failed recoveries and compounding errors drive total steps up to 1.5× higher than at horizon=1. (c) Forward … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of A 3 . Given the current observation and instruction, the VLM backbone and action expert generate K candidate action chunks. The chunks are mapped to induced trajectory states, from which the dominant mode is identified via clustering and its medoid selected as the primary draft; per-step consensus scores reflect the model’s self-consistency at each action position. The draft then undergoes dual… view at source ↗
Figure 3
Figure 3. Figure 3: Trade-off between success rate (top row) and forward calls (bottom row) across execution [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the execution horizon across different tasks. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two representative failure cases. (a) Misalignment between the mug handle and the hook in the hang mug task. (b) Self-occlusion of the inverted mug rim by the gripper in the flip mug task. 6 Conclusion and Future Work In this work, we identify the determination of execution commitment as a principled yet under￾explored inference problem in multi-step VLA systems, and reformulate horizon selection as a stat… view at source ↗
Figure 6
Figure 6. Figure 6: Implementation of dual verification tree. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces A3, an Adaptive Action Acceptance mechanism for Vision-Language-Action (VLA) models. It reframes execution commitment as a self-speculative prefix verification task: a trajectory-wise consensus score is computed via group sampling, a representative draft is selected, and verification proceeds via (1) consensus-ordered conditional invariance (re-decoding low-consensus actions conditioned on high-consensus prefixes) and (2) prefix-closed sequential consistency. The resulting execution horizon is the longest prefix satisfying both checks, eliminating manual horizon tuning and purportedly delivering a superior robustness-throughput trade-off across VLA models and benchmarks.

Significance. If the internal consensus and consistency metrics correlate with physical rollout success, the method would remove a common source of brittleness in VLA deployment by making horizon selection state-dependent and parameter-free. The absence of external state feedback or sensor noise in the verification loop, however, leaves open whether the claimed gains survive dynamic or out-of-distribution conditions.

major comments (2)
  1. Abstract: the central claim that A3 'eliminates the need for manual horizon tuning while achieving a superior trade-off' is load-bearing for the contribution, yet the abstract supplies no quantitative results, error bars, benchmark names, or baseline comparisons, rendering the experimental superiority assertion unevaluable from the provided text.
  2. The A3 mechanism (described in the abstract and presumably §3): the trajectory-wise consensus score, conditional invariance, and prefix-closed consistency checks operate entirely inside the model's output distribution and do not incorporate external state feedback, sensor noise, or unmodeled dynamics. This directly undermines the claim that the selected prefix is 'physically safe' in real-world or OOD regimes, as an internally consistent sequence can still compound prediction errors or fail under external perturbations.
minor comments (1)
  1. Abstract: the phrasing 'prioritizes downstream verification' is slightly ambiguous; a short clarifying sentence or diagram would help readers follow the flow from consensus scoring to prefix selection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate. We believe these responses strengthen the presentation of A3 without altering its core technical claims.

read point-by-point responses
  1. Referee: Abstract: the central claim that A3 'eliminates the need for manual horizon tuning while achieving a superior trade-off' is load-bearing for the contribution, yet the abstract supplies no quantitative results, error bars, benchmark names, or baseline comparisons, rendering the experimental superiority assertion unevaluable from the provided text.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative support. The current abstract summarizes the experimental outcomes at a high level but omits specific metrics. In the revised version we will add key results, such as average success-rate gains and throughput improvements across the evaluated VLA models and benchmarks, together with baseline comparisons and error bars. This change will make the superiority claim directly evaluable from the abstract. revision: yes

  2. Referee: The A3 mechanism (described in the abstract and presumably §3): the trajectory-wise consensus score, conditional invariance, and prefix-closed consistency checks operate entirely inside the model's output distribution and do not incorporate external state feedback, sensor noise, or unmodeled dynamics. This directly undermines the claim that the selected prefix is 'physically safe' in real-world or OOD regimes, as an internally consistent sequence can still compound prediction errors or fail under external perturbations.

    Authors: We acknowledge that A3 performs verification entirely within the model's output distribution and does not ingest external state feedback or sensor noise. The method is intentionally self-contained so that prefix selection can occur at inference time without additional hardware or environment access. The consensus-ordered conditional invariance and prefix-closed sequential consistency checks are designed to detect internal instability that often precedes compounding errors, and our experiments show that the resulting adaptive horizons improve robustness relative to fixed-horizon baselines on the tested benchmarks. We do not claim that internal consistency guarantees physical safety under arbitrary external perturbations. In the revision we will (1) clarify the scope of the 'physical rollout integrity' phrasing to emphasize that it refers to sequential consistency under the model's own distribution and (2) add an explicit limitations paragraph discussing the absence of external feedback and the need for future integration with real-world sensing. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in A3's procedural definition of adaptive execution horizon.

full rationale

The paper defines A3 as a self-speculative prefix verification process: compute trajectory-wise consensus via group sampling, enforce consensus-ordered conditional invariance by re-decoding low-consensus actions conditioned on high-consensus prefixes, and apply prefix-closed sequential consistency to select the longest verifiable prefix. This constructs the execution horizon directly from the internal checks rather than deriving it from external equations or prior fitted parameters. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no renaming of known results occurs. The central claims of eliminating manual tuning and achieving superior robustness-throughput trade-offs are presented as empirical outcomes across VLA models and benchmarks, which remain independent of the internal definitions. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on domain assumptions about sampling-based consensus reflecting true predictive reliability and on the newly introduced verification procedures.

axioms (2)
  • domain assumption Group sampling yields a meaningful trajectory-wise consensus score that indicates action reliability
    Central to selecting representative drafts and prioritizing verification
  • domain assumption Conditional re-decoding of low-consensus actions given high-consensus prefixes preserves physical rollout integrity
    Used to enforce consensus-ordered conditional invariance
invented entities (1)
  • A3 Adaptive Action Acceptance mechanism no independent evidence
    purpose: Dynamically determine execution horizon via self-speculative prefix verification
    Newly proposed framework that reframes the commitment decision

pith-pipeline@v0.9.0 · 5798 in / 1284 out tokens · 36675 ms · 2026-05-20T23:05:53.531898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 15 internal anchors

  1. [1]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi-0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  2. [2]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  3. [3]

    GR-3 Technical Report

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

  4. [4]

    arXiv preprint arXiv:2510.24795 (2025)

    Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Lianli Gao, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. A survey on efficient vision-language-action models, 2025. arXiv preprint arXiv:2510.24795. 12

  5. [5]

    A Survey on Vision-Language-Action Models for Embodied AI

    Yecheng Jason Ma, Zhen Song, Yu Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai, 2024. arXiv preprint arXiv:2405.14093

  6. [6]

    Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

    Runze Shao, Wenxuan Li, Lei Zhang, Rui Zhang, Zhicheng Liu, Ruocheng Chen, and Liqiang Nie. Large vlm-based vision-language-action models for robotic manipulation: A survey, 2025. arXiv preprint arXiv:2508.13073

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  8. [8]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  9. [9]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi-0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  10. [10]

    Mixture of horizons in action chunking, 2025

    Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, and Mingyu Ding. Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

  11. [11]

    Everydayvla: A vision-language-action model for affordable robotic manipulation.arXiv preprint arXiv:2511.05397, 2025

    Samarth Chopra, Alex McMoil, Ben Carnovale, Evan Sokolson, Rajkumar Kubendran, and Samuel Dickerson. Everydayvla: A vision-language-action model for affordable robotic manipulation.arXiv preprint arXiv:2511.05397, 2025

  12. [12]

    Vla knows its limits.arXiv preprint arXiv:2602.21445, 2026

    Haoxuan Wang, Gengyu Zhang, Yan Yan, Ramana Rao Kompella, and Gaowen Liu. Vla knows its limits.arXiv preprint arXiv:2602.21445, 2026

  13. [13]

    When Attention Sink Emerges in Language Models: An Empirical View

    Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

  14. [14]

    See what you are told: Visual attention sink in large multimodal models

    Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InThe Thirteenth International Conference on Learning Representations, 2025

  15. [15]

    Self speculative decoding for diffusion large language models

    Yifeng Gao, Ziang Ji, Yuxuan Wang, Biqing Qi, Hanlin Xu, and Linfeng Zhang. Self speculative decoding for diffusion large language models.arXiv preprint arXiv:2510.04147, 2025

  16. [16]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

  17. [17]

    Spatialvla: Exploring spatial representations for visual-language-action models

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Jiayuan Gu, Zhigang Wang, Yan Ding, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action models. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025

  18. [18]

    Learning to act anywhere with task-centric latent actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Learning to act anywhere with task-centric latent actions. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025

  19. [19]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  20. [20]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  21. [21]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration, Abby O’Neill, et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2023. arXiv preprint arXiv:2310.08864. 13

  22. [22]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  23. [23]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023. arXiv preprint arXiv:2303.04137

  24. [24]

    Accelerating vision-language-action model integrated with action chunking via parallel decoding, 2025

    Wei Song, Jie Chen, Peng Ding, Hao Zhao, Wei Zhao, Zhi Zhong, Zhen Ge, Jun Ma, and Hong Li. PD-VLA: Accelerating vision-language-action model integrated with action chunking via parallel decoding, 2025. arXiv preprint arXiv:2503.02310

  25. [25]

    Freqpolicy: Efficient flow-based visuo- motor policy via frequency consistency.arXiv preprint arXiv:2506.08822, 2025

    Yu Su, Ning Liu, Dong Chen, Zhen Zhao, Kun Wu, Meng Li, Zhi Xu, Zhe Che, and Jie Tang. Freqpolicy: Efficient flow-based visuomotor policy via frequency consistency, 2025. arXiv preprint arXiv:2506.08822

  26. [26]

    Zhao et al

    Tony Z. Zhao et al. Learning fine-grained bimanual manipulation with low-cost hardware. In Robotics: Science and Systems (RSS), 2023

  27. [27]

    CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  28. [28]

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  29. [29]

    DROID: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024

  30. [30]

    Zipar: Accelerating autoregressive image generation through spatial locality.arXiv preprint arXiv:2412.04062, 2(3):4, 2024

    Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Zipar: Accelerating autoregressive image generation through spatial locality.arXiv preprint arXiv:2412.04062, 2(3):4, 2024

  31. [31]

    Swift: On-the-fly self- speculative decoding for llm inference acceleration.arXiv preprint arXiv:2410.06916, 2024

    Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. Swift: On-the-fly self- speculative decoding for llm inference acceleration.arXiv preprint arXiv:2410.06916, 2024

  32. [32]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

  33. [33]

    Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025

    Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, et al. Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025

  34. [34]

    Shuo Wang, Ruize Yu, Zhiyuan Yuan, Chao Yu, Feng Gao, Yilin Wang, and Derek F. Wong. Spec-vla: Speculative decoding for vision-language-action models with relaxed acceptance,

  35. [35]

    arXiv preprint arXiv:2507.22424

  36. [36]

    An overview of model predictive control.Interna- tional Journal of control and automation, 3(4):47–63, 2010

    Kailas S Holkar and Laxman M Waghmare. An overview of model predictive control.Interna- tional Journal of control and automation, 3(4):47–63, 2010

  37. [37]

    Model predictive control.Switzerland: Springer International Publishing, 38(13-56):7, 2016

    Basil Kouvaritakis and Mark Cannon. Model predictive control.Switzerland: Springer International Publishing, 38(13-56):7, 2016

  38. [38]

    arXiv preprint arXiv:2510.25122 (2025)

    Jiahong Chen, Jing Wang, Long Chen, Chuwei Cai, and Jinghui Lu. Nanovla: Routing decoupled vision-language understanding for nano-sized generalist robotic policies.arXiv preprint arXiv:2510.25122, 2025

  39. [39]

    The kinematics of contact and grasp.The International Journal of Robotics Research, 7(3):17–32, 1988

    David J Montana. The kinematics of contact and grasp.The International Journal of Robotics Research, 7(3):17–32, 1988. 14 VLM Observation Prompt Action Expert Consensus Estimation 0.8 0.70.6 0.40.1 Score ordered input Sequential ordered input Action Expert Progressive verification match✅ match✅ mismatch❌ . . . ✅ ❌ ❌ ❌ ❌ parallel inference Joint decision sc...

  40. [40]

    Rlinf-user: A unified and extensible system for real-world online policy learning in embodied ai.arXiv preprint arXiv:2602.07837, 2026

    Hongzhi Zang, Shu’ang Yu, Hao Lin, Tianxing Zhou, Zefang Huang, Zhen Guo, Xin Xu, Jiakai Zhou, Yuze Sheng, Shizhe Zhang, Feng Gao, Wenhao Tang, Yufeng Yue, Quanlu Zhang, Xinlei Chen, Chao Yu, and Yu Wang. Rlinf-user: A unified and extensible system for real-world online policy learning in embodied ai.arXiv preprint arXiv:2602.07837, 2026

  41. [41]

    Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,

    Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations.arXiv preprint arXiv:2107.14483, 2021

  42. [42]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020. A Implementation Details Implementation of the verification tree.As shown in Figure 6, following the self-s...