pith. machine review for the scientific record. sign in

arxiv: 2603.01581 · v2 · submitted 2026-03-02 · 💻 cs.RO · cs.LG

Recognition: no theorem link

KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:36 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords speculative decodingVLA modelsembodied intelligenceKalman filterkinematic predictionrobot controlinference optimization
0
0 comments X

The pith

KERV integrates kinematic predictions to correct speculative decoding errors in VLA robot models, delivering 27-37% speedups with minimal accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KERV to speed up Vision-Language-Action models used for robot control. These models are slow because they generate actions token by token, and speculative decoding can accelerate them but creates errors that need fixing through re-inference. KERV solves this by using a Kalman Filter based on robot kinematics to predict correct actions and adjust the decoding threshold dynamically. Experiments show consistent acceleration across tasks without hurting success rates. This matters because it makes real-time embodied AI more practical by linking physical robot motion rules directly to the AI inference process.

Core claim

KERV is a framework that combines token-domain VLA models with kinematic-domain prediction for speculative decoding. It uses a kinematics-based Kalman Filter to predict actions and compensate for SD errors without costly re-inference, and a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, resulting in 27% to 37% acceleration with nearly no Success Rate loss across diverse tasks and environments.

What carries the argument

Kinematics-based Kalman Filter that predicts robot actions to compensate for token errors in speculative decoding of VLA models, along with the dynamic threshold adjustment strategy.

If this is right

  • Speculative decoding can be applied to real-time robot control without accuracy penalties.
  • Physical kinematics become a direct tool for optimizing AI inference in embodied systems.
  • Threshold tuning for speculative decoding no longer requires manual, task-specific adjustments.
  • Embodied VLA models can achieve higher throughput on existing hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar kinematic rectification might improve other inference optimizations like beam search in robotics.
  • If the approach generalizes, it could allow smaller VLA models to match the speed of larger ones.
  • Integration with other sensor data beyond kinematics could further enhance error compensation.

Load-bearing premise

The Kalman Filter's predictions based on robot kinematics will accurately compensate for speculative decoding errors in diverse real-world environments without causing new failures.

What would settle it

Running KERV on a robot in an environment with dynamics that violate the Kalman Filter model, such as high friction changes or external forces, and checking whether the success rate remains stable.

Figures

Figures reproduced from arXiv: 2603.01581 by Donggang Cao, Hong Mei, Jiayu Chen, Maoliang Li, Xiang Chen, Xinhao Sun, Zhaobo Zhang, Zhihao Mao, Zihao Zheng.

Figure 1
Figure 1. Figure 1: (a) Naive VLA+SD vs. (b) SpecVLA vs. (c) The Proposed KERV KERV across various specific environments (e.g., LIBERO [15]), ana￾lyze its performance drivers, and discuss its optimal configurations. In summary, our contributions are below: • We reveal the discrepancy between the token-domain (VLA mod￾els) and the kinematic-domain (conventional kinematic-based methods) in robot control, demonstrating the feasi… view at source ↗
Figure 2
Figure 2. Figure 2: Discrepancy about Errors evaluating Success Rate (SR), speed, average inference steps, and per-step latency (T). We further report SD’s Average First Error Position (AFEP). The results in Tab. 1 show that prediction errors (AFEP 1.59∼2.04) force draft token abandonment and subsequent re-inference, leading to lower speed than AR (0.188s∼0.198s v.s. 0.200s∼0.217s). Clearly, acceleration hinges on properly ad… view at source ↗
Figure 4
Figure 4. Figure 4: KF-Based Compensation Mechanism in KERV as errors accumulate, the more inference steps there are, the more erroneous tokens that should no longer be accepted. Kinematic-Domain Analysis about Acceptance Threshold. From a kinematic perspective, we analyze the relaxed acceptance threshold. Specifically, we record the SD errors and corresponding 𝐾var during inference (as shown in [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 6
Figure 6. Figure 6: System Implementation of KERV 5 System Implementation This section explores the efficient implementation of KERV on ex￾isting hardware platforms from a computational perspective based on CPU-GPU collaboration. 5.1 GPU-Side Draft and Verify Both the draft model and verification model involve substantial computation. Measurements show the draft model requires 0.07 GFLOPs per inference, while the verification… view at source ↗
Figure 7
Figure 7. Figure 7: An Experimental Example of KERV on LIBERO-Goal Environment Ablation Studies. We performed ablation experiments on two components of KERV: CM and TA, and the results are shown in Tab. 4. The four environments show the same trend. With only the CM, the system achieves high speed. However, as the threshold remains unadjusted, it accepts more erroneous SD-generated to￾kens, leading to significant accuracy loss… view at source ↗
Figure 8
Figure 8. Figure 8: Experiments of Hyper-Parameters in KERV From [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes KERV, a kinematic-rectified speculative decoding framework for Vision-Language-Action (VLA) models. It employs a kinematics-based Kalman Filter to predict actions and compensate for speculative decoding token errors without re-inference, along with a kinematics-based strategy to dynamically adjust the acceptance threshold. Experiments across diverse tasks and environments are reported to yield 27%–37% acceleration with nearly no success rate loss.

Significance. If the central performance claims hold under rigorous validation, the work could meaningfully advance real-time embodied AI by integrating robotic kinematics with token-domain inference optimizations, reducing reliance on expensive re-inference steps in VLA models while preserving task success.

major comments (2)
  1. [Experiments] Experiments section: The reported 27%~37% acceleration and near-zero success rate loss lack any description of baselines (e.g., standard speculative decoding or vanilla VLA inference), statistical significance tests, error bars, variance across runs, or exact protocols for task/environment diversity, which is load-bearing for the central empirical claim.
  2. [Method] Method section (Kalman Filter and projection details): No derivation or explicit mapping is given for projecting token-domain discrepancies into the continuous kinematic state space, nor for filter initialization, process/measurement noise tuning, or handling of non-linear dynamics (e.g., contact-rich tasks); this directly underpins the claim that the filter compensates errors without re-inference or new failure modes.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'nearly no Success Rate loss' is imprecise; quantitative bounds or per-task deltas would improve clarity.
  2. [Method] Notation: The distinction between token-domain corrections and kinematic-state adjustments could be clarified with an explicit diagram or equation linking the two domains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revising the manuscript to strengthen the presentation of both the experimental protocol and the methodological derivations.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The reported 27%~37% acceleration and near-zero success rate loss lack any description of baselines (e.g., standard speculative decoding or vanilla VLA inference), statistical significance tests, error bars, variance across runs, or exact protocols for task/environment diversity, which is load-bearing for the central empirical claim.

    Authors: We agree that the experimental reporting requires greater clarity and statistical rigor to fully support the central claims. The manuscript already includes comparisons against both vanilla VLA inference and standard speculative decoding (Section 4, Tables 1–3), with the reported speedups measured under identical hardware and model settings. However, we acknowledge that error bars, run-to-run variance, and formal significance tests were not presented. In the revision we will add: (i) mean and standard deviation over 10 independent runs with different random seeds, (ii) paired t-test p-values comparing KERV against each baseline, and (iii) an expanded Section 4.1 that enumerates the exact task suites, environment variations, and success criteria used. These additions will be placed in the main text and supplementary material without changing the reported speedup range or success-rate observations. revision: yes

  2. Referee: [Method] Method section (Kalman Filter and projection details): No derivation or explicit mapping is given for projecting token-domain discrepancies into the continuous kinematic state space, nor for filter initialization, process/measurement noise tuning, or handling of non-linear dynamics (e.g., contact-rich tasks); this directly underpins the claim that the filter compensates errors without re-inference or new failure modes.

    Authors: We accept that the current manuscript presents the Kalman-filter integration at a high level and omits the explicit projection mathematics and parameter choices. In the revised version we will insert a new subsection (3.2.1) containing: (i) the closed-form mapping from token-level discrepancy vectors to continuous kinematic state corrections, (ii) the initialization procedure for the state covariance, (iii) the empirical procedure used to set process and measurement noise covariances, and (iv) a brief discussion of the linearization approximation together with its limitations on contact-rich tasks. These additions will be supported by the corresponding equations and a short ablation on noise sensitivity, thereby clarifying how the filter achieves error compensation without triggering re-inference. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on standard Kalman Filter applied to kinematic predictions

full rationale

The paper's central mechanism applies a standard kinematics-based Kalman Filter (from robotics literature) to predict actions and adjust SD acceptance thresholds. No equations, parameters, or claims in the abstract or description reduce the 'predictions' or 'rectifications' to quantities defined by the target success-rate results themselves. The approach is presented as an engineering combination of existing token-domain SD with independent kinematic-domain filtering, followed by empirical validation. No self-citation chains, self-definitional loops, or fitted-input renamings are evident. This is the common case of a self-contained applied method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of Kalman Filter applicability to robot kinematics and the existence of a reliable kinematic model for the target robots; no new free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Kalman Filter provides accurate action predictions that can compensate for token-level errors in speculative decoding
    Invoked in the description of error compensation without re-inference

pith-pipeline@v0.9.0 · 5523 in / 1145 out tokens · 44822 ms · 2026-05-15T18:36:34.409122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

    cs.RO 2026-05 unverdicted novelty 7.0

    A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

  2. HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

    cs.RO 2026-03 unverdicted novelty 7.0

    HeiSD delivers up to 2.45x faster inference for embodied VLA models by hybridizing speculative decoding with kinematic boundary detection and error-mitigation tricks while preserving task success rates.

  3. VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

    cs.RO 2026-03 conditional novelty 7.0

    VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.

  4. FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

    cs.RO 2026-04 unverdicted novelty 6.0

    FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.

  5. RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models

    cs.DC 2026-03 unverdicted novelty 5.0

    RoboECC delivers up to 3.28x speedup for VLA model inference via co-aware segmentation and network-aware adjustment with 2.55-2.62% overhead.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 5 Pith papers · 7 internal anchors

  1. [1]

    Andrea Bajo and Nabil Simaan. 2011. Kinematics-based detection and localiza- tion of contacts along multisegment continuum robots. IEEE Transactions on Robotics 28, 2 (2011), 291–302

  2. [2]

    Pieter M Blok, Koen van Boheemen, Frits K van Evert, Joris IJsselmuiden, and Gook-Hwan Kim. 2019. Robot navigation in orchards with localization based on Particle filter and Kalman filter. Computers and electronics in agriculture 157 (2019), 261–269

  3. [3]

    Paweł Budzianowski, Wesley Maa, Matthew Freed, Jingxiang Mo, Winston Hsiao, Aaron Xie, Tomasz Młoduchowski, Viraj Tipnis, and Benjamin Bolte

  4. [4]

    arXiv preprint arXiv:2507.14049 (2025)

    Edgevla: Efficient vision-language-action models. arXiv preprint arXiv:2507.14049 (2025)

  5. [5]

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024)

  6. [6]

    Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Zhuo Li, Yang Wang, et al. 2025. Fastdrivevla: Efficient end- to-end driving via plug-and-play reconstruction-based token pruning. arXiv preprint arXiv:2507.23318 (2025)

  7. [7]

    Shen-Yong Chen. 2011. Kalman filter for robot vision: a survey.IEEE Transactions on industrial electronics 59, 11 (2011), 4409–4420

  8. [8]

    Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. 2025. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation. arXiv preprint arXiv:2505.03912 (2025)

  9. [9]

    Chrysostomos Karakasis and Panagiotis Artemiadis. 2021. F-VESPA: A kinematic- based algorithm for real-time heel-strike detection during walking. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5098–5103

  10. [10]

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al

  11. [11]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  12. [12]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning. PMLR, 19274–19286

  13. [13]

    Shiyao Li, Yingchun Hu, Xuefei Ning, Xihui Liu, Ke Hong, Xiaotao Jia, Xiuhong Li, Yaqi Yan, Pei Ran, Guohao Dai, et al. 2025. Mbq: Modality-balanced quantization for large vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference. 4167–4177

  14. [14]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle-2: Faster inference of language models with dynamic draft trees. arXiv preprint arXiv:2406.16858 (2024)

  15. [15]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077 (2024)

  16. [16]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. Eagle-3: Scaling up inference acceleration of large language models via training-time test. arXiv preprint arXiv:2503.01840 (2025)

  17. [17]

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems 36 (2023), 44776–44791

  18. [18]

    Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Lily Lee, Kaichen Zhou, Pengju An, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. 2024. Robomamba: Multimodal state space model for efficient robot reasoning and manipulation. arXiv e-prints (2024), arXiv–2406

  19. [19]

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. 2024. A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024)

  20. [20]

    Seongmin Park, Hyungmin Kim, Wonseok Jeon, Juyoung Yang, Byeongwook Jeon, Yoonseon Oh, and Jungwook Choi. 2024. Quantization-aware imitation- learning for resource-efficient robotic control. arXiv preprint arXiv:2412.01034 (2024)

  21. [21]

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. 2025. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

  22. [22]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep- speed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3505–3506

  23. [23]

    Wenxuan Song, Jiayi Chen, Pengxiang Ding, Yuxin Huang, Han Zhao, Donglin Wang, and Haoang Li. 2025. CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding. arXiv preprint arXiv:2506.13725 (2025)

  24. [24]

    Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, and Tao Chen. 2025. Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models. CoRR abs/2505.21200 (2025). arXiv:2505.21200 doi:10.48550/ARXIV.2505.21200

  25. [25]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  26. [26]

    Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, and Guohao Dai. 2025. Specprune-vla: Accelerating vision-language-action models via action-aware self-speculative pruning. arXiv preprint arXiv:2509.05614 (2025)

  27. [27]

    Songsheng Wang, Rucheng Yu, Zhihang Yuan, Chao Yu, Feng Gao, Yu Wang, and Derek F Wong. 2025. Spec-vla: speculative decoding for vision-language-action models with relaxed acceptance. arXiv preprint arXiv:2507.22424 (2025)

  28. [28]

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al . 2025. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters (2025)

  29. [29]

    Zhuofan Wen, Shangtong Gui, and Yang Feng. 2024. Speculative decoding with CTC-based draft model for LLM inference acceleration. Advances in Neural Information Processing Systems 37 (2024), 92082–92100

  30. [30]

    Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu

  31. [31]

    Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025

    Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation. arXiv preprint arXiv:2502.02175 (2025)

  32. [32]

    Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, and Linfeng Zhang. 2025. EfficientVLA: Training-Free Ac- celeration and Compression for Vision-Language-Action Models. arXiv preprint arXiv:2506.10100 (2025)

  33. [33]

    Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. 2024. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution. Advances in Neural Information Processing Systems 37 (2024), 56619–56643

  34. [34]

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024. Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence 46, 8 (2024), 5625–5644

  35. [35]

    Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. 2023. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168 (2023)

  36. [36]

    Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. 2024. Learn- ing harmonized representations for speculative sampling. arXiv preprint arXiv:2408.15766 (2024)

  37. [37]

    Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. 2025. Mole-vla: Dynamic layer- skipping vision language action model via mixture-of-layers for efficient robot manipulation. arXiv preprint arXiv:2503.20384 (2025)

  38. [38]

    Shulai Zhang, Ao Xu, Quan Chen, Han Zhao, Weihao Cui, Ningxin Zheng, Haibin Lin, Xin Liu, and Minyi Guo. 2025. Boosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline Execution. arXiv preprint arXiv:2509.09560 (2025)

  39. [39]

    Wenxin Zheng, Boyang Li, Bin Xu, Erhu Feng, Jinyu Gu, and Haibo Chen. 2025. Leveraging OS-Level Primitives for Robotic Action Management. arXiv preprint arXiv:2508.10259 (2025)

  40. [40]

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. 2023. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning. PMLR, 2165–2183