pith. machine review for the scientific record. sign in

arxiv: 2605.10821 · v1 · submitted 2026-05-11 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:05 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelsnoise-space reinforcement learninghuman-in-the-loop adaptationflow-matching decoderrobotic manipulationdiffusion policiespolicy adaptation
0
0 comments X

The pith

UniSteer inverts human corrective actions to noise targets to jointly steer RL updates in VLA adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework for adapting pretrained diffusion-based vision-language-action models to real-world robotic distributions with limited interaction budget. Human corrections arrive naturally in action space, yet efficient training occurs in noise space with a frozen VLA decoder; UniSteer bridges the gap by recovering noise supervision signals from those actions. This enables simultaneous reinforcement learning on a lightweight noise actor and human-guided updates, lowering exploration cost. Real-world trials across manipulation tasks confirm faster convergence to high success rates than either pure noise-space RL or direct action-space human-in-the-loop methods.

Core claim

UniSteer recovers noise targets from human corrective actions by inverting the frozen flow-matching decoder, supplying supervised guidance to the same noise actor that is simultaneously optimized via reinforcement learning.

What carries the argument

approximate action-to-noise inversion of the frozen flow-matching decoder, which converts action-space corrections into noise-space supervision signals

If this is right

  • The noise actor receives both environment rewards and human-derived supervision without altering the pretrained VLA.
  • Human interventions directly reduce the exploration burden that otherwise slows noise-space RL.
  • Real-world adaptation reaches 90 percent success in roughly one hour of interaction across varied manipulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The inversion technique could transfer to other diffusion policies whose training occurs in latent noise space while feedback arrives in output space.
  • Hybrid human-RL steering may shorten data collection for any latent-variable policy where approximate inversion of the decoder is feasible.

Load-bearing premise

The inversion of human corrective actions through the frozen decoder produces noise targets accurate enough that the resulting supervision improves rather than harms the joint RL optimization.

What would settle it

A controlled trial in which the inverted noise targets are replaced by random noise or omitted entirely, yet success rates and adaptation speed remain equal or higher than with UniSteer.

Figures

Figures reproduced from arXiv: 2605.10821 by Bin Liang, Chuheng Zhang, Junjie Lu, Jun Yang, Kaixin Wang, Li Zhao, Min Xu, Xinyao Qin, Yuhua Jiang.

Figure 1
Figure 1. Figure 1: UniSteer bridges human guidance and noise-space RL. Noise-space finetuning relies on exploration from a distant initial noise distribution. UniSteer maps human corrective actions into noise-space targets, providing a useful prior for faster adaptation with fewer trajectories. VLA frozen. As a result, the decoded actions naturally remain anchored to the pretrained policy prior, thus benefit from the priors … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of UniSteer framework. The noise actor maps current states into noise variables and generates actions through the frozen decoder. Autonomous rollouts are stored in the RL buffer, whereas human takeover actions are inverted into noise space and collected in both the demo buffer and the RL buffer. The two buffers jointly provide training signals for the critic and the actor. 4 Method 4.1 Action-to-N… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of all real-world experimental tasks. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Online adaptation efficiency of UniSteer. UniSteer uses human guidance more efficiently to [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of early exploration behavior on Pick up Spoon. DSRL relies [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative example of forgetting in DAgger on Pick up Spoon. DAgger learns the [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We use an AgileX Piper robot arm equipped with a wrist camera, together with a primary [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Diffusion-based vision-language-action (VLA) models have emerged as strong priors for robotic manipulation, yet adapting them to real-world distributions remains challenging. In particular, on-robot reinforcement learning (RL) is expensive and time-consuming, so effective adaptation depends on efficient policy improvement within a limited budget of real-world interactions. Noise-space RL lowers the cost by keeping the pretrained VLA fixed as a denoising generator while updating only a lightweight actor that predicts the noise. However, its performance is still limited due to inefficient autonomous exploration. Human corrective interventions can reduce this exploration burden, but they are naturally provided in action space, whereas noise-space finetuning requires supervision over noise variables. To address these challenges, we propose UniSteer, a Unified Noise Steering framework that combines human corrective guidance with noise-space RL through approximate action-to-noise inversion. Given a human corrective action, UniSteer inverts the frozen flow-matching decoder to recover a noise target, which provides supervised guidance for the same noise actor that is simultaneously optimized via reinforcement learning. Real-world experiments on diverse manipulation tasks show that UniSteer adapts more efficiently than strong noise-space RL and action-space human-in-the-loop baselines, improving the success rate from 20% to 90% in 66 minutes on average across four real-world adaptation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces UniSteer, a framework for efficient adaptation of diffusion-based vision-language-action (VLA) models. It augments noise-space RL (which updates only a lightweight noise actor while keeping the pretrained VLA fixed) with human corrective interventions by using an approximate inversion through the frozen flow-matching decoder to convert action-space corrections into noise targets for supervised updates to the same actor. Real-world experiments on four manipulation tasks report that this mixed supervision raises success rates from 20% to 90% in an average of 66 minutes, outperforming pure noise-space RL and action-space human-in-the-loop baselines.

Significance. If the inversion step reliably supplies beneficial rather than harmful supervision, the approach could meaningfully lower the real-world sample complexity of adapting large VLA priors by allowing sparse human guidance to steer noise-space optimization, addressing a practical bottleneck in robotic deployment.

major comments (3)
  1. [Method description of the inversion step] The central claim that human corrections, once inverted, improve the joint RL-plus-human objective rests on the unexamined accuracy of the approximate action-to-noise inversion of the frozen decoder. No error bounds, reconstruction metrics on held-out trajectories, or ablation measuring how inversion mismatch affects actor updates (especially when corrections lie outside current policy support) are provided, leaving open the risk that biased targets could stall or reverse adaptation.
  2. [Experiments section] Table or figure reporting real-world results: success-rate gains (20% to 90%) and the 66-minute average are stated without variance across runs, number of trials, statistical significance tests, or precise baseline implementations (e.g., how the action-space human-in-the-loop comparator interfaces with the noise actor or how inversion error was quantified in practice).
  3. [Experiments and ablation studies] No sensitivity analysis or ablation isolates the contribution of the inverted human supervision versus pure RL, nor tests performance when human corrections are deliberately noisy or distant from the current policy distribution, which would directly probe the load-bearing assumption identified in the skeptic note.
minor comments (2)
  1. [Preliminaries] Notation for the flow-matching decoder and the inversion operator could be introduced earlier and used consistently to improve readability of the mixed objective.
  2. [Abstract] The abstract and introduction would benefit from a brief statement of the inversion approximation error observed during training or validation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the presentation of the inversion analysis and experimental results.

read point-by-point responses
  1. Referee: [Method description of the inversion step] The central claim that human corrections, once inverted, improve the joint RL-plus-human objective rests on the unexamined accuracy of the approximate action-to-noise inversion of the frozen decoder. No error bounds, reconstruction metrics on held-out trajectories, or ablation measuring how inversion mismatch affects actor updates (especially when corrections lie outside current policy support) are provided, leaving open the risk that biased targets could stall or reverse adaptation.

    Authors: We agree that the original manuscript would benefit from a more explicit analysis of inversion accuracy. In the revised version we add reconstruction error metrics computed on held-out trajectories, together with an ablation that quantifies how inversion mismatch propagates to actor updates when corrections fall outside the current policy support. While the real-world results demonstrate consistent gains, we acknowledge that these new analyses will better bound the risk of biased supervision. revision: yes

  2. Referee: [Experiments section] Table or figure reporting real-world results: success-rate gains (20% to 90%) and the 66-minute average are stated without variance across runs, number of trials, statistical significance tests, or precise baseline implementations (e.g., how the action-space human-in-the-loop comparator interfaces with the noise actor or how inversion error was quantified in practice).

    Authors: We will expand the experimental reporting to include a detailed table listing per-task success rates with standard deviations across five independent runs, total interaction trials, and p-values from paired statistical tests. The revised text will also provide precise descriptions of baseline implementations, including the exact interface used by the action-space human-in-the-loop comparator with the noise actor and the practical procedure for quantifying inversion error during data collection. revision: yes

  3. Referee: [Experiments and ablation studies] No sensitivity analysis or ablation isolates the contribution of the inverted human supervision versus pure RL, nor tests performance when human corrections are deliberately noisy or distant from the current policy distribution, which would directly probe the load-bearing assumption identified in the skeptic note.

    Authors: We will add new ablation experiments that isolate the contribution of inverted human supervision by comparing UniSteer against pure noise-space RL and human-only baselines. In addition, we include sensitivity tests in which human corrections are deliberately corrupted with noise or sampled from distributions distant from the current policy; these results will directly evaluate the robustness of the inversion step under the conditions highlighted by the referee. revision: yes

Circularity Check

0 steps flagged

No significant circularity; inversion is an explicit algorithmic bridge

full rationale

The paper's core mechanism is an explicit approximate action-to-noise inversion applied to the frozen flow-matching decoder, used to generate supervision targets for the noise actor that is jointly optimized with RL. This inversion is presented as a new, independent algorithmic step rather than a quantity defined in terms of the target success rates, fitted parameters from the adaptation tasks, or the final performance metric. Empirical results (success rate lift from 20% to 90% in 66 minutes) are obtained from separate real-world experiments on four manipulation tasks and do not reduce to the method definition by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that collapse the central claim are present.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that the pretrained flow-matching decoder admits a usable approximate inverse from action to noise; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption The frozen flow-matching decoder can be approximately inverted to recover a noise target from a human corrective action
    This inversion is the load-bearing step that translates human action-space corrections into noise-space supervision.

pith-pipeline@v0.9.0 · 5553 in / 1340 out tokens · 53346 ms · 2026-05-12T04:05:04.305864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 15 internal anchors

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  2. [2]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  3. [3]

    Vima: General robot manipulation with multimodal prompts

    Y Zhu et al. Vima: General robot manipulation with multimodal prompts. InInternational Conference on Learning Representations (ICLR), 2023

  4. [4]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  5. [5]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  6. [6]

    Universal actions for enhanced embodied foundation models

    Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025

  7. [7]

    Vlas: Vision-language-action model with speech instructions for customized robot manipulation,

    Wei Zhao, Pengxiang Ding, Min Zhang, Zhefei Gong, Shuanghao Bai, Han Zhao, and Donglin Wang. Vlas: Vision-language-action model with speech instructions for customized robot manipulation.arXiv preprint arXiv:2502.13508, 2025

  8. [8]

    Villa-x: enhancing latent action modeling in vision-language-action models,

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025

  9. [9]

    Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

    Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

  10. [10]

    Chatvla: Unified multimodal understanding and robot control with vision-language-action model, 2025

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2502.14420, 2025

  11. [11]

    Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models

    Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025

  12. [12]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  13. [13]

    Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  14. [14]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164, 2024. 10

  15. [15]

    π0.5: a vision-language-action model with open-world generalization,

    Physical Intelligence. π0.5: a vision-language-action model with open-world generalization,

  16. [16]

    URLhttps://arxiv.org/abs/2504.16054

  17. [17]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  18. [18]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  19. [19]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  20. [20]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  21. [21]

    Dexgraspvla: A vision-language- action framework towards general dexterous grasping,

    Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Num Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping.arXiv preprint arXiv:2502.20900, 2025

  22. [22]

    DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

  23. [23]

    Unidex: A robot foundation suite for universal dexterous hand control from egocentric human videos.CoRR, abs/2603.22264, 2026

    Gu Zhang, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao, Zeyu Ping, Zhecheng Yuan, Chenhao Lu, Chengbo Yuan, et al. Unidex: A robot foundation suite for uni- versal dexterous hand control from egocentric human videos.arXiv preprint arXiv:2603.22264, 2026

  24. [24]

    Cross-hand latent representation for vision- language-action models.arXiv preprint arXiv:2603.10158, 2026

    Guangqi Jiang, Yutong Liang, Jianglong Ye, Jia-Yang Huang, Changwei Jing, Rocky Duan, Pieter Abbeel, Xiaolong Wang, and Xueyan Zou. Cross-hand latent representation for vision- language-action models.arXiv preprint arXiv:2603.10158, 2026

  25. [25]

    Conrft: A reinforcedfine-tuningmethodforvlamodelsviaconsistencypolicy.arXivpreprintarXiv:2502.05450, 2025c

    Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Con- rft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450, 2025

  26. [26]

    Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

    Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

  27. [27]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. π∗ 0.6: A VLA That Learns From Experience.arXiv preprint arXiv:2511.14759, 2025

  28. [28]

    Interactive post-training for vision-language- action models, 2025

    Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025

  29. [29]

    Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

    Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

  30. [30]

    Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395,

    Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, and Donglin Wang. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395, 2025

  31. [31]

    Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

    Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025. 11

  32. [32]

    Improving vision-language-action model with online reinforcement learning

    Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15665–15672. IEEE, 2025

  33. [33]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

  34. [34]

    Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

    Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

  35. [35]

    Dual-actor fine-tuning of vla models: A talk-and-tweak human-in-the-loop approach.arXiv preprint arXiv:2509.13774, 2025

    Piaopiao Jin, Qi Wang, Guokang Sun, Ziwen Cai, Pinjia He, and Yangwei You. Dual-actor fine-tuning of vla models: A talk-and-tweak human-in-the-loop approach.arXiv preprint arXiv:2509.13774, 2025

  36. [36]

    Srpo: Self-referential policy optimization for vision-language-action models,

    Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision- language-action models.arXiv preprint arXiv:2511.15605, 2025

  37. [37]

    You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector

    Omkar Patil, Ondrej Biza, Thomas Weng, Karl Schmeckpeper, Wil Thomason, Xiaohan Zhang, Robin Walters, Nakul Gopalan, Sebastian Castro, and Eric Rosen. You’ve got a golden ticket: Improving generative robot policies with a single noise vector.arXiv preprint arXiv:2603.15757, 2026

  38. [38]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  39. [39]

    Hg- dagger: Interactive imitation learning with human experts

    Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Hg- dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

  40. [40]

    Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint arXiv:2509.07953, 2025

    Zheyuan Hu, Robyn Wu, Naveen Enock, Jasmine Li, Riya Kadakia, Zackory Erickson, and Aviral Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction. arXiv preprint arXiv:2509.07953, 2025

  41. [41]

    Human-in-the-loop online rejection sampling for robotic manipulation.arXiv preprint arXiv:2510.26406, 2025

    Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, and Yansong Tang. Human-in-the-loop online rejection sampling for robotic manipulation.arXiv preprint arXiv:2510.26406, 2025

  42. [42]

    Precise and dexterous robotic manip- ulation via human-in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

    Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manip- ulation via human-in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

  43. [43]

    Human-assisted robotic policy refinement via action preference optimization.arXiv preprint arXiv:2506.07127, 2025

    Wenke Xia, Yichu Yang, Hongtao Wu, Xiao Ma, Tao Kong, and Di Hu. Human-assisted robotic policy refinement via action preference optimization.arXiv preprint arXiv:2506.07127, 2025

  44. [44]

    The Principles of Diffusion Models,

    Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Ermon. The principles of diffusion models.arXiv preprint arXiv:2510.21890, 2025

  45. [45]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  46. [46]

    Dita: Scaling diffusion transformer for generalist vision-language-action policy

    Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, et al. Dita: Scaling diffusion transformer for generalist vision-language-action policy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7686–7697, 2025

  47. [47]

    arXiv preprint arXiv:2412.03293 , year=

    Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, et al. Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293, 2024. 12

  48. [48]

    Flow matching policy gradients.arXiv preprint arXiv:2507.21053,

    David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

  49. [49]

    Ig-rft: An interaction-guided rl frame- work for vla models in long-horizon robotic manipulation.arXiv preprint arXiv:2602.20715, 2026

    Zhian Su, Weijie Kong, Haonan Dong, and Huixu Dong. Ig-rft: An interaction-guided rl frame- work for vla models in long-horizon robotic manipulation.arXiv preprint arXiv:2602.20715, 2026

  50. [50]

    Aloe: Action-level off-policy evaluation for vision-language-action model post-training.arXiv preprint arXiv:2602.12691, 2026

    Rushuai Yang, Hecheng Wang, Chiming Liu, Xiaohan Yan, Yunlong Wang, Xuan Du, Shuoyu Yue, Yongcheng Liu, Chuheng Zhang, Lizhe Qi, et al. Aloe: Action-level off-policy evaluation for vision-language-action model post-training.arXiv preprint arXiv:2602.12691, 2026

  51. [51]

    ARM: Advantage Reward Modeling for Long-Horizon Manipulation

    Yiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu, Zihan Lan, Minzhao Zhu, and Hua Chen. Arm: Advantage reward modeling for long-horizon manipulation.arXiv preprint arXiv:2604.03037, 2026

  52. [52]

    Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

    Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

  53. [53]

    RL Token: Bootstrapping Online RL with Vision-Language-Action Models

    Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke. Rl token: Bootstrapping online rl with vision-language-action models, 2026. URLhttps://arxiv.org/abs/2604.23073. 13 A Theoretical Analysis and Proofs A.1 Invertibility of the Continuous Flow Decoder Setup and statement.For a fixed states, consider t...

  54. [54]

    =a , then the reverse-time ODE from the same terminal value a has a unique solution, so the recovered initial values must coincide: z0 =z ′

  55. [55]

    □ A.2 One-Step Fixed-Point Inversion Setup.We analyze the inverse of one Euler step of the frozen flow decoder

    Therefore, Gθ(s,·) is bijective. □ A.2 One-Step Fixed-Point Inversion Setup.We analyze the inverse of one Euler step of the frozen flow decoder. The forward step, inverse equation, and associated fixed-point map are y=x+ ∆t v θ(x, tk, s), x=y−∆t v θ(x, tk, s), gy(x) :=y−∆t v θ(x, tk, s). Proposition A.2.Assume that vθ(·, tk, s) is L-Lipschitz and ∆tL <1 ....