pith. sign in

arxiv: 2602.10503 · v2 · pith:HO4WRQX2new · submitted 2026-02-11 · 💻 cs.RO

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

Pith reviewed 2026-05-21 14:03 UTC · model grok-4.3

classification 💻 cs.RO
keywords continual learningvision-language-action modelsreinforcement fine-tuningrobotic policiesmulti-task learningprocess rewardslifelong adaptationVLA models
0
0 comments X

The pith

Reinforcement fine-tuning with chunk-level rewards lets VLA models adapt to new robotic tasks while preserving prior skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LifeLong-RFT, a reinforcement fine-tuning method for vision-language-action models that operates without online robot feedback or external reward models. It applies on-policy learning at the level of action chunks and scores each chunk through three rewards that check prediction accuracy, trajectory alignment, and output format. A reader would care because current supervised approaches demand large new datasets for each task and erase earlier capabilities, limiting robots to short operational lifetimes. The method is shown to deliver a 22 percent gain in average success rate on the LIBERO continual-learning benchmark while requiring only 20 percent of the usual training data for new tasks.

Core claim

LifeLong-RFT integrates chunking-level on-policy reinforcement learning with a multi-dimensional process reward mechanism to optimize VLA policies for continual learning. The mechanism consists of Quantized Action Consistency Reward to enforce accuracy inside the discrete action space, Continuous Trajectory Alignment Reward to match decoded continuous chunks against reference trajectories, and Format Compliance Reward to ensure structural validity of outputs. This combination enables policy improvement independent of environmental interaction and pre-trained rewards, producing strong multi-task results across SimplerEnv, LIBERO, and real-world settings together with the reported 22 percent S

What carries the argument

The multi-dimensional process reward mechanism (QACR, CTAR, FCR) that assigns separate scores to intermediate action chunks to quantify their heterogeneous contributions and drive policy optimization.

If this is right

  • VLA policies can incorporate new tasks with far less task-specific data than supervised fine-tuning requires.
  • Catastrophic forgetting is reduced during sequential task introduction on standard robotic benchmarks.
  • Performance gains appear in both simulated environments and real-world robot deployments.
  • The approach supplies a post-training route for general-purpose VLA models that does not depend on live feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots could accumulate skills across months or years of operation in homes or factories without repeated full retraining.
  • The chunk-reward structure may transfer to other sequence-based embodied models beyond current VLA architectures.
  • Combining this fine-tuning stage with memory-replay or regularization methods could further strengthen long-term retention.
  • Deployment cost drops if new skills can be added from modest demonstration sets rather than exhaustive new collections.

Load-bearing premise

The three rewards can accurately measure the value of each action chunk for successful task execution without any real-time robot trials or separate reward models.

What would settle it

On the LIBERO continual-learning sequence, the method shows no average success-rate improvement over supervised fine-tuning or requires substantially more than 20 percent of the data to match prior-task retention.

Figures

Figures reproduced from arXiv: 2602.10503 by Dongbin Zhao, Haoran Li, Shuai Tian, Yongzhen Huang, Yuan Liu, Yuhui Chen, Yupeng Zheng, Yuxing Qin.

Figure 1
Figure 1. Figure 1: Overview of VLA post-training. This phase involves single-stage multi-task adaptation and incremental continual learning. Addressing the substantial data dependence and susceptibility to catastrophic forgetting inherent in SFT, we introduce LifeLong-RFT, which combines on-policy RL with the Multi-Dimensional Process Reward mechanism. Abstract—Pretrained on large-scale and diverse datasets, VLA models demon… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed LifeLong-RFT. This strategy integrates the chunking-level on-policy reinforcement learning algorithm with the Multi-Dimensional Process Reward mechanism to facilitate policy optimization. of evaluating actions along complete trajectories, we evaluate each action chunk sampled by the VLA model independently, thereby removing the dependency on environment interaction. In this work, w… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of real-world experimental tasks: Pick & Place (Banana, Bread), Pull Drawer, and Hang Chinese Knot. TABLE II: Multi-Task learning performance on LIBERO. Method Training Strategy LIBERO Avg Object Spatial Goal Long Continuous Action Models Octo-Base [66] SFT 85.7 78.9 84.6 51.1 75.1 GR00T N1 [5] SFT 97.6 94.4 93.0 90.6 93.9 π0 [6] SFT 98.8 96.8 95.8 85.2 94.2 OpenVLA-OFT [29] SFT 98.1 96.9 95.5 91.… view at source ↗
Figure 4
Figure 4. Figure 4: Adaptation efficiency on representative new tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on the reward combination weights. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative reward curves during the training phase. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A representative execution of the Pick Banana task [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A representative execution of the Pick Bread task [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A representative execution of the Pull Drawer task [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A representative execution of the Hang Chinese Knot task [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Pretrained on large-scale and diverse datasets, VLA models demonstrate strong generalization and adaptability as general-purpose robotic policies. However, Supervised Fine-Tuning (SFT), which serves as the primary mechanism for adapting VLAs to downstream domains, requires substantial amounts of task-specific data and is prone to catastrophic forgetting. To address these limitations, we propose LifeLong-RFT, a simple yet effective Reinforcement Fine-Tuning (RFT) strategy for VLA models independent of online environmental feedback and pre-trained reward models. By integrating chunking-level on-policy reinforcement learning with the proposed multi-dimensional process reward mechanism, LifeLong-RFT quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate policy optimization. Specifically, (1) the Quantized Action Consistency Reward (QACR) ensures accurate action prediction within the discrete action space; (2) the Continuous Trajectory Alignment Reward (CTAR) aligns decoded continuous action chunks with reference trajectories to ensure precise control; (3) the Format Compliance Reward (FCR) guarantees the structural validity of outputs. Comprehensive experiments across SimplerEnv, LIBERO, and real-world tasks demonstrate that LifeLong-RFT exhibits strong performance in multi-task learning. Furthermore, for continual learning on the LIBERO benchmark, our method achieves a 22% gain in average success rate over SFT, while effectively adapting to new tasks using only 20% of the training data. Overall, our method provides a promising post-training paradigm for VLAs. The project page is available at <https://yuan-liu-lifelong-rft.github.io>.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents LifeLong-RFT, a reinforcement fine-tuning (RFT) method for Vision-Language-Action (VLA) models to support continual learning in robotic tasks. The approach integrates chunking-level on-policy reinforcement learning with a multi-dimensional process reward mechanism comprising Quantized Action Consistency Reward (QACR), Continuous Trajectory Alignment Reward (CTAR), and Format Compliance Reward (FCR). This is claimed to enable policy optimization independent of online environmental feedback and pre-trained reward models. The paper reports strong performance in multi-task learning on SimplerEnv, LIBERO, and real-world tasks, including a 22% gain in average success rate over Supervised Fine-Tuning (SFT) on the LIBERO benchmark while adapting to new tasks with only 20% of the training data.

Significance. If the central claims hold, this work represents a significant advancement in developing long-lived robots by providing a data-efficient post-training method for VLAs that mitigates catastrophic forgetting. The independence from online feedback and external rewards is a key strength, potentially enabling safer and more practical deployment in real-world continual learning scenarios. The use of process rewards for heterogeneous action chunk contributions is an innovative aspect that could influence future VLA fine-tuning strategies.

major comments (3)
  1. [§3.2] §3.2 (definition of CTAR): The Continuous Trajectory Alignment Reward explicitly aligns decoded continuous action chunks with reference trajectories. In continual learning settings that use only 20% of the training data, these references must originate from the limited task-specific data or prior demonstrations; without an explicit derivation showing the composite reward remains informative when reference quality degrades, the reported 22% LIBERO gain risks being an artifact of the data split rather than a genuine reinforcement signal.
  2. [§4.1] §4.1 (on-policy RL formulation): The method is described as performing chunking-level on-policy reinforcement learning independent of online environmental feedback. Standard on-policy methods require environment rollouts for advantage estimation; the manuscript must clarify how policy updates are obtained without any interaction or pre-trained reward model, as this is load-bearing for the independence claim.
  3. [§5.3] §5.3 (LIBERO continual learning results): The 22% average success rate gain over SFT is presented as evidence of effective adaptation with 20% data, but the section provides insufficient detail on the number of random seeds, variance across runs, statistical significance tests, and ablations isolating QACR/CTAR/FCR contributions. This undermines verification that the gains are robustly attributable to the RFT mechanism.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'comprehensive experiments' would be strengthened by briefly naming the primary baselines beyond SFT.
  2. [§2.3] §2.3 (related work): The discussion of prior VLA fine-tuning methods could include a short comparison table to highlight differences from existing continual learning approaches.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the potential impact of LifeLong-RFT. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (definition of CTAR): The Continuous Trajectory Alignment Reward explicitly aligns decoded continuous action chunks with reference trajectories. In continual learning settings that use only 20% of the training data, these references must originate from the limited task-specific data or prior demonstrations; without an explicit derivation showing the composite reward remains informative when reference quality degrades, the reported 22% LIBERO gain risks being an artifact of the data split rather than a genuine reinforcement signal.

    Authors: We appreciate this observation. Reference trajectories for CTAR are constructed from the available 20% task-specific demonstrations combined with relevant prior demonstrations from the pre-training corpus. The composite reward integrates QACR and FCR as robust complementary signals that maintain informativeness even under partial trajectory alignment. We will add an explicit derivation of the composite reward in §3.2 along with a sensitivity analysis demonstrating stability when reference quality is degraded, plus additional experiments varying the data split percentage. revision: partial

  2. Referee: [§4.1] §4.1 (on-policy RL formulation): The method is described as performing chunking-level on-policy reinforcement learning independent of online environmental feedback. Standard on-policy methods require environment rollouts for advantage estimation; the manuscript must clarify how policy updates are obtained without any interaction or pre-trained reward model, as this is load-bearing for the independence claim.

    Authors: We agree that the on-policy formulation requires clearer exposition. Policy updates are performed by sampling action chunks from the current policy, then directly computing the multi-dimensional process rewards (QACR, CTAR, FCR) on the generated outputs to serve as surrogate advantages. No environment rollouts or external reward models are used; the rewards derive solely from internal consistency checks against references and format constraints. We will revise §4.1 to include the full mathematical update rule, a step-by-step derivation, and pseudo-code for the process. revision: yes

  3. Referee: [§5.3] §5.3 (LIBERO continual learning results): The 22% average success rate gain over SFT is presented as evidence of effective adaptation with 20% data, but the section provides insufficient detail on the number of random seeds, variance across runs, statistical significance tests, and ablations isolating QACR/CTAR/FCR contributions. This undermines verification that the gains are robustly attributable to the RFT mechanism.

    Authors: We acknowledge the need for greater statistical rigor and component isolation. In the revision we will report results over 5 random seeds with mean and standard deviation, include paired t-test p-values for RFT versus SFT comparisons, and add ablation tables in §5.3 (and supplementary material) that isolate the performance contribution of each reward term while holding the others fixed. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines LifeLong-RFT via explicit process rewards (QACR for discrete action consistency, CTAR for trajectory alignment to references, FCR for format) and evaluates empirical gains on external benchmarks (LIBERO, SimplerEnv, real-world) against SFT baselines. The 22% success-rate improvement with 20% data is a measured outcome, not a quantity forced by re-using fitted parameters or self-referential definitions. Reference trajectories in CTAR are standard supervised signals within the RL setup and do not collapse the claimed on-policy optimization to the input data by construction. No load-bearing self-citation chains, uniqueness theorems, or ansatz smuggling appear in the provided derivation; the method remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that VLA models can be effectively optimized via process rewards without environmental interaction and on the ad-hoc invention of the three reward functions to substitute for standard RL signals.

axioms (1)
  • domain assumption Pretrained VLA models exhibit strong generalization that can be adapted to downstream domains via fine-tuning mechanisms.
    Opening premise of the abstract.
invented entities (1)
  • Multi-dimensional process reward mechanism (QACR, CTAR, FCR) no independent evidence
    purpose: To quantify heterogeneous contributions of action chunks for policy optimization in the absence of online feedback or pre-trained rewards.
    These reward components are newly defined in the paper to enable the proposed RFT strategy.

pith-pipeline@v0.9.0 · 5843 in / 1347 out tokens · 77754 ms · 2026-05-21T14:03:18.995651+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

    cs.RO 2026-05 unverdicted novelty 7.0

    ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...

  2. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  3. Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

    cs.RO 2026-05 unverdicted novelty 5.0

    ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.

  4. Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

    cs.CV 2026-05 unverdicted novelty 5.0

    A closed-loop system couples LLM-based 3D scene generation with RL optimization and VR user interactions to produce adaptive, immersive environments, claiming SOTA results on the ALFRED benchmark.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 3 Pith papers · 29 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    A survey of meta-reinforcement learning.arXiv preprint arXiv:2301.08028, 2023

    Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon White- son. A survey of meta-reinforcement learning.arXiv preprint arXiv:2301.08028, 2023

  3. [3]

    A tutorial on meta-reinforcement learning.Founda- tions and Trends in Machine Learning, 18(2-3):224–384, 2025

    Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon White- son. A tutorial on meta-reinforcement learning.Founda- tions and Trends in Machine Learning, 18(2-3):224–384, 2025

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  7. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  8. [8]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  9. [9]

    On Tiny Episodic Memories in Continual Learning

    Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho- seiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019

  10. [10]

    Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

    Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on- policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

  11. [11]

    Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, and Chao Yu.π rl: Online RL fine-tuning for flow- based vision-language-action models.arXiv preprint arXiv:2510.25889, 2025

  12. [12]

    Tevir: Text-to-video reward with diffusion models for efficient reinforcement learning

    Yuhui Chen, Haoran Li, Zhennan Jiang, Haowei Wen, and Dongbin Zhao. Tevir: Text-to-video reward with diffusion models for efficient reinforcement learning. arXiv preprint arXiv:2505.19769, 2025

  13. [13]

    Conrft: A reinforced fine-tuning method for vla models via consistency policy

    Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025

  14. [15]

    arXiv preprint arXiv:2506.08440 , year=

    Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, and Zipei Fan. Tgrpo: Fine-tuning vision- language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025

  15. [16]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Ro- man Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neu...

  16. [17]

    Video prediction models as rewards for reinforcement learning

    Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danijar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: An...

  17. [19]

    Srpo: Self-referential policy optimization for vision-language-action models.arXiv preprint arXiv:2511.15605, 2025

    Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision-language-action models.arXiv preprint arXiv:2511.15605, 2025

  18. [20]

    Mergevla: Cross-skill model merging toward a generalist vision-language- action agent.arXiv preprint arXiv:2511.18810, 2025

    Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, and Yadan Luo. Mergevla: Cross-skill model merging toward a generalist vision-language- action agent.arXiv preprint arXiv:2511.18810, 2025

  19. [21]

    Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

    Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025

  20. [22]

    ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu- Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

  21. [23]

    Diffusion reward: Learning rewards via conditional video diffusion

    Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Diffusion reward: Learning rewards via conditional video diffusion. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and G ¨ul Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLII, volume 1...

  22. [24]

    Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards.arXiv preprint arXiv:2511.14659, 2025

    Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, and Soujanya Poria. Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards.arXiv preprint arXiv:2511.14659, 2025

  23. [25]

    NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025

  24. [26]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  25. [27]

    World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,

    Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yu- peng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation.arXiv preprint arXiv:2509.19080, 2025

  26. [28]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  27. [29]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  28. [30]

    Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.arXiv preprint arXiv:2507.05386, 2025

    Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, and Fei Zhu. Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.arXiv preprint arXiv:2507.05386, 2025

  29. [31]

    Incremental learning of re- trievable skills for efficient continual task adaptation

    Daehee Lee, Minjong Yoo, Woo Kyung Kim, Wonje Choi, and Honguk Woo. Incremental learning of re- trievable skills for efficient continual task adaptation. Advances in Neural Information Processing Systems, 37: 17286–17312, 2024

  30. [32]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

  31. [33]

    RoboReward: General-purpose vision- language reward models for robotics,

    Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Robore- ward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

  32. [34]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

  33. [35]

    Vla-rft: Vision- language-action reinforcement fine-tuning with veri- fied rewards in world simulators.arXiv preprint arXiv:2510.00406, 2025

    Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, et al. Vla-rft: Vision- language-action reinforcement fine-tuning with veri- fied rewards in world simulators.arXiv preprint arXiv:2510.00406, 2025

  34. [36]

    Learn to grow: A continual structure learning framework for overcoming catastrophic forget- ting

    Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forget- ting. InInternational conference on machine learning, pages 3925–3934. PMLR, 2019

  35. [37]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

  36. [38]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

  37. [39]

    Towards generalist robot policies: What matters in building vision-language-action models

    Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What matters in building vision-language-action models. 2025

  38. [40]

    What can RL bring to VLA generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

    Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can RL bring to VLA generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

  39. [41]

    Tail: Task-specific adapters for imitation learning with large pretrained models,

    Zuxin Liu, Jesse Zhang, Kavosh Asadi, Yao Liu, Ding Zhao, Shoham Sabach, and Rasool Fakoor. Tail: Task- specific adapters for imitation learning with large pre- trained models.arXiv preprint arXiv:2310.05905, 2023

  40. [42]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  41. [44]

    VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Zi- wei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

  42. [45]

    Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Sci

    Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Sci. Robotics, 10 (105), 2025

  43. [46]

    An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine- tuning.arXiv preprint arXiv:2308.08747, 2023

  44. [48]

    Reinforcement fine-tuning of flow-matching policies for vision-language-action models.arXiv preprint arXiv:2510.09976, 2025

    Mingyang Lyu, Yinqian Sun, Erliang Lin, Huan- grui Li, Ruolin Chen, Feifei Zhao, and Yi Zeng. Reinforcement fine-tuning of flow-matching policies for vision-language-action models.arXiv preprint arXiv:2510.09976, 2025

  45. [49]

    Packnet: Adding multiple tasks to a single network by iterative pruning

    Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

  46. [50]

    Anwar Ma’sum, Mahardhika Pratama, and Igor Skr- janc

    M. Anwar Ma’sum, Mahardhika Pratama, and Igor Skr- janc. Latest advancements towards catastrophic for- getting under data scarcity: A comprehensive survey on few-shot class incremental learning.arXiv preprint arXiv:2502.08181, 2025

  47. [51]

    Preserving and combining knowledge in robotic lifelong reinforcement learning.Nat

    Yuan Meng, Zhenshan Bing, Xiangtong Yao, Kejia Chen, Kai Huang, Yang Gao, Fuchun Sun, and Alois Knoll. Preserving and combining knowledge in robotic lifelong reinforcement learning.Nat. Mac. Intell., 7(2):256–269, 2025

  48. [52]

    Preserving and combining knowledge in robotic lifelong reinforcement learning.Nature Machine Intelligence, pages 1–14, 2025

    Yuan Meng, Zhenshan Bing, Xiangtong Yao, Kejia Chen, Kai Huang, Yang Gao, Fuchun Sun, and Alois Knoll. Preserving and combining knowledge in robotic lifelong reinforcement learning.Nature Machine Intelligence, pages 1–14, 2025

  49. [53]

    Gr00t n1.5: An upgraded foundation model for humanoid robots

    NVIDIA Isaac Robotics Team. Gr00t n1.5: An upgraded foundation model for humanoid robots. https://research. nvidia.com/labs/gear/gr00t-n1 5/, 2025

  50. [54]

    Sop: A scalable online post-training system for vision-language-action models

    Mingjie Pan, Siyuan Feng, Qinglin Zhang, Xinchen Li, Jianheng Song, Chendi Qu, Yi Wang, Chuankang Li, Ziyu Xiong, Zhi Chen, et al. Sop: A scalable online post-training system for vision-language-action models. arXiv preprint arXiv:2601.03044, 2026

  51. [55]

    Pre-trained vision and language transformers are few-shot incremental learners

    Keon-Hee Park, Kyungwoo Song, and Gyeong-Moon Park. Pre-trained vision and language transformers are few-shot incremental learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 23881– 23890. IEEE, 2024

  52. [56]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  53. [57]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  54. [58]

    Continual unsupervised representation learning.Advances in neural information processing systems, 32, 2019

    Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pascanu, Yee Whye Teh, and Raia Hadsell. Continual unsupervised representation learning.Advances in neural information processing systems, 32, 2019

  55. [59]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  56. [60]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  57. [61]

    RL's Razor: Why Online Reinforcement Learning Forgets Less

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less. arXiv preprint arXiv:2509.04259, 2025

  58. [62]

    Few- shot vision-language action-incremental policy learning

    Mingchen Song, Xiang Deng, Guoqiang Zhong, Qi Lv, Jia Wan, Yinchuan Li, Jianye Hao, and Weili Guan. Few- shot vision-language action-incremental policy learning. arXiv preprint arXiv:2504.15517, 2025

  59. [63]

    Sumedh Sontakke, Jesse Zhang, S ´ebastien M. R. Arnold, Karl Pertsch, Erdem Biyik, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. Roboclip: One demonstration is enough to learn robot policies. In Alice Oh, Tristan Nau- mann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Co...

  60. [64]

    Interactive Post-Training for Vision-Language-Action Models

    Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025

  61. [65]

    Few-shot class- incremental learning

    Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, and Yihong Gong. Few-shot class- incremental learning. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 12180–12189. Computer Vision Foundation / IEEE, 2020

  62. [66]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  63. [67]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–

  64. [68]

    Lotus: Continual imitation learning for robot manipula- tion through unsupervised skill discovery

    Weikang Wan, Yifeng Zhu, Rutav Shah, and Yuke Zhu. Lotus: Continual imitation learning for robot manipula- tion through unsupervised skill discovery. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 537–544. IEEE, 2024

  65. [69]

    Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning

    Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xiang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, et al. Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning. arXiv preprint arXiv:2407.01531, 2024

  66. [70]

    Continually Evolving Skill Knowledge in Vision Language Action Model

    Yuxuan Wu, Guangming Wang, Zhiheng Yang, Maoqing Yao, Brian Sheil, and Hesheng Wang. Continually evolv- ing skill knowledge in vision language action model. arXiv preprint arXiv:2511.18085, 2025

  67. [71]

    World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

    Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a vir- tual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025

  68. [72]

    Speci: Skill prompts based hierarchical continual imitation learning for robot manipulation.arXiv preprint arXiv:2504.15561, 2025

    Jingkai Xu and Xiangli Nie. Speci: Skill prompts based hierarchical continual imitation learning for robot manipulation.arXiv preprint arXiv:2504.15561, 2025

  69. [73]

    Policy decorator: Model- agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024

    Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy decorator: Model- agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024

  70. [74]

    Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

    Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

  71. [75]

    A vision-language- action-critic model for robotic real-world reinforcement learning.arXiv preprint arXiv:2509.15937, 2025

    Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language- action-critic model for robotic real-world reinforcement learning.arXiv preprint arXiv:2509.15937, 2025

  72. [76]

    Reinforcing action policies by prophesying

    Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, and Li Zhang. Reinforcing action policies by prophesying. arXiv preprint arXiv:2511.20633, 2025

  73. [77]

    ReWiND: Language-guided rewards teach robot policies without new demonstrations,

    Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations.arXiv preprint arXiv:2505.10911, 2025

  74. [78]

    Cot-vla: Visual chain- of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain- of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

  75. [79]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted trans- former as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

  76. [80]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

  77. [81]

    Wmpo: World model-based policy optimization for vision- language-action models,

    Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. WMPO: world model- based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

  78. [82]

    Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation.IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022

    Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation.IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022

  79. [83]

    Rt-2: Vision-language- action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tunin...

  80. [84]

    Notably, the WidowX setup utilizes a global batch size of 512 for 30 epochs, whereas the Google Robot employs a batch size of 1024 for 40 epochs

    Multi-Task Learning:The training settings for multi-task learning on SimplerEnv are detailed in Table VII. Notably, the WidowX setup utilizes a global batch size of 512 for 30 epochs, whereas the Google Robot employs a batch size of 1024 for 40 epochs. Apart from these specific adjustments, the remaining hyperparameters are kept consistent, highlighting t...

Showing first 80 references.