Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning
Pith reviewed 2026-05-21 14:03 UTC · model grok-4.3
The pith
Reinforcement fine-tuning with chunk-level rewards lets VLA models adapt to new robotic tasks while preserving prior skills.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LifeLong-RFT integrates chunking-level on-policy reinforcement learning with a multi-dimensional process reward mechanism to optimize VLA policies for continual learning. The mechanism consists of Quantized Action Consistency Reward to enforce accuracy inside the discrete action space, Continuous Trajectory Alignment Reward to match decoded continuous chunks against reference trajectories, and Format Compliance Reward to ensure structural validity of outputs. This combination enables policy improvement independent of environmental interaction and pre-trained rewards, producing strong multi-task results across SimplerEnv, LIBERO, and real-world settings together with the reported 22 percent S
What carries the argument
The multi-dimensional process reward mechanism (QACR, CTAR, FCR) that assigns separate scores to intermediate action chunks to quantify their heterogeneous contributions and drive policy optimization.
If this is right
- VLA policies can incorporate new tasks with far less task-specific data than supervised fine-tuning requires.
- Catastrophic forgetting is reduced during sequential task introduction on standard robotic benchmarks.
- Performance gains appear in both simulated environments and real-world robot deployments.
- The approach supplies a post-training route for general-purpose VLA models that does not depend on live feedback.
Where Pith is reading between the lines
- Robots could accumulate skills across months or years of operation in homes or factories without repeated full retraining.
- The chunk-reward structure may transfer to other sequence-based embodied models beyond current VLA architectures.
- Combining this fine-tuning stage with memory-replay or regularization methods could further strengthen long-term retention.
- Deployment cost drops if new skills can be added from modest demonstration sets rather than exhaustive new collections.
Load-bearing premise
The three rewards can accurately measure the value of each action chunk for successful task execution without any real-time robot trials or separate reward models.
What would settle it
On the LIBERO continual-learning sequence, the method shows no average success-rate improvement over supervised fine-tuning or requires substantially more than 20 percent of the data to match prior-task retention.
Figures
read the original abstract
Pretrained on large-scale and diverse datasets, VLA models demonstrate strong generalization and adaptability as general-purpose robotic policies. However, Supervised Fine-Tuning (SFT), which serves as the primary mechanism for adapting VLAs to downstream domains, requires substantial amounts of task-specific data and is prone to catastrophic forgetting. To address these limitations, we propose LifeLong-RFT, a simple yet effective Reinforcement Fine-Tuning (RFT) strategy for VLA models independent of online environmental feedback and pre-trained reward models. By integrating chunking-level on-policy reinforcement learning with the proposed multi-dimensional process reward mechanism, LifeLong-RFT quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate policy optimization. Specifically, (1) the Quantized Action Consistency Reward (QACR) ensures accurate action prediction within the discrete action space; (2) the Continuous Trajectory Alignment Reward (CTAR) aligns decoded continuous action chunks with reference trajectories to ensure precise control; (3) the Format Compliance Reward (FCR) guarantees the structural validity of outputs. Comprehensive experiments across SimplerEnv, LIBERO, and real-world tasks demonstrate that LifeLong-RFT exhibits strong performance in multi-task learning. Furthermore, for continual learning on the LIBERO benchmark, our method achieves a 22% gain in average success rate over SFT, while effectively adapting to new tasks using only 20% of the training data. Overall, our method provides a promising post-training paradigm for VLAs. The project page is available at <https://yuan-liu-lifelong-rft.github.io>.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents LifeLong-RFT, a reinforcement fine-tuning (RFT) method for Vision-Language-Action (VLA) models to support continual learning in robotic tasks. The approach integrates chunking-level on-policy reinforcement learning with a multi-dimensional process reward mechanism comprising Quantized Action Consistency Reward (QACR), Continuous Trajectory Alignment Reward (CTAR), and Format Compliance Reward (FCR). This is claimed to enable policy optimization independent of online environmental feedback and pre-trained reward models. The paper reports strong performance in multi-task learning on SimplerEnv, LIBERO, and real-world tasks, including a 22% gain in average success rate over Supervised Fine-Tuning (SFT) on the LIBERO benchmark while adapting to new tasks with only 20% of the training data.
Significance. If the central claims hold, this work represents a significant advancement in developing long-lived robots by providing a data-efficient post-training method for VLAs that mitigates catastrophic forgetting. The independence from online feedback and external rewards is a key strength, potentially enabling safer and more practical deployment in real-world continual learning scenarios. The use of process rewards for heterogeneous action chunk contributions is an innovative aspect that could influence future VLA fine-tuning strategies.
major comments (3)
- [§3.2] §3.2 (definition of CTAR): The Continuous Trajectory Alignment Reward explicitly aligns decoded continuous action chunks with reference trajectories. In continual learning settings that use only 20% of the training data, these references must originate from the limited task-specific data or prior demonstrations; without an explicit derivation showing the composite reward remains informative when reference quality degrades, the reported 22% LIBERO gain risks being an artifact of the data split rather than a genuine reinforcement signal.
- [§4.1] §4.1 (on-policy RL formulation): The method is described as performing chunking-level on-policy reinforcement learning independent of online environmental feedback. Standard on-policy methods require environment rollouts for advantage estimation; the manuscript must clarify how policy updates are obtained without any interaction or pre-trained reward model, as this is load-bearing for the independence claim.
- [§5.3] §5.3 (LIBERO continual learning results): The 22% average success rate gain over SFT is presented as evidence of effective adaptation with 20% data, but the section provides insufficient detail on the number of random seeds, variance across runs, statistical significance tests, and ablations isolating QACR/CTAR/FCR contributions. This undermines verification that the gains are robustly attributable to the RFT mechanism.
minor comments (2)
- [Abstract] Abstract: The claim of 'comprehensive experiments' would be strengthened by briefly naming the primary baselines beyond SFT.
- [§2.3] §2.3 (related work): The discussion of prior VLA fine-tuning methods could include a short comparison table to highlight differences from existing continual learning approaches.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the potential impact of LifeLong-RFT. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (definition of CTAR): The Continuous Trajectory Alignment Reward explicitly aligns decoded continuous action chunks with reference trajectories. In continual learning settings that use only 20% of the training data, these references must originate from the limited task-specific data or prior demonstrations; without an explicit derivation showing the composite reward remains informative when reference quality degrades, the reported 22% LIBERO gain risks being an artifact of the data split rather than a genuine reinforcement signal.
Authors: We appreciate this observation. Reference trajectories for CTAR are constructed from the available 20% task-specific demonstrations combined with relevant prior demonstrations from the pre-training corpus. The composite reward integrates QACR and FCR as robust complementary signals that maintain informativeness even under partial trajectory alignment. We will add an explicit derivation of the composite reward in §3.2 along with a sensitivity analysis demonstrating stability when reference quality is degraded, plus additional experiments varying the data split percentage. revision: partial
-
Referee: [§4.1] §4.1 (on-policy RL formulation): The method is described as performing chunking-level on-policy reinforcement learning independent of online environmental feedback. Standard on-policy methods require environment rollouts for advantage estimation; the manuscript must clarify how policy updates are obtained without any interaction or pre-trained reward model, as this is load-bearing for the independence claim.
Authors: We agree that the on-policy formulation requires clearer exposition. Policy updates are performed by sampling action chunks from the current policy, then directly computing the multi-dimensional process rewards (QACR, CTAR, FCR) on the generated outputs to serve as surrogate advantages. No environment rollouts or external reward models are used; the rewards derive solely from internal consistency checks against references and format constraints. We will revise §4.1 to include the full mathematical update rule, a step-by-step derivation, and pseudo-code for the process. revision: yes
-
Referee: [§5.3] §5.3 (LIBERO continual learning results): The 22% average success rate gain over SFT is presented as evidence of effective adaptation with 20% data, but the section provides insufficient detail on the number of random seeds, variance across runs, statistical significance tests, and ablations isolating QACR/CTAR/FCR contributions. This undermines verification that the gains are robustly attributable to the RFT mechanism.
Authors: We acknowledge the need for greater statistical rigor and component isolation. In the revision we will report results over 5 random seeds with mean and standard deviation, include paired t-test p-values for RFT versus SFT comparisons, and add ablation tables in §5.3 (and supplementary material) that isolate the performance contribution of each reward term while holding the others fixed. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper defines LifeLong-RFT via explicit process rewards (QACR for discrete action consistency, CTAR for trajectory alignment to references, FCR for format) and evaluates empirical gains on external benchmarks (LIBERO, SimplerEnv, real-world) against SFT baselines. The 22% success-rate improvement with 20% data is a measured outcome, not a quantity forced by re-using fitted parameters or self-referential definitions. Reference trajectories in CTAR are standard supervised signals within the RL setup and do not collapse the claimed on-policy optimization to the input data by construction. No load-bearing self-citation chains, uniqueness theorems, or ansatz smuggling appear in the provided derivation; the method remains self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained VLA models exhibit strong generalization that can be adapted to downstream domains via fine-tuning mechanisms.
invented entities (1)
-
Multi-dimensional process reward mechanism (QACR, CTAR, FCR)
no independent evidence
Forward citations
Cited by 4 Pith papers
-
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.
-
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
A closed-loop system couples LLM-based 3D scene generation with RL optimization and VR user interactions to produce adaptive, immersive environments, claiming SOTA results on the ALFRED benchmark.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
A survey of meta-reinforcement learning.arXiv preprint arXiv:2301.08028, 2023
Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon White- son. A survey of meta-reinforcement learning.arXiv preprint arXiv:2301.08028, 2023
-
[3]
Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon White- son. A tutorial on meta-reinforcement learning.Founda- tions and Trends in Machine Learning, 18(2-3):224–384, 2025
work page 2025
-
[4]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
On Tiny Episodic Memories in Continual Learning
Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho- seiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[10]
Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on- policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025
- [11]
-
[12]
Tevir: Text-to-video reward with diffusion models for efficient reinforcement learning
Yuhui Chen, Haoran Li, Zhennan Jiang, Haowei Wen, and Dongbin Zhao. Tevir: Text-to-video reward with diffusion models for efficient reinforcement learning. arXiv preprint arXiv:2505.19769, 2025
-
[13]
Conrft: A reinforced fine-tuning method for vla models via consistency policy
Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025
-
[15]
arXiv preprint arXiv:2506.08440 , year=
Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, and Zipei Fan. Tgrpo: Fine-tuning vision- language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025
-
[16]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Ro- man Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neu...
work page 2017
-
[17]
Video prediction models as rewards for reinforcement learning
Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danijar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: An...
work page 2023
-
[19]
Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision-language-action models.arXiv preprint arXiv:2511.15605, 2025
-
[20]
Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, and Yadan Luo. Mergevla: Cross-skill model merging toward a generalist vision-language- action agent.arXiv preprint arXiv:2511.18810, 2025
-
[21]
Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025
-
[22]
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu- Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Diffusion reward: Learning rewards via conditional video diffusion
Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Diffusion reward: Learning rewards via conditional video diffusion. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and G ¨ul Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLII, volume 1...
work page 2024
-
[24]
Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, and Soujanya Poria. Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards.arXiv preprint arXiv:2511.14659, 2025
-
[25]
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yu- peng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation.arXiv preprint arXiv:2509.19080, 2025
-
[28]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, and Fei Zhu. Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.arXiv preprint arXiv:2507.05386, 2025
-
[31]
Incremental learning of re- trievable skills for efficient continual task adaptation
Daehee Lee, Minjong Yoo, Woo Kyung Kim, Wonje Choi, and Honguk Woo. Incremental learning of re- trievable skills for efficient continual task adaptation. Advances in Neural Information Processing Systems, 37: 17286–17312, 2024
work page 2024
-
[32]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
RoboReward: General-purpose vision- language reward models for robotics,
Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Robore- ward: General-purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026
-
[34]
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, et al. Vla-rft: Vision- language-action reinforcement fine-tuning with veri- fied rewards in world simulators.arXiv preprint arXiv:2510.00406, 2025
-
[36]
Learn to grow: A continual structure learning framework for overcoming catastrophic forget- ting
Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forget- ting. InInternational conference on machine learning, pages 3925–3934. PMLR, 2019
work page 2019
-
[37]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023
work page 2023
-
[39]
Towards generalist robot policies: What matters in building vision-language-action models
Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What matters in building vision-language-action models. 2025
work page 2025
-
[40]
What can RL bring to VLA generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025
Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can RL bring to VLA generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025
-
[41]
Tail: Task-specific adapters for imitation learning with large pretrained models,
Zuxin Liu, Jesse Zhang, Kavosh Asadi, Yao Liu, Ding Zhao, Shoham Sabach, and Rasool Fakoor. Tail: Task- specific adapters for imitation learning with large pre- trained models.arXiv preprint arXiv:2310.05905, 2023
-
[42]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Zi- wei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Sci
Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Sci. Robotics, 10 (105), 2025
work page 2025
-
[46]
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine- tuning.arXiv preprint arXiv:2308.08747, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Mingyang Lyu, Yinqian Sun, Erliang Lin, Huan- grui Li, Ruolin Chen, Feifei Zhao, and Yi Zeng. Reinforcement fine-tuning of flow-matching policies for vision-language-action models.arXiv preprint arXiv:2510.09976, 2025
-
[49]
Packnet: Adding multiple tasks to a single network by iterative pruning
Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018
work page 2018
-
[50]
Anwar Ma’sum, Mahardhika Pratama, and Igor Skr- janc
M. Anwar Ma’sum, Mahardhika Pratama, and Igor Skr- janc. Latest advancements towards catastrophic for- getting under data scarcity: A comprehensive survey on few-shot class incremental learning.arXiv preprint arXiv:2502.08181, 2025
-
[51]
Preserving and combining knowledge in robotic lifelong reinforcement learning.Nat
Yuan Meng, Zhenshan Bing, Xiangtong Yao, Kejia Chen, Kai Huang, Yang Gao, Fuchun Sun, and Alois Knoll. Preserving and combining knowledge in robotic lifelong reinforcement learning.Nat. Mac. Intell., 7(2):256–269, 2025
work page 2025
-
[52]
Yuan Meng, Zhenshan Bing, Xiangtong Yao, Kejia Chen, Kai Huang, Yang Gao, Fuchun Sun, and Alois Knoll. Preserving and combining knowledge in robotic lifelong reinforcement learning.Nature Machine Intelligence, pages 1–14, 2025
work page 2025
-
[53]
Gr00t n1.5: An upgraded foundation model for humanoid robots
NVIDIA Isaac Robotics Team. Gr00t n1.5: An upgraded foundation model for humanoid robots. https://research. nvidia.com/labs/gear/gr00t-n1 5/, 2025
work page 2025
-
[54]
Sop: A scalable online post-training system for vision-language-action models
Mingjie Pan, Siyuan Feng, Qinglin Zhang, Xinchen Li, Jianheng Song, Chendi Qu, Yi Wang, Chuankang Li, Ziyu Xiong, Zhi Chen, et al. Sop: A scalable online post-training system for vision-language-action models. arXiv preprint arXiv:2601.03044, 2026
-
[55]
Pre-trained vision and language transformers are few-shot incremental learners
Keon-Hee Park, Kyungwoo Song, and Gyeong-Moon Park. Pre-trained vision and language transformers are few-shot incremental learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 23881– 23890. IEEE, 2024
work page 2024
-
[56]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pascanu, Yee Whye Teh, and Raia Hadsell. Continual unsupervised representation learning.Advances in neural information processing systems, 32, 2019
work page 2019
-
[59]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[60]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
RL's Razor: Why Online Reinforcement Learning Forgets Less
Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less. arXiv preprint arXiv:2509.04259, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Few- shot vision-language action-incremental policy learning
Mingchen Song, Xiang Deng, Guoqiang Zhong, Qi Lv, Jia Wan, Yinchuan Li, Jianye Hao, and Weili Guan. Few- shot vision-language action-incremental policy learning. arXiv preprint arXiv:2504.15517, 2025
-
[63]
Sumedh Sontakke, Jesse Zhang, S ´ebastien M. R. Arnold, Karl Pertsch, Erdem Biyik, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. Roboclip: One demonstration is enough to learn robot policies. In Alice Oh, Tristan Nau- mann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Co...
work page 2023
-
[64]
Interactive Post-Training for Vision-Language-Action Models
Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025
work page internal anchor Pith review arXiv 2025
-
[65]
Few-shot class- incremental learning
Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, and Yihong Gong. Few-shot class- incremental learning. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 12180–12189. Computer Vision Foundation / IEEE, 2020
work page 2020
-
[66]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–
-
[68]
Lotus: Continual imitation learning for robot manipula- tion through unsupervised skill discovery
Weikang Wan, Yifeng Zhu, Rutav Shah, and Yuke Zhu. Lotus: Continual imitation learning for robot manipula- tion through unsupervised skill discovery. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 537–544. IEEE, 2024
work page 2024
-
[69]
Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning
Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xiang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, et al. Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning. arXiv preprint arXiv:2407.01531, 2024
-
[70]
Continually Evolving Skill Knowledge in Vision Language Action Model
Yuxuan Wu, Guangming Wang, Zhiheng Yang, Maoqing Yao, Brian Sheil, and Hesheng Wang. Continually evolv- ing skill knowledge in vision language action model. arXiv preprint arXiv:2511.18085, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a vir- tual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Jingkai Xu and Xiangli Nie. Speci: Skill prompts based hierarchical continual imitation learning for robot manipulation.arXiv preprint arXiv:2504.15561, 2025
-
[73]
Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy decorator: Model- agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024
-
[74]
Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025
-
[75]
Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language- action-critic model for robotic real-world reinforcement learning.arXiv preprint arXiv:2509.15937, 2025
-
[76]
Reinforcing action policies by prophesying
Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, and Li Zhang. Reinforcing action policies by prophesying. arXiv preprint arXiv:2511.20633, 2025
-
[77]
ReWiND: Language-guided rewards teach robot policies without new demonstrations,
Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations.arXiv preprint arXiv:2505.10911, 2025
-
[78]
Cot-vla: Visual chain- of-thought reasoning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain- of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025
work page 2025
-
[79]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted trans- former as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[81]
Wmpo: World model-based policy optimization for vision- language-action models,
Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. WMPO: world model- based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025
-
[82]
Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation.IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022
work page 2022
-
[83]
Rt-2: Vision-language- action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tunin...
work page 2023
-
[84]
Multi-Task Learning:The training settings for multi-task learning on SimplerEnv are detailed in Table VII. Notably, the WidowX setup utilizes a global batch size of 512 for 30 epochs, whereas the Google Robot employs a batch size of 1024 for 40 epochs. Apart from these specific adjustments, the remaining hyperparameters are kept consistent, highlighting t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.