pith. sign in

arxiv: 2605.08879 · v2 · pith:M7LQKF3Nnew · submitted 2026-05-09 · 💻 cs.RO

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

Pith reviewed 2026-05-20 23:05 UTC · model grok-4.3

classification 💻 cs.RO
keywords conservative supervised fine-tuningflow-matching VLAscatastrophic forgettingvision-language-action modelsparameter preservationrobotic adaptationsupervised fine-tuning
0
0 comments X

The pith

Conservative fine-tuning scales gradients by model confidence to preserve VLA foundational skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Conservative Supervised Fine-Tuning for flow-matching Vision-Language-Action models. Unconstrained fine-tuning overwrites parameters densely and erodes pre-trained capabilities. ConSFT adjusts learning signals according to the model's own confidence on each sample, limiting large updates from uncertain examples. This bounds parameter disruption and keeps updates sparse without extra data or reference networks. On LIBERO and RoboTwin benchmarks the approach retains over 20 percent more prior capability than standard supervised fine-tuning while matching the performance of replay-based methods.

Core claim

ConSFT is an optimization objective that adapts flow-matching VLAs to target distributions by dynamically scaling learning signals based on model confidence. It suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding intrinsic parameter disruption risk. The formulation draws from trust-region clipping in reinforcement learning to create a progressive learning dynamic that secures both target convergence and retention of prior capabilities, all while maintaining sparse updates without parallel reference networks.

What carries the argument

Conservative Supervised Fine-Tuning (ConSFT), an objective that scales learning signals dynamically by model confidence to suppress excessive gradients from low-confidence samples.

If this is right

  • Outperforms vanilla SFT by an average absolute margin of over 20 percent in capability retention on LIBERO and RoboTwin benchmarks.
  • Matches the efficacy of data-heavy Experience Replay while operating in a prior-data-free regime.
  • Prevents spatial overfitting during real-world robotic deployments and preserves pre-trained physical skills.
  • Maintains sparse parameter updates without requiring parallel reference networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The confidence-based scaling could serve as a lightweight substitute for explicit regularization in other continual-learning settings for generative models.
  • Eliminating the need for stored prior data may simplify deployment of VLAs on resource-limited robots that must adapt sequentially.
  • The same mechanism might stabilize fine-tuning of related flow-based or diffusion-based policies outside the VLA domain.

Load-bearing premise

Model confidence on individual samples provides a stable and sufficient signal to bound parameter disruption risk during fine-tuning.

What would settle it

Measure whether pre-trained task success rates drop sharply on held-out evaluations after ConSFT is applied to a new downstream task, or whether the norm of parameter changes fails to remain sparse compared with vanilla SFT.

Figures

Figures reproduced from arXiv: 2605.08879 by Fuxian Huang, Haoran Zhang, Qi Zhang, Shaopeng Zhai, Tianyi Zhang.

Figure 1
Figure 1. Figure 1: Parameter update sparsity across optimization objectives. (Left) Global sparsity progression. Trust-region constraints (PPO) reduce the update scope compared to unconstrained SFT. (Right) Layer-wise sparsity profiles. PPO yields > 99% sparsity in core Attention and MLP weights. formulation restricts weight divergence to highly localized subspaces, enforcing conservative updates entirely within the standard… view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of layer-wise update sparsity across training steps. Vanilla SFT (left) drives a rapid, early collapse in parameter sparsity, resulting in dense global overwrites. ConSFT (right) structurally delays this shift, enforcing a controlled and uniformly decaying optimization trajectory. This progressive adaptation bridges the strict trust-region bounds of PPO (center) and the unconstrained regression o… view at source ↗
Figure 3
Figure 3. Figure 3: Capability retention in physical deployments. Following downstream adaptation to the test-tube target task (controlled at 70% target success), unconstrained adaptation baselines (vanilla SFT, LwF) exhibit severe degradation of pre-trained capabilities. In contrast, ConSFT achieves the highest prior task retention among all baselines in a prior-data-free regime, maintaining robust performance even under vis… view at source ↗
Figure 4
Figure 4. Figure 4: Per-task capability retention on the LIBERO-Object suite. Performance evolution on the held-out Object tasks during downstream adaptation to the Spatial target. Unconstrained methods exhibit rapid performance degradation, whereas trust-region bounds delay this decline [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: presents the analogous evaluation for the 10 tasks in the LIBERO-Goal suite, which involves long-horizon semantic goals (e.g., "open the top drawer" or "put the bowl on the plate"). Consistent with the Object suite dynamics, the absence of trust-region constraints degrades pre-trained sequential behaviors. 40 60 80 100 0 25 50 75 100 put the bowl on the plate 40 60 80 100 put the wine bottle on the rack 40… view at source ↗
Figure 6
Figure 6. Figure 6: Per-task capability retention on the LIBERO-Object suite. Performance evolution on the held-out Object tasks during adaptation to the Spatial target task. 40 60 80 0 50 100 put the bowl on the plate 40 60 80 put the wine bottle on the rack 40 60 80 open the top drawer and put the bowl inside 40 60 80 put the cream cheese in the bowl 40 60 80 put the wine bottle on top of the cabinet 40 60 80 0 50 100 push … view at source ↗
Figure 7
Figure 7. Figure 7: Per-task capability retention on the LIBERO-Goal suite. Performance evolution on the held-out Goal tasks. The multi-baseline trajectories demonstrate that ConSFT bounds the disruption risk, preserving pre-trained long-horizon behaviors in a prior-data-free regime. C.5 Real-world deployment: foundational robustness and high-precision adaptation To establish the evaluation baseline under physical hardware co… view at source ↗
Figure 8
Figure 8. Figure 8: Real-world multi-task evaluation under visually dense conditions. Execution trajectories of the pre-trained π0.5 policy across four distinct semantic grasping tasks. The environment introduces physical distractors to test foundational robustness prior to downstream adaptation. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Real-world execution of the sequential test tube transfer task. The single-arm robotic system operates under the ConSFT-optimized policy. The task demands high-precision insertion and long-horizon planning, requiring the sequential transfer of all four test tubes to satisfy the binary success criterion. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Unconstrained fine-tuning of flow-matching Vision-Language-Action (VLA) models drives dense parameter overwrites, degrading pre-trained capabilities. We present Conservative Supervised Fine-Tuning (ConSFT), an optimization objective that adapts to target distributions while mitigating catastrophic forgetting, requiring zero prior data or architectural overhead. By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding the intrinsic parameter disruption risk. Inspired by reinforcement learning's trust-region clipping, this formulation establishes a progressive learning dynamic to secure target convergence and prior capability retention, maintaining sparse parameter updates without relying on the parallel reference networks required by explicit regularization. We evaluate ConSFT on the LIBERO and RoboTwin benchmarks across state-of-the-art flow-matching VLAs ($\pi_0$, $\pi_{0.5}$, and GR00T-N1.6-3B). The method outperforms vanilla SFT in capability retention by an average absolute margin of over 20\%, matching the efficacy of data-heavy Experience Replay in a prior-data-free regime. Real-world robotic deployments confirm that ConSFT precludes spatial overfitting during downstream adaptation, preserving pre-trained physical skills while acquiring sequential target tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Conservative Supervised Fine-Tuning (ConSFT) for flow-matching Vision-Language-Action (VLA) models. It claims that dynamically scaling learning signals by per-sample model confidence suppresses excessive gradients on low-confidence data, bounding parameter disruption to preserve pre-trained capabilities during adaptation to new tasks. The method requires no prior data or reference networks, unlike explicit regularization or experience replay. On LIBERO and RoboTwin benchmarks with models such as π₀, π₀.₅, and GR00T-N1.6-3B, ConSFT reportedly yields an average absolute 20% improvement in capability retention over vanilla SFT while matching data-heavy replay baselines; real-world robot deployments are said to confirm reduced spatial overfitting.

Significance. If the performance claims and mechanism are substantiated, the result would be significant for robotic learning: it offers a lightweight, prior-data-free alternative to mitigate catastrophic forgetting in large flow-matching VLAs, potentially simplifying deployment pipelines that currently rely on replay buffers or auxiliary networks. The approach could influence fine-tuning practices in continuous control and embodied AI where retaining foundational physical skills is critical.

major comments (3)
  1. [Abstract / ConSFT formulation] Abstract and method description: the central claim that confidence-based gradient scaling 'bounds the intrinsic parameter disruption risk' and produces 'sparse parameter updates' lacks any derivation or analysis showing how the scaling factor (presumably derived from the flow-matching loss) controls update magnitude in parameter space rather than merely reweighting the loss. In high-dimensional VLA models this linkage is load-bearing for the assertion that the method works without reference networks or prior data.
  2. [Experiments] Evaluation section: the reported 'over 20% absolute margin' in capability retention is presented without error bars, ablation studies isolating the confidence estimator, dataset statistics, or direct measurements of parameter sparsity (e.g., fraction of weights exceeding a change threshold). These omissions prevent assessment of whether the gains are attributable to the proposed dynamic or to other factors.
  3. [Method / Real-world deployments] The manuscript states that low-confidence samples are suppressed to prevent disproportionate updates, yet provides no analysis of whether low-confidence target samples are precisely those requiring larger updates for successful task adaptation; this assumption is load-bearing for the progressive learning dynamic and real-world retention claims.
minor comments (2)
  1. [Method] Notation for the confidence estimator and scaling function should be defined explicitly with an equation, even if the implementation is simple.
  2. [Experiments] The abstract mentions 'state-of-the-art flow-matching VLAs (π₀, π₀.₅, and GR00T-N1.6-3B)' but the main text should include a brief description of each model's scale and pre-training data to contextualize the retention results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's insightful comments on our manuscript. We address each major point below and indicate the revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract / ConSFT formulation] Abstract and method description: the central claim that confidence-based gradient scaling 'bounds the intrinsic parameter disruption risk' and produces 'sparse parameter updates' lacks any derivation or analysis showing how the scaling factor (presumably derived from the flow-matching loss) controls update magnitude in parameter space rather than merely reweighting the loss. In high-dimensional VLA models this linkage is load-bearing for the assertion that the method works without reference networks or prior data.

    Authors: We thank the referee for highlighting this. The original manuscript presented the method intuitively but indeed lacked a formal derivation. In the revised version, we have added an analysis in the Methods section that derives how the per-sample scaling factor modulates the effective learning rate in parameter space. Specifically, we show that the update norm is bounded proportionally to the confidence score, providing a trust-region-like effect without explicit regularization or reference networks. revision: yes

  2. Referee: [Experiments] Evaluation section: the reported 'over 20% absolute margin' in capability retention is presented without error bars, ablation studies isolating the confidence estimator, dataset statistics, or direct measurements of parameter sparsity (e.g., fraction of weights exceeding a change threshold). These omissions prevent assessment of whether the gains are attributable to the proposed dynamic or to other factors.

    Authors: We agree that these elements are necessary for rigorous evaluation. We have updated the Experiments section to include error bars computed over 5 random seeds, an ablation study removing the confidence weighting, summary statistics of the datasets used, and measurements of parameter sparsity by tracking the L2 norm of weight changes and the fraction of parameters exceeding a 0.01 threshold. revision: yes

  3. Referee: [Method / Real-world deployments] The manuscript states that low-confidence samples are suppressed to prevent disproportionate updates, yet provides no analysis of whether low-confidence target samples are precisely those requiring larger updates for successful task adaptation; this assumption is load-bearing for the progressive learning dynamic and real-world retention claims.

    Authors: This comment raises a valid concern about the core assumption. Our formulation is motivated by the idea that suppressing large gradients on uncertain samples prevents catastrophic overwriting of pre-trained capabilities, allowing progressive adaptation as confidence builds. However, we did not provide a direct analysis correlating sample confidence with required update size in the original submission. We have added a qualitative discussion and a supporting figure in the revision showing confidence evolution during training, but acknowledge that a quantitative study of update requirements would benefit from further experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained as a new optimization objective

full rationale

The provided abstract and description present ConSFT as an independent optimization objective that dynamically scales learning signals by model confidence to bound parameter disruption risk, explicitly requiring zero prior data or reference networks. No equations, self-referential derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. The central mechanism is defined directly rather than reducing to its inputs by construction, and the paper does not invoke uniqueness theorems or ansatzes from prior self-work to force the result. This is the most common honest finding for papers whose claims rest on empirical evaluation rather than closed-form derivation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that confidence signals can be used to control update magnitude without side effects; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Model confidence on target samples can be used to bound intrinsic parameter disruption risk during supervised fine-tuning.
    Invoked to justify the dynamic scaling mechanism that replaces explicit regularization.

pith-pipeline@v0.9.0 · 5759 in / 1246 out tokens · 36700 ms · 2026-05-20T23:05:03.409995+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 10 internal anchors

  1. [1]

    A Pragmatic VLA Foundation Model

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic VLA foundation model.arXiv preprint arXiv:2601.18692, 2026

  2. [2]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

  3. [3]

    Ac- tions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

    Asher J Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Majumdar. Ac- tions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

  4. [4]

    Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

    Yuan Liu, Haoran Li, Shuai Tian, Yuxing Qin, Yuhui Chen, Yupeng Zheng, Yongzhen Huang, and Dongbin Zhao. Towards long-lived robots: Continual learning vla models via reinforcement fine-tuning.arXiv preprint arXiv:2602.10503, 2026

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  6. [6]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Gua...

  8. [8]

    RL's Razor: Why Online Reinforcement Learning Forgets Less

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

  9. [9]

    Reinforcement learning fine- tunes small subnetworks in large language models.Advances in Neural Information Processing Systems, 38:132119–132138, 2026

    Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning fine- tunes small subnetworks in large language models.Advances in Neural Information Processing Systems, 38:132119–132138, 2026

  10. [10]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  11. [11]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  12. [12]

    Mechanistic analysis of catastrophic forgetting in large language models during continual fine-tuning.arXiv preprint arXiv:2601.18699, 2026

    Olaf Yunus Laitinen Imanov. Mechanistic analysis of catastrophic forgetting in large language models during continual fine-tuning.arXiv preprint arXiv:2601.18699, 2026. 10

  13. [13]

    A comparative analysis of llm adaptation: Sft, lora, and icl in data-scarce scenarios.arXiv preprint arXiv:2511.00130, 2025

    Bernd Bohnet, Rumen Dangovski, Kevin Swersky, Sherry Moore, Arslan Chaudhry, Kathleen Kenealy, and Noah Fiedel. A comparative analysis of llm adaptation: Sft, lora, and icl in data-scarce scenarios.arXiv preprint arXiv:2511.00130, 2025

  14. [14]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the national academy of sciences, 114 (13):35...

  15. [15]

    Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

  16. [16]

    Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

  17. [17]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

  18. [18]

    Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

  19. [19]

    Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

    Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, and Yuke Zhu. Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

  20. [20]

    Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026

    Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, and Roberto Martin- Martin. Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026

  21. [21]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  22. [22]

    Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

    Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, and Yi Zhong. Safety alignment as continual learning: Mitigating the alignment tax via orthogonal gradient projection. arXiv preprint arXiv:2602.07892, 2026

  23. [23]

    Beyond reasoning gains: Mitigating general capabilities forgetting in large reasoning models.arXiv preprint arXiv:2510.21978, 2025

    Hoang Phan, Xianjun Yang, Kevin Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, and Deren Lei. Beyond reasoning gains: Mitigating general capabilities forgetting in large reasoning models.arXiv preprint arXiv:2510.21978, 2025

  24. [24]

    Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Advances in Neural Information Processing Systems, 38:106282–106319, 2026

    Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Advances in Neural Information Processing Systems, 38:106282–106319, 2026

  25. [25]

    LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id= aloEru2qCG. ...

  26. [26]

    Robust Policy Optimization to Prevent Catastrophic Forgetting

    Mahdi Sabbaghi, George Pappas, Adel Javanmard, and Hamed Hassani. Robust policy opti- mization to prevent catastrophic forgetting.arXiv preprint arXiv:2602.08813, 2026

  27. [27]

    Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.arXiv preprint arXiv:2507.05386, 2025

    Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, and Fei Zhu. Reinforcement fine- tuning naturally mitigates forgetting in continual post-training.arXiv preprint arXiv:2507.05386, 2025. 11

  28. [28]

    Soft Adaptive Policy Optimization

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

  29. [29]

    Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

    David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

  30. [30]

    Reinforcement learning for flow- matching policies.arXiv preprint arXiv:2507.15073, 2025

    Samuel Pfrommer, Yixiao Huang, and Somayeh Sojoudi. Reinforcement learning for flow- matching policies.arXiv preprint arXiv:2507.15073, 2025

  31. [31]

    Flow policy gradients for robot control.arXiv preprint arXiv:2602.02481, 2026

    Brent Yi, Hongsuk Choi, Himanshu Gaurav Singh, Xiaoyu Huang, Takara E Truong, Carmelo Sferrazza, Yi Ma, Rocky Duan, Pieter Abbeel, Guanya Shi, Karen Liu, and Angjoo Kanazawa. Flow policy gradients for robot control.arXiv preprint arXiv:2602.02481, 2026

  32. [32]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  33. [33]

    open the top drawer

    Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the computer vision and pattern recognition conference, pages 27649–27660, 2025. 12 A Mechanistic ab...