pith. sign in

arxiv: 2605.18903 · v1 · pith:QS35YTOFnew · submitted 2026-05-17 · 💻 cs.LG · cs.CV

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

Pith reviewed 2026-05-20 13:24 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords continual learningmultimodal large language modelsreinforcement learning with verifiable rewardsreasoning portabilitydynamic regularizationcatastrophic forgettingvision-language modelsRLVR
0
0 comments X

The pith

Reasoning Portability measures reusable prior reasoning on new tasks to dynamically balance preservation and exploration during RLVR-based continual learning for MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reasoning traces offer more reliable signals than final answers for guiding adaptation on out-of-distribution multimodal tasks. It defines Reasoning Portability as a per-sample score of how much the old policy's reasoning behavior transfers to a new task. This score then controls the strength of Kullback-Leibler regularization inside RLVR training: high-portability samples receive a tight anchor to keep useful prior reasoning, while low-portability samples receive a relaxed anchor to allow new reasoning patterns. The resulting RDB-CL method raises final-task accuracy by 12 percent over standard RLVR in continual learning sequences. The approach therefore supplies a concrete mechanism for reducing catastrophic forgetting without blocking progress on fresh tasks.

Core claim

We formalize portability as a sample-level measure of how reusable the previous policy's behavior is on a new task, empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not, and instantiate this as Reasoning Portability (RP) to modulate per-sample Kullback-Leibler regularization in RLVR: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways.

What carries the argument

Reasoning Portability (RP), a sample-level score of reusability of the prior policy's reasoning behavior on a new task, used to adjust the per-sample strength of KL regularization inside RLVR training.

If this is right

  • RDB-CL consistently outperforms standard RLVR and other continual-learning baselines on sequential multimodal tasks.
  • High-RP samples keep prior reasoning intact through stronger regularization.
  • Low-RP samples gain freedom to develop new reasoning pathways through weaker regularization.
  • Final-task accuracy rises by 12.0 percent compared with the vanilla RLVR baseline.
  • The method supplies a per-sample mechanism that simultaneously limits forgetting and supports adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reasoning-trace signal could be tested as a regularizer in non-RL continual-learning pipelines for large models.
  • If reasoning traces prove stable across domains, the approach may reduce the need for task-specific replay buffers.
  • Extending RP to measure portability between entirely different model architectures would test whether the signal is architecture-agnostic.
  • The framework suggests that monitoring intermediate reasoning rather than final outputs may improve stability in any sequential policy optimization setting.

Load-bearing premise

Reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not.

What would settle it

An experiment in which reasoning traces extracted on out-of-distribution samples show no better correlation with retained performance than answer accuracy, causing RDB-CL to produce no accuracy gain or to degrade relative to the vanilla RLVR baseline.

Figures

Figures reproduced from arXiv: 2605.18903 by Fei Zhu, Qiuhe Hong, Shuo Yang, Tiantian Peng, Yonghong Tian, Yuyang Liu.

Figure 1
Figure 1. Figure 1: Reasoning Portability guides per-sample adaptation in policy space. (a): A portability￾agnostic constraint with static KL is uniform across samples, yielding either under-adaptation or forgetting. (b): RDB-CL introduces reasoning portability (PRP: positive, NRP: negative) to modulate adaptation, steering updates toward the plasticity-stability trade-off. as Group-Relative Policy Optimization (GRPO) [38], m… view at source ↗
Figure 2
Figure 2. Figure 2: Confidence distribution for OOD tasks, measured as P(True). Left: Distribution of answer confidence for correct vs. incor￾rect answers. Right: Distribution of reasoning confidence for correct vs. incorrect reasoning, with reasoning labels provided by GPT-4o. Question: Can you trace a square with this shape? Thinking: The image shows a cone, which is a three-dimensional shape with a circular base and a poin… view at source ↗
Figure 4
Figure 4. Figure 4: Workflow of RDB-CL. The previous policy πθt−1 generates reference responses for the current task Tt. Their RP proxy Cθt−1 (r t−1 |x) is used to classify samples as PRP or NRP by τ . PRP inherits the baseline KL anchor β0 to preserve the previous policy’s reasoning, while NRP receives a relaxed anchor βLRC < β0 to permit exploration of new reasoning pathways. • For Negative RP (NRP) samples (C < τ ): The mo… view at source ↗
Figure 5
Figure 5. Figure 5: Results on a long-horizon benchmark. Avg( ) Last( ) Finetune( ) -BWT( ) Metric 0 8 16 24 32 40 48 56 64 72 Accuracy (%) 40.6 50.7 54.9 6.3 42.8 54.6 58.1 5.3 GRPO RDB-CL [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison of static KL with fixed β (scaled by k from β0 = 0.15) and dynamic KL with βLRC . 5.3 Ablation for Core Claims Validation Reasoning Confidence vs. Answer Confidence. To validate the advantage of reasoning-level signals, we compare answer-level and reasoning-level confidence under identical experimental settings. In both variants, the KL constraint is modulated by calculated answer co… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of pass@k results on the first task VizWiz. Unlike SFT, RL inherently preserves prior knowl￾edge [39, 3], yet it is prone to plasticity loss from exploration collapse [70, 5], a problem further am￾plified in CL. We observed that under a weak or absent KL constraint, plasticity diminishes even to zero: gradient magnitudes become insufficient to move the policy away from its initial mode, and expl… view at source ↗
Figure 9
Figure 9. Figure 9: Heatmap of pairwise feature-distance changes among VizWiz samples across sequentially [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training dynamics of KL divergence under [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison under representative low reasoning confidence scenarios. Low￾confidence cases typically arise from two sources: domain-gap yet sound reasoning(Top) and incoher￾ent reasoning(Bottom). After training, RDB-CL generates more concise and reliable reasoning traces, while GRPO tends to produce incorrect or irrelevant text. A.5 Qualitative Comparisons under Broader Reasoning Confidence Scen… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison under representative high reasoning confidence scenarios. High￾confidence cases typically fall into two types: (Top) cases where the policy successfully produces task-solving reasoning for new tasks, and (Bottom) cases with seemingly sound reasoning that still fail due to missing task-specific, non-reasoning knowledge. A.6 Definitions and Preliminaries Definition (Portability). Give… view at source ↗
read the original abstract

Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes Reasoning Portability (RP) as a sample-level measure of how reusable the previous policy's reasoning behavior is on new tasks for MLLMs under RLVR-based continual learning. It empirically claims that reasoning-level signals remain reliable on OOD samples while answer-level signals do not, instantiates RP, and proposes RDB-CL to modulate per-sample KL regularization: tight anchors on high-RP samples to preserve reusable reasoning and relaxed anchors on low-RP samples to allow new reasoning exploration. Experiments report that RDB-CL outperforms baselines with a +12.0% gain in Last accuracy over vanilla RLVR.

Significance. If the reliability distinction and the resulting gains hold under independent verification, the work would provide a concrete mechanism for leveraging emerging reasoning capabilities to improve stability-plasticity trade-offs in RLVR continual learning for multimodal models. The sample-level dynamic balancing approach could influence future designs that operate at the reasoning trace level rather than output-level regularization.

major comments (2)
  1. [§3] §3 (Reasoning Portability definition and computation): The procedure for computing RP on OOD samples must be specified with explicit equations or pseudocode showing the exact prompting, scoring, or comparison of reasoning traces. If RP relies on the current MLLM to judge previous-policy reasoning traces on the same OOD data, the claimed reliability advantage of reasoning-level over answer-level signals becomes an internal consistency check rather than an externally anchored demonstration, directly undermining the justification for the subsequent per-sample KL modulation in RDB-CL.
  2. [§5] §5 (Experiments and results): The reported +12.0% Last accuracy improvement over vanilla RLVR requires supporting details on statistical significance (e.g., p-values or confidence intervals across multiple seeds), variance, exact dataset splits, number of tasks, and full baseline implementations. Without these, it is impossible to assess whether the gain is robust or driven by the RP-based modulation versus other implementation choices.
minor comments (2)
  1. Notation for RP and the dynamic balance factor should be introduced with a clear table or diagram showing how the per-sample KL weight is derived from the RP score.
  2. The abstract and introduction would benefit from a concise statement of the precise definition of 'reasoning-level signal' versus 'answer-level signal' to avoid ambiguity in the reliability claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§3] §3 (Reasoning Portability definition and computation): The procedure for computing RP on OOD samples must be specified with explicit equations or pseudocode showing the exact prompting, scoring, or comparison of reasoning traces. If RP relies on the current MLLM to judge previous-policy reasoning traces on the same OOD data, the claimed reliability advantage of reasoning-level over answer-level signals becomes an internal consistency check rather than an externally anchored demonstration, directly undermining the justification for the subsequent per-sample KL modulation in RDB-CL.

    Authors: We agree that the RP computation procedure requires explicit specification to avoid ambiguity. In the revised manuscript, we will insert detailed equations and pseudocode in §3 describing the exact process: (1) generation of reasoning traces by the previous policy on OOD samples via fixed prompting, (2) scoring via a separate, pre-trained reasoning similarity metric (e.g., embedding-based trace alignment independent of the current policy), and (3) comparison against answer-level baselines. This metric is externally anchored and does not rely on the current MLLM judging its own outputs, thereby preserving the claimed reliability distinction as an externally validated observation rather than an internal check. We will revise accordingly. revision: yes

  2. Referee: [§5] §5 (Experiments and results): The reported +12.0% Last accuracy improvement over vanilla RLVR requires supporting details on statistical significance (e.g., p-values or confidence intervals across multiple seeds), variance, exact dataset splits, number of tasks, and full baseline implementations. Without these, it is impossible to assess whether the gain is robust or driven by the RP-based modulation versus other implementation choices.

    Authors: We acknowledge that additional experimental details are needed to demonstrate robustness. In the revision, we will expand §5 and the appendix to report: exact dataset splits and task sequence (5 tasks with standard continual learning splits), full baseline implementations, and results across 3 random seeds including mean, standard deviation, and p-values from paired t-tests on the Last accuracy metric. These additions will clarify that the +12.0% gain is driven by the RP modulation while maintaining reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper defines Reasoning Portability (RP) as a sample-level measure of reusability of the previous policy's behavior on new tasks and uses it to modulate per-sample KL regularization in RDB-CL. It further claims an empirical demonstration that reasoning-level signals are more reliable than answer-level signals on OOD samples. No equations or computation procedures are shown that reduce RP to a fitted parameter, a self-referential judgment by the same model, or a self-citation chain. The central claims rest on an independent empirical comparison rather than reducing to the inputs by construction, satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; ledger therefore limited to statements explicitly present in the abstract. The central claim rests on the unverified empirical observation that reasoning signals outperform answer signals on OOD data and on the effectiveness of RP-modulated regularization.

axioms (1)
  • domain assumption reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not
    Directly stated in abstract as the basis for preferring reasoning-level constraints.
invented entities (1)
  • Reasoning Portability (RP) no independent evidence
    purpose: sample-level measure of how reusable the previous policy's behavior is on a new task
    Newly formalized quantity used to modulate KL regularization.

pith-pipeline@v0.9.0 · 5737 in / 1340 out tokens · 81286 ms · 2026-05-20T13:24:56.215098+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 21 internal anchors

  1. [1]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  2. [2]

    Coin: A benchmark of continual instruction tuning for multimodel large language models

    Cheng Chen, Junchen Zhu, Xu Luo, Heng Tao Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 57817–57840. Curran Associat...

  3. [3]

    Retaining by doing: The role of on-policy data in mitigating forgetting, 2025

    Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025. URL https://arxiv.org/abs/2510. 18874

  4. [4]

    Continual multimodal knowledge graph construction.arXiv preprint arXiv:2305.08698, 2023

    Xiang Chen, Jintian Zhang, Xiaohan Wang, Ningyu Zhang, Tongtong Wu, Yuxiang Wang, Yongheng Wang, and Huajun Chen. Continual multimodal knowledge graph construction.arXiv preprint arXiv:2305.08698, 2023

  5. [5]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/2505.22617

  6. [6]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305. 06500

  7. [7]

    Weight ensembling improves reasoning in language models, 2025

    Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models, 2025. URL https://arxiv.org/abs/ 2504.10478

  8. [8]

    A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2021

    Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2021. ISSN 1939-3539. doi: 10.1109/tpami.2021.3057446. URL http://dx.doi.org/10.1109/TPAM...

  9. [9]

    ImageNet:

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

  10. [10]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  11. [11]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, 10 Dongjie J...

  12. [12]

    Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model.arXiv preprint arXiv:2503.12941, 2025

    Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, and Cheng- Lin Liu. Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model.arXiv preprint arXiv:2503.12941, 2025

  13. [13]

    Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P

    Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people,

  14. [14]

    URLhttps://arxiv.org/abs/1802.08218

  15. [15]

    Large language models can self-improve, 2022

    Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve, 2022. URL https://arxiv.org/abs/2210. 11610

  16. [16]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  17. [17]

    Self-training large language models with confident reasoning, 2025

    Hyosoon Jang, Yunhui Jang, Sungjae Lee, Jungseul Ok, and Sungsoo Ahn. Self-training large language models with confident reasoning, 2025. URLhttps://arxiv.org/abs/2505. 17454

  18. [18]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  19. [19]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

  20. [20]

    Uncertainty-aware evaluation for vision-language models, 2024

    Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, and Eugene Ilyushin. Uncertainty-aware evaluation for vision-language models, 2024. URLhttps://arxiv.org/abs/2402.14418

  21. [21]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  22. [22]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

  23. [23]

    Multi-domain lifelong visual question answering via self-critical distillation

    Mingrui Lao, Nan Pu, Yu Liu, Zhun Zhong, Erwin M Bakker, Nicu Sebe, and Michael S Lew. Multi-domain lifelong visual question answering via self-critical distillation. InProceedings of the 31st ACM International Conference on Multimedia, pages 4747–4758, 2023

  24. [24]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought, 2025. URL https://arxiv.org/abs/2501.07542

  25. [25]

    Learning without Forgetting

    Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URL https://arxiv. org/abs/1606.09282

  26. [26]

    Internal consistency and self-feedback in large language models: A survey, 2024

    Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li, Yi Wang, Zhonghao Wang, Feiyu Xiong, and Zhiyu Li. Internal consistency and self-feedback in large language models: A survey, 2024. URLhttps://arxiv.org/abs/2407.14507

  27. [27]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  28. [28]

    Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

    Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, and Yonghong Tian. Continual learning for vlms: A survey and taxonomy beyond forgetting.arXiv preprint arXiv:2508.04227, 2025

  29. [29]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

  30. [30]

    Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. InThe 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021

  31. [31]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

  32. [32]

    Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data selection, 2025

    Adyasha Maharana, Jaehong Yoon, Tianlong Chen, and Mohit Bansal. Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data selection, 2025. URL https:// arxiv.org/abs/2410.10636

  33. [33]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. URL https://arxiv.org/abs/ 2503.07365

  34. [34]

    Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2401.15098, 2024

    Chaofan Pan, Lingfei Ren, Yihui Feng, Linbo Xiong, Wei Wei, Yonghao Li, and Xin Yang. Multi-granularity knowledge transfer for continual reinforcement learning, 2025. URL https: //arxiv.org/abs/2401.15098. 12

  35. [35]

    Parisi, Ronald Kemker, Jose L

    German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,

  36. [36]

    doi: https://doi.org/10.1016/j.neunet.2019.01.012

    ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2019.01.012. URL https: //www.sciencedirect.com/science/article/pii/S0893608019300231

  37. [37]

    Self-consistency preference optimization, 2025

    Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mo- hit Bansal, Sainbayar Sukhbaatar, Jason Weston, and Jane Yu. Self-consistency preference optimization, 2025. URLhttps://arxiv.org/abs/2411.04109

  38. [38]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,

  39. [39]

    URLhttps://arxiv.org/abs/2305.18290

  40. [40]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  41. [41]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

  42. [42]

    RL's Razor: Why Online Reinforcement Learning Forgets Less

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less, 2025. URLhttps://arxiv.org/abs/2509.04259

  43. [43]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

  44. [44]

    Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

  45. [45]

    Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models, 2024

    Longxiang Tang, Zhuotao Tian, Kai Li, Chunming He, Hantao Zhou, Hengshuang Zhao, Xiu Li, and Jiaya Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models, 2024. URL https://arxiv.org/abs/2407. 05342

  46. [46]

    Confidence improves self-consistency in llms

    Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings of the Association for Computational Linguistics: ACL 2025, page 20090–20111. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.1030. URL http://dx.doi.org/10. 18653/v...

  47. [47]

    Qwen2.5-vl, January 2025

    Qwen Team. Qwen2.5-vl, January 2025. URL https://qwenlm.github.io/blog/qwen2. 5-vl/

  48. [48]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

  49. [49]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URLhttps://arxiv.org/abs/2203.11171

  50. [50]

    Uncertainty aware learning for language model alignment, 2024

    Yikun Wang, Rui Zheng, Liang Ding, Qi Zhang, Dahua Lin, and Dacheng Tao. Uncertainty aware learning for language model alignment, 2024. URL https://arxiv.org/abs/2406. 04854

  51. [51]

    Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

    Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

  52. [52]

    URLhttps://arxiv.org/abs/2402.03681. 13

  53. [53]

    Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

    Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning, 2025. URLhttps://arxiv.org/abs/2510.10649

  54. [54]

    Advancing cross-domain discriminability in continual learning of vison-language models, 06 2024

    Yicheng Xu, Yuxin Chen, Jiahao Nie, Yusong Wang, Huiping Zhuang, and Manabu Okumura. Advancing cross-domain discriminability in continual learning of vison-language models, 06 2024

  55. [55]

    Generative negative text replay for continual vision-language pretraining

    Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. Generative negative text replay for continual vision-language pretraining. In European Conference on Computer Vision, pages 22–38. Springer, 2022

  56. [56]

    Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025

    Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025

  57. [57]

    A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

  58. [58]

    arXiv preprint arXiv:2504.07954 , year =

    En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, and Wenbing Tao. Perception-r1: Pioneering perception policy with reinforcement learning, 2025. URL https://arxiv.org/abs/2504.07954

  59. [59]

    Boosting continual learning of vision-language models via mixture-of-experts adapters, 2024

    Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters, 2024. URL https://arxiv.org/abs/2403.11549

  60. [60]

    Exploiting the semantic knowledge of pre-trained text-encoders for continual learning.arXiv preprint arXiv:2408.01076, 2024

    Lu Yu, Zhe Tao, Dipam Goswami, Hantao Yao, Bartłomiej Twardowski, Joost Van de Weijer, and Changsheng Xu. Exploiting the semantic knowledge of pre-trained text-encoders for continual learning.arXiv preprint arXiv:2408.01076, 2024

  61. [61]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  62. [62]

    Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models

    Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen, Kai-Po Chang, Yung-Hsuan Lai, Fu-En Yang, and Yu-Chiang Frank Wang. Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models. InEuropean Conference on Computer Vision, pages 219–236. Springer, 2024

  63. [63]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URLhttps://arxiv.org/abs/2504.13837

  64. [64]

    Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

    Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024. URL https://arxiv.org/abs/ 2405.10292

  65. [65]

    CPPO: Continual learning for reinforcement learning with human feedback

    Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, and Ruifeng Xu. CPPO: Continual learning for reinforcement learning with human feedback. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=86zAUE80pP

  66. [66]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization, 2025. URLhttps://arxiv.org/abs/2503.12937. 14

  67. [67]

    Vqacl: A novel visual question answering continual learning setting

    Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: A novel visual question answering continual learning setting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19102–19112, 2023

  68. [68]

    Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective,

    Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, and Kai Chen. Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective,

  69. [69]

    URLhttps://arxiv.org/abs/2506.23508

  70. [70]

    Mllm-cl: Continual learning for multimodal large language models, 2025

    Hongbo Zhao, Fei Zhu, Haiyang Guo, Meng Wang, Rundong Wang, Gaofeng Meng, and Zhaoxiang Zhang. Mllm-cl: Continual learning for multimodal large language models, 2025. URLhttps://arxiv.org/abs/2506.05453

  71. [71]

    Danna Zheng, Danyang Liu, Mirella Lapata, and Jeff Z. Pan. Trustscore: Reference-free evaluation of llm response trustworthiness, 2024. URL https://arxiv.org/abs/2402. 12545

  72. [72]

    Preventing zero-shot transfer degradation in continual learning of vision-language models

    Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023

  73. [73]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.14362

  74. [74]

    ONLY one valid JSON object

    Da-Wei Zhou, Yuanhan Zhang, Yan Wang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Learning without forgetting for vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(6):4489–4504, 2025. doi: 10.1109/TPAMI.2025. 3540889

  75. [75]

    The surprising effectiveness of negative reinforcement in llm reasoning

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning, 2025. URL https:// arxiv.org/abs/2506.01347. 15 A Appendix A.1 Symbols We use the following notation throughout the paper: •T: a sequence task. •t: total number of tasks in the continual learning sequence. •T...