Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

Fei Zhu; Qiuhe Hong; Shuo Yang; Tiantian Peng; Yonghong Tian; Yuyang Liu

arxiv: 2605.18903 · v1 · pith:QS35YTOFnew · submitted 2026-05-17 · 💻 cs.LG · cs.CV

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

Qiuhe Hong , Yuyang Liu , Shuo Yang , Tiantian Peng , Fei Zhu , Yonghong Tian This is my paper

Pith reviewed 2026-05-20 13:24 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords continual learningmultimodal large language modelsreinforcement learning with verifiable rewardsreasoning portabilitydynamic regularizationcatastrophic forgettingvision-language modelsRLVR

0 comments

The pith

Reasoning Portability measures reusable prior reasoning on new tasks to dynamically balance preservation and exploration during RLVR-based continual learning for MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reasoning traces offer more reliable signals than final answers for guiding adaptation on out-of-distribution multimodal tasks. It defines Reasoning Portability as a per-sample score of how much the old policy's reasoning behavior transfers to a new task. This score then controls the strength of Kullback-Leibler regularization inside RLVR training: high-portability samples receive a tight anchor to keep useful prior reasoning, while low-portability samples receive a relaxed anchor to allow new reasoning patterns. The resulting RDB-CL method raises final-task accuracy by 12 percent over standard RLVR in continual learning sequences. The approach therefore supplies a concrete mechanism for reducing catastrophic forgetting without blocking progress on fresh tasks.

Core claim

We formalize portability as a sample-level measure of how reusable the previous policy's behavior is on a new task, empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not, and instantiate this as Reasoning Portability (RP) to modulate per-sample Kullback-Leibler regularization in RLVR: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways.

What carries the argument

Reasoning Portability (RP), a sample-level score of reusability of the prior policy's reasoning behavior on a new task, used to adjust the per-sample strength of KL regularization inside RLVR training.

If this is right

RDB-CL consistently outperforms standard RLVR and other continual-learning baselines on sequential multimodal tasks.
High-RP samples keep prior reasoning intact through stronger regularization.
Low-RP samples gain freedom to develop new reasoning pathways through weaker regularization.
Final-task accuracy rises by 12.0 percent compared with the vanilla RLVR baseline.
The method supplies a per-sample mechanism that simultaneously limits forgetting and supports adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reasoning-trace signal could be tested as a regularizer in non-RL continual-learning pipelines for large models.
If reasoning traces prove stable across domains, the approach may reduce the need for task-specific replay buffers.
Extending RP to measure portability between entirely different model architectures would test whether the signal is architecture-agnostic.
The framework suggests that monitoring intermediate reasoning rather than final outputs may improve stability in any sequential policy optimization setting.

Load-bearing premise

Reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not.

What would settle it

An experiment in which reasoning traces extracted on out-of-distribution samples show no better correlation with retained performance than answer accuracy, causing RDB-CL to produce no accuracy gain or to degrade relative to the vanilla RLVR baseline.

Figures

Figures reproduced from arXiv: 2605.18903 by Fei Zhu, Qiuhe Hong, Shuo Yang, Tiantian Peng, Yonghong Tian, Yuyang Liu.

**Figure 1.** Figure 1: Reasoning Portability guides per-sample adaptation in policy space. (a): A portabilityagnostic constraint with static KL is uniform across samples, yielding either under-adaptation or forgetting. (b): RDB-CL introduces reasoning portability (PRP: positive, NRP: negative) to modulate adaptation, steering updates toward the plasticity-stability trade-off. as Group-Relative Policy Optimization (GRPO) [38], m… view at source ↗

**Figure 2.** Figure 2: Confidence distribution for OOD tasks, measured as P(True). Left: Distribution of answer confidence for correct vs. incorrect answers. Right: Distribution of reasoning confidence for correct vs. incorrect reasoning, with reasoning labels provided by GPT-4o. Question: Can you trace a square with this shape? Thinking: The image shows a cone, which is a three-dimensional shape with a circular base and a poin… view at source ↗

**Figure 4.** Figure 4: Workflow of RDB-CL. The previous policy πθt−1 generates reference responses for the current task Tt. Their RP proxy Cθt−1 (r t−1 |x) is used to classify samples as PRP or NRP by τ . PRP inherits the baseline KL anchor β0 to preserve the previous policy’s reasoning, while NRP receives a relaxed anchor βLRC < β0 to permit exploration of new reasoning pathways. • For Negative RP (NRP) samples (C < τ ): The mo… view at source ↗

**Figure 5.** Figure 5: Results on a long-horizon benchmark. Avg( ) Last( ) Finetune( ) -BWT( ) Metric 0 8 16 24 32 40 48 56 64 72 Accuracy (%) 40.6 50.7 54.9 6.3 42.8 54.6 58.1 5.3 GRPO RDB-CL [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Performance comparison of static KL with fixed β (scaled by k from β0 = 0.15) and dynamic KL with βLRC . 5.3 Ablation for Core Claims Validation Reasoning Confidence vs. Answer Confidence. To validate the advantage of reasoning-level signals, we compare answer-level and reasoning-level confidence under identical experimental settings. In both variants, the KL constraint is modulated by calculated answer co… view at source ↗

**Figure 8.** Figure 8: Comparison of pass@k results on the first task VizWiz. Unlike SFT, RL inherently preserves prior knowledge [39, 3], yet it is prone to plasticity loss from exploration collapse [70, 5], a problem further amplified in CL. We observed that under a weak or absent KL constraint, plasticity diminishes even to zero: gradient magnitudes become insufficient to move the policy away from its initial mode, and expl… view at source ↗

**Figure 9.** Figure 9: Heatmap of pairwise feature-distance changes among VizWiz samples across sequentially [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 11.** Figure 11: Training dynamics of KL divergence under [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison under representative low reasoning confidence scenarios. Lowconfidence cases typically arise from two sources: domain-gap yet sound reasoning(Top) and incoherent reasoning(Bottom). After training, RDB-CL generates more concise and reliable reasoning traces, while GRPO tends to produce incorrect or irrelevant text. A.5 Qualitative Comparisons under Broader Reasoning Confidence Scen… view at source ↗

**Figure 13.** Figure 13: Qualitative comparison under representative high reasoning confidence scenarios. Highconfidence cases typically fall into two types: (Top) cases where the policy successfully produces task-solving reasoning for new tasks, and (Bottom) cases with seemingly sound reasoning that still fail due to missing task-specific, non-reasoning knowledge. A.6 Definitions and Preliminaries Definition (Portability). Give… view at source ↗

read the original abstract

Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes Reasoning Portability to dynamically tune per-sample KL regularization in RLVR continual learning for MLLMs and reports a +12% last-accuracy lift, but the OOD reliability claim for reasoning signals needs explicit checks against circular measurement.

read the letter

The core takeaway is that they define Reasoning Portability as a per-sample score of how reusable the prior policy's reasoning remains on a new task, then use it to tighten or loosen the KL anchor inside an RLVR loop. High-RP samples keep the old reasoning path; low-RP samples get more room to explore. That produces the reported 12-point gain over plain RLVR on their continual-learning benchmarks.

Referee Report

2 major / 2 minor

Summary. The paper formalizes Reasoning Portability (RP) as a sample-level measure of how reusable the previous policy's reasoning behavior is on new tasks for MLLMs under RLVR-based continual learning. It empirically claims that reasoning-level signals remain reliable on OOD samples while answer-level signals do not, instantiates RP, and proposes RDB-CL to modulate per-sample KL regularization: tight anchors on high-RP samples to preserve reusable reasoning and relaxed anchors on low-RP samples to allow new reasoning exploration. Experiments report that RDB-CL outperforms baselines with a +12.0% gain in Last accuracy over vanilla RLVR.

Significance. If the reliability distinction and the resulting gains hold under independent verification, the work would provide a concrete mechanism for leveraging emerging reasoning capabilities to improve stability-plasticity trade-offs in RLVR continual learning for multimodal models. The sample-level dynamic balancing approach could influence future designs that operate at the reasoning trace level rather than output-level regularization.

major comments (2)

[§3] §3 (Reasoning Portability definition and computation): The procedure for computing RP on OOD samples must be specified with explicit equations or pseudocode showing the exact prompting, scoring, or comparison of reasoning traces. If RP relies on the current MLLM to judge previous-policy reasoning traces on the same OOD data, the claimed reliability advantage of reasoning-level over answer-level signals becomes an internal consistency check rather than an externally anchored demonstration, directly undermining the justification for the subsequent per-sample KL modulation in RDB-CL.
[§5] §5 (Experiments and results): The reported +12.0% Last accuracy improvement over vanilla RLVR requires supporting details on statistical significance (e.g., p-values or confidence intervals across multiple seeds), variance, exact dataset splits, number of tasks, and full baseline implementations. Without these, it is impossible to assess whether the gain is robust or driven by the RP-based modulation versus other implementation choices.

minor comments (2)

Notation for RP and the dynamic balance factor should be introduced with a clear table or diagram showing how the per-sample KL weight is derived from the RP score.
The abstract and introduction would benefit from a concise statement of the precise definition of 'reasoning-level signal' versus 'answer-level signal' to avoid ambiguity in the reliability claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (Reasoning Portability definition and computation): The procedure for computing RP on OOD samples must be specified with explicit equations or pseudocode showing the exact prompting, scoring, or comparison of reasoning traces. If RP relies on the current MLLM to judge previous-policy reasoning traces on the same OOD data, the claimed reliability advantage of reasoning-level over answer-level signals becomes an internal consistency check rather than an externally anchored demonstration, directly undermining the justification for the subsequent per-sample KL modulation in RDB-CL.

Authors: We agree that the RP computation procedure requires explicit specification to avoid ambiguity. In the revised manuscript, we will insert detailed equations and pseudocode in §3 describing the exact process: (1) generation of reasoning traces by the previous policy on OOD samples via fixed prompting, (2) scoring via a separate, pre-trained reasoning similarity metric (e.g., embedding-based trace alignment independent of the current policy), and (3) comparison against answer-level baselines. This metric is externally anchored and does not rely on the current MLLM judging its own outputs, thereby preserving the claimed reliability distinction as an externally validated observation rather than an internal check. We will revise accordingly. revision: yes
Referee: [§5] §5 (Experiments and results): The reported +12.0% Last accuracy improvement over vanilla RLVR requires supporting details on statistical significance (e.g., p-values or confidence intervals across multiple seeds), variance, exact dataset splits, number of tasks, and full baseline implementations. Without these, it is impossible to assess whether the gain is robust or driven by the RP-based modulation versus other implementation choices.

Authors: We acknowledge that additional experimental details are needed to demonstrate robustness. In the revision, we will expand §5 and the appendix to report: exact dataset splits and task sequence (5 tasks with standard continual learning splits), full baseline implementations, and results across 3 random seeds including mean, standard deviation, and p-values from paired t-tests on the Last accuracy metric. These additions will clarify that the +12.0% gain is driven by the RP modulation while maintaining reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper defines Reasoning Portability (RP) as a sample-level measure of reusability of the previous policy's behavior on new tasks and uses it to modulate per-sample KL regularization in RDB-CL. It further claims an empirical demonstration that reasoning-level signals are more reliable than answer-level signals on OOD samples. No equations or computation procedures are shown that reduce RP to a fitted parameter, a self-referential judgment by the same model, or a self-citation chain. The central claims rest on an independent empirical comparison rather than reducing to the inputs by construction, satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; ledger therefore limited to statements explicitly present in the abstract. The central claim rests on the unverified empirical observation that reasoning signals outperform answer signals on OOD data and on the effectiveness of RP-modulated regularization.

axioms (1)

domain assumption reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not
Directly stated in abstract as the basis for preferring reasoning-level constraints.

invented entities (1)

Reasoning Portability (RP) no independent evidence
purpose: sample-level measure of how reusable the previous policy's behavior is on a new task
Newly formalized quantity used to modulate KL regularization.

pith-pipeline@v0.9.0 · 5737 in / 1340 out tokens · 81286 ms · 2026-05-20T13:24:56.215098+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize portability ... and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. ... β_LRC = max(clip_min, 1{C≥τ} + C·1{C<τ})·β_0
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

JRDBCL(θ) = −1/G Σ [r_θ(q, o_i, t) A_i,t − β_LRC D_KL(π_θt || π_θt−1)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 21 internal anchors

[1]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Coin: A benchmark of continual instruction tuning for multimodel large language models

Cheng Chen, Junchen Zhu, Xu Luo, Heng Tao Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 57817–57840. Curran Associat...

work page 2024
[3]

Retaining by doing: The role of on-policy data in mitigating forgetting, 2025

Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025. URL https://arxiv.org/abs/2510. 18874

work page 2025
[4]

Continual multimodal knowledge graph construction.arXiv preprint arXiv:2305.08698, 2023

Xiang Chen, Jintian Zhang, Xiaohan Wang, Ningyu Zhang, Tongtong Wu, Yuxiang Wang, Yongheng Wang, and Huajun Chen. Continual multimodal knowledge graph construction.arXiv preprint arXiv:2305.08698, 2023

work page arXiv 2023
[5]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305. 06500

work page 2023
[7]

Weight ensembling improves reasoning in language models, 2025

Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models, 2025. URL https://arxiv.org/abs/ 2504.10478

work page arXiv 2025
[8]

A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2021

Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2021. ISSN 1939-3539. doi: 10.1109/tpami.2021.3057446. URL http://dx.doi.org/10.1109/TPAM...

work page doi:10.1109/tpami.2021.3057446 2021
[9]

Eastman Kodak Company

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[10]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017
[11]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, 10 Dongjie J...

work page doi:10.1038/s41586-025-09422-z 2025
[12]

Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model.arXiv preprint arXiv:2503.12941, 2025

Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, and Cheng- Lin Liu. Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model.arXiv preprint arXiv:2503.12941, 2025

work page arXiv 2025
[13]

Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P

Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people,

work page
[14]

URLhttps://arxiv.org/abs/1802.08218

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Large language models can self-improve, 2022

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve, 2022. URL https://arxiv.org/abs/2210. 11610

work page 2022
[16]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[17]

Self-training large language models with confident reasoning, 2025

Hyosoon Jang, Yunhui Jang, Sungjae Lee, Jungseul Ok, and Sungsoo Ahn. Self-training large language models with confident reasoning, 2025. URLhttps://arxiv.org/abs/2505. 17454

work page 2025
[18]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

work page doi:10.1073/pnas.1611835114 2017
[20]

Uncertainty-aware evaluation for vision-language models, 2024

Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, and Eugene Ilyushin. Uncertainty-aware evaluation for vision-language models, 2024. URLhttps://arxiv.org/abs/2402.14418

work page arXiv 2024
[21]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[22]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Multi-domain lifelong visual question answering via self-critical distillation

Mingrui Lao, Nan Pu, Yu Liu, Zhun Zhong, Erwin M Bakker, Nicu Sebe, and Michael S Lew. Multi-domain lifelong visual question answering via self-critical distillation. InProceedings of the 31st ACM International Conference on Multimedia, pages 4747–4758, 2023

work page 2023
[24]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought, 2025. URL https://arxiv.org/abs/2501.07542

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Learning without Forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URL https://arxiv. org/abs/1606.09282

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Internal consistency and self-feedback in large language models: A survey, 2024

Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li, Yi Wang, Zhonghao Wang, Feiyu Xiong, and Zhiyu Li. Internal consistency and self-feedback in large language models: A survey, 2024. URLhttps://arxiv.org/abs/2407.14507

work page arXiv 2024
[27]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, and Yonghong Tian. Continual learning for vlms: A survey and taxonomy beyond forgetting.arXiv preprint arXiv:2508.04227, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

work page 2017
[30]

Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. InThe 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021

work page 2021
[31]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

work page 2022
[32]

Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data selection, 2025

Adyasha Maharana, Jaehong Yoon, Tianlong Chen, and Mohit Bansal. Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data selection, 2025. URL https:// arxiv.org/abs/2410.10636

work page arXiv 2025
[33]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. URL https://arxiv.org/abs/ 2503.07365

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2401.15098, 2024

Chaofan Pan, Lingfei Ren, Yihui Feng, Linbo Xiong, Wei Wei, Yonghao Li, and Xin Yang. Multi-granularity knowledge transfer for continual reinforcement learning, 2025. URL https: //arxiv.org/abs/2401.15098. 12

work page arXiv 2025
[35]

Parisi, Ronald Kemker, Jose L

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,

work page
[36]

doi: https://doi.org/10.1016/j.neunet.2019.01.012

ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2019.01.012. URL https: //www.sciencedirect.com/science/article/pii/S0893608019300231

work page doi:10.1016/j.neunet.2019.01.012 2019
[37]

Self-consistency preference optimization, 2025

Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mo- hit Bansal, Sainbayar Sukhbaatar, Jason Weston, and Jane Yu. Self-consistency preference optimization, 2025. URLhttps://arxiv.org/abs/2411.04109

work page arXiv 2025
[38]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,

work page
[39]

URLhttps://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less, 2025. URLhttps://arxiv.org/abs/2509.04259

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019
[44]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

work page arXiv 2025
[45]

Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models, 2024

Longxiang Tang, Zhuotao Tian, Kai Li, Chunming He, Hantao Zhou, Hengshuang Zhao, Xiu Li, and Jiaya Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models, 2024. URL https://arxiv.org/abs/2407. 05342

work page 2024
[46]

Confidence improves self-consistency in llms

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings of the Association for Computational Linguistics: ACL 2025, page 20090–20111. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.1030. URL http://dx.doi.org/10. 18653/v...

work page doi:10.18653/v1/2025.findings-acl.1030 2025
[47]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025. URL https://qwenlm.github.io/blog/qwen2. 5-vl/

work page 2025
[48]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URLhttps://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Uncertainty aware learning for language model alignment, 2024

Yikun Wang, Rui Zheng, Liang Ding, Qi Zhang, Dahua Lin, and Dacheng Tao. Uncertainty aware learning for language model alignment, 2024. URL https://arxiv.org/abs/2406. 04854

work page 2024
[51]

Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

work page
[52]

URLhttps://arxiv.org/abs/2402.03681. 13

work page arXiv
[53]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning, 2025. URLhttps://arxiv.org/abs/2510.10649

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Advancing cross-domain discriminability in continual learning of vison-language models, 06 2024

Yicheng Xu, Yuxin Chen, Jiahao Nie, Yusong Wang, Huiping Zhuang, and Manabu Okumura. Advancing cross-domain discriminability in continual learning of vison-language models, 06 2024

work page 2024
[55]

Generative negative text replay for continual vision-language pretraining

Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. Generative negative text replay for continual vision-language pretraining. In European Conference on Computer Vision, pages 22–38. Springer, 2022

work page 2022
[56]

Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025

Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025

work page arXiv 2025
[57]

A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

work page 2024
[58]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, and Wenbing Tao. Perception-r1: Pioneering perception policy with reinforcement learning, 2025. URL https://arxiv.org/abs/2504.07954

work page arXiv 2025
[59]

Boosting continual learning of vision-language models via mixture-of-experts adapters, 2024

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters, 2024. URL https://arxiv.org/abs/2403.11549

work page arXiv 2024
[60]

Exploiting the semantic knowledge of pre-trained text-encoders for continual learning.arXiv preprint arXiv:2408.01076, 2024

Lu Yu, Zhe Tao, Dipam Goswami, Hantao Yao, Bartłomiej Twardowski, Joost Van de Weijer, and Changsheng Xu. Exploiting the semantic knowledge of pre-trained text-encoders for continual learning.arXiv preprint arXiv:2408.01076, 2024

work page arXiv 2024
[61]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models

Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen, Kai-Po Chang, Yung-Hsuan Lai, Fu-En Yang, and Yu-Chiang Frank Wang. Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models. InEuropean Conference on Computer Vision, pages 219–236. Springer, 2024

work page 2024
[63]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URLhttps://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024. URL https://arxiv.org/abs/ 2405.10292

work page arXiv 2024
[65]

CPPO: Continual learning for reinforcement learning with human feedback

Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, and Ruifeng Xu. CPPO: Continual learning for reinforcement learning with human feedback. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=86zAUE80pP

work page 2024
[66]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization, 2025. URLhttps://arxiv.org/abs/2503.12937. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Vqacl: A novel visual question answering continual learning setting

Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: A novel visual question answering continual learning setting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19102–19112, 2023

work page 2023
[68]

Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective,

Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, and Kai Chen. Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective,

work page
[69]

URLhttps://arxiv.org/abs/2506.23508

work page arXiv
[70]

Mllm-cl: Continual learning for multimodal large language models, 2025

Hongbo Zhao, Fei Zhu, Haiyang Guo, Meng Wang, Rundong Wang, Gaofeng Meng, and Zhaoxiang Zhang. Mllm-cl: Continual learning for multimodal large language models, 2025. URLhttps://arxiv.org/abs/2506.05453

work page arXiv 2025
[71]

Danna Zheng, Danyang Liu, Mirella Lapata, and Jeff Z. Pan. Trustscore: Reference-free evaluation of llm response trustworthiness, 2024. URL https://arxiv.org/abs/2402. 12545

work page 2024
[72]

Preventing zero-shot transfer degradation in continual learning of vision-language models

Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023

work page 2023
[73]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.14362

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

ONLY one valid JSON object

Da-Wei Zhou, Yuanhan Zhang, Yan Wang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Learning without forgetting for vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(6):4489–4504, 2025. doi: 10.1109/TPAMI.2025. 3540889

work page doi:10.1109/tpami.2025 2025
[75]

The surprising effectiveness of negative reinforcement in llm reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning, 2025. URL https:// arxiv.org/abs/2506.01347. 15 A Appendix A.1 Symbols We use the following notation throughout the paper: •T: a sequence task. •t: total number of tasks in the continual learning sequence. •T...

work page arXiv 2025

[1] [1]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

Coin: A benchmark of continual instruction tuning for multimodel large language models

Cheng Chen, Junchen Zhu, Xu Luo, Heng Tao Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 57817–57840. Curran Associat...

work page 2024

[3] [3]

Retaining by doing: The role of on-policy data in mitigating forgetting, 2025

Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025. URL https://arxiv.org/abs/2510. 18874

work page 2025

[4] [4]

Continual multimodal knowledge graph construction.arXiv preprint arXiv:2305.08698, 2023

Xiang Chen, Jintian Zhang, Xiaohan Wang, Ningyu Zhang, Tongtong Wu, Yuxiang Wang, Yongheng Wang, and Huajun Chen. Continual multimodal knowledge graph construction.arXiv preprint arXiv:2305.08698, 2023

work page arXiv 2023

[5] [5]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305. 06500

work page 2023

[7] [7]

Weight ensembling improves reasoning in language models, 2025

Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models, 2025. URL https://arxiv.org/abs/ 2504.10478

work page arXiv 2025

[8] [8]

A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2021

Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2021. ISSN 1939-3539. doi: 10.1109/tpami.2021.3057446. URL http://dx.doi.org/10.1109/TPAM...

work page doi:10.1109/tpami.2021.3057446 2021

[9] [9]

Eastman Kodak Company

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009

[10] [10]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017

[11] [11]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, 10 Dongjie J...

work page doi:10.1038/s41586-025-09422-z 2025

[12] [12]

Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model.arXiv preprint arXiv:2503.12941, 2025

Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, and Cheng- Lin Liu. Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model.arXiv preprint arXiv:2503.12941, 2025

work page arXiv 2025

[13] [13]

Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P

Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people,

work page

[14] [14]

URLhttps://arxiv.org/abs/1802.08218

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Large language models can self-improve, 2022

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve, 2022. URL https://arxiv.org/abs/2210. 11610

work page 2022

[16] [16]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019

[17] [17]

Self-training large language models with confident reasoning, 2025

Hyosoon Jang, Yunhui Jang, Sungjae Lee, Jungseul Ok, and Sungsoo Ahn. Self-training large language models with confident reasoning, 2025. URLhttps://arxiv.org/abs/2505. 17454

work page 2025

[18] [18]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

work page doi:10.1073/pnas.1611835114 2017

[20] [20]

Uncertainty-aware evaluation for vision-language models, 2024

Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, and Eugene Ilyushin. Uncertainty-aware evaluation for vision-language models, 2024. URLhttps://arxiv.org/abs/2402.14418

work page arXiv 2024

[21] [21]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[22] [22]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Multi-domain lifelong visual question answering via self-critical distillation

Mingrui Lao, Nan Pu, Yu Liu, Zhun Zhong, Erwin M Bakker, Nicu Sebe, and Michael S Lew. Multi-domain lifelong visual question answering via self-critical distillation. InProceedings of the 31st ACM International Conference on Multimedia, pages 4747–4758, 2023

work page 2023

[24] [24]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought, 2025. URL https://arxiv.org/abs/2501.07542

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Learning without Forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URL https://arxiv. org/abs/1606.09282

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Internal consistency and self-feedback in large language models: A survey, 2024

Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li, Yi Wang, Zhonghao Wang, Feiyu Xiong, and Zhiyu Li. Internal consistency and self-feedback in large language models: A survey, 2024. URLhttps://arxiv.org/abs/2407.14507

work page arXiv 2024

[27] [27]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, and Yonghong Tian. Continual learning for vlms: A survey and taxonomy beyond forgetting.arXiv preprint arXiv:2508.04227, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

work page 2017

[30] [30]

Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. InThe 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021

work page 2021

[31] [31]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

work page 2022

[32] [32]

Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data selection, 2025

Adyasha Maharana, Jaehong Yoon, Tianlong Chen, and Mohit Bansal. Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data selection, 2025. URL https:// arxiv.org/abs/2410.10636

work page arXiv 2025

[33] [33]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. URL https://arxiv.org/abs/ 2503.07365

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2401.15098, 2024

Chaofan Pan, Lingfei Ren, Yihui Feng, Linbo Xiong, Wei Wei, Yonghao Li, and Xin Yang. Multi-granularity knowledge transfer for continual reinforcement learning, 2025. URL https: //arxiv.org/abs/2401.15098. 12

work page arXiv 2025

[35] [35]

Parisi, Ronald Kemker, Jose L

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,

work page

[36] [36]

doi: https://doi.org/10.1016/j.neunet.2019.01.012

ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2019.01.012. URL https: //www.sciencedirect.com/science/article/pii/S0893608019300231

work page doi:10.1016/j.neunet.2019.01.012 2019

[37] [37]

Self-consistency preference optimization, 2025

Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mo- hit Bansal, Sainbayar Sukhbaatar, Jason Weston, and Jane Yu. Self-consistency preference optimization, 2025. URLhttps://arxiv.org/abs/2411.04109

work page arXiv 2025

[38] [38]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,

work page

[39] [39]

URLhttps://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less, 2025. URLhttps://arxiv.org/abs/2509.04259

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019

[44] [44]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

work page arXiv 2025

[45] [45]

Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models, 2024

Longxiang Tang, Zhuotao Tian, Kai Li, Chunming He, Hantao Zhou, Hengshuang Zhao, Xiu Li, and Jiaya Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models, 2024. URL https://arxiv.org/abs/2407. 05342

work page 2024

[46] [46]

Confidence improves self-consistency in llms

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings of the Association for Computational Linguistics: ACL 2025, page 20090–20111. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.1030. URL http://dx.doi.org/10. 18653/v...

work page doi:10.18653/v1/2025.findings-acl.1030 2025

[47] [47]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025. URL https://qwenlm.github.io/blog/qwen2. 5-vl/

work page 2025

[48] [48]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URLhttps://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Uncertainty aware learning for language model alignment, 2024

Yikun Wang, Rui Zheng, Liang Ding, Qi Zhang, Dahua Lin, and Dacheng Tao. Uncertainty aware learning for language model alignment, 2024. URL https://arxiv.org/abs/2406. 04854

work page 2024

[51] [51]

Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

work page

[52] [52]

URLhttps://arxiv.org/abs/2402.03681. 13

work page arXiv

[53] [53]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning, 2025. URLhttps://arxiv.org/abs/2510.10649

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Advancing cross-domain discriminability in continual learning of vison-language models, 06 2024

Yicheng Xu, Yuxin Chen, Jiahao Nie, Yusong Wang, Huiping Zhuang, and Manabu Okumura. Advancing cross-domain discriminability in continual learning of vison-language models, 06 2024

work page 2024

[55] [55]

Generative negative text replay for continual vision-language pretraining

Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. Generative negative text replay for continual vision-language pretraining. In European Conference on Computer Vision, pages 22–38. Springer, 2022

work page 2022

[56] [56]

Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025

Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025

work page arXiv 2025

[57] [57]

A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

work page 2024

[58] [58]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, and Wenbing Tao. Perception-r1: Pioneering perception policy with reinforcement learning, 2025. URL https://arxiv.org/abs/2504.07954

work page arXiv 2025

[59] [59]

Boosting continual learning of vision-language models via mixture-of-experts adapters, 2024

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters, 2024. URL https://arxiv.org/abs/2403.11549

work page arXiv 2024

[60] [60]

Exploiting the semantic knowledge of pre-trained text-encoders for continual learning.arXiv preprint arXiv:2408.01076, 2024

Lu Yu, Zhe Tao, Dipam Goswami, Hantao Yao, Bartłomiej Twardowski, Joost Van de Weijer, and Changsheng Xu. Exploiting the semantic knowledge of pre-trained text-encoders for continual learning.arXiv preprint arXiv:2408.01076, 2024

work page arXiv 2024

[61] [61]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models

Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen, Kai-Po Chang, Yung-Hsuan Lai, Fu-En Yang, and Yu-Chiang Frank Wang. Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models. InEuropean Conference on Computer Vision, pages 219–236. Springer, 2024

work page 2024

[63] [63]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URLhttps://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024. URL https://arxiv.org/abs/ 2405.10292

work page arXiv 2024

[65] [65]

CPPO: Continual learning for reinforcement learning with human feedback

Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, and Ruifeng Xu. CPPO: Continual learning for reinforcement learning with human feedback. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=86zAUE80pP

work page 2024

[66] [66]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization, 2025. URLhttps://arxiv.org/abs/2503.12937. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Vqacl: A novel visual question answering continual learning setting

Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: A novel visual question answering continual learning setting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19102–19112, 2023

work page 2023

[68] [68]

Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective,

Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, and Kai Chen. Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective,

work page

[69] [69]

URLhttps://arxiv.org/abs/2506.23508

work page arXiv

[70] [70]

Mllm-cl: Continual learning for multimodal large language models, 2025

Hongbo Zhao, Fei Zhu, Haiyang Guo, Meng Wang, Rundong Wang, Gaofeng Meng, and Zhaoxiang Zhang. Mllm-cl: Continual learning for multimodal large language models, 2025. URLhttps://arxiv.org/abs/2506.05453

work page arXiv 2025

[71] [71]

Danna Zheng, Danyang Liu, Mirella Lapata, and Jeff Z. Pan. Trustscore: Reference-free evaluation of llm response trustworthiness, 2024. URL https://arxiv.org/abs/2402. 12545

work page 2024

[72] [72]

Preventing zero-shot transfer degradation in continual learning of vision-language models

Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023

work page 2023

[73] [73]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.14362

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

ONLY one valid JSON object

Da-Wei Zhou, Yuanhan Zhang, Yan Wang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Learning without forgetting for vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(6):4489–4504, 2025. doi: 10.1109/TPAMI.2025. 3540889

work page doi:10.1109/tpami.2025 2025

[75] [75]

The surprising effectiveness of negative reinforcement in llm reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning, 2025. URL https:// arxiv.org/abs/2506.01347. 15 A Appendix A.1 Symbols We use the following notation throughout the paper: •T: a sequence task. •t: total number of tasks in the continual learning sequence. •T...

work page arXiv 2025