Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
Pith reviewed 2026-05-20 13:24 UTC · model grok-4.3
The pith
Reasoning Portability measures reusable prior reasoning on new tasks to dynamically balance preservation and exploration during RLVR-based continual learning for MLLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize portability as a sample-level measure of how reusable the previous policy's behavior is on a new task, empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not, and instantiate this as Reasoning Portability (RP) to modulate per-sample Kullback-Leibler regularization in RLVR: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways.
What carries the argument
Reasoning Portability (RP), a sample-level score of reusability of the prior policy's reasoning behavior on a new task, used to adjust the per-sample strength of KL regularization inside RLVR training.
If this is right
- RDB-CL consistently outperforms standard RLVR and other continual-learning baselines on sequential multimodal tasks.
- High-RP samples keep prior reasoning intact through stronger regularization.
- Low-RP samples gain freedom to develop new reasoning pathways through weaker regularization.
- Final-task accuracy rises by 12.0 percent compared with the vanilla RLVR baseline.
- The method supplies a per-sample mechanism that simultaneously limits forgetting and supports adaptation.
Where Pith is reading between the lines
- The same reasoning-trace signal could be tested as a regularizer in non-RL continual-learning pipelines for large models.
- If reasoning traces prove stable across domains, the approach may reduce the need for task-specific replay buffers.
- Extending RP to measure portability between entirely different model architectures would test whether the signal is architecture-agnostic.
- The framework suggests that monitoring intermediate reasoning rather than final outputs may improve stability in any sequential policy optimization setting.
Load-bearing premise
Reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not.
What would settle it
An experiment in which reasoning traces extracted on out-of-distribution samples show no better correlation with retained performance than answer accuracy, causing RDB-CL to produce no accuracy gain or to degrade relative to the vanilla RLVR baseline.
Figures
read the original abstract
Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes Reasoning Portability (RP) as a sample-level measure of how reusable the previous policy's reasoning behavior is on new tasks for MLLMs under RLVR-based continual learning. It empirically claims that reasoning-level signals remain reliable on OOD samples while answer-level signals do not, instantiates RP, and proposes RDB-CL to modulate per-sample KL regularization: tight anchors on high-RP samples to preserve reusable reasoning and relaxed anchors on low-RP samples to allow new reasoning exploration. Experiments report that RDB-CL outperforms baselines with a +12.0% gain in Last accuracy over vanilla RLVR.
Significance. If the reliability distinction and the resulting gains hold under independent verification, the work would provide a concrete mechanism for leveraging emerging reasoning capabilities to improve stability-plasticity trade-offs in RLVR continual learning for multimodal models. The sample-level dynamic balancing approach could influence future designs that operate at the reasoning trace level rather than output-level regularization.
major comments (2)
- [§3] §3 (Reasoning Portability definition and computation): The procedure for computing RP on OOD samples must be specified with explicit equations or pseudocode showing the exact prompting, scoring, or comparison of reasoning traces. If RP relies on the current MLLM to judge previous-policy reasoning traces on the same OOD data, the claimed reliability advantage of reasoning-level over answer-level signals becomes an internal consistency check rather than an externally anchored demonstration, directly undermining the justification for the subsequent per-sample KL modulation in RDB-CL.
- [§5] §5 (Experiments and results): The reported +12.0% Last accuracy improvement over vanilla RLVR requires supporting details on statistical significance (e.g., p-values or confidence intervals across multiple seeds), variance, exact dataset splits, number of tasks, and full baseline implementations. Without these, it is impossible to assess whether the gain is robust or driven by the RP-based modulation versus other implementation choices.
minor comments (2)
- Notation for RP and the dynamic balance factor should be introduced with a clear table or diagram showing how the per-sample KL weight is derived from the RP score.
- The abstract and introduction would benefit from a concise statement of the precise definition of 'reasoning-level signal' versus 'answer-level signal' to avoid ambiguity in the reliability claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [§3] §3 (Reasoning Portability definition and computation): The procedure for computing RP on OOD samples must be specified with explicit equations or pseudocode showing the exact prompting, scoring, or comparison of reasoning traces. If RP relies on the current MLLM to judge previous-policy reasoning traces on the same OOD data, the claimed reliability advantage of reasoning-level over answer-level signals becomes an internal consistency check rather than an externally anchored demonstration, directly undermining the justification for the subsequent per-sample KL modulation in RDB-CL.
Authors: We agree that the RP computation procedure requires explicit specification to avoid ambiguity. In the revised manuscript, we will insert detailed equations and pseudocode in §3 describing the exact process: (1) generation of reasoning traces by the previous policy on OOD samples via fixed prompting, (2) scoring via a separate, pre-trained reasoning similarity metric (e.g., embedding-based trace alignment independent of the current policy), and (3) comparison against answer-level baselines. This metric is externally anchored and does not rely on the current MLLM judging its own outputs, thereby preserving the claimed reliability distinction as an externally validated observation rather than an internal check. We will revise accordingly. revision: yes
-
Referee: [§5] §5 (Experiments and results): The reported +12.0% Last accuracy improvement over vanilla RLVR requires supporting details on statistical significance (e.g., p-values or confidence intervals across multiple seeds), variance, exact dataset splits, number of tasks, and full baseline implementations. Without these, it is impossible to assess whether the gain is robust or driven by the RP-based modulation versus other implementation choices.
Authors: We acknowledge that additional experimental details are needed to demonstrate robustness. In the revision, we will expand §5 and the appendix to report: exact dataset splits and task sequence (5 tasks with standard continual learning splits), full baseline implementations, and results across 3 random seeds including mean, standard deviation, and p-values from paired t-tests on the Last accuracy metric. These additions will clarify that the +12.0% gain is driven by the RP modulation while maintaining reproducibility. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper defines Reasoning Portability (RP) as a sample-level measure of reusability of the previous policy's behavior on new tasks and uses it to modulate per-sample KL regularization in RDB-CL. It further claims an empirical demonstration that reasoning-level signals are more reliable than answer-level signals on OOD samples. No equations or computation procedures are shown that reduce RP to a fitted parameter, a self-referential judgment by the same model, or a self-citation chain. The central claims rest on an independent empirical comparison rather than reducing to the inputs by construction, satisfying the criteria for a non-circular finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not
invented entities (1)
-
Reasoning Portability (RP)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize portability ... and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. ... β_LRC = max(clip_min, 1{C≥τ} + C·1{C<τ})·β_0
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
JRDBCL(θ) = −1/G Σ [r_θ(q, o_i, t) A_i,t − β_LRC D_KL(π_θt || π_θt−1)]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[2]
Coin: A benchmark of continual instruction tuning for multimodel large language models
Cheng Chen, Junchen Zhu, Xu Luo, Heng Tao Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 57817–57840. Curran Associat...
work page 2024
-
[3]
Retaining by doing: The role of on-policy data in mitigating forgetting, 2025
Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025. URL https://arxiv.org/abs/2510. 18874
work page 2025
-
[4]
Continual multimodal knowledge graph construction.arXiv preprint arXiv:2305.08698, 2023
Xiang Chen, Jintian Zhang, Xiaohan Wang, Ningyu Zhang, Tongtong Wu, Yuxiang Wang, Yongheng Wang, and Huajun Chen. Continual multimodal knowledge graph construction.arXiv preprint arXiv:2305.08698, 2023
-
[5]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/2505.22617
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305. 06500
work page 2023
-
[7]
Weight ensembling improves reasoning in language models, 2025
Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models, 2025. URL https://arxiv.org/abs/ 2504.10478
-
[8]
Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2021. ISSN 1939-3539. doi: 10.1109/tpami.2021.3057446. URL http://dx.doi.org/10.1109/TPAM...
-
[9]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848
-
[10]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017
work page 2017
-
[11]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, 10 Dongjie J...
-
[12]
Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, and Cheng- Lin Liu. Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model.arXiv preprint arXiv:2503.12941, 2025
-
[13]
Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P
Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people,
-
[14]
URLhttps://arxiv.org/abs/1802.08218
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Large language models can self-improve, 2022
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve, 2022. URL https://arxiv.org/abs/2210. 11610
work page 2022
-
[16]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019
work page 2019
-
[17]
Self-training large language models with confident reasoning, 2025
Hyosoon Jang, Yunhui Jang, Sungjae Lee, Jungseul Ok, and Sungsoo Ahn. Self-training large language models with confident reasoning, 2025. URLhttps://arxiv.org/abs/2505. 17454
work page 2025
-
[18]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...
-
[20]
Uncertainty-aware evaluation for vision-language models, 2024
Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, and Eugene Ilyushin. Uncertainty-aware evaluation for vision-language models, 2024. URLhttps://arxiv.org/abs/2402.14418
-
[21]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[22]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Multi-domain lifelong visual question answering via self-critical distillation
Mingrui Lao, Nan Pu, Yu Liu, Zhun Zhong, Erwin M Bakker, Nicu Sebe, and Michael S Lew. Multi-domain lifelong visual question answering via self-critical distillation. InProceedings of the 31st ACM International Conference on Multimedia, pages 4747–4758, 2023
work page 2023
-
[24]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought, 2025. URL https://arxiv.org/abs/2501.07542
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URL https://arxiv. org/abs/1606.09282
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Internal consistency and self-feedback in large language models: A survey, 2024
Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li, Yi Wang, Zhonghao Wang, Feiyu Xiong, and Zhiyu Li. Internal consistency and self-feedback in large language models: A survey, 2024. URLhttps://arxiv.org/abs/2407.14507
-
[27]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting
Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, and Yonghong Tian. Continual learning for vlms: A survey and taxonomy beyond forgetting.arXiv preprint arXiv:2508.04227, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Gradient episodic memory for continual learning
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017
work page 2017
-
[30]
Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. InThe 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021
work page 2021
-
[31]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022
work page 2022
-
[32]
Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data selection, 2025
Adyasha Maharana, Jaehong Yoon, Tianlong Chen, and Mohit Bansal. Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data selection, 2025. URL https:// arxiv.org/abs/2410.10636
-
[33]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. URL https://arxiv.org/abs/ 2503.07365
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Chaofan Pan, Lingfei Ren, Yihui Feng, Linbo Xiong, Wei Wei, Yonghao Li, and Xin Yang. Multi-granularity knowledge transfer for continual reinforcement learning, 2025. URL https: //arxiv.org/abs/2401.15098. 12
-
[35]
German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71,
-
[36]
doi: https://doi.org/10.1016/j.neunet.2019.01.012
ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2019.01.012. URL https: //www.sciencedirect.com/science/article/pii/S0893608019300231
-
[37]
Self-consistency preference optimization, 2025
Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mo- hit Bansal, Sainbayar Sukhbaatar, Jason Weston, and Jane Yu. Self-consistency preference optimization, 2025. URLhttps://arxiv.org/abs/2411.04109
-
[38]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,
-
[39]
URLhttps://arxiv.org/abs/2305.18290
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
RL's Razor: Why Online Reinforcement Learning Forgets Less
Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less, 2025. URLhttps://arxiv.org/abs/2509.04259
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019
work page 2019
-
[44]
Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025
-
[45]
Longxiang Tang, Zhuotao Tian, Kai Li, Chunming He, Hantao Zhou, Hengshuang Zhao, Xiu Li, and Jiaya Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models, 2024. URL https://arxiv.org/abs/2407. 05342
work page 2024
-
[46]
Confidence improves self-consistency in llms
Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings of the Association for Computational Linguistics: ACL 2025, page 20090–20111. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.1030. URL http://dx.doi.org/10. 18653/v...
-
[47]
Qwen Team. Qwen2.5-vl, January 2025. URL https://qwenlm.github.io/blog/qwen2. 5-vl/
work page 2025
-
[48]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URLhttps://arxiv.org/abs/2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Uncertainty aware learning for language model alignment, 2024
Yikun Wang, Rui Zheng, Liang Ding, Qi Zhang, Dahua Lin, and Dacheng Tao. Uncertainty aware learning for language model alignment, 2024. URL https://arxiv.org/abs/2406. 04854
work page 2024
-
[51]
Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,
Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,
- [52]
-
[53]
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning
Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning, 2025. URLhttps://arxiv.org/abs/2510.10649
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Advancing cross-domain discriminability in continual learning of vison-language models, 06 2024
Yicheng Xu, Yuxin Chen, Jiahao Nie, Yusong Wang, Huiping Zhuang, and Manabu Okumura. Advancing cross-domain discriminability in continual learning of vison-language models, 06 2024
work page 2024
-
[55]
Generative negative text replay for continual vision-language pretraining
Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. Generative negative text replay for continual vision-language pretraining. In European Conference on Computer Vision, pages 22–38. Springer, 2022
work page 2022
-
[56]
Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025
Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025
-
[57]
A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024
work page 2024
-
[58]
arXiv preprint arXiv:2504.07954 , year =
En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, and Wenbing Tao. Perception-r1: Pioneering perception policy with reinforcement learning, 2025. URL https://arxiv.org/abs/2504.07954
-
[59]
Boosting continual learning of vision-language models via mixture-of-experts adapters, 2024
Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters, 2024. URL https://arxiv.org/abs/2403.11549
-
[60]
Lu Yu, Zhe Tao, Dipam Goswami, Hantao Yao, Bartłomiej Twardowski, Joost Van de Weijer, and Changsheng Xu. Exploiting the semantic knowledge of pre-trained text-encoders for continual learning.arXiv preprint arXiv:2408.01076, 2024
-
[61]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen, Kai-Po Chang, Yung-Hsuan Lai, Fu-En Yang, and Yu-Chiang Frank Wang. Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models. InEuropean Conference on Computer Vision, pages 219–236. Springer, 2024
work page 2024
-
[63]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URLhttps://arxiv.org/abs/2504.13837
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024. URL https://arxiv.org/abs/ 2405.10292
-
[65]
CPPO: Continual learning for reinforcement learning with human feedback
Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, and Ruifeng Xu. CPPO: Continual learning for reinforcement learning with human feedback. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=86zAUE80pP
work page 2024
-
[66]
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization, 2025. URLhttps://arxiv.org/abs/2503.12937. 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Vqacl: A novel visual question answering continual learning setting
Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: A novel visual question answering continual learning setting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19102–19112, 2023
work page 2023
-
[68]
Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective,
Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, and Kai Chen. Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective,
- [69]
-
[70]
Mllm-cl: Continual learning for multimodal large language models, 2025
Hongbo Zhao, Fei Zhu, Haiyang Guo, Meng Wang, Rundong Wang, Gaofeng Meng, and Zhaoxiang Zhang. Mllm-cl: Continual learning for multimodal large language models, 2025. URLhttps://arxiv.org/abs/2506.05453
-
[71]
Danna Zheng, Danyang Liu, Mirella Lapata, and Jeff Z. Pan. Trustscore: Reference-free evaluation of llm response trustworthiness, 2024. URL https://arxiv.org/abs/2402. 12545
work page 2024
-
[72]
Preventing zero-shot transfer degradation in continual learning of vision-language models
Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023
work page 2023
-
[73]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.14362
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
Da-Wei Zhou, Yuanhan Zhang, Yan Wang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Learning without forgetting for vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(6):4489–4504, 2025. doi: 10.1109/TPAMI.2025. 3540889
-
[75]
The surprising effectiveness of negative reinforcement in llm reasoning
Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning, 2025. URL https:// arxiv.org/abs/2506.01347. 15 A Appendix A.1 Symbols We use the following notation throughout the paper: •T: a sequence task. •t: total number of tasks in the continual learning sequence. •T...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.