Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

Alicja Dobrzeniecka; Bartlomiej Twardowski; Filip Szatkowski; Sebastian Cygert; Szymon Lukasik

arxiv: 2605.31229 · v1 · pith:C63G73HDnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

Alicja Dobrzeniecka , Filip Szatkowski , Sebastian Cygert , Szymon Lukasik , Bartlomiej Twardowski This is my paper

Pith reviewed 2026-06-28 23:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords continual multimodal retrievaldynamic adapter routingclass-incremental learningvision-language modelsmodel mergingprototype-based routingout-of-distribution generalization

0 comments

The pith

Dynamic Adapter Routing outperforms standard CIL methods in continual multimodal retrieval using prototype-based selection and merging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that retrieval tasks in vision-language models need continual learning methods distinct from class-incremental learning because CIL approaches do not capture retrieval-specific dynamics. It introduces a new evaluation framework for continual multimodal retrieval across diverse visual domains, where standard CIL methods produce no meaningful gains. Dynamic Adapter Routing solves this by selecting adapters through prototype-based routing and combining them via model merging, delivering higher performance and stronger out-of-distribution generalization. This matters because retrieval is a core capability of these models, and effective continual updates could allow them to incorporate new domains without losing earlier retrieval accuracy.

Core claim

In a challenging continual multimodal retrieval scenario spanning diverse visual domains, standard class-incremental learning methods fail to yield meaningful gains, while Dynamic Adapter Routing, based on adapters selected through prototype-based routing and combined via model merging, achieves superior performance over the previous baselines and demonstrates strong generalization under out-of-distribution evaluation.

What carries the argument

Dynamic Adapter Routing (DAR), which selects adapters via prototype-based routing and combines them through model merging.

If this is right

Standard CIL methods fail to yield meaningful gains in the more challenging CMR scenario.
DAR achieves superior performance over the previous baselines.
DAR demonstrates strong generalization under out-of-distribution evaluation.
The new framework reveals unique challenges of CMR that classification-focused methods do not address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prototype-routing idea could be tested on other retrieval-heavy tasks such as image captioning or visual question answering under continual updates.
Alternative merging techniques might further improve DAR without changing the routing step.
The evaluation framework could serve as a testbed for retrieval-specific regularization methods beyond adapters.

Load-bearing premise

That the new principled evaluation framework for continual multimodal retrieval spanning diverse visual domains accurately captures retrieval-specific dynamics and that standard CIL methods were appropriately adapted and tested within it.

What would settle it

A result in which DAR fails to outperform adapted CIL baselines on the proposed CMR benchmark or shows no advantage on out-of-distribution test sets would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.31229 by Alicja Dobrzeniecka, Bartlomiej Twardowski, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik.

**Figure 2.** Figure 2: (left) Image-to-Text and (right) Text-to-Image Recall@1 performance of various CL [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: (top) In-distribution AIR and (bottom) I2T R@1 on NoCaps during prolonged continual training. To evaluate DAR and its adapter sharing mechanism, we extend the training sequences by splitting each task from [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Continual performance on selected tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard CIL methods and retrieval-oriented adaptations in settings that may not fully capture the retrieval-specific dynamics. To address this, we introduce a new, principled evaluation framework for continual multimodal retrieval (CMR) spanning diverse visual domains, and systematically evaluate common approaches within this setting. Our empirical analysis shows that standard CIL methods fail to yield meaningful gains in our more challenging scenario. Therefore, we propose Dynamic Adapter Routing (DAR), a novel approach based on adapters selected through prototype-based routing and combined via model merging.DAR achieves superior performance over the previous baselines and demonstrates strong generalization under out-of-distribution evaluation. Our results highlights the unique challenges of CMR and encourages further research in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New CMR framework and DAR method address an underexplored spot but the abstract supplies zero metrics or setup details so the superiority claim stays uncheckable.

read the letter

The paper introduces a continual multimodal retrieval (CMR) evaluation framework that spans diverse visual domains and a Dynamic Adapter Routing (DAR) technique that picks adapters via prototypes then merges them. It argues that standard class-incremental learning methods do not transfer well to retrieval and that DAR does better with stronger out-of-distribution behavior.

That framing is the main new piece. Treating retrieval as its own continual problem rather than a classification proxy makes sense on paper, and the prototype routing plus merging idea is a concrete proposal distinct from the CIL baselines they mention.

The abstract states empirical superiority and failure of the baselines but gives no numbers, no dataset list, no task sequence, no metric (R@K, mAP), and no description of how the CIL methods were changed for retrieval heads. Without those, there is no way to tell whether the framework actually tests retrieval dynamics or whether the baselines were handicapped by the adaptation. The OOD claim is likewise unsupported here.

This is aimed at researchers already working on continual learning inside vision-language models. If the full paper contains reproducible experiments with clear protocols and fair baseline adaptations, it could be worth a referee. Based on the abstract alone the central claim cannot be evaluated, so I would not cite it or bring it to a reading group yet.

Referee Report

2 major / 1 minor

Summary. The paper introduces a new principled evaluation framework for continual multimodal retrieval (CMR) spanning diverse visual domains. It empirically shows that standard class-incremental learning (CIL) methods fail to yield meaningful gains in this setting and proposes Dynamic Adapter Routing (DAR), a method based on prototype-based routing for selecting adapters that are then combined via model merging. The central claim is that DAR achieves superior performance over previous baselines and demonstrates strong generalization under out-of-distribution evaluation.

Significance. If the framework accurately reflects retrieval-specific dynamics (embedding similarity, cross-modal matching, domain shifts) rather than classification proxies and the results are reproducible, the work would be significant in shifting continual learning research for vision-language models from classification to retrieval tasks, while providing a benchmark and a novel adapter-routing approach that could stimulate further CMR research.

major comments (2)

[Abstract] Abstract: the claim of empirical superiority for DAR and failure of CIL baselines supplies no metrics (e.g., R@K, mAP), dataset details, task sequences, or experimental protocol, preventing verification that the results support the central claim.
[Evaluation Framework] CMR framework description: no concrete information is given on domain coverage, how retrieval metrics replace classification accuracy, or the precise adaptations made to standard CIL methods (rehearsal buffers, regularization, parameter isolation) for retrieval heads instead of classifiers; this is load-bearing for the assertion that the framework captures retrieval-specific dynamics rather than introducing artifacts.

minor comments (1)

Notation for 'prototype-based routing' and 'model merging' could be formalized with a short equation or pseudocode in the method section to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and evaluation framework. We address the two major comments point by point below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of empirical superiority for DAR and failure of CIL baselines supplies no metrics (e.g., R@K, mAP), dataset details, task sequences, or experimental protocol, preventing verification that the results support the central claim.

Authors: We agree that the abstract would benefit from including concrete quantitative support for the central claims. In the revised version we will add key results (e.g., average R@1 and mAP gains of DAR over the strongest CIL baseline, number of domains, and task sequence length) while preserving the abstract's brevity. The full experimental protocol, datasets, and task sequences are already reported in Section 4; the abstract revision will simply surface the most salient numbers to allow immediate verification of the claims. revision: yes
Referee: [Evaluation Framework] CMR framework description: no concrete information is given on domain coverage, how retrieval metrics replace classification accuracy, or the precise adaptations made to standard CIL methods (rehearsal buffers, regularization, parameter isolation) for retrieval heads instead of classifiers; this is load-bearing for the assertion that the framework captures retrieval-specific dynamics rather than introducing artifacts.

Authors: We acknowledge that the current description of the CMR framework would be strengthened by additional concrete details. Section 3 already defines the shift from classification accuracy to retrieval metrics (Recall@K, mAP) and outlines the domain coverage across natural, medical, satellite, and artistic imagery, but we will expand this section with explicit lists of domains, the exact number of tasks, and the precise modifications applied to each CIL baseline (e.g., storing normalized embeddings rather than class prototypes in rehearsal buffers, adapting EWC-style regularization to cross-modal similarity losses, and replacing classifier heads with retrieval heads). These additions will make the retrieval-specific adaptations fully explicit and address the concern that the framework may introduce artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical proposal with no derivation chain or self-referential reductions

full rationale

The paper introduces a new evaluation framework for continual multimodal retrieval and proposes the DAR method as an empirical solution, with all central claims resting on experimental comparisons rather than any mathematical derivation. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the approach is presented as a practical adapter-based routing technique whose performance is measured against baselines in the new setting. This is a standard empirical contribution whose validity can be checked externally via reproduction of the reported metrics, with no step reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or background assumptions; ledger is empty due to lack of technical detail.

pith-pipeline@v0.9.1-grok · 5699 in / 933 out tokens · 29465 ms · 2026-06-28T23:00:58.992502+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 1 canonical work pages

[1]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

2019
[2]

wikiart_recaption

AterMors. wikiart_recaption. https://huggingface.co/datasets/AterMors/wikiart_ recaption, 2024. Hugging Face dataset

2024
[3]

Fs-coco: Towards understanding of freehand sketches of common objects in context

Pinaki Nath Chowdhury, Aneeshan Sain, Ayan Kumar Bhunia, Tao Xiang, Yulia Gryaditskaya, and Yi-Zhe Song. Fs-coco: Towards understanding of freehand sketches of common objects in context. InEuropean conference on computer vision, pages 253–270. Springer, 2022

2022
[4]

Continual vision-language retrieval via dynamic knowledge rectification

Zhenyu Cui, Yuxin Peng, Xun Wang, Manyu Zhu, and Jiahuan Zhou. Continual vision-language retrieval via dynamic knowledge rectification. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11704–11712, 2024

2024
[5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society, 2009

2009
[6]

Tic-clip: Continual training of clip models

Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, and Fartash Faghri. Tic-clip: Continual training of clip models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024
[7]

kream-product-blip-captions

hahminlew. kream-product-blip-captions. https://huggingface.co/datasets/ hahminlew/kream-product-blip-captions, 2023. Hugging Face dataset

2023
[8]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

2022
[9]

Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion

Linlan Huang, Xusheng Cao, Haori Lu, and Xialei Liu. Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion. InEuropean Conference on Computer Vision, pages 214–231. Springer, 2024

2024
[10]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023

2023
[11]

CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models

Saurav Jha, Dong Gong, and Lina Yao. CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[12]

Flintstonessv_plus_plus

Janak Kapuriya. Flintstonessv_plus_plus. https://huggingface.co/datasets/Janak12/ FlintstonesSV_Plus_Plus, 2025. Hugging Face dataset

2025
[13]

Flintstonessv++ : Improving story narration using visual scene graph

Janak Kapuriya and Paul Buitelaar. Flintstonessv++ : Improving story narration using visual scene graph. InText2Story@ECIR, 2025. URL https://api.semanticscholar.org/ CorpusID:279053465

2025
[14]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 35...

2017
[15]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs. toronto.edu/~kriz/learning-features-2009-TR.pdf. 10

2009
[16]

Coleclip: Open-domain continual learning via joint task prompt and vocabulary learning.IEEE Transactions on Neural Networks and Learning Systems, 36(8):15137–15151, 2025

Yukun Li, Guansong Pang, Wei Suo, Chenchen Jing, Yuling Xi, Lingqiao Liu, Hao Chen, Guoqiang Liang, and Peng Wang. Coleclip: Open-domain continual learning via joint task prompt and vocabulary learning.IEEE Transactions on Neural Networks and Learning Systems, 36(8):15137–15151, 2025

2025
[17]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

2014
[18]

C-clip: Multimodal continual learning for vision-language model

Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision-language model. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, pages 46461–46477, 2025

2025
[19]

Continual learning on CLIP via incremental prompt tuning with intrinsic textual anchors.Transactions on Machine Learning Research, 2025

Haodong Lu, Xinyu Zhang, Kristen Moore, Jason Xue, Lina Yao, Anton van den Hengel, and Dong Gong. Continual learning on CLIP via incremental prompt tuning with intrinsic textual anchors.Transactions on Machine Learning Research, 2025. ISSN 2835-8856

2025
[20]

MAGMAX: leveraging model merging for seamless continual learning

Daniel Marczak, Bartlomiej Twardowski, Tomasz Trzcinski, and Sebastian Cygert. MAGMAX: leveraging model merging for seamless continual learning. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedin...

2024
[21]

No task left behind: Isotropic model merging with common and task-specific subspaces

Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Andrew D Bagdanov, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. InForty-second International Conference on Machine Learning, 2025

2025
[22]

On class orderings for incremental learning.CoRR, abs/2007.02145, 2020

Marc Masana, Bartlomiej Twardowski, and Joost van de Weijer. On class orderings for incremental learning.CoRR, abs/2007.02145, 2020

arXiv 2007
[23]

Semantic residual prompts for continual learning

Martin Menabue, Emanuele Frascaroli, Matteo Boschini, Enver Sangineto, Lorenzo Bonicelli, Angelo Porrello, and Simone Calderara. Semantic residual prompts for continual learning. In European Conference on Computer Vision, 2024

2024
[24]

Continual vision-language representation learning with off-diagonal information

Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, and Qi Tian. Continual vision-language representation learning with off-diagonal information. InProceedings of the 40th International Conference on Machine Learning, ICML’23, 2023

2023
[25]

Bagdanov, Simone Calderara, and Joost van de Weijer

Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bartłomiej Twardowski, Andrew D. Bagdanov, Simone Calderara, and Joost van de Weijer. Accurate and efficient low-rank model merging in core space. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[26]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012

2012
[27]

doi: 10.18653/v1/2020.emnlp-demos

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vuli´c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. AdapterHub: A framework for adapting transformers. In Qun Liu and David Schlangen, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 46–54, Online, Octob...

work page doi:10.18653/v1/2020.emnlp-demos 2020
[28]

URLhttps://aclanthology.org/2020.emnlp-demos.7/

2020
[29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139 ofProceedings...

2021
[30]

Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G

Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S. Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, Henning Müller, Peter A. Horn, Felix Nensa, and Christoph M. Friedrich. Rocov2: Radiology 11 objects in context version 2, an updated multimodal image dataset.Scientific Data, 11(1...

2024
[31]

Construct-vl: Data-free continual structured vl concepts learning

James Seale Smith, Paola Cascante-Bonilla, Assaf Arbelle, Donghyun Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, and Leonid Karlinsky. Construct-vl: Data-free continual structured vl concepts learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14994–15004, 2023

2023
[32]

A practitioner’s guide to real-world continual multimodal pretraining

Vishaal Udandarao, Karsten Roth, Sebastian Dziadzio, Ameya Prabhu, Mehdi Cherti, Oriol Vinyals, Olivier Hénaff, Samuel Albanie, Zeynep Akata, and Matthias Bethge. A practitioner’s guide to real-world continual multimodal pretraining. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informat...

2024
[33]

Continual learning in cross-modal retrieval

Kai Wang, Luis Herranz, and Joost van de Weijer. Continual learning in cross-modal retrieval . In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE Computer Society, 2021

2021
[34]

Dy, and Tomas Pfister

Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer G. Dy, and Tomas Pfister. Learning to prompt for continual learning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 139–149, 2021

2022
[35]

Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

2023
[36]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 2014

2014
[37]

Boosting continual learning of vision-language models via mixture-of-experts adapters

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23219–23230, June 2024

2024
[38]

lexica-stable-diffusion-v1-5

yuwan0. lexica-stable-diffusion-v1-5. https://huggingface.co/datasets/yuwan0/ lexica-stable-diffusion-v1-5, 2024. Hugging Face dataset

2024
[39]

Preventing zero-shot transfer degradation in continual learning of vision-language models

Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19125–19136, October 2023

2023
[40]

Limitations

zoheb. sketch-scene. https://huggingface.co/datasets/zoheb/sketch-scene, 2025. Hugging Face dataset. 12 Appendix A Implementation details Global training protocol.Unless otherwise stated, all methods are trained with the same optimiza- tion, data, and evaluation protocol. We use CLIP ViT-B/16 initialized from the pretrained checkpoint, train for 20 epochs...

2025
[41]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects 27 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...

[1] [1]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

2019

[2] [2]

wikiart_recaption

AterMors. wikiart_recaption. https://huggingface.co/datasets/AterMors/wikiart_ recaption, 2024. Hugging Face dataset

2024

[3] [3]

Fs-coco: Towards understanding of freehand sketches of common objects in context

Pinaki Nath Chowdhury, Aneeshan Sain, Ayan Kumar Bhunia, Tao Xiang, Yulia Gryaditskaya, and Yi-Zhe Song. Fs-coco: Towards understanding of freehand sketches of common objects in context. InEuropean conference on computer vision, pages 253–270. Springer, 2022

2022

[4] [4]

Continual vision-language retrieval via dynamic knowledge rectification

Zhenyu Cui, Yuxin Peng, Xun Wang, Manyu Zhu, and Jiahuan Zhou. Continual vision-language retrieval via dynamic knowledge rectification. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11704–11712, 2024

2024

[5] [5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society, 2009

2009

[6] [6]

Tic-clip: Continual training of clip models

Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, and Fartash Faghri. Tic-clip: Continual training of clip models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024

[7] [7]

kream-product-blip-captions

hahminlew. kream-product-blip-captions. https://huggingface.co/datasets/ hahminlew/kream-product-blip-captions, 2023. Hugging Face dataset

2023

[8] [8]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

2022

[9] [9]

Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion

Linlan Huang, Xusheng Cao, Haori Lu, and Xialei Liu. Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion. InEuropean Conference on Computer Vision, pages 214–231. Springer, 2024

2024

[10] [10]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023

2023

[11] [11]

CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models

Saurav Jha, Dong Gong, and Lina Yao. CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[12] [12]

Flintstonessv_plus_plus

Janak Kapuriya. Flintstonessv_plus_plus. https://huggingface.co/datasets/Janak12/ FlintstonesSV_Plus_Plus, 2025. Hugging Face dataset

2025

[13] [13]

Flintstonessv++ : Improving story narration using visual scene graph

Janak Kapuriya and Paul Buitelaar. Flintstonessv++ : Improving story narration using visual scene graph. InText2Story@ECIR, 2025. URL https://api.semanticscholar.org/ CorpusID:279053465

2025

[14] [14]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 35...

2017

[15] [15]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs. toronto.edu/~kriz/learning-features-2009-TR.pdf. 10

2009

[16] [16]

Coleclip: Open-domain continual learning via joint task prompt and vocabulary learning.IEEE Transactions on Neural Networks and Learning Systems, 36(8):15137–15151, 2025

Yukun Li, Guansong Pang, Wei Suo, Chenchen Jing, Yuling Xi, Lingqiao Liu, Hao Chen, Guoqiang Liang, and Peng Wang. Coleclip: Open-domain continual learning via joint task prompt and vocabulary learning.IEEE Transactions on Neural Networks and Learning Systems, 36(8):15137–15151, 2025

2025

[17] [17]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

2014

[18] [18]

C-clip: Multimodal continual learning for vision-language model

Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision-language model. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, pages 46461–46477, 2025

2025

[19] [19]

Continual learning on CLIP via incremental prompt tuning with intrinsic textual anchors.Transactions on Machine Learning Research, 2025

Haodong Lu, Xinyu Zhang, Kristen Moore, Jason Xue, Lina Yao, Anton van den Hengel, and Dong Gong. Continual learning on CLIP via incremental prompt tuning with intrinsic textual anchors.Transactions on Machine Learning Research, 2025. ISSN 2835-8856

2025

[20] [20]

MAGMAX: leveraging model merging for seamless continual learning

Daniel Marczak, Bartlomiej Twardowski, Tomasz Trzcinski, and Sebastian Cygert. MAGMAX: leveraging model merging for seamless continual learning. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedin...

2024

[21] [21]

No task left behind: Isotropic model merging with common and task-specific subspaces

Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Andrew D Bagdanov, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. InForty-second International Conference on Machine Learning, 2025

2025

[22] [22]

On class orderings for incremental learning.CoRR, abs/2007.02145, 2020

Marc Masana, Bartlomiej Twardowski, and Joost van de Weijer. On class orderings for incremental learning.CoRR, abs/2007.02145, 2020

arXiv 2007

[23] [23]

Semantic residual prompts for continual learning

Martin Menabue, Emanuele Frascaroli, Matteo Boschini, Enver Sangineto, Lorenzo Bonicelli, Angelo Porrello, and Simone Calderara. Semantic residual prompts for continual learning. In European Conference on Computer Vision, 2024

2024

[24] [24]

Continual vision-language representation learning with off-diagonal information

Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, and Qi Tian. Continual vision-language representation learning with off-diagonal information. InProceedings of the 40th International Conference on Machine Learning, ICML’23, 2023

2023

[25] [25]

Bagdanov, Simone Calderara, and Joost van de Weijer

Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bartłomiej Twardowski, Andrew D. Bagdanov, Simone Calderara, and Joost van de Weijer. Accurate and efficient low-rank model merging in core space. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[26] [26]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012

2012

[27] [27]

doi: 10.18653/v1/2020.emnlp-demos

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vuli´c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. AdapterHub: A framework for adapting transformers. In Qun Liu and David Schlangen, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 46–54, Online, Octob...

work page doi:10.18653/v1/2020.emnlp-demos 2020

[28] [28]

URLhttps://aclanthology.org/2020.emnlp-demos.7/

2020

[29] [29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139 ofProceedings...

2021

[30] [30]

Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G

Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S. Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, Henning Müller, Peter A. Horn, Felix Nensa, and Christoph M. Friedrich. Rocov2: Radiology 11 objects in context version 2, an updated multimodal image dataset.Scientific Data, 11(1...

2024

[31] [31]

Construct-vl: Data-free continual structured vl concepts learning

James Seale Smith, Paola Cascante-Bonilla, Assaf Arbelle, Donghyun Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, and Leonid Karlinsky. Construct-vl: Data-free continual structured vl concepts learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14994–15004, 2023

2023

[32] [32]

A practitioner’s guide to real-world continual multimodal pretraining

Vishaal Udandarao, Karsten Roth, Sebastian Dziadzio, Ameya Prabhu, Mehdi Cherti, Oriol Vinyals, Olivier Hénaff, Samuel Albanie, Zeynep Akata, and Matthias Bethge. A practitioner’s guide to real-world continual multimodal pretraining. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informat...

2024

[33] [33]

Continual learning in cross-modal retrieval

Kai Wang, Luis Herranz, and Joost van de Weijer. Continual learning in cross-modal retrieval . In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE Computer Society, 2021

2021

[34] [34]

Dy, and Tomas Pfister

Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer G. Dy, and Tomas Pfister. Learning to prompt for continual learning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 139–149, 2021

2022

[35] [35]

Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

2023

[36] [36]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 2014

2014

[37] [37]

Boosting continual learning of vision-language models via mixture-of-experts adapters

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23219–23230, June 2024

2024

[38] [38]

lexica-stable-diffusion-v1-5

yuwan0. lexica-stable-diffusion-v1-5. https://huggingface.co/datasets/yuwan0/ lexica-stable-diffusion-v1-5, 2024. Hugging Face dataset

2024

[39] [39]

Preventing zero-shot transfer degradation in continual learning of vision-language models

Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19125–19136, October 2023

2023

[40] [40]

Limitations

zoheb. sketch-scene. https://huggingface.co/datasets/zoheb/sketch-scene, 2025. Hugging Face dataset. 12 Appendix A Implementation details Global training protocol.Unless otherwise stated, all methods are trained with the same optimiza- tion, data, and evaluation protocol. We use CLIP ViT-B/16 initialized from the pretrained checkpoint, train for 20 epochs...

2025

[41] [41]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects 27 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...