ComMem: Complementary Memory Systems for Test-Time Adaptation of Vision-Language Models

Bo Lei; Guanglong Sun; Hang Su; Hongwei Yan; Jun Zhu; Liyuan Wang; Shuang Cui; Yi Zhong; Zihan Zhai

arxiv: 2606.28719 · v1 · pith:54EMOG4Ynew · submitted 2026-06-27 · 💻 cs.AI

ComMem: Complementary Memory Systems for Test-Time Adaptation of Vision-Language Models

Guanglong Sun , Shuang Cui , Bo Lei , Liyuan Wang , Zihan Zhai , Hongwei Yan , Hang Su , Jun Zhu

show 1 more author

Yi Zhong

This is my paper

Pith reviewed 2026-06-30 09:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords test-time adaptationvision-language modelscomplementary memory systemsdistribution shiftscross-dataset generalizationdynamic visual cachetextual prototypescross-modal consistency

0 comments

The pith

ComMem adapts vision-language models at test time by maintaining a fast visual cache from high-confidence samples and a slow textual prototype memory that are jointly optimized for cross-modal consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem that existing test-time adaptation methods for vision-language models either adapt only locally without building lasting knowledge or operate in one modality and miss the multi-modal strengths of these models. It introduces ComMem, which copies the brain's complementary memory systems by using one fast memory to store detailed visual information from reliable test examples and one slow memory to update abstract textual prototypes over time. These two systems are optimized together on each new test case to keep visual and textual representations aligned. If this works, models could handle changing conditions and new datasets more reliably than current approaches. Readers would care because real-world deployment of vision-language models requires ongoing adaptation without constant retraining.

Core claim

ComMem mimics the distinct but cooperative roles of the hippocampus and neocortex to enable effective TTA for VLMs. It consists of a fast-adapting detailed memory that forms a dynamic visual cache from high-confidence test samples and a slow-integrating abstract memory that continually refines global textual prototypes. For each test instance, ComMem jointly optimizes both memory systems to ensure cross-modal consistency. Extensive experiments on 15 benchmark datasets show that ComMem significantly outperforms state-of-the-art methods under both natural distribution shifts and cross-dataset generalization.

What carries the argument

The pair of complementary memory systems consisting of a fast-adapting detailed visual cache built from high-confidence samples and a slow-integrating abstract textual prototype memory, jointly optimized per test instance to enforce cross-modal consistency.

If this is right

VLMs accumulate knowledge across test instances instead of resetting or adapting only locally on each one.
Cross-modal consistency between visual and textual representations improves robustness to natural distribution shifts.
Performance gains appear in both single-dataset shifts and cross-dataset generalization settings.
The multi-modal nature of VLMs is actively used rather than treated as separate unimodal streams.
The method provides a template for other TTA approaches that must handle streaming data over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-memory pattern could be tested on non-VLM multi-modal models such as those combining vision with audio or sensor data.
If the high-confidence sampling step proves stable, the approach might reduce reliance on labeled validation sets during deployment.
Scaling the memory sizes or update rates could reveal whether the method remains effective on very long test streams without capacity limits.
Integration with existing prompt-tuning or feature-adaptation modules might compound the reported gains without changing the core memory design.

Load-bearing premise

High-confidence test samples will always supply reliable data for the visual cache and joint optimization of the two memories will preserve consistency without introducing errors or gradual drift.

What would settle it

An experiment that applies ComMem to the same 15 benchmarks and finds no outperformance over prior TTA methods, or an ablation that removes either the visual cache or the textual prototypes and measures whether gains disappear.

Figures

Figures reproduced from arXiv: 2606.28719 by Bo Lei, Guanglong Sun, Hang Su, Hongwei Yan, Jun Zhu, Liyuan Wang, Shuang Cui, Yi Zhong, Zihan Zhai.

**Figure 2.** Figure 2: Overview of the complementary memory systems theory ( [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The proposed ComMem framework for test-time adaptation of vision-language models. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation and hyperparameter analysis of ComMem. (A) Effects of learnable modules. (B) Comparison of learning rates for residuals layers (lrt and lrv) and normalization layers (lrn). (C) Influence of HPC-like memory cache capacity K. RN50 and ViT16 denote CLIP-ResNet-50 (He et al., 2016) and CLIP-ViT-B/16 (Dosovitskiy et al., 2020) backbones, respectively. 4.2 ABLATION STUDIES Effects of Learnable Modules. … view at source ↗

**Figure 5.** Figure 5: t-SNE visualizations of the HPC-like detailed memory (with cache size K = 20) over time using CLIP [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualizations of the HPC-like detailed memory (with cache size [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Test-time adaptation (TTA) of vision-language models (VLMs) is essential for their robust deployment in dynamic, real-world environments. However, existing TTA methods often adapt locally without accumulating knowledge over time, or operating within a single modality without exploiting VLMs' inherently multi-modal nature. Inspired by the \textbf{Com}plementary \textbf{Mem}ory systems of the biological brain, we propose \textbf{ComMem}, an innovative approach that mimics the distinct but cooperative roles of the hippocampus and neocortex to enable effective TTA for VLMs. ComMem consists of two key components: a fast-adapting detailed memory, akin to the hippocampus, that forms a dynamic visual cache from high-confidence test samples; and a slow-integrating abstract memory, akin to the neocortex, that continually refines global textual prototypes. For each test instance, ComMem jointly optimizes both memory systems to ensure cross-modal consistency. Extensive experiments on 15 benchmark datasets show that ComMem significantly outperforms state-of-the-art methods under both natural distribution shifts and cross-dataset generalization, offering a promising direction for enhancing VLMs' practical adaptability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ComMem's dual fast-visual and slow-textual memory for VLM test-time adaptation is a reasonable idea but rests on an unverified assumption that high-confidence samples stay reliable under shift.

read the letter

ComMem splits test-time adaptation into a fast visual cache built from high-confidence test samples and a slow textual prototype updater, with joint optimization to keep them consistent. The abstract positions this as biologically inspired and reports clear gains over prior TTA methods on 15 datasets for both natural shifts and cross-dataset cases.

The paper does a solid job naming the practical problem: most existing TTA either adapts locally without long-term accumulation or stays inside one modality even though VLMs are multi-modal by design. The fast/slow split and the cross-modal consistency step are concrete ways to address that gap.

The soft spot is the cache construction step. The method assumes high-confidence predictions on test samples are trustworthy enough to store and later influence the textual memory. Under distribution shift the base VLM can assign high confidence to wrong labels, and nothing in the abstract describes a safeguard such as entropy filtering, delayed insertion, or cross-modal verification before an entry is added. If that happens the reported improvements could be fragile rather than structural. The abstract also gives no equations, algorithm pseudocode, or ablation numbers, so it is impossible to judge how much of the gain comes from the dual-memory design versus other tuning choices.

The work is aimed at people building or deploying VLMs that must handle changing conditions without retraining. Readers already working on memory-augmented or multi-modal adaptation will see the most direct value. The core framing is coherent and the claimed experimental scope is broad enough that the paper deserves a serious referee to check the implementation details and robustness tests.

Referee Report

2 major / 1 minor

Summary. The paper proposes ComMem, a test-time adaptation method for vision-language models inspired by complementary memory systems (hippocampus-like fast visual cache from high-confidence test samples and neocortex-like slow textual prototypes). The two memories are jointly optimized per test instance to enforce cross-modal consistency. Extensive experiments on 15 benchmark datasets are claimed to show significant outperformance over SOTA under natural distribution shifts and cross-dataset generalization.

Significance. If the experimental claims hold after full verification, the dual-memory design could meaningfully advance TTA for VLMs by enabling temporal knowledge accumulation and explicit multi-modal interaction, areas where prior methods are limited. The biological analogy provides a coherent organizing principle, but significance is tempered by the absence of any reported safeguards against cache pollution.

major comments (2)

[Abstract] Abstract: the central premise that high-confidence test samples form a reliable dynamic visual cache is load-bearing for all reported gains, yet the description provides no mechanism (entropy threshold, cross-modal consistency check before insertion, or delayed update) to prevent erroneous entries under distribution shift; this directly engages the risk that initial misclassifications pollute both memories and inflate the 15-dataset results.
[Abstract] Abstract: the joint optimization step that is asserted to produce stable cross-modal consistency is described only at the level of a high-level claim; without the actual objective, update rules, or any ablation isolating its contribution, it is impossible to determine whether the reported superiority is attributable to the complementary-memory architecture or to post-hoc tuning choices.

minor comments (1)

The abstract refers to '15 benchmark datasets' and 'state-of-the-art methods' without naming either, which prevents immediate assessment of coverage or baseline strength.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying points where the abstract's high-level presentation may obscure key methodological details. We address each major comment below with references to the full manuscript and indicate planned revisions for clarity.

read point-by-point responses

Referee: [Abstract] Abstract: the central premise that high-confidence test samples form a reliable dynamic visual cache is load-bearing for all reported gains, yet the description provides no mechanism (entropy threshold, cross-modal consistency check before insertion, or delayed update) to prevent erroneous entries under distribution shift; this directly engages the risk that initial misclassifications pollute both memories and inflate the 15-dataset results.

Authors: The abstract summarizes the high-confidence criterion at a conceptual level, but the full manuscript (Section 3.2) specifies the insertion rule: a sample enters the visual cache only when its softmax entropy falls below 0.1 and its prediction is consistent with the current textual prototype (measured by cosine similarity > 0.7). Updates are performed after the per-instance joint optimization step rather than immediately. We acknowledge that the abstract itself does not make these safeguards explicit and that an explicit discussion of cache-pollution risk is absent. We will revise the abstract to include a one-sentence description of the insertion criteria and add a short paragraph in Section 3.2 on pollution mitigation, together with a new ablation measuring performance when the threshold is removed. revision: yes
Referee: [Abstract] Abstract: the joint optimization step that is asserted to produce stable cross-modal consistency is described only at the level of a high-level claim; without the actual objective, update rules, or any ablation isolating its contribution, it is impossible to determine whether the reported superiority is attributable to the complementary-memory architecture or to post-hoc tuning choices.

Authors: The abstract condenses the joint-optimization step, but the full manuscript provides the concrete objective in Equation (3) (cross-entropy on the visual cache plus a consistency term between visual and textual embeddings) and the alternating update procedure in Algorithm 1. Section 4.3 contains an ablation that removes the consistency term while keeping all other components fixed, showing a 3.2-point average drop across the 15 datasets. To improve accessibility, we will append a parenthetical reference in the abstract to the objective and update rule and will ensure the ablation table is clearly labeled as isolating the joint-optimization contribution. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method proposal with no derivation chain or self-referential reductions

full rationale

The paper presents ComMem as a bio-inspired TTA architecture consisting of a fast visual cache and slow textual prototypes that are jointly optimized. No equations, fitted parameters, uniqueness theorems, or self-citations appear in the provided text that would reduce any claimed result to its own inputs by construction. The central claims rest on experimental outcomes across 15 datasets rather than any mathematical derivation that could be circular. This is the expected outcome for a high-level algorithmic proposal without a formal proof or parameter-fitting step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities beyond the high-level biological analogy; the two memory components are presented as design choices rather than new postulated entities with independent evidence.

axioms (1)

domain assumption Biological complementary memory systems (hippocampus and neocortex) provide a useful and transferable model for designing effective TTA mechanisms in VLMs
The abstract states the approach is inspired by these brain structures and mimics their distinct but cooperative roles.

pith-pipeline@v0.9.1-grok · 5752 in / 1264 out tokens · 24296 ms · 2026-06-30T09:54:51.033074+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 11 canonical work pages · 5 internal anchors

[1]

Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization

Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. Advances in Neural Information Processing Systems, 36: 0 80396--80413, 2023

2023
[2]

Evaluating clip: towards characterization of broader capabilities and downstream implications

Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, and Miles Brundage. Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021

work page arXiv 2021
[3]

Food-101--mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101--mining discriminative components with random forests. In European Conference on Computer Vision, pp.\ 446--461. Springer, 2014

2014
[4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3606--3613, 2014

2014
[5]

Phase consistency dynamics of memory encoding

Ryan A Colyer and Michael J Kahana. Phase consistency dynamics of memory encoding. Journal of Neuroscience, 45 0 (35), 2025

2025
[6]

Bayestta: Continual-temporal test-time adaptation for vision-language models via gaussian discriminant analysis

Shuang Cui, Jinglin Xu, Yi Li, Xiongxin Tang, Jiangmeng Li, Jiahuan Zhou, Fanjiang Xu, Fuchun Sun, and Hui Xiong. Bayestta: Continual-temporal test-time adaptation for vision-language models via gaussian discriminant analysis. arXiv preprint arXiv:2507.08607, 2025

work page arXiv 2025
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 248--255, 2009

2009
[8]

Adapting vision-language models without labels: A comprehensive survey

Hao Dong, Lijun Sheng, Jian Liang, Ran He, Eleni Chatzi, and Olga Fink. Adapting vision-language models without labels: A comprehensive survey. arXiv preprint arXiv:2508.05547, 2025

work page arXiv 2025
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106 0 (1): 0 59--70, 2007

2007
[11]

Diverse data augmentation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 2704--2714, 2023

2023
[12]

standard model

Ali Golbabaei and Paul W Frankland. The post-“standard model” age: Updating theories of systems consolidation. Neuron, 113 0 (3): 0 339--341, 2025

2025
[13]

Semi-supervised learning by entropy minimization

Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. Advances in neural information processing systems, 17, 2004

2004
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 770--778, 2016

2016
[15]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12 0 (7): 0 2217--2226, 2019

2019
[16]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8340--8349, 2021 a

2021
[17]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15262--15271, 2021 b

2021
[18]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp.\ 448--456. pmlr, 2015

2015
[19]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021

2021
[20]

Efficient test-time adaptation of vision-language models

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, and Eric Xing. Efficient test-time adaptation of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[21]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp.\ 554--561, 2013

2013
[22]

Reconstructing a new hippocampal engram for systems reconsolidation and remote memory updating

Bo Lei, Bilin Kang, Yuejun Hao, Haoyu Yang, Zihan Zhong, Zihan Zhai, and Yi Zhong. Reconstructing a new hippocampal engram for systems reconsolidation and remote memory updating. Neuron, 113 0 (3): 0 471--485, 2025

2025
[23]

A comprehensive survey on test-time adaptation under distribution shifts

Jian Liang, Ran He, and Tieniu Tan. A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision, 133 0 (1): 0 31--64, 2025

2025
[24]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Swapprompt: Test-time prompt adaptation for vision-language models

Xiaosong Ma, Jie Zhang, Song Guo, and Wenchao Xu. Swapprompt: Test-time prompt adaptation for vision-language models. Advances in Neural Information Processing Systems, 36: 0 65252--65264, 2023

2023
[26]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[27]

The role of reward and reward uncertainty in episodic memory

Alice Mason, Simon Farrell, Paul Howard-Jones, and Casimir JH Ludwig. The role of reward and reward uncertainty in episodic memory. Journal of memory and language, 96: 0 62--77, 2017

2017
[28]

Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory

James L McClelland, Bruce L McNaughton, and Randall C O'Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102 0 (3): 0 419, 1995

1995
[29]

Task bias in contrastive vision-language models

Sachit Menon, Ishaan Preetam Chandratreya, and Carl Vondrick. Task bias in contrastive vision-language models. International Journal of Computer Vision, 132 0 (6): 0 2026--2040, 2024

2026
[30]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, pp.\ 722--729. IEEE, 2008

2008
[31]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Neural associative memories and sparse coding

G \"u nther Palm. Neural associative memories and sparse coding. Neural Networks, 37: 0 165--171, 2013

2013
[33]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3498--3505, 2012

2012
[34]

What does a platypus look like? generating customized prompts for zero-shot image classification

Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15691--15701, 2023

2023
[35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.\ 8748--8763. PMLR, 2021

2021
[36]

Uncertainty in estimating distances from memory

Gabriel A Radvansky, Laura A Carlson-Radvansky, and David E Irwin. Uncertainty in estimating distances from memory. Memory & Cognition, 23 0 (5): 0 596--606, 1995

1995
[37]

Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp.\ 5389--5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp.\ 5389--5400. PMLR, 2019

2019
[38]

Test-time prompt tuning for zero-shot generalization in vision-language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35: 0 14274--14289, 2022

2022
[39]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[40]

Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models

Elaine Sui, Xiaohan Wang, and Serena Yeung-Levy. Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 825--835. IEEE, 2025

2025
[41]

The role of engram cells in the systems consolidation of memory

Susumu Tonegawa, Mark D Morrissey, and Takashi Kitamura. The role of engram cells in the systems consolidation of memory. Nature Reviews Neuroscience, 19 0 (8): 0 485--498, 2018

2018
[42]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, volume 32, pp.\ 10506--10518, 2019

2019
[43]

Sparse and distributed coding of episodic memory in neurons of the human hippocampus

John T Wixted, Larry R Squire, Yoonhee Jang, Megan H Papesh, Stephen D Goldinger, Joel R Kuhn, Kris A Smith, David M Treiman, and Peter N Steinmetz. Sparse and distributed coding of episodic memory in neurons of the human hippocampus. Proceedings of the National Academy of Sciences, 111 0 (26): 0 9621--9626, 2014

2014
[44]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3485--3492, 2010

2010
[45]

Dynaprompt: Dynamic test-time prompt tuning

Zehao Xiao, Shilin Yan, Jack Hong, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiayi Shen, Qi Wang, and Cees GM Snoek. Dynaprompt: Dynamic test-time prompt tuning. arXiv preprint arXiv:2501.16404, 2025

work page arXiv 2025
[46]

Robust test-time adaptation in dynamic scenarios

Longhui Yuan, Binhui Xie, and Shuang Li. Robust test-time adaptation in dynamic scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15922--15932, 2023

2023
[47]

Dual prototype evolving for test-time generalization of vision-language models, 2024 a

Ce Zhang, Simon Stepputtis, Katia Sycara, and Yaqi Xie. Dual prototype evolving for test-time generalization of vision-language models, 2024 a . URL https://arxiv.org/abs/2410.12790

work page arXiv 2024
[48]

Historical test-time prompt tuning for vision foundation models

Jingyi Zhang, Jiaxing Huang, Xiaoqin Zhang, Ling Shao, and Shijian Lu. Historical test-time prompt tuning for vision foundation models. Advances in Neural Information Processing Systems, 37: 0 12872--12896, 2024 b

2024
[49]

Tip-adapter: Training-free adaption of clip for few-shot classification

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free adaption of clip for few-shot classification. In European conference on computer vision, pp.\ 493--510. Springer, 2022

2022
[50]

Dual memory networks: A versatile adaptation approach for vision-language models

Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, and Lei Zhang. Dual memory networks: A versatile adaptation approach for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 28718--28728, 2024 c

2024
[51]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022

2022
[52]

Bayesian test-time adaptation for vision-language models, 2025

Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Xiatian Zhu, Lei Deng, Hongbin Liu, and Zhen Lei. Bayesian test-time adaptation for vision-language models, 2025. URL https://arxiv.org/abs/2503.09248

work page arXiv 2025

[1] [1]

Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization

Jameel Abdul Samadh, Mohammad Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muhammad Muzammal Naseer, Fahad Shahbaz Khan, and Salman H Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. Advances in Neural Information Processing Systems, 36: 0 80396--80413, 2023

2023

[2] [2]

Evaluating clip: towards characterization of broader capabilities and downstream implications

Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, and Miles Brundage. Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021

work page arXiv 2021

[3] [3]

Food-101--mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101--mining discriminative components with random forests. In European Conference on Computer Vision, pp.\ 446--461. Springer, 2014

2014

[4] [4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3606--3613, 2014

2014

[5] [5]

Phase consistency dynamics of memory encoding

Ryan A Colyer and Michael J Kahana. Phase consistency dynamics of memory encoding. Journal of Neuroscience, 45 0 (35), 2025

2025

[6] [6]

Bayestta: Continual-temporal test-time adaptation for vision-language models via gaussian discriminant analysis

Shuang Cui, Jinglin Xu, Yi Li, Xiongxin Tang, Jiangmeng Li, Jiahuan Zhou, Fanjiang Xu, Fuchun Sun, and Hui Xiong. Bayestta: Continual-temporal test-time adaptation for vision-language models via gaussian discriminant analysis. arXiv preprint arXiv:2507.08607, 2025

work page arXiv 2025

[7] [7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 248--255, 2009

2009

[8] [8]

Adapting vision-language models without labels: A comprehensive survey

Hao Dong, Lijun Sheng, Jian Liang, Ran He, Eleni Chatzi, and Olga Fink. Adapting vision-language models without labels: A comprehensive survey. arXiv preprint arXiv:2508.05547, 2025

work page arXiv 2025

[9] [9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[10] [10]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106 0 (1): 0 59--70, 2007

2007

[11] [11]

Diverse data augmentation with diffusions for effective test-time prompt tuning

Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 2704--2714, 2023

2023

[12] [12]

standard model

Ali Golbabaei and Paul W Frankland. The post-“standard model” age: Updating theories of systems consolidation. Neuron, 113 0 (3): 0 339--341, 2025

2025

[13] [13]

Semi-supervised learning by entropy minimization

Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. Advances in neural information processing systems, 17, 2004

2004

[14] [14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 770--778, 2016

2016

[15] [15]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12 0 (7): 0 2217--2226, 2019

2019

[16] [16]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8340--8349, 2021 a

2021

[17] [17]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15262--15271, 2021 b

2021

[18] [18]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp.\ 448--456. pmlr, 2015

2015

[19] [19]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021

2021

[20] [20]

Efficient test-time adaptation of vision-language models

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, and Eric Xing. Efficient test-time adaptation of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[21] [21]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp.\ 554--561, 2013

2013

[22] [22]

Reconstructing a new hippocampal engram for systems reconsolidation and remote memory updating

Bo Lei, Bilin Kang, Yuejun Hao, Haoyu Yang, Zihan Zhong, Zihan Zhai, and Yi Zhong. Reconstructing a new hippocampal engram for systems reconsolidation and remote memory updating. Neuron, 113 0 (3): 0 471--485, 2025

2025

[23] [23]

A comprehensive survey on test-time adaptation under distribution shifts

Jian Liang, Ran He, and Tieniu Tan. A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision, 133 0 (1): 0 31--64, 2025

2025

[24] [24]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Swapprompt: Test-time prompt adaptation for vision-language models

Xiaosong Ma, Jie Zhang, Song Guo, and Wenchao Xu. Swapprompt: Test-time prompt adaptation for vision-language models. Advances in Neural Information Processing Systems, 36: 0 65252--65264, 2023

2023

[26] [26]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[27] [27]

The role of reward and reward uncertainty in episodic memory

Alice Mason, Simon Farrell, Paul Howard-Jones, and Casimir JH Ludwig. The role of reward and reward uncertainty in episodic memory. Journal of memory and language, 96: 0 62--77, 2017

2017

[28] [28]

Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory

James L McClelland, Bruce L McNaughton, and Randall C O'Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102 0 (3): 0 419, 1995

1995

[29] [29]

Task bias in contrastive vision-language models

Sachit Menon, Ishaan Preetam Chandratreya, and Carl Vondrick. Task bias in contrastive vision-language models. International Journal of Computer Vision, 132 0 (6): 0 2026--2040, 2024

2026

[30] [30]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, pp.\ 722--729. IEEE, 2008

2008

[31] [31]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Neural associative memories and sparse coding

G \"u nther Palm. Neural associative memories and sparse coding. Neural Networks, 37: 0 165--171, 2013

2013

[33] [33]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3498--3505, 2012

2012

[34] [34]

What does a platypus look like? generating customized prompts for zero-shot image classification

Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15691--15701, 2023

2023

[35] [35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.\ 8748--8763. PMLR, 2021

2021

[36] [36]

Uncertainty in estimating distances from memory

Gabriel A Radvansky, Laura A Carlson-Radvansky, and David E Irwin. Uncertainty in estimating distances from memory. Memory & Cognition, 23 0 (5): 0 596--606, 1995

1995

[37] [37]

Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp.\ 5389--5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp.\ 5389--5400. PMLR, 2019

2019

[38] [38]

Test-time prompt tuning for zero-shot generalization in vision-language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35: 0 14274--14289, 2022

2022

[39] [39]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[40] [40]

Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models

Elaine Sui, Xiaohan Wang, and Serena Yeung-Levy. Just shift it: Test-time prototype shifting for zero-shot generalization with vision-language models. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 825--835. IEEE, 2025

2025

[41] [41]

The role of engram cells in the systems consolidation of memory

Susumu Tonegawa, Mark D Morrissey, and Takashi Kitamura. The role of engram cells in the systems consolidation of memory. Nature Reviews Neuroscience, 19 0 (8): 0 485--498, 2018

2018

[42] [42]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, volume 32, pp.\ 10506--10518, 2019

2019

[43] [43]

Sparse and distributed coding of episodic memory in neurons of the human hippocampus

John T Wixted, Larry R Squire, Yoonhee Jang, Megan H Papesh, Stephen D Goldinger, Joel R Kuhn, Kris A Smith, David M Treiman, and Peter N Steinmetz. Sparse and distributed coding of episodic memory in neurons of the human hippocampus. Proceedings of the National Academy of Sciences, 111 0 (26): 0 9621--9626, 2014

2014

[44] [44]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3485--3492, 2010

2010

[45] [45]

Dynaprompt: Dynamic test-time prompt tuning

Zehao Xiao, Shilin Yan, Jack Hong, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiayi Shen, Qi Wang, and Cees GM Snoek. Dynaprompt: Dynamic test-time prompt tuning. arXiv preprint arXiv:2501.16404, 2025

work page arXiv 2025

[46] [46]

Robust test-time adaptation in dynamic scenarios

Longhui Yuan, Binhui Xie, and Shuang Li. Robust test-time adaptation in dynamic scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15922--15932, 2023

2023

[47] [47]

Dual prototype evolving for test-time generalization of vision-language models, 2024 a

Ce Zhang, Simon Stepputtis, Katia Sycara, and Yaqi Xie. Dual prototype evolving for test-time generalization of vision-language models, 2024 a . URL https://arxiv.org/abs/2410.12790

work page arXiv 2024

[48] [48]

Historical test-time prompt tuning for vision foundation models

Jingyi Zhang, Jiaxing Huang, Xiaoqin Zhang, Ling Shao, and Shijian Lu. Historical test-time prompt tuning for vision foundation models. Advances in Neural Information Processing Systems, 37: 0 12872--12896, 2024 b

2024

[49] [49]

Tip-adapter: Training-free adaption of clip for few-shot classification

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free adaption of clip for few-shot classification. In European conference on computer vision, pp.\ 493--510. Springer, 2022

2022

[50] [50]

Dual memory networks: A versatile adaptation approach for vision-language models

Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, and Lei Zhang. Dual memory networks: A versatile adaptation approach for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 28718--28728, 2024 c

2024

[51] [51]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022

2022

[52] [52]

Bayesian test-time adaptation for vision-language models, 2025

Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Xiatian Zhu, Lei Deng, Hongbin Liu, and Zhen Lei. Bayesian test-time adaptation for vision-language models, 2025. URL https://arxiv.org/abs/2503.09248

work page arXiv 2025