Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Tao Jin; Yongbo He; Zirun Guo

arxiv: 2603.00574 · v2 · pith:YMZEST75new · submitted 2026-02-28 · 💻 cs.CV · cs.AI

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Yongbo He , Zirun Guo , Tao Jin This is my paper

Pith reviewed 2026-05-15 17:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-modal test-time adaptationstability-plasticityasymmetric adaptationinterdimensional redundancynegative transfercatastrophic forgettingdomain shiftadapter decoupling

0 comments

The pith

Decoupling each modality adapter into stable and plastic parts, activated asymmetrically by feature redundancy, lets models adapt to new domains without negative transfer or forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-modal test-time adaptation fails when unbiased modalities suffer negative transfer and biased ones undergo catastrophic forgetting. It diagnoses the biased modality by its higher interdimensional redundancy in the latent space, then applies an asymmetric strategy: plastic components update for the biased modality while stable components update with KL regularization for the unbiased one. A sympathetic reader cares because pretrained multi-modal models must handle evolving real-world distributions without full retraining. If correct, the method preserves general knowledge while gaining domain-specific flexibility.

Core claim

The central claim is that the biased modality exhibits substantially higher interdimensional redundancy than the unbiased one in the unified latent space, allowing reliable identification followed by an asymmetric adaptation strategy in which each modality-specific adapter is split into stable and plastic components, with the plastic part activated and updated for the biased modality and the stable part updated under KL regularization for the unbiased modality.

What carries the argument

Decoupled stable and plastic components within each modality-specific adapter, selected asymmetrically according to the interdimensional redundancy metric.

If this is right

The model adapts flexibly to new domains while preserving generalizable knowledge.
Negative transfer is avoided in the unbiased modality through KL regularization on stable components.
Catastrophic forgetting is avoided in the biased modality by updating only its plastic components.
Overall accuracy exceeds prior state-of-the-art methods across diverse multi-modal benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The redundancy diagnostic might transfer to single-modal test-time adaptation if a comparable feature correlation measure can be defined.
The stable-plastic split could be tested on additional modality pairs such as audio-visual data.
KL regularization on stable components might combine with other regularization techniques to further strengthen preservation of pretraining knowledge.

Load-bearing premise

Higher interdimensional redundancy reliably identifies the biased modality and the stable-plastic split plus KL regularization prevents negative transfer without creating new failure modes.

What would settle it

A test set in which the redundancy measure mislabels the biased modality and the full DASP procedure still produces either forgetting in one modality or negative transfer in the other.

Figures

Figures reproduced from arXiv: 2603.00574 by Tao Jin, Yongbo He, Zirun Guo.

**Figure 1.** Figure 1: Limitations in Multi-Modal TTA. We evaluate changes in source domain performance during continual adaptation, measured as ∆ = Accorignal − Accadapted, for state-of-the-art methods (READ and TSA). Results indicate ongoing degradation in both multi-modal and uni-modal contexts. Performance drops in the biased modality are referred to as catastrophic forgetting, while drops in the unbiased modality are consi… view at source ↗

**Figure 2.** Figure 2: Entropy and confidence statistics on the VGGSound-C with corrupted audio modality. Since audio serves as the dominant modality in this dataset, it continues to display lower entropy and greater confidence, even in the presence of distribution shifts. • We perform comprehensive experiments on Kinetics50-C and VGGSound-C, and DASP exhibits enhanced adaptivity and stability in comparison to existing methods.… view at source ↗

**Figure 3.** Figure 3: Redundancy statistics on Kinetics50-C and VGGSound-C. The corrupted modality demonstrates increased redundancy in feature embeddings. Furthermore, the results underscore a significant correlation between redundancy and accuracy. nerable to distribution shifts across different modalities. Existing TTA methods, primarily developed for uni-modal tasks, inadequately address these complex shifts. In this conte… view at source ↗

**Figure 4.** Figure 4: The overview of our proposed DASP features a diagnose-then-mitigate framework. It begins by diagnosing the biased modality [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity Analysis of Hyper-parameters: Batch Size (B), Redundancy Threshold (δ) and Loss Coefficents (λent, λkl). without asymmetric adaptation, and (iv) with asymmetric adaptation configured in the opposite manner. Removing either adapter resulted in decreased adaptive performance, indicating that the stable adapter is essential for extracting domain-invariant features and improving discrimination, w… view at source ↗

**Figure 6.** Figure 6: Accuracy vs. Throughput and Memory Usage. Compared to baselines, our method demonstrates superior performance with higher efficiency (observing comparable or lower computational cost and higher inference speed) on Kinetics50-C. modality observed in other methods. Meanwhile, the plastic adapter provides necessary plasticity and domain-specific knowledge for effective target domain adaptation. Lastly, we a… view at source ↗

**Figure 7.** Figure 7: The illustration of uni-modal continual corruption and [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: The results demonstrate a clear, positive correlation [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Redundancy vs. Batch Size. We investigate the correlation between redundancy score and batch size on VGGSound-C with audio corruptions. between increasing corruption severity and the redundancy score R(Z). This confirms our theoretical hypothesis: as inputs deviate further from the source manifold (higher σ 2 α), the representation degradation exacerbates, which is precisely captured by the escalating r… view at source ↗

read the original abstract

Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DASP uses redundancy in the latent space to route asymmetric stable/plastic updates in multi-modal TTA, but the diagnostic step looks fragile.

read the letter

The main takeaway is that this paper proposes diagnosing the biased modality via higher interdimensional redundancy in the unified latent space, then applying an asymmetric adapter split: plastic updates go only to the biased modality while the unbiased one gets KL-regularized stability on its fixed plastic component. This is meant to cut negative transfer and forgetting during test-time adaptation without retraining the base model. The decoupled architecture is a clean way to enforce the split once the diagnosis is made, and it directly targets the two failure modes that standard TTA methods hit in multi-modal settings. The framing around the observed correlation discrepancy is straightforward and gives a concrete handle for the routing decision. The soft spot is exactly the one the stress-test flags. Higher redundancy is treated as a reliable marker of bias, but nothing in the abstract or the described method shows why this pattern would be invariant to shift type rather than an artifact of how the shift distorts features. If the metric misroutes on some distributions, the asymmetric rule could stabilize the wrong modality or over-adapt the stable one, recreating the problems it aims to solve. Without cross-shift ablations or a bound on the diagnostic, the central claim stays conditional on that assumption holding. This is for groups already running multi-modal models at test time and looking for lightweight adaptation tricks rather than full retraining. The mechanism is specific enough to test, so it deserves a serious referee who can check the experiments against the redundancy assumption and see whether the gains survive when the shift statistics change.

Referee Report

2 major / 1 minor

Summary. The paper proposes Decoupling Adaptation for Stability and Plasticity (DASP), a diagnose-then-mitigate framework for multi-modal test-time adaptation. It observes that the biased modality exhibits higher interdimensional redundancy (feature-dimension correlations) in the unified latent space, uses this to identify the biased modality, and applies an asymmetric strategy: modality-specific adapters are split into stable and plastic components, with plasticity activated only for the biased modality while the unbiased modality uses KL-regularized updates on the stable component to avoid negative transfer and forgetting.

Significance. If the empirical claims hold, the work could meaningfully advance multi-modal TTA by providing a practical way to decouple stability and plasticity based on latent-space diagnostics, addressing negative transfer and catastrophic forgetting without requiring parameter-free derivations or machine-checked proofs.

major comments (2)

[§3] §3 (method description): The central claim that higher interdimensional redundancy reliably diagnoses the biased modality is load-bearing for the entire asymmetric split, yet no theoretical bound, invariance proof, or cross-shift validation (e.g., across correlation-inducing vs. other shift types) is supplied; if the metric is an artifact of specific shift statistics, the diagnosis misroutes adapters and reintroduces the very negative transfer the method aims to prevent.
[Abstract] Abstract and §4 (experiments): While the abstract asserts 'comprehensive evaluations on diverse multi-modal benchmarks' that 'significantly outperform state-of-the-art methods,' the manuscript supplies no quantitative tables, ablation results on the redundancy metric, or failure-mode analysis for the KL-regularized stable path, leaving the effectiveness of the asymmetric design unverified.

minor comments (1)

[§3.1] Notation for 'interdimensional redundancy' is introduced without an explicit equation or pseudocode for its computation (e.g., correlation matrix norm or similar), which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (method description): The central claim that higher interdimensional redundancy reliably diagnoses the biased modality is load-bearing for the entire asymmetric split, yet no theoretical bound, invariance proof, or cross-shift validation (e.g., across correlation-inducing vs. other shift types) is supplied; if the metric is an artifact of specific shift statistics, the diagnosis misroutes adapters and reintroduces the very negative transfer the method aims to prevent.

Authors: We agree that the interdimensional redundancy metric is central to the diagnosis step and that stronger validation is warranted. The manuscript presents consistent empirical observations of elevated redundancy in biased modalities across the tested benchmarks. In the revised version we will add dedicated cross-shift experiments that compare correlation-inducing shifts against other shift types (e.g., additive noise, style transfer) to test the metric’s robustness. We will also include a sensitivity analysis and explicit discussion of potential failure cases. A formal theoretical bound or invariance proof is not currently available, as the diagnostic is derived from observed latent-space statistics rather than from a closed-form derivation. revision: partial
Referee: [Abstract] Abstract and §4 (experiments): While the abstract asserts 'comprehensive evaluations on diverse multi-modal benchmarks' that 'significantly outperform state-of-the-art methods,' the manuscript supplies no quantitative tables, ablation results on the redundancy metric, or failure-mode analysis for the KL-regularized stable path, leaving the effectiveness of the asymmetric design unverified.

Authors: We acknowledge that the current abstract is high-level and that the experimental section would benefit from additional quantitative detail. The full manuscript contains comparative results in §4, but we will revise the abstract to incorporate concrete performance deltas where space allows. We will also add (i) an ablation study isolating the redundancy metric and (ii) a failure-mode analysis of the KL-regularized stable path, either in the main text or as an expanded supplementary section. These additions will make the empirical support for the asymmetric design explicit. revision: yes

standing simulated objections not resolved

A theoretical bound or invariance proof establishing that interdimensional redundancy is a reliable, shift-type-invariant diagnostic for the biased modality.

Circularity Check

0 steps flagged

No significant circularity; architectural choice grounded in observed discrepancy

full rationale

The paper describes DASP as a diagnose-then-mitigate framework whose core step is an empirical observation of higher interdimensional redundancy in the biased modality, followed by an asymmetric stable/plastic adapter split. No equations, fitted parameters, or predictions are presented that reduce the claimed performance to a definition or input by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the asymmetric mechanism is introduced as a novel design choice rather than derived from prior self-work. The derivation chain remains self-contained against external benchmarks and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the unproven domain assumption that interdimensional redundancy differences between modalities are stable and diagnostic enough to drive the asymmetric update rule without side effects.

axioms (1)

domain assumption The biased modality exhibits substantially higher interdimensional redundancy compared to the unbiased modality.
This discrepancy is invoked to identify which modality needs plasticity versus stability.

pith-pipeline@v0.9.0 · 5521 in / 1234 out tokens · 55534 ms · 2026-05-15T17:54:39.631169+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Multimodal machine learning: A survey and tax- onomy.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and tax- onomy.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 1

work page 2019
[2]

Vlmo: Unified vision-language pre-training with mixture-of-modality-experts

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. InAdvances in Neural Information Processing Systems, 2022. 1

work page 2022
[3]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. In International Conference on Acoustics, Speech and Signal Processing, 2020. 5, 1

work page 2020
[4]

Test-time selective adaptation for uni-modal distribu- tion shift in multi-modal data

Mingcai Chen, Baoming Zhang, Zongbo Han, Yuntao Du, Wenyu Jiang, Yanmeng Wang, Shuai Feng, and Bingkun Bao. Test-time selective adaptation for uni-modal distribu- tion shift in multi-modal data. InInternational Conference on Machine Learning, 2025. 1, 3, 6

work page 2025
[5]

Domain generalization via model-agnostic learning of semantic features

Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas, and Ben Glocker. Domain generalization via model-agnostic learning of semantic features. InAdvances in Neural Infor- mation Processing Systems, 2019. 2

work page 2019
[6]

Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R

Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R. Glass. Contrastive audio-visual masked autoencoder. InIn- ternational Conference on Learning Representations, 2023. 5

work page 2023
[7]

Smoothing the shift: Towards stable test-time adaptation under complex multimodal noises

Zirun Guo and Tao Jin. Smoothing the shift: Towards stable test-time adaptation under complex multimodal noises. InIn- ternational Conference on Learning Representations, 2025. 1, 3

work page 2025
[8]

Classifier-guided gradient modulation for enhanced multi- modal learning

Zirun Guo, Tao Jin, Jingyuan Chen, and Zhou Zhao. Classifier-guided gradient modulation for enhanced multi- modal learning. InAdvances in Neural Information Process- ing Systems, 2024. 1

work page 2024
[9]

Benchmarking neu- ral network robustness to common corruptions and pertur- bations.International Conference on Learning Representa- tions, 2019

Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and pertur- bations.International Conference on Learning Representa- tions, 2019. 1

work page 2019
[10]

Augmix: A simple data processing method to improve robustness and uncertainty

Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. InInternational Conference on Learning Repre- sentations, 2020. 2

work page 2020
[11]

The kinetics human action video dataset, 2017

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. 5, 1

work page 2017
[12]

An im- age is worth 16x16 words: Transformers for image recogni- tion at scale

Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weis- senborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Syl- vain Gelly, Thomas Unterthiner, and Xiaohua Zhai. An im- age is worth 16x16 words: Transformers for image recogni- tion at scale. InInternational Conference on Learning Rep- resentations, 2021. 5

work page 2021
[13]

Becotta: Input-dependent online blending of experts for continual test-time adaptation

Daeun Lee, Jaehong Yoon, and Sung Ju Hwang. Becotta: Input-dependent online blending of experts for continual test-time adaptation. InInternational Conference on Ma- chine Learning, 2024. 2

work page 2024
[14]

Bridging modalities via pro- gressive re-alignment for multimodal test-time adaptation

Jiacheng Li and Songhe Feng. Bridging modalities via pro- gressive re-alignment for multimodal test-time adaptation. In Annual AAAI Conference on Artificial Intelligence, 2026. 4

work page 2026
[15]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional Conference on Machine Learning, 2022. 1

work page 2022
[16]

Do we really need to access the source data? source hypothesis transfer for un- supervised domain adaptation

Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for un- supervised domain adaptation. InInternational Conference on Machine Learning, 2020. 5

work page 2020
[17]

Ttn: A domain-shift aware batch normalization in test- time adaptation

Hyesu Lim, Byeonggeun Kim, Jaegul Choo, and Sungha Choi. Ttn: A domain-shift aware batch normalization in test- time adaptation. InInternational Conference on Learning Representations, 2023. 2

work page 2023
[18]

Multimodality helps unimodality: Cross- modal few-shot learning with multimodal models

Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, and Deva Ramanan. Multimodality helps unimodality: Cross- modal few-shot learning with multimodal models. InCom- puter Vision and Pattern Recognition, 2023. 1

work page 2023
[19]

Vida: Home- ostatic visual domain adapter for continual test time adapta- tion

Jiaming Liu, Senqiao Yang, Peidong Jia, Ming Lu, Yan- dong Guo, Wei Xue, and Shanghang Zhang. Vida: Home- ostatic visual domain adapter for continual test time adapta- tion. InInternational Conference on Learning Representa- tions, 2024. 2

work page 2024
[20]

Efficient test- time model adaptation without forgetting

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test- time model adaptation without forgetting. InInternational Conference on Machine Learning, 2022. 1, 2, 3, 6

work page 2022
[21]

Towards stable test-time adaptation in dynamic wild world

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. InInternational Conference on Learning Representations, 2023. 1, 2, 3, 6

work page 2023
[22]

Test-time model adaptation with only forward passes

Shuaicheng Niu, Chunyan Miao, Guohao Chen, Pengcheng Wu, and Peilin Zhao. Test-time model adaptation with only forward passes. InInternational Conference on Machine Learning, 2024. 2

work page 2024
[23]

Adapt in the wild: Test-time entropy minimization with sharpness and feature regularization

Shuaicheng Niu, Guohao Chen, Deyu Chen, Yifan Zhang, Jiaxiang Wu, Zhiquan Wen, Yaofo Chen, Peilin Zhao, Chun- yan Miao, and Mingkui Tan. Adapt in the wild: Test-time entropy minimization with sharpness and feature regulariza- tion.arXiv preprint arXiv:2509.04977, 2025. 3

work page arXiv 2025
[24]

Robustness properties of facebook’s resnext wsl models.arXiv preprint arXiv:1907.07640, 2019

A Emin Orhan. Robustness properties of facebook’s resnext wsl models.arXiv preprint arXiv:1907.07640, 2019. 2

work page arXiv 1907
[25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 1

work page 2021
[26]

Generalizing across domains via cross-gradient training

Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Sid- dhartha Chaudhuri, Preethi Jyothi, and Sunita Sarawagi. Generalizing across domains via cross-gradient training. InInternational Conference on Learning Representations,

work page
[27]

Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization

Junha Song, Jungsoo Lee, In So Kweon, and Sungha Choi. Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization. InComputer Vision and Pattern Recognition, 2023. 2

work page 2023
[28]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021. 1, 2, 6

work page 2021
[29]

Partition- then-adapt: Combating prediction bias for reliable multi- modal test-time adaptation

Guowei Wang, Fan Lyu, and Changxing Ding. Partition- then-adapt: Combating prediction bias for reliable multi- modal test-time adaptation. InAdvances in Neural Infor- mation Processing Systems, 2025. 3

work page 2025
[30]

Con- tinual test-time domain adaptation

Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Con- tinual test-time domain adaptation. InComputer Vision and Pattern Recognition, 2022. 2

work page 2022
[31]

Image as a foreign language: Beit pretraining for vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign language: Beit pretraining for vision and vision-language tasks. InComputer Vision and Pattern Recognition, 2023. 1

work page 2023
[32]

Test-time adaption against multi-modal reliability bias

Mouxing Yang, Yunfan Li, Changqing Zhang, Peng Hu, and Xi Peng. Test-time adaption against multi-modal reliability bias. InInternational Conference on Learning Representa- tions, 2024. 1, 3, 5, 6

work page 2024
[33]

Improving out-of-distribution robustness via selective augmentation

Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. InInternational Con- ference on Machine Learning, 2022. 2

work page 2022
[34]

Robust test-time adaptation in dynamic scenarios

Longhui Yuan, Binhui Xie, and Shuang Li. Robust test-time adaptation in dynamic scenarios. InComputer Vision and Pattern Recognition, 2023. 2

work page 2023
[35]

Memo: Test time robustness via adaptation and augmentation

Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. In Advances in Neural Information Processing Systems, 2022. 2

work page 2022
[36]

Analytic con- tinual test-time adaptation for multi-modality corruption

Yufei Zhang, Yicheng Xu, Hongxin Wei, Zhiping Lin, Xi- aofeng Zou, Cen Chen, and Huiping Zhuang. Analytic con- tinual test-time adaptation for multi-modality corruption. In ACM International Conference on Multimedia, pages 1929– 1937, 2025. 3

work page 1929
[37]

Attention bootstrapping for multi-modal test-time adaptation

Yusheng Zhao, Junyu Luo, Xiao Luo, Jinsheng Huang, Jingyang Yuan, Zhiping Xiao, and Ming Zhang. Attention bootstrapping for multi-modal test-time adaptation. InAn- nual AAAI Conference on Artificial Intelligence, 2025. 3 10 Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation Supplementary Material This appendix contains supplementary ...

work page 2025
[38]

in the wild

More Experimental Details 6.1. Benchmarks We construct two benchmarks based on Kinetics [11] and VGGSound [3], to evaluate the performance of state-of-the- art methods under multi-modal domain shifts during test- time adaptation. We introduce three experimental setups: uni-modal episodic corruption, uni-modal continual corrup- tion, and interleaved modali...

work page 2024
[39]

For a perturbed sample ˜z∈R D, we consider the dominant rank-1 compo- nent: ˜z=z+αv

Further Analysis of the Redundancy Score Theoretical Analysis.The distribution shift is modeled as a low-rank perturbation in the latent space. For a perturbed sample ˜z∈R D, we consider the dominant rank-1 compo- nent: ˜z=z+αv. To formalize our analysis, we establish the followingAssumptions: 1.The dimensions ofzare centered and uncorrelated,i.e., E[z] =...

work page
[40]

2 Table 8.Episodic Adaptation.Comparison with SOTA methods on VGGSound-C with video corruptions (severity level 5) regarding Accuracy (%,↑)

Extended Comparative Experiments Main experiments.We report additional results for the main experiments that were not included in the main text. 2 Table 8.Episodic Adaptation.Comparison with SOTA methods on VGGSound-C with video corruptions (severity level 5) regarding Accuracy (%,↑). Noise Blur Weather Digital Method Gauss. Shot Impul. Defoc. Glass Mot. ...

work page

[1] [1]

Multimodal machine learning: A survey and tax- onomy.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and tax- onomy.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 1

work page 2019

[2] [2]

Vlmo: Unified vision-language pre-training with mixture-of-modality-experts

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. InAdvances in Neural Information Processing Systems, 2022. 1

work page 2022

[3] [3]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. In International Conference on Acoustics, Speech and Signal Processing, 2020. 5, 1

work page 2020

[4] [4]

Test-time selective adaptation for uni-modal distribu- tion shift in multi-modal data

Mingcai Chen, Baoming Zhang, Zongbo Han, Yuntao Du, Wenyu Jiang, Yanmeng Wang, Shuai Feng, and Bingkun Bao. Test-time selective adaptation for uni-modal distribu- tion shift in multi-modal data. InInternational Conference on Machine Learning, 2025. 1, 3, 6

work page 2025

[5] [5]

Domain generalization via model-agnostic learning of semantic features

Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas, and Ben Glocker. Domain generalization via model-agnostic learning of semantic features. InAdvances in Neural Infor- mation Processing Systems, 2019. 2

work page 2019

[6] [6]

Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R

Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R. Glass. Contrastive audio-visual masked autoencoder. InIn- ternational Conference on Learning Representations, 2023. 5

work page 2023

[7] [7]

Smoothing the shift: Towards stable test-time adaptation under complex multimodal noises

Zirun Guo and Tao Jin. Smoothing the shift: Towards stable test-time adaptation under complex multimodal noises. InIn- ternational Conference on Learning Representations, 2025. 1, 3

work page 2025

[8] [8]

Classifier-guided gradient modulation for enhanced multi- modal learning

Zirun Guo, Tao Jin, Jingyuan Chen, and Zhou Zhao. Classifier-guided gradient modulation for enhanced multi- modal learning. InAdvances in Neural Information Process- ing Systems, 2024. 1

work page 2024

[9] [9]

Benchmarking neu- ral network robustness to common corruptions and pertur- bations.International Conference on Learning Representa- tions, 2019

Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and pertur- bations.International Conference on Learning Representa- tions, 2019. 1

work page 2019

[10] [10]

Augmix: A simple data processing method to improve robustness and uncertainty

Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. InInternational Conference on Learning Repre- sentations, 2020. 2

work page 2020

[11] [11]

The kinetics human action video dataset, 2017

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. 5, 1

work page 2017

[12] [12]

An im- age is worth 16x16 words: Transformers for image recogni- tion at scale

Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weis- senborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Syl- vain Gelly, Thomas Unterthiner, and Xiaohua Zhai. An im- age is worth 16x16 words: Transformers for image recogni- tion at scale. InInternational Conference on Learning Rep- resentations, 2021. 5

work page 2021

[13] [13]

Becotta: Input-dependent online blending of experts for continual test-time adaptation

Daeun Lee, Jaehong Yoon, and Sung Ju Hwang. Becotta: Input-dependent online blending of experts for continual test-time adaptation. InInternational Conference on Ma- chine Learning, 2024. 2

work page 2024

[14] [14]

Bridging modalities via pro- gressive re-alignment for multimodal test-time adaptation

Jiacheng Li and Songhe Feng. Bridging modalities via pro- gressive re-alignment for multimodal test-time adaptation. In Annual AAAI Conference on Artificial Intelligence, 2026. 4

work page 2026

[15] [15]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional Conference on Machine Learning, 2022. 1

work page 2022

[16] [16]

Do we really need to access the source data? source hypothesis transfer for un- supervised domain adaptation

Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for un- supervised domain adaptation. InInternational Conference on Machine Learning, 2020. 5

work page 2020

[17] [17]

Ttn: A domain-shift aware batch normalization in test- time adaptation

Hyesu Lim, Byeonggeun Kim, Jaegul Choo, and Sungha Choi. Ttn: A domain-shift aware batch normalization in test- time adaptation. InInternational Conference on Learning Representations, 2023. 2

work page 2023

[18] [18]

Multimodality helps unimodality: Cross- modal few-shot learning with multimodal models

Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, and Deva Ramanan. Multimodality helps unimodality: Cross- modal few-shot learning with multimodal models. InCom- puter Vision and Pattern Recognition, 2023. 1

work page 2023

[19] [19]

Vida: Home- ostatic visual domain adapter for continual test time adapta- tion

Jiaming Liu, Senqiao Yang, Peidong Jia, Ming Lu, Yan- dong Guo, Wei Xue, and Shanghang Zhang. Vida: Home- ostatic visual domain adapter for continual test time adapta- tion. InInternational Conference on Learning Representa- tions, 2024. 2

work page 2024

[20] [20]

Efficient test- time model adaptation without forgetting

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test- time model adaptation without forgetting. InInternational Conference on Machine Learning, 2022. 1, 2, 3, 6

work page 2022

[21] [21]

Towards stable test-time adaptation in dynamic wild world

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. InInternational Conference on Learning Representations, 2023. 1, 2, 3, 6

work page 2023

[22] [22]

Test-time model adaptation with only forward passes

Shuaicheng Niu, Chunyan Miao, Guohao Chen, Pengcheng Wu, and Peilin Zhao. Test-time model adaptation with only forward passes. InInternational Conference on Machine Learning, 2024. 2

work page 2024

[23] [23]

Adapt in the wild: Test-time entropy minimization with sharpness and feature regularization

Shuaicheng Niu, Guohao Chen, Deyu Chen, Yifan Zhang, Jiaxiang Wu, Zhiquan Wen, Yaofo Chen, Peilin Zhao, Chun- yan Miao, and Mingkui Tan. Adapt in the wild: Test-time entropy minimization with sharpness and feature regulariza- tion.arXiv preprint arXiv:2509.04977, 2025. 3

work page arXiv 2025

[24] [24]

Robustness properties of facebook’s resnext wsl models.arXiv preprint arXiv:1907.07640, 2019

A Emin Orhan. Robustness properties of facebook’s resnext wsl models.arXiv preprint arXiv:1907.07640, 2019. 2

work page arXiv 1907

[25] [25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 1

work page 2021

[26] [26]

Generalizing across domains via cross-gradient training

Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Sid- dhartha Chaudhuri, Preethi Jyothi, and Sunita Sarawagi. Generalizing across domains via cross-gradient training. InInternational Conference on Learning Representations,

work page

[27] [27]

Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization

Junha Song, Jungsoo Lee, In So Kweon, and Sungha Choi. Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization. InComputer Vision and Pattern Recognition, 2023. 2

work page 2023

[28] [28]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021. 1, 2, 6

work page 2021

[29] [29]

Partition- then-adapt: Combating prediction bias for reliable multi- modal test-time adaptation

Guowei Wang, Fan Lyu, and Changxing Ding. Partition- then-adapt: Combating prediction bias for reliable multi- modal test-time adaptation. InAdvances in Neural Infor- mation Processing Systems, 2025. 3

work page 2025

[30] [30]

Con- tinual test-time domain adaptation

Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Con- tinual test-time domain adaptation. InComputer Vision and Pattern Recognition, 2022. 2

work page 2022

[31] [31]

Image as a foreign language: Beit pretraining for vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign language: Beit pretraining for vision and vision-language tasks. InComputer Vision and Pattern Recognition, 2023. 1

work page 2023

[32] [32]

Test-time adaption against multi-modal reliability bias

Mouxing Yang, Yunfan Li, Changqing Zhang, Peng Hu, and Xi Peng. Test-time adaption against multi-modal reliability bias. InInternational Conference on Learning Representa- tions, 2024. 1, 3, 5, 6

work page 2024

[33] [33]

Improving out-of-distribution robustness via selective augmentation

Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. InInternational Con- ference on Machine Learning, 2022. 2

work page 2022

[34] [34]

Robust test-time adaptation in dynamic scenarios

Longhui Yuan, Binhui Xie, and Shuang Li. Robust test-time adaptation in dynamic scenarios. InComputer Vision and Pattern Recognition, 2023. 2

work page 2023

[35] [35]

Memo: Test time robustness via adaptation and augmentation

Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. In Advances in Neural Information Processing Systems, 2022. 2

work page 2022

[36] [36]

Analytic con- tinual test-time adaptation for multi-modality corruption

Yufei Zhang, Yicheng Xu, Hongxin Wei, Zhiping Lin, Xi- aofeng Zou, Cen Chen, and Huiping Zhuang. Analytic con- tinual test-time adaptation for multi-modality corruption. In ACM International Conference on Multimedia, pages 1929– 1937, 2025. 3

work page 1929

[37] [37]

Attention bootstrapping for multi-modal test-time adaptation

Yusheng Zhao, Junyu Luo, Xiao Luo, Jinsheng Huang, Jingyang Yuan, Zhiping Xiao, and Ming Zhang. Attention bootstrapping for multi-modal test-time adaptation. InAn- nual AAAI Conference on Artificial Intelligence, 2025. 3 10 Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation Supplementary Material This appendix contains supplementary ...

work page 2025

[38] [38]

in the wild

More Experimental Details 6.1. Benchmarks We construct two benchmarks based on Kinetics [11] and VGGSound [3], to evaluate the performance of state-of-the- art methods under multi-modal domain shifts during test- time adaptation. We introduce three experimental setups: uni-modal episodic corruption, uni-modal continual corrup- tion, and interleaved modali...

work page 2024

[39] [39]

For a perturbed sample ˜z∈R D, we consider the dominant rank-1 compo- nent: ˜z=z+αv

Further Analysis of the Redundancy Score Theoretical Analysis.The distribution shift is modeled as a low-rank perturbation in the latent space. For a perturbed sample ˜z∈R D, we consider the dominant rank-1 compo- nent: ˜z=z+αv. To formalize our analysis, we establish the followingAssumptions: 1.The dimensions ofzare centered and uncorrelated,i.e., E[z] =...

work page

[40] [40]

2 Table 8.Episodic Adaptation.Comparison with SOTA methods on VGGSound-C with video corruptions (severity level 5) regarding Accuracy (%,↑)

Extended Comparative Experiments Main experiments.We report additional results for the main experiments that were not included in the main text. 2 Table 8.Episodic Adaptation.Comparison with SOTA methods on VGGSound-C with video corruptions (severity level 5) regarding Accuracy (%,↑). Noise Blur Weather Digital Method Gauss. Shot Impul. Defoc. Glass Mot. ...

work page