Dual Strategies for Test-Time Adaptation
Pith reviewed 2026-05-10 06:07 UTC · model grok-4.3
The pith
DualTTA separates test samples by stability under transformations to apply entropy minimization on reliable ones and maximization on unreliable ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DualTTA identifies two groups of test samples: one where predictions are likely consistent with underlying semantics and another where predictions are likely incorrect. The groups are selected by a reliability criterion that measures prediction stability under semantic-preserving and semantic-altering transformations. Reliable samples undergo entropy minimization to reinforce correct decisions; unreliable samples undergo entropy maximization to suppress errors and unlearn spurious behavior. Theoretical analysis and empirical results show this produces a tighter separation between reliable and unreliable samples, leading to provably more effective model updates.
What carries the argument
The reliability criterion that scores prediction stability under both semantic-preserving and semantic-altering transformations to decide which samples receive entropy minimization versus entropy maximization.
If this is right
- A larger and more diverse portion of the test distribution can be used for adaptation instead of only low-entropy samples.
- The dual objectives create a sharper distinction between samples suitable and unsuitable for model updates.
- Model updates become provably more effective under distribution shifts.
- Reliable predictions are reinforced while overconfident errors and spurious patterns are actively suppressed.
Where Pith is reading between the lines
- The stability-based partition could be reused in continual learning to decide which incoming data should reinforce versus unlearn patterns.
- Entropy maximization on unreliable samples offers a concrete mechanism for online forgetting that might combine with other regularization techniques.
- The same dual-strategy logic could be tested in non-vision domains once equivalent semantic-preserving and altering operations are defined.
Load-bearing premise
Prediction stability under semantic-preserving and semantic-altering transformations accurately identifies which samples have predictions consistent with their underlying semantics.
What would settle it
Ground-truth evaluation on a held-out test set showing that samples labeled reliable by the stability criterion have lower accuracy than those labeled unreliable would falsify the separation's validity.
Figures
read the original abstract
Conventional test-time adaptation (TTA) approaches typically adapt the model using only a small fraction of test samples, often those with low-entropy predictions, thereby failing to fully leverage the available information in the test distribution. This paper introduces DualTTA, a novel framework that improves performance under distribution shifts by utilizing a larger and more diverse set of test samples. DualTTA identifies two distinct groups: one where the model's predictions are likely consistent with the underlying semantics, and another where predictions are likely incorrect. For the first group, it minimizes prediction entropy to reinforce reliable decisions; for the second, it maximizes entropy to suppress overconfident errors and unlearn spurious behavior. These groups are adaptively selected using a new reliability criterion that measures prediction stability under both semantic-preserving and semantic-altering transformations, addressing the limitations of purely entropy-based selection. We further provide theoretical analysis and empirical justification showing that our approach enables a tighter separation between reliable and unreliable samples, in the context of their suitability for adaptation, leading to provably more effective model updates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DualTTA for test-time adaptation under distribution shift. It partitions test samples into two groups using a reliability criterion based on prediction stability under semantic-preserving and semantic-altering transformations: reliable samples (likely semantically consistent) receive entropy-minimization updates, while unreliable samples (likely incorrect) receive entropy-maximization updates. The central claim is that this dual strategy yields a tighter separation between reliable and unreliable samples than entropy-based selection alone, enabling provably more effective model updates and better utilization of the full test distribution.
Significance. If the theoretical link between the stability criterion and semantic correctness holds and the empirical gains are reproducible, the work would meaningfully advance TTA by moving beyond low-entropy-only adaptation and providing a principled way to both reinforce correct predictions and suppress overconfident errors. The dual min/max objective and the explicit reliability criterion are novel relative to prior entropy or pseudo-label methods.
major comments (2)
- [theoretical analysis] Theoretical analysis (referenced in the abstract): the claim of 'provably more effective model updates' is conditioned on the reliability criterion producing an accurate separation, yet the provided description supplies no formal argument or bound establishing why stability under semantic-altering transformations implies semantic inconsistency rather than transformation artifacts, model invariances, or other factors. This assumption is load-bearing for the 'provable' improvement and for the superiority over entropy-based selection.
- [reliability criterion] Reliability criterion definition (abstract and method description): the criterion combines stability under both transformation types, but it is unclear how the two stability measures are combined into a single selection rule and whether the rule is independent of the adaptation objective or risks circularity when the same model is used for both stability measurement and the subsequent min/max updates.
minor comments (1)
- [abstract] The abstract states that the approach 'addresses the limitations of purely entropy-based selection' but does not quantify those limitations or cite the specific prior works whose entropy thresholds are being improved upon.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments on our work. We address each of the major comments in detail below, providing clarifications and indicating revisions made to the manuscript.
read point-by-point responses
-
Referee: [theoretical analysis] Theoretical analysis (referenced in the abstract): the claim of 'provably more effective model updates' is conditioned on the reliability criterion producing an accurate separation, yet the provided description supplies no formal argument or bound establishing why stability under semantic-altering transformations implies semantic inconsistency rather than transformation artifacts, model invariances, or other factors. This assumption is load-bearing for the 'provable' improvement and for the superiority over entropy-based selection.
Authors: We agree that our theoretical analysis assumes the reliability criterion provides an accurate separation and does not include a formal proof that stability under semantic-altering transformations necessarily corresponds to semantic inconsistency (as opposed to artifacts or invariances). The analysis instead shows that, given such a separation, the dual strategy yields more effective updates than single-sided entropy minimization by deriving bounds on the change in model parameters or expected loss. We have revised the manuscript to explicitly state the assumptions, discuss potential confounding factors, and moderate the language from 'provably' to 'theoretically motivated' in the abstract and relevant sections. revision: yes
-
Referee: [reliability criterion] Reliability criterion definition (abstract and method description): the criterion combines stability under both transformation types, but it is unclear how the two stability measures are combined into a single selection rule and whether the rule is independent of the adaptation objective or risks circularity when the same model is used for both stability measurement and the subsequent min/max updates.
Authors: We have clarified the definition in the revised method section. The reliability criterion is computed on the initial pre-adaptation model by measuring prediction consistency under semantic-preserving transformations and inconsistency under semantic-altering ones. These are combined into a single score with a threshold for partitioning into reliable and unreliable groups. Since this measurement uses only the initial model outputs and does not depend on the adaptation updates, there is no circularity. We have added a detailed description, pseudocode, and an ablation study on the combination rule to the manuscript. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper defines a new reliability criterion based on prediction stability under semantic-preserving and semantic-altering transformations, which is presented as an independent selection rule separate from the adaptation objective. Dual min-entropy and max-entropy updates are then applied conditionally on this criterion, with theoretical analysis claiming tighter separation and more effective updates. No equations or steps reduce the claimed predictions or provable improvements to fitted parameters, self-defined quantities, or prior self-citations by construction. The central claims rest on the external validity of the stability measure rather than tautological re-use of inputs, making the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prediction stability under chosen transformations reliably indicates whether a sample's prediction matches underlying semantics
Reference graph
Works this paper leans on
-
[1]
Contrastive test-time adaptation
Dian Chen, Dequan Wang, Trevor Darrell, and Sayna Ebrahimi. Contrastive test-time adaptation. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 1
work page 2022
-
[2]
Feature augmentation based test- time adaptation
Younggeol Cho, Youngrae Kim, Junho Yoon, Seunghoon Hong, and Dongman Lee. Feature augmentation based test- time adaptation. InProceedings of the IEEE Workshop on Applications of Computer Vision, 2025. 2
work page 2025
-
[3]
Randaugment: Practical automated data augmentation with a reduced search space
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. InAdvances in Neural Infor- mation Processing Systems, 2020. 2
work page 2020
- [4]
-
[5]
Sharpness-aware minimization for efficiently improving generalization
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. InProceedings of International Conference on Learning and Representation, 2021. 2
work page 2021
-
[6]
Representative batch normalization with feature calibration
Shang-Hua Gao, Qi Han, Duo Li, Ming-Ming Cheng, and Pai Peng. Representative batch normalization with feature calibration. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2021. 5
work page 2021
-
[7]
Im- age style transfer using convolutional neural networks
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im- age style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2
work page 2016
-
[8]
Benchmarking neu- ral network robustness to common corruptions and perturba- tions
Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions. InProceedings of International Conference on Learn- ing and Representation, 2019. 1, 5
work page 2019
-
[9]
Arbitrary style transfer in real-time with adaptive instance normalization
Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. InProceed- ings of the International Conference on Computer Vision,
-
[10]
Style augmen- tation: Data augmentation via style randomization
Philip T Jackson, Amir Atapour-Abarghouei, Stephen Bon- ner, Toby P Breckon, and Boguslaw Obara. Style augmen- tation: Data augmentation via style randomization. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019. 2
work page 2019
-
[11]
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubra- mani, Weihua Hu, Michihiro Yasunaga, Richard L. Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. InProceedings of the International Conference on Machine Learning, 2021. 1
work page 2021
-
[12]
Towards open-set test-time adaptation utilizing the wisdom of crowds in entropy minimization
Jungsoo Lee, Debasmit Das, Jaegul Choo, and Sungha Choi. Towards open-set test-time adaptation utilizing the wisdom of crowds in entropy minimization. InProceedings of the International Conference on Computer Vision, 2023. 2
work page 2023
-
[13]
Entropy is not enough for test-time adaptation: From the perspective of disentangled factors
Jonghyun Lee, Dahuin Jung, Saehyung Lee, Junsung Park, Juhyeon Shin, Uiwon Hwang, and Sungroh Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. InProceedings of International Con- ference on Learning and Representation, 2024. 1, 2, 3, 5, 6, 7, 8
work page 2024
-
[14]
Boyi Li, Felix Wu, Ser-Nam Lim, Serge Belongie, and Kil- ian Q. Weinberger. On feature normalization and data aug- mentation. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2021. 5
work page 2021
-
[15]
Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, Broader and Artier Domain General- ization. InProceedings of the International Conference on Computer Vision, 2017. 5
work page 2017
-
[16]
A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts
Jian Liang, Ran He, and Tieniu Tan. A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts. International Journal of Computer Vision, 133(1):31–64,
-
[17]
Ttn: A domain-shift aware batch normalization in test- time adaptation
Hyesu Lim, Byeonggeun Kim, Jaegul Choo, and Sungha Choi. Ttn: A domain-shift aware batch normalization in test- time adaptation. InProceedings of International Conference on Learning and Representation, 2022. 2
work page 2022
-
[18]
Yuejiang Liu, Parth Kothari, Bastien van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems,
-
[19]
Efficient test- time model adaptation without forgetting
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test- time model adaptation without forgetting. InProceedings of the International Conference on Machine Learning, 2022. 1, 2, 3, 6, 8
work page 2022
-
[20]
Towards sta- ble test-time adaptation in dynamic wild world
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards sta- ble test-time adaptation in dynamic wild world. InProceed- ings of International Conference on Learning and Represen- tation, 2023. 1, 2, 3, 5, 6
work page 2023
-
[21]
Label shift adapter for test-time adaptation under co- variate and label shifts
Sunghyun Park, Seunghan Yang, Jaegul Choo, and Sungrack Yun. Label shift adapter for test-time adaptation under co- variate and label shifts. InProceedings of the International Conference on Computer Vision, 2023. 2
work page 2023
-
[22]
A mathematical theory of commu- nication.The Bell System Technical Journal, 1948
Claude Elwood Shannon. A mathematical theory of commu- nication.The Bell System Technical Journal, 1948. 1, 3
work page 1948
-
[23]
A survey of multi-source domain adaptation.Information Fusion, 24:84– 92, 2015
Shiliang Sun, Honglei Shi, and Yuanbin Wu. A survey of multi-source domain adaptation.Information Fusion, 24:84– 92, 2015. 1
work page 2015
-
[24]
Test-time training with self- supervision for generalization under distribution shifts,
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self- supervision for generalization under distribution shifts,
-
[25]
Test-time training with self- supervision for generalization under distribution shifts
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self- supervision for generalization under distribution shifts. In Proceedings of the International Conference on Machine Learning, 2020. 1
work page 2020
-
[26]
Conststyle: Robust domain gen- eralization with unified style transformation
Nam Duong Tran, Nam Nguyen Phuong, Hieu H Pham, Phi Le Nguyen, and My T Thai. Conststyle: Robust domain gen- eralization with unified style transformation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3174–3183, 2025. 1
work page 2025
-
[27]
Deep Hashing Network for Unsupervised Domain Adaptation
Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep Hashing Network for Unsupervised Domain Adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, 2017. 5
work page 2017
-
[28]
Tent: Fully test-time adaptation by entropy minimization
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InProceedings of International Conference on Learning and Representation, 2021. 1, 2, 5, 6
work page 2021
-
[29]
Con- tinual test-time domain adaptation
Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Con- tinual test-time domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2022. 2
work page 2022
-
[30]
Feature alignment and uniformity for test time adap- tation
Shuai Wang, Daoan Zhang, Zipei Yan, Jianguo Zhang, and Rui Li. Feature alignment and uniformity for test time adap- tation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. 2
work page 2023
-
[31]
Ximei Wang, Ying Jin, Mingsheng Long, Jianmin Wang, and Michael I. Jordan. Transferable normalization: Towards im- proving transferability of deep neural networks. InAdvances in Neural Information Processing Systems, 2019. 5
work page 2019
-
[32]
Fda: Fourier domain adaptation for semantic segmentation
Yanchao Yang and Stefano Soatto. Fda: Fourier domain adaptation for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2020. 2
work page 2020
-
[33]
Cutmix: Regu- larization strategy to train strong classifiers with localizable features
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. InProceedings of the International Conference on Computer Vision, 2019. 2
work page 2019
-
[34]
Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation.Ad- vances in Neural Information Processing Systems, 35: 38629–38642, 2022. 2
work page 2022
-
[35]
Object detection with self- supervised scene adaptation
Zekun Zhang and Minh Hoai. Object detection with self- supervised scene adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
-
[36]
Efficiency- preserving scene-adaptive object detection
Zekun Zhang, Vu Quang Truong, and Minh Hoai. Efficiency- preserving scene-adaptive object detection. InProceedings of the British Machine Vision Conference, 2024. 1, 2
work page 2024
-
[37]
Do- main generalization with mixstyle
Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Do- main generalization with mixstyle. InProceedings of Inter- national Conference on Learning and Representation, 2021. 1
work page 2021
-
[38]
Domain generalization: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence,
Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.