pith. sign in

arxiv: 2605.19890 · v1 · pith:5HRIIVP7new · submitted 2026-05-19 · 💻 cs.CV

GoTTA be Diverse: Rethinking Memory Policies for Test-Time Adaptation

Pith reviewed 2026-05-20 05:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords test-time adaptationmemory managementintra-class diversitydistribution shiftnon-i.i.d. streamsonline adaptationvideo streamscorruption robustness
0
0 comments X

The pith

Memory policies that keep intra-class diverse samples outperform recent or class-balanced buffers for test-time adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper separates memory design from the choice of adaptation loss and tests different selection rules on the same streams. It shows that simply keeping the newest samples or balancing classes still produces redundant buffers when the incoming data is temporally correlated or label-skewed. Adding a feature-space diversity term inside each class removes that redundancy and supplies more informative gradients. The resulting policies improve accuracy most when memory is small and the stream is far from i.i.d., yet stay competitive once more memory becomes available.

Core claim

Effective test-time adaptation requires memory buffers that are both class-balanced and diverse within each class; the Guided Observational Test-Time Adaptation (GOTTA) family implements this principle by allocating slots proportionally to observed classes and then selecting samples that maximize feature-space spread, yielding higher adaptation accuracy than recency-only or balance-only policies across corruption and video benchmarks.

What carries the argument

GOTTA diversity-aware memory policies that combine class-balanced slot allocation with explicit feature-space diversity selection inside each class.

If this is right

  • Memory design becomes a first-class lever in TTA, independent of the particular adaptation objective.
  • Intra-class diversity reduces redundancy and preserves representative signals under label skew and temporal correlation.
  • The same memory policies can be swapped into existing TTA methods without changing their loss functions.
  • Gains are largest when memory capacity is tightly constrained and streams deviate strongly from i.i.d.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diversity principle could be tested in continual learning settings where the model must retain knowledge across many tasks rather than adapt once.
  • Pairing diversity selection with uncertainty or entropy filters might further improve sample quality when both criteria are available.
  • Real-world deployment would benefit from measuring the computational overhead of the diversity computation relative to the adaptation gains.

Load-bearing premise

The chosen corruption and video-stream benchmarks are representative enough that the observed differences between memory policies will hold for other practical test-time conditions.

What would settle it

Running the same memory policies on a new non-i.i.d. stream where intra-class feature variance is naturally very low and finding that GOTTA no longer improves over a simple FIFO buffer.

Figures

Figures reproduced from arXiv: 2605.19890 by Bernard Ghanem, Merey Ramazanova, Shyma Alhuwaider, Silvio Giancola, Yasmeen Alsaedy.

Figure 1
Figure 1. Figure 1: We evaluate memory selection strategies on a two-class CIFAR-10 subset under a Dirichlet [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Memory mechanism comparison on CIFAR-10-C under continual temporal TTA streams with memory budget M = 32. We compare three memory guidance classes: uninformed (None, FIFO, Reservoir), class-guided (PBRS, CSTU), and guided observational (CDS, FPS, FPSD). For guided observational methods, we ablate over diversity thresholds ϵ, which control similarity between samples (Euclidean distance for FPS and cosine si… view at source ↗
Figure 3
Figure 3. Figure 3: Memory mechanism comparison on CIFAR-10-C under episodic temporal TTA streams with memory budget M = 32. We compare three memory guidance classes: uninformed (None, FIFO, Reservoir), class-guided (PBRS, CSTU), and guided observational (CDS, FPS, FPSD). For guided observational methods, we ablate over diversity thresholds ϵ, which control similarity between samples (Euclidean distance for FPS and cosine sim… view at source ↗
Figure 4
Figure 4. Figure 4: Memory mechanism comparison on CIFAR-10-C under episodic (a) and continual(b) temporal TTA streams with memory budget M = 64. The best result overall across all methods and strategies is indicated with a ⋆. 6.3 Effect of Memory Budget on Diversity-Aware Sample Selection Memory capacity and diversity. The memory-scaling results in [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Memory-capacity scaling under temporal continual test-time adaptation for NORM, NOTE, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Test-time adaptation (TTA) enables a pre-trained model to adapt online to an unlabeled test stream under distribution shift. While most TTA research focuses on the adaptation objective, practical streams also depend critically on the memory used to select which test samples drive adaptation. Existing memory mechanisms are usually evaluated as components of specific TTA algorithms, making it difficult to isolate which memory design choices matter and when they matter. In this work, we provide a systematic benchmark that decouples memory from the adaptation algorithm and evaluates memory policies under unified conditions across i.i.d., non-i.i.d., continual, and practical test streams. Our study shows that effective memory management requires more than retaining recent or class-balanced samples. In particular, intra-class diversity is a key factor for avoiding redundant buffers and maintaining representative adaptation signals under temporally correlated and label-skewed streams. Motivated by this finding, we introduce Guided Observational Test-Time Adaptation (GOTTA), a family of diversity-aware memory policies that combine class-balanced allocation with feature-space diversity. GOTTA memories act as drop-in replacements for existing buffers and can be paired with different TTA objectives. Across corruption benchmarks and video-stream settings, diversity-aware memory improves adaptation most clearly under constrained memory budgets and challenging non-i.i.d. streams, while remaining competitive as memory capacity increases. These results highlight memory management as a first-class component of robust test-time adaptation and identify diversity as a central principle for practical TTA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that memory policies in test-time adaptation (TTA) have been understudied because they are typically evaluated only as components of specific TTA algorithms. Through a systematic benchmark that decouples memory selection from the adaptation objective, the authors evaluate policies across i.i.d., non-i.i.d., continual, and practical streams on corruption benchmarks (CIFAR-C, ImageNet-C) and video streams. They conclude that intra-class diversity is essential to avoid redundant buffers under temporally correlated and label-skewed conditions, and introduce GOTTA, a family of diversity-aware policies that combine class-balanced allocation with feature-space diversity. These policies are shown to improve adaptation most under constrained memory budgets and challenging non-i.i.d. streams while remaining competitive at larger capacities.

Significance. If the benchmark results and the reported gains hold under the stated conditions, the work usefully elevates memory management to a first-class design choice in TTA and supplies a reusable evaluation framework that isolates policy effects. The emphasis on diversity as a guiding principle offers a concrete, actionable insight for practitioners facing limited buffers and distribution shift.

major comments (1)
  1. [Section 3] Section 3: The central claim that the benchmark isolates memory-policy effects from adaptation objectives rests on the assumption that buffer contents interact with the TTA loss (entropy minimization, pseudo-labeling, etc.) only through the selected samples. Because the adaptation objective recomputes statistics or gradients from the buffer, diversity-aware selection could alter those statistics differently than the class-balanced or recency baselines; without an explicit ablation that replaces the adaptation objective by a parameter-free alternative, the observed gains cannot be attributed solely to diversity.
minor comments (2)
  1. [Abstract] The abstract states that diversity-aware memory 'improves adaptation most clearly under constrained memory budgets' but supplies no quantitative deltas, error bars, or statistical tests; the full experimental section should report these numbers for each budget and stream type.
  2. Clarify the precise definition and computation of 'feature-space diversity' used inside GOTTA (e.g., which distance metric, how many neighbors, whether it is enforced via explicit regularization or only via selection).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The major comment raises a valid point about the limits of isolation in our benchmark, which we address directly below with a proposed revision.

read point-by-point responses
  1. Referee: [Section 3] Section 3: The central claim that the benchmark isolates memory-policy effects from adaptation objectives rests on the assumption that buffer contents interact with the TTA loss (entropy minimization, pseudo-labeling, etc.) only through the selected samples. Because the adaptation objective recomputes statistics or gradients from the buffer, diversity-aware selection could alter those statistics differently than the class-balanced or recency baselines; without an explicit ablation that replaces the adaptation objective by a parameter-free alternative, the observed gains cannot be attributed solely to diversity.

    Authors: We thank the referee for this precise observation. Our benchmark in Section 3 does fix the adaptation objective (e.g., entropy minimization or pseudo-labeling) while varying only the memory policy, allowing direct comparison of selection strategies under identical loss computations. We agree, however, that sample selection can still influence batch statistics or gradients, so the gains are not purely from sample quality independent of the objective. To strengthen the isolation claim, we will add (i) an explicit discussion of this interaction in the revised Section 3 and (ii) a new ablation that evaluates buffer quality via a parameter-free method (nearest-neighbor classification on frozen features without any model update or loss-driven adaptation). Preliminary results from this ablation show that diversity-aware buffers still yield higher representation quality than recency or class-balanced baselines, supporting that selection itself contributes meaningfully. We will include the full ablation and updated discussion in the revision. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark with no circular derivation chain

full rationale

The paper is a systematic empirical benchmark study that decouples memory policies from specific TTA adaptation objectives and evaluates them across i.i.d., non-i.i.d., continual, and practical streams using corruption and video datasets. Claims about diversity-aware memory (GOTTA) improving adaptation under constrained budgets are presented as experimental outcomes rather than mathematical predictions or first-principles derivations. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text; the work remains self-contained against external benchmarks without reducing any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review based on abstract only; no explicit free parameters, mathematical axioms, or invented physical entities are described. GOTTA is presented as a new algorithmic family rather than a new postulated entity with independent evidence.

invented entities (1)
  • GOTTA memory policies no independent evidence
    purpose: Diversity-aware buffers combining class balance and feature-space diversity for TTA
    Introduced as a family of drop-in replacement policies motivated by the benchmark findings.

pith-pipeline@v0.9.0 · 5809 in / 1182 out tokens · 51511 ms · 2026-05-20T05:21:47.230571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

  1. [1]

    Dataset shift in machine learning

    Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. 2008

  2. [2]

    Deep visual domain adaptation: A survey.Neurocomputing, 312:135–153, 2018

    Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey.Neurocomputing, 312:135–153, 2018

  3. [3]

    Domain generalization: A survey.IEEE Trans

    Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 2022

  4. [4]

    Improving robustness against common corruptions by covariate shift adaptation

    Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. In NeurIPS 2020, 2020

  5. [5]

    Evaluating prediction-time batch normalization for robustness under covariate shift,

    Zachary Nado, Shreyas Padhy, D. Sculley, Alexander D’Amour, Balaji Lakshminarayanan, and Jasper Snoek. Evaluating prediction-time batch normalization for robustness under covariate shift.CoRR, abs/2006.10963, 2020

  6. [6]

    Olshausen, and Trevor Darrell

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno A. Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InICLR, 2021

  7. [7]

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

    Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. InWorkshop on challenges in representation learning, ICML, volume 3, page 896, 2013

  8. [8]

    Test-time training with self-supervision for generalization under distribution shifts

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InICML, pages 9229–9248, 2020

  9. [9]

    Robust continual test-time adaptation: Instance-aware BN and prediction-balanced memory

    Taesik Gong, Jongheon Jeong, Taewon Kim, Yewon Kim, Jinwoo Shin, and Sung-Ju Lee. Robust continual test-time adaptation: Instance-aware BN and prediction-balanced memory. In NeurIPS, 2022

  10. [10]

    Robust test-time adaptation in dynamic scenarios,

    Longhui Yuan, Binhui Xie, and Shuang Li. Robust test-time adaptation in dynamic scenarios,

  11. [11]

    URLhttps://arxiv.org/abs/2303.13899

  12. [12]

    Generalized domain conditioned adaptation network.IEEE Trans

    Shuang Li, Binhui Xie, Qiuxia Lin, Chi Harold Liu, Gao Huang, and Guoren Wang. Generalized domain conditioned adaptation network.IEEE Trans. Pattern Anal. Mach. Intell., 44(8):4093– 4109, 2022

  13. [13]

    Mingsheng Long, Yue Cao, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Transferable representation learning with deep adaptation networks.IEEE Trans. Pattern Anal. Mach. Intell., 41(12):3071–3085, 2019

  14. [14]

    Lempitsky

    Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor S. Lempitsky. Domain-adversarial training of neural networks.J. Mach. Learn. Res., 17:59:1–59:35, 2016

  15. [15]

    Maximum classifier discrepancy for unsupervised domain adaptation

    Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. InCVPR, pages 3723–3732, 2018

  16. [16]

    Sepico: Semantic-guided pixel contrast for domain adaptive semantic segmentation.IEEE Trans

    Binhui Xie, Shuang Li, Mingjia Li, Chi Harold Liu, Gao Huang, and Guoren Wang. Sepico: Semantic-guided pixel contrast for domain adaptive semantic segmentation.IEEE Trans. Pattern Anal. Mach. Intell., pages 1–17, 2023

  17. [17]

    Unsupervised domain adaptation for semantic segmentation via class-balanced self-training

    Yang Zou, Zhiding Yu, BVK Vijaya Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. InECCV, pages 289–305, 2018. 10

  18. [18]

    A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2021

    Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory G. Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Trans. Pattern Anal. Mach. Intell., 44(7):3366–3385, 2022. doi: 10.1109/TPAMI.2021.3057446. URLhttps://doi.org/10.1109/TPAMI.2021.3057446

  19. [19]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi´nska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Des- jardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Over- coming catastrophic forgetting in neural networks.CoRR, abs/1612.00796, 2016. URL http://arxiv.org/abs/1...

  20. [20]

    Gradient based sample selection for online continual learning

    Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. InNeurIPS, pages 11816–11825, 2019

  21. [21]

    End-to-end incremental learning

    Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. InECCV, pages 233–248, 2018

  22. [22]

    Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classifier and representation learning. InCVPR, pages 5533–5542, 2017

  23. [23]

    A comprehensive survey on test-time adaptation under distribution shifts.International Journal of Computer Vision, 133(1):31–64, 2025

    Jian Liang, Ran He, and Tieniu Tan. A comprehensive survey on test-time adaptation under distribution shifts.International Journal of Computer Vision, 133(1):31–64, 2025

  24. [24]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, pages 448–456, 2015

  25. [25]

    Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation

    Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. InICML, pages 6028–6039, 2020

  26. [26]

    Test-time adaptation via self-training with nearest neighbor information.CoRR, abs/2207.10792, 2022

    Minguk Jang and Sae-Young Chung. Test-time adaptation via self-training with nearest neighbor information.CoRR, abs/2207.10792, 2022

  27. [27]

    Test-time classifier adjustment module for model-agnostic domain generalization

    Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. InNeurIPS, pages 2427–2440, 2021

  28. [28]

    Continual test-time domain adaptation

    Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. InCVPR, pages 7191–7201, 2022

  29. [29]

    Perez, Merey Ramazanova, and Bernard Ghanem

    Shyma Alhuwaider, Motasem Alfarra, Juan C. Perez, Merey Ramazanova, and Bernard Ghanem. Advmem: Adversarial memory initialization for realistic test-time adaptation via tracklet-based benchmarking, 2025. URLhttps://arxiv.org/abs/2509.02182

  30. [30]

    Stamp: Outlier-aware test-time adaptation with stable memory replay, 2024

    Yongcan Yu, Lijun Sheng, Ran He, and Jian Liang. Stamp: Outlier-aware test-time adaptation with stable memory replay, 2024. URLhttps://arxiv.org/abs/2407.15773

  31. [31]

    Jeffrey S. Vitter. Random sampling with a reservoir.ACM Trans. Math. Softw., 11(1):37–57, March 1985. ISSN 0098-3500. doi: 10.1145/3147.3165. URL https://doi.org/10.1145/ 3147.3165

  32. [32]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017

  33. [33]

    Towards active learning for action spotting in association football videos

    Silvio Giancola, Anthony Cioppa, Julia Georgieva, Johsan Billingham, Andreas Serner, Kerry Peek, Bernard Ghanem, and Marc Van Droogenbroeck. Towards active learning for action spotting in association football videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 5098–5108, June 2023

  34. [34]

    The farthest point strategy for progressive image sampling.IEEE transactions on image processing, 6(9):1305– 1315, 1997

    Yuval Eldar, Michael Lindenbaum, Moshe Porat, and Yehoshua Y Zeevi. The farthest point strategy for progressive image sampling.IEEE transactions on image processing, 6(9):1305– 1315, 1997

  35. [35]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 11

  36. [36]

    Determinantal point processes for machine learning.stat, 1050:10, 2013

    Alex Kulesza Ben Taskar. Determinantal point processes for machine learning.stat, 1050:10, 2013

  37. [37]

    Fast greedy map inference for determinantal point process to improve recommendation diversity.Advances in neural information processing systems, 31, 2018

    Laming Chen, Guoxin Zhang, and Eric Zhou. Fast greedy map inference for determinantal point process to improve recommendation diversity.Advances in neural information processing systems, 31, 2018

  38. [38]

    Dietterich

    Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InICLR, 2019

  39. [39]

    Efficient test-time model adaptation without forgetting

    Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. InICML, volume 162, pages 16888–16905, 2022

  40. [40]

    Towards stable test-time adaptation in dynamic wild world, 2023

    Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world, 2023. URL https: //arxiv.org/abs/2302.12400

  41. [41]

    Tea: Test-time energy adaptation, 2024

    Yige Yuan, Bingbing Xu, Liang Hou, Fei Sun, Huawei Shen, and Xueqi Cheng. Tea: Test-time energy adaptation, 2024. URLhttps://arxiv.org/abs/2311.14402

  42. [42]

    Wide residual networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. InBMVC, 2016

  43. [43]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016

  44. [44]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

  45. [45]

    Parameter-free online test-time adaptation

    Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. Parameter-free online test-time adaptation. InCVPR, pages 8344–8353, 2022. 12 6 Appendix 6.1 Test-Time Adaptation settings Following Yuan et al.[10], we distinguish four TTA settings of increasing difficulty, all sharing the same protocol: a source-trained model fθs is adapted online usi...