pith. sign in

arxiv: 2606.02339 · v1 · pith:3MBDR6VVnew · submitted 2026-06-01 · 💻 cs.LG · cs.CV

Entropy Minimization without Model Collapse: Mitigating Prediction Bias in Medical Imaging

Pith reviewed 2026-06-28 15:55 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords entropy minimizationtest-time adaptationmodel collapseprediction biasmedical imagingdistribution shiftbias reductionunsupervised adaptation
0
0 comments X

The pith

Entropy minimization amplifies prediction bias from merged feature clusters, leading to model collapse during test-time adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that distribution shifts in medical imaging can merge feature clusters belonging to different classes in the model's representation space while the decision boundary remains fixed. This produces a skewed predicted class distribution called prediction bias. Entropy minimization, the standard objective for test-time adaptation, then tightens the incorrect clusters and reinforces the skew until predictions collapse to a single trivial output. The authors introduce Distribution Shift Bias Reduction (DSBR) to equalize each predicted class's contribution to the entropy loss, thereby interrupting the amplification process. Experiments on four medical imaging datasets plus ImageNet-C confirm that DSBR prevents collapse and performs at or above existing methods while requiring no source data.

Core claim

Distribution shifts cause feature clusters corresponding to distinct classes to merge in representation space with a fixed decision boundary, inducing prediction bias; entropy minimization amplifies this bias by tightening the clusters until all predictions collapse to a trivial solution; DSBR mitigates the failure mode by equalizing the contribution of each predicted class to the unsupervised entropy minimization loss at test time.

What carries the argument

Distribution Shift Bias Reduction (DSBR), an objective that equalizes the contribution of each predicted class to the entropy minimization loss.

If this is right

  • DSBR stabilizes test-time adaptation and prevents model collapse on medical imaging data.
  • DSBR matches or outperforms state-of-the-art methods while operating solely at test time.
  • Equalizing class contributions in the entropy loss directly addresses the identified amplification mechanism.
  • The approach applies to adaptation settings on four medical-imaging datasets and ImageNet-C.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If cluster merging under fixed boundaries is a common response to distribution shift, DSBR may extend to other unsupervised adaptation tasks outside medical imaging.
  • Testing DSBR on shifts that do not produce merged clusters could isolate whether the equalization step introduces performance costs when bias is absent.
  • The mechanism suggests similar bias-amplification risks may exist in other entropy-based objectives beyond test-time adaptation.

Load-bearing premise

The observed prediction bias arises specifically from merged feature clusters with a fixed decision boundary, and equalizing class contributions directly counters the amplification mechanism without side effects.

What would settle it

An experiment in which feature clusters merge under a distribution shift but DSBR still produces collapse, or in which collapse occurs without prior cluster merging.

Figures

Figures reproduced from arXiv: 2606.02339 by Daniel M. Lang, Johannes Kiechle, Julia A. Schnabel, Sameer Ambekar, Tim Nielen.

Figure 1
Figure 1. Figure 1: Feature space clusters merge under distribution shift, inducing severe prediction bias, which is prone to failure for Entropy Minimization (EM). We show the 2-dimensional t-SNE embeddings from WILDSCamelyon [40] dataset featurized by ResNet50-GN. The background colors indicate the area of the feature space the model classifies as each class which we obtain by fitting a k-nearest neighbors classifier on the… view at source ↗
Figure 2
Figure 2. Figure 2: Entropy minimization amplifies prediction bias and ultimately causes model collapse. We show 2D UMAP embeddings of WILDS-Camelyon features from a ResNet50-GN pretrained on Hospital 2, before and during adaptation. Background colors denote predicted regions in feature space, obtained by fitting a k-nearest neighbors classifier to the embeddings and model predictions. As entropy minimization sharpens biased … view at source ↗
Figure 3
Figure 3. Figure 3: DSBR stabilizes Entropy Minimiza￾tion where ‘H’ denotes the Hospital domain id. Balanced accuracy results on all WILDSCamelyon configurations for Tent (EM) and DSBR. In Tab. 2, we provide the balanced accuracy on the medical datasets averaged over all source￾target configurations. DSBR achieves the high￾est average improvement on both backbones, +4.2% for ResNet and +2.6% for ViT, consid￾erably more than t… view at source ↗
Figure 4
Figure 4. Figure 4: Model collapse in Tent (EM) and its mitigation by DSBR. Initial bias is present for both models. Left: Tent amplifies bias until the model converges to a single class for almost every input (Model collapse) by the end of the episode. Right: DSBR progressively dampens bias until it vanishes by approximately batch 400. 5.2 Performance on natural imaging by mitigating Bias [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 5
Figure 5. Figure 5: DSBR’s effectiveness depends on the quality [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Initial gradient norm has no effect on entropy minimization. We compute the initial gradient [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Entropy is highest close to the decision boundary. The setup is the same as in [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
read the original abstract

Entropy minimization (EM) is the dominant objective for test-time adaptation, yet its failure mode, model collapse, remains poorly understood. In this work, we show that distribution shifts can cause feature clusters corresponding to distinct classes in the model's representation space to merge, while the decision boundary remains fixed. This induces a systematic skew in the predicted class distribution, referred to as prediction bias. Prediction bias refers to a shift in the predicted class distribution, with some classes overrepresented and others suppressed. We show that entropy minimization amplifies this prediction bias by tightening the existing clusters, reinforcing the incorrect groupings until all predictions collapse to a trivial solution. Next, to demonstrate the significance of prediction bias and mitigate it, we further propose Distribution Shift Bias Reduction (DSBR), a bias-correcting objective that specifically targets this failure mode by equalizing the contribution of each predicted class to the unsupervised entropy minimization loss. To study this failure mode, we design suitable adaptation settings using four medical-imaging datasets and additionally evaluate on ImageNet-C. We find that DSBR consistently stabilizes test-time adaptation, prevents model collapse, and matches or outperforms state-of-the-art methods. Moreover, DSBR operates solely at test-time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that entropy minimization (EM) for test-time adaptation (TTA) fails via model collapse because distribution shifts merge feature clusters of distinct classes while the decision boundary remains fixed, inducing prediction bias (skewed predicted class distribution) that EM then amplifies by tightening clusters. It introduces Distribution Shift Bias Reduction (DSBR), which equalizes each predicted class's contribution to the unsupervised entropy loss to target this mechanism. Experiments on four medical imaging datasets plus ImageNet-C show DSBR stabilizes TTA, prevents collapse, and matches or exceeds SOTA methods while operating only at test time.

Significance. If the mechanistic account and DSBR's targeted mitigation hold under the reported conditions, the result is significant for reliable TTA deployment in medical imaging, where shifts are common and collapse is costly. The test-time-only nature and multi-dataset evaluation (including real medical data) are practical strengths; the work directly addresses an identified empirical failure mode rather than proposing a generic regularizer.

major comments (3)
  1. [§3.1] §3.1 (mechanism description): The account that merged clusters plus fixed boundary induce prediction bias which EM amplifies is presented descriptively without an isolation experiment (e.g., controlled synthetic data with measured inter-cluster distances before/after EM) or gradient analysis showing the amplification step; this premise is load-bearing for both the diagnosis and the design of DSBR.
  2. [§4.1] §4.1, DSBR objective: The equalizing term is motivated as directly countering class skew, yet no derivation or ablation demonstrates that it avoids side effects such as reduced adaptation speed on already-balanced shifts or new failure modes when cluster merging is absent; the central claim that DSBR 'specifically targets this failure mode' therefore rests on an untested assumption.
  3. [Table 3] Table 3 (medical dataset results): While DSBR is reported to prevent collapse where EM fails, the tables do not include variance across random seeds, frequency of collapse events, or statistical tests; without these, the claim of 'consistent stabilization' and outperformance cannot be assessed as load-bearing evidence.
minor comments (3)
  1. [Abstract] The four medical imaging datasets are referenced in the abstract and §5 but not named until later; early explicit listing would improve readability.
  2. [§4] Notation for the entropy loss and the DSBR correction term is introduced piecewise; a single consolidated equation block would reduce ambiguity.
  3. [Figure 2] Figure 2 (cluster visualizations) would benefit from quantitative metrics (e.g., cluster purity or silhouette score) alongside qualitative plots.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive review and for recognizing the practical significance of addressing model collapse in test-time adaptation for medical imaging. We address each major comment below and will incorporate revisions to strengthen the empirical support for the mechanism, the DSBR objective, and the reported results.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (mechanism description): The account that merged clusters plus fixed boundary induce prediction bias which EM amplifies is presented descriptively without an isolation experiment (e.g., controlled synthetic data with measured inter-cluster distances before/after EM) or gradient analysis showing the amplification step; this premise is load-bearing for both the diagnosis and the design of DSBR.

    Authors: We acknowledge that §3.1 presents the mechanism through descriptive analysis supported by t-SNE visualizations of merged clusters and prediction skew on the medical datasets, rather than a fully isolated controlled experiment. To directly address this, we will add a synthetic experiment using controlled Gaussian mixture models in the revision: we will measure inter-cluster distances and prediction bias before/after EM, and include a gradient analysis of the entropy loss under biased class distributions to isolate the amplification effect. revision: yes

  2. Referee: [§4.1] §4.1, DSBR objective: The equalizing term is motivated as directly countering class skew, yet no derivation or ablation demonstrates that it avoids side effects such as reduced adaptation speed on already-balanced shifts or new failure modes when cluster merging is absent; the central claim that DSBR 'specifically targets this failure mode' therefore rests on an untested assumption.

    Authors: The DSBR term is introduced to equalize per-class contributions to the entropy loss precisely when prediction bias is present. While the primary experiments target shifted medical data where collapse occurs, we agree that explicit checks for side effects are needed. In revision we will add ablations on balanced (non-shifted) settings and on synthetic data without cluster merging, reporting adaptation speed and any new instabilities to confirm the targeted nature of the mitigation. revision: yes

  3. Referee: [Table 3] Table 3 (medical dataset results): While DSBR is reported to prevent collapse where EM fails, the tables do not include variance across random seeds, frequency of collapse events, or statistical tests; without these, the claim of 'consistent stabilization' and outperformance cannot be assessed as load-bearing evidence.

    Authors: We agree that variance, collapse frequency, and statistical tests are required for robust evaluation of stabilization claims. In the revised manuscript we will rerun all medical-dataset experiments across five random seeds, report mean and standard deviation in Table 3, tabulate the observed frequency of collapse events per method, and include paired statistical tests (e.g., t-tests) against baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central contribution is an empirical description of a failure mode (merged clusters inducing prediction bias that EM then amplifies) followed by the introduction of a new corrective objective DSBR that equalizes class contributions in the entropy loss. No equations, fitted parameters, or self-citations are presented that reduce the claimed result to its own inputs by construction. The method is motivated by observed behavior on medical imaging datasets and evaluated externally rather than defined tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of entropy minimization as a base objective and the empirical observation of prediction bias; no free parameters, additional axioms, or invented entities are specified beyond standard unsupervised learning assumptions.

axioms (1)
  • domain assumption Entropy minimization is a suitable starting objective for test-time adaptation whose failure modes can be diagnosed via feature cluster behavior.
    Invoked throughout the abstract as the dominant objective whose collapse is being analyzed.

pith-pipeline@v0.9.1-grok · 5755 in / 1238 out tokens · 29928 ms · 2026-06-28T15:55:25.895113+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 18 canonical work pages · 5 internal anchors

  1. [1]

    Agarwal, Z

    S. Agarwal, Z. Zhang, L. Yuan, J. Han, and H. Peng. The unreasonable effectiveness of entropy minimization in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=UfFTBEsLgI

  2. [2]

    A. S. Alsolami, W. Shalash, W. Alsaggaf, S. Ashoor, H. Refaat, and M. Elmogy. King abdulaziz university breast cancer mammogram dataset (kau-bcmd).Data, 6(11):111, 2021

  3. [3]

    Bakas, H

    S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, and C. Davatzikos. Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features.Scientific data, 4(1):170117, 2017

  4. [4]

    Balendran, C

    A. Balendran, C. Beji, F. Bouvier, O. Khalifa, T. Evgeniou, P. Ravaud, and R. Porcher. A scoping review of robustness concepts for machine learning in healthcare.npj Digital Medicine, 8(1):38, 2025

  5. [5]

    Y . Bar, S. Shaer, and Y . Romano. Protected test-time adaptation via online entropy matching: A betting approach. InAdvances in Neural Information Processing Systems, 2024

  6. [6]

    Bernhardt, F

    M. Bernhardt, F. D. S. Ribeiro, and B. Glocker. Failure detection in medical image classification: A reality check and benchmarking testbed.Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URLhttps://openreview.net/forum?id=VBHuLfnOMf

  7. [7]

    Calabrese, J

    E. Calabrese, J. E. Villanueva-Meyer, J. D. Rudie, A. M. Rauschecker, U. Baid, S. Bakas, S. Cha, J. T. Mongan, and C. P. Hess. The university of california san francisco preoperative diffuse glioma mri dataset.Radiology: Artificial Intelligence, 4(6):e220058, 2022

  8. [8]

    M. J. Cardoso, W. Li, R. Brown, N. Ma, E. Kerfoot, Y . Wang, B. Murrey, A. Myronenko, C. Zhao, D. Yang, et al. Monai: An open-source framework for deep learning in healthcare. arXiv preprint arXiv:2211.02701, 2022

  9. [9]

    D. Chen, D. Wang, T. Darrell, and S. Ebrahimi. Contrastive test-time adaptation. InIEEE Conference on Computer Vision and Pattern Recognition, pages 295–305, 2022

  10. [10]

    H. Chen, G. Zheng, A. Awadallah, and Y . Ji. Pathologies of pre-trained language models in few-shot fine-tuning. InProceedings of the Third Workshop on Insights from Negative Results in NLP, pages 144–153, 2022

  11. [11]

    C. Cui, L. Li, H. Cai, Z. Fan, L. Zhang, T. Dan, J. Li, and J. Wang. The chinese mammography database (cmmd): An online mammography database with biopsy confirmed types for machine diagnosis of breast.The Cancer Imaging Archive, 1, 2021

  12. [12]

    H. Dang, T. Tran, S. Osher, H. Tran-The, N. Ho, and T. Nguyen. Neural collapse in deep linear networks: from balanced to imbalanced data. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  13. [13]

    M. Z. Darestani, J. Liu, and R. Heckel. Test-time training can close the natural distribution shift performance gap in deep learning based compressed sensing. InInternational Conference on Machine Learning, 2022

  14. [14]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

  15. [15]

    R. Deng, W. Bao, T. Wei, and J. He. Panda: Test-time adaptation with negative data augmenta- tion. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

  16. [16]

    Dohmatob, Y

    E. Dohmatob, Y . Feng, and J. Kempe. Model collapse demystified: The case of regression. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=bioHNTRnQk

  17. [17]

    Dohmatob, Y

    E. Dohmatob, Y . Feng, A. Subramonian, and J. Kempe. Strong model collapse. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=et5l9qPUhm. 11

  18. [18]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020

  19. [19]

    Evans and D

    H. Evans and D. Snead. Understanding the errors made by artificial intelligence algorithms in histopathology in terms of patient impact.NPJ Digital Medicine, 7(1):89, 2024

  20. [20]

    Foret, A

    P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for efficiently improving generalization.International Conference on Learning Representations (ICLR), 2021

  21. [21]

    Learning by Surprise: Adaptive Mitigation of Model Collapse in Large Language Models

    D. Gambetta, G. Gezici, F. Giannotti, D. Pedreschi, A. Knott, and L. Pappalardo. Learning by surprise: Surplexity for mitigating model collapse in generative ai, 2025. URL https: //arxiv.org/abs/2410.12341

  22. [22]

    N. P. Gaurav Bhole, S Suba. Mammo-bench: A large-scale benchmark dataset of mammography images. 2025

  23. [23]

    Goyal, M

    S. Goyal, M. Sun, A. Raghunathan, and J. Z. Kolter. Test time adaptation via conjugate pseudo-labels. InAdvances in Neural Information Processing Systems, 2022

  24. [24]

    Gretton, A

    A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, B. Schölkopf, et al. Covariate shift by kernel mean matching.Dataset shift in machine learning, 3(4):5, 2009

  25. [25]

    Gulrajani and D

    I. Gulrajani and D. Lopez-Paz. In search of lost domain generalization. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=lQdXeXDoWtI

  26. [26]

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017

  27. [27]

    Y . Guo, M. Guo, J. Su, Z. Yang, M. Zhu, H. Li, M. Qiu, and S. S. Liu. Bias in large language models: Origin, evaluation, and mitigation.arXiv preprint arXiv:2411.10915, 2024

  28. [28]

    J. Han, J. Na, and W. Hwang. Ranked entropy minimization for continual test-time adap- tation. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=lHaGLJ65J9

  29. [29]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  30. [30]

    Hendrycks and T

    D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common cor- ruptions and perturbations.International Conference on Learning Representations (ICLR), 2019

  31. [31]

    Hoffman, E

    J. Hoffman, E. Tzeng, T. Park, J.-Y . Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. InInternational Conference on Machine Learning, 2018

  32. [32]

    Huang, A

    J. Huang, A. Gretton, K. Borgwardt, B. Schölkopf, and A. Smola. Correcting sample selection bias by unlabeled data.Advances in neural information processing systems, 19, 2006

  33. [33]

    Iwasawa and Y

    Y . Iwasawa and Y . Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. InAdvances in Neural Information Processing Systems, volume 34, 2021

  34. [34]

    Karani, E

    N. Karani, E. Erdil, K. Chaitanya, and E. Konukoglu. Test-time adaptable neural networks for robust medical image segmentation.Medical Image Analysis, 68:101907, 2021

  35. [35]

    Kazdan, R

    J. Kazdan, R. Schaeffer, A. Dey, M. Gerstgrasser, R. Rafailov, D. L. Donoho, and S. Koyejo. Collapse or thrive? perils and promises of synthetic data in a self-generating world.arXiv preprint arXiv:2410.16713, 2024

  36. [36]

    Khaled, M

    R. Khaled, M. Helal, O. Alfarghaly, O. Mokhtar, A. Elkorany, H. El Kassas, and A. Fahmy. Cate- gorized digital database for low energy and subtracted contrast enhanced spectral mammography images.The Cancer Imaging Archive, 16, 2021. 12

  37. [37]

    Kilim, A

    O. Kilim, A. Olar, T. Joó, T. Palicz, P. Pollner, and I. Csabai. Physical imaging parameter variation drives domain shift.Scientific Reports, 12(1):21302, 2022

  38. [38]

    E. Kim, M. Sun, C. Baek, A. Raghunathan, and J. Z. Kolter. Test-time adaptation induces stronger accuracy and agreement-on-the-line. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= giXUx4VH9t

  39. [39]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  40. [40]

    P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Ya- sunaga, R. L. Phillips, I. Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, 2021

  41. [41]

    Kopans, R

    D. Kopans, R. Moore, K. Bowyer, P. Kegelmeyer, and P. Shile. Ddsm: digital database for screening mammography, 2020

  42. [42]

    Lee et al

    D.-H. Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. InWorkshop on challenges in representation learning, International Conference on Machine Learning (ICML), page 896, 2013

  43. [43]

    J. Lee, D. Das, J. Choo, and S. Choi. Towards open-set test-time adaptation utilizing the wisdom of crowds in entropy minimization. InIEEE International Conference on Computer Vision, 2023

  44. [44]

    J. Lee, D. Jung, S. Lee, J. Park, J. Shin, U. Hwang, and S. Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. InInternational Conference on Learning Representations, 2024

  45. [45]

    Lekadir, A

    K. Lekadir, A. F. Frangi, A. R. Porras, B. Glocker, C. Cintas, C. P. Langlotz, E. Weicken, F. W. Asselbergs, F. Prior, G. S. Collins, et al. Future-ai: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare.bmj, 388, 2025

  46. [46]

    Liang, R

    J. Liang, R. He, and T. Tan. A comprehensive survey on test-time adaptation under distribution shifts.International Journal of Computer Vision, 133(1):31–64, 2025

  47. [47]

    Lin and Y

    R. Lin and Y . You. Optimizing class-level probability reweighting coefficients for equitable prompting accuracy.arXiv preprint arXiv:2405.07623, 2024

  48. [48]

    Lipton, Y .-X

    Z. Lipton, Y .-X. Wang, and A. Smola. Detecting and correcting for label shift with black box predictors. InInternational conference on machine learning, pages 3122–3130. PMLR, 2018

  49. [49]

    Y . Liu, P. Kothari, B. van Delft, B. Bellot-Gurlet, T. Mordan, and A. Alahi. Ttt++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems, volume 34, 2021

  50. [50]

    D. N. Louis, A. Perry, P. Wesseling, D. J. Brat, I. A. Cree, D. Figarella-Branger, C. Hawkins, H. Ng, S. M. Pfister, G. Reifenberger, et al. The 2021 who classification of tumors of the central nervous system: a summary.Neuro-oncology, 23(8):1231–1251, 2021

  51. [51]

    R. A. Marsden, M. Döbler, and B. Yang. Universal test-time adaptation through weight ensembling, diversity weighting, and prior correction. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2555–2565, 2024

  52. [52]

    I. C. Moreira, I. Amaral, I. Domingues, A. Cardoso, M. J. Cardoso, and J. S. Cardoso. Inbreast: toward a full-field digital mammographic database.Academic radiology, 19(2):236–248, 2012

  53. [53]

    Mounsaveng, F

    S. Mounsaveng, F. Chiaroni, M. Boudiaf, M. Pedersoli, and I. Ben Ayed. Bag of tricks for fully test-time adaptation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1936–1945, 2024

  54. [54]

    Z. Nado, S. Padhy, D. Sculley, A. D’Amour, B. Lakshminarayanan, and J. Snoek. Evaluat- ing prediction-time batch normalization for robustness under covariate shift.arXiv preprint arXiv:2006.10963, 2020. 13

  55. [55]

    S. Niu, J. Wu, Y . Zhang, Y . Chen, S. Zheng, P. Zhao, and M. Tan. Efficient test-time model adaptation without forgetting. InInternational Conference on Machine Learning, 2022

  56. [56]

    S. Niu, J. Wu, Y . Zhang, Z. Wen, Y . Chen, P. Zhao, and M. Tan. Towards stable test-time adaptation in dynamic wild world. InInternational Conference on Learning Representations, 2023

  57. [57]

    P. Oza, R. Oza, U. Oza, P. Sharma, S. Patel, P. Kumar, and B. Gohel. Digital mam- mography Dataset for Breast Cancer Diagnosis Research (DMID). 11 2023. doi: 10.6084/m9.figshare.24522883.v2. URL https://figshare.com/articles/dataset/_b_ Digital_mammography_Dataset_for_Breast_Cancer_Diagnosis_Research_DMID_ b_DMID_rar/24522883

  58. [58]

    Pandey, A

    P. Pandey, A. K. Tyagi, S. Ambekar, and A. Prathosh. Unsupervised domain adaptation for semantic segmentation of nir images through generative latent search. InComputer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 413–429. Springer, 2020

  59. [59]

    X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko. Visda: The visual domain adaptation challenge.arXiv preprint arXiv:1710.06924, 2017

  60. [60]

    Press, R

    O. Press, R. Shwartz-Ziv, Y . LeCun, and M. Bethge. The entropy enigma: Success and failure of entropy minimization.arXiv preprint arXiv:2405.05012, 2024

  61. [61]

    Roschewitz, G

    M. Roschewitz, G. Khara, J. Yearsley, N. Sharma, J. J. James, É. Ambrózay, A. Heroux, P. Kecskemethy, T. Rijken, and B. Glocker. Automatic correction of performance drift under acquisition shift in medical image classification.Nature Communications, 14(1):6608, 2023

  62. [62]

    Schaeffer, J

    R. Schaeffer, J. Kazdan, A. C. Arulandu, and S. Koyejo. Position: Model collapse does not mean what you think.arXiv preprint arXiv:2503.03150, 2025

  63. [63]

    Schneider, E

    S. Schneider, E. Rusak, L. Eck, O. Bringmann, W. Brendel, and M. Bethge. Improving robustness against common corruptions by covariate shift adaptation. InAdvances in Neural Information Processing Systems, volume 33, pages 11539–11551, 2020

  64. [64]

    Scholz, A

    D. Scholz, A. C. Erdur, J. A. Buchner, J. C. Peeken, D. Rueckert, and B. Wiestler. Imbalance- aware loss functions improve medical image classification. InMedical imaging with deep learning, 2024

  65. [65]

    Seto, B.-J

    S. Seto, B.-J. Theobald, F. Danieli, N. Jaitly, and D. Busbridge. Realm: Robust entropy adaptive loss minimization for improved single-sample test-time adaptation, 2023. URL https://arxiv.org/abs/2309.03964

  66. [66]

    D. S. Shah, H. A. Schwartz, and D. Hovy. Predictive biases in natural language processing models: A conceptual framework and overview. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 5248–5264, 2020

  67. [67]

    C. E. Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

  68. [68]

    L. Shi, M. Wu, H. Zhang, Z. Zhang, M. Tao, and Q. Qu. A closer look at model collapse: From a generalization-to-memorization perspective.arXiv preprint arXiv:2509.16499, 2025

  69. [69]

    Shimodaira

    H. Shimodaira. Improving predictive inference under covariate shift by weighting the log- likelihood function.Journal of statistical planning and inference, 90(2):227–244, 2000

  70. [70]

    Stacke, G

    K. Stacke, G. Eilertsen, J. Unger, and C. Lundström. Measuring domain shift for deep learning in histopathology.IEEE journal of biomedical and health informatics, 25(2):325–336, 2020

  71. [71]

    Y . Su, Y . Ji, J. Li, H. Ye, and M. Zhang. Beware of model collapse! fast and stable test-time adaptation for robust question answering. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12998–13011, 2023. 14

  72. [72]

    Y . Su, X. Xu, and K. Jia. Towards real-world test-time adaptation: Tri-net self-training with balanced normalization. InAAAI Conference on Artificial Intelligence, 2024

  73. [73]

    Sugiyama, M

    M. Sugiyama, M. Krauledat, and K.-R. Müller. Covariate shift adaptation by importance weighted cross validation.Journal of Machine Learning Research, 8(5), 2007

  74. [74]

    T. Sun, T. Meng, and Y . Liu. Camelyon 17 challenge: A comparison of traditional machine learning (svm) with the deep learning method.Wireless Communications and Mobile Computing, 2022(1):9910471, 2022

  75. [75]

    Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt. Test-time training with self- supervision for generalization under distribution shifts. InInternational Conference on Machine Learning, 2020

  76. [76]

    Van der Laak, G

    J. Van der Laak, G. Litjens, and F. Ciompi. Deep learning in histopathology: the path to the clinic.Nature medicine, 27(5):775–784, 2021

  77. [77]

    S. R. van der V oort, F. Incekara, M. M. Wijnenga, G. Kapsas, R. Gahrmann, J. W. Schouten, H. J. Dubbink, A. J. Vincent, M. J. van den Bent, P. J. French, et al. The erasmus glioma database (egd): Structural mri scans, who 2016 subtypes, and segmentations of 774 patients with glioma.Data in brief, 37:107191, 2021

  78. [78]

    D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations (ICLR), 2021

  79. [79]

    G. Wang, F. Lyu, and C. Ding. Partition-then-adapt: Combating prediction bias for reliable multi-modal test-time adaptation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=T6RkYsuoMW

  80. [80]

    Wang and W

    M. Wang and W. Deng. Deep visual domain adaptation: A survey.Neurocomputing, 2018

Showing first 80 references.