Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure
Pith reviewed 2026-05-22 06:48 UTC · model grok-4.3
The pith
Frozen vision models make noisy-label training a regime-aware selection task because the inherited small-loss rule fails when features are frozen.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 150 conditions there is no universal winner among noisy-label methods when the backbone remains frozen. Loss distributions of clean and noisy samples overlap by 53 to 61 percent under frozen DINOv2 features. Matched-rate clean-sample detection shows prediction agreement loses only 3 percentage points of precision under asymmetric noise while loss ranking loses 13 points. On ISIC2019 with 40 percent asymmetric noise, one method reaches 68 percent overall accuracy yet drops to 35.1 percent balanced accuracy with zero recall on three minority classes. These patterns recast the problem as one of choosing the right method for the observed noise regime rather than seeking a single dominant
What carries the argument
A 150-condition benchmark that directly measures loss-distribution overlap and the relative stability of loss ranking versus prediction agreement for clean-sample detection under frozen features.
If this is right
- The performance cost of choosing the wrong method rises from roughly 4.5 to 18.8 percentage points as noise becomes both heavier and asymmetric.
- Prediction agreement identifies reliable training samples more consistently than loss ranking once noise is asymmetric.
- Methods that look acceptable by overall accuracy can still erase all recall on minority classes in imbalanced medical data.
Where Pith is reading between the lines
- A practical selector could first estimate loss overlap or noise asymmetry on a small validation slice and then route to agreement-based or loss-based training accordingly.
- The same overlap pattern may appear in other domains that rely on frozen foundation models, so the benchmark design could be reused outside medical imaging.
- Developers might add an explicit class-balance check before applying any loss-threshold method to avoid the zero-recall failure mode shown here.
Load-bearing premise
The small-loss assumption that worked for end-to-end training remains valid when only a lightweight head is trained on top of frozen foundation-model features.
What would settle it
A direct measurement showing that loss-based clean-sample detection retains high precision with little drop under 40 percent asymmetric noise on ISIC2019 with frozen DINOv2 features would refute the reported overlap and stability gap.
Figures
read the original abstract
Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still rely on a small-loss assumption inherited from end-to-end training. We present a controlled benchmark of eight noisy-label methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6,000 training runs), evaluated with balanced accuracy. The benchmark shows that there is no universal winner: Friedman ranking over the 150 conditions yields $\chi^2 = 333.2$ ($p = 4.77 \times 10^{-68}$), ELR wins the most conditions (49/150), while CUFIT attains the best mean rank (2.51). The practical cost of method choice grows sharply with noise severity, from 4.5pp on clean data to 18.8pp at asymmetric 40\% noise. To explain these benchmark-level patterns, we revisit the small-loss assumption in a representative high-risk regime. Under frozen DINOv2 features, clean and noisy loss distributions overlap by 53--61\%, and matched-rate clean-sample detection shows that prediction agreement is markedly more stable than loss ranking under asymmetric noise (3pp vs.\ 13pp precision drop). On ISIC2019 with asymmetric 40\% noise, Co-Teaching reaches 68\% overall accuracy while collapsing to 35.1\% balanced accuracy with zero recall on three minority classes. Together, these results recast noisy-label learning for frozen VFMs as a regime-aware method-selection problem rather than a search for a single dominant algorithm. We conclude with evidence-based guidance and a low-regret feature-space selector for practical recommendation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that noisy-label learning for frozen vision foundation models (VFMs) is a regime-aware method-selection problem rather than a search for a universal algorithm. This is supported by a benchmark of eight methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6000 runs) showing no single winner via Friedman ranking (χ² = 333.2, p = 4.77×10^{-68}), with ELR winning the most conditions (49/150) and CUFIT the best mean rank (2.51). A case study under frozen DINOv2 features reports 53–61% overlap between clean and noisy loss distributions, shows prediction agreement is more stable than loss ranking for clean-sample detection (3pp vs. 13pp precision drop), and demonstrates Co-Teaching collapsing from 68% overall accuracy to 35.1% balanced accuracy with zero recall on minority classes under 40% asymmetric noise on ISIC2019.
Significance. If the results hold, the work is significant for medical imaging applications of frozen VFMs by providing large-scale empirical evidence against direct transfer of end-to-end noisy-label techniques and by including statistical validation (Friedman test with p-value) plus a targeted case study on loss overlap and detection stability. Credit is given for the use of balanced accuracy across imbalanced datasets, the scale of 6000 runs, and the explicit guidance for practical method selection.
major comments (1)
- [Case study section] Case study section: the claim of 53–61% clean/noisy loss overlap and the 3pp vs. 13pp precision drop for agreement-based vs. loss-based detection under asymmetric noise is load-bearing for challenging the small-loss assumption, yet the exact loss-distribution computation, noise-injection procedure, and matched-rate detection algorithm are described at insufficient granularity to allow independent verification or reproduction of these key numbers.
minor comments (1)
- [Abstract] The abstract states the benchmark scale but could explicitly list the three backbones and two noise types for immediate context.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for minor revision. We appreciate the recognition of the benchmark scale, statistical validation, and the practical implications for medical imaging with frozen VFMs. We address the single major comment below.
read point-by-point responses
-
Referee: [Case study section] Case study section: the claim of 53–61% clean/noisy loss overlap and the 3pp vs. 13pp precision drop for agreement-based vs. loss-based detection under asymmetric noise is load-bearing for challenging the small-loss assumption, yet the exact loss-distribution computation, noise-injection procedure, and matched-rate detection algorithm are described at insufficient granularity to allow independent verification or reproduction of these key numbers.
Authors: We agree that the case-study description requires greater granularity to support independent reproduction. In the revised manuscript we will expand the relevant subsection (and add an appendix if space is limited) with: (1) the precise loss function and normalization used to compute the clean/noisy loss distributions on frozen DINOv2 features, (2) the exact label-flip probabilities and sampling procedure for generating asymmetric noise at each rate, and (3) a step-by-step description (with pseudocode) of the matched-rate clean-sample detection procedure, including how prediction agreement is measured across epochs and how precision is evaluated for both loss-ranking and agreement-based selectors. These additions will directly enable verification of the reported 53–61 % overlap and the 3 pp vs. 13 pp precision difference. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper reports an empirical benchmark of eight noisy-label methods over 150 conditions on five datasets with statistical tests (Friedman χ²) and a case study measuring loss overlaps and balanced accuracy under asymmetric noise. All central claims rest on direct experimental measurements and standard statistical reporting across independent datasets rather than any derivation, fitted parameter renamed as prediction, or self-referential equation. No load-bearing step reduces by construction to its inputs, and the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The Friedman test is appropriate for comparing method rankings across the 150 experimental conditions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
loss distributions of clean and noisy samples overlap by 53–61% under frozen DINOv2 features... small-loss criterion fundamentally fails
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien
Devansh Arpit, Stanisław Jastrz˛ ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. InInternational Conference on Machine Learning (ICML), 2017
work page 2017
-
[2]
Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). InarXiv preprint arXiv:1902.03368, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Towards un- derstanding deep learning from noisy labels with small-loss criterion
Xian-Jin Gui, Wei Wang, and Zhang-Hao Tian. Towards un- derstanding deep learning from noisy labels with small-loss criterion. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), pages 2469– 2475, 2021
work page 2021
-
[4]
Co- teaching: Robust training of deep neural networks with ex- tremely noisy labels
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co- teaching: Robust training of deep neural networks with ex- tremely noisy labels. InAdvances in Neural Information Processing Systems (NeurIPS), 2018
work page 2018
-
[5]
Junnan Li, Richard Socher, and Steven C.H. Hoi. Di- videmix: Learning with noisy labels as semi-supervised learning. InInternational Conference on Learning Repre- sentations (ICLR), 2020
work page 2020
-
[6]
Early-learning regularization pre- vents memorization of noisy labels
Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Car- los Fernandez-Granda. Early-learning regularization pre- vents memorization of noisy labels. InAdvances in Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[7]
Rafael Müller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2019
work page 2019
-
[8]
Dhillon, Pradeep Raviku- mar, and Ambuj Tewari
Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Raviku- mar, and Ambuj Tewari. Learning with noisy labels. InAd- vances in Neural Information Processing Systems (NeurIPS), 2013
work page 2013
-
[9]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024
work page 2024
-
[10]
Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions.Sci- entific Data, 5:180161, 2018
work page 2018
-
[11]
Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. MedM- NIST v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10:41, 2023
work page 2023
-
[12]
Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, and Kyoobin Lee. Curriculum fine-tuning of vision foun- dation model for medical image classification under label noise. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2024
work page 2024
-
[13]
When noisy labels meet long tail dilem- mas: A representation calibration method
Manyi Zhang, Xuyang Zhao, Jun Yao, Chun Yuan, and Weiran Huang. When noisy labels meet long tail dilem- mas: A representation calibration method. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 15890–15900, 2023
work page 2023
-
[14]
Shaoting Zhang and Dimitris N. Metaxas. On the challenges and perspectives of foundation models for medical image analysis.Medical Image Analysis, 91:102996, 2024
work page 2024
-
[15]
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Rob Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. BiomedCLIP: A multimodal biomedical foundation model pretrained from fift...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.