pith. sign in

arxiv: 2605.22591 · v1 · pith:5N7ME5S4new · submitted 2026-05-21 · 💻 cs.CV

Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure

Pith reviewed 2026-05-22 06:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords noisy label learningfrozen vision modelsmedical imagingbenchmarksmall-loss assumptionasymmetric noisebalanced accuracy
0
0 comments X

The pith

Frozen vision models make noisy-label training a regime-aware selection task because the inherited small-loss rule fails when features are frozen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled test of eight noisy-label methods on five medical datasets, three frozen backbones, two noise types, and five noise rates for a total of 150 conditions. It reports that no method wins across all settings, with clear statistical separation in performance rankings, and that the gap between best and worst method widens sharply as noise grows more severe and asymmetric. The authors then examine why, by measuring how much the loss values of clean and noisy examples overlap under frozen features and by comparing loss ranking against prediction agreement for identifying reliable training samples. The results show that agreement stays more stable while loss-based selection degrades, and that some methods reach decent overall accuracy yet produce zero recall on minority classes. This matters for medical imaging work because frozen foundation models are chosen precisely for efficiency and reproducibility, so an unreliable training rule directly affects whether the deployed system handles class imbalance.

Core claim

Across 150 conditions there is no universal winner among noisy-label methods when the backbone remains frozen. Loss distributions of clean and noisy samples overlap by 53 to 61 percent under frozen DINOv2 features. Matched-rate clean-sample detection shows prediction agreement loses only 3 percentage points of precision under asymmetric noise while loss ranking loses 13 points. On ISIC2019 with 40 percent asymmetric noise, one method reaches 68 percent overall accuracy yet drops to 35.1 percent balanced accuracy with zero recall on three minority classes. These patterns recast the problem as one of choosing the right method for the observed noise regime rather than seeking a single dominant

What carries the argument

A 150-condition benchmark that directly measures loss-distribution overlap and the relative stability of loss ranking versus prediction agreement for clean-sample detection under frozen features.

If this is right

  • The performance cost of choosing the wrong method rises from roughly 4.5 to 18.8 percentage points as noise becomes both heavier and asymmetric.
  • Prediction agreement identifies reliable training samples more consistently than loss ranking once noise is asymmetric.
  • Methods that look acceptable by overall accuracy can still erase all recall on minority classes in imbalanced medical data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A practical selector could first estimate loss overlap or noise asymmetry on a small validation slice and then route to agreement-based or loss-based training accordingly.
  • The same overlap pattern may appear in other domains that rely on frozen foundation models, so the benchmark design could be reused outside medical imaging.
  • Developers might add an explicit class-balance check before applying any loss-threshold method to avoid the zero-recall failure mode shown here.

Load-bearing premise

The small-loss assumption that worked for end-to-end training remains valid when only a lightweight head is trained on top of frozen foundation-model features.

What would settle it

A direct measurement showing that loss-based clean-sample detection retains high precision with little drop under 40 percent asymmetric noise on ISIC2019 with frozen DINOv2 features would refute the reported overlap and stability gap.

Figures

Figures reproduced from arXiv: 2605.22591 by Haoyu Wang, Zitong Li.

Figure 1
Figure 1. Figure 1: Loss distributions of clean (blue) and noisy (red) samples [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ablation of cascade stages. The first filtering stage (Only [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Method ranking reversal between noise types. Under [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-class recall heatmap under asymmetric 40% noise [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Co-Teaching collapse is driven by class imbalance, not [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still rely on a small-loss assumption inherited from end-to-end training. We present a controlled benchmark of eight noisy-label methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6,000 training runs), evaluated with balanced accuracy. The benchmark shows that there is no universal winner: Friedman ranking over the 150 conditions yields $\chi^2 = 333.2$ ($p = 4.77 \times 10^{-68}$), ELR wins the most conditions (49/150), while CUFIT attains the best mean rank (2.51). The practical cost of method choice grows sharply with noise severity, from 4.5pp on clean data to 18.8pp at asymmetric 40\% noise. To explain these benchmark-level patterns, we revisit the small-loss assumption in a representative high-risk regime. Under frozen DINOv2 features, clean and noisy loss distributions overlap by 53--61\%, and matched-rate clean-sample detection shows that prediction agreement is markedly more stable than loss ranking under asymmetric noise (3pp vs.\ 13pp precision drop). On ISIC2019 with asymmetric 40\% noise, Co-Teaching reaches 68\% overall accuracy while collapsing to 35.1\% balanced accuracy with zero recall on three minority classes. Together, these results recast noisy-label learning for frozen VFMs as a regime-aware method-selection problem rather than a search for a single dominant algorithm. We conclude with evidence-based guidance and a low-regret feature-space selector for practical recommendation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that noisy-label learning for frozen vision foundation models (VFMs) is a regime-aware method-selection problem rather than a search for a universal algorithm. This is supported by a benchmark of eight methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6000 runs) showing no single winner via Friedman ranking (χ² = 333.2, p = 4.77×10^{-68}), with ELR winning the most conditions (49/150) and CUFIT the best mean rank (2.51). A case study under frozen DINOv2 features reports 53–61% overlap between clean and noisy loss distributions, shows prediction agreement is more stable than loss ranking for clean-sample detection (3pp vs. 13pp precision drop), and demonstrates Co-Teaching collapsing from 68% overall accuracy to 35.1% balanced accuracy with zero recall on minority classes under 40% asymmetric noise on ISIC2019.

Significance. If the results hold, the work is significant for medical imaging applications of frozen VFMs by providing large-scale empirical evidence against direct transfer of end-to-end noisy-label techniques and by including statistical validation (Friedman test with p-value) plus a targeted case study on loss overlap and detection stability. Credit is given for the use of balanced accuracy across imbalanced datasets, the scale of 6000 runs, and the explicit guidance for practical method selection.

major comments (1)
  1. [Case study section] Case study section: the claim of 53–61% clean/noisy loss overlap and the 3pp vs. 13pp precision drop for agreement-based vs. loss-based detection under asymmetric noise is load-bearing for challenging the small-loss assumption, yet the exact loss-distribution computation, noise-injection procedure, and matched-rate detection algorithm are described at insufficient granularity to allow independent verification or reproduction of these key numbers.
minor comments (1)
  1. [Abstract] The abstract states the benchmark scale but could explicitly list the three backbones and two noise types for immediate context.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We appreciate the recognition of the benchmark scale, statistical validation, and the practical implications for medical imaging with frozen VFMs. We address the single major comment below.

read point-by-point responses
  1. Referee: [Case study section] Case study section: the claim of 53–61% clean/noisy loss overlap and the 3pp vs. 13pp precision drop for agreement-based vs. loss-based detection under asymmetric noise is load-bearing for challenging the small-loss assumption, yet the exact loss-distribution computation, noise-injection procedure, and matched-rate detection algorithm are described at insufficient granularity to allow independent verification or reproduction of these key numbers.

    Authors: We agree that the case-study description requires greater granularity to support independent reproduction. In the revised manuscript we will expand the relevant subsection (and add an appendix if space is limited) with: (1) the precise loss function and normalization used to compute the clean/noisy loss distributions on frozen DINOv2 features, (2) the exact label-flip probabilities and sampling procedure for generating asymmetric noise at each rate, and (3) a step-by-step description (with pseudocode) of the matched-rate clean-sample detection procedure, including how prediction agreement is measured across epochs and how precision is evaluated for both loss-ranking and agreement-based selectors. These additions will directly enable verification of the reported 53–61 % overlap and the 3 pp vs. 13 pp precision difference. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports an empirical benchmark of eight noisy-label methods over 150 conditions on five datasets with statistical tests (Friedman χ²) and a case study measuring loss overlaps and balanced accuracy under asymmetric noise. All central claims rest on direct experimental measurements and standard statistical reporting across independent datasets rather than any derivation, fitted parameter renamed as prediction, or self-referential equation. No load-bearing step reduces by construction to its inputs, and the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark study; no free parameters are fitted to produce the central claims, and no new entities are postulated. The work relies on standard statistical assumptions for ranking tests and balanced accuracy.

axioms (1)
  • standard math The Friedman test is appropriate for comparing method rankings across the 150 experimental conditions.
    Invoked to establish statistical significance of performance differences.

pith-pipeline@v0.9.0 · 5875 in / 1332 out tokens · 82338 ms · 2026-05-22T06:48:14.619001+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien

    Devansh Arpit, Stanisław Jastrz˛ ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. InInternational Conference on Machine Learning (ICML), 2017

  2. [2]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). InarXiv preprint arXiv:1902.03368, 2019

  3. [3]

    Towards un- derstanding deep learning from noisy labels with small-loss criterion

    Xian-Jin Gui, Wei Wang, and Zhang-Hao Tian. Towards un- derstanding deep learning from noisy labels with small-loss criterion. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), pages 2469– 2475, 2021

  4. [4]

    Co- teaching: Robust training of deep neural networks with ex- tremely noisy labels

    Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co- teaching: Robust training of deep neural networks with ex- tremely noisy labels. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

  5. [5]

    Junnan Li, Richard Socher, and Steven C.H. Hoi. Di- videmix: Learning with noisy labels as semi-supervised learning. InInternational Conference on Learning Repre- sentations (ICLR), 2020

  6. [6]

    Early-learning regularization pre- vents memorization of noisy labels

    Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Car- los Fernandez-Granda. Early-learning regularization pre- vents memorization of noisy labels. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  7. [7]

    When does label smoothing help? InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2019

    Rafael Müller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2019

  8. [8]

    Dhillon, Pradeep Raviku- mar, and Ambuj Tewari

    Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Raviku- mar, and Ambuj Tewari. Learning with noisy labels. InAd- vances in Neural Information Processing Systems (NeurIPS), 2013

  9. [9]

    DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

  10. [10]

    The HAM10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions.Sci- entific Data, 5:180161, 2018

    Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions.Sci- entific Data, 5:180161, 2018

  11. [11]

    MedM- NIST v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10:41, 2023

    Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. MedM- NIST v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10:41, 2023

  12. [12]

    Curriculum fine-tuning of vision foun- dation model for medical image classification under label noise

    Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, and Kyoobin Lee. Curriculum fine-tuning of vision foun- dation model for medical image classification under label noise. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2024

  13. [13]

    When noisy labels meet long tail dilem- mas: A representation calibration method

    Manyi Zhang, Xuyang Zhao, Jun Yao, Chun Yuan, and Weiran Huang. When noisy labels meet long tail dilem- mas: A representation calibration method. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 15890–15900, 2023

  14. [14]

    Shaoting Zhang and Dimitris N. Metaxas. On the challenges and perspectives of foundation models for medical image analysis.Medical Image Analysis, 91:102996, 2024

  15. [15]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Rob Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. BiomedCLIP: A multimodal biomedical foundation model pretrained from fift...