arxiv: 2512.09315 · v1 · submitted 2025-12-10 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges, Practice, and Outlook

Yuan Ma , Junlin Hou , Chao Zhang , Yukun Zhou , Zongyuan Ge , Haoran Xie , Lie Ju

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords noisy labelsmedical image classificationbenchmarklearning with noisy labelsrobustness evaluationdomain variabilityclass imbalance

0 comments

The pith

Existing methods for learning from noisy labels lose much of their effectiveness on real medical images under high noise and domain shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates LNMBench to test how well current noisy label techniques hold up when applied to medical scans that carry realistic errors and inconsistencies. Tests on seven datasets across six modalities and three noise patterns show clear drops in performance once noise reaches higher levels or images come from different clinical sources. These results highlight that class imbalance and variability between domains remain hard problems for existing approaches. The authors also describe a basic adjustment that improves resistance to such conditions. Making the evaluation code public supports more consistent checks and better algorithm development for practical medical use.

Core claim

The paper establishes that existing learning with noisy labels methods degrade substantially under high and real-world noise in medical image classification, with class imbalance and domain variability posing persistent challenges. It introduces LNMBench as a unified framework evaluating ten representative methods across seven datasets, six modalities, and three noise patterns, and proposes a simple yet effective improvement to enhance model robustness under these conditions.

What carries the argument

LNMBench, a benchmark framework that applies ten LNL methods to seven medical datasets spanning six modalities with three noise patterns to assess robustness under realistic conditions.

If this is right

Existing LNL methods show substantial performance degradation under high and real-world noise.
Class imbalance and domain variability remain persistent challenges for noise-resilient algorithms in medical data.
A simple yet effective improvement can enhance model robustness under high noise and domain variability.
Public release of the LNMBench codebase supports standardized evaluation and reproducible research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future noisy label methods would likely benefit from built-in handling for medical domain shifts between equipment and sites.
Extending the benchmark to cover real multi-expert annotation disagreements could provide a closer match to clinical label noise.
Hybrid approaches that pair noise correction with domain adaptation techniques may offer a practical next step for medical applications.

Load-bearing premise

The seven datasets, six modalities, and three noise patterns chosen for the benchmark adequately represent the annotation inconsistencies and domain shifts encountered in real clinical practice.

What would settle it

Running the same ten methods plus the proposed improvement on a new medical dataset with independently measured high real-world noise from multiple observers and observing no substantial degradation would challenge the reported performance drops.

Figures

Figures reproduced from arXiv: 2512.09315 by Chao Zhang, Haoran Xie, Junlin Hou, Lie Ju, Yuan Ma, Yukun Zhou, Zongyuan Ge.

**Figure 1.** Figure 1: (a) We present a systematic benchmark for learning with noisy labels in medical image analysis, encompassing 3 noise patterns, 7 datasets from 6 imaging modalities, and 10 representative methods. Integrating both synthetic and real-world noise across balanced and imbalanced datasets, the benchmark provides a unified and reproducible framework that evaluates not only classification accuracy but also robustn… view at source ↗

**Figure 2.** Figure 2: Overview of class distributions across the datasets utilized in LNMBench. under clinically sourced label noise using the real-world noise datasets DRTiD [13], Kaggle DR+ [15], and CheXpert [14]. 4.1.1. Balanced Dataset PathMNIST [48] is derived from colorectal cancer histology slides and consists of hematoxylin and eosin stained tissue patches for a nine-class classification task. Each sample is a 224 ×… view at source ↗

**Figure 3.** Figure 3: Comparison of estimation error and test accuracy under different symmetric noise rates for VolMinNet. 0 10 20 30 40 50 Epoch 50 60 70 80 90 Clean Ratio (%) Co-teaching Co-teaching+ CoDis JoCoR DISC DivideMix (a) 0 10 20 30 40 50 Epoch 0 20 40 60 80 100 Coverage Ratio (%) Co-teaching Co-teaching+ CoDis JoCoR DISC DivideMix (b) 0 10 20 30 40 50 Epoch 0 2 4 6 8 10 12 14 Clean Ratio (%) Co-teaching Co-teaching… view at source ↗

**Figure 4.** Figure 4: Sample selection performance under instancedependent noise on PathMNIST. The clean ratio represents the proportion of correctly labeled samples among those selected, while the coverage ratio represents the proportion of selected clean samples among all clean samples. (a) Clean ratio at 50% noise. (b) Coverage ratio at 50% noise. (c) Clean ratio at 90% noise. (d) Coverage ratio at 90% noise. contrast, DISC… view at source ↗

**Figure 5.** Figure 5: The loss distributions of Co-teaching under Idn-50% and Idn-90% on PathMNIST. 4.5.3. Analysis on Sample Selection Next, we focus on sample selection and semi-supervised methods, which have been effective under instance dependent noise. The sample selection methods in LNMBench are all derived from the Co-teaching, adopting the smallloss strategy in which the proportion of selected samples is determined by… view at source ↗

**Figure 6.** Figure 6: a). At Idn-20% setting, CDR consistently outperforms CE across all classes. In contrast, DISC, CoDis and Coteaching underperform CE in most classes except for class 5, which is the majority class in DermaMNIST. At Idn-50%, CDR exhibits performance degradation across all classes, further confirming that regularization based methods fail under high noise levels (see Fig. 6b). Meanwhile, DISC, CoDis and Co-… view at source ↗

**Figure 7.** Figure 7: Per-class performance comparison of Co-teaching and DISC methods on DermMNIST dataset with Idn-20% noise. (a) Clean ratio of Co-teaching. (b) Coverage ratio of Co-teaching. (c) Clean ratio of DISC. (d) Coverage ratio of DISC. contrast, for class 3 and class 6, the method discards all clean samples at an early stage of training, resulting in zero accuracy for these classes. For the remaining classes, most … view at source ↗

**Figure 8.** Figure 8: Violin plots of per-class clean sample losses and the overall noisy sample loss distribution for Co-teaching under Idn-20% on DermaMNIST. 0 10 20 30 40 50 Epoch 0 20 40 60 80 100 Clean Ratio (%) Clean set Purified set [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Epoch wise ratios of clean class 5 samples in the clean set and the purified set of DISC, normalized by the total number of samples under Idn-20% on DermaMNIST. The clean set refers to the subset of samples selected by the model as clean, while the purified set refers to the subset of excluded samples whose labels are reassigned by the model. results in some clean samples being excluded from the training s… view at source ↗

**Figure 10.** Figure 10: Per-class classification accuracy of four methods on (a) PathMNIST and (b) ImPathMNIST. Findings: Based on sample selection methods rely on the model’s loss or confidence to select clean samples. However, in imbalanced datasets, some clean samples from minority classes exhibit higher losses than noisy samples during the early training stage, preventing them from being selected. This further limits the mod… view at source ↗

**Figure 11.** Figure 11: Per-class classification accuracy of four methods on (a) DRTiD (b) Kaggle DR+ and (c) CheXpert. label noise approaches are not well suited for real-world noisy data. Findings: Current LNL methods perform poorly on realworld medical datasets, in some cases even underperforming standard CE model. Moreover, semi-supervised methods that perform well on synthetic noisy datasets often fail to achieve comparabl… view at source ↗

read the original abstract

Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf{10} representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical applications.The codebase is publicly available on https://github.com/myyy777/LNMBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LNMBench gives a useful broad benchmark for noisy-label methods in medical imaging with public code, though the noise patterns may lean synthetic and limit how far the real-world claims travel.

read the letter

LNMBench is a practical benchmark that shows existing noisy-label methods lose ground fast when noise gets high or when medical data brings its usual imbalances and shifts. The new part is the unified setup: ten methods tested on seven datasets covering six modalities and three noise patterns. That scope goes beyond the narrower tests in earlier work. Releasing the code publicly is the right move and should help others run consistent comparisons. The experiments do make a case that class imbalance and domain variability remain tough problems even for methods designed to handle noise. The main concern is whether those three noise patterns really stand in for clinical label noise. If they are generated synthetically rather than taken from actual multi-observer disagreements in the data, the headline result about real-world degradation rests on an assumption that needs checking. An ablation comparing synthetic versus observed noise would have strengthened the claims. The proposed improvement is called simple yet effective, but without more on its construction or results, it is hard to judge how much it adds. This paper is for people who need a starting point for testing robust algorithms in medical imaging. Anyone planning clinical deployment or working on label noise will find the framework useful. It deserves a serious referee because the evaluation gap is clear and the resource is reproducible. I would send it to review and ask for clarification on the noise model and more detail on the improvement.

Referee Report

2 major / 2 minor

Summary. The paper introduces LNMBench, a benchmark for assessing the robustness of learning with noisy labels (LNL) methods in medical image classification. It evaluates 10 LNL methods on 7 datasets across 6 modalities using 3 noise patterns, reports substantial performance degradation under high and real-world noise levels, highlights challenges from class imbalance and domain variability, proposes a simple improvement, and releases the codebase publicly.

Significance. If the benchmark's noise patterns and dataset selection prove representative, the work offers a valuable standardized framework for evaluating LNL methods in medical imaging, where annotation noise is common. The public codebase release is a clear strength that supports reproducibility and future research. The findings on degradation could inform more resilient algorithm design, though their impact depends on how closely the evaluated conditions match clinical practice.

major comments (2)

[Section 4] Section 4 (Noise Patterns): The manuscript must explicitly state whether the 3 noise patterns are generated synthetically (e.g., symmetric/asymmetric/instance-dependent label flips) or derived from multi-rater annotations or observed inter-observer inconsistencies within the 7 datasets. This detail is load-bearing for the central claim of evaluating under 'real-world noise' in the title and abstract; without it, the observed degradation may not generalize to clinical annotation variability.
[Results] Results section (implied by abstract claims): The comprehensive experiments are described as revealing 'substantial' degradation, yet the manuscript lacks specific quantitative tables, accuracy drops with standard deviations, or ablation isolating synthetic vs. observed noise. This weakens the ability to assess the magnitude and statistical reliability of the headline result on existing LNL methods.

minor comments (2)

[Abstract] Abstract: The phrase 'simple yet effective improvement' is used without a brief derivation or pseudocode; adding one sentence on its core mechanism would improve clarity without altering scope.
The claim of '6 imaging modalities' and '7 datasets' should include a short table or appendix listing the exact datasets, modalities, and class imbalance statistics to aid readers in assessing representativeness.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which helps strengthen the clarity and impact of our work on LNMBench. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Section 4] Section 4 (Noise Patterns): The manuscript must explicitly state whether the 3 noise patterns are generated synthetically (e.g., symmetric/asymmetric/instance-dependent label flips) or derived from multi-rater annotations or observed inter-observer inconsistencies within the 7 datasets. This detail is load-bearing for the central claim of evaluating under 'real-world noise' in the title and abstract; without it, the observed degradation may not generalize to clinical annotation variability.

Authors: We appreciate this important clarification request. The three noise patterns described in Section 4 are synthetically generated using standard models (symmetric, asymmetric, and instance-dependent label noise) with parameters chosen to reflect typical annotation error patterns reported in medical imaging literature on inter-observer variability. They are not directly derived from multi-rater annotations or observed inconsistencies within the specific 7 datasets. We will revise Section 4 to explicitly detail the synthetic generation procedure, including exact flip probabilities and instance-dependent mechanisms, and add a discussion on how these patterns approximate real-world clinical noise while noting the limitations for direct generalization. This will better support the claims regarding real-world noise in the title and abstract. revision: yes
Referee: [Results] Results section (implied by abstract claims): The comprehensive experiments are described as revealing 'substantial' degradation, yet the manuscript lacks specific quantitative tables, accuracy drops with standard deviations, or ablation isolating synthetic vs. observed noise. This weakens the ability to assess the magnitude and statistical reliability of the headline result on existing LNL methods.

Authors: We agree that enhanced quantitative detail would improve interpretability. The Results section currently summarizes the degradation trends across the 10 methods, 7 datasets, and 3 noise patterns, but we will expand it with new tables reporting per-method accuracy drops (with standard deviations over multiple runs) under varying noise levels. For the requested ablation, we will add a comparison where feasible using any available multi-rater subsets in the datasets; in cases where observed noise data is unavailable, we will explicitly discuss this as a limitation and suggest it as future work. These additions will provide clearer evidence for the magnitude and reliability of the findings. revision: partial

standing simulated objections not resolved

A full ablation isolating synthetic noise from observed multi-rater noise is not possible across all 7 datasets without new data collection, which exceeds the scope of this benchmarking study.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper introduces LNMBench as an empirical evaluation framework and reports performance degradation of existing LNL methods based on new runs across 7 public datasets, 6 modalities, and 3 noise patterns. No mathematical derivation chain exists; claims rest on direct experimental observations rather than quantities defined from fitted parameters, self-referential definitions, or load-bearing self-citations. The proposed improvement is motivated by the benchmark results but does not reduce to a tautology by construction. This is a standard empirical benchmarking paper whose central results are independently falsifiable via the released codebase and public data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the selected datasets and noise models; no free parameters, new entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption The chosen 7 datasets and 3 noise patterns capture realistic medical annotation variability.
Invoked to support the claim that observed degradation reflects real-world conditions.

pith-pipeline@v0.9.0 · 5520 in / 1096 out tokens · 27698 ms · 2026-05-16T23:53:13.404972+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce LNMBench... 10 representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns... performance of existing LNL methods degrades substantially under high and real-world noise

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 3 internal anchors

[1]

Two wrongs don’t make a right: Combating confirmation bias in learning withlabelnoise,in:ProceedingsoftheAAAIConferenceonArtificial Intelligence, pp

Chen,M.,Cheng,H.,Du,Y.,Xu,M.,Jiang,W.,Wang,C.,2023. Two wrongs don’t make a right: Combating confirmation bias in learning withlabelnoise,in:ProceedingsoftheAAAIConferenceonArtificial Intelligence, pp. 14765–14773

work page 2023
[2]

Understanding and utilizing deep neural networks trained with noisy labels, in: International conference on machine learning, PMLR

Chen, P., Liao, B.B., Chen, G., Zhang, S., 2019. Understanding and utilizing deep neural networks trained with noisy labels, in: International conference on machine learning, PMLR. pp. 1062– 1070

work page 2019
[3]

Cordeiro, F.R., Carneiro, G., 2020. A survey on deep learning with noisy labels: How to train your model when you cannot trust on the annotations?, in: 2020 33rd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), IEEE. pp. 9–16

work page 2020
[4]

Training a neural network based on unreliable human annotation of medical images, in: 2018 IEEE 15th International symposium on biomedical imaging (ISBI 2018), IEEE

Dgani, Y., Greenspan, H., Goldberger, J., 2018. Training a neural network based on unreliable human annotation of medical images, in: 2018 IEEE 15th International symposium on biomedical imaging (ISBI 2018), IEEE. pp. 39–42

work page 2018
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy,A.,2020. Animageisworth16x16words:Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

A cnn-based unified frame- work utilizing projection loss in unison with label noise handling for multiple myeloma cancer diagnosis

Gehlot, S., Gupta, A., Gupta, R., 2021. A cnn-based unified frame- work utilizing projection loss in unison with label noise handling for multiple myeloma cancer diagnosis. Medical Image Analysis 72, 102099

work page 2021
[7]

Ghesu, F.C., Georgescu, B., Gibson, E., Guendel, S., Kalra, M.K., Singh, R., Digumarthy, S.R., Grbic, S., Comaniciu, D., 2019. Quan- tifying and leveraging classification uncertainty for chest radiograph assessment,in:Internationalconferenceonmedicalimagecomputing and computer-assisted intervention, Springer. pp. 676–684

work page 2019
[8]

Openmibood: Open medical imaging benchmarks for out-of-distribution detection, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Gutbrod,M.,Rauber,D.,Nunes,D.W.,Palm,C.,2025. Openmibood: Open medical imaging benchmarks for out-of-distribution detection, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 25874–25886

work page 2025
[9]

Co-teaching:Robusttrainingofdeepneuralnetworkswith extremely noisy labels

Han,B.,Yao,Q.,Yu,X.,Niu,G.,Xu,M.,Hu,W.,Tsang,I.,Sugiyama, M.,2018. Co-teaching:Robusttrainingofdeepneuralnetworkswith extremely noisy labels. Advances in neural information processing systems 31

work page 2018
[10]

Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

work page 2016
[11]

Using pre-training can improve model robustness and uncertainty, in: International confer- ence on machine learning, PMLR

Hendrycks, D., Lee, K., Mazeika, M., 2019. Using pre-training can improve model robustness and uncertainty, in: International confer- ence on machine learning, PMLR. pp. 2712–2721

work page 2019
[12]

Qmix: Quality-aware learning with mixed noise for robust retinal disease diagnosis

Hou, J., Xu, J., Feng, R., Chen, H., 2025. Qmix: Quality-aware learning with mixed noise for robust retinal disease diagnosis. IEEE Transactions on Medical Imaging

work page 2025
[13]

Cross-field transformer for diabetic retinopathy gradingontwo-fieldfundusimages,in:2022IEEEInternationalCon- ferenceonBioinformaticsandBiomedicine(BIBM),IEEEComputer Society

Hou,J.,Xu,J.,Xiao,F.,Zhao,R.W.,Zhang,Y.,Zou,H.,Lu,L.,Xue, W., Feng, R., 2022. Cross-field transformer for diabetic retinopathy gradingontwo-fieldfundusimages,in:2022IEEEInternationalCon- ferenceonBioinformaticsandBiomedicine(BIBM),IEEEComputer Society. pp. 985–990. Yuan Ma et al.:Preprint submitted to ElsevierPage 14 of 15 LNMBench

work page 2022
[14]

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

Johnson,A.E.,Pollard,T.J.,Greenbaum,N.R.,Lungren,M.P.,Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S., 2019. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042

work page internal anchor Pith review arXiv 2019
[15]

Improving medical images classification with label noise using dual-uncertainty estimation

Ju, L., Wang, X., Wang, L., Mahapatra, D., Zhao, X., Zhou, Q., Liu, T., Ge, Z., 2022. Improving medical images classification with label noise using dual-uncertainty estimation. IEEE transactions on medical imaging 41, 1533–1546

work page 2022
[16]

Monica: Benchmarking on long-tailed medical image classification

Ju, L., Yan, S., Zhou, Y., Nan, Y., Xing, X., Duan, P., Ge, Z., 2024. Monica: Benchmarking on long-tailed medical image classification. arXiv preprint arXiv:2410.02010

work page arXiv 2024
[17]

Ju, L., Yu, Z., Wang, L., Zhao, X., Wang, X., Bonnington, P., Ge, Z.,

work page
[18]

IEEETransactionsonMedicalImaging43,335– 350

Hierarchical knowledge guided learning for real-world retinal diseaserecognition. IEEETransactionsonMedicalImaging43,335– 350

work page
[19]

Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis

Karimi, D., Dou, H., Warfield, S.K., Gholipour, A., 2020. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Medical image analysis 65, 101759

work page 2020
[20]

Improving medicalimageclassificationinnoisylabelsusingonlyself-supervised pretraining, in: MICCAI Workshop on Data Engineering in Medical Imaging, Springer

Khanal, B., Bhattarai, B., Khanal, B., Linte, C.A., 2023. Improving medicalimageclassificationinnoisylabelsusingonlyself-supervised pretraining, in: MICCAI Workshop on Data Engineering in Medical Imaging, Springer. pp. 78–90

work page 2023
[21]

Ko, J., Yi, B., Yun, S.Y., 2023. A gift from label smoothing: robust training with adaptive label smoothing via auxiliary classifier under label noise, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8325–8333

work page 2023
[22]

Li,J.,Cao,H.,Wang,J.,Liu,F.,Dou,Q.,Chen,G.,Heng,P.A.,2023a. Learning robust classifier for imbalanced medical image dataset with noisylabelsbyminimizinginvariantrisk,in:InternationalConference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 306–316

work page
[23]

Dividemix: Learning with noisy labelsassemi-supervisedlearning

Li, J., Socher, R., Hoi, S.C., 2020. Dividemix: Learning with noisy labelsassemi-supervisedlearning. arXivpreprintarXiv:2002.07394

work page arXiv 2020
[24]

Provably end- to-end label-noise learning without anchor points, in: International conference on machine learning, PMLR

Li, X., Liu, T., Han, B., Niu, G., Sugiyama, M., 2021. Provably end- to-end label-noise learning without anchor points, in: International conference on machine learning, PMLR. pp. 6403–6413

work page 2021
[25]

Disc: Learning from noisy labels via dynamic instance-specific selection and correction, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp

Li, Y., Han, H., Shan, S., Chen, X., 2023b. Disc: Learning from noisy labels via dynamic instance-specific selection and correction, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp. 24070–24079

work page
[26]

Instance-dependent label distribution estimation for learning with label noise

Liao, Z., Hu, S., Xie, Y., Xia, Y., 2025a. Instance-dependent label distribution estimation for learning with label noise. International Journal of Computer Vision 133, 2568–2580

work page
[27]

Unleashing the potential of open-set noisy samples against label noise for medical image classification

Liao, Z., Hu, S., Zhang, Y., Xia, Y., 2025b. Unleashing the potential of open-set noisy samples against label noise for medical image classification. Medical Image Analysis , 103702

work page
[28]

Learning the latent causal structure formodelinglabelnoise

Lin, Y., Yao, Y., Liu, T., 2024. Learning the latent causal structure formodelinglabelnoise. AdvancesinNeuralInformationProcessing Systems 37, 120549–120577

work page 2024
[29]

Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian,M.,VanDerLaak,J.A.,VanGinneken,B.,Sánchez,C.I.,

work page
[30]

Medical image analysis 42, 60–88

A survey on deep learning in medical image analysis. Medical image analysis 42, 60–88

work page
[31]

2537–2546

Liu,Z.,Miao,Z.,Zhan,X.,Wang,J.,Gong,B.,Yu,S.X.,2019.Large- scale long-tailed recognitionin an open world,in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2537–2546

work page 2019
[32]

Does label smoothing mitigate label noise?, in: International Conference on Machine Learning, PMLR

Lukasik, M., Bhojanapalli, S., Menon, A., Kumar, S., 2020. Does label smoothing mitigate label noise?, in: International Conference on Machine Learning, PMLR. pp. 6448–6458

work page 2020
[33]

Bench- marking common uncertainty estimation methods with histopatho- logical images under domain shift and label noise

Mehrtens, H.A., Kurz, A., Bucher, T.C., Brinker, T.J., 2023. Bench- marking common uncertainty estimation methods with histopatho- logical images under domain shift and label noise. Medical image analysis 89, 102914

work page 2023
[34]

The multimodal brain tumor image segmentation benchmark (brats)

Menze,B.H.,Jakab,A.,Bauer,S.,Kalpathy-Cramer,J.,Farahani,K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al., 2014. The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34, 1993–2024

work page 2014
[35]

Self: Learning to filter noisy labels with self- ensembling

Nguyen,D.T.,Mummadi,C.K.,Ngo,T.P.N.,Nguyen,T.H.P.,Beggel, L., Brox, T., 2019. Self: Learning to filter noisy labels with self- ensembling. arXiv preprint arXiv:1910.01842

work page arXiv 2019
[36]

Interpreting chest x-rays via cnns that exploit hierarchical disease dependenciesanduncertaintylabels

Pham, H.H., Le, T.T., Tran, D.Q., Ngo, D.T., Nguyen, H.Q., 2021. Interpreting chest x-rays via cnns that exploit hierarchical disease dependenciesanduncertaintylabels. Neurocomputing437,186–194

work page 2021
[37]

Asurveyof label-noise deep learning for medical image analysis

Shi,J.,Zhang,K.,Guo,C.,Yang,Y.,Xu,Y.,Wu,J.,2024. Asurveyof label-noise deep learning for medical image analysis. Medical image analysis 95, 103166

work page 2024
[38]

Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning

Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M., 2016. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35, 1285–1298

work page 2016
[39]

Training Convolutional Networks with Noisy Labels

Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., Fergus, R., 2014. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080

work page internal anchor Pith review Pith/arXiv arXiv 2014
[40]

Learning from noisy labels by regularized estimation of annotator confusion, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Tanno,R.,Saeedi,A.,Sankaranarayanan,S.,Alexander,D.C.,Silber- man, N., 2019. Learning from noisy labels by regularized estimation of annotator confusion, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11244–11253

work page 2019
[41]

Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.,

work page
[42]

2097–2106

Chestx-ray8: Hospital-scale chest x-ray database and bench- marks on weakly-supervised classification and localization of com- mon thorax diseases, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106

work page 2097
[43]

Symmet- riccrossentropyforrobustlearningwithnoisylabels,in:Proceedings of the IEEE/CVF international conference on computer vision, pp

Wang,Y.,Ma,X.,Chen,Z.,Luo,Y.,Yi,J.,Bailey,J.,2019. Symmet- riccrossentropyforrobustlearningwithnoisylabels,in:Proceedings of the IEEE/CVF international conference on computer vision, pp. 322–330

work page 2019
[44]

Combating noisy labels by agreement: A joint training method with co-regularization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Wei, H., Feng, L., Chen, X., An, B., 2020. Combating noisy labels by agreement: A joint training method with co-regularization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13726–13735

work page 2020
[45]

Learning with noisy labels revisited: A study using real-world human annota- tions

Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., Liu, Y., 2021. Learning with noisy labels revisited: A study using real-world human annota- tions. arXiv preprint arXiv:2110.12088

work page arXiv 2021
[46]

Xia, X., Han, B., Zhan, Y., Yu, J., Gong, M., Gong, C., Liu, T.,

work page
[47]

1833–1843

Combatingnoisylabelswithsampleselectionbymininghigh- discrepancyexamples,in:ProceedingsoftheIEEE/CVFinternational conference on computer vision, pp. 1833–1843

work page
[48]

Robust early-learning: Hindering the memorization of noisy labels, in: International conference on learning representations

Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., Chang, Y., 2020a. Robust early-learning: Hindering the memorization of noisy labels, in: International conference on learning representations

work page
[49]

Part-dependent label noise: Towards instance-dependent label noise

Xia, X., Liu, T., Han, B., Wang, N., Gong, M., Liu, H., Niu, G., Tao, D., Sugiyama, M., 2020b. Part-dependent label noise: Towards instance-dependent label noise. Advances in neural information processing systems 33, 7597–7610

work page
[50]

Xia,X.,Liu,T.,Wang,N.,Han,B.,Gong,C.,Niu,G.,Sugiyama,M.,

work page
[51]

Areanchorpointsreallyindispensableinlabel-noiselearning? Advances in neural information processing systems 32

work page
[52]

Xue,C.,Dou,Q.,Shi,X.,Chen,H.,Heng,P.A.,2019.Robustlearning atnoisylabeledmedicalimages:Appliedtoskinlesionclassification, in: 2019 IEEE 16th International symposium on biomedical imaging (ISBI 2019), IEEE. pp. 1280–1283

work page 2019
[53]

Yang,J.,Shi,R.,Wei,D.,Liu,Z.,Zhao,L.,Ke,B.,Pfister,H.,Ni,B.,

work page
[54]

Scientific Data 10, 41

Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data 10, 41

work page
[55]

Howdoesdisagreementhelpgeneralizationagainstlabelcorruption?, in: International conference on machine learning, PMLR

Yu, X., Han, B., Yao, J., Niu, G., Tsang, I., Sugiyama, M., 2019. Howdoesdisagreementhelpgeneralizationagainstlabelcorruption?, in: International conference on machine learning, PMLR. pp. 7164– 7173

work page 2019
[56]

Robust curriculum learning: from clean label detection to noisy label self-correction, in: Interna- tional conference on learning representations

Zhou, T., Wang, S., Bilmes, J., 2020. Robust curriculum learning: from clean label detection to noisy label self-correction, in: Interna- tional conference on learning representations. Yuan Ma et al.:Preprint submitted to ElsevierPage 15 of 15

work page 2020