pith. machine review for the scientific record. sign in

arxiv: 2512.09315 · v1 · submitted 2025-12-10 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges, Practice, and Outlook

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords noisy labelsmedical image classificationbenchmarklearning with noisy labelsrobustness evaluationdomain variabilityclass imbalance
0
0 comments X

The pith

Existing methods for learning from noisy labels lose much of their effectiveness on real medical images under high noise and domain shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates LNMBench to test how well current noisy label techniques hold up when applied to medical scans that carry realistic errors and inconsistencies. Tests on seven datasets across six modalities and three noise patterns show clear drops in performance once noise reaches higher levels or images come from different clinical sources. These results highlight that class imbalance and variability between domains remain hard problems for existing approaches. The authors also describe a basic adjustment that improves resistance to such conditions. Making the evaluation code public supports more consistent checks and better algorithm development for practical medical use.

Core claim

The paper establishes that existing learning with noisy labels methods degrade substantially under high and real-world noise in medical image classification, with class imbalance and domain variability posing persistent challenges. It introduces LNMBench as a unified framework evaluating ten representative methods across seven datasets, six modalities, and three noise patterns, and proposes a simple yet effective improvement to enhance model robustness under these conditions.

What carries the argument

LNMBench, a benchmark framework that applies ten LNL methods to seven medical datasets spanning six modalities with three noise patterns to assess robustness under realistic conditions.

If this is right

  • Existing LNL methods show substantial performance degradation under high and real-world noise.
  • Class imbalance and domain variability remain persistent challenges for noise-resilient algorithms in medical data.
  • A simple yet effective improvement can enhance model robustness under high noise and domain variability.
  • Public release of the LNMBench codebase supports standardized evaluation and reproducible research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future noisy label methods would likely benefit from built-in handling for medical domain shifts between equipment and sites.
  • Extending the benchmark to cover real multi-expert annotation disagreements could provide a closer match to clinical label noise.
  • Hybrid approaches that pair noise correction with domain adaptation techniques may offer a practical next step for medical applications.

Load-bearing premise

The seven datasets, six modalities, and three noise patterns chosen for the benchmark adequately represent the annotation inconsistencies and domain shifts encountered in real clinical practice.

What would settle it

Running the same ten methods plus the proposed improvement on a new medical dataset with independently measured high real-world noise from multiple observers and observing no substantial degradation would challenge the reported performance drops.

Figures

Figures reproduced from arXiv: 2512.09315 by Chao Zhang, Haoran Xie, Junlin Hou, Lie Ju, Yuan Ma, Yukun Zhou, Zongyuan Ge.

Figure 1
Figure 1. Figure 1: (a) We present a systematic benchmark for learning with noisy labels in medical image analysis, encompassing 3 noise patterns, 7 datasets from 6 imaging modalities, and 10 representative methods. Integrating both synthetic and real-world noise across balanced and imbalanced datasets, the benchmark provides a unified and reproducible framework that evaluates not only classification accuracy but also robustn… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of class distributions across the datasets utilized in LNMBench. under clinically sourced label noise using the real-world noise datasets DRTiD [13], Kaggle DR+ [15], and CheX￾pert [14]. 4.1.1. Balanced Dataset PathMNIST [48] is derived from colorectal cancer his￾tology slides and consists of hematoxylin and eosin stained tissue patches for a nine-class classification task. Each sam￾ple is a 224 ×… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of estimation error and test accuracy under different symmetric noise rates for VolMinNet. 0 10 20 30 40 50 Epoch 50 60 70 80 90 Clean Ratio (%) Co-teaching Co-teaching+ CoDis JoCoR DISC DivideMix (a) 0 10 20 30 40 50 Epoch 0 20 40 60 80 100 Coverage Ratio (%) Co-teaching Co-teaching+ CoDis JoCoR DISC DivideMix (b) 0 10 20 30 40 50 Epoch 0 2 4 6 8 10 12 14 Clean Ratio (%) Co-teaching Co-teaching… view at source ↗
Figure 4
Figure 4. Figure 4: Sample selection performance under instance￾dependent noise on PathMNIST. The clean ratio represents the proportion of correctly labeled samples among those selected, while the coverage ratio represents the proportion of selected clean samples among all clean samples. (a) Clean ratio at 50% noise. (b) Coverage ratio at 50% noise. (c) Clean ratio at 90% noise. (d) Coverage ratio at 90% noise. contrast, DISC… view at source ↗
Figure 5
Figure 5. Figure 5: The loss distributions of Co-teaching under Idn-50% and Idn-90% on PathMNIST. 4.5.3. Analysis on Sample Selection Next, we focus on sample selection and semi-supervised methods, which have been effective under instance depen￾dent noise. The sample selection methods in LNMBench are all derived from the Co-teaching, adopting the small￾loss strategy in which the proportion of selected samples is determined by… view at source ↗
Figure 6
Figure 6. Figure 6: a). At Idn-20% setting, CDR consistently outperforms CE across all classes. In contrast, DISC, CoDis and Co￾teaching underperform CE in most classes except for class 5, which is the majority class in DermaMNIST. At Idn-50%, CDR exhibits performance degradation across all classes, further confirming that regularization based methods fail un￾der high noise levels (see Fig. 6b). Meanwhile, DISC, CoDis and Co-… view at source ↗
Figure 7
Figure 7. Figure 7: Per-class performance comparison of Co-teaching and DISC methods on DermMNIST dataset with Idn-20% noise. (a) Clean ratio of Co-teaching. (b) Coverage ratio of Co-teaching. (c) Clean ratio of DISC. (d) Coverage ratio of DISC. contrast, for class 3 and class 6, the method discards all clean samples at an early stage of training, resulting in zero accu￾racy for these classes. For the remaining classes, most … view at source ↗
Figure 8
Figure 8. Figure 8: Violin plots of per-class clean sample losses and the overall noisy sample loss distribution for Co-teaching under Idn-20% on DermaMNIST. 0 10 20 30 40 50 Epoch 0 20 40 60 80 100 Clean Ratio (%) Clean set Purified set [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Epoch wise ratios of clean class 5 samples in the clean set and the purified set of DISC, normalized by the total number of samples under Idn-20% on DermaMNIST. The clean set refers to the subset of samples selected by the model as clean, while the purified set refers to the subset of excluded samples whose labels are reassigned by the model. results in some clean samples being excluded from the training s… view at source ↗
Figure 10
Figure 10. Figure 10: Per-class classification accuracy of four methods on (a) PathMNIST and (b) ImPathMNIST. Findings: Based on sample selection methods rely on the model’s loss or confidence to select clean samples. However, in imbalanced datasets, some clean samples from minority classes exhibit higher losses than noisy samples during the early training stage, preventing them from being selected. This further limits the mod… view at source ↗
Figure 11
Figure 11. Figure 11: Per-class classification accuracy of four methods on (a) DRTiD (b) Kaggle DR+ and (c) CheXpert. label noise approaches are not well suited for real-world noisy data. Findings: Current LNL methods perform poorly on real￾world medical datasets, in some cases even underperforming standard CE model. Moreover, semi-supervised methods that perform well on synthetic noisy datasets often fail to achieve comparabl… view at source ↗
read the original abstract

Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf{10} representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical applications.The codebase is publicly available on https://github.com/myyy777/LNMBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LNMBench, a benchmark for assessing the robustness of learning with noisy labels (LNL) methods in medical image classification. It evaluates 10 LNL methods on 7 datasets across 6 modalities using 3 noise patterns, reports substantial performance degradation under high and real-world noise levels, highlights challenges from class imbalance and domain variability, proposes a simple improvement, and releases the codebase publicly.

Significance. If the benchmark's noise patterns and dataset selection prove representative, the work offers a valuable standardized framework for evaluating LNL methods in medical imaging, where annotation noise is common. The public codebase release is a clear strength that supports reproducibility and future research. The findings on degradation could inform more resilient algorithm design, though their impact depends on how closely the evaluated conditions match clinical practice.

major comments (2)
  1. [Section 4] Section 4 (Noise Patterns): The manuscript must explicitly state whether the 3 noise patterns are generated synthetically (e.g., symmetric/asymmetric/instance-dependent label flips) or derived from multi-rater annotations or observed inter-observer inconsistencies within the 7 datasets. This detail is load-bearing for the central claim of evaluating under 'real-world noise' in the title and abstract; without it, the observed degradation may not generalize to clinical annotation variability.
  2. [Results] Results section (implied by abstract claims): The comprehensive experiments are described as revealing 'substantial' degradation, yet the manuscript lacks specific quantitative tables, accuracy drops with standard deviations, or ablation isolating synthetic vs. observed noise. This weakens the ability to assess the magnitude and statistical reliability of the headline result on existing LNL methods.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'simple yet effective improvement' is used without a brief derivation or pseudocode; adding one sentence on its core mechanism would improve clarity without altering scope.
  2. The claim of '6 imaging modalities' and '7 datasets' should include a short table or appendix listing the exact datasets, modalities, and class imbalance statistics to aid readers in assessing representativeness.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which helps strengthen the clarity and impact of our work on LNMBench. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Noise Patterns): The manuscript must explicitly state whether the 3 noise patterns are generated synthetically (e.g., symmetric/asymmetric/instance-dependent label flips) or derived from multi-rater annotations or observed inter-observer inconsistencies within the 7 datasets. This detail is load-bearing for the central claim of evaluating under 'real-world noise' in the title and abstract; without it, the observed degradation may not generalize to clinical annotation variability.

    Authors: We appreciate this important clarification request. The three noise patterns described in Section 4 are synthetically generated using standard models (symmetric, asymmetric, and instance-dependent label noise) with parameters chosen to reflect typical annotation error patterns reported in medical imaging literature on inter-observer variability. They are not directly derived from multi-rater annotations or observed inconsistencies within the specific 7 datasets. We will revise Section 4 to explicitly detail the synthetic generation procedure, including exact flip probabilities and instance-dependent mechanisms, and add a discussion on how these patterns approximate real-world clinical noise while noting the limitations for direct generalization. This will better support the claims regarding real-world noise in the title and abstract. revision: yes

  2. Referee: [Results] Results section (implied by abstract claims): The comprehensive experiments are described as revealing 'substantial' degradation, yet the manuscript lacks specific quantitative tables, accuracy drops with standard deviations, or ablation isolating synthetic vs. observed noise. This weakens the ability to assess the magnitude and statistical reliability of the headline result on existing LNL methods.

    Authors: We agree that enhanced quantitative detail would improve interpretability. The Results section currently summarizes the degradation trends across the 10 methods, 7 datasets, and 3 noise patterns, but we will expand it with new tables reporting per-method accuracy drops (with standard deviations over multiple runs) under varying noise levels. For the requested ablation, we will add a comparison where feasible using any available multi-rater subsets in the datasets; in cases where observed noise data is unavailable, we will explicitly discuss this as a limitation and suggest it as future work. These additions will provide clearer evidence for the magnitude and reliability of the findings. revision: partial

standing simulated objections not resolved
  • A full ablation isolating synthetic noise from observed multi-rater noise is not possible across all 7 datasets without new data collection, which exceeds the scope of this benchmarking study.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper introduces LNMBench as an empirical evaluation framework and reports performance degradation of existing LNL methods based on new runs across 7 public datasets, 6 modalities, and 3 noise patterns. No mathematical derivation chain exists; claims rest on direct experimental observations rather than quantities defined from fitted parameters, self-referential definitions, or load-bearing self-citations. The proposed improvement is motivated by the benchmark results but does not reduce to a tautology by construction. This is a standard empirical benchmarking paper whose central results are independently falsifiable via the released codebase and public data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the selected datasets and noise models; no free parameters, new entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption The chosen 7 datasets and 3 noise patterns capture realistic medical annotation variability.
    Invoked to support the claim that observed degradation reflects real-world conditions.

pith-pipeline@v0.9.0 · 5520 in / 1096 out tokens · 27698 ms · 2026-05-16T23:53:13.404972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 3 internal anchors

  1. [1]

    Two wrongs don’t make a right: Combating confirmation bias in learning withlabelnoise,in:ProceedingsoftheAAAIConferenceonArtificial Intelligence, pp

    Chen,M.,Cheng,H.,Du,Y.,Xu,M.,Jiang,W.,Wang,C.,2023. Two wrongs don’t make a right: Combating confirmation bias in learning withlabelnoise,in:ProceedingsoftheAAAIConferenceonArtificial Intelligence, pp. 14765–14773

  2. [2]

    Understanding and utilizing deep neural networks trained with noisy labels, in: International conference on machine learning, PMLR

    Chen, P., Liao, B.B., Chen, G., Zhang, S., 2019. Understanding and utilizing deep neural networks trained with noisy labels, in: International conference on machine learning, PMLR. pp. 1062– 1070

  3. [3]

    Cordeiro, F.R., Carneiro, G., 2020. A survey on deep learning with noisy labels: How to train your model when you cannot trust on the annotations?, in: 2020 33rd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), IEEE. pp. 9–16

  4. [4]

    Training a neural network based on unreliable human annotation of medical images, in: 2018 IEEE 15th International symposium on biomedical imaging (ISBI 2018), IEEE

    Dgani, Y., Greenspan, H., Goldberger, J., 2018. Training a neural network based on unreliable human annotation of medical images, in: 2018 IEEE 15th International symposium on biomedical imaging (ISBI 2018), IEEE. pp. 39–42

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy,A.,2020. Animageisworth16x16words:Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  6. [6]

    A cnn-based unified frame- work utilizing projection loss in unison with label noise handling for multiple myeloma cancer diagnosis

    Gehlot, S., Gupta, A., Gupta, R., 2021. A cnn-based unified frame- work utilizing projection loss in unison with label noise handling for multiple myeloma cancer diagnosis. Medical Image Analysis 72, 102099

  7. [7]

    Ghesu, F.C., Georgescu, B., Gibson, E., Guendel, S., Kalra, M.K., Singh, R., Digumarthy, S.R., Grbic, S., Comaniciu, D., 2019. Quan- tifying and leveraging classification uncertainty for chest radiograph assessment,in:Internationalconferenceonmedicalimagecomputing and computer-assisted intervention, Springer. pp. 676–684

  8. [8]

    Openmibood: Open medical imaging benchmarks for out-of-distribution detection, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Gutbrod,M.,Rauber,D.,Nunes,D.W.,Palm,C.,2025. Openmibood: Open medical imaging benchmarks for out-of-distribution detection, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 25874–25886

  9. [9]

    Co-teaching:Robusttrainingofdeepneuralnetworkswith extremely noisy labels

    Han,B.,Yao,Q.,Yu,X.,Niu,G.,Xu,M.,Hu,W.,Tsang,I.,Sugiyama, M.,2018. Co-teaching:Robusttrainingofdeepneuralnetworkswith extremely noisy labels. Advances in neural information processing systems 31

  10. [10]

    Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

    He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

  11. [11]

    Using pre-training can improve model robustness and uncertainty, in: International confer- ence on machine learning, PMLR

    Hendrycks, D., Lee, K., Mazeika, M., 2019. Using pre-training can improve model robustness and uncertainty, in: International confer- ence on machine learning, PMLR. pp. 2712–2721

  12. [12]

    Qmix: Quality-aware learning with mixed noise for robust retinal disease diagnosis

    Hou, J., Xu, J., Feng, R., Chen, H., 2025. Qmix: Quality-aware learning with mixed noise for robust retinal disease diagnosis. IEEE Transactions on Medical Imaging

  13. [13]

    Cross-field transformer for diabetic retinopathy gradingontwo-fieldfundusimages,in:2022IEEEInternationalCon- ferenceonBioinformaticsandBiomedicine(BIBM),IEEEComputer Society

    Hou,J.,Xu,J.,Xiao,F.,Zhao,R.W.,Zhang,Y.,Zou,H.,Lu,L.,Xue, W., Feng, R., 2022. Cross-field transformer for diabetic retinopathy gradingontwo-fieldfundusimages,in:2022IEEEInternationalCon- ferenceonBioinformaticsandBiomedicine(BIBM),IEEEComputer Society. pp. 985–990. Yuan Ma et al.:Preprint submitted to ElsevierPage 14 of 15 LNMBench

  14. [14]

    MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

    Johnson,A.E.,Pollard,T.J.,Greenbaum,N.R.,Lungren,M.P.,Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S., 2019. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042

  15. [15]

    Improving medical images classification with label noise using dual-uncertainty estimation

    Ju, L., Wang, X., Wang, L., Mahapatra, D., Zhao, X., Zhou, Q., Liu, T., Ge, Z., 2022. Improving medical images classification with label noise using dual-uncertainty estimation. IEEE transactions on medical imaging 41, 1533–1546

  16. [16]

    Monica: Benchmarking on long-tailed medical image classification

    Ju, L., Yan, S., Zhou, Y., Nan, Y., Xing, X., Duan, P., Ge, Z., 2024. Monica: Benchmarking on long-tailed medical image classification. arXiv preprint arXiv:2410.02010

  17. [17]

    Ju, L., Yu, Z., Wang, L., Zhao, X., Wang, X., Bonnington, P., Ge, Z.,

  18. [18]

    IEEETransactionsonMedicalImaging43,335– 350

    Hierarchical knowledge guided learning for real-world retinal diseaserecognition. IEEETransactionsonMedicalImaging43,335– 350

  19. [19]

    Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis

    Karimi, D., Dou, H., Warfield, S.K., Gholipour, A., 2020. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Medical image analysis 65, 101759

  20. [20]

    Improving medicalimageclassificationinnoisylabelsusingonlyself-supervised pretraining, in: MICCAI Workshop on Data Engineering in Medical Imaging, Springer

    Khanal, B., Bhattarai, B., Khanal, B., Linte, C.A., 2023. Improving medicalimageclassificationinnoisylabelsusingonlyself-supervised pretraining, in: MICCAI Workshop on Data Engineering in Medical Imaging, Springer. pp. 78–90

  21. [21]

    Ko, J., Yi, B., Yun, S.Y., 2023. A gift from label smoothing: robust training with adaptive label smoothing via auxiliary classifier under label noise, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8325–8333

  22. [22]

    Li,J.,Cao,H.,Wang,J.,Liu,F.,Dou,Q.,Chen,G.,Heng,P.A.,2023a. Learning robust classifier for imbalanced medical image dataset with noisylabelsbyminimizinginvariantrisk,in:InternationalConference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 306–316

  23. [23]

    Dividemix: Learning with noisy labelsassemi-supervisedlearning

    Li, J., Socher, R., Hoi, S.C., 2020. Dividemix: Learning with noisy labelsassemi-supervisedlearning. arXivpreprintarXiv:2002.07394

  24. [24]

    Provably end- to-end label-noise learning without anchor points, in: International conference on machine learning, PMLR

    Li, X., Liu, T., Han, B., Niu, G., Sugiyama, M., 2021. Provably end- to-end label-noise learning without anchor points, in: International conference on machine learning, PMLR. pp. 6403–6413

  25. [25]

    Disc: Learning from noisy labels via dynamic instance-specific selection and correction, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp

    Li, Y., Han, H., Shan, S., Chen, X., 2023b. Disc: Learning from noisy labels via dynamic instance-specific selection and correction, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp. 24070–24079

  26. [26]

    Instance-dependent label distribution estimation for learning with label noise

    Liao, Z., Hu, S., Xie, Y., Xia, Y., 2025a. Instance-dependent label distribution estimation for learning with label noise. International Journal of Computer Vision 133, 2568–2580

  27. [27]

    Unleashing the potential of open-set noisy samples against label noise for medical image classification

    Liao, Z., Hu, S., Zhang, Y., Xia, Y., 2025b. Unleashing the potential of open-set noisy samples against label noise for medical image classification. Medical Image Analysis , 103702

  28. [28]

    Learning the latent causal structure formodelinglabelnoise

    Lin, Y., Yao, Y., Liu, T., 2024. Learning the latent causal structure formodelinglabelnoise. AdvancesinNeuralInformationProcessing Systems 37, 120549–120577

  29. [29]

    Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian,M.,VanDerLaak,J.A.,VanGinneken,B.,Sánchez,C.I.,

  30. [30]

    Medical image analysis 42, 60–88

    A survey on deep learning in medical image analysis. Medical image analysis 42, 60–88

  31. [31]

    2537–2546

    Liu,Z.,Miao,Z.,Zhan,X.,Wang,J.,Gong,B.,Yu,S.X.,2019.Large- scale long-tailed recognitionin an open world,in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2537–2546

  32. [32]

    Does label smoothing mitigate label noise?, in: International Conference on Machine Learning, PMLR

    Lukasik, M., Bhojanapalli, S., Menon, A., Kumar, S., 2020. Does label smoothing mitigate label noise?, in: International Conference on Machine Learning, PMLR. pp. 6448–6458

  33. [33]

    Bench- marking common uncertainty estimation methods with histopatho- logical images under domain shift and label noise

    Mehrtens, H.A., Kurz, A., Bucher, T.C., Brinker, T.J., 2023. Bench- marking common uncertainty estimation methods with histopatho- logical images under domain shift and label noise. Medical image analysis 89, 102914

  34. [34]

    The multimodal brain tumor image segmentation benchmark (brats)

    Menze,B.H.,Jakab,A.,Bauer,S.,Kalpathy-Cramer,J.,Farahani,K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al., 2014. The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34, 1993–2024

  35. [35]

    Self: Learning to filter noisy labels with self- ensembling

    Nguyen,D.T.,Mummadi,C.K.,Ngo,T.P.N.,Nguyen,T.H.P.,Beggel, L., Brox, T., 2019. Self: Learning to filter noisy labels with self- ensembling. arXiv preprint arXiv:1910.01842

  36. [36]

    Interpreting chest x-rays via cnns that exploit hierarchical disease dependenciesanduncertaintylabels

    Pham, H.H., Le, T.T., Tran, D.Q., Ngo, D.T., Nguyen, H.Q., 2021. Interpreting chest x-rays via cnns that exploit hierarchical disease dependenciesanduncertaintylabels. Neurocomputing437,186–194

  37. [37]

    Asurveyof label-noise deep learning for medical image analysis

    Shi,J.,Zhang,K.,Guo,C.,Yang,Y.,Xu,Y.,Wu,J.,2024. Asurveyof label-noise deep learning for medical image analysis. Medical image analysis 95, 103166

  38. [38]

    Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning

    Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M., 2016. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35, 1285–1298

  39. [39]

    Training Convolutional Networks with Noisy Labels

    Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., Fergus, R., 2014. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080

  40. [40]

    Learning from noisy labels by regularized estimation of annotator confusion, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Tanno,R.,Saeedi,A.,Sankaranarayanan,S.,Alexander,D.C.,Silber- man, N., 2019. Learning from noisy labels by regularized estimation of annotator confusion, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11244–11253

  41. [41]

    Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.,

  42. [42]

    2097–2106

    Chestx-ray8: Hospital-scale chest x-ray database and bench- marks on weakly-supervised classification and localization of com- mon thorax diseases, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106

  43. [43]

    Symmet- riccrossentropyforrobustlearningwithnoisylabels,in:Proceedings of the IEEE/CVF international conference on computer vision, pp

    Wang,Y.,Ma,X.,Chen,Z.,Luo,Y.,Yi,J.,Bailey,J.,2019. Symmet- riccrossentropyforrobustlearningwithnoisylabels,in:Proceedings of the IEEE/CVF international conference on computer vision, pp. 322–330

  44. [44]

    Combating noisy labels by agreement: A joint training method with co-regularization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Wei, H., Feng, L., Chen, X., An, B., 2020. Combating noisy labels by agreement: A joint training method with co-regularization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13726–13735

  45. [45]

    Learning with noisy labels revisited: A study using real-world human annota- tions

    Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., Liu, Y., 2021. Learning with noisy labels revisited: A study using real-world human annota- tions. arXiv preprint arXiv:2110.12088

  46. [46]

    Xia, X., Han, B., Zhan, Y., Yu, J., Gong, M., Gong, C., Liu, T.,

  47. [47]

    1833–1843

    Combatingnoisylabelswithsampleselectionbymininghigh- discrepancyexamples,in:ProceedingsoftheIEEE/CVFinternational conference on computer vision, pp. 1833–1843

  48. [48]

    Robust early-learning: Hindering the memorization of noisy labels, in: International conference on learning representations

    Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., Chang, Y., 2020a. Robust early-learning: Hindering the memorization of noisy labels, in: International conference on learning representations

  49. [49]

    Part-dependent label noise: Towards instance-dependent label noise

    Xia, X., Liu, T., Han, B., Wang, N., Gong, M., Liu, H., Niu, G., Tao, D., Sugiyama, M., 2020b. Part-dependent label noise: Towards instance-dependent label noise. Advances in neural information processing systems 33, 7597–7610

  50. [50]

    Xia,X.,Liu,T.,Wang,N.,Han,B.,Gong,C.,Niu,G.,Sugiyama,M.,

  51. [51]

    Areanchorpointsreallyindispensableinlabel-noiselearning? Advances in neural information processing systems 32

  52. [52]

    Xue,C.,Dou,Q.,Shi,X.,Chen,H.,Heng,P.A.,2019.Robustlearning atnoisylabeledmedicalimages:Appliedtoskinlesionclassification, in: 2019 IEEE 16th International symposium on biomedical imaging (ISBI 2019), IEEE. pp. 1280–1283

  53. [53]

    Yang,J.,Shi,R.,Wei,D.,Liu,Z.,Zhao,L.,Ke,B.,Pfister,H.,Ni,B.,

  54. [54]

    Scientific Data 10, 41

    Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data 10, 41

  55. [55]

    Howdoesdisagreementhelpgeneralizationagainstlabelcorruption?, in: International conference on machine learning, PMLR

    Yu, X., Han, B., Yao, J., Niu, G., Tsang, I., Sugiyama, M., 2019. Howdoesdisagreementhelpgeneralizationagainstlabelcorruption?, in: International conference on machine learning, PMLR. pp. 7164– 7173

  56. [56]

    Robust curriculum learning: from clean label detection to noisy label self-correction, in: Interna- tional conference on learning representations

    Zhou, T., Wang, S., Bilmes, J., 2020. Robust curriculum learning: from clean label detection to noisy label self-correction, in: Interna- tional conference on learning representations. Yuan Ma et al.:Preprint submitted to ElsevierPage 15 of 15