When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection
Pith reviewed 2026-05-20 13:16 UTC · model grok-4.3
The pith
High accuracy on noisy labels does not guarantee reliable out-of-distribution rejection because misclassified in-distribution samples overlap OOD score regions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
High closed-set accuracy under noisy label training does not ensure OOD reliability, because low-confidence misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs, a pathology termed uncertainty collapse that reduces separability at the ID-error/OOD interface under standard post-hoc scores.
What carries the argument
Uncertainty collapse, the structural overlap of low-confidence misclassified ID samples with OOD inputs in score and feature space under noisy training, which erodes detection performance even when accuracy remains high.
If this is right
- High-accuracy LNL methods can still lose separability between ID errors and OOD inputs under standard energy or score-based detectors.
- The ACC-OOD benchmark reveals this failure across both synthetic and real label noise settings.
- Virtual Margin Regularization partially restores far-OOD detection by synthesizing boundary virtual outliers on trusted ID batches.
- Closed-set accuracy alone is insufficient for open-world reliability in noisy-label settings.
Where Pith is reading between the lines
- LNL training objectives may need explicit terms that penalize ID-error/OOD overlap rather than relying on post-hoc fixes.
- The collapse phenomenon could be stronger when label noise correlates with the natural shifts that occur in deployed data.
- Applying the same benchmark protocol to other open-set or robust-learning methods would test whether uncertainty collapse is specific to noisy-label regimes.
Load-bearing premise
The standardized near- and far-OOD routing together with post-hoc scores in the ACC-OOD benchmark faithfully reflect the failure modes that appear under real deployment distribution shifts.
What would settle it
A new evaluation on real-world noisy data with natural shifts showing that low-confidence ID misclassifications do not overlap OOD score distributions under the same post-hoc detectors would falsify the collapse claim.
Figures
read the original abstract
Learning with noisy labels (LNL) is typically benchmarked by closed-set classification accuracy, yet deployment often requires classifiers to reject out-of-distribution (OOD) inputs. We present a learner-agnostic ACC-OOD benchmark that freezes LNL checkpoints and evaluates them with standardized near-/far-OOD routing and post-hoc scores across synthetic and real label noise. The benchmark reveals a recurring failure mode: high closed-set accuracy does not ensure OOD reliability, because low-confidence, misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs under noisy training. We term this pathology uncertainty collapse. This structural overlap can make high-accuracy LNL methods lose separability at the ID-error/OOD interface under standard OOD scores. As an intervention, we study Virtual Margin Regularization (VMR), a lightweight repair probe demonstrated mainly with PSSCL that synthesizes boundary virtual outliers on trusted ID batches and widens the energy margin. VMR partially reduces the collapse-induced far-OOD failure without replacing the host objective or sacrificing closed-set accuracy in the tested settings. These results support LNL benchmarks that co-report closed-set generalization, open-world reliability, and structural overlap diagnostics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces an ACC-OOD benchmark that freezes LNL checkpoints and evaluates them under standardized near-/far-OOD routing with post-hoc scores. It reports a recurring failure mode termed uncertainty collapse: high closed-set accuracy does not guarantee OOD reliability because low-confidence misclassified ID samples overlap the score and feature regions occupied by OOD inputs when training occurs under label noise. The authors propose Virtual Margin Regularization (VMR) as a lightweight probe that synthesizes boundary virtual outliers on trusted ID batches to widen the energy margin, showing partial mitigation of far-OOD failure without sacrificing closed-set accuracy in the tested settings.
Significance. If the central empirical observations hold, the work usefully bridges LNL and OOD detection by demonstrating that accuracy-centric benchmarks are insufficient for open-world deployment. The learner-agnostic ACC-OOD benchmark and the concrete VMR intervention supply practical tools and diagnostics. Credit is due for the reproducible-style evaluation protocol across synthetic and real noise and for framing the overlap as a structural rather than incidental issue.
major comments (2)
- The claim that uncertainty collapse is induced specifically by noisy training (rather than being a general property of high-accuracy classifiers) is load-bearing for the central thesis. The manuscript should include a direct comparison against clean-label models that reach comparable closed-set accuracy; without it, the observed ID-error/OOD overlap could be an artifact of the chosen post-hoc scores or the ACC-OOD routing protocol rather than intrinsic to the LNL objective.
- § on VMR experiments: VMR is demonstrated primarily with PSSCL. To support the statement that it is a general lightweight repair, results with at least two additional LNL methods (e.g., DivideMix or ELR) under the same ACC-OOD protocol are needed; otherwise the intervention remains tied to a single host objective.
minor comments (2)
- Clarify whether the energy score used for VMR is the same formulation as the post-hoc energy score in the main benchmark tables; any difference should be stated explicitly.
- Figure captions should explicitly list the exact noise rates and OOD datasets corresponding to each panel to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which help clarify the scope of our claims on uncertainty collapse and the generality of Virtual Margin Regularization. We address each major comment below and have prepared revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: The claim that uncertainty collapse is induced specifically by noisy training (rather than being a general property of high-accuracy classifiers) is load-bearing for the central thesis. The manuscript should include a direct comparison against clean-label models that reach comparable closed-set accuracy; without it, the observed ID-error/OOD overlap could be an artifact of the chosen post-hoc scores or the ACC-OOD routing protocol rather than intrinsic to the LNL objective.
Authors: We agree that a direct comparison to clean-label training at matched closed-set accuracy is essential to isolate the contribution of label noise. In the revised manuscript we have added controlled experiments training the same backbone architectures on clean versions of the benchmark datasets until they reach accuracy levels comparable to the noisy-label checkpoints. These results show substantially reduced ID-error/OOD score overlap under clean training, indicating that the collapse phenomenon is indeed exacerbated by the noisy-label objective rather than being an artifact of the evaluation protocol. A new subsection and accompanying figure have been inserted to present this comparison. revision: yes
-
Referee: § on VMR experiments: VMR is demonstrated primarily with PSSCL. To support the statement that it is a general lightweight repair, results with at least two additional LNL methods (e.g., DivideMix or ELR) under the same ACC-OOD protocol are needed; otherwise the intervention remains tied to a single host objective.
Authors: We acknowledge that broader empirical validation across LNL methods would better support the claim that VMR is a general lightweight repair. While the current experiments focus on PSSCL as a strong and representative host method, we have now run the identical ACC-OOD protocol with VMR applied on top of DivideMix and ELR. The additional results, included in the revised experimental section and supplementary material, show that VMR continues to deliver partial mitigation of far-OOD failure without degrading closed-set accuracy, consistent with the mechanism operating on trusted ID batches independently of the underlying LNL objective. revision: yes
Circularity Check
Empirical benchmark with no circular derivation or self-referential claims
full rationale
The paper's core contribution is an empirical ACC-OOD benchmark that freezes existing LNL checkpoints and measures overlaps between low-confidence misclassified ID samples and OOD inputs using standardized post-hoc scores and near-/far-OOD routing. The observation of uncertainty collapse follows directly from these experimental comparisons across synthetic and real noise settings, without any equations that define quantities in terms of themselves or predictions that reduce to fitted inputs by construction. The VMR intervention is presented as a lightweight empirical probe evaluated on the same benchmark protocol, with no load-bearing reliance on self-citations, uniqueness theorems, or ansatzes imported from prior author work. All claims remain tied to the reported benchmark results rather than any self-contained derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions about label noise models and the separability of in-distribution versus out-of-distribution data under common post-hoc scores.
invented entities (1)
-
uncertainty collapse
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
high closed-set accuracy does not ensure OOD reliability, because low-confidence, misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs under noisy training
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We term this pathology uncertainty collapse... low-confidence ID-wrong samples occupy score and feature regions shared with OOD inputs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8135–8153, 2023
work page 2023
-
[2]
Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. InAdvances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[3]
Co-teaching: Robust training of deep neural networks with extremely noisy labels
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. InAdvances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[4]
Junnan Li, Richard Socher, and Steven C. H. Hoi. DivideMix: Learning with noisy labels as semi-supervised learning. InInternational Conference on Learning Representations, 2020
work page 2020
-
[5]
UNICON: Combating label noise through uniform selection and contrastive learning
Nazmul Karim, Mamshad Nayeem Rizve, Nazanin Rahnavard, Ajmal Mian, and Mubarak Shah. UNICON: Combating label noise through uniform selection and contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9676–9686, 2022
work page 2022
-
[6]
Abhijit Bendale and Terrance E. Boult. Towards open set deep networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1563–1572, 2016
work page 2016
-
[7]
Selective classification for deep neural networks
Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017. 9
work page 2017
-
[8]
A baseline for detecting misclassified and out-of-distribution examples in neural networks
Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. InInternational Conference on Learning Representations, 2017
work page 2017
-
[9]
Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. InInternational Conference on Learning Representations, 2018
work page 2018
-
[10]
A simple unified framework for detecting out-of-distribution samples and adversarial attacks
Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. InAdvances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[11]
Energy-based out-of-distribution detection
Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. InAdvances in Neural Information Processing Systems, volume 33, pages 21464– 21475, 2020
work page 2020
- [12]
-
[13]
Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation
Amartya Sanyal, Yaxi Hu, Yaodong Yu, Yian Ma, Yixin Wang, and Bernhard Schölkopf. Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation. InProceedings of The 28th International Conference on Artificial Intelligence and Statistics, pages 2170–2178, 2025
work page 2025
-
[14]
Unlocking the power of open set: A new perspective for open-set noisy label learning
Wenhai Wan, Xinrui Wang, Ming-Kun Xie, Shao-Yuan Li, Sheng-Jun Huang, and Songcan Chen. Unlocking the power of open set: A new perspective for open-set noisy label learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15438–15446, 2024
work page 2024
-
[15]
Learning multiple layers of features from tiny images
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009
work page 2009
-
[16]
Learning with noisy labels revisited: A study using real-world human annotations
Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. Learning with noisy labels revisited: A study using real-world human annotations. InInternational Conference on Learning Representations, 2022
work page 2022
-
[17]
Cleannet: Transfer learning for scalable image classifier training with label noise
Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. Cleannet: Transfer learning for scalable image classifier training with label noise. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5447–5456, June 2018
work page 2018
-
[18]
SELFIE: Refurbishing unclean samples for robust deep learning
Hwanjun Song, Minseok Kim, and Jae-Gil Lee. SELFIE: Refurbishing unclean samples for robust deep learning. InProceedings of the International Conference on Machine Learning, 2019
work page 2019
-
[19]
ReAct: Out-of-distribution detection with rectified activations
Yiyou Sun, Chuan Guo, and Yixuan Li. ReAct: Out-of-distribution detection with rectified activations. InAdvances in Neural Information Processing Systems, volume 34, pages 144–157, 2021
work page 2021
-
[20]
Out-of-distribution detection with deep nearest neighbors
Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. InProceedings of the 39th International Conference on Machine Learning, pages 20827–20840, 2022
work page 2022
-
[21]
ViM: Out-of-distribution with virtual-logit matching
Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. ViM: Out-of-distribution with virtual-logit matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4921–4930, 2022
work page 2022
-
[22]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017
work page 2017
-
[23]
VOS: Learning what you don’t know by virtual outlier synthesis
Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li. VOS: Learning what you don’t know by virtual outlier synthesis. InInternational Conference on Learning Representations, 2022. 10
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.