pith. sign in

arxiv: 2605.17795 · v1 · pith:YI3F63DRnew · submitted 2026-05-18 · 💻 cs.LG · cs.CV

When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection

Pith reviewed 2026-05-20 13:16 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords noisy label learningout-of-distribution detectionuncertainty collapseopen-world reliabilityvirtual margin regularizationACC-OOD benchmark
0
0 comments X

The pith

High accuracy on noisy labels does not guarantee reliable out-of-distribution rejection because misclassified in-distribution samples overlap OOD score regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that closed-set accuracy benchmarks for learning with noisy labels miss a critical deployment issue: uncertainty collapse, where low-confidence misclassified training examples occupy the same score and feature space as true out-of-distribution inputs. A sympathetic reader would care because deployed classifiers must both classify correctly and safely reject unknowns, yet the two goals can conflict under label noise. The authors create a learner-agnostic ACC-OOD benchmark that freezes LNL checkpoints and tests them with fixed near- and far-OOD protocols plus post-hoc scores. They show the overlap occurs across synthetic and real noise settings and introduce Virtual Margin Regularization as a lightweight probe that widens energy margins on trusted batches without replacing the main objective.

Core claim

High closed-set accuracy under noisy label training does not ensure OOD reliability, because low-confidence misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs, a pathology termed uncertainty collapse that reduces separability at the ID-error/OOD interface under standard post-hoc scores.

What carries the argument

Uncertainty collapse, the structural overlap of low-confidence misclassified ID samples with OOD inputs in score and feature space under noisy training, which erodes detection performance even when accuracy remains high.

If this is right

  • High-accuracy LNL methods can still lose separability between ID errors and OOD inputs under standard energy or score-based detectors.
  • The ACC-OOD benchmark reveals this failure across both synthetic and real label noise settings.
  • Virtual Margin Regularization partially restores far-OOD detection by synthesizing boundary virtual outliers on trusted ID batches.
  • Closed-set accuracy alone is insufficient for open-world reliability in noisy-label settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LNL training objectives may need explicit terms that penalize ID-error/OOD overlap rather than relying on post-hoc fixes.
  • The collapse phenomenon could be stronger when label noise correlates with the natural shifts that occur in deployed data.
  • Applying the same benchmark protocol to other open-set or robust-learning methods would test whether uncertainty collapse is specific to noisy-label regimes.

Load-bearing premise

The standardized near- and far-OOD routing together with post-hoc scores in the ACC-OOD benchmark faithfully reflect the failure modes that appear under real deployment distribution shifts.

What would settle it

A new evaluation on real-world noisy data with natural shifts showing that low-confidence ID misclassifications do not overlap OOD score distributions under the same post-hoc detectors would falsify the collapse claim.

Figures

Figures reproduced from arXiv: 2605.17795 by Jingyang Mao, Ningkang Peng, Peirong Ma, Runhan Zhou, Yanhui Gu.

Figure 1
Figure 1. Figure 1: Per-noise ACC–OOD profiles corresponding to [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: plots kernel-smoothed Energy densities for RRL on ID-correct, ID-wrong, and pooled far-OOD tests (shared top-left legend; seven synthetic-noise settings), rendering the low-confidence overlap on the same Energy axis [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Far-OOD feature probe. Two-dimensional projection of penultimate features for UNICON on C100 sym 0.5: ID-wrong (red) bridges ID-correct (blue) and far-OOD (gray), degrading coarse separation in the representation used by post-hoc Energy scoring. VMR targets this type of bridge; Section 6 reports the resulting metric changes. Main benchmark effect [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Learning with noisy labels (LNL) is typically benchmarked by closed-set classification accuracy, yet deployment often requires classifiers to reject out-of-distribution (OOD) inputs. We present a learner-agnostic ACC-OOD benchmark that freezes LNL checkpoints and evaluates them with standardized near-/far-OOD routing and post-hoc scores across synthetic and real label noise. The benchmark reveals a recurring failure mode: high closed-set accuracy does not ensure OOD reliability, because low-confidence, misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs under noisy training. We term this pathology uncertainty collapse. This structural overlap can make high-accuracy LNL methods lose separability at the ID-error/OOD interface under standard OOD scores. As an intervention, we study Virtual Margin Regularization (VMR), a lightweight repair probe demonstrated mainly with PSSCL that synthesizes boundary virtual outliers on trusted ID batches and widens the energy margin. VMR partially reduces the collapse-induced far-OOD failure without replacing the host objective or sacrificing closed-set accuracy in the tested settings. These results support LNL benchmarks that co-report closed-set generalization, open-world reliability, and structural overlap diagnostics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an ACC-OOD benchmark that freezes LNL checkpoints and evaluates them under standardized near-/far-OOD routing with post-hoc scores. It reports a recurring failure mode termed uncertainty collapse: high closed-set accuracy does not guarantee OOD reliability because low-confidence misclassified ID samples overlap the score and feature regions occupied by OOD inputs when training occurs under label noise. The authors propose Virtual Margin Regularization (VMR) as a lightweight probe that synthesizes boundary virtual outliers on trusted ID batches to widen the energy margin, showing partial mitigation of far-OOD failure without sacrificing closed-set accuracy in the tested settings.

Significance. If the central empirical observations hold, the work usefully bridges LNL and OOD detection by demonstrating that accuracy-centric benchmarks are insufficient for open-world deployment. The learner-agnostic ACC-OOD benchmark and the concrete VMR intervention supply practical tools and diagnostics. Credit is due for the reproducible-style evaluation protocol across synthetic and real noise and for framing the overlap as a structural rather than incidental issue.

major comments (2)
  1. The claim that uncertainty collapse is induced specifically by noisy training (rather than being a general property of high-accuracy classifiers) is load-bearing for the central thesis. The manuscript should include a direct comparison against clean-label models that reach comparable closed-set accuracy; without it, the observed ID-error/OOD overlap could be an artifact of the chosen post-hoc scores or the ACC-OOD routing protocol rather than intrinsic to the LNL objective.
  2. § on VMR experiments: VMR is demonstrated primarily with PSSCL. To support the statement that it is a general lightweight repair, results with at least two additional LNL methods (e.g., DivideMix or ELR) under the same ACC-OOD protocol are needed; otherwise the intervention remains tied to a single host objective.
minor comments (2)
  1. Clarify whether the energy score used for VMR is the same formulation as the post-hoc energy score in the main benchmark tables; any difference should be stated explicitly.
  2. Figure captions should explicitly list the exact noise rates and OOD datasets corresponding to each panel to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the scope of our claims on uncertainty collapse and the generality of Virtual Margin Regularization. We address each major comment below and have prepared revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The claim that uncertainty collapse is induced specifically by noisy training (rather than being a general property of high-accuracy classifiers) is load-bearing for the central thesis. The manuscript should include a direct comparison against clean-label models that reach comparable closed-set accuracy; without it, the observed ID-error/OOD overlap could be an artifact of the chosen post-hoc scores or the ACC-OOD routing protocol rather than intrinsic to the LNL objective.

    Authors: We agree that a direct comparison to clean-label training at matched closed-set accuracy is essential to isolate the contribution of label noise. In the revised manuscript we have added controlled experiments training the same backbone architectures on clean versions of the benchmark datasets until they reach accuracy levels comparable to the noisy-label checkpoints. These results show substantially reduced ID-error/OOD score overlap under clean training, indicating that the collapse phenomenon is indeed exacerbated by the noisy-label objective rather than being an artifact of the evaluation protocol. A new subsection and accompanying figure have been inserted to present this comparison. revision: yes

  2. Referee: § on VMR experiments: VMR is demonstrated primarily with PSSCL. To support the statement that it is a general lightweight repair, results with at least two additional LNL methods (e.g., DivideMix or ELR) under the same ACC-OOD protocol are needed; otherwise the intervention remains tied to a single host objective.

    Authors: We acknowledge that broader empirical validation across LNL methods would better support the claim that VMR is a general lightweight repair. While the current experiments focus on PSSCL as a strong and representative host method, we have now run the identical ACC-OOD protocol with VMR applied on top of DivideMix and ELR. The additional results, included in the revised experimental section and supplementary material, show that VMR continues to deliver partial mitigation of far-OOD failure without degrading closed-set accuracy, consistent with the mechanism operating on trusted ID batches independently of the underlying LNL objective. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no circular derivation or self-referential claims

full rationale

The paper's core contribution is an empirical ACC-OOD benchmark that freezes existing LNL checkpoints and measures overlaps between low-confidence misclassified ID samples and OOD inputs using standardized post-hoc scores and near-/far-OOD routing. The observation of uncertainty collapse follows directly from these experimental comparisons across synthetic and real noise settings, without any equations that define quantities in terms of themselves or predictions that reduce to fitted inputs by construction. The VMR intervention is presented as a lightweight empirical probe evaluated on the same benchmark protocol, with no load-bearing reliance on self-citations, uniqueness theorems, or ansatzes imported from prior author work. All claims remain tied to the reported benchmark results rather than any self-contained derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical validity of the proposed benchmark and the assumption that observed overlaps generalize beyond the tested noise models and OOD datasets.

axioms (1)
  • domain assumption Standard assumptions about label noise models and the separability of in-distribution versus out-of-distribution data under common post-hoc scores.
    The benchmark and collapse diagnosis presuppose that synthetic and real label noise behave similarly to deployment conditions.
invented entities (1)
  • uncertainty collapse no independent evidence
    purpose: To name the observed overlap between low-confidence ID misclassifications and OOD inputs in score/feature space.
    New descriptive term introduced to characterize the pathology; no independent falsifiable prediction is provided in the abstract.

pith-pipeline@v0.9.0 · 5759 in / 1332 out tokens · 36318 ms · 2026-05-20T13:16:35.997923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8135–8153, 2023

    Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8135–8153, 2023

  2. [2]

    Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. InAdvances in Neural Information Processing Systems, volume 31, 2018

  3. [3]

    Co-teaching: Robust training of deep neural networks with extremely noisy labels

    Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. InAdvances in Neural Information Processing Systems, volume 31, 2018

  4. [4]

    Junnan Li, Richard Socher, and Steven C. H. Hoi. DivideMix: Learning with noisy labels as semi-supervised learning. InInternational Conference on Learning Representations, 2020

  5. [5]

    UNICON: Combating label noise through uniform selection and contrastive learning

    Nazmul Karim, Mamshad Nayeem Rizve, Nazanin Rahnavard, Ajmal Mian, and Mubarak Shah. UNICON: Combating label noise through uniform selection and contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9676–9686, 2022

  6. [6]

    Abhijit Bendale and Terrance E. Boult. Towards open set deep networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1563–1572, 2016

  7. [7]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017. 9

  8. [8]

    A baseline for detecting misclassified and out-of-distribution examples in neural networks

    Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. InInternational Conference on Learning Representations, 2017

  9. [9]

    Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. InInternational Conference on Learning Representations, 2018

  10. [10]

    A simple unified framework for detecting out-of-distribution samples and adversarial attacks

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. InAdvances in Neural Information Processing Systems, volume 31, 2018

  11. [11]

    Energy-based out-of-distribution detection

    Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. InAdvances in Neural Information Processing Systems, volume 33, pages 21464– 21475, 2020

  12. [12]

    Moeslund

    Galadrielle Humblot-Renaux, Sergio Escalera, and Thomas B. Moeslund. A noisy elephant in the room: Is your out-of-distribution detector robust to label noise? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22626–22636, 2024

  13. [13]

    Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation

    Amartya Sanyal, Yaxi Hu, Yaodong Yu, Yian Ma, Yixin Wang, and Bernhard Schölkopf. Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation. InProceedings of The 28th International Conference on Artificial Intelligence and Statistics, pages 2170–2178, 2025

  14. [14]

    Unlocking the power of open set: A new perspective for open-set noisy label learning

    Wenhai Wan, Xinrui Wang, Ming-Kun Xie, Shao-Yuan Li, Sheng-Jun Huang, and Songcan Chen. Unlocking the power of open set: A new perspective for open-set noisy label learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15438–15446, 2024

  15. [15]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  16. [16]

    Learning with noisy labels revisited: A study using real-world human annotations

    Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. Learning with noisy labels revisited: A study using real-world human annotations. InInternational Conference on Learning Representations, 2022

  17. [17]

    Cleannet: Transfer learning for scalable image classifier training with label noise

    Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. Cleannet: Transfer learning for scalable image classifier training with label noise. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5447–5456, June 2018

  18. [18]

    SELFIE: Refurbishing unclean samples for robust deep learning

    Hwanjun Song, Minseok Kim, and Jae-Gil Lee. SELFIE: Refurbishing unclean samples for robust deep learning. InProceedings of the International Conference on Machine Learning, 2019

  19. [19]

    ReAct: Out-of-distribution detection with rectified activations

    Yiyou Sun, Chuan Guo, and Yixuan Li. ReAct: Out-of-distribution detection with rectified activations. InAdvances in Neural Information Processing Systems, volume 34, pages 144–157, 2021

  20. [20]

    Out-of-distribution detection with deep nearest neighbors

    Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. InProceedings of the 39th International Conference on Machine Learning, pages 20827–20840, 2022

  21. [21]

    ViM: Out-of-distribution with virtual-logit matching

    Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. ViM: Out-of-distribution with virtual-logit matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4921–4930, 2022

  22. [22]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

  23. [23]

    VOS: Learning what you don’t know by virtual outlier synthesis

    Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li. VOS: Learning what you don’t know by virtual outlier synthesis. InInternational Conference on Learning Representations, 2022. 10