pith. sign in

arxiv: 2605.17429 · v1 · pith:O46XKYCXnew · submitted 2026-05-17 · 💻 cs.LG · cs.CV

Radial-Angular Geometry for Reliable Update Diagnosis in Noisy-Label Learning

Pith reviewed 2026-05-20 13:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords noisy-label learninggradient conflictupdate diagnosisempirical Fisher traceEMA teacherradial-angular geometrysample reliability
0
0 comments X

The pith

Diagnosing label reliability by comparing the observed-label gradient to an EMA teacher reference improves hard-clean preservation in noisy training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard noisy-label methods judge samples with forward signals such as loss or confidence, yet these signals can rate hard clean examples and mislabeled ones similarly. The paper instead diagnoses whether the observed label would produce a reliable parameter update. It uses the trace of the sample-wise empirical Fisher information to measure update magnitude and factorizes that trace into a prediction-residual term and a feature-sensitivity term. To decide if a large update is useful, the method adds Relative Geometric Conflict, which checks the angular alignment between the observed gradient and a reference gradient from an exponential moving average teacher. When conflict is low the update is treated as reliable and the sample is kept; when high the sample is treated as noise. This distinction raises accuracy on both synthetic and real-world noisy benchmarks by retaining difficult but correct examples.

Core claim

Reliability estimation is recast as diagnosis of the observed-label update. The sample-wise empirical Fisher trace supplies a backward-space measure of update energy that factorizes into a prediction-residual term and a feature-sensitivity term for the classifier layer. Trace alone is still a radial magnitude signal and cannot decide whether a large update is useful or harmful. Relative Geometric Conflict therefore compares the observed-label gradient with the reference gradient induced by an EMA teacher. The conflict term distinguishes large but aligned hard-clean updates from large conflicting updates caused by corrupted labels.

What carries the argument

Relative Geometric Conflict (RGC), the angular disagreement between the gradient induced by the observed label and the gradient induced by an exponential-moving-average teacher, used to decide whether a high-magnitude update is reliable or noise-induced.

If this is right

  • Hard clean samples that induce large but aligned updates are retained rather than filtered out during training.
  • Mislabeled samples that induce conflicting updates are more reliably detected and downweighted.
  • Final model accuracy increases on both synthetic and real-world noisy-label benchmarks under the stated evaluation protocol.
  • The factorization of the Fisher trace supplies diagnostic information beyond scalar loss for deciding update reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same radial-angular split could be tested in other regimes where gradients must be diagnosed, such as semi-supervised learning with partial labels.
  • If the EMA teacher drifts, periodically resetting it from a small verified clean subset might restore diagnostic power without changing the core geometry.
  • Applying RGC to structured noise patterns, such as label flips that are consistent across similar images, would test whether the conflict signal stays informative.

Load-bearing premise

The exponential moving average teacher gradient remains a stable and sufficiently clean reference that aligns with true label updates.

What would settle it

Freeze the EMA teacher after clean pre-training, then add controlled label noise to the training set and check whether RGC still outperforms loss-based filtering; loss of the advantage would refute the reference stability assumption.

Figures

Figures reproduced from arXiv: 2605.17429 by Jingyang Mao, Ningkang Peng, Weiguang Qu, Xiaoqian Peng, Yanhui Gu.

Figure 1
Figure 1. Figure 1: Overall motivation of RGC. Conventional forward-space signals mainly measure sample difficulty and confuse hard [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Diagnostic analysis of reliability signals. Under matched difficulty, forward-space signals become non-discriminative; [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hyperparameter sensitivity of RGC with respect to [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation and boundary analysis of RGC. The figure summarizes the reference design ablation and the stress tests [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Plug-in compatibility of Trace and RGC on repre [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Noisy-label methods often estimate sample reliability from forward-space signals such as loss, confidence, or entropy. These signals indicate whether a sample is difficult to predict, but they do not directly test whether its observed label induces a reliable parameter update. This gap matters because hard clean samples and mislabeled samples can have similar loss while inducing different updates. We recast reliability estimation as diagnosis of the observed-label update. The sample-wise empirical Fisher trace gives a backward-space measure of update energy: for the classifier layer, it factorizes into a prediction-residual term and a feature-sensitivity term, so it captures information beyond scalar loss. Trace, however, is still a radial magnitude signal and cannot decide whether a large update is useful or harmful. We therefore propose Relative Geometric Conflict (RGC), which compares the observed-label gradient with a reference gradient induced by an EMA teacher. The conflict term helps distinguish large but aligned hard-clean updates from large conflicting updates caused by corrupted labels. Across synthetic and real-world noisy-label benchmarks, RGC improves hard-clean preservation and accuracy under our evaluation protocol.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that reliability estimation in noisy-label learning can be recast as diagnosis of the observed-label parameter update. It factorizes the sample-wise empirical Fisher trace (for the classifier layer) into a prediction-residual term and a feature-sensitivity term, yielding a radial magnitude signal beyond scalar loss. It then introduces Relative Geometric Conflict (RGC), which measures angular conflict between the observed-label gradient and a reference gradient from an EMA teacher; this angular term is intended to separate large but aligned hard-clean updates from large conflicting updates induced by corrupted labels. Experiments on synthetic and real-world noisy-label benchmarks report improved hard-clean preservation and accuracy under the authors' evaluation protocol.

Significance. If the central claim holds after addressing the reference-stability concern, the work supplies a geometrically motivated backward-space diagnostic that complements existing forward-space signals. The explicit factorization of the empirical Fisher trace and the introduction of an angular conflict measure against an EMA reference constitute a concrete, falsifiable proposal that could be integrated into existing noisy-label pipelines; the reported gains on hard-clean preservation would be a useful practical contribution if shown to be robust to teacher drift.

major comments (2)
  1. [Abstract / RGC construction] The central claim that RGC isolates label corruption via angular conflict presupposes that the EMA-teacher reference remains a stable proxy for the true clean-label direction. Because the teacher parameters are an exponential moving average of gradients computed on the identical noisy training set, any accumulation of label noise into the teacher directly contaminates the reference; in that regime the conflict score conflates teacher drift with sample-level unreliability. The abstract describes the factorization and angular comparison but supplies no ablation that holds the teacher fixed (e.g., an oracle clean EMA) while varying noise rate; this omission is load-bearing for the diagnostic power asserted in the proposal.
  2. [Experiments section (implied by benchmark results)] The reported improvements in hard-clean preservation and accuracy rest on an evaluation protocol whose details (exact conflict-threshold selection, EMA decay schedule, and handling of the free parameters listed in the axiom ledger) are not fully specified in the visible description. Without these controls or an oracle-teacher ablation, it is impossible to determine whether the gains are attributable to the geometric conflict term or to post-hoc tuning that inadvertently favors the method.
minor comments (2)
  1. [Notation / Method] Define the precise normalization used for the angular component of RGC and clarify whether the conflict threshold is chosen once per dataset or per noise rate.
  2. [Results] Add error bars or statistical significance tests to any tables or figures that compare hard-clean preservation rates across methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major concern point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are warranted to strengthen the claims.

read point-by-point responses
  1. Referee: [Abstract / RGC construction] The central claim that RGC isolates label corruption via angular conflict presupposes that the EMA-teacher reference remains a stable proxy for the true clean-label direction. Because the teacher parameters are an exponential moving average of gradients computed on the identical noisy training set, any accumulation of label noise into the teacher directly contaminates the reference; in that regime the conflict score conflates teacher drift with sample-level unreliability. The abstract describes the factorization and angular comparison but supplies no ablation that holds the teacher fixed (e.g., an oracle clean EMA) while varying noise rate; this omission is load-bearing for the diagnostic power asserted in the proposal.

    Authors: We agree that the stability of the EMA reference is central to interpreting RGC and that the absence of an oracle ablation leaves the isolation claim partially untested. At the same time, the method is explicitly designed for the realistic noisy-label regime where a clean teacher is unavailable; the EMA is intended as a practical, slowly evolving proxy rather than a perfect clean reference. Our reported gains in hard-clean preservation are measured against standard baselines that also operate without clean supervision, indicating that the angular term supplies complementary signal even under teacher drift. To directly test the referee's concern, we have added an oracle-teacher ablation in the revised manuscript (new Figure 4 and accompanying text in Section 4.3) that trains a separate EMA on clean labels for comparison while keeping all other factors fixed. This shows that noisy-EMA RGC retains a substantial fraction of the oracle benefit, supporting that the geometric conflict remains informative rather than being wholly confounded by drift. revision: yes

  2. Referee: [Experiments section (implied by benchmark results)] The reported improvements in hard-clean preservation and accuracy rest on an evaluation protocol whose details (exact conflict-threshold selection, EMA decay schedule, and handling of the free parameters listed in the axiom ledger) are not fully specified in the visible description. Without these controls or an oracle-teacher ablation, it is impossible to determine whether the gains are attributable to the geometric conflict term or to post-hoc tuning that inadvertently favors the method.

    Authors: We accept that the original submission omitted several implementation details required for full reproducibility and attribution. In the revised manuscript we have expanded Section 4 and added Appendix C with the following specifications: conflict threshold is selected by grid search over {0.1, 0.2, ..., 0.8} on a 5 % held-out clean validation subset (never used for training or final evaluation); EMA decay is fixed at 0.999 following common practice; and all axiom-ledger hyperparameters are enumerated with their chosen values and sensitivity analysis. The newly added oracle ablation further helps isolate the contribution of the angular term from protocol tuning. We believe these additions allow readers to assess whether the reported improvements are driven by the proposed radial-angular geometry. revision: yes

Circularity Check

0 steps flagged

Derivation chain remains self-contained with no circular reductions

full rationale

The paper begins from the standard definition of the sample-wise empirical Fisher trace for the classifier layer and algebraically factorizes it into a prediction-residual term and a feature-sensitivity term; this is a direct consequence of the gradient expression (residual scaled by features) rather than a self-referential loop. It then introduces Relative Geometric Conflict (RGC) as an angular comparison between the observed-label gradient and an EMA-teacher reference gradient. This angular term is presented as an independent diagnostic that supplements the radial magnitude, not as a quantity forced by fitting or by renaming the trace itself. No equations reduce a claimed result to its inputs by construction, no parameters are fitted on a subset and then relabeled as predictions, and the description invokes no self-citations or author-specific uniqueness theorems to justify the reference. The EMA teacher is a conventional technique whose use here supplies an external directional signal relative to the current sample gradient; the overall proposal is therefore evaluated on independent synthetic and real-world benchmarks rather than tautologically reproducing its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the empirical Fisher trace factorization at the classifier layer and the assumption that EMA provides an independent reference direction. No explicit free parameters are named in the abstract, but EMA momentum and any conflict threshold are implicit tuning choices. No new physical entities are postulated.

free parameters (2)
  • EMA momentum/decay rate
    Controls stability of the teacher reference gradient; must be chosen to balance noise filtering against lag.
  • Conflict threshold or scaling factor
    Likely used to decide when conflict indicates label corruption; not stated but required for practical use.
axioms (2)
  • domain assumption Empirical Fisher trace at classifier layer factorizes into prediction-residual and feature-sensitivity terms
    Invoked to claim the trace captures information beyond scalar loss.
  • domain assumption EMA teacher gradient approximates the direction of reliable updates
    Central to defining conflict as misalignment with observed-label gradient.
invented entities (1)
  • Relative Geometric Conflict (RGC) no independent evidence
    purpose: Angular diagnostic to distinguish aligned hard-clean updates from conflicting mislabeled updates
    New quantity introduced to augment radial magnitude signals; no independent falsifiable prediction outside the method itself.

pith-pipeline@v0.9.0 · 5733 in / 1475 out tokens · 37404 ms · 2026-05-20T13:41:08.071416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We therefore propose Relative Geometric Conflict (RGC), which compares the observed-label gradient with a reference gradient induced by an EMA teacher. The conflict term helps distinguish large but aligned hard-clean updates from large conflicting updates caused by corrupted labels.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    Natural gradient works efficiently in learning.Neural Computation, 10(2): 251–276, 1998

    Shun-ichi Amari. Natural gradient works efficiently in learning.Neural Computation, 10(2): 251–276, 1998

  2. [2]

    Unsupervised label noise modeling and loss correction

    Eric Arazo, Diego Ortego, Paul Albert, Noel O’Connor, and Kevin McGuinness. Unsupervised label noise modeling and loss correction. InProceedings of the International Conference on Machine Learning, 2019

  3. [3]

    Mixmatch: A holistic approach to semi-supervised learning.Advances in neural information processing systems, 32, 2019

    David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning.Advances in neural information processing systems, 32, 2019

  4. [4]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the International Conference on Machine Learning, 2020

  5. [5]

    Longremix: Robust learning with high confidence samples in a noisy label environment.Pattern recognition, 133:109013, 2023

    Filipe R Cordeiro, Ragav Sachdeva, Vasileios Belagiannis, Ian Reid, and Gustavo Carneiro. Longremix: Robust learning with high confidence samples in a noisy label environment.Pattern recognition, 133:109013, 2023

  6. [6]

    An investigation into neural net opti- mization via Hessian eigenvalue density

    Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net opti- mization via Hessian eigenvalue density. InProceedings of the International Conference on Machine Learning, 2019

  7. [7]

    Training deep neural-networks using a noise adapta- tion layer

    Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adapta- tion layer. InInternational Conference on Learning Representations, 2017

  8. [8]

    Co-teaching: Robust training of deep neural networks with extremely noisy labels

    Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. InAdvances in Neural Information Processing Systems, 2018

  9. [9]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  10. [10]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2 edition, 2012

  11. [11]

    Catastrophic fisher explosion: Early phase fisher matrix impacts generalization

    Stanislaw Jastrzebski, Devansh Arpit, Oliver Astrand, Giancarlo B Kerg, Huan Wang, Caiming Xiong, Richard Socher, Kyunghyun Cho, and Krzysztof J Geras. Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. InInternational Conference on Machine Learning, pages 4772–4784. PMLR, 2021

  12. [12]

    MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels

    Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. InProceedings of the International Conference on Machine Learning, 2018

  13. [13]

    Unicon: Combating label noise through uniform selection and contrastive learning

    Nazmul Karim, Mamshad Nayeem Rizve, Nazanin Rahnavard, Ajmal Mian, and Mubarak Shah. Unicon: Combating label noise through uniform selection and contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9676–9686, 2022

  14. [14]

    Supervised contrastive learning

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems, 2020

  15. [15]

    Limitations of the empirical Fisher approximation for natural gradient descent

    Frederik Kunstner, Lukas Balles, and Philipp Hennig. Limitations of the empirical Fisher approximation for natural gradient descent. InAdvances in Neural Information Processing Systems, 2019. 10

  16. [16]

    Temporal ensembling for semi-supervised learning

    Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. InInterna- tional Conference on Learning Representations, 2017

  17. [17]

    Junnan Li, Richard Socher, and Steven C. H. Hoi. DivideMix: Learning with noisy labels as semi-supervised learning. InInternational Conference on Learning Representations, 2020

  18. [18]

    Junnan Li, Caiming Xiong, and Steven C. H. Hoi. Learning from noisy data with robust representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9485–9494, 2021

  19. [19]

    Early- learning regularization prevents memorization of noisy labels

    Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Carlos Fernandez-Granda. Early- learning regularization prevents memorization of noisy labels. InAdvances in Neural Informa- tion Processing Systems, 2020

  20. [20]

    Meta-learning dynamic center distance: Hard sample mining for learning with noisy labels

    Chenyu Mu, Yijun Qu, Jiexi Yan, Erkun Yang, and Cheng Deng. Meta-learning dynamic center distance: Hard sample mining for learning with noisy labels. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 415–425, 2025. doi: 10.1109/ICCV51701. 2025.00046

  21. [21]

    Making deep neural networks robust to label noise: A loss correction approach

    Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

  22. [22]

    Training deep neural networks on noisy labels with bootstrapping

    Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. InInternational Conference on Learning Representations Workshop, 2015

  23. [23]

    The effective rank: A measure of effective dimensionality

    Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In European Signal Processing Conference, 2007

  24. [24]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. InAdvances in Neural Information Processing Systems, 2017

  25. [25]

    Part-dependent label noise: Towards instance-dependent label noise.Advances in neural information processing systems, 33:7597–7610, 2020

    Xiaobo Xia, Tongliang Liu, Bo Han, Nannan Wang, Mingming Gong, Haifeng Liu, Gang Niu, Dacheng Tao, and Masashi Sugiyama. Part-dependent label noise: Towards instance-dependent label noise.Advances in neural information processing systems, 33:7597–7610, 2020

  26. [26]

    ProMix: Combating label noise via maximizing clean sample utility

    Ruixuan Xiao, Yiwen Dong, Haobo Wang, Lei Feng, Runze Wu, Gang Chen, and Junbo Zhao. ProMix: Combating label noise via maximizing clean sample utility. InProceedings of the International Joint Conference on Artificial Intelligence, pages 4442–4450, 2023. doi: 10.24963/ijcai.2023/494

  27. [27]

    How does disagreement help generalization against label corruption? InInternational conference on machine learning, pages 7164–7173

    Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. How does disagreement help generalization against label corruption? InInternational conference on machine learning, pages 7164–7173. PMLR, 2019

  28. [28]

    Enhancing sample selection against label noise by cutting mislabeled easy examples

    Suqin Yuan, Lei Feng, Bo Han, and Tongliang Liu. Enhancing sample selection against label noise by cutting mislabeled easy examples. InAdvances in Neural Information Processing Systems, 2025

  29. [29]

    Handling label noise via instance-level difficulty modeling and dynamic optimization

    Kuan Zhang, Chengliang Chai, Jingzhe Xu, Chi Zhang, Han Han, Ye Yuan, Guoren Wang, and Lei Cao. Handling label noise via instance-level difficulty modeling and dynamic optimization. InAdvances in Neural Information Processing Systems, 2025

  30. [30]

    Psscl: A progressive sample selection framework with contrastive loss designed for noisy labels.Pattern Recognition, 161:111284, 2025

    Qian Zhang, Yi Zhu, Filipe R Cordeiro, and Qiu Chen. Psscl: A progressive sample selection framework with contrastive loss designed for noisy labels.Pattern Recognition, 161:111284, 2025

  31. [31]

    arXiv preprint arXiv:2103.07756(2021)

    Yikai Zhang, Songzhu Zheng, Pengxiang Wu, Mayank Goswami, and Chao Chen. Learning with feature-dependent label noise: A progressive approach.arXiv preprint arXiv:2103.07756, 2021. 11

  32. [32]

    Generalized cross entropy loss for training deep neural networks with noisy labels

    Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. InAdvances in Neural Information Processing Systems, 2018

  33. [33]

    L2b: Learning to bootstrap robust models for combating label noise

    Yuyin Zhou, Xianhang Li, Fengze Liu, Qingyue Wei, Xuxi Chen, Lequan Yu, Cihang Xie, Matthew P Lungren, and Lei Xing. L2b: Learning to bootstrap robust models for combating label noise. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23523–23533, 2024. 12