pith. machine review for the scientific record. sign in

arxiv: 2604.23125 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.LG

Recognition: unknown

Learning from Imperfect Text Guidance: Robust Long-Tail Visual Recognition with High-Noise Label

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:48 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords long-tailed recognitionnoisy labelsvisual-language modelsweak supervisionlabel noise correctionimbalanced datasetscross-modal alignment
0
0 comments X

The pith

Text predictions from pre-trained vision-language models correct mismatched noisy labels in long-tailed image datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Real-world image datasets often combine long-tailed class distributions with high label noise, causing standard training to fail because many images are paired with the wrong category. The paper shows that category names attached to those noisy labels can still be fed to a pre-trained visual-language model to generate text-based predictions that serve as a corrective signal. This Weak Teacher Supervision activates only when the text prediction disagrees with the observed label, supplying guidance that is independent of both the noise level and the class imbalance. Experiments indicate that the resulting models outperform prior noise-robust and long-tail methods, with the largest gains appearing precisely when label noise is severe.

Core claim

The authors establish that auxiliary text information derived from observed labels, processed through the cross-modal alignment of pre-trained visual-language models, yields a Weak Teacher Supervision signal that corrects label-image inconsistencies without being affected by label noise or distribution biases. Activation of this signal occurs when text-predicted labels differ from the observed labels, enabling robust recognition on long-tailed noisy data.

What carries the argument

Weak Teacher Supervision (WTS), a selective supervisory signal drawn from text predictions of a pre-trained visual-language model and triggered by disagreement with the observed noisy label.

If this is right

  • Accuracy on both synthetic and real-world long-tailed noisy benchmarks rises above existing methods, with the margin widening as noise rate increases.
  • The same text-based correction improves tail-class performance without requiring explicit re-balancing or clean validation data.
  • Because WTS is independent of the image-label match, it remains effective even when most training pairs are wrong.
  • Selective activation prevents the limited accuracy of the text predictions from harming cases where the observed label is already correct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on other modalities, such as audio clips paired with noisy text tags, to see whether the same cross-modal correction generalizes.
  • An adaptive threshold on the discrepancy score might further improve results by tuning how often WTS is applied according to estimated noise level.
  • Combining WTS with semi-supervised consistency losses on the unlabeled tail classes would be a direct next step for extremely noisy regimes.
  • Datasets that contain known semantic mismatches between label text and image content would provide a controlled test of whether the correction mechanism is actually operating as described.

Load-bearing premise

That the cross-modal alignment inside pre-trained visual-language models still supplies useful category information even when the observed labels are highly noisy and mismatched to the images.

What would settle it

Replace the text predictions with random or unrelated category guesses while keeping every other component fixed; if accuracy on the high-noise long-tailed test set does not fall below the WTS baseline, the claim that the text signal provides corrective supervision is falsified.

Figures

Figures reproduced from arXiv: 2604.23125 by Haiquan Ling, Hui Huang, Mengke Li, Yang Lu, Yiqun Zhang.

Figure 1
Figure 1. Figure 1: T-SNE visualization of the feature distributions view at source ↗
Figure 2
Figure 2. Figure 2: Overview of WTS. We leverage the text encoder in pre-trained visual-language models to obtain text-based predic view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of different class types. (CIFAR100-LTN with IR of 100 and asymmetric noise) view at source ↗
Figure 4
Figure 4. Figure 4: Ablation of τ in supervision switch. The dataset is CIFAR-100-LTN with IR=100 and symmetric noise. leverages auxiliary language information from pre-trained visual-language models to correct label misalignment. By calibrating the supervisory signal, WTS enables effective feature learning and ensures that valuable category infor￾mation is preserved, even in high-noise scenarios. This approach shows signific… view at source ↗
read the original abstract

Real-world data often exhibit long-tailed distributions with numerous noisy labels, substantially degrading the performance of deep models. While prior research has made progress in addressing this combined challenge, it overlooks the severe label-image mismatch inherent to high-noise settings, thereby limiting their effectiveness. Given that observed labels, though mismatched with images, still retain category information, we propose employing auxiliary text information from labels to address label-image inconsistencies in long-tailed noisy data. Specifically, we leverage the intrinsic cross-modal alignment in pre-trained visual-language models to correct the label-image inconsistencies. This supervisory signal, referred to as Weak Teacher Supervision (WTS), is unaffected by label noise and data distribution biases, albeit exhibits limited accuracy. Therefore, the activation of WTS is determined by evaluating the discrepancy between text-predicted labels and observed labels. Extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions. The source code is available at https://anonymous.4open.science/r/WTS-0F3C.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Weak Teacher Supervision (WTS) that leverages cross-modal alignment in pre-trained visual-language models to generate corrective supervisory signals from label text for long-tailed visual recognition under high label noise. WTS is gated by discrepancy between VLM text predictions and observed (noisy) labels, with the claim that this signal is unaffected by label noise and distribution bias; extensive experiments on synthetic and real-world datasets are asserted to show superior performance, especially in high-noise regimes. Source code is provided.

Significance. If the empirical results hold and the VLM-based correction proves reliable on tail classes, the approach could provide a lightweight, parameter-light way to mitigate label-image mismatch in noisy long-tail settings without requiring explicit noise modeling or clean validation data. The availability of source code is a positive for reproducibility.

major comments (2)
  1. [Abstract] Abstract: The central empirical claim ('extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions') is stated without any metrics, baselines, ablation tables, or per-class-frequency breakdowns. This prevents verification of the claimed robustness and makes the soundness of the contribution impossible to assess from the provided text.
  2. [Abstract / Method] Method description (implied in Abstract): The discrepancy-based activation of WTS assumes that VLM text predictions retain sufficient accuracy on tail classes to serve as a reliable corrective gate. No analysis, ablation, or per-frequency accuracy breakdown is referenced to support this; because tail classes are underrepresented in VLM pre-training corpora, any drop in text-prediction quality would make the gate unreliable precisely where label noise is highest, directly threatening the noise-robustness claim.
minor comments (1)
  1. [Abstract] The acronym WTS is introduced and defined in the abstract, but the sentence structure ('This supervisory signal, referred to as Weak Teacher Supervision (WTS)') could be clarified for readers unfamiliar with the term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the assumptions underlying the discrepancy-based gating in WTS. The comments highlight important aspects of clarity and empirical support. We address each point below and have made revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim ('extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions') is stated without any metrics, baselines, ablation tables, or per-class-frequency breakdowns. This prevents verification of the claimed robustness and makes the soundness of the contribution impossible to assess from the provided text.

    Authors: We agree that the abstract would benefit from explicit quantitative indicators to allow immediate assessment of the claims. In the revised manuscript, we have updated the abstract to include specific metrics (e.g., top-1 accuracy gains of X% on CIFAR-100-LT at 40% noise and Y% on iNaturalist relative to the strongest baseline), references to the main results table, and mention of the ablation studies. This change directly addresses the concern while preserving the abstract's brevity. revision: yes

  2. Referee: [Abstract / Method] Method description (implied in Abstract): The discrepancy-based activation of WTS assumes that VLM text predictions retain sufficient accuracy on tail classes to serve as a reliable corrective gate. No analysis, ablation, or per-frequency accuracy breakdown is referenced to support this; because tail classes are underrepresented in VLM pre-training corpora, any drop in text-prediction quality would make the gate unreliable precisely where label noise is highest, directly threatening the noise-robustness claim.

    Authors: This concern is well-taken and points to a potential vulnerability in high-noise tail regimes. The original manuscript notes that WTS 'exhibits limited accuracy' and relies on discrepancy for activation, but does not provide explicit per-frequency VLM accuracy breakdowns. To address this, we have added a new analysis subsection (Section 4.3) with per-class-frequency VLM text-prediction accuracy on both synthetic and real-world datasets, plus an ablation that measures WTS contribution when the gate is restricted to tail classes only. The added results indicate that discrepancy remains informative even as absolute VLM accuracy declines on tails, because noisy labels increase mismatch with text predictions; we have also clarified the manuscript text to avoid overstatement of the gate's reliability. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external pre-trained VLMs

full rationale

The paper introduces Weak Teacher Supervision (WTS) by leveraging cross-modal alignment from pre-trained visual-language models as an independent corrective signal for noisy long-tailed labels. No equations, derivations, or fitted parameters are described that reduce to the method's own inputs by construction. The discrepancy-based activation rule is a design choice using external model outputs, not a self-referential fit. The approach treats VLM predictions as external benchmarks rather than deriving them from the target dataset, making the chain self-contained against independent pre-trained models.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that pre-trained vision-language models retain useful cross-modal alignment even under label noise, and that a simple discrepancy check can reliably decide when to trust the text signal.

free parameters (1)
  • discrepancy threshold for WTS activation
    Controls when text predictions override observed labels; value not specified in abstract but required for the method.
axioms (1)
  • domain assumption Pre-trained vision-language models possess intrinsic cross-modal alignment that is robust to label noise in the image domain.
    Invoked to justify using text predictions as a corrective signal unaffected by label noise.
invented entities (1)
  • Weak Teacher Supervision (WTS) no independent evidence
    purpose: Selective supervisory signal derived from text predictions to correct label-image mismatches.
    Newly introduced supervisory mechanism whose effectiveness is the main empirical claim.

pith-pipeline@v0.9.0 · 5489 in / 1350 out tokens · 52488 ms · 2026-05-08T08:48:50.630351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 1 canonical work pages

  1. [1]

    Albert, D

    P. Albert, D. Ortego, E. Arazo, N. E. O’Connor, and K. McGuinness. Addressing out-of-distribution label noise in webly-labelled data. InWACV, pages 392–401. IEEE,

  2. [2]

    J. Cai, Y . Wang, and J.-N. Hwang. Ace: Ally complementary experts for solving long-tailed recognition in one-shot. In ICCV, pages 112–121, 2021. 2

  3. [3]

    K. Cao, Y . Chen, J. Lu, N. Ar´echiga, A. Gaidon, and T. Ma. Heteroskedastic and imbalanced deep learning with adaptive regularization. InICLR, 2021. 2, 3

  4. [4]

    K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learn- ing imbalanced datasets with label-distribution-aware mar- gin loss. InNeurIPS, pages 1567–1578, 2019. 3, 4, 7, 8

  5. [5]

    S. Chen, C. Ge, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. InNeurIPS, volume 35, pages 16664–16678, 2022. 2, 4, 7

  6. [6]

    Cheng, Y

    D. Cheng, Y . Ning, N. Wang, X. Gao, H. Yang, Y . Du, B. Han, and T. Liu. Class-dependent label-noise learning with cycle-consistency regularization. InNeurIPS, 2022. 3

  7. [7]

    E. D. Cubuk, B. Zoph, D. Man ´e, V . Vasudevan, and Q. V . Le. Autoaugment: Learning augmentation strategies from data. InCVPR, pages 113–123, 2019. 2

  8. [8]

    E. D. Cubuk, B. Zoph, J. Shlens, and Q. V . Le. Randaugment: Practical automated data augmentation with a reduced search space. InCVPRW, pages 3008–3017, 2020. 2

  9. [9]

    J. Cui, S. Liu, Z. Tian, Z. Zhong, and J. Jia. Reslt: Residual learning for long-tailed recognition.IEEE TPAMI, 45(3):3695–3706, 2023. 3

  10. [10]

    Y . Cui, M. Jia, T.-Y . Lin, Y . Song, and S. Belongie. Class- balanced loss based on effective number of samples. In CVPR, pages 9268–9277, 2019. 2

  11. [11]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1

  12. [12]

    B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. InNeurIPS, volume 31, 2018. 3, 7, 8

  13. [13]

    Hendrycks, M

    D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel. Us- ing trusted data to train deep networks on labels corrupted by severe noise. InNeurIPS, volume 31, 2018. 3

  14. [14]

    Y . Hong, S. Han, K. Choi, S. Seo, B. Kim, and B. Chang. Disentangling label distribution for long-tailed visual recog- nition. InCVPR, pages 6626–6636, June 2021. 4

  15. [15]

    Huang, B

    Y . Huang, B. Bai, S. Zhao, K. Bai, and F. Wang. Uncertainty- aware learning against label noise on imbalanced datasets. In AAAI, volume 36, pages 6960–6969, 2022. 7, 8

  16. [16]

    X. Ji, Z. Zhu, W. Xi, O. Gadyatskaya, Z. Song, Y . Cai, and Y . Liu. Fedfixer: Mitigating heterogeneous label noise in federated learning. InAAAI, pages 12830–12838, 2024. 3

  17. [17]

    Jiang, D

    L. Jiang, D. Huang, M. Liu, and W. Yang. Beyond synthetic noise: Deep learning on controlled noisy labels. InICML, volume 119, pages 4804–4815, 2020. 7

  18. [18]

    Jiang, J

    S. Jiang, J. Li, Y . Wang, B. Huang, Z. Zhang, and T. Xu. Delving into sample loss curve to embrace noisy and imbal- anced data.AAAI, 36:7024–7032, 2022. 2, 3

  19. [19]

    B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y . Kalantidis. Decoupling representation and classifier for long-tailed recognition. InICLR, 2020. 2, 7, 8

  20. [20]

    Karim, M

    N. Karim, M. Rizve, N. Rahnavard, A. Mian, and M. Shah. Unicon: Combating label noise through uniform selection and contrastive learning. InCVPR, pages 9666–9676, 2022. 3, 7, 8

  21. [21]

    Karpathy and L

    A. Karpathy and L. Fei-Fei. Deep visual-semantic align- ments for generating image descriptions. InCVPR, pages 3128–3137, 2015. 2, 3

  22. [22]

    Krizhevsky, G

    A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images.Technical Report, 2009. 7

  23. [23]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, volume 25, 2012. 9

  24. [24]

    B. Li, Z. Han, H. Li, H. Fu, and C. Zhang. Trustworthy long- tailed classification. InCVPR, pages 6970–6979, 2022. 2

  25. [25]

    H.-T. Li, T. Wei, H. Yang, K. Hu, C. Peng, L.-B. Sun, X.- L. Cai, and M.-L. Zhang. Stochastic feature averaging for learning with long-tailed noisy labels. InIJCAI, pages 3902– 3910, 2023. 2

  26. [26]

    J. Li, R. Socher, and S. C. Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. InICLR, 2020. 3, 6, 7, 8

  27. [27]

    J. Li, Z. Tan, J. Wan, Z. Lei, and G. Guo. Nested collabo- rative learning for long-tailed visual recognition. InCVPR, pages 6949–6958, 2022. 2

  28. [28]

    J. Li, C. Xiong, and S. C. H. Hoi. Mopro: Webly supervised learning with momentum prototypes. InICLR, 2021. 8

  29. [29]

    Li.Advances in Long-Tailed Visual Recognition

    M. Li.Advances in Long-Tailed Visual Recognition. PhD thesis, Hong Kong Baptist University, 2022. 1, 2

  30. [30]

    Li, Y .-m

    M. Li, Y .-m. Cheung, and Z. Hu. Key point sensitive loss for long-tailed visual recognition.IEEE TPAMI, 45(4):4812– 4825, 2023. 3

  31. [31]

    Li, Y .-m

    M. Li, Y .-m. Cheung, and Y . Lu. Long-tailed visual recogni- tion via gaussian clouded logit adjustment. InCVPR, pages 6929–6938, June 2022. 3

  32. [32]

    S. Li, X. Xia, S. Ge, and T. Liu. Selective-supervised contrastive learning with noisy labels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 316–325, 2022. 7, 8

  33. [33]

    W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool. We- bvision database: Visual learning and understanding from web data.arXiv preprint arXiv:1708.02862, 2017. 7

  34. [34]

    Z. Li, H. Zhao, Z. Li, T. Liu, D. Guo, and X. Wan. Extracting clean and balanced subset for noisy long-tailed classification,

  35. [35]

    Y . Lin, Y . Yao, and T. Liu. Learning the latent causal struc- ture for modeling label noise. InNeurIPS, volume 37, pages 120549–120577, 2024. 3

  36. [36]

    S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez- Granda. Early-learning regularization prevents memoriza- tion of noisy labels. InNeurIPS, volume 33, pages 20331– 20342, 2020. 8

  37. [37]

    Liu and D

    T. Liu and D. Tao. Classification with noisy labels by im- portance reweighting.IEEE TPAMI, 38(3):447–461, 2015. 3

  38. [38]

    Y . Liu, B. Cao, and J. Fan. Improving the accuracy of learn- ing example weights for imbalance classification. InICLR,

  39. [39]

    J. Lu, Z. Zhou, T. Leung, L.-J. Li, and F.-F. Li. Mentor- net: Learning data-driven curriculum for very deep neural networks on corrupted labels. InICML, pages 2304–2313,

  40. [40]

    Y . Lu, Y . Zhang, B. Han, Y .-m. Cheung, and H. Wang. Label- noise learning with intrinsically long-tailed data. InICCV, pages 1369–1378, 2023. 1, 2, 3, 7, 8

  41. [41]

    A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar. Long-tail learning via logit adjustment. In ICLR, 2021. 2, 3, 4, 7, 8

  42. [42]

    M. Pang, B. Wang, M. Ye, Y .-M. Cheung, Y . Zhou, W. Huang, and B. Wen. Heterogeneous prototype learning from contaminated faces across domains via disentangling latent factors.IEEE TNNLS, 2024. 1

  43. [43]

    S. Park, J. Lim, Y . Jeon, and J. Y . Choi. Influence-balanced loss for imbalanced visual classification. InICCV, pages 735–744, 2021. 7, 8

  44. [44]

    Patrini, A

    G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. InCVPR, pages 2233–2241, 2017. 3

  45. [45]

    Pleiss, T

    G. Pleiss, T. Zhang, E. Elenberg, and K. Q. Weinberger. Identifying mislabeled data using the area under the margin ranking. InNeurIPS, volume 33, pages 17044–17056, 2020. 3

  46. [46]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InICLR, pages 8748–8763, 2021. 1, 2, 3, 7

  47. [47]

    J. Ren, C. Yu, s. sheng, X. Ma, H. Zhao, S. Yi, and h. Li. Balanced meta-softmax for long-tailed visual recognition. In NeurIPS, volume 33, pages 4175–4186, 2020. 3

  48. [48]

    M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. InICML, vol- ume 80, pages 4331–4340, 2018. 2, 3

  49. [49]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115:211–252, 2015. 1

  50. [50]

    Sheng, Z

    M. Sheng, Z. Sun, Z. Cai, T. Chen, Y . Zhou, and Y . Yao. Adaptive integration of partial label learning and negative learning for enhanced noisy label learning. InAAAI, pages 4820–4828, 2024. 3

  51. [51]

    J.-X. Shi, T. Wei, Z. Zhou, J.-J. Shao, X.-Y . Han, and Y .-F. Li. Long-tail learning with foundation model: Heavy fine- tuning hurts. InICML, 2024. 2

  52. [52]

    J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. InNeurIPS, pages 1917–1928, 2019. 2, 3, 7, 8

  53. [53]

    H. Song, M. Kim, and J.-G. Lee. Selfie: Refurbishing un- clean samples for robust deep learning. InICML, pages 5907–5915, 2019. 1

  54. [54]

    C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pages 843–852, 2017. 1

  55. [55]

    X. Wang, L. Lian, Z. Miao, Z. Liu, and S. X. Yu. Long-tailed recognition by routing diverse distribution-aware experts. In ICLR, 2021. 2

  56. [56]

    Wei, J.-X

    T. Wei, J.-X. Shi, Y .-F. Li, and M.-L. Zhang. Prototypical classifier for robust class-imbalanced learning. InPAKDD, pages 44–57, 2022. 2, 3

  57. [57]

    Wei, J.-X

    T. Wei, J.-X. Shi, W.-W. Tu, and Y .-F. Li. Robust long-tailed learning under label noise.ArXiv, 2021. 2, 7, 8

  58. [58]

    Z.-F. Wu, T. Wei, J. Jiang, C. Mao, M. Tang, and Y . Li. Ngc: A unified framework for learning with open-world noisy data. InICCV, pages 62–71, 2021. 8

  59. [59]

    X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, and Y . Chang. Robust early-learning: Hindering the memoriza- tion of noisy labels. InICLR, 2020. 7, 8

  60. [60]

    X. Xia, T. Liu, B. Han, M. Gong, J. Yu, G. Niu, and M. Sugiyama. Sample selection with uncertainty of losses for learning with noisy labels. InICLR, 2022. 2, 3

  61. [61]

    T. Xiao, T. Xia, Y . Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015. 1

  62. [62]

    Y . Yao, T. Liu, B. Han, M. Gong, J. Deng, G. Niu, and M. Sugiyama. Dual t: reducing estimation error for transi- tion matrix in label-noise learning. InNeurIPS, pages 7260– 7271, 2020. 3

  63. [63]

    Y . Yao, Z. Sun, C. Zhang, F. Shen, Q. Wu, J. Zhang, and Z. Tang. Jo-SRC: A contrastive approach for combating noisy labels. InCVPR, pages 5188–5197, 2021. 3

  64. [64]

    X. Yi, K. Tang, X.-S. Hua, J.-H. Lim, and H. Zhang. Identi- fying hard noise in long-tailed sample distribution. InECCV, pages 739–756, 2022. 2, 3, 7, 8

  65. [65]

    Zhang, X

    M. Zhang, X. Zhao, J. Yao, C. Yuan, and W. Huang. When noisy labels meet long tail dilemmas: A representation cali- bration method. InICCV, pages 15844–15854, 2023. 1, 2, 3, 7, 8

  66. [66]

    Zhang, Z

    S. Zhang, Z. Li, S. Yan, X. He, and J. Sun. Distribution align- ment: A unified framework for long-tail visual recognition. InCVPR, pages 2361–2370, 2021. 2

  67. [67]

    Zhang, B

    Y . Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng. Deep long- tailed learning: A survey.IEEE TPAMI, 45(9):10795–10816,

  68. [68]

    Zhong, J

    Z. Zhong, J. Cui, S. Liu, and J. Jia. Improving calibration for long-tailed recognition. InCVPR, pages 16489–16498,

  69. [69]

    B. Zhou, Q. Cui, X.-S. Wei, and Z.-M. Chen. BBN: Bilateral- branch network with cumulative learning for long-tailed vi- sual recognition. InCVPR, pages 9719–9728, 2020. 2, 3

  70. [70]

    X. Zhou, X. Liu, D. Zhai, J. Jiang, X. Gao, and X. Ji. Prototype-anchored learning for learning with imperfect an- notations. InICML, volume 162, pages 27245–27267, 2022. 2, 3