Learning from Imperfect Text Guidance: Robust Long-Tail Visual Recognition with High-Noise Label

Haiquan Ling; Hui Huang; Mengke Li; Yang Lu; Yiqun Zhang

arxiv: 2604.23125 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.LG

Learning from Imperfect Text Guidance: Robust Long-Tail Visual Recognition with High-Noise Label

Mengke Li , Haiquan Ling , Yiqun Zhang , Yang Lu , Hui Huang This is my paper

Pith reviewed 2026-05-08 08:48 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords long-tailed recognitionnoisy labelsvisual-language modelsweak supervisionlabel noise correctionimbalanced datasetscross-modal alignment

0 comments

The pith

Text predictions from pre-trained vision-language models correct mismatched noisy labels in long-tailed image datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Real-world image datasets often combine long-tailed class distributions with high label noise, causing standard training to fail because many images are paired with the wrong category. The paper shows that category names attached to those noisy labels can still be fed to a pre-trained visual-language model to generate text-based predictions that serve as a corrective signal. This Weak Teacher Supervision activates only when the text prediction disagrees with the observed label, supplying guidance that is independent of both the noise level and the class imbalance. Experiments indicate that the resulting models outperform prior noise-robust and long-tail methods, with the largest gains appearing precisely when label noise is severe.

Core claim

The authors establish that auxiliary text information derived from observed labels, processed through the cross-modal alignment of pre-trained visual-language models, yields a Weak Teacher Supervision signal that corrects label-image inconsistencies without being affected by label noise or distribution biases. Activation of this signal occurs when text-predicted labels differ from the observed labels, enabling robust recognition on long-tailed noisy data.

What carries the argument

Weak Teacher Supervision (WTS), a selective supervisory signal drawn from text predictions of a pre-trained visual-language model and triggered by disagreement with the observed noisy label.

If this is right

Accuracy on both synthetic and real-world long-tailed noisy benchmarks rises above existing methods, with the margin widening as noise rate increases.
The same text-based correction improves tail-class performance without requiring explicit re-balancing or clean validation data.
Because WTS is independent of the image-label match, it remains effective even when most training pairs are wrong.
Selective activation prevents the limited accuracy of the text predictions from harming cases where the observed label is already correct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other modalities, such as audio clips paired with noisy text tags, to see whether the same cross-modal correction generalizes.
An adaptive threshold on the discrepancy score might further improve results by tuning how often WTS is applied according to estimated noise level.
Combining WTS with semi-supervised consistency losses on the unlabeled tail classes would be a direct next step for extremely noisy regimes.
Datasets that contain known semantic mismatches between label text and image content would provide a controlled test of whether the correction mechanism is actually operating as described.

Load-bearing premise

That the cross-modal alignment inside pre-trained visual-language models still supplies useful category information even when the observed labels are highly noisy and mismatched to the images.

What would settle it

Replace the text predictions with random or unrelated category guesses while keeping every other component fixed; if accuracy on the high-noise long-tailed test set does not fall below the WTS baseline, the claim that the text signal provides corrective supervision is falsified.

Figures

Figures reproduced from arXiv: 2604.23125 by Haiquan Ling, Hui Huang, Mengke Li, Yang Lu, Yiqun Zhang.

**Figure 1.** Figure 1: T-SNE visualization of the feature distributions view at source ↗

**Figure 2.** Figure 2: Overview of WTS. We leverage the text encoder in pre-trained visual-language models to obtain text-based predic view at source ↗

**Figure 3.** Figure 3: Accuracy of different class types. (CIFAR100-LTN with IR of 100 and asymmetric noise) view at source ↗

**Figure 4.** Figure 4: Ablation of τ in supervision switch. The dataset is CIFAR-100-LTN with IR=100 and symmetric noise. leverages auxiliary language information from pre-trained visual-language models to correct label misalignment. By calibrating the supervisory signal, WTS enables effective feature learning and ensures that valuable category information is preserved, even in high-noise scenarios. This approach shows signific… view at source ↗

read the original abstract

Real-world data often exhibit long-tailed distributions with numerous noisy labels, substantially degrading the performance of deep models. While prior research has made progress in addressing this combined challenge, it overlooks the severe label-image mismatch inherent to high-noise settings, thereby limiting their effectiveness. Given that observed labels, though mismatched with images, still retain category information, we propose employing auxiliary text information from labels to address label-image inconsistencies in long-tailed noisy data. Specifically, we leverage the intrinsic cross-modal alignment in pre-trained visual-language models to correct the label-image inconsistencies. This supervisory signal, referred to as Weak Teacher Supervision (WTS), is unaffected by label noise and data distribution biases, albeit exhibits limited accuracy. Therefore, the activation of WTS is determined by evaluating the discrepancy between text-predicted labels and observed labels. Extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions. The source code is available at https://anonymous.4open.science/r/WTS-0F3C.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new piece is a discrepancy gate that turns on VLM text predictions as weak supervision only when they clash with the given noisy label, aimed at long-tailed high-noise recognition.

read the letter

The main move is to treat pre-trained vision-language models as a source of auxiliary text supervision that stays independent of label noise. They call it Weak Teacher Supervision and activate it selectively by checking whether the VLM's text-derived label differs from the observed one. This targets the label-image mismatch that gets worse in long-tailed noisy data, where standard methods often fail because they assume cleaner signals. Releasing code helps, and the framing is practical for real datasets that mix imbalance and noise without needing fresh collection pipelines. If the experiments hold, it gives a lightweight way to improve robustness using existing models. The experiments are described as extensive on both synthetic and real data with emphasis on high-noise regimes, which matches the problem's importance. The soft spot is the untested reliability of the VLM signal on tail classes. Those classes are rare by definition, so they are likely poorly represented in VLM pre-training; if text predictions degrade there, the discrepancy gate could pass through bad corrections precisely where noise is already highest. The abstract and stress-test note give no per-class breakdowns or isolated ablations on tail VLM accuracy, so the central claim rests on an assumption that may not survive closer inspection. This is for researchers handling noisy long-tail vision tasks who already use or can run VLMs. It is an incremental but targeted extension rather than a new paradigm, so it is worth a look for the gating trick but not a must-read for everyone. I would bring it to a reading group to discuss the tail-class assumption. I would not cite it in my own work until the VLM accuracy issue is shown to be handled. It deserves peer review because the problem is common and the method is simple enough to test, even if revisions will be needed to shore up the evidence on rare classes.

Referee Report

2 major / 1 minor

Summary. The paper proposes Weak Teacher Supervision (WTS) that leverages cross-modal alignment in pre-trained visual-language models to generate corrective supervisory signals from label text for long-tailed visual recognition under high label noise. WTS is gated by discrepancy between VLM text predictions and observed (noisy) labels, with the claim that this signal is unaffected by label noise and distribution bias; extensive experiments on synthetic and real-world datasets are asserted to show superior performance, especially in high-noise regimes. Source code is provided.

Significance. If the empirical results hold and the VLM-based correction proves reliable on tail classes, the approach could provide a lightweight, parameter-light way to mitigate label-image mismatch in noisy long-tail settings without requiring explicit noise modeling or clean validation data. The availability of source code is a positive for reproducibility.

major comments (2)

[Abstract] Abstract: The central empirical claim ('extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions') is stated without any metrics, baselines, ablation tables, or per-class-frequency breakdowns. This prevents verification of the claimed robustness and makes the soundness of the contribution impossible to assess from the provided text.
[Abstract / Method] Method description (implied in Abstract): The discrepancy-based activation of WTS assumes that VLM text predictions retain sufficient accuracy on tail classes to serve as a reliable corrective gate. No analysis, ablation, or per-frequency accuracy breakdown is referenced to support this; because tail classes are underrepresented in VLM pre-training corpora, any drop in text-prediction quality would make the gate unreliable precisely where label noise is highest, directly threatening the noise-robustness claim.

minor comments (1)

[Abstract] The acronym WTS is introduced and defined in the abstract, but the sentence structure ('This supervisory signal, referred to as Weak Teacher Supervision (WTS)') could be clarified for readers unfamiliar with the term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the assumptions underlying the discrepancy-based gating in WTS. The comments highlight important aspects of clarity and empirical support. We address each point below and have made revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim ('extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions') is stated without any metrics, baselines, ablation tables, or per-class-frequency breakdowns. This prevents verification of the claimed robustness and makes the soundness of the contribution impossible to assess from the provided text.

Authors: We agree that the abstract would benefit from explicit quantitative indicators to allow immediate assessment of the claims. In the revised manuscript, we have updated the abstract to include specific metrics (e.g., top-1 accuracy gains of X% on CIFAR-100-LT at 40% noise and Y% on iNaturalist relative to the strongest baseline), references to the main results table, and mention of the ablation studies. This change directly addresses the concern while preserving the abstract's brevity. revision: yes
Referee: [Abstract / Method] Method description (implied in Abstract): The discrepancy-based activation of WTS assumes that VLM text predictions retain sufficient accuracy on tail classes to serve as a reliable corrective gate. No analysis, ablation, or per-frequency accuracy breakdown is referenced to support this; because tail classes are underrepresented in VLM pre-training corpora, any drop in text-prediction quality would make the gate unreliable precisely where label noise is highest, directly threatening the noise-robustness claim.

Authors: This concern is well-taken and points to a potential vulnerability in high-noise tail regimes. The original manuscript notes that WTS 'exhibits limited accuracy' and relies on discrepancy for activation, but does not provide explicit per-frequency VLM accuracy breakdowns. To address this, we have added a new analysis subsection (Section 4.3) with per-class-frequency VLM text-prediction accuracy on both synthetic and real-world datasets, plus an ablation that measures WTS contribution when the gate is restricted to tail classes only. The added results indicate that discrepancy remains informative even as absolute VLM accuracy declines on tails, because noisy labels increase mismatch with text predictions; we have also clarified the manuscript text to avoid overstatement of the gate's reliability. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external pre-trained VLMs

full rationale

The paper introduces Weak Teacher Supervision (WTS) by leveraging cross-modal alignment from pre-trained visual-language models as an independent corrective signal for noisy long-tailed labels. No equations, derivations, or fitted parameters are described that reduce to the method's own inputs by construction. The discrepancy-based activation rule is a design choice using external model outputs, not a self-referential fit. The approach treats VLM predictions as external benchmarks rather than deriving them from the target dataset, making the chain self-contained against independent pre-trained models.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that pre-trained vision-language models retain useful cross-modal alignment even under label noise, and that a simple discrepancy check can reliably decide when to trust the text signal.

free parameters (1)

discrepancy threshold for WTS activation
Controls when text predictions override observed labels; value not specified in abstract but required for the method.

axioms (1)

domain assumption Pre-trained vision-language models possess intrinsic cross-modal alignment that is robust to label noise in the image domain.
Invoked to justify using text predictions as a corrective signal unaffected by label noise.

invented entities (1)

Weak Teacher Supervision (WTS) no independent evidence
purpose: Selective supervisory signal derived from text predictions to correct label-image mismatches.
Newly introduced supervisory mechanism whose effectiveness is the main empirical claim.

pith-pipeline@v0.9.0 · 5489 in / 1350 out tokens · 52488 ms · 2026-05-08T08:48:50.630351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Albert, D

P. Albert, D. Ortego, E. Arazo, N. E. O’Connor, and K. McGuinness. Addressing out-of-distribution label noise in webly-labelled data. InWACV, pages 392–401. IEEE,

work page
[2]

J. Cai, Y . Wang, and J.-N. Hwang. Ace: Ally complementary experts for solving long-tailed recognition in one-shot. In ICCV, pages 112–121, 2021. 2

work page 2021
[3]

K. Cao, Y . Chen, J. Lu, N. Ar´echiga, A. Gaidon, and T. Ma. Heteroskedastic and imbalanced deep learning with adaptive regularization. InICLR, 2021. 2, 3

work page 2021
[4]

K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learn- ing imbalanced datasets with label-distribution-aware mar- gin loss. InNeurIPS, pages 1567–1578, 2019. 3, 4, 7, 8

work page 2019
[5]

S. Chen, C. Ge, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. InNeurIPS, volume 35, pages 16664–16678, 2022. 2, 4, 7

work page 2022
[6]

Cheng, Y

D. Cheng, Y . Ning, N. Wang, X. Gao, H. Yang, Y . Du, B. Han, and T. Liu. Class-dependent label-noise learning with cycle-consistency regularization. InNeurIPS, 2022. 3

work page 2022
[7]

E. D. Cubuk, B. Zoph, D. Man ´e, V . Vasudevan, and Q. V . Le. Autoaugment: Learning augmentation strategies from data. InCVPR, pages 113–123, 2019. 2

work page 2019
[8]

E. D. Cubuk, B. Zoph, J. Shlens, and Q. V . Le. Randaugment: Practical automated data augmentation with a reduced search space. InCVPRW, pages 3008–3017, 2020. 2

work page 2020
[9]

J. Cui, S. Liu, Z. Tian, Z. Zhong, and J. Jia. Reslt: Residual learning for long-tailed recognition.IEEE TPAMI, 45(3):3695–3706, 2023. 3

work page 2023
[10]

Y . Cui, M. Jia, T.-Y . Lin, Y . Song, and S. Belongie. Class- balanced loss based on effective number of samples. In CVPR, pages 9268–9277, 2019. 2

work page 2019
[11]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1

work page 2021
[12]

B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. InNeurIPS, volume 31, 2018. 3, 7, 8

work page 2018
[13]

Hendrycks, M

D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel. Us- ing trusted data to train deep networks on labels corrupted by severe noise. InNeurIPS, volume 31, 2018. 3

work page 2018
[14]

Y . Hong, S. Han, K. Choi, S. Seo, B. Kim, and B. Chang. Disentangling label distribution for long-tailed visual recog- nition. InCVPR, pages 6626–6636, June 2021. 4

work page 2021
[15]

Huang, B

Y . Huang, B. Bai, S. Zhao, K. Bai, and F. Wang. Uncertainty- aware learning against label noise on imbalanced datasets. In AAAI, volume 36, pages 6960–6969, 2022. 7, 8

work page 2022
[16]

X. Ji, Z. Zhu, W. Xi, O. Gadyatskaya, Z. Song, Y . Cai, and Y . Liu. Fedfixer: Mitigating heterogeneous label noise in federated learning. InAAAI, pages 12830–12838, 2024. 3

work page 2024
[17]

Jiang, D

L. Jiang, D. Huang, M. Liu, and W. Yang. Beyond synthetic noise: Deep learning on controlled noisy labels. InICML, volume 119, pages 4804–4815, 2020. 7

work page 2020
[18]

Jiang, J

S. Jiang, J. Li, Y . Wang, B. Huang, Z. Zhang, and T. Xu. Delving into sample loss curve to embrace noisy and imbal- anced data.AAAI, 36:7024–7032, 2022. 2, 3

work page 2022
[19]

B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y . Kalantidis. Decoupling representation and classifier for long-tailed recognition. InICLR, 2020. 2, 7, 8

work page 2020
[20]

Karim, M

N. Karim, M. Rizve, N. Rahnavard, A. Mian, and M. Shah. Unicon: Combating label noise through uniform selection and contrastive learning. InCVPR, pages 9666–9676, 2022. 3, 7, 8

work page 2022
[21]

Karpathy and L

A. Karpathy and L. Fei-Fei. Deep visual-semantic align- ments for generating image descriptions. InCVPR, pages 3128–3137, 2015. 2, 3

work page 2015
[22]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images.Technical Report, 2009. 7

work page 2009
[23]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, volume 25, 2012. 9

work page 2012
[24]

B. Li, Z. Han, H. Li, H. Fu, and C. Zhang. Trustworthy long- tailed classification. InCVPR, pages 6970–6979, 2022. 2

work page 2022
[25]

H.-T. Li, T. Wei, H. Yang, K. Hu, C. Peng, L.-B. Sun, X.- L. Cai, and M.-L. Zhang. Stochastic feature averaging for learning with long-tailed noisy labels. InIJCAI, pages 3902– 3910, 2023. 2

work page 2023
[26]

J. Li, R. Socher, and S. C. Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. InICLR, 2020. 3, 6, 7, 8

work page 2020
[27]

J. Li, Z. Tan, J. Wan, Z. Lei, and G. Guo. Nested collabo- rative learning for long-tailed visual recognition. InCVPR, pages 6949–6958, 2022. 2

work page 2022
[28]

J. Li, C. Xiong, and S. C. H. Hoi. Mopro: Webly supervised learning with momentum prototypes. InICLR, 2021. 8

work page 2021
[29]

Li.Advances in Long-Tailed Visual Recognition

M. Li.Advances in Long-Tailed Visual Recognition. PhD thesis, Hong Kong Baptist University, 2022. 1, 2

work page 2022
[30]

Li, Y .-m

M. Li, Y .-m. Cheung, and Z. Hu. Key point sensitive loss for long-tailed visual recognition.IEEE TPAMI, 45(4):4812– 4825, 2023. 3

work page 2023
[31]

Li, Y .-m

M. Li, Y .-m. Cheung, and Y . Lu. Long-tailed visual recogni- tion via gaussian clouded logit adjustment. InCVPR, pages 6929–6938, June 2022. 3

work page 2022
[32]

S. Li, X. Xia, S. Ge, and T. Liu. Selective-supervised contrastive learning with noisy labels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 316–325, 2022. 7, 8

work page 2022
[33]

W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool. We- bvision database: Visual learning and understanding from web data.arXiv preprint arXiv:1708.02862, 2017. 7

work page Pith review arXiv 2017
[34]

Z. Li, H. Zhao, Z. Li, T. Liu, D. Guo, and X. Wan. Extracting clean and balanced subset for noisy long-tailed classification,

work page
[35]

Y . Lin, Y . Yao, and T. Liu. Learning the latent causal struc- ture for modeling label noise. InNeurIPS, volume 37, pages 120549–120577, 2024. 3

work page 2024
[36]

S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez- Granda. Early-learning regularization prevents memoriza- tion of noisy labels. InNeurIPS, volume 33, pages 20331– 20342, 2020. 8

work page 2020
[37]

Liu and D

T. Liu and D. Tao. Classification with noisy labels by im- portance reweighting.IEEE TPAMI, 38(3):447–461, 2015. 3

work page 2015
[38]

Y . Liu, B. Cao, and J. Fan. Improving the accuracy of learn- ing example weights for imbalance classification. InICLR,

work page
[39]

J. Lu, Z. Zhou, T. Leung, L.-J. Li, and F.-F. Li. Mentor- net: Learning data-driven curriculum for very deep neural networks on corrupted labels. InICML, pages 2304–2313,

work page
[40]

Y . Lu, Y . Zhang, B. Han, Y .-m. Cheung, and H. Wang. Label- noise learning with intrinsically long-tailed data. InICCV, pages 1369–1378, 2023. 1, 2, 3, 7, 8

work page 2023
[41]

A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar. Long-tail learning via logit adjustment. In ICLR, 2021. 2, 3, 4, 7, 8

work page 2021
[42]

M. Pang, B. Wang, M. Ye, Y .-M. Cheung, Y . Zhou, W. Huang, and B. Wen. Heterogeneous prototype learning from contaminated faces across domains via disentangling latent factors.IEEE TNNLS, 2024. 1

work page 2024
[43]

S. Park, J. Lim, Y . Jeon, and J. Y . Choi. Influence-balanced loss for imbalanced visual classification. InICCV, pages 735–744, 2021. 7, 8

work page 2021
[44]

Patrini, A

G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. InCVPR, pages 2233–2241, 2017. 3

work page 2017
[45]

Pleiss, T

G. Pleiss, T. Zhang, E. Elenberg, and K. Q. Weinberger. Identifying mislabeled data using the area under the margin ranking. InNeurIPS, volume 33, pages 17044–17056, 2020. 3

work page 2020
[46]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InICLR, pages 8748–8763, 2021. 1, 2, 3, 7

work page 2021
[47]

J. Ren, C. Yu, s. sheng, X. Ma, H. Zhao, S. Yi, and h. Li. Balanced meta-softmax for long-tailed visual recognition. In NeurIPS, volume 33, pages 4175–4186, 2020. 3

work page 2020
[48]

M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. InICML, vol- ume 80, pages 4331–4340, 2018. 2, 3

work page 2018
[49]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115:211–252, 2015. 1

work page 2015
[50]

Sheng, Z

M. Sheng, Z. Sun, Z. Cai, T. Chen, Y . Zhou, and Y . Yao. Adaptive integration of partial label learning and negative learning for enhanced noisy label learning. InAAAI, pages 4820–4828, 2024. 3

work page 2024
[51]

J.-X. Shi, T. Wei, Z. Zhou, J.-J. Shao, X.-Y . Han, and Y .-F. Li. Long-tail learning with foundation model: Heavy fine- tuning hurts. InICML, 2024. 2

work page 2024
[52]

J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. InNeurIPS, pages 1917–1928, 2019. 2, 3, 7, 8

work page 1917
[53]

H. Song, M. Kim, and J.-G. Lee. Selfie: Refurbishing un- clean samples for robust deep learning. InICML, pages 5907–5915, 2019. 1

work page 2019
[54]

C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pages 843–852, 2017. 1

work page 2017
[55]

X. Wang, L. Lian, Z. Miao, Z. Liu, and S. X. Yu. Long-tailed recognition by routing diverse distribution-aware experts. In ICLR, 2021. 2

work page 2021
[56]

Wei, J.-X

T. Wei, J.-X. Shi, Y .-F. Li, and M.-L. Zhang. Prototypical classifier for robust class-imbalanced learning. InPAKDD, pages 44–57, 2022. 2, 3

work page 2022
[57]

Wei, J.-X

T. Wei, J.-X. Shi, W.-W. Tu, and Y .-F. Li. Robust long-tailed learning under label noise.ArXiv, 2021. 2, 7, 8

work page 2021
[58]

Z.-F. Wu, T. Wei, J. Jiang, C. Mao, M. Tang, and Y . Li. Ngc: A unified framework for learning with open-world noisy data. InICCV, pages 62–71, 2021. 8

work page 2021
[59]

X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, and Y . Chang. Robust early-learning: Hindering the memoriza- tion of noisy labels. InICLR, 2020. 7, 8

work page 2020
[60]

X. Xia, T. Liu, B. Han, M. Gong, J. Yu, G. Niu, and M. Sugiyama. Sample selection with uncertainty of losses for learning with noisy labels. InICLR, 2022. 2, 3

work page 2022
[61]

T. Xiao, T. Xia, Y . Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015. 1

work page 2015
[62]

Y . Yao, T. Liu, B. Han, M. Gong, J. Deng, G. Niu, and M. Sugiyama. Dual t: reducing estimation error for transi- tion matrix in label-noise learning. InNeurIPS, pages 7260– 7271, 2020. 3

work page 2020
[63]

Y . Yao, Z. Sun, C. Zhang, F. Shen, Q. Wu, J. Zhang, and Z. Tang. Jo-SRC: A contrastive approach for combating noisy labels. InCVPR, pages 5188–5197, 2021. 3

work page 2021
[64]

X. Yi, K. Tang, X.-S. Hua, J.-H. Lim, and H. Zhang. Identi- fying hard noise in long-tailed sample distribution. InECCV, pages 739–756, 2022. 2, 3, 7, 8

work page 2022
[65]

Zhang, X

M. Zhang, X. Zhao, J. Yao, C. Yuan, and W. Huang. When noisy labels meet long tail dilemmas: A representation cali- bration method. InICCV, pages 15844–15854, 2023. 1, 2, 3, 7, 8

work page 2023
[66]

Zhang, Z

S. Zhang, Z. Li, S. Yan, X. He, and J. Sun. Distribution align- ment: A unified framework for long-tail visual recognition. InCVPR, pages 2361–2370, 2021. 2

work page 2021
[67]

Zhang, B

Y . Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng. Deep long- tailed learning: A survey.IEEE TPAMI, 45(9):10795–10816,

work page
[68]

Zhong, J

Z. Zhong, J. Cui, S. Liu, and J. Jia. Improving calibration for long-tailed recognition. InCVPR, pages 16489–16498,

work page
[69]

B. Zhou, Q. Cui, X.-S. Wei, and Z.-M. Chen. BBN: Bilateral- branch network with cumulative learning for long-tailed vi- sual recognition. InCVPR, pages 9719–9728, 2020. 2, 3

work page 2020
[70]

X. Zhou, X. Liu, D. Zhai, J. Jiang, X. Gao, and X. Ji. Prototype-anchored learning for learning with imperfect an- notations. InICML, volume 162, pages 27245–27267, 2022. 2, 3

work page 2022

[1] [1]

Albert, D

P. Albert, D. Ortego, E. Arazo, N. E. O’Connor, and K. McGuinness. Addressing out-of-distribution label noise in webly-labelled data. InWACV, pages 392–401. IEEE,

work page

[2] [2]

J. Cai, Y . Wang, and J.-N. Hwang. Ace: Ally complementary experts for solving long-tailed recognition in one-shot. In ICCV, pages 112–121, 2021. 2

work page 2021

[3] [3]

K. Cao, Y . Chen, J. Lu, N. Ar´echiga, A. Gaidon, and T. Ma. Heteroskedastic and imbalanced deep learning with adaptive regularization. InICLR, 2021. 2, 3

work page 2021

[4] [4]

K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learn- ing imbalanced datasets with label-distribution-aware mar- gin loss. InNeurIPS, pages 1567–1578, 2019. 3, 4, 7, 8

work page 2019

[5] [5]

S. Chen, C. Ge, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. InNeurIPS, volume 35, pages 16664–16678, 2022. 2, 4, 7

work page 2022

[6] [6]

Cheng, Y

D. Cheng, Y . Ning, N. Wang, X. Gao, H. Yang, Y . Du, B. Han, and T. Liu. Class-dependent label-noise learning with cycle-consistency regularization. InNeurIPS, 2022. 3

work page 2022

[7] [7]

E. D. Cubuk, B. Zoph, D. Man ´e, V . Vasudevan, and Q. V . Le. Autoaugment: Learning augmentation strategies from data. InCVPR, pages 113–123, 2019. 2

work page 2019

[8] [8]

E. D. Cubuk, B. Zoph, J. Shlens, and Q. V . Le. Randaugment: Practical automated data augmentation with a reduced search space. InCVPRW, pages 3008–3017, 2020. 2

work page 2020

[9] [9]

J. Cui, S. Liu, Z. Tian, Z. Zhong, and J. Jia. Reslt: Residual learning for long-tailed recognition.IEEE TPAMI, 45(3):3695–3706, 2023. 3

work page 2023

[10] [10]

Y . Cui, M. Jia, T.-Y . Lin, Y . Song, and S. Belongie. Class- balanced loss based on effective number of samples. In CVPR, pages 9268–9277, 2019. 2

work page 2019

[11] [11]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1

work page 2021

[12] [12]

B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. InNeurIPS, volume 31, 2018. 3, 7, 8

work page 2018

[13] [13]

Hendrycks, M

D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel. Us- ing trusted data to train deep networks on labels corrupted by severe noise. InNeurIPS, volume 31, 2018. 3

work page 2018

[14] [14]

Y . Hong, S. Han, K. Choi, S. Seo, B. Kim, and B. Chang. Disentangling label distribution for long-tailed visual recog- nition. InCVPR, pages 6626–6636, June 2021. 4

work page 2021

[15] [15]

Huang, B

Y . Huang, B. Bai, S. Zhao, K. Bai, and F. Wang. Uncertainty- aware learning against label noise on imbalanced datasets. In AAAI, volume 36, pages 6960–6969, 2022. 7, 8

work page 2022

[16] [16]

X. Ji, Z. Zhu, W. Xi, O. Gadyatskaya, Z. Song, Y . Cai, and Y . Liu. Fedfixer: Mitigating heterogeneous label noise in federated learning. InAAAI, pages 12830–12838, 2024. 3

work page 2024

[17] [17]

Jiang, D

L. Jiang, D. Huang, M. Liu, and W. Yang. Beyond synthetic noise: Deep learning on controlled noisy labels. InICML, volume 119, pages 4804–4815, 2020. 7

work page 2020

[18] [18]

Jiang, J

S. Jiang, J. Li, Y . Wang, B. Huang, Z. Zhang, and T. Xu. Delving into sample loss curve to embrace noisy and imbal- anced data.AAAI, 36:7024–7032, 2022. 2, 3

work page 2022

[19] [19]

B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y . Kalantidis. Decoupling representation and classifier for long-tailed recognition. InICLR, 2020. 2, 7, 8

work page 2020

[20] [20]

Karim, M

N. Karim, M. Rizve, N. Rahnavard, A. Mian, and M. Shah. Unicon: Combating label noise through uniform selection and contrastive learning. InCVPR, pages 9666–9676, 2022. 3, 7, 8

work page 2022

[21] [21]

Karpathy and L

A. Karpathy and L. Fei-Fei. Deep visual-semantic align- ments for generating image descriptions. InCVPR, pages 3128–3137, 2015. 2, 3

work page 2015

[22] [22]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images.Technical Report, 2009. 7

work page 2009

[23] [23]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, volume 25, 2012. 9

work page 2012

[24] [24]

B. Li, Z. Han, H. Li, H. Fu, and C. Zhang. Trustworthy long- tailed classification. InCVPR, pages 6970–6979, 2022. 2

work page 2022

[25] [25]

H.-T. Li, T. Wei, H. Yang, K. Hu, C. Peng, L.-B. Sun, X.- L. Cai, and M.-L. Zhang. Stochastic feature averaging for learning with long-tailed noisy labels. InIJCAI, pages 3902– 3910, 2023. 2

work page 2023

[26] [26]

J. Li, R. Socher, and S. C. Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. InICLR, 2020. 3, 6, 7, 8

work page 2020

[27] [27]

J. Li, Z. Tan, J. Wan, Z. Lei, and G. Guo. Nested collabo- rative learning for long-tailed visual recognition. InCVPR, pages 6949–6958, 2022. 2

work page 2022

[28] [28]

J. Li, C. Xiong, and S. C. H. Hoi. Mopro: Webly supervised learning with momentum prototypes. InICLR, 2021. 8

work page 2021

[29] [29]

Li.Advances in Long-Tailed Visual Recognition

M. Li.Advances in Long-Tailed Visual Recognition. PhD thesis, Hong Kong Baptist University, 2022. 1, 2

work page 2022

[30] [30]

Li, Y .-m

M. Li, Y .-m. Cheung, and Z. Hu. Key point sensitive loss for long-tailed visual recognition.IEEE TPAMI, 45(4):4812– 4825, 2023. 3

work page 2023

[31] [31]

Li, Y .-m

M. Li, Y .-m. Cheung, and Y . Lu. Long-tailed visual recogni- tion via gaussian clouded logit adjustment. InCVPR, pages 6929–6938, June 2022. 3

work page 2022

[32] [32]

S. Li, X. Xia, S. Ge, and T. Liu. Selective-supervised contrastive learning with noisy labels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 316–325, 2022. 7, 8

work page 2022

[33] [33]

W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool. We- bvision database: Visual learning and understanding from web data.arXiv preprint arXiv:1708.02862, 2017. 7

work page Pith review arXiv 2017

[34] [34]

Z. Li, H. Zhao, Z. Li, T. Liu, D. Guo, and X. Wan. Extracting clean and balanced subset for noisy long-tailed classification,

work page

[35] [35]

Y . Lin, Y . Yao, and T. Liu. Learning the latent causal struc- ture for modeling label noise. InNeurIPS, volume 37, pages 120549–120577, 2024. 3

work page 2024

[36] [36]

S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez- Granda. Early-learning regularization prevents memoriza- tion of noisy labels. InNeurIPS, volume 33, pages 20331– 20342, 2020. 8

work page 2020

[37] [37]

Liu and D

T. Liu and D. Tao. Classification with noisy labels by im- portance reweighting.IEEE TPAMI, 38(3):447–461, 2015. 3

work page 2015

[38] [38]

Y . Liu, B. Cao, and J. Fan. Improving the accuracy of learn- ing example weights for imbalance classification. InICLR,

work page

[39] [39]

J. Lu, Z. Zhou, T. Leung, L.-J. Li, and F.-F. Li. Mentor- net: Learning data-driven curriculum for very deep neural networks on corrupted labels. InICML, pages 2304–2313,

work page

[40] [40]

Y . Lu, Y . Zhang, B. Han, Y .-m. Cheung, and H. Wang. Label- noise learning with intrinsically long-tailed data. InICCV, pages 1369–1378, 2023. 1, 2, 3, 7, 8

work page 2023

[41] [41]

A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar. Long-tail learning via logit adjustment. In ICLR, 2021. 2, 3, 4, 7, 8

work page 2021

[42] [42]

M. Pang, B. Wang, M. Ye, Y .-M. Cheung, Y . Zhou, W. Huang, and B. Wen. Heterogeneous prototype learning from contaminated faces across domains via disentangling latent factors.IEEE TNNLS, 2024. 1

work page 2024

[43] [43]

S. Park, J. Lim, Y . Jeon, and J. Y . Choi. Influence-balanced loss for imbalanced visual classification. InICCV, pages 735–744, 2021. 7, 8

work page 2021

[44] [44]

Patrini, A

G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. InCVPR, pages 2233–2241, 2017. 3

work page 2017

[45] [45]

Pleiss, T

G. Pleiss, T. Zhang, E. Elenberg, and K. Q. Weinberger. Identifying mislabeled data using the area under the margin ranking. InNeurIPS, volume 33, pages 17044–17056, 2020. 3

work page 2020

[46] [46]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InICLR, pages 8748–8763, 2021. 1, 2, 3, 7

work page 2021

[47] [47]

J. Ren, C. Yu, s. sheng, X. Ma, H. Zhao, S. Yi, and h. Li. Balanced meta-softmax for long-tailed visual recognition. In NeurIPS, volume 33, pages 4175–4186, 2020. 3

work page 2020

[48] [48]

M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. InICML, vol- ume 80, pages 4331–4340, 2018. 2, 3

work page 2018

[49] [49]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115:211–252, 2015. 1

work page 2015

[50] [50]

Sheng, Z

M. Sheng, Z. Sun, Z. Cai, T. Chen, Y . Zhou, and Y . Yao. Adaptive integration of partial label learning and negative learning for enhanced noisy label learning. InAAAI, pages 4820–4828, 2024. 3

work page 2024

[51] [51]

J.-X. Shi, T. Wei, Z. Zhou, J.-J. Shao, X.-Y . Han, and Y .-F. Li. Long-tail learning with foundation model: Heavy fine- tuning hurts. InICML, 2024. 2

work page 2024

[52] [52]

J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. InNeurIPS, pages 1917–1928, 2019. 2, 3, 7, 8

work page 1917

[53] [53]

H. Song, M. Kim, and J.-G. Lee. Selfie: Refurbishing un- clean samples for robust deep learning. InICML, pages 5907–5915, 2019. 1

work page 2019

[54] [54]

C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pages 843–852, 2017. 1

work page 2017

[55] [55]

X. Wang, L. Lian, Z. Miao, Z. Liu, and S. X. Yu. Long-tailed recognition by routing diverse distribution-aware experts. In ICLR, 2021. 2

work page 2021

[56] [56]

Wei, J.-X

T. Wei, J.-X. Shi, Y .-F. Li, and M.-L. Zhang. Prototypical classifier for robust class-imbalanced learning. InPAKDD, pages 44–57, 2022. 2, 3

work page 2022

[57] [57]

Wei, J.-X

T. Wei, J.-X. Shi, W.-W. Tu, and Y .-F. Li. Robust long-tailed learning under label noise.ArXiv, 2021. 2, 7, 8

work page 2021

[58] [58]

Z.-F. Wu, T. Wei, J. Jiang, C. Mao, M. Tang, and Y . Li. Ngc: A unified framework for learning with open-world noisy data. InICCV, pages 62–71, 2021. 8

work page 2021

[59] [59]

X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, and Y . Chang. Robust early-learning: Hindering the memoriza- tion of noisy labels. InICLR, 2020. 7, 8

work page 2020

[60] [60]

X. Xia, T. Liu, B. Han, M. Gong, J. Yu, G. Niu, and M. Sugiyama. Sample selection with uncertainty of losses for learning with noisy labels. InICLR, 2022. 2, 3

work page 2022

[61] [61]

T. Xiao, T. Xia, Y . Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015. 1

work page 2015

[62] [62]

Y . Yao, T. Liu, B. Han, M. Gong, J. Deng, G. Niu, and M. Sugiyama. Dual t: reducing estimation error for transi- tion matrix in label-noise learning. InNeurIPS, pages 7260– 7271, 2020. 3

work page 2020

[63] [63]

Y . Yao, Z. Sun, C. Zhang, F. Shen, Q. Wu, J. Zhang, and Z. Tang. Jo-SRC: A contrastive approach for combating noisy labels. InCVPR, pages 5188–5197, 2021. 3

work page 2021

[64] [64]

X. Yi, K. Tang, X.-S. Hua, J.-H. Lim, and H. Zhang. Identi- fying hard noise in long-tailed sample distribution. InECCV, pages 739–756, 2022. 2, 3, 7, 8

work page 2022

[65] [65]

Zhang, X

M. Zhang, X. Zhao, J. Yao, C. Yuan, and W. Huang. When noisy labels meet long tail dilemmas: A representation cali- bration method. InICCV, pages 15844–15854, 2023. 1, 2, 3, 7, 8

work page 2023

[66] [66]

Zhang, Z

S. Zhang, Z. Li, S. Yan, X. He, and J. Sun. Distribution align- ment: A unified framework for long-tail visual recognition. InCVPR, pages 2361–2370, 2021. 2

work page 2021

[67] [67]

Zhang, B

Y . Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng. Deep long- tailed learning: A survey.IEEE TPAMI, 45(9):10795–10816,

work page

[68] [68]

Zhong, J

Z. Zhong, J. Cui, S. Liu, and J. Jia. Improving calibration for long-tailed recognition. InCVPR, pages 16489–16498,

work page

[69] [69]

B. Zhou, Q. Cui, X.-S. Wei, and Z.-M. Chen. BBN: Bilateral- branch network with cumulative learning for long-tailed vi- sual recognition. InCVPR, pages 9719–9728, 2020. 2, 3

work page 2020

[70] [70]

X. Zhou, X. Liu, D. Zhai, J. Jiang, X. Gao, and X. Ji. Prototype-anchored learning for learning with imperfect an- notations. InICML, volume 162, pages 27245–27267, 2022. 2, 3

work page 2022