PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning
Pith reviewed 2026-05-18 16:30 UTC · model grok-4.3
The pith
PCSR refines noisy image-text pairs by scoring pseudo-label consistency to separate ambiguous from refinable samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the PCSR framework, which enhances correspondence reliability by explicitly dividing samples based on pseudo-label consistency. Clean and noisy pairs are distinguished via confidence-based estimation. Noisy pairs are further refined through pseudo-label consistency to uncover structurally distinct subsets. The Pseudo-label Consistency Score quantifies prediction stability to isolate ambiguous samples from refinable ones. Adaptive Pair Optimization then applies robust loss functions to the former and text replacement to the latter.
What carries the argument
The Pseudo-label Consistency Score (PCS), which measures prediction stability to partition noisy pairs so that each subset receives a matching optimization strategy inside Adaptive Pair Optimization.
If this is right
- Retrieval models trained with PCSR achieve higher performance on noisy versions of CC152K, MS-COCO, and Flickr30K.
- Noisy samples receive category-specific treatment instead of uniform handling, increasing overall data utilization.
- Misaligned pairs exert less damage on similarity learning because refinable cases are actively corrected.
- Cross-modal retrieval systems become more robust when supervision contains realistic noise levels.
Where Pith is reading between the lines
- The same consistency-based split could be tested on video-text or audio-text retrieval tasks that also suffer from alignment noise.
- Text replacement for refinable pairs might be replaced or augmented by other generative corrections without changing the core separation logic.
- Integrating the PCS with self-supervised pretraining could further stabilize the ambiguous-versus-refinable boundary on very large collections.
Load-bearing premise
That a confidence-based estimate followed by the Pseudo-label Consistency Score can reliably identify which noisy pairs will yield useful training signals when their text is replaced.
What would settle it
Train the same base model on a controlled noisy version of MS-COCO or Flickr30K once with standard clean-noisy splitting and once with the full PCSR pipeline; if recall@K and precision@K show no consistent gain for PCSR, the benefit of the finer division collapses.
Figures
read the original abstract
Cross-modal retrieval aims to align different modalities via semantic similarity. However, existing methods often assume that image-text pairs are perfectly aligned, overlooking Noisy Correspondences in real data. These misaligned pairs misguide similarity learning and degrade retrieval performance. Previous methods often rely on coarse-grained categorizations that simply divide data into clean and noisy samples, overlooking the intrinsic diversity within noisy instances. Moreover, they typically apply uniform training strategies regardless of sample characteristics, resulting in suboptimal sample utilization for model optimization. To address the above challenges, we introduce a novel framework, called Pseudo-label Consistency-Guided Sample Refinement (PCSR), which enhances correspondence reliability by explicitly dividing samples based on pseudo-label consistency. Specifically, we first employ a confidence-based estimation to distinguish clean and noisy pairs, then refine the noisy pairs via pseudo-label consistency to uncover structurally distinct subsets. We further proposed a Pseudo-label Consistency Score (PCS) to quantify prediction stability, enabling the separation of ambiguous and refinable samples within noisy pairs. Accordingly, we adopt Adaptive Pair Optimization (APO), where ambiguous samples are optimized with robust loss functions and refinable ones are enhanced via text replacement during training. Extensive experiments on CC152K, MS-COCO and Flickr30K validate the effectiveness of our method in improving retrieval robustness under noisy supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PCSR, a framework for handling noisy correspondences in cross-modal retrieval. It employs confidence-based estimation to separate clean from noisy image-text pairs, introduces a Pseudo-label Consistency Score (PCS) to further partition noisy pairs into ambiguous and refinable subsets, applies robust losses to ambiguous samples, and performs text replacement on refinable samples via Adaptive Pair Optimization (APO). Effectiveness is claimed on CC152K, MS-COCO, and Flickr30K.
Significance. If the results hold, the work offers a finer-grained alternative to binary clean/noisy splits by exploiting diversity within noisy instances, which could improve sample utilization and robustness in noisy supervision for retrieval tasks. The PCS as a stability metric is a potentially useful technical contribution.
major comments (3)
- [Abstract, paragraph describing the refinement step and APO] Abstract and APO description: the text replacement mechanism for refinable samples is load-bearing for the novelty claim of handling 'intrinsic diversity within noisy instances,' yet the source of replacement texts, selection criterion, and any guarantee of improved semantic alignment (versus the original noisy pair) are not defined. If replacements derive from the same model's pseudo-labels, error reinforcement rather than correction is possible.
- [Method, PCS definition] Method section on PCS: using pseudo-labels generated by the model under training to compute consistency scores for deciding how to refine the same data creates a circular dependency. The manuscript must demonstrate that PCS provides an independent signal rather than amplifying early errors.
- [Experiments] Experiments: the abstract asserts validation on three datasets but supplies no quantitative results, ablation on the two free parameters (confidence threshold and PCS threshold), or direct comparison against uniform strategies. Without these, it is impossible to confirm that the proposed division yields gains that support the central claim.
minor comments (2)
- [Notation and abstract] Ensure all acronyms (PCS, APO) are expanded on first use and used consistently.
- [Figures] Pipeline figures should explicitly label the clean/noisy split, PCS computation, and the two branches of APO.
Simulated Author's Rebuttal
We sincerely thank the referee for the constructive and detailed feedback. We have carefully reviewed each major comment and provide point-by-point responses below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract, paragraph describing the refinement step and APO] Abstract and APO description: the text replacement mechanism for refinable samples is load-bearing for the novelty claim of handling 'intrinsic diversity within noisy instances,' yet the source of replacement texts, selection criterion, and any guarantee of improved semantic alignment (versus the original noisy pair) are not defined. If replacements derive from the same model's pseudo-labels, error reinforcement rather than correction is possible.
Authors: We agree that the current description of the text replacement mechanism within APO is insufficiently detailed. In the revised manuscript we will expand the APO subsection to explicitly specify the source of replacement texts, the selection criterion used, and empirical evidence (including alignment metrics before and after replacement) that the operation improves semantic correspondence. We will also add a short analysis addressing the risk of error reinforcement, for example by restricting replacements to samples whose PCS exceeds a conservative threshold and by validating against a small set of manually verified pairs. revision: yes
-
Referee: [Method, PCS definition] Method section on PCS: using pseudo-labels generated by the model under training to compute consistency scores for deciding how to refine the same data creates a circular dependency. The manuscript must demonstrate that PCS provides an independent signal rather than amplifying early errors.
Authors: This concern about circularity is well-taken. We will revise the PCS definition paragraph to clarify that consistency is measured across temporally separated model checkpoints (e.g., every k epochs) rather than solely on the instantaneous prediction, thereby providing a stability signal that is partially decoupled from any single erroneous state. In addition, we will include a new paragraph with both a brief theoretical argument and supporting experiments that track how PCS correlates with ground-truth alignment on a held-out clean subset, demonstrating that early errors do not dominate the score. revision: yes
-
Referee: [Experiments] Experiments: the abstract asserts validation on three datasets but supplies no quantitative results, ablation on the two free parameters (confidence threshold and PCS threshold), or direct comparison against uniform strategies. Without these, it is impossible to confirm that the proposed division yields gains that support the central claim.
Authors: We thank the referee for highlighting the need for more explicit evidence. The full paper already reports quantitative results on CC152K, MS-COCO, and Flickr30K in Section 4 (Tables 1–3). We will update the abstract to include the main performance deltas. We will also add a dedicated ablation subsection that varies the confidence threshold and the PCS threshold, and we will insert direct comparisons against uniform robust-loss and random-replacement baselines to quantify the benefit of the fine-grained division. revision: yes
Circularity Check
No significant circularity; derivation proposes independent framework components validated on external benchmarks.
full rationale
The paper's chain begins with a confidence-based split of clean vs. noisy pairs, followed by introduction of a new PCS metric to further partition noisy pairs into ambiguous and refinable subsets, then applies APO (robust loss on one subset, text replacement on the other). None of these steps reduces to its own inputs by construction: PCS is defined as a stability quantifier rather than a tautological re-use of the model's output, and text replacement is presented as an enhancement operation without any equation showing it equals the original noisy pair or a fitted parameter. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and the method is tested on CC152K, MS-COCO and Flickr30K, which are independent of the internal pseudo-label generation. This satisfies the self-contained criterion; the approach is a standard iterative refinement proposal rather than a definitional loop.
Axiom & Free-Parameter Ledger
free parameters (2)
- confidence threshold for clean/noisy split
- PCS threshold separating ambiguous from refinable
axioms (1)
- domain assumption Pseudo-label consistency reliably indicates whether a noisy pair can be refined by text replacement
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We further propose a Pseudo-label Consistency Score (PCS) to quantify prediction stability, enabling the separation of ambiguous and refinable samples within noisy pairs. Accordingly, we adopt Adaptive Pair Optimization (APO), where ambiguous samples are optimized with robust loss functions and refinable ones are enhanced via text replacement during training.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we first employ a confidence-based estimation to distinguish clean and noisy pairs, then refine the noisy pairs via pseudo-label consistency
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6077--6086
work page 2018
-
[2]
Bai, Y.; Yang, E.; Han, B.; Yang, Y.; Li, J.; Mao, Y.; Niu, G.; and Liu, T. 2021. Understanding and improving early stopping for learning with noisy labels. Advances in Neural Information Processing Systems, 34: 24392--24403
work page 2021
-
[3]
Chen, J.; Dun, C.; and Kyrillidis, A. 2024. Fast fixmatch: Faster semi-supervised learning with curriculum batch size. In 2024 IEEE International Symposium on Information Theory, 1836--1841. IEEE
work page 2024
-
[4]
Chen, J.; Hu, H.; Wu, H.; Jiang, Y.; and Wang, C. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 15789--15798
work page 2021
-
[5]
Chen, M.; and Wang, C. 2024. Multi-head co-training: An uncertainty-aware and robust semi-supervised learning framework. Knowledge-Based Systems, 302: 112325
work page 2024
-
[6]
D.; Wang, X.; Vineet, V.; Joshi, N.; Torralba, A.; Jegelka, S.; and Song, Y
Chuang, C.-Y.; Hjelm, R. D.; Wang, X.; Vineet, V.; Joshi, N.; Torralba, A.; Jegelka, S.; and Song, Y. 2022. Robust contrastive learning against noisy views. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 16670--16681
work page 2022
-
[7]
Diao, H.; Zhang, Y.; Ma, L.; and Lu, H. 2021. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI conference on Artificial Intelligence, volume 35, 1218--1226
work page 2021
-
[8]
Duan, Y.; Gu, Z.; Ying, Z.; Qi, L.; Meng, C.; and Shi, Y. 2024. Pc2: Pseudo-classification based pseudo-captioning for noisy correspondence learning in cross-modal retrieval. In Proceedings of the 32nd ACM International Conference on Multimedia, 9397--9406
work page 2024
-
[9]
Faghri, F.; Fleet, D. J.; Kiros, J. R.; and Fidler, S. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. In British Machine Vision Conference
work page 2017
-
[10]
Feng, Y.; Zhu, H.; Peng, D.; Peng, X.; and Hu, P. 2023. RONO: robust discriminative learning with noisy labels for 2D-3D cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11610--11619
work page 2023
-
[11]
Fu, Z.; Zhang, L.; Xia, H.; and Mao, Z. 2024. Linguistic-aware patch slimming framework for fine-grained cross-modal alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26307--26316
work page 2024
-
[12]
Han, H.; Miao, K.; Zheng, Q.; and Luo, M. 2023. Noisy correspondence learning with meta similarity correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7517--7526
work page 2023
-
[13]
Han, H.; Zheng, Q.; Dai, G.; Luo, M.; and Wang, J. 2024. Learning to rematch mismatched pairs for robust cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26679--26688
work page 2024
-
[14]
Heidari, M.; Zhang, H.; and Guo, Y. 2024. Reinforcement learning guided semi-supervised learning. Advances in Neural Information Processing Systems, 37: 136990--137009
work page 2024
-
[15]
Hu, P.; Huang, Z.; Peng, D.; Wang, X.; and Peng, X. 2023. Cross-modal retrieval with partially mismatched pairs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8): 9595--9610
work page 2023
-
[16]
Huang, Z.; Niu, G.; Liu, X.; Ding, W.; Xiao, X.; Wu, H.; and Peng, X. 2021. Learning with noisy correspondence for cross-modal matching. Advances in Neural Information Processing Systems, 34: 29406--29419
work page 2021
-
[17]
Iscen, A.; Valmadre, J.; Arnab, A.; and Schmid, C. 2022. Learning with neighbor consistency for noisy labels. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 4672--4681
work page 2022
-
[18]
Jiang, L.; Huang, D.; Liu, M.; and Yang, W. 2020. Beyond synthetic noise: Deep learning on controlled noisy labels. In International conference on Machine Learning, 4804--4815. PMLR
work page 2020
-
[19]
Lee, D.-H.; et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In International Conference on Machine Learning, volume 3, 896. Atlanta
work page 2013
-
[20]
Lee, K.-H.; Chen, X.; Hua, G.; Hu, H.; and He, X. 2018. Stacked cross attention for image-text matching. In Proceedings of the European conference on Computer Vision, 201--216
work page 2018
-
[21]
Li, K.; Zhang, Y.; Li, K.; Li, Y.; and Fu, Y. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on Computer Vision, 4654--4662
work page 2019
-
[22]
Li, Y.; Huang, H.; Xu, J.; and Huang, S.-L. 2024. NAC: Mitigating Noisy Correspondence in Cross-Modal Matching Via Neighbor Auxiliary Corrector. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 6815--6819. IEEE
work page 2024
-
[23]
Liu, C.; Mao, Z.; Zhang, T.; Xie, H.; Wang, B.; and Zhang, Y. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 10921--10930
work page 2020
-
[24]
Liu, Y.; Liu, M.; Huang, S.; and Lv, J. 2025. Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 5676--5684
work page 2025
-
[25]
Ma, X.; Yang, M.; Li, Y.; Hu, P.; Lv, J.; and Peng, X. 2024. Cross-modal retrieval with noisy correspondence via consistency refining and mining. IEEE transactions on Image Processing, 33: 2587--2598
work page 2024
-
[26]
Miyato, T.; Maeda, S.-I.; Koyama, M.; and Ishii, S. 2019. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8): 1979--1993
work page 2019
-
[27]
Pan, Z.; Wu, F.; and Zhang, B. 2023. Fine-grained image-text matching by cross-modal hard aligning network. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 19275--19284
work page 2023
-
[28]
Qin, Y.; Peng, D.; Peng, X.; Wang, X.; and Hu, P. 2022. Deep evidential learning with noisy correspondence for cross-modal retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, 4948--4956
work page 2022
-
[29]
Wang, Q.; Han, B.; Liu, T.; Niu, G.; Yang, J.; and Gong, C. 2021. Tackling Instance-Dependent Label Noise via a Universal Probabilistic Model. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11): 10183--10191
work page 2021
-
[30]
Wang, S.; Wang, R.; Yao, Z.; Shan, S.; and Chen, X. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF winter conference on Applications of Computer Vision, 1508--1517
work page 2020
-
[31]
Wen, T.; Lai, S.; and Qian, X. 2021. Preparing lessons: Improve knowledge distillation with better supervision. Neurocomputing, 454: 25--33
work page 2021
-
[32]
Yan, J.; Luo, L.; Deng, C.; and Huang, H. 2023. Adaptive hierarchical similarity metric learning with noisy labels. IEEE Transactions on Image Processing, 32: 1245--1256
work page 2023
-
[33]
Yang, S.; Li, Q.; Li, W.; Li, X.; and Liu, A.-A. 2022. Dual-level representation enhancement on characteristic and context for image-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 32(11): 8037--8050
work page 2022
-
[34]
Yang, S.; Xu, Z.; Wang, K.; You, Y.; Yao, H.; Liu, T.; and Xu, M. 2023. Bicro: Noisy correspondence rectification for multi-modality data via bi-directional cross-modal similarity consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19883--19892
work page 2023
-
[35]
Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L. 2023. Sigmoid Loss for Language Image Pre-Training. In 2023 IEEE/CVF International Conference on Computer Vision, 11941--11952
work page 2023
-
[36]
Zhang, H.; Mao, Z.; Zhang, K.; and Zhang, Y. 2022 a . Show your faith: Cross-modal confidence-aware network for image-text matching. In Proceedings of the AAAI conference on Artificial Intelligence, volume 36, 3262--3270
work page 2022
-
[37]
Zhang, K.; Mao, Z.; Wang, Q.; and Zhang, Y. 2022 b . Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 15661--15670
work page 2022
-
[38]
Zhang, Q.; Lei, Z.; Zhang, Z.; and Li, S. Z. 2020. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 3536--3545
work page 2020
-
[39]
Zhang, S.; Li, Y.; Tian, J.; Man, Z.; Chung, C. Y.; and Shen, W. 2024. Improving Battery Life Prediction with Unlabeled Data: Confidence-Weighted Semi-Supervised Learning with Label Propagation. IEEE Transactions on Transportation Electrification
work page 2024
-
[40]
Zhang, Z.; and Sabuncu, M. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in Neural Information Processing Systems, 31
work page 2018
-
[41]
Zhao, X.; Li, D.; Zhong, Y.; Hu, B.; Chen, Y.; Hu, B.; and Zhang, M. 2024. SEER : Self-Aligned Evidence Extraction for Retrieval-Augmented Generation. In Al-Onaizan, Y.; Bansal, M.; and Chen, Y.-N., eds., Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 3027--3041. Association for Computational Linguistics
work page 2024
-
[42]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[43]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.