Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings
Pith reviewed 2026-05-20 06:15 UTC · model grok-4.3
The pith
A hierarchical concept ontology built with large language models allows filtering of noisy negative pairs to improve zero-shot chest X-ray classification and grounding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that constructing a hierarchical concept ontology with large language models and applying it through cross-patient pair relabeling plus a Concept-Aware NCE loss suppresses noisy negatives that standard contrastive training introduces, thereby reducing semantic ambiguity and raising accuracy in zero-shot classification and grounding of chest X-ray findings over prior methods that rely on raw or templatized reports.
What carries the argument
The hierarchical concept ontology that organizes 41 clinical concepts by presence, location and characteristic attributes, and evidential text segments, which drives the three-step cross-patient pair relabeling to categorize, filter false negatives, and mine hard negatives.
If this is right
- Zero-shot grounding at multiple levels of detail improves because false negatives no longer pull apart matching visual and textual features.
- Accuracy rises across five distinct zero-shot classification datasets once semantic conflicts are resolved by relabeling.
- Hard negative mining based on attribute differences supplies training signals that standard random negatives miss.
- The Concept-Aware NCE loss provides a direct mechanism to down-weight only the pairs flagged as noisy by the ontology.
Where Pith is reading between the lines
- The same relabeling logic could be tested on other radiology modalities such as CT or MRI where report variability across patients is also high.
- If the ontology construction step proves sensitive to the choice of large language model, replacing it with a smaller domain-specific model would be a direct next experiment.
- The approach suggests a general pattern for any vision-language contrastive setting in which cross-example similarity is common but not captured by patient identity alone.
Load-bearing premise
The large language model produces a hierarchical concept ontology that accurately and completely represents clinical findings so that noisy negative pairs can be identified and removed without systematic mistakes.
What would settle it
If a model trained with the full CoNNS pipeline shows no gain or lower accuracy than a plain contrastive baseline when evaluated on any of the five zero-shot classification datasets or the multi-granularity grounding tasks, the benefit of noisy-negative suppression would be refuted.
Figures
read the original abstract
Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at https://github.com/DopamineLcy/conns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CoNNS, a concept-guided noisy-negative suppression framework for zero-shot classification and grounding of chest X-ray findings. It constructs a hierarchical concept ontology for 41 clinical concepts via LLMs that explicitly models presence, attributes (location/characteristics), and texts (evidential segments and presence statements). This ontology supports a three-step cross-patient pair relabeling process (fine-grained breakdown, noisy negative filtering, hard negative mining) and a Concept-Aware NCE loss that suppresses identified noisy negatives while aligning visual and text features. The authors claim this yields superior performance over prior SOTA on multi-granularity zero-shot grounding tasks and five zero-shot classification datasets, with code released.
Significance. If the LLM ontology accurately and completely represents the 41 concepts without systematic omissions or hallucinations, the approach could meaningfully improve vision-language alignment in radiology by reducing semantic ambiguity from shared findings across patients. The explicit attribute modeling and concept-aware loss differentiate it from raw-report or template baselines, and the public code supports reproducibility. The work addresses a genuine limitation in standard contrastive learning for medical reports.
major comments (2)
- [Ontology Construction] Ontology construction (described in the abstract and methods): The central mechanism relies on an LLM-generated hierarchical ontology for 41 concepts to enable reliable fine-grained breakdown, noisy-negative filtering, and hard-negative mining. However, the manuscript provides no details on prompt design, expert review, inter-annotator agreement, or error analysis against clinical ground truth. This is load-bearing because any misattribution of presence or attributes would directly produce incorrect pair labels, undermining the Concept-Aware NCE loss and the reported gains over baselines.
- [Experiments] Experimental claims (abstract and results section): The paper states that extensive experiments across grounding and five classification datasets validate outperformance, yet without reported ablations that isolate the contribution of the ontology-based relabeling versus the loss formulation or mining step, it remains unclear whether the gains are attributable to the proposed components or other factors such as dataset specifics.
minor comments (2)
- [Abstract] The abstract could more precisely state the number of patients or reports used in the cross-patient relabeling to contextualize the scale of the filtering process.
- [Methods] Notation for the Concept-Aware NCE loss should be introduced with an explicit equation early in the methods to improve readability for readers unfamiliar with the variant.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of transparency in our ontology construction and the need for clearer isolation of component contributions in experiments. We address each major comment below and will revise the manuscript to strengthen these areas.
read point-by-point responses
-
Referee: [Ontology Construction] Ontology construction (described in the abstract and methods): The central mechanism relies on an LLM-generated hierarchical ontology for 41 concepts to enable reliable fine-grained breakdown, noisy-negative filtering, and hard-negative mining. However, the manuscript provides no details on prompt design, expert review, inter-annotator agreement, or error analysis against clinical ground truth. This is load-bearing because any misattribution of presence or attributes would directly produce incorrect pair labels, undermining the Concept-Aware NCE loss and the reported gains over baselines.
Authors: We agree that greater transparency on the ontology construction is warranted given its central role. In the revised manuscript, we will add the exact LLM prompts used to generate the hierarchical structure for the 41 concepts, along with a description of the iterative refinement process based on standard radiological taxonomies. The authors include individuals with clinical domain knowledge who performed internal consistency checks against sample reports and literature. However, we did not conduct a formal multi-expert inter-annotator agreement study or exhaustive error analysis against external clinical ground truth, as this would have required substantial additional annotation resources beyond the scope of the original work. We will include a dedicated limitations subsection discussing potential risks of LLM hallucinations or omissions and provide a qualitative error analysis on a sampled subset of concepts to illustrate reliability. revision: partial
-
Referee: [Experiments] Experimental claims (abstract and results section): The paper states that extensive experiments across grounding and five classification datasets validate outperformance, yet without reported ablations that isolate the contribution of the ontology-based relabeling versus the loss formulation or mining step, it remains unclear whether the gains are attributable to the proposed components or other factors such as dataset specifics.
Authors: We appreciate this observation on experimental rigor. The current manuscript includes component-wise ablations in Section 4.3 that examine the impact of the relabeling pipeline and the Concept-Aware NCE loss, but these do not fully decouple the ontology-driven fine-grained breakdown from the hard-negative mining step or the loss formulation. In the revision, we will introduce new ablation tables that independently disable or replace each element (ontology-based relabeling, noisy-negative filtering, hard-negative mining, and the concept-aware loss) while keeping other factors fixed. These additional experiments will be run on the same datasets to more clearly attribute performance improvements to the individual proposed mechanisms rather than dataset characteristics or other implementation details. revision: yes
Circularity Check
No significant circularity; derivation introduces independent components validated externally
full rationale
The paper's core contribution is a new framework (CoNNS) that constructs an LLM-generated hierarchical ontology for 41 concepts, applies a three-step cross-patient relabeling process (fine-grained breakdown, noisy negative filtering, hard negative mining), and defines a Concept-Aware NCE loss to suppress noisy negatives. These elements are introduced as novel inputs rather than derived from or fitted to the target zero-shot tasks. No equations or steps reduce by construction to previously fitted quantities, self-citations, or renamed known results. Performance is assessed via experiments on five classification datasets and multi-granularity grounding tasks, which are independent of the method's internal definitions. The approach is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Radiographs and reports from different patients can be treated as negative pairs but often share similar findings leading to noisy negatives
invented entities (1)
-
hierarchical concept ontology
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024
Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024)
-
[2]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bannur, S., Hyland, S., Liu, Q., Perez-Garcia, F., Ilse, M., Castro, D.C., Boecking, B., Sharma, H., Bouzid, K., Thieme, A., et al.: Learning to exploit temporal struc- ture for biomedical vision-language processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15016–15027 (2023)
work page 2023
-
[3]
In: European conference on computer vision
Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hyland, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., et al.: Making the most of text semantics to improve biomedical vision–language processing. In: European conference on computer vision. pp. 1–21. Springer (2022)
work page 2022
-
[4]
NEJM AI2(7), AIdbp2401120 (2025)
de Castro, D.C., Bustos, A., Bannur, S., Hyland, S.L., Bouzid, K., Wetscherek, M.T., Sánchez-Valverde, M.D., Jaques-Pérez, L., Pérez-Rodríguez, L., Takeda, K., et al.: Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. NEJM AI2(7), AIdbp2401120 (2025)
work page 2025
-
[5]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Chen, Z., Zhou, Y., Tran, A., Zhao, J., Wan, L., Ooi, G.S.K., Cheng, L.T.E., Thng, C.H., Xu, X., Liu, Y., et al.: Medical phrase grounding with region-phrase context contrastive alignment. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 371–381. Springer (2023)
work page 2023
-
[6]
Journal of the American Medical Informatics Association23(2), 304–310 (2015)
Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiol- ogy examinations for distribution and retrieval. Journal of the American Medical Informatics Association23(2), 304–310 (2015)
work page 2015
-
[7]
In: International Con- ference on Learning Representations (2020)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Con- ference on Learning Representations (2020)
work page 2020
-
[8]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)
work page 2024
-
[9]
In: Proceedings of the AAAI conference on artificial intelligence
Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)
work page 2019
-
[10]
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs
Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly 10 C. Lian et al. available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Lai, H., Yao, Q., Jiang, Z., Wang, R., He, Z., Tao, X., Zhou, S.K.: Carzero: Cross- attention alignment for radiology zero-shot classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11137– 11146 (2024)
work page 2024
-
[12]
IEEE Transactions on Medical Imaging44(11), 4499–4510 (2025)
Lian,C.,Zhou,H.Y.,Liang,D.,Qin,J.,Wang,L.:Efficientmedicalvision-language alignment through adapting masked vision models. IEEE Transactions on Medical Imaging44(11), 4499–4510 (2025)
work page 2025
-
[13]
arXiv preprint arXiv:2006.10550 (2020)
Liu, J., Lian, J., Yu, Y.: Chestx-det10: chest x-ray dataset on detection of thoracic abnormalities. arXiv preprint arXiv:2006.10550 (2020)
-
[14]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Representation Learning with Contrastive Predictive Coding
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
In: The Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems (2025)
Park, J., Yoon, B., Kim, S., Choi, K.: Radzero: Similarity-based cross-attention for explainable vision-language alignment in chest x-ray with zero-shot multi-task capability. In: The Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems (2025)
work page 2025
-
[17]
Advances in neural information processing sys- tems32(2019)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems32(2019)
work page 2019
-
[18]
Nature Machine Intelligence 7(1), 119–130 (2025)
Perez-Garcia, F., Sharma, H., Bond-Taylor, S., Bouzid, K., Salvatelli, V., Ilse, M., Bannur, S., Castro, D.C., Schwaighofer, A., Lungren, M.P., et al.: Exploring scal- able medical image encoders beyond text supervision. Nature Machine Intelligence 7(1), 119–130 (2025)
work page 2025
-
[19]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
work page 2021
-
[20]
In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019)
work page 2019
-
[21]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classi- fication and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2097–2106 (2017)
work page 2097
-
[22]
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing
Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from un- paired medical images and text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing. vol. 2022, p. 3876 (2022)
work page 2022
-
[23]
In: Proceedings of the IEEE/CVF international conference on computer vision
Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 21372–21383 (2023)
work page 2023
-
[24]
International Journal of Computer Vision 126(10), 1084–1102 (2018)
Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neu- ral attention by excitation backprop. International Journal of Computer Vision 126(10), 1084–1102 (2018)
work page 2018
-
[25]
Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., Tupini, A., Wang, Y., Mazzola, M., Shukla, S., Concept-Guided Noisy Negative Suppression 11 Liden, L., Gao, J., Crabtree, A., Piening, B., Bifulco, C., Lungren, M.P., Naumann, T., Wang, S., Poon, H.: A multimodal biomedical foundation model t...
work page 2024
-
[26]
Nature Communications14(1), 4542 (2023)
Zhang, X., Wu, C., Zhang, Y., Xie, W., Wang, Y.: Knowledge-enhanced visual- language pre-training on chest radiology images. Nature Communications14(1), 4542 (2023)
work page 2023
-
[27]
In: Machine Learning for Healthcare Conference
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learn- ing of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference. pp. 2–25. PMLR (2022)
work page 2022
-
[28]
In: The Eleventh International Conference on Learning Representations (2023)
Zhou,H.Y.,Lian,C.,Wang,L.,Yu,Y.:Advancingradiographrepresentationlearn- ing with masked record modeling. In: The Eleventh International Conference on Learning Representations (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.