pith. sign in

arxiv: 2605.19374 · v1 · pith:5RWCEUCCnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.LG

Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings

Pith reviewed 2026-05-20 06:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords zero-shot classificationchest X-raynoisy negative suppressionconcept ontologyvision-language modelcontrastive learninggroundingmedical imaging
0
0 comments X

The pith

A hierarchical concept ontology built with large language models allows filtering of noisy negative pairs to improve zero-shot chest X-ray classification and grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard contrastive learning treats chest X-ray images and reports from different patients as negative pairs even when the patients share similar findings, which creates semantic ambiguity. The paper builds a hierarchical concept ontology covering 41 clinical concepts that explicitly records presence, attributes such as location and characteristics, and supporting text segments. Using this ontology, a three-step relabeling process breaks down pairs by finding presence, removes false negatives, and mines hard negatives with a lightweight language model. A Concept-Aware NCE loss then aligns features while down-weighting the identified noisy negatives. If the approach works, zero-shot performance rises on both grounding tasks at multiple granularities and on five separate classification datasets.

Core claim

The authors establish that constructing a hierarchical concept ontology with large language models and applying it through cross-patient pair relabeling plus a Concept-Aware NCE loss suppresses noisy negatives that standard contrastive training introduces, thereby reducing semantic ambiguity and raising accuracy in zero-shot classification and grounding of chest X-ray findings over prior methods that rely on raw or templatized reports.

What carries the argument

The hierarchical concept ontology that organizes 41 clinical concepts by presence, location and characteristic attributes, and evidential text segments, which drives the three-step cross-patient pair relabeling to categorize, filter false negatives, and mine hard negatives.

If this is right

  • Zero-shot grounding at multiple levels of detail improves because false negatives no longer pull apart matching visual and textual features.
  • Accuracy rises across five distinct zero-shot classification datasets once semantic conflicts are resolved by relabeling.
  • Hard negative mining based on attribute differences supplies training signals that standard random negatives miss.
  • The Concept-Aware NCE loss provides a direct mechanism to down-weight only the pairs flagged as noisy by the ontology.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relabeling logic could be tested on other radiology modalities such as CT or MRI where report variability across patients is also high.
  • If the ontology construction step proves sensitive to the choice of large language model, replacing it with a smaller domain-specific model would be a direct next experiment.
  • The approach suggests a general pattern for any vision-language contrastive setting in which cross-example similarity is common but not captured by patient identity alone.

Load-bearing premise

The large language model produces a hierarchical concept ontology that accurately and completely represents clinical findings so that noisy negative pairs can be identified and removed without systematic mistakes.

What would settle it

If a model trained with the full CoNNS pipeline shows no gain or lower accuracy than a plain contrastive baseline when evaluated on any of the five zero-shot classification datasets or the multi-granularity grounding tasks, the benefit of noisy-negative suppression would be refuted.

Figures

Figures reproduced from arXiv: 2605.19374 by Chenyu Lian, Chun-Ka Wong, Hong-Yu Zhou, Jing Qin.

Figure 1
Figure 1. Figure 1: a. Noisy negatives and the proposed suppression strategy. b. Training and c. inference flows. d. Comparison of two mainstream paradigms of text inputs with ours. 1 Introduction Vision-language alignment using chest X-rays (CXR) and radiology reports sig￾nificantly advances zero-shot classification and grounding of radiological findings without the need for costly manual annotations (e.g., class labels or b… view at source ↗
Figure 2
Figure 2. Figure 2: The proposed CoNNS consists of three stages. (1) We construct a hierarchical concept ontology via LLM. (2) Cross-patient pairs are relabeled into a relation matrix based on the ontology (Unk. = Unknown). (3) We perform vision-language alignment with noisy negative suppression using the proposed Concept-Aware NCE loss. 2.2 Cross-Patient Pair Relabeling Following standard practice, we regard pairs consisting… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of similarity maps between ours and the leading com￾peting method [16]. Red bounding boxes indicate ground-truth annotations [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at https://github.com/DopamineLcy/conns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CoNNS, a concept-guided noisy-negative suppression framework for zero-shot classification and grounding of chest X-ray findings. It constructs a hierarchical concept ontology for 41 clinical concepts via LLMs that explicitly models presence, attributes (location/characteristics), and texts (evidential segments and presence statements). This ontology supports a three-step cross-patient pair relabeling process (fine-grained breakdown, noisy negative filtering, hard negative mining) and a Concept-Aware NCE loss that suppresses identified noisy negatives while aligning visual and text features. The authors claim this yields superior performance over prior SOTA on multi-granularity zero-shot grounding tasks and five zero-shot classification datasets, with code released.

Significance. If the LLM ontology accurately and completely represents the 41 concepts without systematic omissions or hallucinations, the approach could meaningfully improve vision-language alignment in radiology by reducing semantic ambiguity from shared findings across patients. The explicit attribute modeling and concept-aware loss differentiate it from raw-report or template baselines, and the public code supports reproducibility. The work addresses a genuine limitation in standard contrastive learning for medical reports.

major comments (2)
  1. [Ontology Construction] Ontology construction (described in the abstract and methods): The central mechanism relies on an LLM-generated hierarchical ontology for 41 concepts to enable reliable fine-grained breakdown, noisy-negative filtering, and hard-negative mining. However, the manuscript provides no details on prompt design, expert review, inter-annotator agreement, or error analysis against clinical ground truth. This is load-bearing because any misattribution of presence or attributes would directly produce incorrect pair labels, undermining the Concept-Aware NCE loss and the reported gains over baselines.
  2. [Experiments] Experimental claims (abstract and results section): The paper states that extensive experiments across grounding and five classification datasets validate outperformance, yet without reported ablations that isolate the contribution of the ontology-based relabeling versus the loss formulation or mining step, it remains unclear whether the gains are attributable to the proposed components or other factors such as dataset specifics.
minor comments (2)
  1. [Abstract] The abstract could more precisely state the number of patients or reports used in the cross-patient relabeling to contextualize the scale of the filtering process.
  2. [Methods] Notation for the Concept-Aware NCE loss should be introduced with an explicit equation early in the methods to improve readability for readers unfamiliar with the variant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of transparency in our ontology construction and the need for clearer isolation of component contributions in experiments. We address each major comment below and will revise the manuscript to strengthen these areas.

read point-by-point responses
  1. Referee: [Ontology Construction] Ontology construction (described in the abstract and methods): The central mechanism relies on an LLM-generated hierarchical ontology for 41 concepts to enable reliable fine-grained breakdown, noisy-negative filtering, and hard-negative mining. However, the manuscript provides no details on prompt design, expert review, inter-annotator agreement, or error analysis against clinical ground truth. This is load-bearing because any misattribution of presence or attributes would directly produce incorrect pair labels, undermining the Concept-Aware NCE loss and the reported gains over baselines.

    Authors: We agree that greater transparency on the ontology construction is warranted given its central role. In the revised manuscript, we will add the exact LLM prompts used to generate the hierarchical structure for the 41 concepts, along with a description of the iterative refinement process based on standard radiological taxonomies. The authors include individuals with clinical domain knowledge who performed internal consistency checks against sample reports and literature. However, we did not conduct a formal multi-expert inter-annotator agreement study or exhaustive error analysis against external clinical ground truth, as this would have required substantial additional annotation resources beyond the scope of the original work. We will include a dedicated limitations subsection discussing potential risks of LLM hallucinations or omissions and provide a qualitative error analysis on a sampled subset of concepts to illustrate reliability. revision: partial

  2. Referee: [Experiments] Experimental claims (abstract and results section): The paper states that extensive experiments across grounding and five classification datasets validate outperformance, yet without reported ablations that isolate the contribution of the ontology-based relabeling versus the loss formulation or mining step, it remains unclear whether the gains are attributable to the proposed components or other factors such as dataset specifics.

    Authors: We appreciate this observation on experimental rigor. The current manuscript includes component-wise ablations in Section 4.3 that examine the impact of the relabeling pipeline and the Concept-Aware NCE loss, but these do not fully decouple the ontology-driven fine-grained breakdown from the hard-negative mining step or the loss formulation. In the revision, we will introduce new ablation tables that independently disable or replace each element (ontology-based relabeling, noisy-negative filtering, hard-negative mining, and the concept-aware loss) while keeping other factors fixed. These additional experiments will be run on the same datasets to more clearly attribute performance improvements to the individual proposed mechanisms rather than dataset characteristics or other implementation details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent components validated externally

full rationale

The paper's core contribution is a new framework (CoNNS) that constructs an LLM-generated hierarchical ontology for 41 concepts, applies a three-step cross-patient relabeling process (fine-grained breakdown, noisy negative filtering, hard negative mining), and defines a Concept-Aware NCE loss to suppress noisy negatives. These elements are introduced as novel inputs rather than derived from or fitted to the target zero-shot tasks. No equations or steps reduce by construction to previously fitted quantities, self-citations, or renamed known results. Performance is assessed via experiments on five classification datasets and multi-granularity grounding tasks, which are independent of the method's internal definitions. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework relies on the assumption that the LLM-generated ontology is reliable and that the three-step relabeling correctly identifies noisy negatives without introducing new biases.

axioms (1)
  • domain assumption Radiographs and reports from different patients can be treated as negative pairs but often share similar findings leading to noisy negatives
    Core problem statement in the abstract.
invented entities (1)
  • hierarchical concept ontology no independent evidence
    purpose: To model 41 clinical concepts with presence, attributes, and texts for pair relabeling
    Built using large language models; no independent validation data mentioned in abstract.

pith-pipeline@v0.9.0 · 5814 in / 1381 out tokens · 80757 ms · 2026-05-20T06:15:28.581243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

  1. [1]

    Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024

    Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bannur, S., Hyland, S., Liu, Q., Perez-Garcia, F., Ilse, M., Castro, D.C., Boecking, B., Sharma, H., Bouzid, K., Thieme, A., et al.: Learning to exploit temporal struc- ture for biomedical vision-language processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15016–15027 (2023)

  3. [3]

    In: European conference on computer vision

    Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hyland, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., et al.: Making the most of text semantics to improve biomedical vision–language processing. In: European conference on computer vision. pp. 1–21. Springer (2022)

  4. [4]

    NEJM AI2(7), AIdbp2401120 (2025)

    de Castro, D.C., Bustos, A., Bannur, S., Hyland, S.L., Bouzid, K., Wetscherek, M.T., Sánchez-Valverde, M.D., Jaques-Pérez, L., Pérez-Rodríguez, L., Takeda, K., et al.: Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. NEJM AI2(7), AIdbp2401120 (2025)

  5. [5]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Chen, Z., Zhou, Y., Tran, A., Zhao, J., Wan, L., Ooi, G.S.K., Cheng, L.T.E., Thng, C.H., Xu, X., Liu, Y., et al.: Medical phrase grounding with region-phrase context contrastive alignment. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 371–381. Springer (2023)

  6. [6]

    Journal of the American Medical Informatics Association23(2), 304–310 (2015)

    Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiol- ogy examinations for distribution and retrieval. Journal of the American Medical Informatics Association23(2), 304–310 (2015)

  7. [7]

    In: International Con- ference on Learning Representations (2020)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Con- ference on Learning Representations (2020)

  8. [8]

    arXiv e-prints pp

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

  9. [9]

    In: Proceedings of the AAAI conference on artificial intelligence

    Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)

  10. [10]

    MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

    Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly 10 C. Lian et al. available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lai, H., Yao, Q., Jiang, Z., Wang, R., He, Z., Tao, X., Zhou, S.K.: Carzero: Cross- attention alignment for radiology zero-shot classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11137– 11146 (2024)

  12. [12]

    IEEE Transactions on Medical Imaging44(11), 4499–4510 (2025)

    Lian,C.,Zhou,H.Y.,Liang,D.,Qin,J.,Wang,L.:Efficientmedicalvision-language alignment through adapting masked vision models. IEEE Transactions on Medical Imaging44(11), 4499–4510 (2025)

  13. [13]

    arXiv preprint arXiv:2006.10550 (2020)

    Liu, J., Lian, J., Yu, Y.: Chestx-det10: chest x-ray dataset on detection of thoracic abnormalities. arXiv preprint arXiv:2006.10550 (2020)

  14. [14]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  15. [15]

    Representation Learning with Contrastive Predictive Coding

    Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

  16. [16]

    In: The Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems (2025)

    Park, J., Yoon, B., Kim, S., Choi, K.: Radzero: Similarity-based cross-attention for explainable vision-language alignment in chest x-ray with zero-shot multi-task capability. In: The Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems (2025)

  17. [17]

    Advances in neural information processing sys- tems32(2019)

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems32(2019)

  18. [18]

    Nature Machine Intelligence 7(1), 119–130 (2025)

    Perez-Garcia, F., Sharma, H., Bond-Taylor, S., Bouzid, K., Salvatelli, V., Ilse, M., Bannur, S., Castro, D.C., Schwaighofer, A., Lungren, M.P., et al.: Exploring scal- able medical image encoders beyond text supervision. Nature Machine Intelligence 7(1), 119–130 (2025)

  19. [19]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

  20. [20]

    In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing

    Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019)

  21. [21]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classi- fication and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2097–2106 (2017)

  22. [22]

    In: Proceedings of the Conference on Empirical Methods in Natural Language Processing

    Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from un- paired medical images and text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing. vol. 2022, p. 3876 (2022)

  23. [23]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 21372–21383 (2023)

  24. [24]

    International Journal of Computer Vision 126(10), 1084–1102 (2018)

    Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neu- ral attention by excitation backprop. International Journal of Computer Vision 126(10), 1084–1102 (2018)

  25. [25]

    NEJM AI2(1) (2024)

    Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., Tupini, A., Wang, Y., Mazzola, M., Shukla, S., Concept-Guided Noisy Negative Suppression 11 Liden, L., Gao, J., Crabtree, A., Piening, B., Bifulco, C., Lungren, M.P., Naumann, T., Wang, S., Poon, H.: A multimodal biomedical foundation model t...

  26. [26]

    Nature Communications14(1), 4542 (2023)

    Zhang, X., Wu, C., Zhang, Y., Xie, W., Wang, Y.: Knowledge-enhanced visual- language pre-training on chest radiology images. Nature Communications14(1), 4542 (2023)

  27. [27]

    In: Machine Learning for Healthcare Conference

    Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learn- ing of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference. pp. 2–25. PMLR (2022)

  28. [28]

    In: The Eleventh International Conference on Learning Representations (2023)

    Zhou,H.Y.,Lian,C.,Wang,L.,Yu,Y.:Advancingradiographrepresentationlearn- ing with masked record modeling. In: The Eleventh International Conference on Learning Representations (2023)