pith. sign in

arxiv: 2604.09887 · v1 · submitted 2026-04-10 · 💻 cs.LG

SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning

Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3

classification 💻 cs.LG
keywords self-supervised learningsemantic clusteringradiology reportsvision-language modelsdata enrichmentGRPO trainingmedical imagingreport augmentation
0
0 comments X

The pith

Self-supervised semantic clustering enriches radiology reports by adding positive and neutral findings, improving vision-language model performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical vision-language datasets are small and biased toward negative findings because clinicians tend to report abnormalities while omitting some positive or neutral observations. The paper establishes that semantic clustering of sentences across reports can identify compatible positive and neutral content that can be added to other reports in a self-supervised way. This enrichment produces measurable gains when models are fine-tuned and even larger gains when the same cluster information is used to shape rewards in GRPO training. A sympathetic reader would care because the method expands limited medical data without manual labeling, addressing a core bottleneck in training reliable radiology AI systems.

Core claim

By grouping sentences from radiology reports into semantic clusters, positive and neutral observations can be identified and transferred across reports to create enriched training sets. This self-supervised augmentation leads to average gains of 5.63% on COMET, 3.04% on BERTScore, 7.40% on Sentence BLEU, 5.30% on CheXbert-F1, and 7.47% on RadGraph-F1 during supervised fine-tuning. Incorporating cluster membership into the reward function for GRPO training produces further average gains of 2.78% on COMET, 3.14% on BERTScore, and 12.80% on Sentence BLEU. Ablation experiments confirm that the improvements arise from the semantic structure rather than random sentence insertion.

What carries the argument

Semantic clustering of report sentences that groups observations by meaning to select positive or neutral findings for cross-report augmentation.

If this is right

  • Enriched reports produce consistent gains across five automatic metrics during standard supervised fine-tuning.
  • Using cluster labels inside the GRPO reward design yields additional improvements on COMET, BERTScore, and Sentence BLEU.
  • Ablation results show that random sentence addition does not replicate the gains, confirming the role of semantic grouping.
  • The method directly counters the negative-finding bias that limits current radiology vision-language datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clustering logic could be tested on other types of incomplete clinical text, such as pathology or discharge summaries.
  • If cluster-derived additions prove reliable, the approach could reduce reliance on large manually curated medical datasets.
  • Combining semantic enrichment with existing image augmentation techniques might produce multiplicative gains in end-to-end training.

Load-bearing premise

Sentences placed in the same semantic cluster reliably represent true positive or neutral medical observations that can be added to other reports without creating factual errors or noise.

What would settle it

Training a vision-language model on the enriched reports and finding no improvement or outright degradation on held-out test sets compared with the original data, or expert review showing frequent factual mismatches in the added sentences.

Figures

Figures reproduced from arXiv: 2604.09887 by Halil Ibrahim Gulluk, Olivier Gevaert.

Figure 1
Figure 1. Figure 1: An example chest X-ray from the ReXGradient-160K dataset. Findings: The lungs are adequately inflated. There is no focal infiltrate. There is no pleural effusion. No pulmonary parenchymal nodules or masses are observed. The heart and pulmonary vascularity are normal. There is calcification in the wall of the thoracic aorta. The bony thorax exhibits no acute abnormality. Impression: No pulmonary nodules are… view at source ↗
Figure 2
Figure 2. Figure 2: HDBSCAN log histogram of cluster sizes. Let’s denote the sentence to cluster id mapping as fC : si → ci where ci is the cluster id of the i-th sentence. This way, each sentence belongs to a cluster and a findings for a chest X-ray, which is naturally a set of sentences, can be represented as a set of cluster ids. Eventually, we have K cluster ids C = {c1, c2, . . . , cK} and we can represent a findings for… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the cluster-based findings representation. Each node represents a cluster of semantically similar sentences. Patient 1’s findings (consisting of multiple sentences) forms a complete subgraph where each node corresponds to the cluster containing one of the finding sentences. The blue subgraph shows an example findings with 4 sentences: the bold text shows the original sentence, and italiciz… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the valid enrichment process. Left: The base findings Fj = {c1, c2, c3, c4} (enclosed in blue box) with candidate clusters (dashed nodes). Green nodes are positive, red nodes are negative. Candidate c5 is valid (positive and connected to all base nodes). Candidate c6 is invalid (negative sign). Candidate c7 is invalid (not connected to c1 and c3). Candidates c8 and c9 form a valid pair (bot… view at source ↗
Figure 5
Figure 5. Figure 5: Model architecture overview. The input chest X-ray is processed by a Vision Encoder and projected into image tokens. These are concatenated with the tokenized findings (text tokens) and fed into a Large Language Model for next token prediction. we format our input as follows: "<think> findings here </think><answer> impression here </answer>". One classical way to reward the model in GRPO training is to rew… view at source ↗
Figure 6
Figure 6. Figure 6: HDBSCAN clustering statistics: (a) Log histogram of cluster sizes, (b) [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: KMEANS-5000 clustering statistics: (a) Log histogram of cluster sizes, (b) [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

Medical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient's condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SemEnrich, a self-supervised method that clusters sentences from radiology reports and enriches training reports by inserting positive or neutral observations drawn from other clusters. It reports consistent average gains in supervised fine-tuning (5.63% COMET, 3.04% BERTScore, 7.40% Sentence BLEU, 5.30% CheXbert-F1, 7.47% RadGraph-F1) and further gains when semantic cluster information is folded into the GRPO reward function (2.78% COMET, 3.14% BERTScore, 12.80% Sentence BLEU). Ablations are presented to show that gains arise from semantic rather than random augmentation.

Significance. If the added sentences are verifiably supported by the paired images, the approach could mitigate the negative-finding bias common in radiology VL datasets and improve downstream model robustness. The ablation against random augmentation and the open code release are positive elements that support reproducibility and isolate the contribution of clustering.

major comments (3)
  1. [Methods] Methods section: the enrichment step inserts sentences from other clusters without any image-conditioned consistency check or contradiction detection against the original report content. Cluster membership alone is treated as sufficient evidence that an added observation is factually safe for the given image; this assumption is load-bearing for the claim that enrichment improves data quality rather than merely increasing lexical diversity.
  2. [Results] Results / Ablation studies: the comparison to random augmentation rules out non-semantic effects but does not test whether added sentences are image-supported or introduce factual errors. COMET and BERTScore can reward semantic plausibility even when the addition contradicts the image; CheXbert-F1 and RadGraph-F1 capture only a subset of observation errors, leaving the factual accuracy of the enriched reports unverified.
  3. [GRPO section] GRPO reward design paragraph: the description of how cluster information is incorporated into the reward function is high-level; without explicit formulation or analysis of how cluster-derived rewards interact with image-text alignment, it is unclear whether the additional reported gains (2.78–12.80 %) reflect improved factual grounding or simply reinforce cluster-level priors.
minor comments (2)
  1. [Abstract] The clustering algorithm, number of clusters, and exact criteria for labeling sentences as positive/neutral are not stated in the abstract; these details are necessary for reproducibility even if they appear later in the manuscript.
  2. The anonymous code link is appreciated; ensure the repository is made public and includes the exact clustering and enrichment scripts used for the reported experiments.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below with clarifications on our design choices and indicate where revisions will be made to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Methods] Methods section: the enrichment step inserts sentences from other clusters without any image-conditioned consistency check or contradiction detection against the original report content. Cluster membership alone is treated as sufficient evidence that an added observation is factually safe for the given image; this assumption is load-bearing for the claim that enrichment improves data quality rather than merely increasing lexical diversity.

    Authors: We agree that the enrichment procedure relies solely on semantic cluster membership derived from report text, without an explicit image-conditioned consistency check or contradiction detection. This choice is intentional to maintain a fully self-supervised pipeline that requires no additional labels or models. The assumption is that sentences grouped by semantic clustering share compatible observational properties, which is supported by the ablation demonstrating that semantic enrichment outperforms random augmentation on downstream metrics. In the revised manuscript, we will expand the Methods section to explicitly articulate this assumption, its rationale, and its limitations, including the lack of per-instance image verification. revision: yes

  2. Referee: [Results] Results / Ablation studies: the comparison to random augmentation rules out non-semantic effects but does not test whether added sentences are image-supported or introduce factual errors. COMET and BERTScore can reward semantic plausibility even when the addition contradicts the image; CheXbert-F1 and RadGraph-F1 capture only a subset of observation errors, leaving the factual accuracy of the enriched reports unverified.

    Authors: The referee is correct that our ablations and automatic metrics do not directly confirm image support or rule out factual errors in every enriched sentence. The evaluation strategy relies on consistent improvements in multiple downstream metrics after fine-tuning, with the semantic-versus-random ablation isolating the contribution of clustering. Direct factual verification at scale would require either human review or an auxiliary image-text alignment model, which we deliberately avoided to preserve the self-supervised character of the method. In revision, we will add a limitations subsection that acknowledges the reliance on proxy metrics and the possibility of undetected inconsistencies, while noting that the observed gains across COMET, BERTScore, CheXbert-F1, and RadGraph-F1 provide indirect evidence of net benefit. revision: yes

  3. Referee: [GRPO section] GRPO reward design paragraph: the description of how cluster information is incorporated into the reward function is high-level; without explicit formulation or analysis of how cluster-derived rewards interact with image-text alignment, it is unclear whether the additional reported gains (2.78–12.80 %) reflect improved factual grounding or simply reinforce cluster-level priors.

    Authors: We acknowledge that the current description of the GRPO reward modification is high-level. In the revised manuscript, we will supply the explicit mathematical formulation showing how cluster membership is encoded into the reward signal. We will also add a short analysis discussing the interaction between the cluster-derived term and the base image-text alignment objective, to help distinguish whether the reported gains arise from improved grounding or from reinforcing semantic priors. This addition should address the concern about interpretability of the extra performance. revision: yes

standing simulated objections not resolved
  • Direct, instance-level verification that every added sentence is factually supported by its paired image, which would require either large-scale human annotation or an external image-conditioned consistency model outside the self-supervised framework of the present work.

Circularity Check

0 steps flagged

No significant circularity; empirical gains are measured against external metrics and ablations

full rationale

The paper presents a self-supervised enrichment procedure that clusters report sentences and augments training reports with positive/neutral sentences drawn from other clusters. Gains are then measured on held-out test sets using standard external metrics (COMET, BERTScore, Sentence BLEU, CheXbert-F1, RadGraph-F1) after supervised fine-tuning and GRPO training, with an ablation confirming that semantic clustering outperforms random augmentation. No equations, fitted parameters, or self-citations appear in the provided text that would reduce the reported improvements to quantities defined by the method's own inputs. The central claims are therefore falsifiable via independent evaluation and the released code, satisfying the criteria for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method assumes semantic clusters derived from report sentences map to medically meaningful positive/neutral observations that can be transferred without introducing errors; this is a domain assumption rather than a derived result.

axioms (1)
  • domain assumption Semantic clustering of report sentences produces groups that correspond to clinically relevant positive or neutral findings suitable for cross-report enrichment.
    Invoked when the method adds observations from different clusters; no independent validation of cluster medical accuracy is described in the abstract.

pith-pipeline@v0.9.0 · 5512 in / 1406 out tokens · 53706 ms · 2026-05-10T16:43:12.651163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Informatics in Medicine Unlocked 24, 100557 (2021) SemEnrich 17

    Alfarghaly, O., Khaled, R., Elkorany, A., Helal, M., Fahmy, A.: Automated radiology report generation using conditioned transformers. Informatics in Medicine Unlocked 24, 100557 (2021) SemEnrich 17

  2. [2]

    In: European conference on computer vision

    Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hyland, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., et al.: Making the most of text semantics to improve biomedical vision–language processing. In: European conference on computer vision. pp. 1–21. Springer (2022)

  3. [3]

    Communications of the ACM16(9), 575–577 (1973)

    Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM16(9), 575–577 (1973)

  4. [4]

    arXiv preprint arXiv:2410.20327 (2024)

    Chen, X., Lai, Z., Ruan, K., Chen, S., Liu, J., Liu, Z.: R-llava: Improving med- vqa understanding through visual region of interest. arxiv 2024. arXiv preprint arXiv:2410.20327

  5. [5]

    Gener- ating Radiology Reports via Memory-driven Transformer,

    Chen, Z., Song, Y., Chang, T.H., Wan, X.: Generating radiology reports via memory-driven transformer. arXiv preprint arXiv:2010.16056 (2020)

  6. [6]

    Available: [https://arxiv.org/abs/2502.03333](https://arxiv.org/abs/2502.03333)

    Deperrois, N., Matsuo, H., Ruipérez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T.M., Vogt, J.E., et al.: Radvlm: A multitask conversational vision-language model for radiology. arXiv preprint arXiv:2502.03333 (2025)

  7. [7]

    3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

    Gai, X., Liu, J., Li, Y., Meng, Z., Wu, J., Liu, Z.: 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks. arXiv preprint arXiv:2506.11147 (2025)

  8. [8]

    In: Proceedings of the AAAI conference on artificial intelligence

    Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)

  9. [9]

    In: International conference on machine learning

    Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

  10. [10]

    Scientific data6(1), 317 (2019)

    Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)

  11. [11]

    In: 50 Years of Integer Programming 1958-2008: from the Early Years to the State-of-the-Art, pp

    Karp, R.M.: Reducibility among combinatorial problems. In: 50 Years of Integer Programming 1958-2008: from the Early Years to the State-of-the-Art, pp. 219–241. Springer (2009)

  12. [12]

    Scientific data5(1), 1–10 (2018)

    Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data5(1), 1–10 (2018)

  13. [13]

    Advances in Neural Information Processing Systems36, 28541–28564 (2023)

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

  14. [14]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  15. [15]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Con- trastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)

  16. [16]

    Israel journal of Mathematics3(1), 23–28 (1965) 18 I

    Moon, J.W., Moser, L.: On cliques in graphs. Israel journal of Mathematics3(1), 23–28 (1965) 18 I. Gulluk et al

  17. [17]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  18. [18]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)

  19. [19]

    Nature Medicine31(2), 599–608 (2025)

    Tanno, R., Barrett, D.G., Sellergren, A., Ghaisas, S., Dathathri, S., See, A., Welbl, J., Lau, C., Tu, T., Azizi, S., et al.: Collaboration between clinicians and vision– language models in radiology report generation. Nature Medicine31(2), 599–608 (2025)

  20. [20]

    Theoretical computer science363(1), 28–42 (2006)

    Tomita, E., Tanaka, A., Takahashi, H.: The worst-case time complexity for gener- ating all maximal cliques and computational experiments. Theoretical computer science363(1), 28–42 (2006)

  21. [21]

    SIAM Journal on Computing6(3), 505–517 (1977)

    Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing6(3), 505–517 (1977)

  22. [22]

    Nejm Ai 1(3), AIoa2300138 (2024)

    Tu, T., Azizi, S., Driess, D., Schaekermann, M., Amin, M., Chang, P.C., Carroll, A., Lau, C., Tanno, R., Ktena, I., et al.: Towards generalist biomedical ai. Nejm Ai 1(3), AIoa2300138 (2024)

  23. [23]

    Advances in Neural Information Processing Systems 36, 56186–56197 (2023)

    Wan, Z., Liu, C., Zhang, M., Fu, J., Wang, B., Cheng, S., Ma, L., Quilodrán-Casas, C., Arcucci, R.: Med-unic: Unifying cross-lingual medical vision-language pre- training by diminishing bias. Advances in Neural Information Processing Systems 36, 56186–56197 (2023)

  24. [24]

    Journal of the American Medical Informatics Association31(9), 1833–1843 (2024)

    Wu, C., Lin, W., Zhang, X., Zhang, Y., Xie, W., Wang, Y.: Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association31(9), 1833–1843 (2024)

  25. [25]

    Mmed-rag: Versatile multimodal rag system for medical vision language models.arXiv preprint arXiv:2410.13085, 2024

    Xia, P., Zhu, K., Li, H., Wang, T., Shi, W., Wang, S., Zhang, L., Zou, J., Yao, H.: Mmed-rag: Versatile multimodal rag system for medical vision language models. arXiv preprint arXiv:2410.13085 (2024)

  26. [26]

    Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports.arXiv preprint arXiv:2505.00228, 2025

    Zhang, X., Acosta, J.N., Miller, J., Huang, O., Rajpurkar, P.: Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports. arXiv preprint arXiv:2505.00228 (2025)

  27. [27]

    bilateral pelvic phleboliths

    Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Development of a large-scale medical visual question-answering dataset. Communications Medicine 4(1), 277 (2024) SemEnrich 19 A Appendix A.1 Cluster Examples More examples of semantic clusters are provided. Cluster 5:Pelvic Phleboliths “bilateral pelvic phleboliths.” / “pelvic calcificati...