SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning
Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3
The pith
Self-supervised semantic clustering enriches radiology reports by adding positive and neutral findings, improving vision-language model performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By grouping sentences from radiology reports into semantic clusters, positive and neutral observations can be identified and transferred across reports to create enriched training sets. This self-supervised augmentation leads to average gains of 5.63% on COMET, 3.04% on BERTScore, 7.40% on Sentence BLEU, 5.30% on CheXbert-F1, and 7.47% on RadGraph-F1 during supervised fine-tuning. Incorporating cluster membership into the reward function for GRPO training produces further average gains of 2.78% on COMET, 3.14% on BERTScore, and 12.80% on Sentence BLEU. Ablation experiments confirm that the improvements arise from the semantic structure rather than random sentence insertion.
What carries the argument
Semantic clustering of report sentences that groups observations by meaning to select positive or neutral findings for cross-report augmentation.
If this is right
- Enriched reports produce consistent gains across five automatic metrics during standard supervised fine-tuning.
- Using cluster labels inside the GRPO reward design yields additional improvements on COMET, BERTScore, and Sentence BLEU.
- Ablation results show that random sentence addition does not replicate the gains, confirming the role of semantic grouping.
- The method directly counters the negative-finding bias that limits current radiology vision-language datasets.
Where Pith is reading between the lines
- The same clustering logic could be tested on other types of incomplete clinical text, such as pathology or discharge summaries.
- If cluster-derived additions prove reliable, the approach could reduce reliance on large manually curated medical datasets.
- Combining semantic enrichment with existing image augmentation techniques might produce multiplicative gains in end-to-end training.
Load-bearing premise
Sentences placed in the same semantic cluster reliably represent true positive or neutral medical observations that can be added to other reports without creating factual errors or noise.
What would settle it
Training a vision-language model on the enriched reports and finding no improvement or outright degradation on held-out test sets compared with the original data, or expert review showing frequent factual mismatches in the added sentences.
Figures
read the original abstract
Medical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient's condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SemEnrich, a self-supervised method that clusters sentences from radiology reports and enriches training reports by inserting positive or neutral observations drawn from other clusters. It reports consistent average gains in supervised fine-tuning (5.63% COMET, 3.04% BERTScore, 7.40% Sentence BLEU, 5.30% CheXbert-F1, 7.47% RadGraph-F1) and further gains when semantic cluster information is folded into the GRPO reward function (2.78% COMET, 3.14% BERTScore, 12.80% Sentence BLEU). Ablations are presented to show that gains arise from semantic rather than random augmentation.
Significance. If the added sentences are verifiably supported by the paired images, the approach could mitigate the negative-finding bias common in radiology VL datasets and improve downstream model robustness. The ablation against random augmentation and the open code release are positive elements that support reproducibility and isolate the contribution of clustering.
major comments (3)
- [Methods] Methods section: the enrichment step inserts sentences from other clusters without any image-conditioned consistency check or contradiction detection against the original report content. Cluster membership alone is treated as sufficient evidence that an added observation is factually safe for the given image; this assumption is load-bearing for the claim that enrichment improves data quality rather than merely increasing lexical diversity.
- [Results] Results / Ablation studies: the comparison to random augmentation rules out non-semantic effects but does not test whether added sentences are image-supported or introduce factual errors. COMET and BERTScore can reward semantic plausibility even when the addition contradicts the image; CheXbert-F1 and RadGraph-F1 capture only a subset of observation errors, leaving the factual accuracy of the enriched reports unverified.
- [GRPO section] GRPO reward design paragraph: the description of how cluster information is incorporated into the reward function is high-level; without explicit formulation or analysis of how cluster-derived rewards interact with image-text alignment, it is unclear whether the additional reported gains (2.78–12.80 %) reflect improved factual grounding or simply reinforce cluster-level priors.
minor comments (2)
- [Abstract] The clustering algorithm, number of clusters, and exact criteria for labeling sentences as positive/neutral are not stated in the abstract; these details are necessary for reproducibility even if they appear later in the manuscript.
- The anonymous code link is appreciated; ensure the repository is made public and includes the exact clustering and enrichment scripts used for the reported experiments.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below with clarifications on our design choices and indicate where revisions will be made to improve clarity and transparency.
read point-by-point responses
-
Referee: [Methods] Methods section: the enrichment step inserts sentences from other clusters without any image-conditioned consistency check or contradiction detection against the original report content. Cluster membership alone is treated as sufficient evidence that an added observation is factually safe for the given image; this assumption is load-bearing for the claim that enrichment improves data quality rather than merely increasing lexical diversity.
Authors: We agree that the enrichment procedure relies solely on semantic cluster membership derived from report text, without an explicit image-conditioned consistency check or contradiction detection. This choice is intentional to maintain a fully self-supervised pipeline that requires no additional labels or models. The assumption is that sentences grouped by semantic clustering share compatible observational properties, which is supported by the ablation demonstrating that semantic enrichment outperforms random augmentation on downstream metrics. In the revised manuscript, we will expand the Methods section to explicitly articulate this assumption, its rationale, and its limitations, including the lack of per-instance image verification. revision: yes
-
Referee: [Results] Results / Ablation studies: the comparison to random augmentation rules out non-semantic effects but does not test whether added sentences are image-supported or introduce factual errors. COMET and BERTScore can reward semantic plausibility even when the addition contradicts the image; CheXbert-F1 and RadGraph-F1 capture only a subset of observation errors, leaving the factual accuracy of the enriched reports unverified.
Authors: The referee is correct that our ablations and automatic metrics do not directly confirm image support or rule out factual errors in every enriched sentence. The evaluation strategy relies on consistent improvements in multiple downstream metrics after fine-tuning, with the semantic-versus-random ablation isolating the contribution of clustering. Direct factual verification at scale would require either human review or an auxiliary image-text alignment model, which we deliberately avoided to preserve the self-supervised character of the method. In revision, we will add a limitations subsection that acknowledges the reliance on proxy metrics and the possibility of undetected inconsistencies, while noting that the observed gains across COMET, BERTScore, CheXbert-F1, and RadGraph-F1 provide indirect evidence of net benefit. revision: yes
-
Referee: [GRPO section] GRPO reward design paragraph: the description of how cluster information is incorporated into the reward function is high-level; without explicit formulation or analysis of how cluster-derived rewards interact with image-text alignment, it is unclear whether the additional reported gains (2.78–12.80 %) reflect improved factual grounding or simply reinforce cluster-level priors.
Authors: We acknowledge that the current description of the GRPO reward modification is high-level. In the revised manuscript, we will supply the explicit mathematical formulation showing how cluster membership is encoded into the reward signal. We will also add a short analysis discussing the interaction between the cluster-derived term and the base image-text alignment objective, to help distinguish whether the reported gains arise from improved grounding or from reinforcing semantic priors. This addition should address the concern about interpretability of the extra performance. revision: yes
- Direct, instance-level verification that every added sentence is factually supported by its paired image, which would require either large-scale human annotation or an external image-conditioned consistency model outside the self-supervised framework of the present work.
Circularity Check
No significant circularity; empirical gains are measured against external metrics and ablations
full rationale
The paper presents a self-supervised enrichment procedure that clusters report sentences and augments training reports with positive/neutral sentences drawn from other clusters. Gains are then measured on held-out test sets using standard external metrics (COMET, BERTScore, Sentence BLEU, CheXbert-F1, RadGraph-F1) after supervised fine-tuning and GRPO training, with an ablation confirming that semantic clustering outperforms random augmentation. No equations, fitted parameters, or self-citations appear in the provided text that would reduce the reported improvements to quantities defined by the method's own inputs. The central claims are therefore falsifiable via independent evaluation and the released code, satisfying the criteria for a self-contained, non-circular derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic clustering of report sentences produces groups that correspond to clinically relevant positive or neutral findings suitable for cross-report enrichment.
Reference graph
Works this paper leans on
-
[1]
Informatics in Medicine Unlocked 24, 100557 (2021) SemEnrich 17
Alfarghaly, O., Khaled, R., Elkorany, A., Helal, M., Fahmy, A.: Automated radiology report generation using conditioned transformers. Informatics in Medicine Unlocked 24, 100557 (2021) SemEnrich 17
work page 2021
-
[2]
In: European conference on computer vision
Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hyland, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., et al.: Making the most of text semantics to improve biomedical vision–language processing. In: European conference on computer vision. pp. 1–21. Springer (2022)
work page 2022
-
[3]
Communications of the ACM16(9), 575–577 (1973)
Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM16(9), 575–577 (1973)
work page 1973
-
[4]
arXiv preprint arXiv:2410.20327 (2024)
Chen, X., Lai, Z., Ruan, K., Chen, S., Liu, J., Liu, Z.: R-llava: Improving med- vqa understanding through visual region of interest. arxiv 2024. arXiv preprint arXiv:2410.20327
-
[5]
Gener- ating Radiology Reports via Memory-driven Transformer,
Chen, Z., Song, Y., Chang, T.H., Wan, X.: Generating radiology reports via memory-driven transformer. arXiv preprint arXiv:2010.16056 (2020)
-
[6]
Available: [https://arxiv.org/abs/2502.03333](https://arxiv.org/abs/2502.03333)
Deperrois, N., Matsuo, H., Ruipérez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T.M., Vogt, J.E., et al.: Radvlm: A multitask conversational vision-language model for radiology. arXiv preprint arXiv:2502.03333 (2025)
-
[7]
Gai, X., Liu, J., Li, Y., Meng, Z., Wu, J., Liu, Z.: 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks. arXiv preprint arXiv:2506.11147 (2025)
-
[8]
In: Proceedings of the AAAI conference on artificial intelligence
Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)
work page 2019
-
[9]
In: International conference on machine learning
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)
work page 2021
-
[10]
Scientific data6(1), 317 (2019)
Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)
work page 2019
-
[11]
In: 50 Years of Integer Programming 1958-2008: from the Early Years to the State-of-the-Art, pp
Karp, R.M.: Reducibility among combinatorial problems. In: 50 Years of Integer Programming 1958-2008: from the Early Years to the State-of-the-Art, pp. 219–241. Springer (2009)
work page 1958
-
[12]
Scientific data5(1), 1–10 (2018)
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data5(1), 1–10 (2018)
work page 2018
-
[13]
Advances in Neural Information Processing Systems36, 28541–28564 (2023)
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)
work page 2023
-
[14]
In: International conference on machine learning
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)
work page 2023
-
[15]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Con- trastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)
work page 2023
-
[16]
Israel journal of Mathematics3(1), 23–28 (1965) 18 I
Moon, J.W., Moser, L.: On cliques in graphs. Israel journal of Mathematics3(1), 23–28 (1965) 18 I. Gulluk et al
work page 1965
-
[17]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[18]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[19]
Nature Medicine31(2), 599–608 (2025)
Tanno, R., Barrett, D.G., Sellergren, A., Ghaisas, S., Dathathri, S., See, A., Welbl, J., Lau, C., Tu, T., Azizi, S., et al.: Collaboration between clinicians and vision– language models in radiology report generation. Nature Medicine31(2), 599–608 (2025)
work page 2025
-
[20]
Theoretical computer science363(1), 28–42 (2006)
Tomita, E., Tanaka, A., Takahashi, H.: The worst-case time complexity for gener- ating all maximal cliques and computational experiments. Theoretical computer science363(1), 28–42 (2006)
work page 2006
-
[21]
SIAM Journal on Computing6(3), 505–517 (1977)
Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing6(3), 505–517 (1977)
work page 1977
-
[22]
Nejm Ai 1(3), AIoa2300138 (2024)
Tu, T., Azizi, S., Driess, D., Schaekermann, M., Amin, M., Chang, P.C., Carroll, A., Lau, C., Tanno, R., Ktena, I., et al.: Towards generalist biomedical ai. Nejm Ai 1(3), AIoa2300138 (2024)
work page 2024
-
[23]
Advances in Neural Information Processing Systems 36, 56186–56197 (2023)
Wan, Z., Liu, C., Zhang, M., Fu, J., Wang, B., Cheng, S., Ma, L., Quilodrán-Casas, C., Arcucci, R.: Med-unic: Unifying cross-lingual medical vision-language pre- training by diminishing bias. Advances in Neural Information Processing Systems 36, 56186–56197 (2023)
work page 2023
-
[24]
Journal of the American Medical Informatics Association31(9), 1833–1843 (2024)
Wu, C., Lin, W., Zhang, X., Zhang, Y., Xie, W., Wang, Y.: Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association31(9), 1833–1843 (2024)
work page 2024
-
[25]
Xia, P., Zhu, K., Li, H., Wang, T., Shi, W., Wang, S., Zhang, L., Zou, J., Yao, H.: Mmed-rag: Versatile multimodal rag system for medical vision language models. arXiv preprint arXiv:2410.13085 (2024)
-
[26]
Zhang, X., Acosta, J.N., Miller, J., Huang, O., Rajpurkar, P.: Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports. arXiv preprint arXiv:2505.00228 (2025)
-
[27]
Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Development of a large-scale medical visual question-answering dataset. Communications Medicine 4(1), 277 (2024) SemEnrich 19 A Appendix A.1 Cluster Examples More examples of semantic clusters are provided. Cluster 5:Pelvic Phleboliths “bilateral pelvic phleboliths.” / “pelvic calcificati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.