RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports
Pith reviewed 2026-06-28 02:45 UTC · model grok-4.3
The pith
RadSEM rewrites radiology reports into atomic findings and scores them with contradiction-constrained matching to measure clinical consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as effusion and no effusion receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. On SSREE, the method achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8 percent, adjacent concordance of 95.0 percent, and strict five-level ordering for 81.9 percent of reports, outper
What carries the argument
Contradiction-constrained many-to-many matching of ordered atomic finding sentences, followed by deterministic part-whole weighting and abnormal-focused F1 scoring.
If this is right
- Generated reports can be ranked by clinical accuracy instead of surface lexical overlap.
- Polarity inversions and laterality errors receive zero credit rather than partial lexical credit.
- Compatible findings at different levels of detail receive partial credit when they align clinically.
- The final score remains deterministic after the rewriting step, supporting interpretability of per-finding alignments.
- The metric distinguishes clinical compatibility from surface similarity on controlled monotonicity tests.
Where Pith is reading between the lines
- The atomic-finding representation could support automated localization of specific clinical errors in generated reports.
- Similar rewriting-plus-matching pipelines might apply to other clinical document types where propositional consistency is required.
- The approach may generalize to non-radiology domains that need fine-grained semantic matching with explicit contradiction handling.
- Performance on real-world clinical variation outside the controlled SSREE corruption levels would test robustness to natural report phrasing.
Load-bearing premise
The LLM rewriting step produces accurate, consistent atomic finding sentences that faithfully capture clinical propositions without introducing or omitting key details.
What would settle it
A collection of radiology reports in which the LLM rewriting step alters a negation, laterality, or site detail, causing the subsequent matching and F1 score to diverge from clinical judgment.
read the original abstract
Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation Metric), a constrained LLM-assisted metric for reference-based evaluation of radiology Findings. RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as "effusion" and "no effusion" receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. Thus, the LLM supports structured rewriting and local alignment rather than acting as an opaque judge. We evaluate RadSEM with SSREE, a controlled monotonicity stress test built from 2,448 de-identified reports expanded into five graded corruption levels. RadSEM achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8%, adjacent concordance of 95.0%, and strict five-level ordering for 81.9% of reports, outperforming radiology-specific and general text metrics while avoiding the failure in which polarity-inverted reports regain lexical overlap. On the same SSREE set, RadSEM outperforms the Ref-anchored RadSEM-Alt policy, improving adjacent concordance from 90.7% to 95.0% and strict ordering from 67.2% to 81.9%. On a 599-triplet synonym/antonym subset, RadSEM prefers synonyms in 597 cases (99.67%). These results suggest that explicit finding units, contradiction-aware matching, and abnormal-focused deterministic scoring make report scoring more interpretable and sensitive to clinically meaningful errors. Code is available at https://github.com/jdh-algo/RadSEM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RadSEM, a reference-based metric for radiology report evaluation. It rewrites reference and generated reports into ordered atomic finding sentences via LLM, performs contradiction-constrained many-to-many matching (no credit for incompatible pairs like 'effusion' vs. 'no effusion'), applies deterministic weighting by part-whole and abnormal-detail relations, and computes an abnormal-focused weighted F1. On the author-built SSREE stress test (2,448 reports expanded to five graded corruption levels), it reports Kendall tau_b of 0.957, 97.8% all-pairs concordance, 95.0% adjacent concordance, and 81.9% strict five-level ordering, outperforming other metrics and handling polarity inversions; it also shows 99.67% synonym preference on a 599-triplet subset.
Significance. If the atomic-sentence fidelity and benchmark independence hold, RadSEM could provide a more clinically interpretable alternative to lexical or embedding-based metrics by making explicit use of finding-level propositions, contradiction awareness, and abnormal focus. The public code release supports reproducibility.
major comments (3)
- [Abstract / Methods] Abstract and Methods (LLM rewriting step): The pipeline's first stage rewrites reports into atomic finding sentences, yet no human validation, inter-annotator agreement, or error analysis is reported for this step; because all downstream matching, weighting, and F1 scoring (and thus the SSREE metrics) depend on these sentences faithfully capturing clinical propositions without omission or polarity errors, the absence of verification is load-bearing for the performance claims.
- [Evaluation / SSREE] Evaluation section (SSREE construction): SSREE is built by expanding the same 2,448 de-identified reports into graded corruption levels; this author-constructed design creates dependence between metric definition and benchmark, so the reported Kendall tau_b 0.957 and 81.9% strict ordering may partly reflect that dependence rather than independent clinical fidelity.
- [Results] Results (599-triplet subset): The 99.67% synonym preference is computed after the same LLM rewriting; without separate validation that the atomic sentences preserve clinical granularity and polarity on this subset, the result does not independently confirm superiority over baselines.
minor comments (1)
- [Abstract] Abstract: the GitHub link is given but no license, version, or exact reproduction instructions are stated.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We respond to each major comment below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods (LLM rewriting step): The pipeline's first stage rewrites reports into atomic finding sentences, yet no human validation, inter-annotator agreement, or error analysis is reported for this step; because all downstream matching, weighting, and F1 scoring (and thus the SSREE metrics) depend on these sentences faithfully capturing clinical propositions without omission or polarity errors, the absence of verification is load-bearing for the performance claims.
Authors: We agree that the lack of direct validation for the LLM rewriting step is a limitation, as the performance metrics depend on the quality of these atomic sentences. While the SSREE results provide indirect support through the metric's sensitivity to clinical corruptions, we will add an error analysis section in the revised manuscript. This will include manual inspection of a random sample of rewritten sentences to report rates of omission, addition, and polarity errors. revision: partial
-
Referee: [Evaluation / SSREE] Evaluation section (SSREE construction): SSREE is built by expanding the same 2,448 de-identified reports into graded corruption levels; this author-constructed design creates dependence between metric definition and benchmark, so the reported Kendall tau_b 0.957 and 81.9% strict ordering may partly reflect that dependence rather than independent clinical fidelity.
Authors: The referee correctly identifies a potential circularity in the evaluation design. The corruption types were chosen to reflect clinically relevant errors (polarity, laterality, granularity) prior to metric development, but using the same report corpus does introduce dependence. We will revise the manuscript to explicitly discuss this limitation and emphasize that future work should validate on independently collected reports or external benchmarks. revision: yes
-
Referee: [Results] Results (599-triplet subset): The 99.67% synonym preference is computed after the same LLM rewriting; without separate validation that the atomic sentences preserve clinical granularity and polarity on this subset, the result does not independently confirm superiority over baselines.
Authors: We concur that this result shares the same dependency on the rewriting step as the main SSREE evaluation. We will update the results section to note this caveat and incorporate the synonym preference evaluation into the planned error analysis of the rewriting step. revision: partial
Circularity Check
No significant circularity; metric definition independent of benchmark
full rationale
The paper defines RadSEM via an explicit pipeline (LLM rewriting to atomic sentences, contradiction-constrained matching, part-whole weighting, abnormal-focused F1) that does not reference or depend on SSREE construction details. SSREE is used solely for post-hoc evaluation on 2,448 expanded reports; the metric equations and rules are stated independently and contain no fitted parameters, self-definitional loops, or self-citation chains that reduce the claimed performance to the input benchmark by construction. No equations or load-bearing steps in the abstract or described text exhibit the required reduction (e.g., prediction = fit). This is a standard proposal-plus-benchmark case with no circularity per the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An LLM can reliably rewrite radiology reports into ordered atomic finding sentences that preserve clinical propositions without distortion or omission.
Reference graph
Works this paper leans on
-
[1]
Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Q. H. Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P. Lungren, Andrew Y. Ng, Curtis P. Langlotz, and Pranav Rajpurkar. Radgraph: Extracting clinical entities and relations from radiology reports. InAdvances in Neural Information Processing Systems (NeurIPS), Track on Datasets an...
2021
-
[2]
Reis, Eduardo K
Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andrew Tsai, Eduardo P. Reis, Eduardo K. U. N. Fonseca, Hyung M. H. Lee, Zahra Shakeri Hossein Abad, Andrew Y. Ng, Curtis P. Langlotz, Vivek K. Venugopal, and Pranav Rajpurkar. Evaluating progress in automatic chest x-ray radiology report generation.Patterns, 4(9): 100802, 2023
2023
-
[3]
RaTEScore: A metric for radiology report generation
Wenxuan Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. RaTEScore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, Miami, Florida, USA, 2024. Association for Computational Linguistics
2024
-
[4]
Tanveer Syeda-Mahmood, Ken C. L.Wong, JoyT. Wu, AshutoshJadhav, and Orest Boyko. Extracting and learning fine-grained labels from chest radiographs.arXiv preprint arXiv:2011.09517, 2020. doi: 10.48550/arXiv.2011.09517. URLhttps://arxiv.org/abs/2011.09517
-
[5]
Tanveer Syeda-Mahmood, Ken C. L. Wong, Yaniv Gur, Joy T. Wu, Ashutosh Jadhav, Satyananda Kashyap, Alexandros Karargyris, Anup Pillai, Arjun Sharma, Ali Bin Syed, Orest Boyko, and Mehdi Moradi. Chest x-ray report generation through fine-grained label learning.arXiv preprint arXiv:2007.13831, 2020. doi: 10.48550/ arXiv.2007.13831. URLhttps://arxiv.org/abs/2...
-
[6]
Razi Mahmood, Pingkun Yan, Diego Machado Reyes, Ge Wang, Mannudeep K. Kalra, Parisa Kaviani, Joy T. Wu, and Tanveer Syeda-Mahmood. Evaluating automated radiology report quality through fine-grained phrasal grounding of clinical findings.arXiv preprint arXiv:2412.01031, 2024. doi: 10.48550/arXiv.2412.01031. URL https://arxiv.org/abs/2412.01031
-
[7]
Chest imagenome dataset.PhysioNet, 2021
Joy Wu, Nkechinyere Agu, Ismini Lourentzou, Arjun Sharma, Joseph Paguio, Jasper Seth Yao, Edward Christopher Dee, William Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo Anthony Celi, Tanveer Syeda-Mahmood, and Mehdi Moradi. Chest imagenome dataset.PhysioNet, 2021. doi: 10.13026/wv01-y230. URL https: //physionet.org/content/chest-imagenome/1.0.0/
-
[8]
Chaudhari, and Jean-Benoit Delbrouck
Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S. Chaudhari, and Jean-Benoit Delbrouck. GREEN: Generative radiology report evaluation and error notation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 374–390, Miam...
-
[9]
Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, and Jie Xu. Medbench v4: A robust and scalable benchmark for evaluating chinese medical language models, multimodal models, and intelligent agents.ar...
-
[10]
Luyi Jiang, Jiayuan Chen, Lu Lu, Xinwei Peng, Lihao Liu, Junjun He, and Jie Xu. Benchmarking chinese medical llms: A medbench-based analysis of performance gaps and hierarchical optimization strategies.arXiv preprint arXiv:2503.07306, 2025. doi: 10.48550/arXiv.2503.07306. URLhttps://arxiv.org/abs/2503.07306
-
[11]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics
2002
-
[12]
ROUGE: A package for automatic evaluation of summaries
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computational Linguistics
2004
-
[13]
Lawrence Zitnick, and Devi Parikh
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575, 2015
2015
-
[14]
METEOR: An automatic metric for mt evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association for Computational Linguistics
2005
-
[15]
Razi Mahmood, Diego Machado-Reyes, Joy Wu, Parisa Kaviani, Ken C. L. Wong, Niharika D’Souza, Mannudeep Kalra, Ge Wang, Pingkun Yan, and Tanveer Syeda-Mahmood. Phrase-grounded fact-checking for automatically generated chest x-ray reports.arXiv preprint arXiv:2509.21356, 2025. doi: 10.48550/arXiv.2509.21356. URL https://arxiv.org/abs/2509.21356. 23
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.