pith. sign in

arxiv: 2606.17062 · v1 · pith:RTAPK6C7new · submitted 2026-06-03 · 🧬 q-bio.QM · cs.LG

RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports

Pith reviewed 2026-06-28 02:45 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LG
keywords radiology report evaluationclinical consistencysentence-level metricLLM-assisted evaluationfinding matchingcontradiction detectionweighted F1 scoremonotonicity test
0
0 comments X

The pith

RadSEM rewrites radiology reports into atomic findings and scores them with contradiction-constrained matching to measure clinical consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RadSEM as a reference-based metric that first rewrites both reference and generated radiology reports into ordered lists of atomic finding sentences, each stating one site-finding proposition. It then applies many-to-many matching that withholds credit from contradictory pairs such as a finding and its negation, while allowing partial credit for compatible differences in granularity. A deterministic weighting stage accounts for part-whole relations and abnormal-detail priorities before computing an abnormal-focused weighted F1 score. Evaluated on the SSREE stress test of 2,448 reports expanded to five graded corruption levels, RadSEM reaches Kendall tau_b of 0.957, 97.8 percent all-pairs concordance, and strict five-level ordering on 81.9 percent of cases while correctly penalizing polarity inversions that lexical metrics overlook.

Core claim

RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as effusion and no effusion receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. On SSREE, the method achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8 percent, adjacent concordance of 95.0 percent, and strict five-level ordering for 81.9 percent of reports, outper

What carries the argument

Contradiction-constrained many-to-many matching of ordered atomic finding sentences, followed by deterministic part-whole weighting and abnormal-focused F1 scoring.

If this is right

  • Generated reports can be ranked by clinical accuracy instead of surface lexical overlap.
  • Polarity inversions and laterality errors receive zero credit rather than partial lexical credit.
  • Compatible findings at different levels of detail receive partial credit when they align clinically.
  • The final score remains deterministic after the rewriting step, supporting interpretability of per-finding alignments.
  • The metric distinguishes clinical compatibility from surface similarity on controlled monotonicity tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The atomic-finding representation could support automated localization of specific clinical errors in generated reports.
  • Similar rewriting-plus-matching pipelines might apply to other clinical document types where propositional consistency is required.
  • The approach may generalize to non-radiology domains that need fine-grained semantic matching with explicit contradiction handling.
  • Performance on real-world clinical variation outside the controlled SSREE corruption levels would test robustness to natural report phrasing.

Load-bearing premise

The LLM rewriting step produces accurate, consistent atomic finding sentences that faithfully capture clinical propositions without introducing or omitting key details.

What would settle it

A collection of radiology reports in which the LLM rewriting step alters a negation, laterality, or site detail, causing the subsequent matching and F1 score to diverge from clinical judgment.

read the original abstract

Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation Metric), a constrained LLM-assisted metric for reference-based evaluation of radiology Findings. RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as "effusion" and "no effusion" receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. Thus, the LLM supports structured rewriting and local alignment rather than acting as an opaque judge. We evaluate RadSEM with SSREE, a controlled monotonicity stress test built from 2,448 de-identified reports expanded into five graded corruption levels. RadSEM achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8%, adjacent concordance of 95.0%, and strict five-level ordering for 81.9% of reports, outperforming radiology-specific and general text metrics while avoiding the failure in which polarity-inverted reports regain lexical overlap. On the same SSREE set, RadSEM outperforms the Ref-anchored RadSEM-Alt policy, improving adjacent concordance from 90.7% to 95.0% and strict ordering from 67.2% to 81.9%. On a 599-triplet synonym/antonym subset, RadSEM prefers synonyms in 597 cases (99.67%). These results suggest that explicit finding units, contradiction-aware matching, and abnormal-focused deterministic scoring make report scoring more interpretable and sensitive to clinically meaningful errors. Code is available at https://github.com/jdh-algo/RadSEM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes RadSEM, a reference-based metric for radiology report evaluation. It rewrites reference and generated reports into ordered atomic finding sentences via LLM, performs contradiction-constrained many-to-many matching (no credit for incompatible pairs like 'effusion' vs. 'no effusion'), applies deterministic weighting by part-whole and abnormal-detail relations, and computes an abnormal-focused weighted F1. On the author-built SSREE stress test (2,448 reports expanded to five graded corruption levels), it reports Kendall tau_b of 0.957, 97.8% all-pairs concordance, 95.0% adjacent concordance, and 81.9% strict five-level ordering, outperforming other metrics and handling polarity inversions; it also shows 99.67% synonym preference on a 599-triplet subset.

Significance. If the atomic-sentence fidelity and benchmark independence hold, RadSEM could provide a more clinically interpretable alternative to lexical or embedding-based metrics by making explicit use of finding-level propositions, contradiction awareness, and abnormal focus. The public code release supports reproducibility.

major comments (3)
  1. [Abstract / Methods] Abstract and Methods (LLM rewriting step): The pipeline's first stage rewrites reports into atomic finding sentences, yet no human validation, inter-annotator agreement, or error analysis is reported for this step; because all downstream matching, weighting, and F1 scoring (and thus the SSREE metrics) depend on these sentences faithfully capturing clinical propositions without omission or polarity errors, the absence of verification is load-bearing for the performance claims.
  2. [Evaluation / SSREE] Evaluation section (SSREE construction): SSREE is built by expanding the same 2,448 de-identified reports into graded corruption levels; this author-constructed design creates dependence between metric definition and benchmark, so the reported Kendall tau_b 0.957 and 81.9% strict ordering may partly reflect that dependence rather than independent clinical fidelity.
  3. [Results] Results (599-triplet subset): The 99.67% synonym preference is computed after the same LLM rewriting; without separate validation that the atomic sentences preserve clinical granularity and polarity on this subset, the result does not independently confirm superiority over baselines.
minor comments (1)
  1. [Abstract] Abstract: the GitHub link is given but no license, version, or exact reproduction instructions are stated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods (LLM rewriting step): The pipeline's first stage rewrites reports into atomic finding sentences, yet no human validation, inter-annotator agreement, or error analysis is reported for this step; because all downstream matching, weighting, and F1 scoring (and thus the SSREE metrics) depend on these sentences faithfully capturing clinical propositions without omission or polarity errors, the absence of verification is load-bearing for the performance claims.

    Authors: We agree that the lack of direct validation for the LLM rewriting step is a limitation, as the performance metrics depend on the quality of these atomic sentences. While the SSREE results provide indirect support through the metric's sensitivity to clinical corruptions, we will add an error analysis section in the revised manuscript. This will include manual inspection of a random sample of rewritten sentences to report rates of omission, addition, and polarity errors. revision: partial

  2. Referee: [Evaluation / SSREE] Evaluation section (SSREE construction): SSREE is built by expanding the same 2,448 de-identified reports into graded corruption levels; this author-constructed design creates dependence between metric definition and benchmark, so the reported Kendall tau_b 0.957 and 81.9% strict ordering may partly reflect that dependence rather than independent clinical fidelity.

    Authors: The referee correctly identifies a potential circularity in the evaluation design. The corruption types were chosen to reflect clinically relevant errors (polarity, laterality, granularity) prior to metric development, but using the same report corpus does introduce dependence. We will revise the manuscript to explicitly discuss this limitation and emphasize that future work should validate on independently collected reports or external benchmarks. revision: yes

  3. Referee: [Results] Results (599-triplet subset): The 99.67% synonym preference is computed after the same LLM rewriting; without separate validation that the atomic sentences preserve clinical granularity and polarity on this subset, the result does not independently confirm superiority over baselines.

    Authors: We concur that this result shares the same dependency on the rewriting step as the main SSREE evaluation. We will update the results section to note this caveat and incorporate the synonym preference evaluation into the planned error analysis of the rewriting step. revision: partial

Circularity Check

0 steps flagged

No significant circularity; metric definition independent of benchmark

full rationale

The paper defines RadSEM via an explicit pipeline (LLM rewriting to atomic sentences, contradiction-constrained matching, part-whole weighting, abnormal-focused F1) that does not reference or depend on SSREE construction details. SSREE is used solely for post-hoc evaluation on 2,448 expanded reports; the metric equations and rules are stated independently and contain no fitted parameters, self-definitional loops, or self-citation chains that reduce the claimed performance to the input benchmark by construction. No equations or load-bearing steps in the abstract or described text exhibit the required reduction (e.g., prediction = fit). This is a standard proposal-plus-benchmark case with no circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters or new physical entities. The primary unverified premise is the reliability of the LLM rewriting component.

axioms (1)
  • domain assumption An LLM can reliably rewrite radiology reports into ordered atomic finding sentences that preserve clinical propositions without distortion or omission.
    The method depends on this capability for the initial rewriting and alignment steps.

pith-pipeline@v0.9.1-grok · 5899 in / 1289 out tokens · 77916 ms · 2026-06-28T02:45:23.862159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 8 canonical work pages

  1. [1]

    Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Q. H. Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P. Lungren, Andrew Y. Ng, Curtis P. Langlotz, and Pranav Rajpurkar. Radgraph: Extracting clinical entities and relations from radiology reports. InAdvances in Neural Information Processing Systems (NeurIPS), Track on Datasets an...

  2. [2]

    Reis, Eduardo K

    Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andrew Tsai, Eduardo P. Reis, Eduardo K. U. N. Fonseca, Hyung M. H. Lee, Zahra Shakeri Hossein Abad, Andrew Y. Ng, Curtis P. Langlotz, Vivek K. Venugopal, and Pranav Rajpurkar. Evaluating progress in automatic chest x-ray radiology report generation.Patterns, 4(9): 100802, 2023

  3. [3]

    RaTEScore: A metric for radiology report generation

    Wenxuan Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. RaTEScore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, Miami, Florida, USA, 2024. Association for Computational Linguistics

  4. [4]

    L.Wong, JoyT

    Tanveer Syeda-Mahmood, Ken C. L.Wong, JoyT. Wu, AshutoshJadhav, and Orest Boyko. Extracting and learning fine-grained labels from chest radiographs.arXiv preprint arXiv:2011.09517, 2020. doi: 10.48550/arXiv.2011.09517. URLhttps://arxiv.org/abs/2011.09517

  5. [5]

    Tanveer Syeda-Mahmood, Ken C. L. Wong, Yaniv Gur, Joy T. Wu, Ashutosh Jadhav, Satyananda Kashyap, Alexandros Karargyris, Anup Pillai, Arjun Sharma, Ali Bin Syed, Orest Boyko, and Mehdi Moradi. Chest x-ray report generation through fine-grained label learning.arXiv preprint arXiv:2007.13831, 2020. doi: 10.48550/ arXiv.2007.13831. URLhttps://arxiv.org/abs/2...

  6. [6]

    Kalra, Parisa Kaviani, Joy T

    Razi Mahmood, Pingkun Yan, Diego Machado Reyes, Ge Wang, Mannudeep K. Kalra, Parisa Kaviani, Joy T. Wu, and Tanveer Syeda-Mahmood. Evaluating automated radiology report quality through fine-grained phrasal grounding of clinical findings.arXiv preprint arXiv:2412.01031, 2024. doi: 10.48550/arXiv.2412.01031. URL https://arxiv.org/abs/2412.01031

  7. [7]

    Chest imagenome dataset.PhysioNet, 2021

    Joy Wu, Nkechinyere Agu, Ismini Lourentzou, Arjun Sharma, Joseph Paguio, Jasper Seth Yao, Edward Christopher Dee, William Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo Anthony Celi, Tanveer Syeda-Mahmood, and Mehdi Moradi. Chest imagenome dataset.PhysioNet, 2021. doi: 10.13026/wv01-y230. URL https: //physionet.org/content/chest-imagenome/1.0.0/

  8. [8]

    Chaudhari, and Jean-Benoit Delbrouck

    Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S. Chaudhari, and Jean-Benoit Delbrouck. GREEN: Generative radiology report evaluation and error notation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 374–390, Miam...

  9. [9]

    Medbench v4: A robust and scalable benchmark for evalu- ating chinese medical language models, multimodal models, and intelligent agents.ArXiv, abs/2511.14439, 2025

    Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, and Jie Xu. Medbench v4: A robust and scalable benchmark for evaluating chinese medical language models, multimodal models, and intelligent agents.ar...

  10. [10]

    Benchmarking chinese medical llms: A medbench-based analysis of performance gaps and hierarchical optimization strategies.arXiv preprint arXiv:2503.07306, 2025

    Luyi Jiang, Jiayuan Chen, Lu Lu, Xinwei Peng, Lihao Liu, Junjun He, and Jie Xu. Benchmarking chinese medical llms: A medbench-based analysis of performance gaps and hierarchical optimization strategies.arXiv preprint arXiv:2503.07306, 2025. doi: 10.48550/arXiv.2503.07306. URLhttps://arxiv.org/abs/2503.07306

  11. [11]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics

  12. [12]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computational Linguistics

  13. [13]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575, 2015

  14. [14]

    METEOR: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association for Computational Linguistics

  15. [15]

    Razi Mahmood, Diego Machado-Reyes, Joy Wu, Parisa Kaviani, Ken C. L. Wong, Niharika D’Souza, Mannudeep Kalra, Ge Wang, Pingkun Yan, and Tanveer Syeda-Mahmood. Phrase-grounded fact-checking for automatically generated chest x-ray reports.arXiv preprint arXiv:2509.21356, 2025. doi: 10.48550/arXiv.2509.21356. URL https://arxiv.org/abs/2509.21356. 23