RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports

Jintao Fei; Jun Xu; Jun Zhao; Shichao Quan; Wen Tang; Zhenhong Yang; Zhuoyun Liu

arxiv: 2606.17062 · v1 · pith:RTAPK6C7new · submitted 2026-06-03 · 🧬 q-bio.QM · cs.LG

RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports

Zhenhong Yang , Zhuoyun Liu , Jintao Fei , Wen Tang , Shichao Quan , Jun Zhao , Jun Xu This is my paper

Pith reviewed 2026-06-28 02:45 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LG

keywords radiology report evaluationclinical consistencysentence-level metricLLM-assisted evaluationfinding matchingcontradiction detectionweighted F1 scoremonotonicity test

0 comments

The pith

RadSEM rewrites radiology reports into atomic findings and scores them with contradiction-constrained matching to measure clinical consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RadSEM as a reference-based metric that first rewrites both reference and generated radiology reports into ordered lists of atomic finding sentences, each stating one site-finding proposition. It then applies many-to-many matching that withholds credit from contradictory pairs such as a finding and its negation, while allowing partial credit for compatible differences in granularity. A deterministic weighting stage accounts for part-whole relations and abnormal-detail priorities before computing an abnormal-focused weighted F1 score. Evaluated on the SSREE stress test of 2,448 reports expanded to five graded corruption levels, RadSEM reaches Kendall tau_b of 0.957, 97.8 percent all-pairs concordance, and strict five-level ordering on 81.9 percent of cases while correctly penalizing polarity inversions that lexical metrics overlook.

Core claim

RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as effusion and no effusion receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. On SSREE, the method achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8 percent, adjacent concordance of 95.0 percent, and strict five-level ordering for 81.9 percent of reports, outper

What carries the argument

Contradiction-constrained many-to-many matching of ordered atomic finding sentences, followed by deterministic part-whole weighting and abnormal-focused F1 scoring.

If this is right

Generated reports can be ranked by clinical accuracy instead of surface lexical overlap.
Polarity inversions and laterality errors receive zero credit rather than partial lexical credit.
Compatible findings at different levels of detail receive partial credit when they align clinically.
The final score remains deterministic after the rewriting step, supporting interpretability of per-finding alignments.
The metric distinguishes clinical compatibility from surface similarity on controlled monotonicity tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The atomic-finding representation could support automated localization of specific clinical errors in generated reports.
Similar rewriting-plus-matching pipelines might apply to other clinical document types where propositional consistency is required.
The approach may generalize to non-radiology domains that need fine-grained semantic matching with explicit contradiction handling.
Performance on real-world clinical variation outside the controlled SSREE corruption levels would test robustness to natural report phrasing.

Load-bearing premise

The LLM rewriting step produces accurate, consistent atomic finding sentences that faithfully capture clinical propositions without introducing or omitting key details.

What would settle it

A collection of radiology reports in which the LLM rewriting step alters a negation, laterality, or site detail, causing the subsequent matching and F1 score to diverge from clinical judgment.

read the original abstract

Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation Metric), a constrained LLM-assisted metric for reference-based evaluation of radiology Findings. RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as "effusion" and "no effusion" receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. Thus, the LLM supports structured rewriting and local alignment rather than acting as an opaque judge. We evaluate RadSEM with SSREE, a controlled monotonicity stress test built from 2,448 de-identified reports expanded into five graded corruption levels. RadSEM achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8%, adjacent concordance of 95.0%, and strict five-level ordering for 81.9% of reports, outperforming radiology-specific and general text metrics while avoiding the failure in which polarity-inverted reports regain lexical overlap. On the same SSREE set, RadSEM outperforms the Ref-anchored RadSEM-Alt policy, improving adjacent concordance from 90.7% to 95.0% and strict ordering from 67.2% to 81.9%. On a 599-triplet synonym/antonym subset, RadSEM prefers synonyms in 597 cases (99.67%). These results suggest that explicit finding units, contradiction-aware matching, and abnormal-focused deterministic scoring make report scoring more interpretable and sensitive to clinically meaningful errors. Code is available at https://github.com/jdh-algo/RadSEM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RadSEM gives a structured metric that handles clinical polarity and granularity better than baselines on the authors' test, but the unvalidated LLM rewrite step is load-bearing and untested.

read the letter

RadSEM's core move is rewriting reports into ordered atomic finding sentences, then running contradiction-constrained many-to-many matching before a deterministic abnormal-focused F1. On the SSREE test it reaches 0.957 Kendall tau_b and 97.8% all-pairs concordance while correctly penalizing polarity flips that trip up lexical metrics.

The new element is the explicit separation of the LLM into a rewriting role only, followed by transparent matching rules that give partial credit for compatible granularity differences and zero credit for contradictions. The deterministic weighting for part-whole relations and the abnormal focus are also clear improvements over opaque LLM judges or simple overlap scores. The synonym preference result on the 599-triplet subset is a useful sanity check.

The soft spot is that no human validation, inter-annotator numbers, or error analysis is shown for the atomic sentences themselves. Every downstream number depends on those rewrites being faithful, yet the paper gives no evidence they are. The SSREE set is built by the authors from the same underlying reports, which creates moderate circularity even if the graded corruption design is sensible.

This is for researchers working on radiology report generation who need an evaluation signal that tracks clinical compatibility rather than surface form. A reader already running model comparisons would find the metric and the test construction useful to examine.

It deserves peer review. The pipeline is concrete, the quantitative claims are specific, and the gaps around validation are straightforward for referees to address.

Referee Report

3 major / 1 minor

Summary. The paper proposes RadSEM, a reference-based metric for radiology report evaluation. It rewrites reference and generated reports into ordered atomic finding sentences via LLM, performs contradiction-constrained many-to-many matching (no credit for incompatible pairs like 'effusion' vs. 'no effusion'), applies deterministic weighting by part-whole and abnormal-detail relations, and computes an abnormal-focused weighted F1. On the author-built SSREE stress test (2,448 reports expanded to five graded corruption levels), it reports Kendall tau_b of 0.957, 97.8% all-pairs concordance, 95.0% adjacent concordance, and 81.9% strict five-level ordering, outperforming other metrics and handling polarity inversions; it also shows 99.67% synonym preference on a 599-triplet subset.

Significance. If the atomic-sentence fidelity and benchmark independence hold, RadSEM could provide a more clinically interpretable alternative to lexical or embedding-based metrics by making explicit use of finding-level propositions, contradiction awareness, and abnormal focus. The public code release supports reproducibility.

major comments (3)

[Abstract / Methods] Abstract and Methods (LLM rewriting step): The pipeline's first stage rewrites reports into atomic finding sentences, yet no human validation, inter-annotator agreement, or error analysis is reported for this step; because all downstream matching, weighting, and F1 scoring (and thus the SSREE metrics) depend on these sentences faithfully capturing clinical propositions without omission or polarity errors, the absence of verification is load-bearing for the performance claims.
[Evaluation / SSREE] Evaluation section (SSREE construction): SSREE is built by expanding the same 2,448 de-identified reports into graded corruption levels; this author-constructed design creates dependence between metric definition and benchmark, so the reported Kendall tau_b 0.957 and 81.9% strict ordering may partly reflect that dependence rather than independent clinical fidelity.
[Results] Results (599-triplet subset): The 99.67% synonym preference is computed after the same LLM rewriting; without separate validation that the atomic sentences preserve clinical granularity and polarity on this subset, the result does not independently confirm superiority over baselines.

minor comments (1)

[Abstract] Abstract: the GitHub link is given but no license, version, or exact reproduction instructions are stated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods (LLM rewriting step): The pipeline's first stage rewrites reports into atomic finding sentences, yet no human validation, inter-annotator agreement, or error analysis is reported for this step; because all downstream matching, weighting, and F1 scoring (and thus the SSREE metrics) depend on these sentences faithfully capturing clinical propositions without omission or polarity errors, the absence of verification is load-bearing for the performance claims.

Authors: We agree that the lack of direct validation for the LLM rewriting step is a limitation, as the performance metrics depend on the quality of these atomic sentences. While the SSREE results provide indirect support through the metric's sensitivity to clinical corruptions, we will add an error analysis section in the revised manuscript. This will include manual inspection of a random sample of rewritten sentences to report rates of omission, addition, and polarity errors. revision: partial
Referee: [Evaluation / SSREE] Evaluation section (SSREE construction): SSREE is built by expanding the same 2,448 de-identified reports into graded corruption levels; this author-constructed design creates dependence between metric definition and benchmark, so the reported Kendall tau_b 0.957 and 81.9% strict ordering may partly reflect that dependence rather than independent clinical fidelity.

Authors: The referee correctly identifies a potential circularity in the evaluation design. The corruption types were chosen to reflect clinically relevant errors (polarity, laterality, granularity) prior to metric development, but using the same report corpus does introduce dependence. We will revise the manuscript to explicitly discuss this limitation and emphasize that future work should validate on independently collected reports or external benchmarks. revision: yes
Referee: [Results] Results (599-triplet subset): The 99.67% synonym preference is computed after the same LLM rewriting; without separate validation that the atomic sentences preserve clinical granularity and polarity on this subset, the result does not independently confirm superiority over baselines.

Authors: We concur that this result shares the same dependency on the rewriting step as the main SSREE evaluation. We will update the results section to note this caveat and incorporate the synonym preference evaluation into the planned error analysis of the rewriting step. revision: partial

Circularity Check

0 steps flagged

No significant circularity; metric definition independent of benchmark

full rationale

The paper defines RadSEM via an explicit pipeline (LLM rewriting to atomic sentences, contradiction-constrained matching, part-whole weighting, abnormal-focused F1) that does not reference or depend on SSREE construction details. SSREE is used solely for post-hoc evaluation on 2,448 expanded reports; the metric equations and rules are stated independently and contain no fitted parameters, self-definitional loops, or self-citation chains that reduce the claimed performance to the input benchmark by construction. No equations or load-bearing steps in the abstract or described text exhibit the required reduction (e.g., prediction = fit). This is a standard proposal-plus-benchmark case with no circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters or new physical entities. The primary unverified premise is the reliability of the LLM rewriting component.

axioms (1)

domain assumption An LLM can reliably rewrite radiology reports into ordered atomic finding sentences that preserve clinical propositions without distortion or omission.
The method depends on this capability for the initial rewriting and alignment steps.

pith-pipeline@v0.9.1-grok · 5899 in / 1289 out tokens · 77916 ms · 2026-06-28T02:45:23.862159+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 8 canonical work pages

[1]

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Q. H. Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P. Lungren, Andrew Y. Ng, Curtis P. Langlotz, and Pranav Rajpurkar. Radgraph: Extracting clinical entities and relations from radiology reports. InAdvances in Neural Information Processing Systems (NeurIPS), Track on Datasets an...

2021
[2]

Reis, Eduardo K

Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andrew Tsai, Eduardo P. Reis, Eduardo K. U. N. Fonseca, Hyung M. H. Lee, Zahra Shakeri Hossein Abad, Andrew Y. Ng, Curtis P. Langlotz, Vivek K. Venugopal, and Pranav Rajpurkar. Evaluating progress in automatic chest x-ray radiology report generation.Patterns, 4(9): 100802, 2023

2023
[3]

RaTEScore: A metric for radiology report generation

Wenxuan Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. RaTEScore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, Miami, Florida, USA, 2024. Association for Computational Linguistics

2024
[4]

L.Wong, JoyT

Tanveer Syeda-Mahmood, Ken C. L.Wong, JoyT. Wu, AshutoshJadhav, and Orest Boyko. Extracting and learning fine-grained labels from chest radiographs.arXiv preprint arXiv:2011.09517, 2020. doi: 10.48550/arXiv.2011.09517. URLhttps://arxiv.org/abs/2011.09517

work page doi:10.48550/arxiv.2011.09517 2011
[5]

Tanveer Syeda-Mahmood, Ken C. L. Wong, Yaniv Gur, Joy T. Wu, Ashutosh Jadhav, Satyananda Kashyap, Alexandros Karargyris, Anup Pillai, Arjun Sharma, Ali Bin Syed, Orest Boyko, and Mehdi Moradi. Chest x-ray report generation through fine-grained label learning.arXiv preprint arXiv:2007.13831, 2020. doi: 10.48550/ arXiv.2007.13831. URLhttps://arxiv.org/abs/2...

work page arXiv 2007
[6]

Kalra, Parisa Kaviani, Joy T

Razi Mahmood, Pingkun Yan, Diego Machado Reyes, Ge Wang, Mannudeep K. Kalra, Parisa Kaviani, Joy T. Wu, and Tanveer Syeda-Mahmood. Evaluating automated radiology report quality through fine-grained phrasal grounding of clinical findings.arXiv preprint arXiv:2412.01031, 2024. doi: 10.48550/arXiv.2412.01031. URL https://arxiv.org/abs/2412.01031

work page doi:10.48550/arxiv.2412.01031 2024
[7]

Chest imagenome dataset.PhysioNet, 2021

Joy Wu, Nkechinyere Agu, Ismini Lourentzou, Arjun Sharma, Joseph Paguio, Jasper Seth Yao, Edward Christopher Dee, William Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo Anthony Celi, Tanveer Syeda-Mahmood, and Mehdi Moradi. Chest imagenome dataset.PhysioNet, 2021. doi: 10.13026/wv01-y230. URL https: //physionet.org/content/chest-imagenome/1.0.0/

work page doi:10.13026/wv01-y230 2021
[8]

Chaudhari, and Jean-Benoit Delbrouck

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S. Chaudhari, and Jean-Benoit Delbrouck. GREEN: Generative radiology report evaluation and error notation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 374–390, Miam...

work page doi:10.18653/v1/2024.findings-emnlp.21 2024
[9]

Medbench v4: A robust and scalable benchmark for evalu- ating chinese medical language models, multimodal models, and intelligent agents.ArXiv, abs/2511.14439, 2025

Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, and Jie Xu. Medbench v4: A robust and scalable benchmark for evaluating chinese medical language models, multimodal models, and intelligent agents.ar...

work page doi:10.48550/arxiv.2511.14439 2025
[10]

Benchmarking chinese medical llms: A medbench-based analysis of performance gaps and hierarchical optimization strategies.arXiv preprint arXiv:2503.07306, 2025

Luyi Jiang, Jiayuan Chen, Lu Lu, Xinwei Peng, Lihao Liu, Junjun He, and Jie Xu. Benchmarking chinese medical llms: A medbench-based analysis of performance gaps and hierarchical optimization strategies.arXiv preprint arXiv:2503.07306, 2025. doi: 10.48550/arXiv.2503.07306. URLhttps://arxiv.org/abs/2503.07306

work page doi:10.48550/arxiv.2503.07306 2025
[11]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics

2002
[12]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computational Linguistics

2004
[13]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575, 2015

2015
[14]

METEOR: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association for Computational Linguistics

2005
[15]

Razi Mahmood, Diego Machado-Reyes, Joy Wu, Parisa Kaviani, Ken C. L. Wong, Niharika D’Souza, Mannudeep Kalra, Ge Wang, Pingkun Yan, and Tanveer Syeda-Mahmood. Phrase-grounded fact-checking for automatically generated chest x-ray reports.arXiv preprint arXiv:2509.21356, 2025. doi: 10.48550/arXiv.2509.21356. URL https://arxiv.org/abs/2509.21356. 23

work page doi:10.48550/arxiv.2509.21356 2025

[1] [1]

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Q. H. Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P. Lungren, Andrew Y. Ng, Curtis P. Langlotz, and Pranav Rajpurkar. Radgraph: Extracting clinical entities and relations from radiology reports. InAdvances in Neural Information Processing Systems (NeurIPS), Track on Datasets an...

2021

[2] [2]

Reis, Eduardo K

Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andrew Tsai, Eduardo P. Reis, Eduardo K. U. N. Fonseca, Hyung M. H. Lee, Zahra Shakeri Hossein Abad, Andrew Y. Ng, Curtis P. Langlotz, Vivek K. Venugopal, and Pranav Rajpurkar. Evaluating progress in automatic chest x-ray radiology report generation.Patterns, 4(9): 100802, 2023

2023

[3] [3]

RaTEScore: A metric for radiology report generation

Wenxuan Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. RaTEScore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, Miami, Florida, USA, 2024. Association for Computational Linguistics

2024

[4] [4]

L.Wong, JoyT

Tanveer Syeda-Mahmood, Ken C. L.Wong, JoyT. Wu, AshutoshJadhav, and Orest Boyko. Extracting and learning fine-grained labels from chest radiographs.arXiv preprint arXiv:2011.09517, 2020. doi: 10.48550/arXiv.2011.09517. URLhttps://arxiv.org/abs/2011.09517

work page doi:10.48550/arxiv.2011.09517 2011

[5] [5]

Tanveer Syeda-Mahmood, Ken C. L. Wong, Yaniv Gur, Joy T. Wu, Ashutosh Jadhav, Satyananda Kashyap, Alexandros Karargyris, Anup Pillai, Arjun Sharma, Ali Bin Syed, Orest Boyko, and Mehdi Moradi. Chest x-ray report generation through fine-grained label learning.arXiv preprint arXiv:2007.13831, 2020. doi: 10.48550/ arXiv.2007.13831. URLhttps://arxiv.org/abs/2...

work page arXiv 2007

[6] [6]

Kalra, Parisa Kaviani, Joy T

Razi Mahmood, Pingkun Yan, Diego Machado Reyes, Ge Wang, Mannudeep K. Kalra, Parisa Kaviani, Joy T. Wu, and Tanveer Syeda-Mahmood. Evaluating automated radiology report quality through fine-grained phrasal grounding of clinical findings.arXiv preprint arXiv:2412.01031, 2024. doi: 10.48550/arXiv.2412.01031. URL https://arxiv.org/abs/2412.01031

work page doi:10.48550/arxiv.2412.01031 2024

[7] [7]

Chest imagenome dataset.PhysioNet, 2021

Joy Wu, Nkechinyere Agu, Ismini Lourentzou, Arjun Sharma, Joseph Paguio, Jasper Seth Yao, Edward Christopher Dee, William Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo Anthony Celi, Tanveer Syeda-Mahmood, and Mehdi Moradi. Chest imagenome dataset.PhysioNet, 2021. doi: 10.13026/wv01-y230. URL https: //physionet.org/content/chest-imagenome/1.0.0/

work page doi:10.13026/wv01-y230 2021

[8] [8]

Chaudhari, and Jean-Benoit Delbrouck

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S. Chaudhari, and Jean-Benoit Delbrouck. GREEN: Generative radiology report evaluation and error notation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 374–390, Miam...

work page doi:10.18653/v1/2024.findings-emnlp.21 2024

[9] [9]

Medbench v4: A robust and scalable benchmark for evalu- ating chinese medical language models, multimodal models, and intelligent agents.ArXiv, abs/2511.14439, 2025

Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, and Jie Xu. Medbench v4: A robust and scalable benchmark for evaluating chinese medical language models, multimodal models, and intelligent agents.ar...

work page doi:10.48550/arxiv.2511.14439 2025

[10] [10]

Benchmarking chinese medical llms: A medbench-based analysis of performance gaps and hierarchical optimization strategies.arXiv preprint arXiv:2503.07306, 2025

Luyi Jiang, Jiayuan Chen, Lu Lu, Xinwei Peng, Lihao Liu, Junjun He, and Jie Xu. Benchmarking chinese medical llms: A medbench-based analysis of performance gaps and hierarchical optimization strategies.arXiv preprint arXiv:2503.07306, 2025. doi: 10.48550/arXiv.2503.07306. URLhttps://arxiv.org/abs/2503.07306

work page doi:10.48550/arxiv.2503.07306 2025

[11] [11]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics

2002

[12] [12]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computational Linguistics

2004

[13] [13]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575, 2015

2015

[14] [14]

METEOR: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, 2005. Association for Computational Linguistics

2005

[15] [15]

Razi Mahmood, Diego Machado-Reyes, Joy Wu, Parisa Kaviani, Ken C. L. Wong, Niharika D’Souza, Mannudeep Kalra, Ge Wang, Pingkun Yan, and Tanveer Syeda-Mahmood. Phrase-grounded fact-checking for automatically generated chest x-ray reports.arXiv preprint arXiv:2509.21356, 2025. doi: 10.48550/arXiv.2509.21356. URL https://arxiv.org/abs/2509.21356. 23

work page doi:10.48550/arxiv.2509.21356 2025