pith. machine review for the scientific record. sign in

arxiv: 2604.03376 · v1 · submitted 2026-04-03 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

VERT: Reliable LLM Judges for Radiology Report Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords radiology report evaluationLLM judgesVERT metriccorrelation analysisfine-tuningmulti-modalitymedical AI evaluationerror categorization
0
0 comments X

The pith

VERT is an LLM-based metric that improves correlation with radiologist judgments on radiology reports by up to 11.7% relative to GREEN.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether LLM judges can reliably score radiology reports from different imaging modalities and body regions, beyond the chest X-ray focus of prior work. It introduces VERT and benchmarks it against RadFact, GREEN, and FineRadScore using open- and closed-source models on the expert-annotated RadEval and RaTE-Eval datasets. VERT shows stronger alignment with human experts, and fine-tuning a 30B model on only 1,300 examples produces further gains while cutting inference time dramatically. A systematic error analysis also maps where these metrics agree and disagree with experts. This line of work matters because scalable, accurate automated evaluation is a prerequisite for developing trustworthy AI systems that generate or review medical reports.

Core claim

We propose VERT, an LLM-based metric for radiology report evaluation. Across two expert-annotated datasets covering multiple modalities and anatomies, VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Fine-tuning Qwen3 30B on 1,300 training samples yields gains of up to 25% and reduces inference time by up to 37.2 times. These results demonstrate that reliable evaluation can be achieved with lightweight adaptation of LLMs.

What carries the argument

VERT, the proposed LLM-based judge metric that uses tailored configurations, few-shot prompting, ensembling, or parameter-efficient fine-tuning to align ratings more closely with expert radiologist judgments on reports from varied modalities.

Load-bearing premise

Expert annotations in RadEval and RaTE-Eval are treated as reliable ground truth without reported inter-rater agreement statistics or analysis of annotation variability across modalities.

What would settle it

A follow-up study that collects independent ratings from multiple radiologists on the same reports and finds low inter-rater agreement, or that measures VERT correlation dropping below GREEN on a new set of reports from an unseen modality.

Figures

Figures reproduced from arXiv: 2604.03376 by Asma Ben Abacha, Federica Bologna, Jean-Philippe Corbeil, Matthew Wilkens.

Figure 1
Figure 1. Figure 1: Fine-tuning of Qwen3-30B-A3B-Instruct-2507 on RaTE-Eval trainset. 5.1.7 LoRA Fine-Tuning Qwen3-30B-A3B-Instruct-2507 fine-tuned on RaTE-Eval’s training dataset for 5 epochs1 surpasses zero-shot performance ( [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean number of clinically significant errors annotated by humans vs. vs. those [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean number of clinically significant errors detected by model-prompt combina [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: F1 scores for error type on RadEval. Results are flipped for type (a) and (b) on RaTE-Eval ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Histogram of error annotations in RadEval. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Histogram of normalized expert-annotated scores in RaTE-Eval. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Simulations of the impact of sweeping S or TP for GREEN and F1. C F1-Score for Error Type Categorization For each error category c ∈ {(a), . . . ,(f)} and each report i, let hi,c ∈ N0 denote the human annotated significant error count for category c, and let gi,c ∈ N0 denote the model-predicted error count for the same category. We compute count-level matches and mismatches per report as TPi,c = min(hi,c ,… view at source ↗
read the original abstract

Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic error detection and categorization study to assess alignment of these metrics against expert judgments and identify areas of lower and higher agreement. Our results show that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Furthermore, fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples. The fine-tuned model also reduces inference time up to 37.2 times. These findings highlight the effectiveness of LLM-based judges and demonstrate that reliable evaluation can be achieved with lightweight adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VERT, a proposed LLM-based metric for evaluating radiology reports. Through extensive experiments on the RadEval and RaTE-Eval datasets covering multiple modalities and anatomies, it compares VERT against RadFact, GREEN, and FineRadScore using various open- and closed-source LLMs. The paper reports that VERT achieves up to 11.7% higher correlation with expert judgments than GREEN, and that fine-tuning Qwen3-30B on 1,300 samples yields up to 25% gains while reducing inference time by up to 37.2 times. It also includes analyses of few-shot learning, ensembling, and a systematic error detection study.

Significance. If the reported correlations hold under scrutiny, this work would be significant for the development of reliable automated evaluation tools in radiology AI, extending beyond chest X-rays to diverse modalities. The emphasis on lightweight fine-tuning and efficiency gains is practically valuable, and the error categorization provides insight into where LLM judges align or diverge from experts.

major comments (2)
  1. [Abstract] The abstract states that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN and that fine-tuning yields gains of up to 25%, but provides no details on the specific correlation coefficients (Pearson or Spearman), statistical significance, error bars, or how data splits were handled in the experiments.
  2. [Results] All quantitative claims rely on expert annotations in RadEval and RaTE-Eval as ground truth, yet no inter-rater agreement statistics (e.g., Cohen’s κ, Fleiss’ κ) or analysis of annotation variability across modalities are reported. This is a load-bearing issue for interpreting the correlation improvements as evidence of increased reliability.
minor comments (2)
  1. [Methods] The exact prompt templates used for the LLM judges (including VERT) are not provided, which limits reproducibility of the results.
  2. [Abstract] The claim of reducing inference time up to 37.2 times lacks specification of the baseline model and conditions under which this speedup is measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires more precise quantitative details and that inter-rater agreement is important for validating the ground-truth annotations. We will revise the manuscript to address both points while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] The abstract states that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN and that fine-tuning yields gains of up to 25%, but provides no details on the specific correlation coefficients (Pearson or Spearman), statistical significance, error bars, or how data splits were handled in the experiments.

    Authors: We agree that the abstract should be more informative. In the revision we will explicitly state the Pearson and Spearman coefficients underlying the 11.7% and 25% relative gains, note statistical significance (p-values) where computed, include standard-error or bootstrap confidence intervals, and briefly describe the train/validation/test splits used on RadEval and RaTE-Eval. revision: yes

  2. Referee: [Results] All quantitative claims rely on expert annotations in RadEval and RaTE-Eval as ground truth, yet no inter-rater agreement statistics (e.g., Cohen’s κ, Fleiss’ κ) or analysis of annotation variability across modalities are reported. This is a load-bearing issue for interpreting the correlation improvements as evidence of increased reliability.

    Authors: We acknowledge this limitation. RadEval and RaTE-Eval originate from prior publications; we will add a dedicated paragraph reporting any inter-rater statistics (Cohen’s or Fleiss’ κ) that were originally published with those datasets, together with an analysis of annotation variability across modalities when multiple ratings exist. If only single-expert labels are available for certain subsets, we will explicitly note this and discuss its implications for the reported correlations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reports empirical correlation results (Pearson/Spearman) between LLM-based metrics and expert ratings on held-out portions of the external RadEval and RaTE-Eval datasets. No equations, fitted parameters, or self-referential definitions are present that would reduce any reported gain (e.g., 11.7% over GREEN or 25% from fine-tuning) to the inputs by construction. The central claims rest on comparisons against independent expert annotations rather than any self-citation load-bearing premise, ansatz smuggling, or renaming of known results. The evaluation is therefore self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the assumption that the two expert-annotated datasets are representative and that LLM ratings can be meaningfully compared to human ratings without additional calibration. No free parameters, axioms, or invented entities are introduced beyond the definition of VERT itself.

pith-pipeline@v0.9.0 · 5567 in / 1129 out tokens · 24526 ms · 2026-05-13T19:46:55.709163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    ISBN 979-8-89176-261-9

    Association for Computational Linguistics. ISBN 979-8-89176-261-9. URL https: //aclanthology.org/2025.gem-1.5/. Jean-Benoit Delbrouck, Justin Xu, Johannes Moll, Alois Thomas, Zhihong Chen, Sophie Ostmeier, Asfandyar Azhar, Kelvin Zhenghao Li, Andrew Johnston, Christian Bluethgen, et al. Automated structured radiology report generation. InProceedings of th...

  2. [2]

    The criteria for making a judgment

  3. [3]

    The reference radiology report

  4. [4]

    The candidate radiology report

  5. [5]

    The desired format for your assessment

  6. [6]

    The count of clinically insignificant errors

    Criteria for Judgment: For each candidate report, determine: The count of clinically significant errors. The count of clinically insignificant errors. Errors can fall into one of these categories: a) False report of a finding in the candidate. b) Missing a finding present in the reference. 15 Preprint. Under review. c) Misidentification of a finding’s ana...

  7. [7]

    Reference Report: {reference}

  8. [8]

    Candidate Report: {candidate}

  9. [9]

    <Error 1>

    Reporting Your Assessment: Follow this specific format for your output, even if no errors are found: ˋˋˋ [Explanation]: <Explanation> [Clinically Significant Errors]: (a) <Error Type>: <count>. <Error 1>; ... .... (f) <Error Type>: <count>. <Error 1>; ... [Clinically Insignificant Errors]: (a) <Error Type>: <count>. <Error 1>; ... .... (f) <Error Type>: <...

  10. [10]

    The count of clinically insignificant errors

    Criteria for Judgment: For each candidate report, determine: The count of clinically significant errors. The count of clinically insignificant errors. The overall accuracy score to assign to the candidate report given the count of clinically significant and clinically insignificant errors. The score must be a continuous number in [0.00, 1.00] with two dec...

  11. [11]

    Extract clinical findings from both reports

  12. [12]

    Match findings referring to the same pathology and anatomical location

  13. [13]

    Identify discrepancies and classify them as errors

  14. [14]

    Assign an overall accuracy score based on matched findings and errors

  15. [15]

    Count the number of matched findings and errors

  16. [16]

    Focus only on clinical findings and comparisons, not writing style

    Compute the accuracy score. Focus only on clinical findings and comparisons, not writing style. Step 1: Extract Clinical Findings Extract all distinct clinical findings from each report. A clinical finding is a statement describing: - a pathology - an anatomical abnormality - the presence or absence of a clinically relevant condition - disease severity or...

  17. [17]

    D.6 rad-err: RadEval Error-Type Examples The VERT base prompt is extended with:

    Examples: --- Example 1 --- Reference Report: <reference text> Candidate Report: <candidate text> [Overall Accuracy Score]: <score> --- Example 2 --- ... D.6 rad-err: RadEval Error-Type Examples The VERT base prompt is extended with:

  18. [18]

    (b)--(f) ...: <human count>

    Examples: --- Example 1 (illustrating error category (a)) --- Reference Report: <reference text> Candidate Report: <candidate text> [Clinically Significant Errors]: (a) False report of a finding in the candidate: <count>. (b)--(f) ...: <human count>. [Clinically Insignificant Errors]: (a)--(f) ...: <human count>. [Overall accuracy Score]: <score calculate...

  19. [19]

    (b)--(f) ...: 0

    Examples: --- Example 1 (illustrating error category (a)) --- Reference Report: <reference text> Candidate Report: <perturbed candidate | one injected false finding> [Explanation]: <description of the injected finding> [Clinically Significant Errors]: (a) False report of a finding in the candidate: 1. (b)--(f) ...: 0. 19 Preprint. Under review. [Clinicall...

  20. [20]

    Lungs are clear bilaterally

    Error Type Illustrations: --- Error type (a): False report of a finding in the candidate --- Reference: Mild cardiomegaly. Lungs are clear bilaterally. No pleural effusion or pneumothorax. Candidate: Mild cardiomegaly. Lungs are clear bilaterally. Small left pleural effusion is present. No pneumothorax. →The candidate falsely reports a small left pleural ...

  21. [21]

    Under review

    Full Assessment Examples: 20 Preprint. Under review. --- Example 1 --- Reference Report: <reference text> Candidate Report: <candidate text> [Explanation]: <GPT-4.1-mini explanation> [Clinically Significant Errors]: (a) False report of a finding in the candidate: <count>. (b)--(f) ...: <GPT-4.1-mini count>. [Clinically Insignificant Errors]: (a)--(f) ...:...

  22. [26]

    Under review

    Maintain clinical plausibility 21 Preprint. Under review

  23. [27]

    modified report

    The added finding must be anatomically consistent with the body region and modality of the report ERROR TYPE AND EXAMPLES ============================================================ False prediction of finding (false positive): add a finding that is NOT present in the reference report (either adding a new sentence or modifying an existing sentence to ins...

  24. [28]

    Do NOT make multiple different errors in the same sentence

  25. [29]

    Do NOT combine unrelated findings in the same sentence

  26. [30]

    Do NOT reword without changing meaning: ×’normal’→’unremarkable’ (NO) ×’multiple’→’several’ (NO) ×’noticed’→’seen’ (NO)

  27. [31]

    Do NOT reorder sentence parts without changing meaning

  28. [32]

    Maintain clinical plausibility

  29. [33]

    modified report

    After removing a finding, the report must still read naturally and coherently ERROR TYPE AND EXAMPLES ============================================================ Omission of finding (false negative): remove a finding that IS present in the report (either deleting a sentence or modifying a sentence to omit the finding). The omitted finding must be clinica...