arxiv: 2604.03376 · v1 · submitted 2026-04-03 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

VERT: Reliable LLM Judges for Radiology Report Evaluation

Federica Bologna , Jean-Philippe Corbeil , Matthew Wilkens , Asma Ben Abacha

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords radiology report evaluationLLM judgesVERT metriccorrelation analysisfine-tuningmulti-modalitymedical AI evaluationerror categorization

0 comments

The pith

VERT is an LLM-based metric that improves correlation with radiologist judgments on radiology reports by up to 11.7% relative to GREEN.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether LLM judges can reliably score radiology reports from different imaging modalities and body regions, beyond the chest X-ray focus of prior work. It introduces VERT and benchmarks it against RadFact, GREEN, and FineRadScore using open- and closed-source models on the expert-annotated RadEval and RaTE-Eval datasets. VERT shows stronger alignment with human experts, and fine-tuning a 30B model on only 1,300 examples produces further gains while cutting inference time dramatically. A systematic error analysis also maps where these metrics agree and disagree with experts. This line of work matters because scalable, accurate automated evaluation is a prerequisite for developing trustworthy AI systems that generate or review medical reports.

Core claim

We propose VERT, an LLM-based metric for radiology report evaluation. Across two expert-annotated datasets covering multiple modalities and anatomies, VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Fine-tuning Qwen3 30B on 1,300 training samples yields gains of up to 25% and reduces inference time by up to 37.2 times. These results demonstrate that reliable evaluation can be achieved with lightweight adaptation of LLMs.

What carries the argument

VERT, the proposed LLM-based judge metric that uses tailored configurations, few-shot prompting, ensembling, or parameter-efficient fine-tuning to align ratings more closely with expert radiologist judgments on reports from varied modalities.

Load-bearing premise

Expert annotations in RadEval and RaTE-Eval are treated as reliable ground truth without reported inter-rater agreement statistics or analysis of annotation variability across modalities.

What would settle it

A follow-up study that collects independent ratings from multiple radiologists on the same reports and finds low inter-rater agreement, or that measures VERT correlation dropping below GREEN on a new set of reports from an unseen modality.

Figures

Figures reproduced from arXiv: 2604.03376 by Asma Ben Abacha, Federica Bologna, Jean-Philippe Corbeil, Matthew Wilkens.

**Figure 1.** Figure 1: Fine-tuning of Qwen3-30B-A3B-Instruct-2507 on RaTE-Eval trainset. 5.1.7 LoRA Fine-Tuning Qwen3-30B-A3B-Instruct-2507 fine-tuned on RaTE-Eval’s training dataset for 5 epochs1 surpasses zero-shot performance ( [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: Mean number of clinically significant errors annotated by humans vs. vs. those [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Mean number of clinically significant errors detected by model-prompt combina [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: F1 scores for error type on RadEval. Results are flipped for type (a) and (b) on RaTE-Eval ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Histogram of error annotations in RadEval. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Histogram of normalized expert-annotated scores in RaTE-Eval. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Simulations of the impact of sweeping S or TP for GREEN and F1. C F1-Score for Error Type Categorization For each error category c ∈ {(a), . . . ,(f)} and each report i, let hi,c ∈ N0 denote the human annotated significant error count for category c, and let gi,c ∈ N0 denote the model-predicted error count for the same category. We compute count-level matches and mismatches per report as TPi,c = min(hi,c ,… view at source ↗

read the original abstract

Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic error detection and categorization study to assess alignment of these metrics against expert judgments and identify areas of lower and higher agreement. Our results show that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Furthermore, fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples. The fine-tuned model also reduces inference time up to 37.2 times. These findings highlight the effectiveness of LLM-based judges and demonstrate that reliable evaluation can be achieved with lightweight adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VERT edges GREEN by up to 11.7% correlation and fine-tuning Qwen3-30B on 1300 samples gives 25% gains with big speedups, but the expert labels lack any reliability numbers.

read the letter

VERT edges GREEN by up to 11.7% correlation and fine-tuning Qwen3-30B on 1300 samples gives 25% gains with big speedups, but the expert labels lack any reliability numbers. That is the practical headline from the abstract and results. The paper tests the new VERT metric against RadFact, GREEN, and FineRadScore on RadEval and RaTE-Eval, which cover more than chest X-rays. It runs the comparison across open and closed models of different sizes, adds few-shot and ensembling baselines, and includes a systematic error categorization to show where the judges match or diverge from experts. The fine-tuning section is the strongest part: parameter-efficient adaptation on a small set produces clear lifts and cuts inference time by up to 37 times. Those numbers are useful for anyone who needs to run automated report checks at scale. The main gap is the missing inter-rater agreement data for the expert annotations. No Cohen’s κ, Fleiss’ κ, or even simple percentage agreement appears, so it is hard to judge how stable the reference ratings are across modalities. If the experts disagree substantially on borderline cases, the reported correlation gains could partly reflect better tracking of the same noise rather than genuine improvement. The abstract also gives no error bars or significance tests, which leaves the exact size of the improvements harder to interpret. This work is aimed at groups building or auditing clinical AI systems that generate or score radiology reports. Researchers who need concrete benchmarks for LLM judges across anatomies will find the comparisons and fine-tuning results directly applicable. The experiments sit on held-out data and the claims are testable, so the paper deserves a serious referee. I would send it for review but ask the authors to add inter-rater statistics and more detail on prompts and splits before acceptance.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VERT, a proposed LLM-based metric for evaluating radiology reports. Through extensive experiments on the RadEval and RaTE-Eval datasets covering multiple modalities and anatomies, it compares VERT against RadFact, GREEN, and FineRadScore using various open- and closed-source LLMs. The paper reports that VERT achieves up to 11.7% higher correlation with expert judgments than GREEN, and that fine-tuning Qwen3-30B on 1,300 samples yields up to 25% gains while reducing inference time by up to 37.2 times. It also includes analyses of few-shot learning, ensembling, and a systematic error detection study.

Significance. If the reported correlations hold under scrutiny, this work would be significant for the development of reliable automated evaluation tools in radiology AI, extending beyond chest X-rays to diverse modalities. The emphasis on lightweight fine-tuning and efficiency gains is practically valuable, and the error categorization provides insight into where LLM judges align or diverge from experts.

major comments (2)

[Abstract] The abstract states that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN and that fine-tuning yields gains of up to 25%, but provides no details on the specific correlation coefficients (Pearson or Spearman), statistical significance, error bars, or how data splits were handled in the experiments.
[Results] All quantitative claims rely on expert annotations in RadEval and RaTE-Eval as ground truth, yet no inter-rater agreement statistics (e.g., Cohen’s κ, Fleiss’ κ) or analysis of annotation variability across modalities are reported. This is a load-bearing issue for interpreting the correlation improvements as evidence of increased reliability.

minor comments (2)

[Methods] The exact prompt templates used for the LLM judges (including VERT) are not provided, which limits reproducibility of the results.
[Abstract] The claim of reducing inference time up to 37.2 times lacks specification of the baseline model and conditions under which this speedup is measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires more precise quantitative details and that inter-rater agreement is important for validating the ground-truth annotations. We will revise the manuscript to address both points while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] The abstract states that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN and that fine-tuning yields gains of up to 25%, but provides no details on the specific correlation coefficients (Pearson or Spearman), statistical significance, error bars, or how data splits were handled in the experiments.

Authors: We agree that the abstract should be more informative. In the revision we will explicitly state the Pearson and Spearman coefficients underlying the 11.7% and 25% relative gains, note statistical significance (p-values) where computed, include standard-error or bootstrap confidence intervals, and briefly describe the train/validation/test splits used on RadEval and RaTE-Eval. revision: yes
Referee: [Results] All quantitative claims rely on expert annotations in RadEval and RaTE-Eval as ground truth, yet no inter-rater agreement statistics (e.g., Cohen’s κ, Fleiss’ κ) or analysis of annotation variability across modalities are reported. This is a load-bearing issue for interpreting the correlation improvements as evidence of increased reliability.

Authors: We acknowledge this limitation. RadEval and RaTE-Eval originate from prior publications; we will add a dedicated paragraph reporting any inter-rater statistics (Cohen’s or Fleiss’ κ) that were originally published with those datasets, together with an analysis of annotation variability across modalities when multiple ratings exist. If only single-expert labels are available for certain subsets, we will explicitly note this and discuss its implications for the reported correlations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reports empirical correlation results (Pearson/Spearman) between LLM-based metrics and expert ratings on held-out portions of the external RadEval and RaTE-Eval datasets. No equations, fitted parameters, or self-referential definitions are present that would reduce any reported gain (e.g., 11.7% over GREEN or 25% from fine-tuning) to the inputs by construction. The central claims rest on comparisons against independent expert annotations rather than any self-citation load-bearing premise, ansatz smuggling, or renaming of known results. The evaluation is therefore self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the assumption that the two expert-annotated datasets are representative and that LLM ratings can be meaningfully compared to human ratings without additional calibration. No free parameters, axioms, or invented entities are introduced beyond the definition of VERT itself.

pith-pipeline@v0.9.0 · 5567 in / 1129 out tokens · 24526 ms · 2026-05-13T19:46:55.709163+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT... using... RadEval and RaTE-Eval
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples... reduces inference time up to 37.2 times

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

ISBN 979-8-89176-261-9

Association for Computational Linguistics. ISBN 979-8-89176-261-9. URL https: //aclanthology.org/2025.gem-1.5/. Jean-Benoit Delbrouck, Justin Xu, Johannes Moll, Alois Thomas, Zhihong Chen, Sophie Ostmeier, Asfandyar Azhar, Kelvin Zhenghao Li, Andrew Johnston, Christian Bluethgen, et al. Automated structured radiology report generation. InProceedings of th...

work page doi:10.1038/s41746-026-02380-4 2025
[2]

The criteria for making a judgment

work page
[3]

The reference radiology report

work page
[4]

The candidate radiology report

work page
[5]

The desired format for your assessment

work page
[6]

The count of clinically insignificant errors

Criteria for Judgment: For each candidate report, determine: The count of clinically significant errors. The count of clinically insignificant errors. Errors can fall into one of these categories: a) False report of a finding in the candidate. b) Missing a finding present in the reference. 15 Preprint. Under review. c) Misidentification of a finding’s ana...

work page
[7]

Reference Report: {reference}

work page
[8]

Candidate Report: {candidate}

work page
[9]

<Error 1>

Reporting Your Assessment: Follow this specific format for your output, even if no errors are found: ˋˋˋ [Explanation]: <Explanation> [Clinically Significant Errors]: (a) <Error Type>: <count>. <Error 1>; ... .... (f) <Error Type>: <count>. <Error 1>; ... [Clinically Insignificant Errors]: (a) <Error Type>: <count>. <Error 1>; ... .... (f) <Error Type>: <...

work page
[10]

The count of clinically insignificant errors

Criteria for Judgment: For each candidate report, determine: The count of clinically significant errors. The count of clinically insignificant errors. The overall accuracy score to assign to the candidate report given the count of clinically significant and clinically insignificant errors. The score must be a continuous number in [0.00, 1.00] with two dec...

work page
[11]

Extract clinical findings from both reports

work page
[12]

Match findings referring to the same pathology and anatomical location

work page
[13]

Identify discrepancies and classify them as errors

work page
[14]

Assign an overall accuracy score based on matched findings and errors

work page
[15]

Count the number of matched findings and errors

work page
[16]

Focus only on clinical findings and comparisons, not writing style

Compute the accuracy score. Focus only on clinical findings and comparisons, not writing style. Step 1: Extract Clinical Findings Extract all distinct clinical findings from each report. A clinical finding is a statement describing: - a pathology - an anatomical abnormality - the presence or absence of a clinically relevant condition - disease severity or...

work page
[17]

D.6 rad-err: RadEval Error-Type Examples The VERT base prompt is extended with:

Examples: --- Example 1 --- Reference Report: <reference text> Candidate Report: <candidate text> [Overall Accuracy Score]: <score> --- Example 2 --- ... D.6 rad-err: RadEval Error-Type Examples The VERT base prompt is extended with:

work page
[18]

(b)--(f) ...: <human count>

Examples: --- Example 1 (illustrating error category (a)) --- Reference Report: <reference text> Candidate Report: <candidate text> [Clinically Significant Errors]: (a) False report of a finding in the candidate: <count>. (b)--(f) ...: <human count>. [Clinically Insignificant Errors]: (a)--(f) ...: <human count>. [Overall accuracy Score]: <score calculate...

work page
[19]

(b)--(f) ...: 0

Examples: --- Example 1 (illustrating error category (a)) --- Reference Report: <reference text> Candidate Report: <perturbed candidate | one injected false finding> [Explanation]: <description of the injected finding> [Clinically Significant Errors]: (a) False report of a finding in the candidate: 1. (b)--(f) ...: 0. 19 Preprint. Under review. [Clinicall...

work page
[20]

Lungs are clear bilaterally

Error Type Illustrations: --- Error type (a): False report of a finding in the candidate --- Reference: Mild cardiomegaly. Lungs are clear bilaterally. No pleural effusion or pneumothorax. Candidate: Mild cardiomegaly. Lungs are clear bilaterally. Small left pleural effusion is present. No pneumothorax. →The candidate falsely reports a small left pleural ...

work page
[21]

Under review

Full Assessment Examples: 20 Preprint. Under review. --- Example 1 --- Reference Report: <reference text> Candidate Report: <candidate text> [Explanation]: <GPT-4.1-mini explanation> [Clinically Significant Errors]: (a) False report of a finding in the candidate: <count>. (b)--(f) ...: <GPT-4.1-mini count>. [Clinically Insignificant Errors]: (a)--(f) ...:...

work page
[26]

Under review

Maintain clinical plausibility 21 Preprint. Under review

work page
[27]

modified report

The added finding must be anatomically consistent with the body region and modality of the report ERROR TYPE AND EXAMPLES ============================================================ False prediction of finding (false positive): add a finding that is NOT present in the reference report (either adding a new sentence or modifying an existing sentence to ins...

work page
[28]

Do NOT make multiple different errors in the same sentence

work page
[29]

Do NOT combine unrelated findings in the same sentence

work page
[30]

Do NOT reword without changing meaning: ×’normal’→’unremarkable’ (NO) ×’multiple’→’several’ (NO) ×’noticed’→’seen’ (NO)

work page
[31]

Do NOT reorder sentence parts without changing meaning

work page
[32]

Maintain clinical plausibility

work page
[33]

modified report

After removing a finding, the report must still read naturally and coherently ERROR TYPE AND EXAMPLES ============================================================ Omission of finding (false negative): remove a finding that IS present in the report (either deleting a sentence or modifying a sentence to omit the finding). The omitted finding must be clinica...

work page