LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Dongsheng Li; Lili Qiu; Xinyang Jiang; Xufang Luo; Zilong Wang

arxiv: 2404.00998 · v1 · pith:63N3TDQJnew · submitted 2024-04-01 · 💻 cs.CL · cs.AI

LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Zilong Wang , Xufang Luo , Xinyang Jiang , Dongsheng Li , Lili Qiu This is my paper

classification 💻 cs.CL cs.AI

keywords evaluationmodelradiologyaccessibleachievescomparedevelopmentdistilled

0 comments

read the original abstract

Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves evaluation consistency close to that of radiologists. Furthermore, to reduce costs and improve accessibility, making this method practical, we construct a dataset using LLM evaluation results and perform knowledge distillation to train a smaller model. The distilled model achieves evaluation capabilities comparable to GPT-4. Our framework and distilled model offer an accessible and efficient evaluation method for radiology report generation, facilitating the development of more clinically relevant models. The model will be further open-sourced and accessible.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs
cs.CV 2026-03 unverdicted novelty 5.0

CogAlign uses hierarchical supervised fine-tuning on clinical cognition data plus counterfactual RL to align MLLMs with expert diagnostic pathways and enforce causal lesion grounding for GI endoscopy diagnosis.