Recognition: unknown
Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports
Pith reviewed 2026-05-10 02:05 UTC · model grok-4.3
The pith
A two-stage fine-tuning process with reinforcement learning improves both accuracy and reasoning in disease classification from radiology reports.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that supervised fine-tuning on disease labels followed by Group Relative Policy Optimization, which optimizes accuracy and format without reasoning supervision, further improves classification and enhances reasoning recall and comprehensiveness across three radiologist-annotated datasets.
What carries the argument
Group Relative Policy Optimization (GRPO), a reinforcement learning step that refines predictions using accuracy and format rewards without reasoning labels.
If this is right
- Classification accuracy exceeds levels reached by supervised fine-tuning alone.
- Reasoning recall and comprehensiveness rise despite the absence of reasoning supervision.
- The gains hold for lightweight LLMs across three independent radiologist-annotated datasets.
- No additional reasoning annotations are needed beyond the disease labels.
Where Pith is reading between the lines
- Optimizing output accuracy and format may serve as an implicit signal that improves internal reasoning processes.
- The two-stage method could be applied to other medical text tasks that require both factual correctness and explanatory reasoning.
- It points to possible reductions in annotation costs for building medical AI systems.
Load-bearing premise
That optimizing only for accuracy and format via GRPO, without any reasoning supervision, will still enhance reasoning recall and comprehensiveness.
What would settle it
Re-running the GRPO stage on the same datasets and measuring no increase or a decrease in reasoning recall and comprehensiveness compared to SFT alone would falsify the claim.
Figures
read the original abstract
Accurate disease classification from radiology reports is essential for many applications. While supervised fine-tuning (SFT) of lightweight LLMs improves accuracy, it can degrade reasoning. We propose a two-stage approach: SFT on disease labels followed by Group Relative Policy Optimization (GRPO) to refine predictions by optimizing accuracy and format without reasoning supervision. Across three radiologist-annotated datasets, SFT outperformed baselines and GRPO further improved classification and enhanced reasoning recall and comprehensiveness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a two-stage pipeline for disease classification from radiology reports using lightweight LLMs: supervised fine-tuning (SFT) on disease labels, followed by Group Relative Policy Optimization (GRPO) that optimizes only accuracy and output format rewards (no explicit reasoning supervision). It claims that SFT already outperforms baselines and that GRPO yields further gains in classification accuracy while also improving reasoning recall and comprehensiveness across three radiologist-annotated datasets.
Significance. If the empirical results and reasoning metrics hold after proper controls, the work would show that RL-based post-training with narrow rewards can simultaneously boost predictive accuracy and the apparent quality of generated reasoning in a high-stakes medical domain. This would be useful for clinical NLP applications where both correctness and interpretability matter, and the absence of direct reasoning supervision in the GRPO stage would be a practical advantage.
major comments (2)
- [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that GRPO improves reasoning recall and comprehensiveness rests on unverified metrics. No ablation that removes the format reward is reported, nor is there a control for output length or structure; higher recall scores could therefore be an artifact of the format term encouraging longer or more structured text rather than genuine reasoning improvement. This directly affects the weakest assumption identified in the stress-test note.
- [Tables 1 and 2] Table 1 and Table 2: Quantitative results for accuracy, recall, and comprehensiveness are presented without statistical significance tests, confidence intervals, or multiple-run variance; the reported gains of GRPO over SFT therefore cannot be assessed for reliability, undermining the claim that GRPO “further improved classification.”
minor comments (2)
- [Abstract] The abstract contains no numerical results, baseline names, or dataset sizes; while the full text supplies these, the abstract should be self-contained for readers who encounter only the summary.
- [§3] Notation for the GRPO objective and the exact form of the accuracy and format rewards is introduced without an explicit equation reference in the main text; adding a numbered equation would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below and describe the revisions we will implement to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that GRPO improves reasoning recall and comprehensiveness rests on unverified metrics. No ablation that removes the format reward is reported, nor is there a control for output length or structure; higher recall scores could therefore be an artifact of the format term encouraging longer or more structured text rather than genuine reasoning improvement. This directly affects the weakest assumption identified in the stress-test note.
Authors: We agree that the lack of an ablation isolating the format reward leaves the source of the reasoning gains open to alternative explanations. In the revised manuscript we will add a controlled ablation in which GRPO is run with the accuracy reward alone (format reward removed) and directly compare reasoning recall and comprehensiveness against the full-reward setting. We will also report mean output lengths (in tokens) for the SFT and GRPO models on each dataset so that readers can assess whether length or structural changes account for the observed metric differences. These additions will clarify whether the improvements stem from the RL objective itself. revision: yes
-
Referee: [Tables 1 and 2] Table 1 and Table 2: Quantitative results for accuracy, recall, and comprehensiveness are presented without statistical significance tests, confidence intervals, or multiple-run variance; the reported gains of GRPO over SFT therefore cannot be assessed for reliability, undermining the claim that GRPO “further improved classification.”
Authors: We concur that statistical characterization is necessary to substantiate the reliability of the reported improvements. In the revised version we will repeat all experiments across five independent random seeds, present means and standard deviations for every metric, and add 95 % confidence intervals together with paired statistical tests (e.g., Wilcoxon signed-rank) to Tables 1 and 2. This will allow readers to evaluate the significance and stability of the GRPO gains over SFT. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external datasets and standard training stages
full rationale
The paper presents a sequential two-stage pipeline (SFT on disease labels, then GRPO optimizing only accuracy and format rewards) evaluated on three independent radiologist-annotated datasets. All performance claims, including gains in classification accuracy and reasoning recall/comprehensiveness, are framed as measured outcomes from held-out test sets rather than quantities defined in terms of the optimization objective itself. No equations, self-citations, or ansatzes are shown that would reduce the reported reasoning improvements to a redefinition of the format/accuracy rewards or to prior author work. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
16 RadGraph: Extracting clinical entities and relations from radiology reports
Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis Langlotz, and Pranav Rajpurkar. 16 RadGraph: Extracting clinical entities and relations from radiology reports. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks ...
2021
-
[2]
CheXpert plus: Augmenting a large chest X-ray dataset with text radiology reports, patient demographics and additional image formats.arXiv [cs.CL], 29 May 2024
Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven Q H Truong, Chu The Chuong, and Curtis P Langlotz. CheXpert plus: Augmenting a large chest X-ray dataset with text radiology reports, patient demographics and additional image formats.arXiv [cs.CL], 29 May 2024
2024
-
[3]
Hieu Tran, Zhichao Yang, Zonghai Yao, and Hong Yu. BioInstruct: instruction tuning of large language models for biomedical natural language processing.J. Am. Med. Inform. Assoc., 31(9):1821–1832, 1 September 2024. ISSN 1067-5027,1527-974X. doi: 10.1093/jamia/ocae122
-
[4]
The rise of small language models in healthcare: A comprehensive survey.arXiv [cs.CL], 23 April 2025
Muskan Garg, Shaina Raza, Shebuti Rayana, Xingyi Liu, and Sunghwan Sohn. The rise of small language models in healthcare: A comprehensive survey.arXiv [cs.CL], 23 April 2025
2025
-
[5]
CPLLM: Clinical prediction with large language models
Ofir Ben Shoham and Nadav Rappoport. CPLLM: Clinical prediction with large language models. PLOS Digit. Health, 3(12):e0000680, December 2024. ISSN 2767-3170. doi: 10.1371/journal.pdig.0 000680
-
[6]
VetLLM: Large language model for predicting diagnosis from veterinary notes.Pac
Yixing Jiang, Jeremy A Irvin, Andrew Y Ng, and James Zou. VetLLM: Large language model for predicting diagnosis from veterinary notes.Pac. Symp. Biocomput., 29:120–133, 2024. ISSN 2335- 6928,2335-6936
2024
-
[7]
Enhancing disease detection in radiology reports through fine-tuning lightweight LLM on weak labels
Yishu Wei, Xindi Wang, Hanley Ong, Yiliang Zhou, Adam Flanders, George Shih, and Yifan Peng. Enhancing disease detection in radiology reports through fine-tuning lightweight LLM on weak labels. arXiv [cs.AI], 24 September 2024
2024
-
[8]
Available: http://dx.doi.org/10.1038/s41586-025-09422-z
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z F Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Er...
-
[9]
Amrith Rajagopal Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. RL on incorrect synthetic data scales the efficiency of LLM math reasoning by eight-fold. Neural Inf Process Syst, abs/2406.14532:43000–43031, 20 June 2024
-
[10]
Teaching large language models to reason with reinforcement learning.arXiv [cs.LG], 7 March 2024
Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. Teaching large language models to reason with reinforcement learning.arXiv [cs.LG], 7 March 2024
2024
-
[11]
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. ChestX-Ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classifica- tion and localization of common thorax diseases. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3462–3471, July 2017. doi: 10.1109/CVPR.2017.369
-
[12]
Yiliang Zhou, Hanley Ong, Patrick Kennedy, Carol C Wu, Jacob Kazam, Keith Hentel, Adam Flanders, George Shih, and Yifan Peng. Evaluating GPT-V4 (GPT-4 with vision) on detection of radiologic findings on chest radiographs.Radiology, 311(2):e233270, May 2024. ISSN 0033-8419,1527-1315. doi: 10.1148/radiol.233270
-
[13]
The llama 3 herd of models.arXiv [cs.AI], 31 July 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...
2024
-
[14]
Qwen2.5 technical report.arXiv [cs.CL], 19 December 2024
Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Ti...
2024
-
[15]
Phi-3 technical report: A highly capable language model locally on your phone.arXiv [cs.CL], 22 April 2024
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...
2024
-
[16]
Medalpaca–an open-source collection of medical conversational ai models and training data
Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexan- der Löser, Daniel Truhn, and Keno K Bressem. Medalpaca–an open-source collection of medical conversational ai models and training data.arXiv preprint arXiv:2304.08247, 2023. 20
-
[17]
Publicly available clinical bert embeddings
Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. Publicly available clinical bert embeddings. InProceedings of the 2nd clinical natural language processing workshop, pages 72–78, 2019
2019
-
[18]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, R Anil, S Borgeaud, Y Wu, JB Alayrac, J Yu, R Soricut, J Schalkwyk, AM Dai, A Hauth, et al. Gemini: A family of highly capable multimodal models, 2024.arXiv preprint arXiv:2312.11805, 10, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
ChestX-reasoner: Advancing radiology foundation models with reasoning through step-by-step verification.arXiv [cs.AI], 29 April 2025
Ziqing Fan, Cheng Liang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. ChestX-reasoner: Advancing radiology foundation models with reasoning through step-by-step verification.arXiv [cs.AI], 29 April 2025
2025
-
[20]
A survey on LLM-as-a-judge.arXiv [cs.CL], 23 November 2024
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-judge.arXiv [cs.CL], 23 November 2024
2024
-
[21]
Overcoming catastrophic forgetting in neural networks.Proc
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proc. Natl. Acad. Sci. U. S. A., 114(13):3521–3526, 28 March 20...
2017
-
[22]
doi: 10.1073/pnas.1611835114
-
[23]
Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks.Nat
Timothy Tadros, Giri P Krishnan, Ramyaa Ramyaa, and Maxim Bazhenov. Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks.Nat. Commun., 13(1):7742, 15 De- cember 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-34938-7
-
[24]
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs.arXiv [cs.CV], 21 January 2019
Alistair E W Johnson, Tom J Pollard, Nathaniel R Greenbaum, Matthew P Lungren, Chih-Ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven Horng. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs.arXiv [cs.CV], 21 January 2019
2019
-
[25]
GPT-4: a new era of artificial intelligence in medicine.Ir
Ethan Waisberg, Joshua Ong, Mouayad Masalkhi, Sharif Amit Kamran, Nasif Zaman, Prithul Sarker, Andrew G Lee, and Alireza Tavakkoli. GPT-4: a new era of artificial intelligence in medicine.Ir. J. Med. Sci., 192(6):3197–3200, December 2023. ISSN 0021-1265,1863-4362. doi: 10.1007/s11845-0 23-03377-8
-
[26]
LoRA: Low-Rank Adaptation of Large Language Models
J E Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.Int Conf Learn Represent, abs/2106.09685, 17 June 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
Trl: Transformer reinforcement learning
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lam- bert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020
2020
-
[28]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y Fu, Stefano Ermon, A Rudra, and Christopher R’e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Neural Inf Process Syst, abs/2205.14135:16344– 16359, 27 May 2022. 21 Supplementary materials Supplementary Table S1: Prediction accuracy of individual predictions for MIMIC-CXR dataset Precision Recall Model Metho...
work page internal anchor Pith review arXiv 2022
-
[29]
The output is a list of formatted structures, each element in the list has the components of:' phrase', 'whether supported by report','target diseases'
-
[30]
Each part will lead to one structured element returned, and the part will be the'phrase'component
For the'phrase'component, your task is to divide the reasoning into semantically independent part. Each part will lead to one structured element returned, and the part will be the'phrase'component. 3.'whether_supported_by_report'is a boolean to evaluate whether the'phrase'is supported by the report. 4.'target_diseases': which condition (including'Support ...
-
[31]
So'target diseases'are a list of candidate disease this 'phrase'is targeting for
Each phrase can target to a several diseases or even no disease. So'target diseases'are a list of candidate disease this 'phrase'is targeting for
-
[32]
For target disease, especially pay attention to'Support Devices'since that one has to be inferred from the text
-
[33]
Target disease can also be those coming from the result list
-
[34]
According to the report, there is a cluster of heterogeneous opacities
If there is no reasoning or the reasoning is clearly just repeating a list or not making sense, return an empty list. ********** Example ## Report FINDINGS: A cluster of heterogeneous opacities in the right lower lung has has continued to grow since ___. Otherwise, the lungs are clear. Moderate cardiomegaly, including severe left atrial enlargement is chr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.