arxiv: 2604.19060 · v1 · submitted 2026-04-21 · 💻 cs.AI

Recognition: unknown

Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports

Yishu Wei , Yi Lin , Adam Flanders , George Shih , Yifan Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learninglarge language modelsdisease classificationradiology reportssupervised fine-tuningreasoningGRPO

0 comments

The pith

A two-stage fine-tuning process with reinforcement learning improves both accuracy and reasoning in disease classification from radiology reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Supervised fine-tuning of lightweight LLMs on disease labels from radiology reports improves classification accuracy over baselines but can degrade reasoning quality. Adding Group Relative Policy Optimization as a second stage, which rewards only accurate and correctly formatted outputs without any reasoning supervision, further raises classification performance. This same stage also boosts reasoning recall and comprehensiveness. The gains appear across three separate radiologist-annotated datasets. The approach matters because it delivers stronger results without requiring costly annotations for reasoning.

Core claim

The paper claims that supervised fine-tuning on disease labels followed by Group Relative Policy Optimization, which optimizes accuracy and format without reasoning supervision, further improves classification and enhances reasoning recall and comprehensiveness across three radiologist-annotated datasets.

What carries the argument

Group Relative Policy Optimization (GRPO), a reinforcement learning step that refines predictions using accuracy and format rewards without reasoning labels.

If this is right

Classification accuracy exceeds levels reached by supervised fine-tuning alone.
Reasoning recall and comprehensiveness rise despite the absence of reasoning supervision.
The gains hold for lightweight LLMs across three independent radiologist-annotated datasets.
No additional reasoning annotations are needed beyond the disease labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimizing output accuracy and format may serve as an implicit signal that improves internal reasoning processes.
The two-stage method could be applied to other medical text tasks that require both factual correctness and explanatory reasoning.
It points to possible reductions in annotation costs for building medical AI systems.

Load-bearing premise

That optimizing only for accuracy and format via GRPO, without any reasoning supervision, will still enhance reasoning recall and comprehensiveness.

What would settle it

Re-running the GRPO stage on the same datasets and measuring no increase or a decrease in reasoning recall and comprehensiveness compared to SFT alone would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.19060 by Adam Flanders, George Shih, Yifan Peng, Yi Lin, Yishu Wei.

**Figure 2.** Figure 2: Disease detection performance on the MIMIC-CXR, NIH-CXR, and MIDRC evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Reasoning performance of the MIMIC-CXR, NIH-CXR, and MIDRC datasets. SFT: Supervised [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: micro-F1 and reasoning performance of the MIMIC-CXR, NIH-CXR, and MIDRC datasets. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Accurate disease classification from radiology reports is essential for many applications. While supervised fine-tuning (SFT) of lightweight LLMs improves accuracy, it can degrade reasoning. We propose a two-stage approach: SFT on disease labels followed by Group Relative Policy Optimization (GRPO) to refine predictions by optimizing accuracy and format without reasoning supervision. Across three radiologist-annotated datasets, SFT outperformed baselines and GRPO further improved classification and enhanced reasoning recall and comprehensiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a two-stage pipeline for disease classification from radiology reports using lightweight LLMs: supervised fine-tuning (SFT) on disease labels, followed by Group Relative Policy Optimization (GRPO) that optimizes only accuracy and output format rewards (no explicit reasoning supervision). It claims that SFT already outperforms baselines and that GRPO yields further gains in classification accuracy while also improving reasoning recall and comprehensiveness across three radiologist-annotated datasets.

Significance. If the empirical results and reasoning metrics hold after proper controls, the work would show that RL-based post-training with narrow rewards can simultaneously boost predictive accuracy and the apparent quality of generated reasoning in a high-stakes medical domain. This would be useful for clinical NLP applications where both correctness and interpretability matter, and the absence of direct reasoning supervision in the GRPO stage would be a practical advantage.

major comments (2)

[§4 and §5] §4 (Experiments) and §5 (Results): The central claim that GRPO improves reasoning recall and comprehensiveness rests on unverified metrics. No ablation that removes the format reward is reported, nor is there a control for output length or structure; higher recall scores could therefore be an artifact of the format term encouraging longer or more structured text rather than genuine reasoning improvement. This directly affects the weakest assumption identified in the stress-test note.
[Tables 1 and 2] Table 1 and Table 2: Quantitative results for accuracy, recall, and comprehensiveness are presented without statistical significance tests, confidence intervals, or multiple-run variance; the reported gains of GRPO over SFT therefore cannot be assessed for reliability, undermining the claim that GRPO “further improved classification.”

minor comments (2)

[Abstract] The abstract contains no numerical results, baseline names, or dataset sizes; while the full text supplies these, the abstract should be self-contained for readers who encounter only the summary.
[§3] Notation for the GRPO objective and the exact form of the accuracy and format rewards is introduced without an explicit equation reference in the main text; adding a numbered equation would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and describe the revisions we will implement to strengthen the manuscript.

read point-by-point responses

Referee: [§4 and §5] §4 (Experiments) and §5 (Results): The central claim that GRPO improves reasoning recall and comprehensiveness rests on unverified metrics. No ablation that removes the format reward is reported, nor is there a control for output length or structure; higher recall scores could therefore be an artifact of the format term encouraging longer or more structured text rather than genuine reasoning improvement. This directly affects the weakest assumption identified in the stress-test note.

Authors: We agree that the lack of an ablation isolating the format reward leaves the source of the reasoning gains open to alternative explanations. In the revised manuscript we will add a controlled ablation in which GRPO is run with the accuracy reward alone (format reward removed) and directly compare reasoning recall and comprehensiveness against the full-reward setting. We will also report mean output lengths (in tokens) for the SFT and GRPO models on each dataset so that readers can assess whether length or structural changes account for the observed metric differences. These additions will clarify whether the improvements stem from the RL objective itself. revision: yes
Referee: [Tables 1 and 2] Table 1 and Table 2: Quantitative results for accuracy, recall, and comprehensiveness are presented without statistical significance tests, confidence intervals, or multiple-run variance; the reported gains of GRPO over SFT therefore cannot be assessed for reliability, undermining the claim that GRPO “further improved classification.”

Authors: We concur that statistical characterization is necessary to substantiate the reliability of the reported improvements. In the revised version we will repeat all experiments across five independent random seeds, present means and standard deviations for every metric, and add 95 % confidence intervals together with paired statistical tests (e.g., Wilcoxon signed-rank) to Tables 1 and 2. This will allow readers to evaluate the significance and stability of the GRPO gains over SFT. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external datasets and standard training stages

full rationale

The paper presents a sequential two-stage pipeline (SFT on disease labels, then GRPO optimizing only accuracy and format rewards) evaluated on three independent radiologist-annotated datasets. All performance claims, including gains in classification accuracy and reasoning recall/comprehensiveness, are framed as measured outcomes from held-out test sets rather than quantities defined in terms of the optimization objective itself. No equations, self-citations, or ansatzes are shown that would reduce the reported reasoning improvements to a redefinition of the format/accuracy rewards or to prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is abstract-only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5370 in / 1036 out tokens · 44764 ms · 2026-05-10T02:05:08.911461+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 13 canonical work pages · 3 internal anchors

[1]

16 RadGraph: Extracting clinical entities and relations from radiology reports

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis Langlotz, and Pranav Rajpurkar. 16 RadGraph: Extracting clinical entities and relations from radiology reports. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks ...

2021
[2]

CheXpert plus: Augmenting a large chest X-ray dataset with text radiology reports, patient demographics and additional image formats.arXiv [cs.CL], 29 May 2024

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven Q H Truong, Chu The Chuong, and Curtis P Langlotz. CheXpert plus: Augmenting a large chest X-ray dataset with text radiology reports, patient demographics and additional image formats.arXiv [cs.CL], 29 May 2024

2024
[3]

BioInstruct: instruction tuning of large language models for biomedical natural language processing.J

Hieu Tran, Zhichao Yang, Zonghai Yao, and Hong Yu. BioInstruct: instruction tuning of large language models for biomedical natural language processing.J. Am. Med. Inform. Assoc., 31(9):1821–1832, 1 September 2024. ISSN 1067-5027,1527-974X. doi: 10.1093/jamia/ocae122

work page doi:10.1093/jamia/ocae122 2024
[4]

The rise of small language models in healthcare: A comprehensive survey.arXiv [cs.CL], 23 April 2025

Muskan Garg, Shaina Raza, Shebuti Rayana, Xingyi Liu, and Sunghwan Sohn. The rise of small language models in healthcare: A comprehensive survey.arXiv [cs.CL], 23 April 2025

2025
[5]

CPLLM: Clinical prediction with large language models

Ofir Ben Shoham and Nadav Rappoport. CPLLM: Clinical prediction with large language models. PLOS Digit. Health, 3(12):e0000680, December 2024. ISSN 2767-3170. doi: 10.1371/journal.pdig.0 000680

work page doi:10.1371/journal.pdig.0 2024
[6]

VetLLM: Large language model for predicting diagnosis from veterinary notes.Pac

Yixing Jiang, Jeremy A Irvin, Andrew Y Ng, and James Zou. VetLLM: Large language model for predicting diagnosis from veterinary notes.Pac. Symp. Biocomput., 29:120–133, 2024. ISSN 2335- 6928,2335-6936

2024
[7]

Enhancing disease detection in radiology reports through fine-tuning lightweight LLM on weak labels

Yishu Wei, Xindi Wang, Hanley Ong, Yiliang Zhou, Adam Flanders, George Shih, and Yifan Peng. Enhancing disease detection in radiology reports through fine-tuning lightweight LLM on weak labels. arXiv [cs.AI], 24 September 2024

2024
[8]

Available: http://dx.doi.org/10.1038/s41586-025-09422-z

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z F Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Er...

work page doi:10.1038/s41586-025-09422-z 2025
[9]

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold.arXiv preprint arXiv:2406.14532, 2024

Amrith Rajagopal Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. RL on incorrect synthetic data scales the efficiency of LLM math reasoning by eight-fold. Neural Inf Process Syst, abs/2406.14532:43000–43031, 20 June 2024

work page arXiv 2024
[10]

Teaching large language models to reason with reinforcement learning.arXiv [cs.LG], 7 March 2024

Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. Teaching large language models to reason with reinforcement learning.arXiv [cs.LG], 7 March 2024

2024
[11]

ChestX-Ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classifica- tion and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. ChestX-Ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classifica- tion and localization of common thorax diseases. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3462–3471, July 2017. doi: 10.1109/CVPR.2017.369

work page doi:10.1109/cvpr.2017.369 2017
[12]

Evaluating GPT-V4 (GPT-4 with vision) on detection of radiologic findings on chest radiographs.Radiology, 311(2):e233270, May 2024

Yiliang Zhou, Hanley Ong, Patrick Kennedy, Carol C Wu, Jacob Kazam, Keith Hentel, Adam Flanders, George Shih, and Yifan Peng. Evaluating GPT-V4 (GPT-4 with vision) on detection of radiologic findings on chest radiographs.Radiology, 311(2):e233270, May 2024. ISSN 0033-8419,1527-1315. doi: 10.1148/radiol.233270

work page doi:10.1148/radiol.233270 2024
[13]

The llama 3 herd of models.arXiv [cs.AI], 31 July 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

2024
[14]

Qwen2.5 technical report.arXiv [cs.CL], 19 December 2024

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Ti...

2024
[15]

Phi-3 technical report: A highly capable language model locally on your phone.arXiv [cs.CL], 22 April 2024

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

2024
[16]

Medalpaca–an open-source collection of medical conversational ai models and training data

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexan- der Löser, Daniel Truhn, and Keno K Bressem. Medalpaca–an open-source collection of medical conversational ai models and training data.arXiv preprint arXiv:2304.08247, 2023. 20

work page arXiv 2023
[17]

Publicly available clinical bert embeddings

Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. Publicly available clinical bert embeddings. InProceedings of the 2nd clinical natural language processing workshop, pages 72–78, 2019

2019
[18]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, R Anil, S Borgeaud, Y Wu, JB Alayrac, J Yu, R Soricut, J Schalkwyk, AM Dai, A Hauth, et al. Gemini: A family of highly capable multimodal models, 2024.arXiv preprint arXiv:2312.11805, 10, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

ChestX-reasoner: Advancing radiology foundation models with reasoning through step-by-step verification.arXiv [cs.AI], 29 April 2025

Ziqing Fan, Cheng Liang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. ChestX-reasoner: Advancing radiology foundation models with reasoning through step-by-step verification.arXiv [cs.AI], 29 April 2025

2025
[20]

A survey on LLM-as-a-judge.arXiv [cs.CL], 23 November 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-judge.arXiv [cs.CL], 23 November 2024

2024
[21]

Overcoming catastrophic forgetting in neural networks.Proc

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proc. Natl. Acad. Sci. U. S. A., 114(13):3521–3526, 28 March 20...

2017
[22]

doi: 10.1073/pnas.1611835114

work page doi:10.1073/pnas.1611835114
[23]

Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks.Nat

Timothy Tadros, Giri P Krishnan, Ramyaa Ramyaa, and Maxim Bazhenov. Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks.Nat. Commun., 13(1):7742, 15 De- cember 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-34938-7

work page doi:10.1038/s41467-022-34938-7 2022
[24]

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs.arXiv [cs.CV], 21 January 2019

Alistair E W Johnson, Tom J Pollard, Nathaniel R Greenbaum, Matthew P Lungren, Chih-Ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven Horng. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs.arXiv [cs.CV], 21 January 2019

2019
[25]

GPT-4: a new era of artificial intelligence in medicine.Ir

Ethan Waisberg, Joshua Ong, Mouayad Masalkhi, Sharif Amit Kamran, Nasif Zaman, Prithul Sarker, Andrew G Lee, and Alireza Tavakkoli. GPT-4: a new era of artificial intelligence in medicine.Ir. J. Med. Sci., 192(6):3197–3200, December 2023. ISSN 0021-1265,1863-4362. doi: 10.1007/s11845-0 23-03377-8

work page doi:10.1007/s11845-0 2023
[26]

LoRA: Low-Rank Adaptation of Large Language Models

J E Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.Int Conf Learn Represent, abs/2106.09685, 17 June 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lam- bert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

2020
[28]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, A Rudra, and Christopher R’e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Neural Inf Process Syst, abs/2205.14135:16344– 16359, 27 May 2022. 21 Supplementary materials Supplementary Table S1: Prediction accuracy of individual predictions for MIMIC-CXR dataset Precision Recall Model Metho...

work page internal anchor Pith review arXiv 2022
[29]

The output is a list of formatted structures, each element in the list has the components of:' phrase', 'whether supported by report','target diseases'
[30]

Each part will lead to one structured element returned, and the part will be the'phrase'component

For the'phrase'component, your task is to divide the reasoning into semantically independent part. Each part will lead to one structured element returned, and the part will be the'phrase'component. 3.'whether_supported_by_report'is a boolean to evaluate whether the'phrase'is supported by the report. 4.'target_diseases': which condition (including'Support ...
[31]

So'target diseases'are a list of candidate disease this 'phrase'is targeting for

Each phrase can target to a several diseases or even no disease. So'target diseases'are a list of candidate disease this 'phrase'is targeting for
[32]

For target disease, especially pay attention to'Support Devices'since that one has to be inferred from the text
[33]

Target disease can also be those coming from the result list
[34]

According to the report, there is a cluster of heterogeneous opacities

If there is no reasoning or the reasoning is clearly just repeating a list or not making sense, return an empty list. ********** Example ## Report FINDINGS: A cluster of heterogeneous opacities in the right lower lung has has continued to grow since ___. Otherwise, the lungs are clear. Moderate cardiomegaly, including severe left atrial enlargement is chr...