MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Yuhao Su , Anwesa Choudhuri , Zhongpai Gao , Benjamin Planche , Van Nguyen Nguyen , Meng Zheng , Yuhan Shen , Arun Innanje

show 3 more authors

Terrence Chen Ehsan Elhamifar Ziyan Wu

Authors on Pith no claims yet

classification 💻 cs.CV

keywords medicalmedgrpovideoacrossmedvidbenchrewardtrainingunderstanding

0 comments

read the original abstract

Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, while MedGRPO further improves the SFT baseline on grounding and captioning. Our work establishes a foundational benchmark and training methodology for advancing medical video understanding with VLMs. Our project website is available at: https://uii-america.github.io/MedGRPO/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?
cs.CV 2026-04 unverdicted novelty 5.0

LLM-generated narratives from surgical videos enable scalable vision-language pre-training through a noise-robust framework that maintains visual model performance on surgical benchmarks.