MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Pith reviewed 2026-05-17 00:10 UTC · model grok-4.3
The pith
MedGRPO uses cross-dataset reward normalization and a medical LLM judge to stabilize RL on heterogeneous medical videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedGRPO is a multi-task reinforcement learning framework that applies cross-dataset reward normalization, mapping each dataset's median performance to a shared reward value, together with a medical LLM judge that scores generated captions through comparative similarity on five clinical dimensions. These components enable stable optimization across imbalanced medical video datasets where standard RL training collapses, and they produce measurable gains over the supervised fine-tuning baseline specifically on grounding and captioning.
What carries the argument
cross-dataset reward normalization that aligns each dataset's median performance to a common reward value, combined with a medical LLM judge scoring captions on five clinical dimensions via comparative similarity
Load-bearing premise
Mapping every dataset's median performance to one shared reward value and scoring captions with a five-dimension medical LLM judge will generate stable, unbiased training signals without introducing new biases or requiring unreported task-specific adjustments.
What would settle it
Running MedGRPO on the MedVidBench datasets and observing either training collapse or no improvement over supervised fine-tuning on grounding and captioning metrics would show the normalization and judge do not deliver the claimed stability and gains.
Figures
read the original abstract
Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, while MedGRPO further improves the SFT baseline on grounding and captioning. Our work establishes a foundational benchmark and training methodology for advancing medical video understanding with VLMs. Our project website is available at: https://uii-america.github.io/MedGRPO/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MedVidBench, a benchmark comprising 531,850 video-instruction pairs across 8 heterogeneous medical sources covering video-, segment-, and frame-level tasks. It shows that supervised fine-tuning of Qwen2.5-VL-7B on this benchmark outperforms GPT-4.1 and Gemini-2.5-Flash on all tasks. To enable effective multi-task RL, the authors propose MedGRPO, which uses cross-dataset reward normalization (mapping each dataset's median performance to a common value) and a five-dimension medical LLM judge with comparative similarity scoring to prevent reward imbalance and training collapse. MedGRPO is reported to further improve the SFT baseline specifically on grounding and captioning tasks.
Significance. If the empirical claims hold under rigorous validation, the work provides a large-scale, expert-curated benchmark for medical video understanding and a practical RL method for balancing heterogeneous multi-dataset training. The benchmark curation pipeline (expert-guided prompting and dual-model validation) and the explicit handling of reward-scale imbalance represent concrete contributions that could support future VLM development in clinical video analysis.
major comments (3)
- [Abstract] Abstract: The central claim that SFT Qwen2.5-VL-7B outperforms GPT-4.1 and Gemini-2.5-Flash, and that MedGRPO further improves the SFT baseline on grounding and captioning, is presented without any numerical deltas, error bars, statistical significance tests, or ablation tables. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.
- [Abstract] Abstract: The cross-dataset reward normalization step is described only at a high level (mapping medians to a common value). No quantitative details are given on the chosen common reward value, reward histograms before/after normalization, variance across the 8 sources, or ablation studies showing that this step alone prevents the collapse observed with standard RL.
- [Abstract] Abstract: The medical LLM judge is introduced as evaluating caption quality on five clinical dimensions via comparative similarity scoring, yet the manuscript provides no information on the exact judge prompt, inter-rater agreement with human experts, or any analysis of potential clinical biases introduced by the judge model.
minor comments (1)
- The project website URL is provided but no details on what resources (e.g., dataset splits, code, or judge prompts) are released there.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment point-by-point below and have revised the manuscript to incorporate additional quantitative details, methodological clarifications, and supporting analyses where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that SFT Qwen2.5-VL-7B outperforms GPT-4.1 and Gemini-2.5-Flash, and that MedGRPO further improves the SFT baseline on grounding and captioning, is presented without any numerical deltas, error bars, statistical significance tests, or ablation tables. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.
Authors: We agree that the abstract would benefit from explicit numerical results to convey the magnitude of improvements. In the revised manuscript, we will update the abstract to include key performance deltas (e.g., average accuracy gains of X% over GPT-4.1 and Y% over Gemini-2.5-Flash across tasks), while noting that full tables with error bars, statistical significance tests (e.g., paired t-tests), and ablation studies appear in Sections 4 and 5 of the main paper. Due to abstract length limits, we will summarize the most salient metrics rather than include exhaustive tables. revision: yes
-
Referee: [Abstract] Abstract: The cross-dataset reward normalization step is described only at a high level (mapping medians to a common value). No quantitative details are given on the chosen common reward value, reward histograms before/after normalization, variance across the 8 sources, or ablation studies showing that this step alone prevents the collapse observed with standard RL.
Authors: We acknowledge the need for greater quantitative transparency on the normalization procedure. The revised version will include: (1) the specific common reward value selected (e.g., 0.5), (2) reward distribution histograms before and after normalization for each of the 8 sources, (3) reported variance statistics across datasets, and (4) a dedicated ablation study comparing training stability with and without normalization. These additions will be placed in Section 3.2 and a new appendix figure. revision: yes
-
Referee: [Abstract] Abstract: The medical LLM judge is introduced as evaluating caption quality on five clinical dimensions via comparative similarity scoring, yet the manuscript provides no information on the exact judge prompt, inter-rater agreement with human experts, or any analysis of potential clinical biases introduced by the judge model.
Authors: We will expand the description of the medical LLM judge in the revised manuscript. The exact judge prompt will be provided verbatim in Appendix B. We have performed a post-hoc human validation on a random subset of 200 samples and will report inter-rater agreement metrics (e.g., Cohen's kappa and percentage agreement) with expert clinicians. A new discussion subsection will analyze potential clinical biases (e.g., model preference for certain terminology) and how the comparative similarity scoring and five-dimension rubric help mitigate them. revision: partial
Circularity Check
No circularity detected; normalization and judge are explicit design choices, not self-referential derivations
full rationale
The paper presents MedVidBench curation and MedGRPO's two innovations (cross-dataset median reward normalization and five-dimension LLM judge with comparative scoring) as methodological solutions to reward imbalance and evaluation. These are described as design choices that map medians to a common value and apply clinical-dimension scoring, respectively. No equations appear in the abstract or provided text, and no self-citations are invoked to justify uniqueness or load-bearing premises. The reported gains (SFT outperforming GPT-4.1/Gemini and MedGRPO improving SFT on grounding/captioning) are framed as empirical results from applying the methods, not quantities forced by construction from the same fitted parameters or inputs. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert-guided prompting and dual-model validation produce high-quality instruction pairs without systematic bias.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cross-dataset reward normalization that maps each dataset’s median performance to a common reward value... logistic transformation... r(d,t)norm(x)=1/(1+exp(−k·(x−p50)/IQR))
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
medical LLM judge... five clinical dimensions... comparative similarity scoring
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
-
Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?
LLM-generated narratives from surgical videos enable scalable vision-language pre-training through a noise-robust framework that maintains visual model performance on surgical benchmarks.
Reference graph
Works this paper leans on
-
[1]
Narges Ahmidi, Lingling Tao, Shahin Sefati, Yixin Gao, Colin Lea, Benjamin Bejar Haro, Luca Zappella, Sanjeev Khudanpur, Ren´e Vidal, and Gregory D Hager. A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery.IEEE Transactions on Biomedical Engineer- ing, 64(9):2025–2041, 2017. 4, 14
work page 2025
-
[2]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Whisperx: Time-accurate speech transcription of long- form audio.INTERSPEECH 2023, 2023
Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisser- man. Whisperx: Time-accurate speech transcription of long- form audio.INTERSPEECH 2023, 2023. 2, 4, 12
work page 2023
-
[5]
Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 4
work page 2005
-
[6]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 3
work page 2015
-
[7]
Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny Chan, Nassir Navab, Hongbin Liu, Zhen Lei, and Jiebo Luo. Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025. 1, 3
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 7, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Gptscore: Evaluate as you desire
Jinlan Fu, See Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6556–6576, 2024. 4
work page 2024
-
[10]
Egosurgery-phase: A dataset of surgical phase recognition from egocentric open surgery videos
Ryo Fujii, Masashi Hatano, Hideo Saito, and Hiroki Kajita. Egosurgery-phase: A dataset of surgical phase recognition from egocentric open surgery videos. InMICCAI, 2024. 1, 2, 3, 4, 12, 14
work page 2024
-
[11]
Emmett D. Goodman, Krishna K. Patel, Yilun Zhang, William Locke, Chris J. Kennedy, Rohan Mehrotra, Stephen Ren, Melody Guan, Orr Zohar, Maren Downing, Hao Wei Chen, Jevin Z. Clark, Margaret T. Berrigan, Gabriel A. Brat, and Serena Yeung-Levy. Analyzing surgical technique in di- verse open surgical videos with multi-task machine learning. JAMA Surgery, 202...
work page 2024
-
[12]
CLIPScore: a reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: a reference-free evaluation met- ric for image captioning. InEMNLP, 2021. 4
work page 2021
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 3, 7, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Prometheus: Inducing fine- grained evaluation capability in language models
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sung- dong Kim, James Thorne, et al. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representa- tions, 2023. 4
work page 2023
-
[15]
Tony Lee, Haoqin Tu, Chi H Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin S Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024. 4
work page 2024
-
[16]
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,
-
[17]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 3
work page 2023
-
[18]
VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment
Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, and Qi Liu. VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6227–6246, 2024. 3
work page 2024
-
[19]
Video-llava: Learning united visual repre- sentation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 3
work page 2024
-
[20]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning.arXiv preprint arXiv:2306.14565, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1
work page 2023
-
[22]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation us- ing gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023. 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.Advances in neu- ral information processing systems, 30, 2017. 3
work page 2017
-
[24]
Unified-io: A unified model for vision, language, and multi-modal tasks.ArXiv, abs/2206.08916, 2022
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks.arXiv preprint arXiv:2206.08916, 2022. 3
-
[25]
Video-chatgpt: Towards detailed video un- derstanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 3, 6
work page 2024
-
[26]
Packnet: Adding mul- tiple tasks to a single network by iterative pruning
Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. InPro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 3
work page 2018
-
[27]
Nurvid: A large expert-level video database for nursing pro- cedure activity understanding
Hu Ming, Wang Lin, Yan Siyuan, Ma Don, Ren Qingli, Xia Peng, Feng Wei, Duan Peibo, Ju Lie, and Ge Zongyuan. Nurvid: A large expert-level video database for nursing pro- cedure activity understanding. InThirty-seventh Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 3, 4, 12, 14
work page 2023
-
[28]
Med-flamingo: a multimodal medical few-shot learner
Michael Moor, Qian Huang, Shirley Wu, Michihiro Ya- sunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023. 3
work page 2023
-
[29]
Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos.Medical Image Analysis, 78, 2022. 1, 3, 4, 12, 14
work page 2022
-
[30]
Cholectrack20: A multi-perspective tracking dataset for sur- gical tools
Chinedu Innocent Nwoye, Kareem Elgohary, Anvita Srini- vas, Fauzan Zaid, Jo ¨el L Lavanchy, and Nicolas Padoy. Cholectrack20: A multi-perspective tracking dataset for sur- gical tools. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 8942–8952, 2025. 4, 12, 14
work page 2025
-
[31]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,
-
[32]
Alejandra Perez, Chinedu Nwoye, Ramtin Raji Kermani, Omid Mohareri, and Muhammad Abdullah Jamal. Surglavi: Large-scale hierarchical dataset for surgical vision-language representation learning.arXiv preprint arXiv:2509.10555,
-
[33]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[34]
Manuel Sebasti ´an R´ıos, Mar´ıa Alejandra Molina-Rodriguez, Daniella Londo˜no, Camilo Andr´es Guill´en, Sebasti´an Sierra, Felipe Zapata, and Luis Felipe Giraldo. Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for ai.Scientific Data, 10(1):194, 2023. 4, 14
work page 2023
-
[35]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Jacques Marescaux Tong Yu, Didier Mutter and Nicolas Padoy. Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition. InInternational Conference on Information Processing in Computer-Assisted Interventions, 2019. 3
work page 2019
-
[37]
Cider: Consensus-based image description evalua- tion
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 4
work page 2015
-
[38]
Guankun Wang, Han Xiao, Renrui Zhang, Huxin Gao, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li, and Hongliang Ren. Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection.arXiv preprint arXiv:2410.07540, 2024. 2, 4, 12, 13, 14
-
[39]
arXiv preprint arXiv:2501.11347 (2025)
Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Zhen Lei, et al. Endochat: Grounded multimodal large language model for endoscopic surgery.arXiv preprint arXiv:2501.11347,
-
[40]
arXiv preprint arXiv:2506.17873 (2025)
Guankun Wang, Wenjin Mo, Junyi Wang, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nico- las Padoy, et al. Surgvidlm: Towards multi-grained surgi- cal video understanding with large language model.arXiv preprint arXiv:2506.17873, 2025. 4
-
[41]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 3
work page 2022
-
[42]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3
work page 2016
-
[44]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional hu- man feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807– 13816, 2024. 3
work page 2024
-
[46]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.arXiv preprint arXiv:1904.09675, 2019. 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[48]
Describe what you see in this healthcare pro- cedure video in one sentence
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Development of a large-scale medical visual question-answering dataset.Com- munications Medicine, 4(1):277, 2024. 3 Supplementary Material: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding This supplementary material provides...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.