pith. sign in

arxiv: 2512.06581 · v4 · submitted 2025-12-06 · 💻 cs.CV

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Pith reviewed 2026-05-17 00:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical video understandingreinforcement learningvision-language modelsreward normalizationmulti-task trainingMedVidBenchcaption evaluationvideo grounding
0
0 comments X

The pith

MedGRPO uses cross-dataset reward normalization and a medical LLM judge to stabilize RL on heterogeneous medical videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MedVidBench, a benchmark containing 531,850 video-instruction pairs drawn from eight medical sources and covering video, segment, and frame-level tasks. Supervised fine-tuning of Qwen2.5-VL-7B on this benchmark already surpasses GPT-4.1 and Gemini-2.5-Flash across tasks. Standard reinforcement learning collapses because reward scales differ sharply across datasets. MedGRPO counters this imbalance with two mechanisms that produce consistent training signals and deliver further gains on grounding and captioning. The result supplies both data and a training recipe for vision-language models that must handle spatial precision, temporal order, and clinical meaning in medical video.

Core claim

MedGRPO is a multi-task reinforcement learning framework that applies cross-dataset reward normalization, mapping each dataset's median performance to a shared reward value, together with a medical LLM judge that scores generated captions through comparative similarity on five clinical dimensions. These components enable stable optimization across imbalanced medical video datasets where standard RL training collapses, and they produce measurable gains over the supervised fine-tuning baseline specifically on grounding and captioning.

What carries the argument

cross-dataset reward normalization that aligns each dataset's median performance to a common reward value, combined with a medical LLM judge scoring captions on five clinical dimensions via comparative similarity

Load-bearing premise

Mapping every dataset's median performance to one shared reward value and scoring captions with a five-dimension medical LLM judge will generate stable, unbiased training signals without introducing new biases or requiring unreported task-specific adjustments.

What would settle it

Running MedGRPO on the MedVidBench datasets and observing either training collapse or no improvement over supervised fine-tuning on grounding and captioning metrics would show the normalization and judge do not deliver the claimed stability and gains.

Figures

Figures reproduced from arXiv: 2512.06581 by Anwesa Choudhuri, Arun Innanje, Benjamin Planche, Ehsan Elhamifar, Meng Zheng, Terrence Chen, Van Nguyen Nguyen, Yuhan Shen, Yuhao Su, Zhongpai Gao, Ziyan Wu.

Figure 1
Figure 1. Figure 1: Overview of MedVidBench. (a) High quality data curation pipeline for MedVidBench. We leaverage expert knowledge into [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MedGRPO. (a) MedGRPO framework with cross-dataset reward normalization and medical LLM judge evaluation. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of region captioning generation. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scaling law analysis. Performance on Dense Video Cap [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interface for human validation study. Users were provided detailed instruction to rank caption after watching a short video. An instruction example for a good and bad caption was provided [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Human validation study results. User preference comparison with 12 participants on CoPESD dataset. “w/ Ex￾pert Prompt” refers to captions generated using our annotation￾enriched prompting with overlaid bounding boxes, procedure con￾text, and expert annotations. “w/o Expert Prompt” refers to captions generated from raw frames only with minimal prompt￾ing. Participants strongly prefer captions generated with… view at source ↗
Figure 7
Figure 7. Figure 7: Dataset distribution analysis. Dataset distribution across 532K QA instances from 8 medical video datasets. (Left) Answer length distribution showing word counts ranging from 1 to 1,170 words (median: 21, mean: 41). Short answers (≤5 words, 28.1%) are predominantly from temporal action grounding tasks, while long answers (>20 words, 51.8%) come mainly from dense video captioning and region captioning tasks… view at source ↗
Figure 8
Figure 8. Figure 8: Examples of diverse tasks. 5 diverse tasks from MedVidBench (Dense Video Captioning, Spatio-Temporal Grounding, Critical View Safety, Video Summary, and Next Action Prediction) spanning 3 domains (Nursing, Laparoscopic Surgery and Open Surgery). • Score 2: procedural context differs significantly from ref￾erence • Score 1: procedural context mostly missing or wrong vs reference Action and State Accuracy De… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative examples on dense video captioning. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative examples on video summary. 0.0 Seconds 2.0 Seconds 8.0 Seconds 16.0 Seconds 28.0 Seconds … Question: You are an expert surgical analyst. The video comes from Cholec80-CVS and is for evaluating Strasberg’s Critical View of Safety. For this laparoscopic cholecystectomy procedure, evaluate the Critical View of Safety based on the three essential criteria: proper identification of two structures, … view at source ↗
Figure 11
Figure 11. Figure 11: Failure case examples on Critical View of Safety (CVS) assessment. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, while MedGRPO further improves the SFT baseline on grounding and captioning. Our work establishes a foundational benchmark and training methodology for advancing medical video understanding with VLMs. Our project website is available at: https://uii-america.github.io/MedGRPO/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces MedVidBench, a benchmark comprising 531,850 video-instruction pairs across 8 heterogeneous medical sources covering video-, segment-, and frame-level tasks. It shows that supervised fine-tuning of Qwen2.5-VL-7B on this benchmark outperforms GPT-4.1 and Gemini-2.5-Flash on all tasks. To enable effective multi-task RL, the authors propose MedGRPO, which uses cross-dataset reward normalization (mapping each dataset's median performance to a common value) and a five-dimension medical LLM judge with comparative similarity scoring to prevent reward imbalance and training collapse. MedGRPO is reported to further improve the SFT baseline specifically on grounding and captioning tasks.

Significance. If the empirical claims hold under rigorous validation, the work provides a large-scale, expert-curated benchmark for medical video understanding and a practical RL method for balancing heterogeneous multi-dataset training. The benchmark curation pipeline (expert-guided prompting and dual-model validation) and the explicit handling of reward-scale imbalance represent concrete contributions that could support future VLM development in clinical video analysis.

major comments (3)
  1. [Abstract] Abstract: The central claim that SFT Qwen2.5-VL-7B outperforms GPT-4.1 and Gemini-2.5-Flash, and that MedGRPO further improves the SFT baseline on grounding and captioning, is presented without any numerical deltas, error bars, statistical significance tests, or ablation tables. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.
  2. [Abstract] Abstract: The cross-dataset reward normalization step is described only at a high level (mapping medians to a common value). No quantitative details are given on the chosen common reward value, reward histograms before/after normalization, variance across the 8 sources, or ablation studies showing that this step alone prevents the collapse observed with standard RL.
  3. [Abstract] Abstract: The medical LLM judge is introduced as evaluating caption quality on five clinical dimensions via comparative similarity scoring, yet the manuscript provides no information on the exact judge prompt, inter-rater agreement with human experts, or any analysis of potential clinical biases introduced by the judge model.
minor comments (1)
  1. The project website URL is provided but no details on what resources (e.g., dataset splits, code, or judge prompts) are released there.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point-by-point below and have revised the manuscript to incorporate additional quantitative details, methodological clarifications, and supporting analyses where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that SFT Qwen2.5-VL-7B outperforms GPT-4.1 and Gemini-2.5-Flash, and that MedGRPO further improves the SFT baseline on grounding and captioning, is presented without any numerical deltas, error bars, statistical significance tests, or ablation tables. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.

    Authors: We agree that the abstract would benefit from explicit numerical results to convey the magnitude of improvements. In the revised manuscript, we will update the abstract to include key performance deltas (e.g., average accuracy gains of X% over GPT-4.1 and Y% over Gemini-2.5-Flash across tasks), while noting that full tables with error bars, statistical significance tests (e.g., paired t-tests), and ablation studies appear in Sections 4 and 5 of the main paper. Due to abstract length limits, we will summarize the most salient metrics rather than include exhaustive tables. revision: yes

  2. Referee: [Abstract] Abstract: The cross-dataset reward normalization step is described only at a high level (mapping medians to a common value). No quantitative details are given on the chosen common reward value, reward histograms before/after normalization, variance across the 8 sources, or ablation studies showing that this step alone prevents the collapse observed with standard RL.

    Authors: We acknowledge the need for greater quantitative transparency on the normalization procedure. The revised version will include: (1) the specific common reward value selected (e.g., 0.5), (2) reward distribution histograms before and after normalization for each of the 8 sources, (3) reported variance statistics across datasets, and (4) a dedicated ablation study comparing training stability with and without normalization. These additions will be placed in Section 3.2 and a new appendix figure. revision: yes

  3. Referee: [Abstract] Abstract: The medical LLM judge is introduced as evaluating caption quality on five clinical dimensions via comparative similarity scoring, yet the manuscript provides no information on the exact judge prompt, inter-rater agreement with human experts, or any analysis of potential clinical biases introduced by the judge model.

    Authors: We will expand the description of the medical LLM judge in the revised manuscript. The exact judge prompt will be provided verbatim in Appendix B. We have performed a post-hoc human validation on a random subset of 200 samples and will report inter-rater agreement metrics (e.g., Cohen's kappa and percentage agreement) with expert clinicians. A new discussion subsection will analyze potential clinical biases (e.g., model preference for certain terminology) and how the comparative similarity scoring and five-dimension rubric help mitigate them. revision: partial

Circularity Check

0 steps flagged

No circularity detected; normalization and judge are explicit design choices, not self-referential derivations

full rationale

The paper presents MedVidBench curation and MedGRPO's two innovations (cross-dataset median reward normalization and five-dimension LLM judge with comparative scoring) as methodological solutions to reward imbalance and evaluation. These are described as design choices that map medians to a common value and apply clinical-dimension scoring, respectively. No equations appear in the abstract or provided text, and no self-citations are invoked to justify uniqueness or load-bearing premises. The reported gains (SFT outperforming GPT-4.1/Gemini and MedGRPO improving SFT on grounding/captioning) are framed as empirical results from applying the methods, not quantities forced by construction from the same fitted parameters or inputs. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the new benchmark is representative and that the two RL modifications are sufficient to stabilize training; no free parameters are explicitly fitted in the abstract description, and no new physical entities are postulated.

axioms (1)
  • domain assumption Expert-guided prompting and dual-model validation produce high-quality instruction pairs without systematic bias.
    Invoked in the MedVidBench curation description.

pith-pipeline@v0.9.0 · 5602 in / 1281 out tokens · 28354 ms · 2026-05-17T00:10:36.108663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MedHorizon: Towards Long-context Medical Video Understanding in the Wild

    cs.CV 2026-05 unverdicted novelty 8.0

    MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

  2. Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

    cs.CV 2026-04 unverdicted novelty 5.0

    LLM-generated narratives from surgical videos enable scalable vision-language pre-training through a noise-robust framework that maintains visual model performance on surgical benchmarks.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 11 internal anchors

  1. [1]

    A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery.IEEE Transactions on Biomedical Engineer- ing, 64(9):2025–2041, 2017

    Narges Ahmidi, Lingling Tao, Shahin Sefati, Yixin Gao, Colin Lea, Benjamin Bejar Haro, Luca Zappella, Sanjeev Khudanpur, Ren´e Vidal, and Gregory D Hager. A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery.IEEE Transactions on Biomedical Engineer- ing, 64(9):2025–2041, 2017. 4, 14

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 7

  4. [4]

    Whisperx: Time-accurate speech transcription of long- form audio.INTERSPEECH 2023, 2023

    Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisser- man. Whisperx: Time-accurate speech transcription of long- form audio.INTERSPEECH 2023, 2023. 2, 4, 12

  5. [5]

    Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 4

  6. [6]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 3

  7. [7]

    Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025

    Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny Chan, Nassir Navab, Hongbin Liu, Zhen Lei, and Jiebo Luo. Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025. 1, 3

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 7, 12

  9. [9]

    Gptscore: Evaluate as you desire

    Jinlan Fu, See Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6556–6576, 2024. 4

  10. [10]

    Egosurgery-phase: A dataset of surgical phase recognition from egocentric open surgery videos

    Ryo Fujii, Masashi Hatano, Hideo Saito, and Hiroki Kajita. Egosurgery-phase: A dataset of surgical phase recognition from egocentric open surgery videos. InMICCAI, 2024. 1, 2, 3, 4, 12, 14

  11. [11]

    Goodman, Krishna K

    Emmett D. Goodman, Krishna K. Patel, Yilun Zhang, William Locke, Chris J. Kennedy, Rohan Mehrotra, Stephen Ren, Melody Guan, Orr Zohar, Maren Downing, Hao Wei Chen, Jevin Z. Clark, Margaret T. Berrigan, Gabriel A. Brat, and Serena Yeung-Levy. Analyzing surgical technique in di- verse open surgical videos with multi-task machine learning. JAMA Surgery, 202...

  12. [12]

    CLIPScore: a reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: a reference-free evaluation met- ric for image captioning. InEMNLP, 2021. 4

  13. [13]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 3, 7, 12

  14. [14]

    Prometheus: Inducing fine- grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sung- dong Kim, James Thorne, et al. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representa- tions, 2023. 4

  15. [15]

    Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024

    Tony Lee, Haoqin Tu, Chi H Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin S Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024. 4

  16. [16]

    Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

  17. [17]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 3

  18. [18]

    VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment

    Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, and Qi Liu. VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6227–6246, 2024. 3

  19. [19]

    Video-llava: Learning united visual repre- sentation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 3

  20. [20]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning.arXiv preprint arXiv:2306.14565, 2023. 4

  21. [21]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  22. [22]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation us- ing gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023. 4, 6

  23. [23]

    Gradient episodic memory for continual learning.Advances in neu- ral information processing systems, 30, 2017

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.Advances in neu- ral information processing systems, 30, 2017. 3

  24. [24]

    Unified-io: A unified model for vision, language, and multi-modal tasks.ArXiv, abs/2206.08916, 2022

    Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks.arXiv preprint arXiv:2206.08916, 2022. 3

  25. [25]

    Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 3, 6

  26. [26]

    Packnet: Adding mul- tiple tasks to a single network by iterative pruning

    Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. InPro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 3

  27. [27]

    Nurvid: A large expert-level video database for nursing pro- cedure activity understanding

    Hu Ming, Wang Lin, Yan Siyuan, Ma Don, Ren Qingli, Xia Peng, Feng Wei, Duan Peibo, Ju Lie, and Ge Zongyuan. Nurvid: A large expert-level video database for nursing pro- cedure activity understanding. InThirty-seventh Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 3, 4, 12, 14

  28. [28]

    Med-flamingo: a multimodal medical few-shot learner

    Michael Moor, Qian Huang, Shirley Wu, Michihiro Ya- sunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023. 3

  29. [29]

    Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos.Medical Image Analysis, 78, 2022

    Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos.Medical Image Analysis, 78, 2022. 1, 3, 4, 12, 14

  30. [30]

    Cholectrack20: A multi-perspective tracking dataset for sur- gical tools

    Chinedu Innocent Nwoye, Kareem Elgohary, Anvita Srini- vas, Fauzan Zaid, Jo ¨el L Lavanchy, and Nicolas Padoy. Cholectrack20: A multi-perspective tracking dataset for sur- gical tools. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 8942–8952, 2025. 4, 12, 14

  31. [31]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  32. [32]

    Surglavi: Large-scale hierarchical dataset for surgical vision-language representation learning.arXiv preprint arXiv:2509.10555,

    Alejandra Perez, Chinedu Nwoye, Ramtin Raji Kermani, Omid Mohareri, and Muhammad Abdullah Jamal. Surglavi: Large-scale hierarchical dataset for surgical vision-language representation learning.arXiv preprint arXiv:2509.10555,

  33. [33]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 5, 6

  34. [34]

    Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for ai.Scientific Data, 10(1):194, 2023

    Manuel Sebasti ´an R´ıos, Mar´ıa Alejandra Molina-Rodriguez, Daniella Londo˜no, Camilo Andr´es Guill´en, Sebasti´an Sierra, Felipe Zapata, and Luis Felipe Giraldo. Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for ai.Scientific Data, 10(1):194, 2023. 4, 14

  35. [35]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5

  36. [36]

    Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition

    Jacques Marescaux Tong Yu, Didier Mutter and Nicolas Padoy. Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition. InInternational Conference on Information Processing in Computer-Assisted Interventions, 2019. 3

  37. [37]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 4

  38. [38]

    Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection.arXiv preprint arXiv:2410.07540, 2024

    Guankun Wang, Han Xiao, Renrui Zhang, Huxin Gao, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li, and Hongliang Ren. Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection.arXiv preprint arXiv:2410.07540, 2024. 2, 4, 12, 13, 14

  39. [39]

    arXiv preprint arXiv:2501.11347 (2025)

    Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Zhen Lei, et al. Endochat: Grounded multimodal large language model for endoscopic surgery.arXiv preprint arXiv:2501.11347,

  40. [40]

    arXiv preprint arXiv:2506.17873 (2025)

    Guankun Wang, Wenjin Mo, Junyi Wang, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nico- las Padoy, et al. Surgvidlm: Towards multi-grained surgi- cal video understanding with large language model.arXiv preprint arXiv:2506.17873, 2025. 4

  41. [41]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 3

  42. [42]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

  43. [43]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3

  44. [44]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 5, 7

  45. [45]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional hu- man feedback

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional hu- man feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807– 13816, 2024. 3

  46. [46]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 3

  47. [47]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.arXiv preprint arXiv:1904.09675, 2019. 4, 6

  48. [48]

    Describe what you see in this healthcare pro- cedure video in one sentence

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Development of a large-scale medical visual question-answering dataset.Com- munications Medicine, 4(1):277, 2024. 3 Supplementary Material: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding This supplementary material provides...