MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Anwesa Choudhuri; Arun Innanje; Benjamin Planche; Ehsan Elhamifar; Meng Zheng; Terrence Chen; Van Nguyen Nguyen; Yuhan Shen; Yuhao Su; Zhongpai Gao

arxiv: 2512.06581 · v4 · submitted 2025-12-06 · 💻 cs.CV

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Yuhao Su , Anwesa Choudhuri , Zhongpai Gao , Benjamin Planche , Van Nguyen Nguyen , Meng Zheng , Yuhan Shen , Arun Innanje

show 3 more authors

Terrence Chen Ehsan Elhamifar Ziyan Wu

This is my paper

Pith reviewed 2026-05-17 00:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical video understandingreinforcement learningvision-language modelsreward normalizationmulti-task trainingMedVidBenchcaption evaluationvideo grounding

0 comments

The pith

MedGRPO uses cross-dataset reward normalization and a medical LLM judge to stabilize RL on heterogeneous medical videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MedVidBench, a benchmark containing 531,850 video-instruction pairs drawn from eight medical sources and covering video, segment, and frame-level tasks. Supervised fine-tuning of Qwen2.5-VL-7B on this benchmark already surpasses GPT-4.1 and Gemini-2.5-Flash across tasks. Standard reinforcement learning collapses because reward scales differ sharply across datasets. MedGRPO counters this imbalance with two mechanisms that produce consistent training signals and deliver further gains on grounding and captioning. The result supplies both data and a training recipe for vision-language models that must handle spatial precision, temporal order, and clinical meaning in medical video.

Core claim

MedGRPO is a multi-task reinforcement learning framework that applies cross-dataset reward normalization, mapping each dataset's median performance to a shared reward value, together with a medical LLM judge that scores generated captions through comparative similarity on five clinical dimensions. These components enable stable optimization across imbalanced medical video datasets where standard RL training collapses, and they produce measurable gains over the supervised fine-tuning baseline specifically on grounding and captioning.

What carries the argument

cross-dataset reward normalization that aligns each dataset's median performance to a common reward value, combined with a medical LLM judge scoring captions on five clinical dimensions via comparative similarity

Load-bearing premise

Mapping every dataset's median performance to one shared reward value and scoring captions with a five-dimension medical LLM judge will generate stable, unbiased training signals without introducing new biases or requiring unreported task-specific adjustments.

What would settle it

Running MedGRPO on the MedVidBench datasets and observing either training collapse or no improvement over supervised fine-tuning on grounding and captioning metrics would show the normalization and judge do not deliver the claimed stability and gains.

Figures

Figures reproduced from arXiv: 2512.06581 by Anwesa Choudhuri, Arun Innanje, Benjamin Planche, Ehsan Elhamifar, Meng Zheng, Terrence Chen, Van Nguyen Nguyen, Yuhan Shen, Yuhao Su, Zhongpai Gao, Ziyan Wu.

**Figure 1.** Figure 1: Overview of MedVidBench. (a) High quality data curation pipeline for MedVidBench. We leaverage expert knowledge into [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of MedGRPO. (a) MedGRPO framework with cross-dataset reward normalization and medical LLM judge evaluation. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of region captioning generation. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling law analysis. Performance on Dense Video Cap [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Interface for human validation study. Users were provided detailed instruction to rank caption after watching a short video. An instruction example for a good and bad caption was provided [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Human validation study results. User preference comparison with 12 participants on CoPESD dataset. “w/ Expert Prompt” refers to captions generated using our annotationenriched prompting with overlaid bounding boxes, procedure context, and expert annotations. “w/o Expert Prompt” refers to captions generated from raw frames only with minimal prompting. Participants strongly prefer captions generated with… view at source ↗

**Figure 7.** Figure 7: Dataset distribution analysis. Dataset distribution across 532K QA instances from 8 medical video datasets. (Left) Answer length distribution showing word counts ranging from 1 to 1,170 words (median: 21, mean: 41). Short answers (≤5 words, 28.1%) are predominantly from temporal action grounding tasks, while long answers (>20 words, 51.8%) come mainly from dense video captioning and region captioning tasks… view at source ↗

**Figure 8.** Figure 8: Examples of diverse tasks. 5 diverse tasks from MedVidBench (Dense Video Captioning, Spatio-Temporal Grounding, Critical View Safety, Video Summary, and Next Action Prediction) spanning 3 domains (Nursing, Laparoscopic Surgery and Open Surgery). • Score 2: procedural context differs significantly from reference • Score 1: procedural context mostly missing or wrong vs reference Action and State Accuracy De… view at source ↗

**Figure 9.** Figure 9: Qualitative examples on dense video captioning. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative examples on video summary. 0.0 Seconds 2.0 Seconds 8.0 Seconds 16.0 Seconds 28.0 Seconds … Question: You are an expert surgical analyst. The video comes from Cholec80-CVS and is for evaluating Strasberg’s Critical View of Safety. For this laparoscopic cholecystectomy procedure, evaluate the Critical View of Safety based on the three essential criteria: proper identification of two structures, … view at source ↗

**Figure 11.** Figure 11: Failure case examples on Critical View of Safety (CVS) assessment. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, while MedGRPO further improves the SFT baseline on grounding and captioning. Our work establishes a foundational benchmark and training methodology for advancing medical video understanding with VLMs. Our project website is available at: https://uii-america.github.io/MedGRPO/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

MedVidBench is the main thing here, a large curated medical video dataset that could serve as a shared benchmark, while MedGRPO's median normalization and LLM judge give a workable fix for RL imbalance across heterogeneous tasks. The authors assembled 531,850 video-instruction pairs from eight medical sources spanning video, segment, and frame levels, using expert-guided prompting and dual-model validation. That scale and pipeline represent concrete effort that addresses the lack of standardized data in clinical video AI. They also report that supervised fine-tuning of Qwen2.5-VL-7B on this set beats GPT-4.1 and Gemini-2.5-Flash across tasks, which is a useful baseline comparison. For the RL component, standard methods collapse due to reward scale differences, so MedGRPO applies cross-dataset median normalization to map each source's median performance to a common value and adds a five-dimension medical LLM judge that scores captions via comparative similarity. This combination reportedly improves grounding and captioning over the SFT baseline. The approach is direct and targets the multi-task heterogeneity problem without new theoretical machinery. The benchmark size, curation details, and the specific median-plus-LLM-judge adaptations look new relative to the prior RL-for-VLM work referenced. The soft spots are in the supporting evidence. The abstract states the improvements but gives no ablations on the normalization parameters, no reward distribution histograms, no judge-human agreement numbers, and no error bars or significance tests. If reward variances differ substantially across the eight sources, median mapping alone may leave gradient instability, as the stress-test note suggests. The full paper may contain these checks, but based on what is shown the stability claim rests mainly on the final performance numbers. This work is for groups doing VLM fine-tuning or evaluation on medical videos. The dataset alone makes it worth reading, and the training recipe could be adopted if the details hold up. It deserves peer review because the benchmark is substantial and the practical problem is real; referees can push for the missing ablations and variance analysis in revision.

Referee Report

3 major / 1 minor

Summary. The paper introduces MedVidBench, a benchmark comprising 531,850 video-instruction pairs across 8 heterogeneous medical sources covering video-, segment-, and frame-level tasks. It shows that supervised fine-tuning of Qwen2.5-VL-7B on this benchmark outperforms GPT-4.1 and Gemini-2.5-Flash on all tasks. To enable effective multi-task RL, the authors propose MedGRPO, which uses cross-dataset reward normalization (mapping each dataset's median performance to a common value) and a five-dimension medical LLM judge with comparative similarity scoring to prevent reward imbalance and training collapse. MedGRPO is reported to further improve the SFT baseline specifically on grounding and captioning tasks.

Significance. If the empirical claims hold under rigorous validation, the work provides a large-scale, expert-curated benchmark for medical video understanding and a practical RL method for balancing heterogeneous multi-dataset training. The benchmark curation pipeline (expert-guided prompting and dual-model validation) and the explicit handling of reward-scale imbalance represent concrete contributions that could support future VLM development in clinical video analysis.

major comments (3)

[Abstract] Abstract: The central claim that SFT Qwen2.5-VL-7B outperforms GPT-4.1 and Gemini-2.5-Flash, and that MedGRPO further improves the SFT baseline on grounding and captioning, is presented without any numerical deltas, error bars, statistical significance tests, or ablation tables. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.
[Abstract] Abstract: The cross-dataset reward normalization step is described only at a high level (mapping medians to a common value). No quantitative details are given on the chosen common reward value, reward histograms before/after normalization, variance across the 8 sources, or ablation studies showing that this step alone prevents the collapse observed with standard RL.
[Abstract] Abstract: The medical LLM judge is introduced as evaluating caption quality on five clinical dimensions via comparative similarity scoring, yet the manuscript provides no information on the exact judge prompt, inter-rater agreement with human experts, or any analysis of potential clinical biases introduced by the judge model.

minor comments (1)

The project website URL is provided but no details on what resources (e.g., dataset splits, code, or judge prompts) are released there.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point-by-point below and have revised the manuscript to incorporate additional quantitative details, methodological clarifications, and supporting analyses where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that SFT Qwen2.5-VL-7B outperforms GPT-4.1 and Gemini-2.5-Flash, and that MedGRPO further improves the SFT baseline on grounding and captioning, is presented without any numerical deltas, error bars, statistical significance tests, or ablation tables. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.

Authors: We agree that the abstract would benefit from explicit numerical results to convey the magnitude of improvements. In the revised manuscript, we will update the abstract to include key performance deltas (e.g., average accuracy gains of X% over GPT-4.1 and Y% over Gemini-2.5-Flash across tasks), while noting that full tables with error bars, statistical significance tests (e.g., paired t-tests), and ablation studies appear in Sections 4 and 5 of the main paper. Due to abstract length limits, we will summarize the most salient metrics rather than include exhaustive tables. revision: yes
Referee: [Abstract] Abstract: The cross-dataset reward normalization step is described only at a high level (mapping medians to a common value). No quantitative details are given on the chosen common reward value, reward histograms before/after normalization, variance across the 8 sources, or ablation studies showing that this step alone prevents the collapse observed with standard RL.

Authors: We acknowledge the need for greater quantitative transparency on the normalization procedure. The revised version will include: (1) the specific common reward value selected (e.g., 0.5), (2) reward distribution histograms before and after normalization for each of the 8 sources, (3) reported variance statistics across datasets, and (4) a dedicated ablation study comparing training stability with and without normalization. These additions will be placed in Section 3.2 and a new appendix figure. revision: yes
Referee: [Abstract] Abstract: The medical LLM judge is introduced as evaluating caption quality on five clinical dimensions via comparative similarity scoring, yet the manuscript provides no information on the exact judge prompt, inter-rater agreement with human experts, or any analysis of potential clinical biases introduced by the judge model.

Authors: We will expand the description of the medical LLM judge in the revised manuscript. The exact judge prompt will be provided verbatim in Appendix B. We have performed a post-hoc human validation on a random subset of 200 samples and will report inter-rater agreement metrics (e.g., Cohen's kappa and percentage agreement) with expert clinicians. A new discussion subsection will analyze potential clinical biases (e.g., model preference for certain terminology) and how the comparative similarity scoring and five-dimension rubric help mitigate them. revision: partial

Circularity Check

0 steps flagged

No circularity detected; normalization and judge are explicit design choices, not self-referential derivations

full rationale

The paper presents MedVidBench curation and MedGRPO's two innovations (cross-dataset median reward normalization and five-dimension LLM judge with comparative scoring) as methodological solutions to reward imbalance and evaluation. These are described as design choices that map medians to a common value and apply clinical-dimension scoring, respectively. No equations appear in the abstract or provided text, and no self-citations are invoked to justify uniqueness or load-bearing premises. The reported gains (SFT outperforming GPT-4.1/Gemini and MedGRPO improving SFT on grounding/captioning) are framed as empirical results from applying the methods, not quantities forced by construction from the same fitted parameters or inputs. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the new benchmark is representative and that the two RL modifications are sufficient to stabilize training; no free parameters are explicitly fitted in the abstract description, and no new physical entities are postulated.

axioms (1)

domain assumption Expert-guided prompting and dual-model validation produce high-quality instruction pairs without systematic bias.
Invoked in the MedVidBench curation description.

pith-pipeline@v0.9.0 · 5602 in / 1281 out tokens · 28354 ms · 2026-05-17T00:10:36.108663+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-dataset reward normalization that maps each dataset’s median performance to a common reward value... logistic transformation... r(d,t)norm(x)=1/(1+exp(−k·(x−p50)/IQR))
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

medical LLM judge... five clinical dimensions... comparative similarity scoring

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?
cs.CV 2026-04 unverdicted novelty 5.0

LLM-generated narratives from surgical videos enable scalable vision-language pre-training through a noise-robust framework that maintains visual model performance on surgical benchmarks.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery.IEEE Transactions on Biomedical Engineer- ing, 64(9):2025–2041, 2017

Narges Ahmidi, Lingling Tao, Shahin Sefati, Yixin Gao, Colin Lea, Benjamin Bejar Haro, Luca Zappella, Sanjeev Khudanpur, Ren´e Vidal, and Gregory D Hager. A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery.IEEE Transactions on Biomedical Engineer- ing, 64(9):2025–2041, 2017. 4, 14

work page 2025
[2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Whisperx: Time-accurate speech transcription of long- form audio.INTERSPEECH 2023, 2023

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisser- man. Whisperx: Time-accurate speech transcription of long- form audio.INTERSPEECH 2023, 2023. 2, 4, 12

work page 2023
[5]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 4

work page 2005
[6]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 3

work page 2015
[7]

Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025

Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny Chan, Nassir Navab, Hongbin Liu, Zhen Lei, and Jiebo Luo. Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025. 1, 3

work page arXiv 2025
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 7, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Gptscore: Evaluate as you desire

Jinlan Fu, See Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6556–6576, 2024. 4

work page 2024
[10]

Egosurgery-phase: A dataset of surgical phase recognition from egocentric open surgery videos

Ryo Fujii, Masashi Hatano, Hideo Saito, and Hiroki Kajita. Egosurgery-phase: A dataset of surgical phase recognition from egocentric open surgery videos. InMICCAI, 2024. 1, 2, 3, 4, 12, 14

work page 2024
[11]

Goodman, Krishna K

Emmett D. Goodman, Krishna K. Patel, Yilun Zhang, William Locke, Chris J. Kennedy, Rohan Mehrotra, Stephen Ren, Melody Guan, Orr Zohar, Maren Downing, Hao Wei Chen, Jevin Z. Clark, Margaret T. Berrigan, Gabriel A. Brat, and Serena Yeung-Levy. Analyzing surgical technique in di- verse open surgical videos with multi-task machine learning. JAMA Surgery, 202...

work page 2024
[12]

CLIPScore: a reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: a reference-free evaluation met- ric for image captioning. InEMNLP, 2021. 4

work page 2021
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 3, 7, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Prometheus: Inducing fine- grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sung- dong Kim, James Thorne, et al. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representa- tions, 2023. 4

work page 2023
[15]

Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024

Tony Lee, Haoqin Tu, Chi H Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin S Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024. 4

work page 2024
[16]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

work page
[17]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 3

work page 2023
[18]

VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, and Qi Liu. VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6227–6246, 2024. 3

work page 2024
[19]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 3

work page 2024
[20]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning.arXiv preprint arXiv:2306.14565, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023
[22]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation us- ing gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Gradient episodic memory for continual learning.Advances in neu- ral information processing systems, 30, 2017

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.Advances in neu- ral information processing systems, 30, 2017. 3

work page 2017
[24]

Unified-io: A unified model for vision, language, and multi-modal tasks.ArXiv, abs/2206.08916, 2022

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks.arXiv preprint arXiv:2206.08916, 2022. 3

work page arXiv 2022
[25]

Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 3, 6

work page 2024
[26]

Packnet: Adding mul- tiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. InPro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 3

work page 2018
[27]

Nurvid: A large expert-level video database for nursing pro- cedure activity understanding

Hu Ming, Wang Lin, Yan Siyuan, Ma Don, Ren Qingli, Xia Peng, Feng Wei, Duan Peibo, Ju Lie, and Ge Zongyuan. Nurvid: A large expert-level video database for nursing pro- cedure activity understanding. InThirty-seventh Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 3, 4, 12, 14

work page 2023
[28]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Ya- sunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023. 3

work page 2023
[29]

Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos.Medical Image Analysis, 78, 2022

Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos.Medical Image Analysis, 78, 2022. 1, 3, 4, 12, 14

work page 2022
[30]

Cholectrack20: A multi-perspective tracking dataset for sur- gical tools

Chinedu Innocent Nwoye, Kareem Elgohary, Anvita Srini- vas, Fauzan Zaid, Jo ¨el L Lavanchy, and Nicolas Padoy. Cholectrack20: A multi-perspective tracking dataset for sur- gical tools. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 8942–8952, 2025. 4, 12, 14

work page 2025
[31]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page
[32]

Surglavi: Large-scale hierarchical dataset for surgical vision-language representation learning.arXiv preprint arXiv:2509.10555,

Alejandra Perez, Chinedu Nwoye, Ramtin Raji Kermani, Omid Mohareri, and Muhammad Abdullah Jamal. Surglavi: Large-scale hierarchical dataset for surgical vision-language representation learning.arXiv preprint arXiv:2509.10555,

work page arXiv
[33]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 1908
[34]

Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for ai.Scientific Data, 10(1):194, 2023

Manuel Sebasti ´an R´ıos, Mar´ıa Alejandra Molina-Rodriguez, Daniella Londo˜no, Camilo Andr´es Guill´en, Sebasti´an Sierra, Felipe Zapata, and Luis Felipe Giraldo. Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for ai.Scientific Data, 10(1):194, 2023. 4, 14

work page 2023
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition

Jacques Marescaux Tong Yu, Didier Mutter and Nicolas Padoy. Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition. InInternational Conference on Information Processing in Computer-Assisted Interventions, 2019. 3

work page 2019
[37]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 4

work page 2015
[38]

Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection.arXiv preprint arXiv:2410.07540, 2024

Guankun Wang, Han Xiao, Renrui Zhang, Huxin Gao, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li, and Hongliang Ren. Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection.arXiv preprint arXiv:2410.07540, 2024. 2, 4, 12, 13, 14

work page arXiv 2024
[39]

arXiv preprint arXiv:2501.11347 (2025)

Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Zhen Lei, et al. Endochat: Grounded multimodal large language model for endoscopic surgery.arXiv preprint arXiv:2501.11347,

work page arXiv
[40]

arXiv preprint arXiv:2506.17873 (2025)

Guankun Wang, Wenjin Mo, Junyi Wang, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nico- las Padoy, et al. Surgvidlm: Towards multi-grained surgi- cal video understanding with large language model.arXiv preprint arXiv:2506.17873, 2025. 4

work page arXiv 2025
[41]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 3

work page 2022
[42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3

work page 2016
[44]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional hu- man feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional hu- man feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807– 13816, 2024. 3

work page 2024
[46]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.arXiv preprint arXiv:1904.09675, 2019. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 1904
[48]

Describe what you see in this healthcare pro- cedure video in one sentence

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Development of a large-scale medical visual question-answering dataset.Com- munications Medicine, 4(1):277, 2024. 3 Supplementary Material: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding This supplementary material provides...

work page 2024

[1] [1]

A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery.IEEE Transactions on Biomedical Engineer- ing, 64(9):2025–2041, 2017

Narges Ahmidi, Lingling Tao, Shahin Sefati, Yixin Gao, Colin Lea, Benjamin Bejar Haro, Luca Zappella, Sanjeev Khudanpur, Ren´e Vidal, and Gregory D Hager. A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery.IEEE Transactions on Biomedical Engineer- ing, 64(9):2025–2041, 2017. 4, 14

work page 2025

[2] [2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Whisperx: Time-accurate speech transcription of long- form audio.INTERSPEECH 2023, 2023

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisser- man. Whisperx: Time-accurate speech transcription of long- form audio.INTERSPEECH 2023, 2023. 2, 4, 12

work page 2023

[5] [5]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 4

work page 2005

[6] [6]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 3

work page 2015

[7] [7]

Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025

Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny Chan, Nassir Navab, Hongbin Liu, Zhen Lei, and Jiebo Luo. Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025. 1, 3

work page arXiv 2025

[8] [8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 7, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Gptscore: Evaluate as you desire

Jinlan Fu, See Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6556–6576, 2024. 4

work page 2024

[10] [10]

Egosurgery-phase: A dataset of surgical phase recognition from egocentric open surgery videos

Ryo Fujii, Masashi Hatano, Hideo Saito, and Hiroki Kajita. Egosurgery-phase: A dataset of surgical phase recognition from egocentric open surgery videos. InMICCAI, 2024. 1, 2, 3, 4, 12, 14

work page 2024

[11] [11]

Goodman, Krishna K

Emmett D. Goodman, Krishna K. Patel, Yilun Zhang, William Locke, Chris J. Kennedy, Rohan Mehrotra, Stephen Ren, Melody Guan, Orr Zohar, Maren Downing, Hao Wei Chen, Jevin Z. Clark, Margaret T. Berrigan, Gabriel A. Brat, and Serena Yeung-Levy. Analyzing surgical technique in di- verse open surgical videos with multi-task machine learning. JAMA Surgery, 202...

work page 2024

[12] [12]

CLIPScore: a reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: a reference-free evaluation met- ric for image captioning. InEMNLP, 2021. 4

work page 2021

[13] [13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 3, 7, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Prometheus: Inducing fine- grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sung- dong Kim, James Thorne, et al. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representa- tions, 2023. 4

work page 2023

[15] [15]

Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024

Tony Lee, Haoqin Tu, Chi H Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin S Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models.Advances in Neural Information Processing Systems, 37:140632–140666, 2024. 4

work page 2024

[16] [16]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

work page

[17] [17]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 3

work page 2023

[18] [18]

VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, and Qi Liu. VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6227–6246, 2024. 3

work page 2024

[19] [19]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 3

work page 2024

[20] [20]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning.arXiv preprint arXiv:2306.14565, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023

[22] [22]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation us- ing gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Gradient episodic memory for continual learning.Advances in neu- ral information processing systems, 30, 2017

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.Advances in neu- ral information processing systems, 30, 2017. 3

work page 2017

[24] [24]

Unified-io: A unified model for vision, language, and multi-modal tasks.ArXiv, abs/2206.08916, 2022

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks.arXiv preprint arXiv:2206.08916, 2022. 3

work page arXiv 2022

[25] [25]

Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 3, 6

work page 2024

[26] [26]

Packnet: Adding mul- tiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. InPro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 3

work page 2018

[27] [27]

Nurvid: A large expert-level video database for nursing pro- cedure activity understanding

Hu Ming, Wang Lin, Yan Siyuan, Ma Don, Ren Qingli, Xia Peng, Feng Wei, Duan Peibo, Ju Lie, and Ge Zongyuan. Nurvid: A large expert-level video database for nursing pro- cedure activity understanding. InThirty-seventh Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 3, 4, 12, 14

work page 2023

[28] [28]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Ya- sunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023. 3

work page 2023

[29] [29]

Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos.Medical Image Analysis, 78, 2022

Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos.Medical Image Analysis, 78, 2022. 1, 3, 4, 12, 14

work page 2022

[30] [30]

Cholectrack20: A multi-perspective tracking dataset for sur- gical tools

Chinedu Innocent Nwoye, Kareem Elgohary, Anvita Srini- vas, Fauzan Zaid, Jo ¨el L Lavanchy, and Nicolas Padoy. Cholectrack20: A multi-perspective tracking dataset for sur- gical tools. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 8942–8952, 2025. 4, 12, 14

work page 2025

[31] [31]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page

[32] [32]

Surglavi: Large-scale hierarchical dataset for surgical vision-language representation learning.arXiv preprint arXiv:2509.10555,

Alejandra Perez, Chinedu Nwoye, Ramtin Raji Kermani, Omid Mohareri, and Muhammad Abdullah Jamal. Surglavi: Large-scale hierarchical dataset for surgical vision-language representation learning.arXiv preprint arXiv:2509.10555,

work page arXiv

[33] [33]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 1908

[34] [34]

Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for ai.Scientific Data, 10(1):194, 2023

Manuel Sebasti ´an R´ıos, Mar´ıa Alejandra Molina-Rodriguez, Daniella Londo˜no, Camilo Andr´es Guill´en, Sebasti´an Sierra, Felipe Zapata, and Luis Felipe Giraldo. Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for ai.Scientific Data, 10(1):194, 2023. 4, 14

work page 2023

[35] [35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition

Jacques Marescaux Tong Yu, Didier Mutter and Nicolas Padoy. Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition. InInternational Conference on Information Processing in Computer-Assisted Interventions, 2019. 3

work page 2019

[37] [37]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 4

work page 2015

[38] [38]

Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection.arXiv preprint arXiv:2410.07540, 2024

Guankun Wang, Han Xiao, Renrui Zhang, Huxin Gao, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li, and Hongliang Ren. Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection.arXiv preprint arXiv:2410.07540, 2024. 2, 4, 12, 13, 14

work page arXiv 2024

[39] [39]

arXiv preprint arXiv:2501.11347 (2025)

Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Zhen Lei, et al. Endochat: Grounded multimodal large language model for endoscopic surgery.arXiv preprint arXiv:2501.11347,

work page arXiv

[40] [40]

arXiv preprint arXiv:2506.17873 (2025)

Guankun Wang, Wenjin Mo, Junyi Wang, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nico- las Padoy, et al. Surgvidlm: Towards multi-grained surgi- cal video understanding with large language model.arXiv preprint arXiv:2506.17873, 2025. 4

work page arXiv 2025

[41] [41]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 3

work page 2022

[42] [42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3

work page 2016

[44] [44]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional hu- man feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional hu- man feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807– 13816, 2024. 3

work page 2024

[46] [46]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.arXiv preprint arXiv:1904.09675, 2019. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 1904

[48] [48]

Describe what you see in this healthcare pro- cedure video in one sentence

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Development of a large-scale medical visual question-answering dataset.Com- munications Medicine, 4(1):277, 2024. 3 Supplementary Material: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding This supplementary material provides...

work page 2024