arxiv: 2604.20319 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Gui Wang , YongSong Zhou , Kaijun Deng , Wooi Ping Cheah , Rong Qu , Jianfeng Ren , Linlin Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical video analysismultimodal large language modelschain-of-thought reasoningspatiotemporal reasoningmedical AI benchmarkvideo understandingclinical decision support

0 comments

The pith

SurgCoT creates a benchmark that tests chain-of-thought reasoning by multimodal models on surgical videos across seven specialties and thirty-five procedures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a structured benchmark to measure how well multimodal large language models handle fine-grained spatiotemporal reasoning in surgical video. It defines five specific dimensions of reasoning and supplies each test item with background knowledge and explicit spatiotemporal clues to guide step-by-step answers. The work matters because surgical decisions often depend on precise ordering of actions, alignment of visual cues with movements, and detection of subtle transitions or anomalies. A reproducible test set of this kind could let researchers track whether models are closing the gap between general video understanding and the demands of clinical practice.

Core claim

SurgCoT is a benchmark that evaluates chain-of-thought reasoning in multimodal large language models on surgical videos by covering seven specialties and thirty-five procedures through five core dimensions: causal action ordering, cue-action alignment, affordance mapping, micro-transition localization, and anomaly onset tracking. Each item follows a Question-Option-Knowledge-Clue-Answer format in which the Knowledge field supplies necessary context and the Clue field supplies definitive spatiotemporal evidence. Tests of ten leading models show commercial systems outperform open-source and medically specialized variants while revealing persistent gaps in surgical chain-of-thought performance;

What carries the argument

The Question-Option-Knowledge-Clue-Answer annotation structure, in which Knowledge supplies background context and Clue supplies definitive spatiotemporal evidence to support step-by-step reasoning.

If this is right

Commercial multimodal models currently outperform open-source and medical-specialized models on the five surgical reasoning dimensions.
Large gaps remain between current model performance and the level of spatiotemporal reasoning required for clinical video analysis.
The benchmark supports systematic measurement of whether models improve their reasoning when given progressive chain-of-thought scaffolding.
SurgCoT supplies a shared, reproducible testbed that researchers can use to compare approaches aimed at clinical video understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same five reasoning dimensions could be reused to create parallel benchmarks for other high-stakes video domains such as traffic monitoring or sports officiating.
Models fine-tuned on SurgCoT data might transfer better to real-time surgical assistance tasks if the annotation protocol is kept consistent.
Extending the clue and knowledge fields to include temporal uncertainty estimates could test whether models can express confidence in their localization of micro-transitions.

Load-bearing premise

The detailed annotation protocol with expert-provided knowledge and clue fields accurately and without bias captures the five intended reasoning dimensions.

What would settle it

A controlled experiment in which models given only the questions and options perform as well as models given the full knowledge and clue fields would show that the benchmark does not isolate the targeted chain-of-thought reasoning.

Figures

Figures reproduced from arXiv: 2604.20319 by Gui Wang, Jianfeng Ren, Kaijun Deng, Linlin Shen, Rong Qu, Wooi Ping Cheah, YongSong Zhou.

**Figure 1.** Figure 1: SurgCoT comprises 2,841 surgical videos across 7 specialties and 35 procedures, with 19,345 main questions and 59,177 sub-questions. SurgCoT advances beyond frame-level tasks (e.g., phase/tool recognition) by introducing a three-stage, five-tuple annotation protocol (Question→Option→Knowledge→Clue→Answer) to scaffold chain-of-thought reasoning. The framework’s efficacy stems from its multi-stage reasoning… view at source ↗

**Figure 2.** Figure 2: Construction pipeline of SurgCoT benchmark. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Statistics of SurgCoT: 2,841 videos, 19,345 questions, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: SurgCoT constructs a diagnostic chain-of-thought by progressively decomposing a flawed baseline ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SurgCoT is a new benchmark for surgical video CoT reasoning across five dimensions, but its value depends on whether the expert annotations cleanly isolate those dimensions.

read the letter

SurgCoT introduces a benchmark for chain-of-thought reasoning on surgical videos. It covers seven specialties and 35 procedures, with questions structured around five dimensions: causal action ordering, cue-action alignment, affordance mapping, micro-transition localization, and anomaly onset tracking. The annotation format adds Knowledge and Clue fields to each question-option pair to guide the reasoning steps. That combination of domain focus and structured CoT setup is the main new piece relative to existing video QA sets.

Referee Report

2 major / 0 minor

Summary. The paper introduces SurgCoT, a benchmark for evaluating chain-of-thought (CoT) reasoning in Multi-modal Large Language Models (MLLMs) on surgical videos. It spans 7 specialties and 35 procedures, focusing on five reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking. The benchmark employs a structured annotation protocol involving Question-Option-Knowledge-Clue-Answer, with evaluations on 10 MLLMs highlighting performance differences and gaps in surgical CoT reasoning.

Significance. If the annotations prove robust, this benchmark could significantly advance the field by providing a standardized, reproducible way to assess and improve MLLMs' spatiotemporal reasoning in high-stakes medical applications. The open code release supports reproducibility and further research.

major comments (2)

The paper describes the Question-Option-Knowledge-Clue-Answer protocol where Knowledge provides background and Clue provides spatiotemporal evidence, but lacks inter-annotator agreement metrics or external validation to confirm that these fields accurately and unbiasedly capture the five intended reasoning dimensions across the 35 procedures.
High-level evaluation outcomes are reported for 10 MLLMs, but the absence of detailed data statistics, full results tables, or breakdowns by specialty and dimension makes it difficult to fully verify the claims of significant gaps and effective evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the significance of SurgCoT and for the constructive major comments. We address each point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The paper describes the Question-Option-Knowledge-Clue-Answer protocol where Knowledge provides background and Clue provides spatiotemporal evidence, but lacks inter-annotator agreement metrics or external validation to confirm that these fields accurately and unbiasedly capture the five intended reasoning dimensions across the 35 procedures.

Authors: We agree that inter-annotator agreement metrics would provide valuable evidence of annotation quality. The annotations were conducted by multiple domain experts in surgery following detailed guidelines to align with the five reasoning dimensions. In the revised manuscript, we will report inter-annotator agreement scores (e.g., Fleiss' kappa) for the Knowledge and Clue fields across a subset of annotations. We will also elaborate on the annotation protocol and any steps taken for validation to demonstrate that the fields accurately capture the intended dimensions. revision: yes
Referee: High-level evaluation outcomes are reported for 10 MLLMs, but the absence of detailed data statistics, full results tables, or breakdowns by specialty and dimension makes it difficult to fully verify the claims of significant gaps and effective evaluation.

Authors: We acknowledge the need for more detailed reporting to allow full verification of our claims. The current manuscript presents aggregated results to highlight overall trends, but we will expand the evaluation section in the revision to include comprehensive data statistics (such as sample distribution across specialties and dimensions), full per-model performance tables, and breakdowns by each of the 7 specialties and 5 reasoning dimensions. This will substantiate the reported gaps and the effectiveness of SurgCoT as an evaluation benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and model evaluation with independent results

full rationale

The paper introduces SurgCoT as a new benchmark with a Question-Option-Knowledge-Clue-Answer annotation protocol across 7 specialties and 35 procedures, then reports evaluation results for 10 MLLMs on five reasoning dimensions. No equations, parameter fitting, or derivations are present. Model performance numbers are direct outputs of running the models on the held-out benchmark and are not reducible to the authors' annotation choices by construction. No self-citation chains or uniqueness theorems are invoked to justify core claims. The annotation protocol is a design choice whose validity can be externally challenged, but it does not create circularity in the reported outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that expert annotations can reliably encode clinical spatiotemporal reasoning; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption The five dimensions (Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, Anomaly Onset Tracking) collectively represent core spatiotemporal reasoning needs in surgery.
Invoked in the benchmark design and evaluation framework described in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1282 out tokens · 30735 ms · 2026-05-10T01:18:19.532861+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 23 canonical work pages · 8 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,
[2]

Introducing claude sonnet 4.5.https:// www.anthropic.com/news/claude- sonnet- 4- 5, 2025

Anthropic. Introducing claude sonnet 4.5.https:// www.anthropic.com/news/claude- sonnet- 4- 5, 2025. Accessed: 2025-11-09. 2, 6, 7, 8

2025
[3]

arXiv preprint arXiv:2305.11692 (2023)

Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, and Hongliang Ren. Surgical-vqla: Transformer with gated vision-language embedding for visual question localized-answering in robotic surgery.arXiv preprint arXiv:2305.11692, 2023. 3

work page arXiv 2023
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 2, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Surg-3m: A dataset and foundation model for perception in surgical settings.arXiv preprint arXiv:2503.19740, 2025

Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, and Luis C Garcia-Peraza-Herrera. Surg-3m: A dataset and foundation model for perception in surgical settings.arXiv preprint arXiv:2503.19740, 2025. 3

work page arXiv 2025
[6]

Towards injecting medical vi- sual knowledge into multimodal llms at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shu- nian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical vi- sual knowledge into multimodal llms at scale. InProceed- ings of the 2024 Conference on Empirical Methods in Natu- ral Language Processing, pages 7346–7370, 2024. 2, 3, 6, 7, 8

2024
[7]

Vs-assistant: versatile surgery assistant on the demand of surgeons,

Zhen Chen, Xingjian Luo, Jinlin Wu, Danny Chan, Zhen Lei, Jinqiao Wang, Sebastien Ourselin, and Hongbin Liu. Vs- assistant: versatile surgery assistant on the demand of sur- geons.arXiv preprint arXiv:2405.08272, 2024. 2

work page arXiv 2024
[8]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)...

2024
[9]

Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025

Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny Chan, Nassir Navab, Hongbin Liu, Zhen Lei, and Jiebo Luo. Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025. 2

work page arXiv 2025
[10]

Endoassistant: A large- scale vision-language dataset for endoscopic surgery under- standing from open-source videos

Xuan Gong, Balu Harshavardan Koduru, Yuanhao Zhai, Shun Liu, Nan Xi, Xi Tang, Yuan Zhang, Tenzin Lhakpa, Yunjie Tian, Yuxuan Sun, Tianyu Luan, Ziqing Xue, Jun- song Yuan, and David Doermann. Endoassistant: A large- scale vision-language dataset for endoscopic surgery under- standing from open-source videos. OpenReview, ICLR 2025 submission, 2025. Submitte...

2025
[11]

Gemini 2.5 pro.https://ai.google.dev/ gemini - api / docs / models / gemini - 2

Google. Gemini 2.5 pro.https://ai.google.dev/ gemini - api / docs / models / gemini - 2 . 5 - pro,
[12]

2, 6, 7, 8

Google AI for Developers, accessed 2026-03-22. 2, 6, 7, 8

2026
[13]

Overview of the medvidqa 2022 shared task on medical video question- answering

Deepak Gupta and Dina Demner-Fushman. Overview of the medvidqa 2022 shared task on medical video question- answering. InProceedings of the 21st Workshop on Biomed- ical Language Processing, pages 264–274, 2022. 3

2022
[14]

Surgery-r1: Ad- vancing surgical-vqla with reasoning multimodal large lan- guage model via reinforcement learning.arXiv preprint arXiv:2506.19469, 2025

Pengfei Hao, Shuaibo Li, Hongqiu Wang, Zhizhuo Kou, Jun- hang Zhang, Guang Yang, and Lei Zhu. Surgery-r1: Ad- vancing surgical-vqla with reasoning multimodal large lan- guage model via reinforcement learning.arXiv preprint arXiv:2506.19469, 2025. 2

work page arXiv 2025
[15]

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie

Sunan He, Yuxiang Nie, Zhixuan Chen, Zhiyuan Cai, Hongmei Wang, Shu Yang, and Hao Chen. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning.arXiv preprint arXiv:2404.15127,

work page arXiv
[16]

Vidlpro: A video-language pre- training framework for robotic and laparoscopic surgery

Mohammadmahdi Honarmand, Muhammad Abdullah Ja- mal, and Omid Mohareri. Vidlpro: A video-language pre- training framework for robotic and laparoscopic surgery. arXiv preprint arXiv:2409.04732, 2024. 3

work page arXiv 2024
[17]

Ophnet: A large-scale video bench- mark for ophthalmic surgical workflow understanding

Ming Hu, Peng Xia, Lin Wang, Siyuan Yan, Feilong Tang, Zhongxing Xu, Yimin Luo, Kaimin Song, Jurgen Leitner, Xuelian Cheng, et al. Ophnet: A large-scale video bench- mark for ophthalmic surgical workflow understanding. In European Conference on Computer Vision, pages 481–500. Springer, 2024. 3

2024
[18]

Ophclip: Hierarchical retrieval-augmented learn- ing for ophthalmic surgical video-language pretraining

Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, et al. Ophclip: Hierarchical retrieval-augmented learn- ing for ophthalmic surgical video-language pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19838–19849, 2025. 3

2025
[19]

Mllms for versatile scene understanding: Towards embodied intel- ligent surgical robots

Wenguo Huang, Jin Yi, Ziyao Liu, and Wenjing Zhou. Mllms for versatile scene understanding: Towards embodied intel- ligent surgical robots. In2025 5th International Conference on Artificial Intelligence and Industrial Technology Applica- tions (AIITA), pages 1300–1303. IEEE, 2025. 2

2025
[20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Machine learning for technical skill assessment in surgery: a systematic review.NPJ Digital Medicine, 5(1):24, 2022

Kyle Lam, Junhong Chen, Zeyu Wang, Fahad M Iqbal, Ara Darzi, Benny Lo, Sanjay Purkayastha, and James M Kinross. Machine learning for technical skill assessment in surgery: a systematic review.NPJ Digital Medicine, 5(1):24, 2022. 2

2022
[22]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,
[23]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Fuchen Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee, et al. Llava-next-interleave: Tackling multi-image, video, and multi-view in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 2

work page internal anchor Pith review arXiv 2024
[24]

Llava-surg: Towards multimodal surgical assistant via struc- tured lecture learning

Jiajie Li, Garrett Skinner, Brian R Quaranto, Gene Yang, Steven D Schwaitzberg, Peter CW Kim, and Jinjun Xiong. Llava-surg: Towards multimodal surgical assistant via struc- tured lecture learning. 2, 3
[25]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 2

2022
[26]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023. 2

2023
[27]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 3

2024
[28]

Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 2

2023
[29]

M 3-med: A benchmark for multi-lingual, multi-modal, and multi- hop reasoning in medical instructional video understanding

Shenxi Liu, Kan Li, Mingyang Zhao, Yuhang Tian, Bin Li, Shoujun Zhou, Hongliang Li, and Fuxia Yang. M 3-med: A benchmark for multi-lingual, multi-modal, and multi- hop reasoning in medical instructional video understanding. arXiv preprint arXiv:2507.04289, 2025. 3

work page arXiv 2025
[30]

Lovit: Long video transformer for surgical phase recognition.Medical Image Analysis, 99: 103366, 2025

Yang Liu, Maxence Boels, Luis C Garcia-Peraza-Herrera, Tom Vercauteren, Prokar Dasgupta, Alejandro Granados, and Sebastien Ourselin. Lovit: Long video transformer for surgical phase recognition.Medical Image Analysis, 99: 103366, 2025. 2

2025
[31]

Llava-next: A strong zero-shot video under- standing model, 2024

LLaV A Team. Llava-next: A strong zero-shot video under- standing model, 2024. Accessed: 2025-10-13. 2

2024
[32]

Surgical data science–from concepts toward clinical translation.Medical Image Analysis, 76:102306, 2022

Lena Maier-Hein, Matthias Eisenmann, Duygu Sarikaya, Keno M¨arz, Toby Collins, Anand Malpani, Johannes Fallert, Hubertus Feussner, Stamatia Giannarou, Pietro Mascagni, et al. Surgical data science–from concepts toward clinical translation.Medical Image Analysis, 76:102306, 2022. 2

2022
[33]

Gpt-5: Multimodal large language models for un- derstanding complex visual inputs.OpenAI Technical Re- port, 2025

OpenAI. Gpt-5: Multimodal large language models for un- derstanding complex visual inputs.OpenAI Technical Re- port, 2025. 2, 6, 7, 8

2025
[34]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 4

work page internal anchor Pith review arXiv 2024
[35]

General surgery vision transformer: A video pre-trained foundation model for general surgery.arXiv preprint arXiv:2403.05949, 2024

Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, and Axel Krieger. General surgery vision transformer: A video pre-trained foundation model for general surgery.arXiv preprint arXiv:2403.05949, 2024. 3

work page arXiv 2024
[36]

Surgical-vqa: Visual ques- tion answering in surgical scenes using transformer

Lalithkumar Seenivasan, Mobarakol Islam, Adithya K Kr- ishna, and Hongliang Ren. Surgical-vqa: Visual ques- tion answering in surgical scenes using transformer. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 33–43. Springer,
[37]

Surgicalgpt: end-to-end language- vision gpt for visual question answering in surgery

Lalithkumar Seenivasan, Mobarakol Islam, Gokul Kannan, and Hongliang Ren. Surgicalgpt: end-to-end language- vision gpt for visual question answering in surgery. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 281–290. Springer,
[38]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025. 2, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Surgical optomics: a new science towards surgical pre- cision.NPJ Gut and Liver, 2(1):1–14, 2025

Gabriel Szydlo Shein, Elisa Bannone, Silvia Seidlitz, Mo- hamed Hassouna, Luca Baratelli, Arturo Pardo, Sylvain Lecler, Fr ´ed´eric Triponez, Manish Chand, Sylvain Gioux, et al. Surgical optomics: a new science towards surgical pre- cision.NPJ Gut and Liver, 2(1):1–14, 2025. 2

2025
[40]

Qwen2.5-VL Technical Report

Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Yolov10: Real-time end-to-end object de- tection.Advances in Neural Information Processing Systems, 37:107984–108011, 2024

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jun- gong Han, et al. Yolov10: Real-time end-to-end object de- tection.Advances in Neural Information Processing Systems, 37:107984–108011, 2024. 4

2024
[42]

Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hong- bin Liu, and Hongliang Ren. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024. 3

work page arXiv 2024
[43]

Endochat: Grounded multimodal large language model for endoscopic surgery.Medical Image Analysis, page 103789, 2025

Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Zhen Lei, et al. Endochat: Grounded multimodal large language model for endoscopic surgery.Medical Image Analysis, page 103789, 2025. 2, 3

2025
[44]

Eyepcr: A comprehensive benchmark for fine-grained perception, knowledge comprehension and clinical reasoning in ophthalmic surgery.arXiv preprint arXiv:2509.15596, 2025

Gui Wang, Yang Wennuo, Xusen Ma, Zehao Zhong, Zhuoru Wu, Ende Wu, Rong Qu, Wooi Ping Cheah, Jianfeng Ren, and Linlin Shen. Eyepcr: A comprehensive benchmark for fine-grained perception, knowledge comprehension and clinical reasoning in ophthalmic surgery.arXiv preprint arXiv:2509.15596, 2025. 2, 3

work page arXiv 2025
[45]

Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection

Guankun Wang, Han Xiao, Renrui Zhang, Huxin Gao, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li, and Hongliang Ren. Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection. InProceedings of the 33rd ACM In- ternational Conference on Multimedia, pages 12636–12643,
[46]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 6

work page internal anchor Pith review arXiv 2025
[47]

Foun- dation model for endoscopy video analysis via large-scale self-supervised pre-train

Zhao Wang, Chang Liu, Shaoting Zhang, and Qi Dou. Foun- dation model for endoscopy video analysis via large-scale self-supervised pre-train. InInternational Conference on Medical Image Computing and Computer-Assisted Interven- tion, pages 101–111. Springer, 2023. 3

2023
[48]

Surgbench: A unified large- scale benchmark for surgical video analysis.arXiv preprint arXiv:2506.07603, 2025

Jianhui Wei, Zikai Xiao, Danyu Sun, Luqi Gong, Zongxin Yang, Zuozhu Liu, and Jian Wu. Surgbench: A unified large- scale benchmark for surgical video analysis.arXiv preprint arXiv:2506.07603, 2025. 3

work page arXiv 2025
[49]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A general- ist foundation model for unified multimodal medical under- standing and reasoning.arXiv preprint arXiv:2506.07044,

work page internal anchor Pith review arXiv
[50]

Artificial intelligence meets medical robotics.Science, 381(6654):141–146, 2023

Michael Yip, Septimiu Salcudean, Ken Goldberg, Kaspar Althoefer, Arianna Menciassi, Justin D Opfermann, Axel Krieger, Krithika Swaminathan, Conor J Walsh, He Huang, et al. Artificial intelligence meets medical robotics.Science, 381(6654):141–146, 2023. 2

2023
[51]

arXiv preprint arXiv:2505.16964 (2025)

Suhao Yu, Haojin Wang, Juncheng Wu, Cihang Xie, and Yuyin Zhou. Medframeqa: A multi-image medi- cal vqa benchmark for clinical reasoning.arXiv preprint arXiv:2505.16964, 2025. 2, 3

work page arXiv 2025
[52]

Advancing surgical vqa with scene graph knowledge.International Journal of Computer Assisted Radiology and Surgery, 19(7):1409– 1417, 2024

Kun Yuan, Manasi Kattel, Jo ¨el L Lavanchy, Nassir Navab, Vinkle Srivastav, and Nicolas Padoy. Advancing surgical vqa with scene graph knowledge.International Journal of Computer Assisted Radiology and Surgery, 19(7):1409– 1417, 2024. 2, 3

2024
[53]

Learning multi-modal representations by watching hundreds of surgical video lectures.Medical Im- age Analysis, 105:103644, 2025

Kun Yuan, Vinkle Srivastav, Tong Yu, Joel L Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, and Nicolas Padoy. Learning multi-modal representations by watching hundreds of surgical video lectures.Medical Im- age Analysis, 105:103644, 2025. 2, 3

2025
[54]

Surgvlm: A large vision-language model and systematic evaluation benchmark for surgical in- telligence.arXiv preprint arXiv:2506.02555, 2025

Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde Wu, Jiaan Zhang, Yuxuan Wang, Chang Han Low, Jian Jiang, Zilong Zheng, et al. Surgvlm: A large vision-language model and systematic evaluation benchmark for surgical in- telligence.arXiv preprint arXiv:2506.02555, 2025. 3

work page arXiv 2025
[55]

Bytetrack: Multi-object tracking by associating every detection box

Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box. InEuropean conference on computer vision, pages 1–21. Springer, 2022. 4

2022
[56]

Mejo: Mllm-engaged surgical triplet recognition via inter-and intra-task joint opti- mization.arXiv preprint arXiv:2509.12893, 2025

Yiyi Zhang, Yuchen Yuan, Ying Zheng, Jialun Pei, Jinpeng Li, Zheng Li, and Pheng-Ann Heng. Mejo: Mllm-engaged surgical triplet recognition via inter-and intra-task joint opti- mization.arXiv preprint arXiv:2509.12893, 2025. 2

work page arXiv 2025
[57]

Sfn-esvqa: Spatio-temporal frame-aware network for endoscopic surgery visual question answering

Junze Zhu, Dasen Gu, Mengwei Sha, Yang Peng, Haotian Yang, and Bin Li. Sfn-esvqa: Spatio-temporal frame-aware network for endoscopic surgery visual question answering