Recognition: unknown
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
Pith reviewed 2026-05-10 01:18 UTC · model grok-4.3
The pith
SurgCoT creates a benchmark that tests chain-of-thought reasoning by multimodal models on surgical videos across seven specialties and thirty-five procedures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SurgCoT is a benchmark that evaluates chain-of-thought reasoning in multimodal large language models on surgical videos by covering seven specialties and thirty-five procedures through five core dimensions: causal action ordering, cue-action alignment, affordance mapping, micro-transition localization, and anomaly onset tracking. Each item follows a Question-Option-Knowledge-Clue-Answer format in which the Knowledge field supplies necessary context and the Clue field supplies definitive spatiotemporal evidence. Tests of ten leading models show commercial systems outperform open-source and medically specialized variants while revealing persistent gaps in surgical chain-of-thought performance;
What carries the argument
The Question-Option-Knowledge-Clue-Answer annotation structure, in which Knowledge supplies background context and Clue supplies definitive spatiotemporal evidence to support step-by-step reasoning.
If this is right
- Commercial multimodal models currently outperform open-source and medical-specialized models on the five surgical reasoning dimensions.
- Large gaps remain between current model performance and the level of spatiotemporal reasoning required for clinical video analysis.
- The benchmark supports systematic measurement of whether models improve their reasoning when given progressive chain-of-thought scaffolding.
- SurgCoT supplies a shared, reproducible testbed that researchers can use to compare approaches aimed at clinical video understanding.
Where Pith is reading between the lines
- The same five reasoning dimensions could be reused to create parallel benchmarks for other high-stakes video domains such as traffic monitoring or sports officiating.
- Models fine-tuned on SurgCoT data might transfer better to real-time surgical assistance tasks if the annotation protocol is kept consistent.
- Extending the clue and knowledge fields to include temporal uncertainty estimates could test whether models can express confidence in their localization of micro-transitions.
Load-bearing premise
The detailed annotation protocol with expert-provided knowledge and clue fields accurately and without bias captures the five intended reasoning dimensions.
What would settle it
A controlled experiment in which models given only the questions and options perform as well as models given the full knowledge and clue fields would show that the benchmark does not isolate the targeted chain-of-thought reasoning.
Figures
read the original abstract
Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SurgCoT, a benchmark for evaluating chain-of-thought (CoT) reasoning in Multi-modal Large Language Models (MLLMs) on surgical videos. It spans 7 specialties and 35 procedures, focusing on five reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking. The benchmark employs a structured annotation protocol involving Question-Option-Knowledge-Clue-Answer, with evaluations on 10 MLLMs highlighting performance differences and gaps in surgical CoT reasoning.
Significance. If the annotations prove robust, this benchmark could significantly advance the field by providing a standardized, reproducible way to assess and improve MLLMs' spatiotemporal reasoning in high-stakes medical applications. The open code release supports reproducibility and further research.
major comments (2)
- The paper describes the Question-Option-Knowledge-Clue-Answer protocol where Knowledge provides background and Clue provides spatiotemporal evidence, but lacks inter-annotator agreement metrics or external validation to confirm that these fields accurately and unbiasedly capture the five intended reasoning dimensions across the 35 procedures.
- High-level evaluation outcomes are reported for 10 MLLMs, but the absence of detailed data statistics, full results tables, or breakdowns by specialty and dimension makes it difficult to fully verify the claims of significant gaps and effective evaluation.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the significance of SurgCoT and for the constructive major comments. We address each point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: The paper describes the Question-Option-Knowledge-Clue-Answer protocol where Knowledge provides background and Clue provides spatiotemporal evidence, but lacks inter-annotator agreement metrics or external validation to confirm that these fields accurately and unbiasedly capture the five intended reasoning dimensions across the 35 procedures.
Authors: We agree that inter-annotator agreement metrics would provide valuable evidence of annotation quality. The annotations were conducted by multiple domain experts in surgery following detailed guidelines to align with the five reasoning dimensions. In the revised manuscript, we will report inter-annotator agreement scores (e.g., Fleiss' kappa) for the Knowledge and Clue fields across a subset of annotations. We will also elaborate on the annotation protocol and any steps taken for validation to demonstrate that the fields accurately capture the intended dimensions. revision: yes
-
Referee: High-level evaluation outcomes are reported for 10 MLLMs, but the absence of detailed data statistics, full results tables, or breakdowns by specialty and dimension makes it difficult to fully verify the claims of significant gaps and effective evaluation.
Authors: We acknowledge the need for more detailed reporting to allow full verification of our claims. The current manuscript presents aggregated results to highlight overall trends, but we will expand the evaluation section in the revision to include comprehensive data statistics (such as sample distribution across specialties and dimensions), full per-model performance tables, and breakdowns by each of the 7 specialties and 5 reasoning dimensions. This will substantiate the reported gaps and the effectiveness of SurgCoT as an evaluation benchmark. revision: yes
Circularity Check
No circularity: benchmark construction and model evaluation with independent results
full rationale
The paper introduces SurgCoT as a new benchmark with a Question-Option-Knowledge-Clue-Answer annotation protocol across 7 specialties and 35 procedures, then reports evaluation results for 10 MLLMs on five reasoning dimensions. No equations, parameter fitting, or derivations are present. Model performance numbers are direct outputs of running the models on the held-out benchmark and are not reducible to the authors' annotation choices by construction. No self-citation chains or uniqueness theorems are invoked to justify core claims. The annotation protocol is a design choice whose validity can be externally challenged, but it does not create circularity in the reported outcomes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five dimensions (Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, Anomaly Onset Tracking) collectively represent core spatiotemporal reasoning needs in surgery.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,
-
[2]
Introducing claude sonnet 4.5.https:// www.anthropic.com/news/claude- sonnet- 4- 5, 2025
Anthropic. Introducing claude sonnet 4.5.https:// www.anthropic.com/news/claude- sonnet- 4- 5, 2025. Accessed: 2025-11-09. 2, 6, 7, 8
2025
-
[3]
arXiv preprint arXiv:2305.11692 (2023)
Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, and Hongliang Ren. Surgical-vqla: Transformer with gated vision-language embedding for visual question localized-answering in robotic surgery.arXiv preprint arXiv:2305.11692, 2023. 3
-
[4]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 2, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, and Luis C Garcia-Peraza-Herrera. Surg-3m: A dataset and foundation model for perception in surgical settings.arXiv preprint arXiv:2503.19740, 2025. 3
-
[6]
Towards injecting medical vi- sual knowledge into multimodal llms at scale
Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shu- nian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical vi- sual knowledge into multimodal llms at scale. InProceed- ings of the 2024 Conference on Empirical Methods in Natu- ral Language Processing, pages 7346–7370, 2024. 2, 3, 6, 7, 8
2024
-
[7]
Vs-assistant: versatile surgery assistant on the demand of surgeons,
Zhen Chen, Xingjian Luo, Jinlin Wu, Danny Chan, Zhen Lei, Jinqiao Wang, Sebastien Ourselin, and Hongbin Liu. Vs- assistant: versatile surgery assistant on the demand of sur- geons.arXiv preprint arXiv:2405.08272, 2024. 2
-
[8]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)...
2024
-
[9]
Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny Chan, Nassir Navab, Hongbin Liu, Zhen Lei, and Jiebo Luo. Surgllm: A versatile large multimodal model with spatial fo- cus and temporal awareness for surgical video understand- ing.arXiv preprint arXiv:2509.00357, 2025. 2
-
[10]
Endoassistant: A large- scale vision-language dataset for endoscopic surgery under- standing from open-source videos
Xuan Gong, Balu Harshavardan Koduru, Yuanhao Zhai, Shun Liu, Nan Xi, Xi Tang, Yuan Zhang, Tenzin Lhakpa, Yunjie Tian, Yuxuan Sun, Tianyu Luan, Ziqing Xue, Jun- song Yuan, and David Doermann. Endoassistant: A large- scale vision-language dataset for endoscopic surgery under- standing from open-source videos. OpenReview, ICLR 2025 submission, 2025. Submitte...
2025
-
[11]
Gemini 2.5 pro.https://ai.google.dev/ gemini - api / docs / models / gemini - 2
Google. Gemini 2.5 pro.https://ai.google.dev/ gemini - api / docs / models / gemini - 2 . 5 - pro,
-
[12]
2, 6, 7, 8
Google AI for Developers, accessed 2026-03-22. 2, 6, 7, 8
2026
-
[13]
Overview of the medvidqa 2022 shared task on medical video question- answering
Deepak Gupta and Dina Demner-Fushman. Overview of the medvidqa 2022 shared task on medical video question- answering. InProceedings of the 21st Workshop on Biomed- ical Language Processing, pages 264–274, 2022. 3
2022
-
[14]
Pengfei Hao, Shuaibo Li, Hongqiu Wang, Zhizhuo Kou, Jun- hang Zhang, Guang Yang, and Lei Zhu. Surgery-r1: Ad- vancing surgical-vqla with reasoning multimodal large lan- guage model via reinforcement learning.arXiv preprint arXiv:2506.19469, 2025. 2
-
[15]
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie
Sunan He, Yuxiang Nie, Zhixuan Chen, Zhiyuan Cai, Hongmei Wang, Shu Yang, and Hao Chen. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning.arXiv preprint arXiv:2404.15127,
-
[16]
Vidlpro: A video-language pre- training framework for robotic and laparoscopic surgery
Mohammadmahdi Honarmand, Muhammad Abdullah Ja- mal, and Omid Mohareri. Vidlpro: A video-language pre- training framework for robotic and laparoscopic surgery. arXiv preprint arXiv:2409.04732, 2024. 3
-
[17]
Ophnet: A large-scale video bench- mark for ophthalmic surgical workflow understanding
Ming Hu, Peng Xia, Lin Wang, Siyuan Yan, Feilong Tang, Zhongxing Xu, Yimin Luo, Kaimin Song, Jurgen Leitner, Xuelian Cheng, et al. Ophnet: A large-scale video bench- mark for ophthalmic surgical workflow understanding. In European Conference on Computer Vision, pages 481–500. Springer, 2024. 3
2024
-
[18]
Ophclip: Hierarchical retrieval-augmented learn- ing for ophthalmic surgical video-language pretraining
Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, et al. Ophclip: Hierarchical retrieval-augmented learn- ing for ophthalmic surgical video-language pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19838–19849, 2025. 3
2025
-
[19]
Mllms for versatile scene understanding: Towards embodied intel- ligent surgical robots
Wenguo Huang, Jin Yi, Ziyao Liu, and Wenjing Zhou. Mllms for versatile scene understanding: Towards embodied intel- ligent surgical robots. In2025 5th International Conference on Artificial Intelligence and Industrial Technology Applica- tions (AIITA), pages 1300–1303. IEEE, 2025. 2
2025
-
[20]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Machine learning for technical skill assessment in surgery: a systematic review.NPJ Digital Medicine, 5(1):24, 2022
Kyle Lam, Junhong Chen, Zeyu Wang, Fahad M Iqbal, Ara Darzi, Benny Lo, Sanjay Purkayastha, and James M Kinross. Machine learning for technical skill assessment in surgery: a systematic review.NPJ Digital Medicine, 5(1):24, 2022. 2
2022
-
[22]
Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,
-
[23]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Fuchen Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee, et al. Llava-next-interleave: Tackling multi-image, video, and multi-view in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[24]
Llava-surg: Towards multimodal surgical assistant via struc- tured lecture learning
Jiajie Li, Garrett Skinner, Brian R Quaranto, Gene Yang, Steven D Schwaitzberg, Peter CW Kim, and Jinjun Xiong. Llava-surg: Towards multimodal surgical assistant via struc- tured lecture learning. 2, 3
-
[25]
Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 2
2022
-
[26]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023. 2
2023
-
[27]
Video-llava: Learning united visual repre- sentation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 3
2024
-
[28]
Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 2
2023
-
[29]
Shenxi Liu, Kan Li, Mingyang Zhao, Yuhang Tian, Bin Li, Shoujun Zhou, Hongliang Li, and Fuxia Yang. M 3-med: A benchmark for multi-lingual, multi-modal, and multi- hop reasoning in medical instructional video understanding. arXiv preprint arXiv:2507.04289, 2025. 3
-
[30]
Lovit: Long video transformer for surgical phase recognition.Medical Image Analysis, 99: 103366, 2025
Yang Liu, Maxence Boels, Luis C Garcia-Peraza-Herrera, Tom Vercauteren, Prokar Dasgupta, Alejandro Granados, and Sebastien Ourselin. Lovit: Long video transformer for surgical phase recognition.Medical Image Analysis, 99: 103366, 2025. 2
2025
-
[31]
Llava-next: A strong zero-shot video under- standing model, 2024
LLaV A Team. Llava-next: A strong zero-shot video under- standing model, 2024. Accessed: 2025-10-13. 2
2024
-
[32]
Surgical data science–from concepts toward clinical translation.Medical Image Analysis, 76:102306, 2022
Lena Maier-Hein, Matthias Eisenmann, Duygu Sarikaya, Keno M¨arz, Toby Collins, Anand Malpani, Johannes Fallert, Hubertus Feussner, Stamatia Giannarou, Pietro Mascagni, et al. Surgical data science–from concepts toward clinical translation.Medical Image Analysis, 76:102306, 2022. 2
2022
-
[33]
Gpt-5: Multimodal large language models for un- derstanding complex visual inputs.OpenAI Technical Re- port, 2025
OpenAI. Gpt-5: Multimodal large language models for un- derstanding complex visual inputs.OpenAI Technical Re- port, 2025. 2, 6, 7, 8
2025
-
[34]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 4
work page internal anchor Pith review arXiv 2024
-
[35]
Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, and Axel Krieger. General surgery vision transformer: A video pre-trained foundation model for general surgery.arXiv preprint arXiv:2403.05949, 2024. 3
-
[36]
Surgical-vqa: Visual ques- tion answering in surgical scenes using transformer
Lalithkumar Seenivasan, Mobarakol Islam, Adithya K Kr- ishna, and Hongliang Ren. Surgical-vqa: Visual ques- tion answering in surgical scenes using transformer. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 33–43. Springer,
-
[37]
Surgicalgpt: end-to-end language- vision gpt for visual question answering in surgery
Lalithkumar Seenivasan, Mobarakol Islam, Gokul Kannan, and Hongliang Ren. Surgicalgpt: end-to-end language- vision gpt for visual question answering in surgery. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 281–290. Springer,
-
[38]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025. 2, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Surgical optomics: a new science towards surgical pre- cision.NPJ Gut and Liver, 2(1):1–14, 2025
Gabriel Szydlo Shein, Elisa Bannone, Silvia Seidlitz, Mo- hamed Hassouna, Luca Baratelli, Arturo Pardo, Sylvain Lecler, Fr ´ed´eric Triponez, Manish Chand, Sylvain Gioux, et al. Surgical optomics: a new science towards surgical pre- cision.NPJ Gut and Liver, 2(1):1–14, 2025. 2
2025
-
[40]
Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Yolov10: Real-time end-to-end object de- tection.Advances in Neural Information Processing Systems, 37:107984–108011, 2024
Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jun- gong Han, et al. Yolov10: Real-time end-to-end object de- tection.Advances in Neural Information Processing Systems, 37:107984–108011, 2024. 4
2024
-
[42]
Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hong- bin Liu, and Hongliang Ren. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024. 3
-
[43]
Endochat: Grounded multimodal large language model for endoscopic surgery.Medical Image Analysis, page 103789, 2025
Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Zhen Lei, et al. Endochat: Grounded multimodal large language model for endoscopic surgery.Medical Image Analysis, page 103789, 2025. 2, 3
2025
-
[44]
Gui Wang, Yang Wennuo, Xusen Ma, Zehao Zhong, Zhuoru Wu, Ende Wu, Rong Qu, Wooi Ping Cheah, Jianfeng Ren, and Linlin Shen. Eyepcr: A comprehensive benchmark for fine-grained perception, knowledge comprehension and clinical reasoning in ophthalmic surgery.arXiv preprint arXiv:2509.15596, 2025. 2, 3
-
[45]
Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection
Guankun Wang, Han Xiao, Renrui Zhang, Huxin Gao, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li, and Hongliang Ren. Copesd: A multi-level surgical motion dataset for train- ing large vision-language models to co-pilot endoscopic sub- mucosal dissection. InProceedings of the 33rd ACM In- ternational Conference on Multimedia, pages 12636–12643,
-
[46]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 6
work page internal anchor Pith review arXiv 2025
-
[47]
Foun- dation model for endoscopy video analysis via large-scale self-supervised pre-train
Zhao Wang, Chang Liu, Shaoting Zhang, and Qi Dou. Foun- dation model for endoscopy video analysis via large-scale self-supervised pre-train. InInternational Conference on Medical Image Computing and Computer-Assisted Interven- tion, pages 101–111. Springer, 2023. 3
2023
-
[48]
Jianhui Wei, Zikai Xiao, Danyu Sun, Luqi Gong, Zongxin Yang, Zuozhu Liu, and Jian Wu. Surgbench: A unified large- scale benchmark for surgical video analysis.arXiv preprint arXiv:2506.07603, 2025. 3
-
[49]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A general- ist foundation model for unified multimodal medical under- standing and reasoning.arXiv preprint arXiv:2506.07044,
work page internal anchor Pith review arXiv
-
[50]
Artificial intelligence meets medical robotics.Science, 381(6654):141–146, 2023
Michael Yip, Septimiu Salcudean, Ken Goldberg, Kaspar Althoefer, Arianna Menciassi, Justin D Opfermann, Axel Krieger, Krithika Swaminathan, Conor J Walsh, He Huang, et al. Artificial intelligence meets medical robotics.Science, 381(6654):141–146, 2023. 2
2023
-
[51]
arXiv preprint arXiv:2505.16964 (2025)
Suhao Yu, Haojin Wang, Juncheng Wu, Cihang Xie, and Yuyin Zhou. Medframeqa: A multi-image medi- cal vqa benchmark for clinical reasoning.arXiv preprint arXiv:2505.16964, 2025. 2, 3
-
[52]
Advancing surgical vqa with scene graph knowledge.International Journal of Computer Assisted Radiology and Surgery, 19(7):1409– 1417, 2024
Kun Yuan, Manasi Kattel, Jo ¨el L Lavanchy, Nassir Navab, Vinkle Srivastav, and Nicolas Padoy. Advancing surgical vqa with scene graph knowledge.International Journal of Computer Assisted Radiology and Surgery, 19(7):1409– 1417, 2024. 2, 3
2024
-
[53]
Learning multi-modal representations by watching hundreds of surgical video lectures.Medical Im- age Analysis, 105:103644, 2025
Kun Yuan, Vinkle Srivastav, Tong Yu, Joel L Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, and Nicolas Padoy. Learning multi-modal representations by watching hundreds of surgical video lectures.Medical Im- age Analysis, 105:103644, 2025. 2, 3
2025
-
[54]
Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde Wu, Jiaan Zhang, Yuxuan Wang, Chang Han Low, Jian Jiang, Zilong Zheng, et al. Surgvlm: A large vision-language model and systematic evaluation benchmark for surgical in- telligence.arXiv preprint arXiv:2506.02555, 2025. 3
-
[55]
Bytetrack: Multi-object tracking by associating every detection box
Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box. InEuropean conference on computer vision, pages 1–21. Springer, 2022. 4
2022
-
[56]
Yiyi Zhang, Yuchen Yuan, Ying Zheng, Jialun Pei, Jinpeng Li, Zheng Li, and Pheng-Ann Heng. Mejo: Mllm-engaged surgical triplet recognition via inter-and intra-task joint opti- mization.arXiv preprint arXiv:2509.12893, 2025. 2
-
[57]
Sfn-esvqa: Spatio-temporal frame-aware network for endoscopic surgery visual question answering
Junze Zhu, Dasen Gu, Mengwei Sha, Yang Peng, Haotian Yang, and Bin Li. Sfn-esvqa: Spatio-temporal frame-aware network for endoscopic surgery visual question answering
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.