pith. sign in

arxiv: 2606.25905 · v1 · pith:5U3SAFK6new · submitted 2026-06-24 · 💻 cs.CV

SurgAtlas: A Large-Scale Surgical Video-Language Dataset with 2,391 Hours of Open and Minimally Invasive Surgery

Pith reviewed 2026-06-25 20:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords surgical video datasetvideo-language datasetopen surgeryminimally invasive surgerysurgical AIfoundation modelsphase recognitionvisual question answering
0
0 comments X

The pith

SurgAtlas supplies 2,391 hours of public surgical video with layered language annotations to train multimodal models for surgery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SurgAtlas as the largest surgical video-language dataset, built from 15,291 YouTube videos totaling 2,391 hours across 18 specialties and more than 5,000 procedure types. It is the first to include open surgery at scale alongside minimally invasive recordings and supplies hierarchical annotations consisting of segment captions, phase and step descriptions, video-level summaries, and reasoning question-answer pairs. These annotations are produced by an automated multi-tier pipeline that uses LLM enrichment and staged visual question answering with groundedness checks. The authors demonstrate the dataset's utility by fine-tuning Qwen3-VL-8B in a captioning-then-instruction regime and report competitive or state-of-the-art performance on established benchmarks for phase recognition, triplet detection, and surgical reasoning. The scale and public availability are positioned to support pretraining of broader foundation models for surgical AI.

Core claim

SurgAtlas is the largest surgical video-language dataset to date, comprising 15,291 videos (2,391 hours) spanning 18 specialties and over 5,000 procedure types, with 6,182 open-procedure videos alongside more than 9,000 minimally invasive recordings. It supplies one of the most diverse annotation schemas through an automated multi-tier pipeline with LLM-based enrichment and a staged VQA generation framework, plus an expert-validated subset of verified visual question-answer pairs. Fine-tuning Qwen3-VL-8B via a two-stage captioning-then-instruction pipeline on this data produces competitive or state-of-the-art results on multiple surgical benchmarks including phase recognition, triplet detect

What carries the argument

The automated multi-tier pipeline with LLM-based enrichment and staged VQA generation framework that creates segment-level captions, phase and step descriptions, video-level summaries, and grounded reasoning question-answer pairs at scale.

If this is right

  • The dataset enables training of models with broad procedural coverage across open and minimally invasive surgery.
  • It establishes the first standardized benchmarks specifically for open-surgery video understanding.
  • The public corpus supports large-scale pretraining of multimodal surgical AI systems.
  • Hierarchical annotations allow models to learn at multiple levels of temporal granularity.
  • The expert-validated subset serves as a clinically grounded testbed for surgical reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Public sourcing from YouTube may allow repeated updates as new surgical videos appear online.
  • The combination of open and minimally invasive data could help models handle the wider range of visual conditions encountered in real operating rooms.
  • The scale may reduce dependence on limited private hospital datasets for surgical AI development.
  • The annotation taxonomy could be reused to label new videos from robotic surgery platforms.

Load-bearing premise

The automated multi-tier pipeline with LLM-based enrichment produces accurate, clinically grounded annotations at scale without introducing significant errors or biases.

What would settle it

A random sample of several hundred generated question-answer pairs or captions reviewed by multiple surgeons showing a high rate of clinically inaccurate or ungrounded content.

Figures

Figures reproduced from arXiv: 2606.25905 by Ahmad M. Hider, Alyssa M. Hardin, Andre S. Gala-Garza, Chenliang Xu, Donald S. Likosky, Filippos Bellos, Jason J. Corso, Jing Bi, Miaowei Wang, Susan Liang, Yayuan Li.

Figure 1
Figure 1. Figure 1: Visual overview of SurgAtlas. Sampled frames organized by surgical specialty and split into minimally invasive (left) and open surgery (right). Insets illustrate the diverse annotation types that SurgAtlas uniquely combines in a single corpus. 1 Introduction Large multimodal foundation models have shown strong performance across domains by aligning visual content with natural language, sparking growing int… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of large-scale surgical video datasets. Bubble area is proportional to total duration; color indicates open-source, restricted/partially restricted, or ours. Callouts mark each dataset’s principal limitation (absence of open surgery, lecture-only sourcing, conversion from public benchmarks, or coarse categorical labels without reasoning supervision). SurgAtlas exceeds prior corpora in scale (15,… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the SurgAtlas construction pipeline. (1) YouTube search and channel￾level discovery yield 18,855 candidates; filtering retains 15,291 videos (9,109 MIS / 6,182 open). (2) Each video is routed to one of three annotation tiers by available signal: narrated audio, on-screen text, or external metadata. (3) Tier-specific extraction produces step-, segment-, and procedure￾level annotations (Whisper A… view at source ↗
Figure 4
Figure 4. Figure 4: Per-specialty composition of SurgAtlas. Number of videos (left) and total duration in hours (right) across the 18 surgical specialties, decomposed into open surgery (blue) and minimally in￾vasive surgery (orange). Specialties are sorted by total video count. Five specialties—Cardiothoracic, Colorectal, General, Gastrointestinal/Laparoscopic, and Neurosurgery—account for ∼ 70% of the corpus, while the open/… view at source ↗
read the original abstract

We introduce SurgAtlas, the largest surgical video-language dataset to date, comprising 15,291 videos (2,391 hours) spanning 18 surgical specialties and over 5,000 procedure types, sourced entirely from publicly available YouTube content. SurgAtlas is also the first surgical video-language dataset to include open surgery at scale, with 6,182 open procedure videos alongside over 9,000 minimally invasive recordings, and the first to establish standardized benchmarks for open-surgery video understanding. We additionally provide an expert-validated subset with verified visual question-answer pairs across diverse open and minimally invasive procedures, serving as a clinically grounded benchmark for surgical reasoning. Compared with existing surgical video-language datasets, SurgAtlas provides one of the most diverse annotation schemas, combining segment-level captions, step- and phase-level descriptions, video-level surgical descriptions, and reasoning-oriented question-answer pairs organized within a hierarchical taxonomy. These annotations are constructed through an automated multi-tier pipeline with LLM-based enrichment and a staged VQA generation framework with explicit groundedness verification. The scale and diversity of SurgAtlas enable training surgical foundation models with broad procedural coverage: we finetune Qwen3-VL-8B through a two-stage captioning-then-instruction pipeline and achieve competitive or state-of-the-art results on multiple established surgical benchmarks, including phase recognition, triplet detection, and reasoning question answering. More broadly, SurgAtlas provides a large native public video corpus that can support future large-scale pretraining of multimodal surgical AI systems and contribute to the development of next-generation foundation models for surgery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SurgAtlas, the largest surgical video-language dataset to date, comprising 15,291 YouTube-sourced videos (2,391 hours) across 18 specialties and >5,000 procedure types, with substantial open-surgery coverage (6,182 videos). Annotations follow a hierarchical schema (segment captions, phase/step descriptions, video-level summaries, and VQA pairs) generated by an automated multi-tier LLM pipeline with groundedness verification; an expert-validated VQA subset is provided as a benchmark. The authors demonstrate utility by finetuning Qwen3-VL-8B via a two-stage captioning-then-instruction process, reporting competitive or state-of-the-art results on phase recognition, triplet detection, and reasoning QA benchmarks.

Significance. If the central claims hold, SurgAtlas would be a significant contribution as the first large-scale public resource explicitly including open surgery at this volume and the first to define standardized open-surgery benchmarks. The scale, diversity of specialties/procedures, and hierarchical annotation schema are clear strengths that could support broader pretraining of surgical foundation models. The release of a native public video corpus and an expert-validated VQA subset are also positive.

major comments (2)
  1. [Section 3] Section 3 (Dataset Construction and Annotation Pipeline): Expert validation with quantitative statistics (accuracy, agreement rates, or error analysis) is reported only for the VQA subset. The bulk annotations—segment-level captions, step/phase descriptions, and video-level summaries generated by the automated LLM pipeline—are used for the two-stage finetuning that underpins the benchmark results, yet no validation metrics or bias assessment (e.g., for open-surgery visuals or rare procedures) are provided. This directly affects whether the reported gains can be attributed to data quality.
  2. [Section 5] Section 5 (Experiments and Results): The competitive/SOTA claims on phase recognition, triplet detection, and reasoning QA are presented without ablations that isolate the effect of SurgAtlas annotations versus potential label noise or biases introduced by the LLM pipeline. No comparison is shown to models trained on prior datasets using the identical pipeline, making it impossible to confirm that the improvements stem from the new data rather than the training procedure itself.
minor comments (2)
  1. [Abstract] Abstract and Section 2: The exact count of procedure types (stated as 'over 5,000') should be reported precisely in the main text or a table for reproducibility.
  2. Figure captions and tables summarizing dataset statistics should include explicit definitions of all categories (e.g., what constitutes a 'step' versus 'phase') to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments. We address each major point below and outline revisions to strengthen the manuscript's claims on data quality and experimental rigor.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Dataset Construction and Annotation Pipeline): Expert validation with quantitative statistics (accuracy, agreement rates, or error analysis) is reported only for the VQA subset. The bulk annotations—segment-level captions, step/phase descriptions, and video-level summaries generated by the automated LLM pipeline—are used for the two-stage finetuning that underpins the benchmark results, yet no validation metrics or bias assessment (e.g., for open-surgery visuals or rare procedures) are provided. This directly affects whether the reported gains can be attributed to data quality.

    Authors: We acknowledge that expert validation is provided only for the VQA subset, as full manual review of 15,291 videos is not feasible. The bulk annotations rely on the multi-tier LLM pipeline with explicit groundedness verification. In the revision, we will expand Section 3 with additional details on the verification process and include a quantitative error analysis on a sampled subset (e.g., 500 annotations stratified by open vs. MIS and procedure rarity), reporting agreement metrics and bias observations where available. This addresses the concern without overclaiming full expert coverage. revision: partial

  2. Referee: [Section 5] Section 5 (Experiments and Results): The competitive/SOTA claims on phase recognition, triplet detection, and reasoning QA are presented without ablations that isolate the effect of SurgAtlas annotations versus potential label noise or biases introduced by the LLM pipeline. No comparison is shown to models trained on prior datasets using the identical pipeline, making it impossible to confirm that the improvements stem from the new data rather than the training procedure itself.

    Authors: We agree that the current experiments do not isolate the contribution of SurgAtlas annotations from the training procedure or potential pipeline noise. In the revised manuscript, we will add the requested ablations: retraining Qwen3-VL-8B on prior datasets using the identical two-stage captioning-then-instruction pipeline and directly comparing results to the SurgAtlas-trained model on the same benchmarks. This will clarify the source of performance gains. revision: yes

Circularity Check

0 steps flagged

No circularity; dataset paper with no derivations or self-referential predictions

full rationale

This is a dataset introduction paper whose central contribution is the construction and release of SurgAtlas via an automated multi-tier LLM pipeline, followed by empirical finetuning of Qwen3-VL-8B on external established benchmarks (phase recognition, triplet detection, reasoning QA). No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The claim of competitive/SOTA results is an empirical outcome on independent benchmarks rather than a reduction to the paper's own inputs by construction. Annotation accuracy is a correctness issue, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on standard assumptions about the representativeness of public YouTube surgical videos and the reliability of LLM-assisted automated annotation at clinical scale.

axioms (1)
  • domain assumption Publicly available YouTube videos provide representative and high-quality samples of real surgical procedures across specialties
    The dataset is sourced entirely from publicly available YouTube content spanning 18 specialties.

pith-pipeline@v0.9.1-grok · 5865 in / 1157 out tokens · 36051 ms · 2026-06-25T20:37:07.455408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 18 canonical work pages

  1. [1]

    Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, and Nicolas Padoy

    Kun Yuan, Vinkle Srivastav, Tong Yu, Joël L. Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, and Nicolas Padoy. Learning multi-modal representations by watching hundreds of surgical video lectures.Medical Image Analysis, 105:103644, 2025. doi: 10.1016/j.media.2025.103644

  2. [2]

    Hecvl: Hierarchical video-language pretraining for zero-shot surgical phase recognition

    Kun Yuan, Vinkle Srivastav, Nassir Navab, and Nicolas Padoy. Hecvl: Hierarchical video-language pretraining for zero-shot surgical phase recognition. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2024, pages 306–316. Springer, 2024. 11

  3. [3]

    Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation

    Kun Yuan, Vinkle Srivastav, Nassir Navab, and Nicolas Padoy. Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation. InAdvances in Neural Information Processing Systems, volume 37, 2024

  4. [4]

    Surglavi: Large-scale hierarchical dataset for surgical vision-language representation learning

    Alejandra Perez, Chinedu Innocent Nwoye, Ramtin Raji Kermani, Omid Mohareri, and Muhammad Abdul- lah Jamal. Surglavi: Large-scale hierarchical dataset for surgical vision-language representation learning. Medical Image Analysis, 107:103825, 2026. doi: 10.1016/j.media.2026.103825

  5. [5]

    SUREON: A benchmark and vision-language-model for surgical reasoning.arXiv preprint arXiv:2603.06570, 2026

    Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, and Omid Mohareri. SUREON: A benchmark and vision-language-model for surgical reasoning.arXiv preprint arXiv:2603.06570, 2026

  6. [6]

    SurgVLM: A large vision-language model and systematic evaluation benchmark for surgical intelligence.arXiv preprint arXiv:2506.02555, 2025

    Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde Wu, Jiaan Zhang, Yuxuan Wang, Chang Han Low, Jian Jiang, Zilong Zheng, Xiaochun Cao, Yutong Ban, Qi Dou, Yang Liu, and Yueming Jin. SurgVLM: A large vision-language model and systematic evaluation benchmark for surgical intelligence.arXiv preprint arXiv:2506.02555, 2025

  7. [7]

    SurgΣ: A spectrum of large-scale multimodal data and foundation models for surgical intelligence.arXiv preprint arXiv:2603.16822, 2026

    Zhitao Zeng, Mengya Xu, Jian Jiang, Pengfei Guo, Yunqiu Xu, Zhu Zhuo, Chang Han Low, Yufan He, Dong Yang, Chenxi Lin, Yiming Gu, Jiaxin Guo, Yutong Ban, Daguang Xu, Qi Dou, and Yueming Jin. SurgΣ: A spectrum of large-scale multimodal data and foundation models for surgical intelligence.arXiv preprint arXiv:2603.16822, 2026

  8. [8]

    Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny T. M. Chan, Nassir Navab, Hongbin Liu, Zhen Lei, and Jiebo Luo. SurgLLM: A versatile large multimodal model with spatial focus and temporal awareness for surgical video understanding.arXiv preprint arXiv:2509.00357, 2025

  9. [9]

    Endochat: Grounded multimodal large language model for endoscopic surgery.Medical Image Analysis, 107:103789, 2026

    Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Zhen Lei, Hongbin Liu, Jiazheng Wang, Fan Zhang, Nicolas Padoy, Nassir Navab, and Hongliang Ren. Endochat: Grounded multimodal large language model for endoscopic surgery.Medical Image Analysis, 107:103789, 2026. doi: 10.1016/j.media.2025.103789

  10. [10]

    Goodman, Krishna K

    Emmett D. Goodman, Krishna K. Patel, Yilun Zhang, William Locke, Chris J. Kennedy, Rohan Mehrotra, Stephen Ren, Melody Guan, Orr Zohar, Maren Downing, Hao Wei Chen, Jevin Z. Clark, Margaret T. Berrigan, Gabriel A. Brat, and Serena Yeung-Levy. Analyzing surgical technique in diverse open surgical videos with multitask machine learning.JAMA Surgery, 159(2):...

  11. [11]

    EgoSurgery-Phase: A dataset of surgical phase recognition from egocentric open surgery videos

    Ryo Fujii, Masashi Hatano, Hideo Saito, and Hiroki Kajita. EgoSurgery-Phase: A dataset of surgical phase recognition from egocentric open surgery videos. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2024, volume 15006 ofLecture Notes in Computer Science, pages 184–195. Springer Nature Switzerland, 2024. doi: 10.1007/978-3-031-72089-5_18

  12. [12]

    IEEE Transactions on Medical Imaging36(1), 86–97 (2017).https://doi.org/10.1109/TMI.2016.2593957

    Apriyana P. Twinanda, Shekoofeh Shehata, Didier Mutter, Jacques Marescaux, Michel Mathelin, and Nicolas Padoy. Endonet: A deep architecture for recognition tasks on laparoscopic videos.IEEE Transactions on Medical Imaging, 36(1):86–97, 2017. doi: 10.1109/TMI.2016.2593957

  13. [13]

    Medical Image Analysis59, 101570 (Jan 2020),http://dx.doi.org/10.1016/j.media

    Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos.Medical Image Analysis, 78:102433, 2022. doi: 10.1016/j.media. 2022.102433

  14. [14]

    Au- tolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy

    Ziyi Wang, Bo Lu, Yonghao Long, Fangxun Zhong, Tak-Hong Cheung, Qi Dou, and Yunhui Liu. Au- tolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 486–496. Springer, 2022. doi: 10.1007/978-3-031-16449-1_46

  15. [15]

    Lavanchy, Orestis Zisimopoulos, Pietro Mascagni, Didier Mutter, and Nicolas Padoy

    Joël L. Lavanchy, Orestis Zisimopoulos, Pietro Mascagni, Didier Mutter, and Nicolas Padoy. Challenges in multi-centric generalization: Phase and step recognition in roux-en-y gastric bypass surgery.International Journal of Computer Assisted Radiology and Surgery, 2024. doi: 10.1007/s11548-024-03166-3

  16. [16]

    SAR-RARP50: 12 Segmentation of surgical instrumentation and action recognition on robot-assisted radical prostatectomy challenge.arXiv preprint arXiv:2401.00496, 2024

    Dimitrios Psychogyios, Emanuele Colleoni, Beatrice Van Amsterdam, Chih-Yang Li, Shu-Yu Huang, Yuchong Li, Fucang Jia, Baosheng Zou, Guotai Wang, Yang Liu, Maxence Boels, Jiayu Huo, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin, Mengya Xu, An Wang, Yanan Wu, Long Bai, Hongliang Ren, Atsushi Yamada, Yuriko Harai, Yuto Ishikawa, Kazu...

  17. [17]

    Cataract-101: Video dataset of 101 cataract surgeries

    Klaus Schoeffmann, Mario Taschwer, Stephanie Sarny, Bernd Münzer, Manfred Jürgen Primus, and Doris Putzgruber. Cataract-101: Video dataset of 101 cataract surgeries. InProceedings of the 9th ACM Multimedia Systems Conference, pages 421–425, 2018. doi: 10.1145/3204949.3208137

  18. [18]

    Garcia-Peraza-Herrera

    Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, and Luis C. Garcia-Peraza-Herrera. LEMON: A large endoscopic MONocular dataset and foundation model for perception in surgical settings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  19. [19]

    SurgBench: A unified large-scale benchmark for surgical video analysis.arXiv preprint arXiv:2506.07603, 2025

    Jianhui Wei, Zikai Xiao, Danyu Sun, Luqi Gong, Zongxin Yang, Zuozhu Liu, and Jian Wu. SurgBench: A unified large-scale benchmark for surgical video analysis.arXiv preprint arXiv:2506.07603, 2025

  20. [20]

    Large-scale self-supervised video foundation model for intelligent surgery.npj Digital Medicine, 9:220, 2026

    Shu Yang, Fengtao Zhou, Leon Mayer, Fuxiang Huang, Yiliang Chen, Yihui Wang, Sunan He, Yuxi- ang Nie, Xi Wang, Yueming Jin, Huihui Sun, Shuchang Xu, Alex Qinyang Liu, Zheng Li, Jing Qin, Jeremy YuenChun Teoh, Lena Maier-Hein, and Hao Chen. Large-scale self-supervised video foundation model for intelligent surgery.npj Digital Medicine, 9:220, 2026. doi: 10...

  21. [21]

    General surgery vision transformer: A video pre-trained foundation model for general surgery.arXiv preprint arXiv:2403.05949, 2024

    Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, and Axel Krieger. General surgery vision transformer: A video pre-trained foundation model for general surgery.arXiv preprint arXiv:2403.05949, 2024

  22. [22]

    CPT® (Current Procedural Terminology)

    American Medical Association. CPT® (Current Procedural Terminology). https://www.ama-assn. org/practice-management/cpt, 2026

  23. [23]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 28492–28518. PMLR, 2023

  24. [24]

    Müller-Stich, Anastasia Kisilenko, Danail Tran, Philipp Heger, Maximilian Müller, Hannes G

    Martin Wagner, Beat P. Müller-Stich, Anastasia Kisilenko, Danail Tran, Philipp Heger, Maximilian Müller, Hannes G. Kenngott, Felix Nickel, and Stefanie Speidel. Comparative validation of machine learning algorithms for surgical workflow and skill analysis with the heichole benchmark.Medical Image Analysis, 86:102770, 2023. doi: 10.1016/j.media.2023.102770

  25. [25]

    The TUM LapChole dataset for the M2CAI 2016 workflow challenge.arXiv preprint arXiv:1610.09278, 2016

    Ralf Stauder, Daniel Ostler, Michael Kranzfelder, Sebastian Koller, Hubertus Feußner, and Nassir Navab. The TUM LapChole dataset for the M2CAI 2016 workflow challenge.arXiv preprint arXiv:1610.09278, 2016

  26. [26]

    Cataract-1K dataset for deep-learning-assisted analysis of cataract surgery videos.Scientific Data, 11(1):373, 2024

    Negin Ghamsarian, Yosuf El-Shabrawi, Sahar Nasirihaghighi, Doris Putzgruber-Adamitsch, Martin Zinkernagel, Sebastian Wolf, Klaus Schoeffmann, and Raphael Sznitman. Cataract-1K dataset for deep-learning-assisted analysis of cataract surgery videos.Scientific Data, 11(1):373, 2024. doi: 10.1038/s41597-024-03193-4

  27. [27]

    2017 robotic instrument segmentation challenge.arXiv preprint arXiv:1902.06426, 2019

    Max Allan, Alex Shvets, Thomas Kurmann, Zichen Zhang, Rahul Duggal, Yun-Hsuan Su, Nicola Rieke, Iro Laina, Niveditha Kalavakonda, Sebastian Bodenstedt, Luis Herrera, Wenqi Li, Vladimir Iglovikov, Huoling Luo, Jian Yang, Danail Stoyanov, Lena Maier-Hein, Stefanie Speidel, and Mahdi Azizian. 2017 robotic instrument segmentation challenge.arXiv preprint arXi...

  28. [28]

    2018 robotic scene segmentation challenge.arXiv preprint arXiv:2001.11190, 2020

    Max Allan, Satoshi Kondo, Sebastian Bodenstedt, Stefan Leger, Rahim Kadkhodamohammadi, Imanol Lu- engo, Felix Fuentes, Evangello Flouty, Ahmed Mohammed, Marius Pedersen, Avinash Kori, Varghese Alex, Ganapathy Krishnamurthi, David Rauber, Robert Mendel, Christoph Palm, Sophia Bano, Guinther Saibro, Chi-Sheng Shih, Hsun-An Chiang, Juntang Zhuang, Junlin Yan...

  29. [29]

    LapGyn4: A dataset for 4 automatic content analysis problems in the domain of laparoscopic gynecology

    Andreas Leibetseder, Stefan Petscharnig, Manfred Jürgen Primus, Sabrina Kletz, Bernd Münzer, Klaus Schoeffmann, and Jörg Keckstein. LapGyn4: A dataset for 4 automatic content analysis problems in the domain of laparoscopic gynecology. InProceedings of the 9th ACM Multimedia Systems Conference, pages 357–362, 2018. doi: 10.1145/3204949.3208127

  30. [30]

    Pixel-wise recognition for holistic surgical scene understanding.Medical Image Analysis, 106: 103726, 2025

    Nicolás Ayobi, Santiago Rodríguez, Alejandra Pérez, Isabela Hernández, Nicolás Aparicio, Eugénie Dessevres, Sebastián Peña, Jessica Santander, Juan Ignacio Caicedo, Nicolás Fernández, and Pablo Arbeláez. Pixel-wise recognition for holistic surgical scene understanding.Medical Image Analysis, 106: 103726, 2025. doi: 10.1016/j.media.2025.103726. 13

  31. [31]

    Hong, C.-L

    W.-Y . Hong, C.-L. Kao, Y .-H. Kuo, J.-R. Wang, W.-L. Chang, and C.-S. Shih. CholecSeg8k: A semantic seg- mentation dataset for laparoscopic cholecystectomy based on Cholec80.arXiv preprint arXiv:2012.12453, 2020

  32. [32]

    Scientific Data12(1), 331 (2025).https://doi.org/ 10.1038/s41597-025-04642-4

    Pietro Mascagni, Deepak Alapatt, Aditya Murali, Armine Vardazaryan, Alain Garcia, Nariaki Okamoto, Guido Costamagna, Didier Mutter, Jacques Marescaux, Bernard Dallemagne, and Nicolas Padoy. En- doscapes, a critical view of safety and surgical scene segmentation dataset for laparoscopic cholecystectomy. Scientific Data, 12:331, 2025. doi: 10.1038/s41597-02...

  33. [33]

    Khan, Sophia Bano, Hani J

    Runlong He, Mengya Xu, Adrito Das, Danyal Z. Khan, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, and Mobarakol Islam. PitVQA: Image-grounded text embedding LLM for visual question answering in pituitary surgery. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2024, volume 15006 ofLecture Notes in Computer Science...

  34. [34]

    Video dataset for surgical phase, keypoint, and instrument recognition in laparoscopic surgery (PhaKIR).arXiv preprint arXiv:2511.06549, 2025

    Tobias Rueckert, Raphaela Maerkl, David Rauber, Leonard Klausmann, Max Gutbrod, Daniel Rueckert, Hubertus Feussner, Dirk Wilhelm, and Christoph Palm. Video dataset for surgical phase, keypoint, and instrument recognition in laparoscopic surgery (PhaKIR).arXiv preprint arXiv:2511.06549, 2025

  35. [35]

    Scientific Data 10(1), 3 (2023).https://doi.org/10.1038/s41597-022-01719-2

    Matthias Carstens, Franziska Maria Rinner, Sebastian Bodenstedt, Andreas C. Jenke, Jürgen Weitz, Marius Distler, Stefanie Speidel, and Felix R. Kolbinger. The dresden surgical anatomy dataset for abdominal organ segmentation in surgical data science.Scientific Data, 10(1):3, 2023. doi: 10.1038/s41597-022-01719-2

  36. [36]

    Surgical visual understanding (SurgVU) dataset

    Aneeq Zia, Max Berniker, Rogerio Nespolo, Conor Perreault, Ziheng Wang, Benjamin Mueller, Ryan Schmidt, Kiran Bhattacharyya, Xi Liu, and Anthony Jarc. Surgical visual understanding (SurgVU) dataset. arXiv preprint arXiv:2501.09209, 2025

  37. [37]

    Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  38. [38]

    GPT-5 System Card.https://cdn.openai.com/gpt-5-system-card.pdf, August 2025

    OpenAI. GPT-5 System Card.https://cdn.openai.com/gpt-5-system-card.pdf, August 2025

  39. [39]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities, 2025. URLhttps://arxiv.org/abs/2507. 06261. 14 A Technical appendices and supplementary material...