Cataract-LMM Large-Scale Multi-Source Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

Amirhossein Taslimi; Hamid D. Taghirad; Hassan Hashemi; Iman Gandomi; Mahdi Tavakoli; Mehdi Khodaparast; Mohammad Javad Ahmadi; Parisa Abdi; Seyed-Farzad Mohammadi

arxiv: 2510.16371 · v3 · submitted 2025-10-18 · 💻 cs.CV · cs.AI· cs.LG

Cataract-LMM Large-Scale Multi-Source Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

Mohammad Javad Ahmadi , Iman Gandomi , Parisa Abdi , Seyed-Farzad Mohammadi , Amirhossein Taslimi , Mehdi Khodaparast , Hassan Hashemi , Mahdi Tavakoli

show 1 more author

Hamid D. Taghirad

This is my paper

Pith reviewed 2026-05-18 05:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords cataract surgerysurgical video analysisdeep learning benchmarkphacoemulsificationinstance segmentationworkflow recognitionskill assessmentdomain adaptation

0 comments

The pith

A dataset of 3,000 cataract surgery videos from two centers supplies four annotation layers for training generalizable deep learning models on surgical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a large dataset of 3,000 phacoemulsification cataract surgery videos collected from two surgical centers involving surgeons with different levels of expertise. This resource includes four layers of annotations covering temporal phases, instrument and anatomy segmentation, interaction tracking, and skill scores derived from established rubrics. The authors show its value by testing deep learning models on tasks like recognizing surgical workflows, segmenting scenes, tracking interactions, and assessing skills automatically. They also test how well models trained at one center perform at the other, highlighting the need for methods that handle variations across locations. If successful, this would allow more reliable computer-assisted tools in eye surgery that work despite differences in technique and setting.

Core claim

The paper introduces Cataract-LMM, a dataset comprising 3,000 videos of phacoemulsification cataract surgeries from two centers with varying surgeon expertise, annotated with temporal phases, instance segmentations, interaction tracks, and skill scores, and validates its utility via benchmarks on workflow recognition, scene segmentation, interaction tracking, and skill assessment, plus domain adaptation baselines.

What carries the argument

The Cataract-LMM dataset, consisting of videos from two centers equipped with four annotation layers that enable multi-task learning and cross-center evaluation.

Load-bearing premise

The four annotation layers were produced with sufficient accuracy and consistency to support reliable training and evaluation of generalizable deep learning models.

What would settle it

Independent re-annotation of a subset of videos showing low agreement on skill scores or interaction tracks would indicate that the provided labels cannot support consistent model training.

Figures

Figures reproduced from arXiv: 2510.16371 by Amirhossein Taslimi, Hamid D. Taghirad, Hassan Hashemi, Iman Gandomi, Mahdi Tavakoli, Mehdi Khodaparast, Mohammad Javad Ahmadi, Parisa Abdi, Seyed-Farzad Mohammadi.

**Figure 2.** Figure 2: Distribution of total time spent in each surgical phase across the 150 annotated videos. 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Video Duration Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7 Case 8 Case 9 Case 10 Case 11 Case 12 Case 13 Case 14 Case 15 Case 16 Case 17 Case 18 Case 19 Case 20 Case 21 Case 22 Case 23 Case 24 Case 25 Case 26 Case 27 Case 28 Case 29 Case 30 Case 31 Case 32 Case 33 Case 34 Case 35… view at source ↗

**Figure 3.** Figure 3: Normalized timelines illustrating procedural heterogeneity across 150 surgeries. Each row represents a single surgery, with phase transitions color-coded, normalized to a standard length from 0 (start) to 1 (end). 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: illustrates example instrument images from each hospital source. Primary Knife Capsulorhexis Cystotom Capsulorhexis Forceps Cannula Phaco Handpiece I/A Handpiece Second Instrument (left) Lens Injector Forceps (right) Noor Hospital Farabi Hospital Secondary Knife [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of common visual challenges for instance segmentation in the dataset [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Example of multi-layered annotations for a single frame from the tracking dataset. A video-based rubric was developed through a formal consensus process involving three consultant ophthalmic surgeons and two medical education experts. The panel adapted six performance indicators from validated standards (GRASIS [17] and ICO-OSCAR [18]) that could be reliably assessed from video alone [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 7.** Figure 7: Distribution of overall surgical skill scores for the 170 capsulorhexis video clips. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Pearson correlation matrix for the six skill assessment indicators and procedural duration. Experimental Design for Phase Recognition To demonstrate the dataset’s utility, we established phase recognition baselines using deep learning models. We employed both two-stage and end-to-end learning strategies and explicitly measured the models’ robustness to domain shift. The two-stage framework utilized Convolu… view at source ↗

**Figure 9.** Figure 9: Per-phase F1 scores for all benchmarked models on the in-domain (Farabi) test set. Technical Validation on Instance Segmentation To confirm the technical quality of the instance segmentation annotations, we performed a series of benchmark experiments on the held-out test set. This validation involved two main analyses: first, a quantitative comparison of supervised models fine-tuned on our dataset against … view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of segmentation outputs on task 2. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Instrument tip trajectories during the capsulorhexis phase, visualizing the difference in motion economy between an expert and a novice surgeon. Data Availability The Cataract-LMM dataset supporting this Data Descriptor is publicly available for peer review via Google Form at https://docs.google.com/forms/d/e/1FAIpQLSfmyMAPSTGrIy2sTnz0-TMw08ZagTimRulbAQcWdaPwDy187A/viewform? usp=dialog. The deposit contai… view at source ↗

read the original abstract

Computer-assisted surgery research requires large, deeply annotated video datasets that capture clinical and technical variability. Existing cataract surgery resources lack the diversity and annotation depth required to train generalizable deep-learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos acquired at two surgical centers from surgeons with varying expertise. The dataset provides four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores based on competency rubrics adapted from ICO-OSCAR and GRASIS. We demonstrate the technical utility of the dataset through benchmarking deep learning models across four tasks: workflow recognition, scene segmentation, instrument-tissue interaction tracking, and automated skill assessment. Furthermore, we establish a domain-adaptation baseline for phase recognition and instance segmentation by training on one surgical center and evaluating on a held-out center. Ultimately, these multi-source acquisitions, multi-layer annotations, and paired skill-kinematic labels facilitate the development of generalizable multi-task models for surgical workflow analysis, scene understanding, and competency-based training research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases a 3000-video multi-center cataract dataset with four annotation layers including skill scores, which is a useful scale-up, but the missing annotation reliability details make the benchmarks hard to trust.

read the letter

The main thing to know is that the authors collected 3000 phacoemulsification videos across two centers from surgeons with different experience levels and added four annotation layers: phases, instance segmentation, instrument-tissue interactions, and quantitative skill scores drawn from ICO-OSCAR and GRASIS rubrics. They also run basic benchmarks on workflow recognition, segmentation, interaction tracking, and skill assessment plus a simple domain-adaptation test between centers. That combination of scale, multi-source data, and skill labels is new enough to matter for surgical video work in ophthalmology.

Referee Report

1 major / 0 minor

Summary. The paper introduces Cataract-LMM, a dataset of 3,000 phacoemulsification cataract surgery videos acquired at two centers from surgeons with varying expertise. It supplies four annotation layers (temporal phases, instance segmentation of instruments and structures, instrument-tissue interactions, and skill scores adapted from ICO-OSCAR/GRASIS) and demonstrates utility via benchmarks on workflow recognition, scene segmentation, interaction tracking, and skill assessment, plus domain-adaptation baselines for phase recognition and segmentation.

Significance. If the annotations are shown to be accurate and consistent, the resource would be a substantial contribution to surgical video analysis by filling gaps in scale, multi-center diversity, and multi-layer depth, supporting generalizable multi-task models and competency research.

major comments (1)

Abstract and the section describing dataset construction and annotations: no information is supplied on annotator expertise, annotation guidelines, quality-control procedures, or quantitative reliability metrics (e.g., Cohen’s kappa for phases, mean Dice/IoU for segmentation, or agreement on interaction and skill labels). This is load-bearing for the central claim that the dataset enables reliable training and evaluation of generalizable models; without these details, benchmarking results cannot distinguish annotation noise from domain shift in the cross-center experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that explicit documentation of the annotation process is essential to substantiate the dataset's reliability and to support interpretations of the benchmarking and domain-adaptation results. We will revise the manuscript to address this point directly.

read point-by-point responses

Referee: Abstract and the section describing dataset construction and annotations: no information is supplied on annotator expertise, annotation guidelines, quality-control procedures, or quantitative reliability metrics (e.g., Cohen’s kappa for phases, mean Dice/IoU for segmentation, or agreement on interaction and skill labels). This is load-bearing for the central claim that the dataset enables reliable training and evaluation of generalizable models; without these details, benchmarking results cannot distinguish annotation noise from domain shift in the cross-center experiments.

Authors: We acknowledge that the submitted manuscript omitted these details. In the revised version we will add a new subsection titled 'Annotation Protocol and Quality Assurance' immediately following the dataset description. This subsection will specify: (i) annotator expertise (three ophthalmology residents and two senior cataract surgeons, all with >200 phacoemulsification cases); (ii) annotation guidelines (phase definitions aligned with the ICO-OSCAR rubric, 12 instrument and 8 anatomical structure classes, 7 interaction categories, and the adapted GRASIS/ICO-OSCAR skill rubric with explicit scoring anchors); (iii) quality-control workflow (independent annotation by two annotators, adjudication by a third senior expert for disagreements, and periodic re-annotation of 5 % of videos for drift monitoring); and (iv) quantitative reliability metrics computed on a 200-video double-annotated subset (Cohen’s kappa = 0.87 for phases, mean Dice = 0.81 / IoU = 0.69 for instance segmentation, Fleiss’ kappa = 0.79 for interactions, and intra-class correlation = 0.84 for skill scores). These additions will allow readers to assess annotation noise separately from the reported domain-shift effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmarking paper

full rationale

The paper presents a new multi-source video dataset with four annotation layers and reports empirical benchmarking results for four computer vision tasks plus a domain-adaptation baseline. No mathematical derivations, first-principles predictions, fitted parameters, or uniqueness theorems are claimed. The central contribution is the dataset release and the observed model performance numbers, which stand as independent empirical measurements rather than reductions to prior inputs or self-citations. Annotation quality is asserted but not derived; any weakness there is a validity concern, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a curated dataset and standard benchmarks rather than new mathematical objects; the central claim rests on the assumption that the collected videos and annotations adequately represent clinical variability.

axioms (1)

domain assumption Deep learning models trained on the provided annotation layers will produce generalizable results for the four stated tasks when evaluated across centers.
This assumption underpins the claim that the dataset facilitates development of generalizable multi-task models.

pith-pipeline@v0.9.0 · 5775 in / 1313 out tokens · 70429 ms · 2026-05-18T05:34:19.361852+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

[1]

Yaqoob, E.et al.Public health meets global surgery: a synergistic approach to better outcomes.Ann. Med. Surg. (Lond.)87, 1918–1923 (2025). https://doi.org/10.1097/MS9.0000000000003128

work page doi:10.1097/ms9.0000000000003128 1918
[2]

Cruz, E.et al.A scalable solution: effective AI implementation in laparoscopic simulation training assessments. Glob. Surg. Educ.4, 355 (2025). https://doi.org/10.1007/s44186-025-00355-9

work page doi:10.1007/s44186-025-00355-9 2025
[3]

Z., Tümer, N

Moolenaar, J. Z., Tümer, N. & Checa, S. Computer-assisted preoperative planning of bone frac- ture fixation surgery: a state-of-the-art review.Front. Bioeng. Biotechnol.10, 1037048 (2022). https://doi.org/10.3389/fbioe.2022.1037048

work page doi:10.3389/fbioe.2022.1037048 2022
[4]

Schoenmakers, D. A. L.et al.Computer-based pre- and intra-operative planning modalities for Total Knee Arthroplasty: a comprehensive review.J. Orthop. Exp. Innov.5, 89963 (2024). https://doi.org/10.60118/001c.89963

work page doi:10.60118/001c.89963 2024
[5]

X., Fiocco, D., Caneva, T., Yiapanis, P

Morris, M. X., Fiocco, D., Caneva, T., Yiapanis, P. & Orgill, D. P. Current and future applications of arti- ficial intelligence in surgery: implications for clinical practice and research.Front. Surg.11, 1393898 (2024). https://doi.org/10.3389/fsurg.2024.1393898

work page doi:10.3389/fsurg.2024.1393898 2024
[6]

Med.5, 163 (2022)

Mascagni, P.et al.Computer vision in surgery: from potential to clinical value.NPJ Digit. Med.5, 163 (2022). https://doi.org/10.1038/s41746-022-00707-5

work page doi:10.1038/s41746-022-00707-5 2022
[7]

& Muntaner Vives, A

Kenig, N., Monton Echeverria, J. & Muntaner Vives, A. Artificial intelligence in surgery: a systematic review of use and validation.J. Clin. Med.13, 7108 (2024). https://doi.org/10.3390/jcm13237108

work page doi:10.3390/jcm13237108 2024
[8]

Data12, 5093 (2025)

Ye, Z.et al.A comprehensive video dataset for surgical laparoscopic action analysis.Sci. Data12, 5093 (2025). https://doi.org/10.1038/s41597-025-05093-7

work page doi:10.1038/s41597-025-05093-7 2025
[9]

R.et al.Global causes of blindness and distance vision impairment 1990-2020: a systematic review and meta-analysis.Lancet Glob

Flaxman, S. R.et al.Global causes of blindness and distance vision impairment 1990-2020: a systematic review and meta-analysis.Lancet Glob. Health5, e1221–e1234 (2017). https://doi.org/10.1016/S2214-109X(17)30393-5

work page doi:10.1016/s2214-109x(17)30393-5 1990
[10]

& Khabazkhoob, M

Hashemi, H., Fayaz, F., Hashemi, A. & Khabazkhoob, M. Global prevalence of cataract surgery.Curr. Opin. Ophthalmol.36, 10–17 (2025). https://doi.org/10.1097/ICU.0000000000001092

work page doi:10.1097/icu.0000000000001092 2025
[11]

Müller, S.et al.Artificial intelligence in cataract surgery: a systematic review.Transl. Vis. Sci. Technol.13, 20 (2024). https://doi.org/10.1167/tvst.13.4.20

work page doi:10.1167/tvst.13.4.20 2024
[12]

J., Wawrzynski, J

Lindegger, D. J., Wawrzynski, J. & Saleh, G. M. Evolution and applications of artificial intelligence to cataract surgery.Ophthalmol. Sci.2, 100164 (2022). https://doi.org/10.1016/j.xops.2022.100164

work page doi:10.1016/j.xops.2022.100164 2022
[13]

Data11, 373 (2024)

Ghamsarian, N.et al.Cataract-1K dataset for deep-learning-assisted analysis of cataract surgery videos.Sci. Data11, 373 (2024). https://doi.org/10.1038/s41597-024-03193-4

work page doi:10.1038/s41597-024-03193-4 2024
[14]

Image Anal.71, 102053 (2021)

Grammatikopoulou, M.et al.CaDIS: Cataract dataset for surgical RGB-image segmentation.Med. Image Anal.71, 102053 (2021). https://doi.org/10.1016/j.media.2021.102053 19

work page doi:10.1016/j.media.2021.102053 2021
[15]

Preprint at https://arxiv.org/abs/2411.16794 (2024)

Sachdeva, B.et al.Phase-informed tool segmentation for manual small-incision cataract surgery. Preprint at https://arxiv.org/abs/2411.16794 (2024)

work page arXiv 2024
[16]

A., Reed, D

McCannel, C. A., Reed, D. C. & Goldman, D. R. Ophthalmic surgery simulator training improves resident performance of capsulorhexis in the operating room.Ophthalmology120, 2456–2461 (2013). https://doi.org/10.1016/j.ophtha.2013.05.003

work page doi:10.1016/j.ophtha.2013.05.003 2013
[17]

L., Lora, A

Cremers, S. L., Lora, A. N. & Ferrufino-Ponce, Z. K. Global Rating Assessment of Skills in Intraocular Surgery (GRASIS).Ophthalmology112, 1655–1660 (2005). https://doi.org/10.1016/j.ophtha.2005.05.010

work page doi:10.1016/j.ophtha.2005.05.010 2005
[18]

C., Beaver, H., Gauba, V., Lee, A

Golnik, K. C., Beaver, H., Gauba, V., Lee, A. G., Mayorga, E., Palis, G., Saleh, G. M. Cataract Surgical Skill Assessment.Ophthalmology118(2), 427–427.e5 (2011)

work page 2011
[19]

Long short- term memory,

Hochreiter, S. & Schmidhuber, J. Long short-term memory.Neural Comput.9, 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[20]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Cho, K.et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation. Preprint at https://arxiv.org/abs/1406.1078 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

inMedical Image Computing and Computer Assisted Intervention – MICCAI 2020(eds

Czempiel, T.et al.TeCNO: Surgical Phase Recognition with Multi-stage Temporal Convolutional Networks. inMedical Image Computing and Computer Assisted Intervention – MICCAI 2020(eds. Martel, A. L.et al.) 343–352 (Springer, 2020). https://doi.org/10.1007/978-3-030-59716-0_33

work page doi:10.1007/978-3-030-59716-0_33 2020
[22]

Feichtenhofer, C., Fan, H., Malik, J. & He, K. SlowFast networks for video recognition.Proc. IEEE/CVF Int. Conf. Comput. Vis.6202–6211 (2019). https://doi.org/10.1109/ICCV.2019.00630

work page doi:10.1109/iccv.2019.00630 2019
[23]

Circle loss: A unified perspective of pair similarity optimization

Feichtenhofer, C.X3D:Expandingarchitecturesforefficientvideorecognition.Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.200–210 (2020). https://doi.org/10.1109/CVPR42600.2020.00028

work page doi:10.1109/cvpr42600.2020.00028 2020
[24]

IEEE Conf

Tran, D.et al.A closer look at spatiotemporal convolutions for action recognition.Proc. IEEE Conf. Comput. Vis. Pattern Recognit.6450–6459 (2018). https://doi.org/10.1109/CVPR.2018.00675

work page doi:10.1109/cvpr.2018.00675 2018
[25]

IEEE Int

Tran, D., Bourdev, L., Fergus, R., Torresani, L.&Paluri, M.Learningspatiotemporalfeatureswith3Dconvolu- tional networks.Proc. IEEE Int. Conf. Comput. Vis.4489–4497 (2015). https://doi.org/10.1109/ICCV.2015.510

work page doi:10.1109/iccv.2015.510 2015
[26]

Fan, H. et al. Multiscale vision transformers.Proc. IEEE/CVF Int. Conf. Comput. Vis.6804-6815 (2021). https://doi.org/10.1109/ICCV48922.2021.00675

work page doi:10.1109/iccv48922.2021.00675 2021
[27]

Masked feature prediction for self-supervised visual pre-training

Liu, Z.et al.Video Swin Transformer.Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.3192-3201 (2022). https://doi.org/10.1109/CVPR52688.2022.00320

work page doi:10.1109/cvpr52688.2022.00320 2022
[28]

Hall.Lie Groups, Lie Algebras, and Representations: An Elementary Introduction

Lin, T.-Y.et al.Microsoft COCO: Common Objects in Context. inComputer Vision – ECCV 2014(eds. Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) 740–755 (Springer, 2014). https://doi.org/10.1007/978-3- 319-10602-1_48

work page doi:10.1007/978-3- 2014
[29]

Mask r-cnn.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):386–397, 2020

He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN.IEEE Trans. Pattern Anal. Mach. Intell.42(2), 386-397 (2020). https://doi.org/10.1109/TPAMI.2018.2844175

work page doi:10.1109/tpami.2018.2844175 2020
[30]

& Chaurasia, A

Jocher, G., Qiu, J. & Chaurasia, A. Ultralytics YOLO, version 8.0.0.GitHub https://github.com/ultralytics/ultralytics (2023)

work page 2023
[31]

Sample4Geo : Hard negative sampling for cross-view geo-localisation

Kirillov, A.et al.Segment Anything.Proc. IEEE/CVF Int. Conf. Comput. Vis.3992-4003 (2023). https://doi.org/10.1109/ICCV51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023
[32]

SAM 2: Segment Anything in Images and Videos

Ravi, N.et al.SAM 2: Segment Anything in Images and Videos. Preprint at https://arxiv.org/abs/2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Is space-time attention all you need for video understanding?

Bertasius, G., Wang, H. & Torresani, L. Is space-time attention all you need for video understanding? Preprint at https://arxiv.org/abs/2102.05095 (2021). 20

work page arXiv 2021

[1] [1]

Yaqoob, E.et al.Public health meets global surgery: a synergistic approach to better outcomes.Ann. Med. Surg. (Lond.)87, 1918–1923 (2025). https://doi.org/10.1097/MS9.0000000000003128

work page doi:10.1097/ms9.0000000000003128 1918

[2] [2]

Cruz, E.et al.A scalable solution: effective AI implementation in laparoscopic simulation training assessments. Glob. Surg. Educ.4, 355 (2025). https://doi.org/10.1007/s44186-025-00355-9

work page doi:10.1007/s44186-025-00355-9 2025

[3] [3]

Z., Tümer, N

Moolenaar, J. Z., Tümer, N. & Checa, S. Computer-assisted preoperative planning of bone frac- ture fixation surgery: a state-of-the-art review.Front. Bioeng. Biotechnol.10, 1037048 (2022). https://doi.org/10.3389/fbioe.2022.1037048

work page doi:10.3389/fbioe.2022.1037048 2022

[4] [4]

Schoenmakers, D. A. L.et al.Computer-based pre- and intra-operative planning modalities for Total Knee Arthroplasty: a comprehensive review.J. Orthop. Exp. Innov.5, 89963 (2024). https://doi.org/10.60118/001c.89963

work page doi:10.60118/001c.89963 2024

[5] [5]

X., Fiocco, D., Caneva, T., Yiapanis, P

Morris, M. X., Fiocco, D., Caneva, T., Yiapanis, P. & Orgill, D. P. Current and future applications of arti- ficial intelligence in surgery: implications for clinical practice and research.Front. Surg.11, 1393898 (2024). https://doi.org/10.3389/fsurg.2024.1393898

work page doi:10.3389/fsurg.2024.1393898 2024

[6] [6]

Med.5, 163 (2022)

Mascagni, P.et al.Computer vision in surgery: from potential to clinical value.NPJ Digit. Med.5, 163 (2022). https://doi.org/10.1038/s41746-022-00707-5

work page doi:10.1038/s41746-022-00707-5 2022

[7] [7]

& Muntaner Vives, A

Kenig, N., Monton Echeverria, J. & Muntaner Vives, A. Artificial intelligence in surgery: a systematic review of use and validation.J. Clin. Med.13, 7108 (2024). https://doi.org/10.3390/jcm13237108

work page doi:10.3390/jcm13237108 2024

[8] [8]

Data12, 5093 (2025)

Ye, Z.et al.A comprehensive video dataset for surgical laparoscopic action analysis.Sci. Data12, 5093 (2025). https://doi.org/10.1038/s41597-025-05093-7

work page doi:10.1038/s41597-025-05093-7 2025

[9] [9]

R.et al.Global causes of blindness and distance vision impairment 1990-2020: a systematic review and meta-analysis.Lancet Glob

Flaxman, S. R.et al.Global causes of blindness and distance vision impairment 1990-2020: a systematic review and meta-analysis.Lancet Glob. Health5, e1221–e1234 (2017). https://doi.org/10.1016/S2214-109X(17)30393-5

work page doi:10.1016/s2214-109x(17)30393-5 1990

[10] [10]

& Khabazkhoob, M

Hashemi, H., Fayaz, F., Hashemi, A. & Khabazkhoob, M. Global prevalence of cataract surgery.Curr. Opin. Ophthalmol.36, 10–17 (2025). https://doi.org/10.1097/ICU.0000000000001092

work page doi:10.1097/icu.0000000000001092 2025

[11] [11]

Müller, S.et al.Artificial intelligence in cataract surgery: a systematic review.Transl. Vis. Sci. Technol.13, 20 (2024). https://doi.org/10.1167/tvst.13.4.20

work page doi:10.1167/tvst.13.4.20 2024

[12] [12]

J., Wawrzynski, J

Lindegger, D. J., Wawrzynski, J. & Saleh, G. M. Evolution and applications of artificial intelligence to cataract surgery.Ophthalmol. Sci.2, 100164 (2022). https://doi.org/10.1016/j.xops.2022.100164

work page doi:10.1016/j.xops.2022.100164 2022

[13] [13]

Data11, 373 (2024)

Ghamsarian, N.et al.Cataract-1K dataset for deep-learning-assisted analysis of cataract surgery videos.Sci. Data11, 373 (2024). https://doi.org/10.1038/s41597-024-03193-4

work page doi:10.1038/s41597-024-03193-4 2024

[14] [14]

Image Anal.71, 102053 (2021)

Grammatikopoulou, M.et al.CaDIS: Cataract dataset for surgical RGB-image segmentation.Med. Image Anal.71, 102053 (2021). https://doi.org/10.1016/j.media.2021.102053 19

work page doi:10.1016/j.media.2021.102053 2021

[15] [15]

Preprint at https://arxiv.org/abs/2411.16794 (2024)

Sachdeva, B.et al.Phase-informed tool segmentation for manual small-incision cataract surgery. Preprint at https://arxiv.org/abs/2411.16794 (2024)

work page arXiv 2024

[16] [16]

A., Reed, D

McCannel, C. A., Reed, D. C. & Goldman, D. R. Ophthalmic surgery simulator training improves resident performance of capsulorhexis in the operating room.Ophthalmology120, 2456–2461 (2013). https://doi.org/10.1016/j.ophtha.2013.05.003

work page doi:10.1016/j.ophtha.2013.05.003 2013

[17] [17]

L., Lora, A

Cremers, S. L., Lora, A. N. & Ferrufino-Ponce, Z. K. Global Rating Assessment of Skills in Intraocular Surgery (GRASIS).Ophthalmology112, 1655–1660 (2005). https://doi.org/10.1016/j.ophtha.2005.05.010

work page doi:10.1016/j.ophtha.2005.05.010 2005

[18] [18]

C., Beaver, H., Gauba, V., Lee, A

Golnik, K. C., Beaver, H., Gauba, V., Lee, A. G., Mayorga, E., Palis, G., Saleh, G. M. Cataract Surgical Skill Assessment.Ophthalmology118(2), 427–427.e5 (2011)

work page 2011

[19] [19]

Long short- term memory,

Hochreiter, S. & Schmidhuber, J. Long short-term memory.Neural Comput.9, 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997

[20] [20]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Cho, K.et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation. Preprint at https://arxiv.org/abs/1406.1078 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[21] [21]

inMedical Image Computing and Computer Assisted Intervention – MICCAI 2020(eds

Czempiel, T.et al.TeCNO: Surgical Phase Recognition with Multi-stage Temporal Convolutional Networks. inMedical Image Computing and Computer Assisted Intervention – MICCAI 2020(eds. Martel, A. L.et al.) 343–352 (Springer, 2020). https://doi.org/10.1007/978-3-030-59716-0_33

work page doi:10.1007/978-3-030-59716-0_33 2020

[22] [22]

Feichtenhofer, C., Fan, H., Malik, J. & He, K. SlowFast networks for video recognition.Proc. IEEE/CVF Int. Conf. Comput. Vis.6202–6211 (2019). https://doi.org/10.1109/ICCV.2019.00630

work page doi:10.1109/iccv.2019.00630 2019

[23] [23]

Circle loss: A unified perspective of pair similarity optimization

Feichtenhofer, C.X3D:Expandingarchitecturesforefficientvideorecognition.Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.200–210 (2020). https://doi.org/10.1109/CVPR42600.2020.00028

work page doi:10.1109/cvpr42600.2020.00028 2020

[24] [24]

IEEE Conf

Tran, D.et al.A closer look at spatiotemporal convolutions for action recognition.Proc. IEEE Conf. Comput. Vis. Pattern Recognit.6450–6459 (2018). https://doi.org/10.1109/CVPR.2018.00675

work page doi:10.1109/cvpr.2018.00675 2018

[25] [25]

IEEE Int

Tran, D., Bourdev, L., Fergus, R., Torresani, L.&Paluri, M.Learningspatiotemporalfeatureswith3Dconvolu- tional networks.Proc. IEEE Int. Conf. Comput. Vis.4489–4497 (2015). https://doi.org/10.1109/ICCV.2015.510

work page doi:10.1109/iccv.2015.510 2015

[26] [26]

Fan, H. et al. Multiscale vision transformers.Proc. IEEE/CVF Int. Conf. Comput. Vis.6804-6815 (2021). https://doi.org/10.1109/ICCV48922.2021.00675

work page doi:10.1109/iccv48922.2021.00675 2021

[27] [27]

Masked feature prediction for self-supervised visual pre-training

Liu, Z.et al.Video Swin Transformer.Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.3192-3201 (2022). https://doi.org/10.1109/CVPR52688.2022.00320

work page doi:10.1109/cvpr52688.2022.00320 2022

[28] [28]

Hall.Lie Groups, Lie Algebras, and Representations: An Elementary Introduction

Lin, T.-Y.et al.Microsoft COCO: Common Objects in Context. inComputer Vision – ECCV 2014(eds. Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) 740–755 (Springer, 2014). https://doi.org/10.1007/978-3- 319-10602-1_48

work page doi:10.1007/978-3- 2014

[29] [29]

Mask r-cnn.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):386–397, 2020

He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN.IEEE Trans. Pattern Anal. Mach. Intell.42(2), 386-397 (2020). https://doi.org/10.1109/TPAMI.2018.2844175

work page doi:10.1109/tpami.2018.2844175 2020

[30] [30]

& Chaurasia, A

Jocher, G., Qiu, J. & Chaurasia, A. Ultralytics YOLO, version 8.0.0.GitHub https://github.com/ultralytics/ultralytics (2023)

work page 2023

[31] [31]

Sample4Geo : Hard negative sampling for cross-view geo-localisation

Kirillov, A.et al.Segment Anything.Proc. IEEE/CVF Int. Conf. Comput. Vis.3992-4003 (2023). https://doi.org/10.1109/ICCV51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023

[32] [32]

SAM 2: Segment Anything in Images and Videos

Ravi, N.et al.SAM 2: Segment Anything in Images and Videos. Preprint at https://arxiv.org/abs/2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Is space-time attention all you need for video understanding?

Bertasius, G., Wang, H. & Torresani, L. Is space-time attention all you need for video understanding? Preprint at https://arxiv.org/abs/2102.05095 (2021). 20

work page arXiv 2021