arxiv: 2602.05638 · v3 · submitted 2026-02-05 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Jinlin Wu , Felix Holm , Chuxi Chen , An Wang , Yaxin Hu , Xiaofan Ye , Zelin Zang , Miao Xu

show 12 more authors

Lihua Zhou Huai Liao Danny T. M. Chan Ming Feng Wai S. Poon Hongliang Ren Dong Yi Nassir Navab Gaofeng Meng Jiebo Luo Hongbin Liu Zhen Lei

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical videofoundation modellatent motion predictionself-supervised learningworkflow recognitionaction triplet recognitionvideo pretrainingsurgical understanding

0 comments

The pith

SurgMotion learns surgical video understanding by predicting latent motion rather than reconstructing pixels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that foundation models for surgical videos waste capacity on low-level distractions such as smoke and reflections when they rely on pixel reconstruction. SurgMotion instead pretrains by predicting latent motion, using motion-guided masking, affinity self-distillation, and feature diversity regularization on a newly curated 3658-hour dataset. This produces large gains on workflow recognition, action triplet detection, skill assessment, segmentation, and depth estimation across 17 benchmarks. A sympathetic reader would care because the change focuses model capacity on semantic structures that matter for real surgical understanding.

Core claim

SurgMotion is a video-native foundation model built on V-JEPA that replaces pixel-level reconstruction with latent motion prediction. It introduces motion-guided latent masked prediction to focus on meaningful regions, spatiotemporal affinity self-distillation to maintain relational consistency, and spatiotemporal feature diversity regularization to avoid collapse in texture-sparse scenes. Pretrained on the 3658-hour SurgMotion-15M dataset spanning 13 anatomical regions, the model outperforms prior methods with 14.6 percent F1 improvement on EgoSurgery workflow recognition, 10.3 percent on PitVis, and 39.54 percent mAP-IVT on CholecT50 action triplets.

What carries the argument

Motion-guided latent masked prediction that directs learning toward semantically meaningful regions instead of low-level visual noise.

If this is right

Higher accuracy on surgical workflow recognition without task-specific fine-tuning.
Stronger recognition of action triplets that describe tool-tissue interactions.
Improved performance on skill assessment and visual tasks such as polyp segmentation and depth estimation.
A scalable pretraining recipe that works across 50 video sources and 13 anatomical regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same motion-focused objective could transfer to other high-noise video domains such as underwater or endoscopic imaging.
Large curated surgical datasets may become standard benchmarks for testing motion-centric video models.
Reducing emphasis on pixel reconstruction may lower the data and compute needed to reach usable representations.
Representations built this way could support real-time surgical assistance systems that react to procedure semantics.

Load-bearing premise

Prioritizing latent motion prediction will capture semantically meaningful structures without discarding low-level cues needed for some downstream tasks.

What would settle it

A pixel-reconstruction model trained on the identical SurgMotion-15M dataset that matches or exceeds SurgMotion's scores on the 17 benchmarks would show the motion-prediction shift is not required.

Figures

Figures reproduced from arXiv: 2602.05638 by An Wang, Chuxi Chen, Danny T. M. Chan, Dong Yi, Felix Holm, Gaofeng Meng, Hongbin Liu, Hongliang Ren, Huai Liao, Jiebo Luo, Jinlin Wu, Lihua Zhou, Miao Xu, Ming Feng, Nassir Navab, Wai S. Poon, Xiaofan Ye, Yaxin Hu, Zelin Zang, Zhen Lei.

**Figure 1.** Figure 1: Overview of SurgMotion. (a) SurgMotion-15M is a large-scale surgical video dataset with 3,658 hours from 50 sources, spanning 12+ organs, 70+ anatomical regions, and 100+ procedures. (b) SurgMotion greatly expands the scale of surgical pre-training data. (c) Larger pre-training data leads to better workflow recognition performance; marker size denotes model size. (d) SurgMotion learns universal representat… view at source ↗

**Figure 2.** Figure 2: Overview of our framework. Left: Pre-training. A masked surgical video clip is processed by an online encoder and predictor to infer latent representations for masked tubes, while an EMA target encoder encodes the full clip to provide stable stop-gradient targets. Training is driven by three complementary objectives: motion-guided latent prediction, spatiotemporal affinity self-distillation, and spatiotem… view at source ↗

**Figure 3.** Figure 3: Multi-dataset qualitative showcase of surgical phase recognition. Frame-wise phase predictions on representative cases from different surgical datasets are compared with the ground truth and competing methods. Our method yields predictions that better match the ground truth, with clearer phase boundaries and stronger temporal consistency across diverse scenarios. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Results on Polyp Segmentation and Colonoscopic Depth Estimation. Columns show the input frame, predictions from SurgVLP, EndoFM, VideoMAE-G, GastroNet, DINOv3-H, and our SurgMotion, as well as the ground truth (GT). SurgMotion yields tighter polyp boundaries under domain shift and more spatially coherent depth maps across anatomical segments. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of surgical action triplet recognition on CholecT50 dataset. These frames from a CholecT50 laparoscopic cholecystectomy video exemplify the model’s capability to disentangle overlapping surgical actions. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details, such as smoke, specular reflections, and fluid motion, rather than semantic structures essential for surgical understanding. We present SurgMotion, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), SurgMotion introduces three key technical innovations tailored to surgical videos: (1) motion-guided latent masked prediction to prioritize semantically meaningful regions, (2) spatiotemporal affinity self-distillation to enforce relational consistency, and (3) spatiotemporal feature diversity regularization (SFDR) to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate SurgMotion-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that SurgMotion significantly outperforms state-of-the-art methods on surgical workflow recognition, achieving 14.6 percent improvement in F1 score on EgoSurgery and 10.3 percent on PitVis; on action triplet recognition with 39.54 percent mAP-IVT on CholecT50; as well as on skill assessment, polyp segmentation, and depth estimation. These results establish SurgMotion as a new standard for universal, motion-oriented surgical video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents SurgMotion, a video-native foundation model for surgical video understanding built on V-JEPA. It replaces pixel-level reconstruction with latent motion prediction and introduces three surgical-specific components: motion-guided latent masked prediction, spatiotemporal affinity self-distillation, and spatiotemporal feature diversity regularization (SFDR). The authors curate SurgMotion-15M, a new 3,658-hour dataset from 50 sources, and report large gains over prior methods on 17 benchmarks, including +14.6% F1 on EgoSurgery workflow recognition, +10.3% on PitVis, and 39.54% mAP-IVT on CholecT50 action triplet recognition, plus gains on skill assessment, segmentation, and depth estimation.

Significance. If the attribution of gains to the proposed motion-oriented objectives and components is substantiated, the work would provide a new pretraining paradigm and the largest public surgical video corpus to date, with potential to improve downstream performance across workflow, action, and skill tasks in computer-assisted surgery.

major comments (1)

[Abstract and §4] Abstract and §4 (Experiments): the headline improvements (14.6% F1 on EgoSurgery, 10.3% on PitVis, 39.54% mAP-IVT on CholecT50) are credited to the shift from pixel reconstruction plus the three innovations, yet no control experiment is reported that trains an unmodified V-JEPA baseline on the identical SurgMotion-15M corpus; without this isolation the gains cannot be unambiguously attributed to the methodological changes rather than the 3,658-hour scale and diversity of the new data.

minor comments (2)

[Abstract] The abstract states specific percentage gains but supplies no error bars, number of runs, or statistical significance tests; these details should be added to the main experimental tables.
[§3.3] Notation for the SFDR coefficient and motion threshold is introduced without an explicit hyper-parameter table or sensitivity analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that isolating the contributions of our proposed components from the scale of SurgMotion-15M is important and will add the requested control experiment in the revision.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline improvements (14.6% F1 on EgoSurgery, 10.3% on PitVis, 39.54% mAP-IVT on CholecT50) are credited to the shift from pixel reconstruction plus the three innovations, yet no control experiment is reported that trains an unmodified V-JEPA baseline on the identical SurgMotion-15M corpus; without this isolation the gains cannot be unambiguously attributed to the methodological changes rather than the 3,658-hour scale and diversity of the new data.

Authors: We acknowledge the validity of this concern. The current manuscript does not include a direct ablation of unmodified V-JEPA trained on the full SurgMotion-15M corpus, which limits the strength of attribution to the motion-guided masking, affinity distillation, and SFDR components. In the revised version we will add this baseline: we will train V-JEPA from scratch on SurgMotion-15M using its original pixel-reconstruction objective and report its performance on all 17 downstream benchmarks alongside the SurgMotion results. This control will be presented in a new table in §4 and referenced in the abstract and discussion, allowing readers to quantify the incremental benefit of the surgical-specific objectives over dataset scale alone. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external empirical benchmarks

full rationale

The paper introduces SurgMotion by extending the external V-JEPA architecture with three described components (motion-guided masked prediction, spatiotemporal affinity self-distillation, SFDR) and a new curated dataset SurgMotion-15M. All headline performance numbers (14.6% F1 on EgoSurgery, 10.3% on PitVis, 39.54% mAP-IVT on CholecT50) are presented as direct comparisons against prior published state-of-the-art methods on public benchmarks. No equations, self-definitions, fitted-parameter renamings, or self-citation chains are supplied that would reduce any claimed prediction or uniqueness result to the paper's own inputs by construction. The derivation chain is therefore self-contained against external references.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Relies on standard self-supervised video learning assumptions from V-JEPA; introduces three new regularization terms whose validity is asserted via empirical gains rather than derivation.

free parameters (2)

masking ratio and motion threshold
Hyperparameters controlling which regions receive motion-guided masking; values not stated in abstract.
SFDR strength coefficient
Weight of the diversity regularization term; chosen to prevent collapse but not derived.

axioms (2)

domain assumption Latent motion prediction captures semantic surgical structures better than pixel reconstruction
Central premise of the paradigm shift; invoked in the abstract without proof.
standard math V-JEPA joint embedding architecture transfers to surgical video without major modification
Built directly on prior V-JEPA work.

pith-pipeline@v0.9.0 · 5634 in / 1323 out tokens · 31830 ms · 2026-05-16T07:13:59.364609+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness, Aczél classification) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

shifts the learning paradigm from pixel-level reconstruction to latent motion prediction... motion-guided latent masked prediction... spatiotemporal affinity self-distillation... spatiotemporal feature diversity regularization (SFDR)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Built on the Video Joint Embedding Predictive Architecture (V-JEPA)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 5 internal anchors

[1]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022
[3]

Masked autoencoders as spatiotemporal learners,

C. Feichtenhofer, Y . Li, K. Heet al., “Masked autoencoders as spatiotemporal learners,”Advances in neural information processing systems, vol. 35, pp. 35 946–35 958, 2022

work page 2022
[4]

Endovit: pretraining vision transformers on a large collection of endoscopic images,

D. Bati ´c, F. Holm, E. ¨Ozsoy, T. Czempiel, and N. Navab, “Endovit: pretraining vision transformers on a large collection of endoscopic images,”International Journal of Computer Assisted Radiology and Surgery, vol. 19, no. 6, pp. 1085–1091, 2024

work page 2024
[5]

Foundation model for endoscopy video analysis via large- scale self-supervised pre-train,

Z. Wang, C. Liu, S. Zhang, and Q. Dou, “Foundation model for endoscopy video analysis via large- scale self-supervised pre-train,” inInternational Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2023, pp. 101–111

work page 2023
[6]

General surgery vision transformer: A video pre- trained foundation model for general surgery,

S. Schmidgall, J. W. Kim, J. Jopling, and A. Krieger, “General surgery vision transformer: A video pre- trained foundation model for general surgery,”arXiv preprint arXiv:2403.05949, 2024

work page arXiv 2024
[7]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,

Z. Tong, Y . Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 10 078–10 093

work page 2022
[8]

Videomae v2: Scaling video masked autoencoders with dual masking,

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). IEEE, 2023, pp. 14 549–14 560

work page 2023
[9]

Dissecting self-supervised learning methods for surgical computer vision,

S. Ramesh, V . Srivastav, D. Alapatt, T. Yu, A. Murali, L. Sestini, C. I. Nwoye, I. Hamoud, S. Sharma, A. Fleurentinet al., “Dissecting self-supervised learning methods for surgical computer vision,”Medical Image Analysis, vol. 88, p. 102844, 2023

work page 2023
[10]

Endonet: a deep architecture for recognition tasks on laparoscopic videos,

A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy, “Endonet: a deep architecture for recognition tasks on laparoscopic videos,”IEEE transactions on medical imaging, vol. 36, no. 1, pp. 86–97, 2016

work page 2016
[11]

Pitvis-2023 challenge: Workflow recognition in videos of endoscopic pituitary surgery,

A. Das, D. Z. Khan, D. Psychogyios, Y . Zhang, J. G. Hanrahan, F. Vasconcelos, Y . Pang, Z. Chen, J. Wu, X. Zouet al., “Pitvis-2023 challenge: Workflow recognition in videos of endoscopic pituitary surgery,” Medical Image Analysis, p. 103716, 2025

work page 2023
[12]

Egosurgery-phase: a dataset of surgical phase recognition from egocentric open surgery videos,

R. Fujii, M. Hatano, H. Saito, and H. Kajita, “Egosurgery-phase: a dataset of surgical phase recognition from egocentric open surgery videos,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, pp. 187–196

work page 2024
[13]

Revisiting Feature Prediction for Learning Visual Representations from Video

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas, “Revisiting feature prediction for learning visual representations from video,”arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholuset al., “V-jepa 2: Self-supervised video models enable understanding, prediction and planning,” arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Bootstrap your own latent: A new approach to self-supervised learn- ing,

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azaret al., “Bootstrap your own latent: A new approach to self-supervised learn- ing,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 21 271–21 284. 20

work page 2020
[16]

Internvideo2: Scaling video foundation models for multimodal video understanding,

Y . Wang, K. Li, X. Li, J. Yu, Y . He, G. Chen, B. Pei, R. Zheng, J. Xu, Z. Wang, Y . Shi, T. Jiang, S. Li, H. Zhang, Y . Huang, Y . Qiao, Y . Wang, and L. Wang, “Internvideo2: Scaling video foundation models for multimodal video understanding,” inEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[17]

Internvideo-next: Towards general video foundation models without video-text supervision,

C. Wang, K. Li, Y . He, Y . Wang, Z. Yan, J. Yu, Y . Wang, and L. Wang, “Internvideo-next: Towards general video foundation models without video-text supervision,”arXiv preprint arXiv:2512.01342, 2025

work page arXiv 2025
[18]

Emerging properties in self-supervised vision transformers,

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660

work page 2021
[19]

DINOv3

M. Seitzeret al., “Dinov3,”arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Gastronet-5m: A multicenter dataset for developing foundation models in gastrointestinal endoscopy,

M. R. Jong, T. G. Boers, K. N. Fockens, J. B. Jukema, C. H. Kusters, T. J. Jaspers, R. v. E. van Hes- linga, F. C. Slooter, M. R. Struyvenberg, R. Bisschopset al., “Gastronet-5m: A multicenter dataset for developing foundation models in gastrointestinal endoscopy,”Gastroenterology, 2025

work page 2025
[21]

Self-supervised learning for endoscopic video analysis,

R. Hirsch, M. Caron, R. Cohen, A. Livne, R. Shapiro, T. Golany, R. Goldenberg, D. Freedman, and E. Rivlin, “Self-supervised learning for endoscopic video analysis,” inInternational Conference on Med- ical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 569–578

work page 2023
[22]

Endomamba: an efficient founda- tion model for endoscopic videos via hierarchical pre-training,

Q. Tian, H. Liao, X. Huang, B. Yang, D. Lei, S. Ourselin, and H. Liu, “Endomamba: an efficient founda- tion model for endoscopic videos via hierarchical pre-training,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2025, pp. 224–234

work page 2025
[23]

Scaling up self-supervised learning for improved surgical foundation models,

T. J. Jaspers, R. L. de Jong, Y . Li, C. H. Kusters, F. H. Bakker, R. C. van Jaarsveld, G. M. Kuiper, R. van Hillegersberg, J. P. Ruurda, W. M. Brinkmanet al., “Scaling up self-supervised learning for improved surgical foundation models,”arXiv preprint arXiv:2501.09436, 2025

work page arXiv 2025
[24]

Learn- ing multi-modal representations by watching hundreds of surgical video lectures,

K. Yuan, V . Srivastav, T. Yu, J. L. Lavanchy, J. Marescaux, P. Mascagni, N. Navab, and N. Padoy, “Learn- ing multi-modal representations by watching hundreds of surgical video lectures,”Medical Image Analy- sis, p. 103644, 2025

work page 2025
[25]

The TUM LapChole dataset for the M2CAI 2016 workflow challenge

R. Stauder, D. Ostler, M. Kranzfelder, S. Koller, H. Feußner, and N. Navab, “The tum lapchole dataset for the m2cai 2016 workflow challenge,”arXiv preprint arXiv:1610.09278, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos,

C. I. Nwoye, T. Yu, C. Gonzalez, B. Seeliger, P. Mascagni, D. Mutter, J. Marescaux, and N. Padoy, “Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos,” Medical Image Analysis, vol. 78, p. 102433, 2022

work page 2022
[27]

Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy,

Z. Wang, B. Lu, Y . Long, F. Zhong, T.-H. Cheung, Q. Dou, and Y . Liu, “Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy,” inInterna- tional Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 486–496

work page 2022
[28]

Surgical workflow recognition and blocking effectiveness detection in laparoscopic liver resection with pringle maneuver,

D. Guo, W. Si, Z. Li, J. Pei, and P.-A. Heng, “Surgical workflow recognition and blocking effectiveness detection in laparoscopic liver resection with pringle maneuver,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3220–3228

work page 2025
[29]

Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding,

M. Hu, P. Xia, L. Wang, S. Yan, F. Tang, Z. Xu, Y . Luo, K. Song, J. Leitner, X. Chenget al., “Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding,” inEuropean Conference on Computer Vision (ECCV). Springer, 2024

work page 2024
[30]

Analyzing surgical technique in diverse open surgical videos with multitask machine learning,

E. D. Goodman, K. K. Patel, Y . Zhang, W. Locke, C. J. Kennedy, R. Mehrotra, S. Ren, M. Guan, O. Zohar, M. Downinget al., “Analyzing surgical technique in diverse open surgical videos with multitask machine learning,”JAMA Surgery, 2024. 21

work page 2024
[31]

A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery,

N. Ahmidi, L. Tao, S. Sefati, Y . Gao, C. Lea, B. B. Haro, L. Zappella, S. Khudanpur, R. Vidal, and G. D. Hager, “A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery,”IEEE Transactions on Biomedical Engineering, vol. 64, no. 9, pp. 2025–2041, 2017

work page 2025
[32]

Aixsuture: vision-based assessment of open suturing skills,

H. Hoffmann, I. Funke, P. Peters, D. K. Venkatesh, J. Egger, D. Rivoir, R. R¨ohrig, F. H¨olzle, S. Bodenstedt, M.-C. Willemer, S. Speidel, and B. Puladi, “Aixsuture: vision-based assessment of open suturing skills,” International Journal of Computer Assisted Radiology and Surgery, vol. 19, no. 6, pp. 1045–1052, 2024

work page 2024
[33]

Video retrieval in laparoscopic video recordings with dynamic content descriptors,

K. Schoeffmann, H. Husslein, S. Kletz, S. Petscharnig, B. M ¨unzer, and C. Beecks, “Video retrieval in laparoscopic video recordings with dynamic content descriptors,”Multimedia Tools and Applications, vol. 77, no. 13, pp. 16 813–16 832, 2018

work page 2018
[34]

Contrastive transformer- based multiple instance learning for weakly supervised polyp frame detection,

Y . Tian, G. Pang, F. Liu, Y . Liu, C. Wang, Y . Chen, J. Verjans, and G. Carneiro, “Contrastive transformer- based multiple instance learning for weakly supervised polyp frame detection,” inInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 88–98

work page 2022
[35]

Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,

J. Bernal, F. J. S ´anchez, G. Fern´andez-Esparrach, D. Gil, C. Rodr´ıguez, and F. Vilari˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”Comput- erized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015

work page 2015
[36]

Implicit domain adaptation with conditional generative adversarial networks for depth prediction in endoscopy,

A. Rau, P. E. Edwards, O. F. Ahmad, P. Riordan, M. Janatka, L. B. Lovat, and D. Stoyanov, “Implicit domain adaptation with conditional generative adversarial networks for depth prediction in endoscopy,” International Journal of Computer Assisted Radiology and Surgery, vol. 14, pp. 1167–1176, 2019

work page 2019
[37]

Colonoscopy 3d video dataset with paired depth from 2d-3d registration,

T. L. Bobrow, M. Golhar, R. Vijayan, V . S. Akshintala, J. R. Garcia, and N. J. Durr, “Colonoscopy 3d video dataset with paired depth from 2d-3d registration,”Medical Image Analysis, p. 102956, 2023

work page 2023
[38]

Cataracts: Challenge on automatic tool annotation for cataract surgery,

H. Al Hajj, M. Lamard, P.-H. Conze, B. Cochener, and G. Quellec, “Cataracts: Challenge on automatic tool annotation for cataract surgery,”Medical Image Analysis, vol. 52, pp. 24–41, 2019. [Online]. Available: https://dx.doi.org/10.21227/ac97-8m18

work page doi:10.21227/ac97-8m18 2019
[39]

Challenges in multi-centric generalization: phase and step recog- nition in roux-en-y gastric bypass surgery,

J. L. Lavanchy, S. Ramesh, D. Dall’Alba, C. Gonzalez, P. Fiorini, B. P. M ¨uller-Stich, P. C. Nett, J. Marescaux, D. Mutter, and N. Padoy, “Challenges in multi-centric generalization: phase and step recog- nition in roux-en-y gastric bypass surgery,”International Journal of Computer Assisted Radiology and Surgery, vol. 19, pp. 2249–2258, 2024

work page 2024
[40]

Copesd: A multi- level surgical motion dataset for training large vision-language models to co-pilot endoscopic submucosal dissection,

G. Wang, H. Xiao, R. Zhang, H. Gao, L. Bai, X. Yang, Z. Li, H. Li, and H. Ren, “Copesd: A multi- level surgical motion dataset for training large vision-language models to co-pilot endoscopic submucosal dissection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2024

work page 2024
[41]

Towards holistic surgical scene understanding,

N. Valderrama, O. Zisimopoulos, and S. Giannarou, “Towards holistic surgical scene understanding,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 442–452

work page 2023
[42]

Kvasir-seg: A segmented polyp dataset,

D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” inMultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26, 2020, pp. 451–462

work page 2020
[43]

A benchmark for endoluminal scene segmentation of colonoscopy images,

D. V ´azquez, J. Bernal, F. J. S ´anchez, G. Fern ´andez-Esparrach, A. M. L ´opez, A. Romero, M. Drozdzal, and A. Courville, “A benchmark for endoluminal scene segmentation of colonoscopy images,”Journal of healthcare engineering, vol. 2017, no. 1, p. 4037190, 2017

work page 2017
[44]

Towards automatic polyp detection with a polyp appearance model,

J. Bernal, J. S ´anchez, and F. Vilarino, “Towards automatic polyp detection with a polyp appearance model,”Pattern Recognition, vol. 45, no. 9, pp. 3166–3182, 2012. 22

work page 2012
[45]

Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,

J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, “Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,”International journal of computer assisted radiology and surgery, vol. 9, no. 2, pp. 283–293, 2014

work page 2014
[46]

Pranet: Parallel reverse attention network for polyp segmentation,

D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” inInternational conference on medical image computing and computer- assisted intervention. Springer, 2020, pp. 263–273

work page 2020
[47]

Uacanet: Uncertainty augmented context attention for polyp segmentation,

T. Kim, H. Lee, and D. Kim, “Uacanet: Uncertainty augmented context attention for polyp segmentation,” inProceedings of the 29th ACM international conference on multimedia, 2021, pp. 2167–2175

work page 2021
[48]

Pranet-v2: Dual-supervised reverse attention for medical image segmentation,

B.-C. Hu, G.-P. Ji, D. Shao, and D.-P. Fan, “Pranet-v2: Dual-supervised reverse attention for medical image segmentation,”arXiv preprint arXiv:2504.10986, 2025. 23

work page arXiv 2025