pith. machine review for the scientific record. sign in

arxiv: 2604.07577 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords surgical video analysisinstrument handover detectionmulti-task learningVision TransformerLSTM temporal modelingevent detectioninterpretable visionintraoperative monitoring
0
0 comments X

The pith

A ViT-LSTM model detects surgical instrument handovers at the event level and classifies transfer directions from video confidence signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Surgical teams exchange instruments frequently during procedures, but tracking these handovers manually from video is tedious and error-prone. The paper builds a model that pulls spatial details from each frame with a Vision Transformer and tracks their evolution over time with a unidirectional LSTM. A single network head predicts both whether a handover is occurring and its direction, producing a smooth confidence curve whose peaks mark individual events. On kidney transplant videos this yields an F1 of 0.84 for detection and 0.72 for direction, beating a single-task version and a VideoMamba baseline on classification while matching detection performance. Layer-CAM maps show the model attends to hand-instrument contact regions, making the output easier to inspect.

Core claim

The authors establish that a unified multi-task ViT-LSTM architecture, trained to output both handover occurrence and direction, generates a temporal confidence signal from which peak detection extracts discrete events. On a kidney transplant dataset the approach reaches an F1-score of 0.84 for detection and a mean F1-score of 0.72 for direction classification, outperforming a single-task ablation and a VideoMamba baseline on direction while retaining comparable detection performance. Layer-CAM attribution maps confirm that decisions rest on hand-instrument interaction cues rather than background clutter.

What carries the argument

Multi-task ViT-LSTM network that produces a temporal handover confidence signal processed by peak detection to isolate events, with joint direction classification and Layer-CAM visualizations of spatial attention.

If this is right

  • Joint training on occurrence and direction avoids error propagation that occurs in cascaded detection-then-classification pipelines.
  • Peak detection applied to the model's temporal signal isolates discrete handover events without excessive false positives on the tested data.
  • The model surpasses both a single-task variant and a VideoMamba baseline on direction classification while maintaining detection performance.
  • Layer-CAM attribution maps highlight hand-instrument interaction regions as the primary drivers of model decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same temporal-signal-plus-peak-detection design could be tested on videos from laparoscopic or robotic procedures to check whether the kidney-transplant patterns transfer.
  • Adding the handover detector to existing surgical-phase recognition systems might produce richer summaries of team coordination during an entire case.
  • Real-time versions could generate live alerts when expected handovers are delayed, potentially supporting OR workflow optimization.
  • The interpretability maps may help trainees review which visual cues they missed when observing recorded handovers.

Load-bearing premise

The patterns of handovers observed in kidney transplant videos are representative enough that the same model will perform well on other types of surgery.

What would settle it

Running the trained model on a held-out set of videos from a different procedure such as cardiac or orthopedic surgery and measuring whether both F1 scores fall below 0.6.

Figures

Figures reproduced from arXiv: 2604.07577 by Alexander Ehrenhoefer, Amira Mouakher, David Przewozny, George Zountsas, Igor Maximilian Sauer, Karam Tomotaki-Dawoud, Katerina Katsarou, Paul Chojecki, Sebastian Bosse.

Figure 1
Figure 1. Figure 1: Overview of the proposed architecture. Spatial features are extracted independently from sampled video frames using a ViT [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Handover detection evaluation approach applied to a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Normalized confusion matrices comparing multi-task [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualized gradient accumulation of the VideoMamba [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer-CAM explanation maps of the multi-task ViT– [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Layer-CAM explanation maps illustrating the contri [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes a multi-task spatiotemporal model using a Vision Transformer (ViT) backbone for spatial feature extraction and a unidirectional LSTM for temporal aggregation to jointly detect surgical instrument handovers and classify their direction in intraoperative videos. Discrete events are extracted from the model's temporal confidence scores via peak detection, and Layer-CAM is employed to visualize spatial attributions. On a kidney transplant procedure dataset, the approach reports an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming a single-task ablation and a VideoMamba baseline for direction while matching detection performance.

Significance. If the empirical results are reproducible and generalize, the work provides a practical pipeline for event-level analysis of instrument interactions in surgical videos, which could support workflow monitoring and safety applications in the operating room. The unified multi-task formulation and post-hoc interpretability via Layer-CAM are constructive elements that avoid cascaded error propagation common in sequential pipelines.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The central performance claims (F1=0.84 detection, mean F1=0.72 direction) are presented without dataset size, number of videos/procedures, frame counts, class balance, train/validation/test split ratios, or any statistical significance testing (e.g., p-values or confidence intervals) for the reported outperformance over baselines. These details are load-bearing for verifying the claims and assessing generalizability beyond the single kidney-transplant domain.
  2. [Methods] Methods section: The peak-detection step that converts the continuous temporal confidence signal into discrete handover events lacks specification of key parameters (threshold, minimum distance, or window size) and any sensitivity analysis, making it impossible to determine whether the reported F1 scores depend on post-processing choices that could introduce false positives or missed events.
minor comments (3)
  1. [Methods] Clarify whether the LSTM is truly unidirectional as stated or if bidirectional processing was considered, and ensure all model hyperparameters (learning rate, hidden size, ViT patch size) are listed with values used in the experiments.
  2. [Abstract] The abstract uses the phrase 'strong performance' without quantitative context relative to the baselines; consider replacing with direct comparison phrasing.
  3. [Discussion] Add a limitations paragraph discussing the single-procedure dataset and potential domain shift to other surgeries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on reproducibility and methodological clarity. We address each major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The central performance claims (F1=0.84 detection, mean F1=0.72 direction) are presented without dataset size, number of videos/procedures, frame counts, class balance, train/validation/test split ratios, or any statistical significance testing (e.g., p-values or confidence intervals) for the reported outperformance over baselines. These details are load-bearing for verifying the claims and assessing generalizability beyond the single kidney-transplant domain.

    Authors: We agree these details are essential for reproducibility and evaluation of generalizability. In the revised manuscript we will expand the Experiments section (and update the abstract if space permits) to report the full dataset composition, including the number of videos and procedures, total frame count, class balance for handover and direction labels, the train/validation/test split ratios, and statistical measures such as confidence intervals or p-values for the comparisons against the single-task ablation and VideoMamba baseline. revision: yes

  2. Referee: [Methods] Methods section: The peak-detection step that converts the continuous temporal confidence signal into discrete handover events lacks specification of key parameters (threshold, minimum distance, or window size) and any sensitivity analysis, making it impossible to determine whether the reported F1 scores depend on post-processing choices that could introduce false positives or missed events.

    Authors: We acknowledge that the peak-detection parameters were not specified. In the revised Methods section we will explicitly state the threshold, minimum peak distance, and window size used for event extraction, and we will add a sensitivity analysis showing the effect of reasonable variations in these hyperparameters on the final F1 scores for both detection and direction classification. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a standard empirical ML pipeline: a ViT+LSTM multi-task model trained on a kidney-transplant video dataset, with post-hoc peak detection on confidence scores to identify events, evaluated via F1 metrics against ablations and a VideoMamba baseline. No equations, derivations, or self-referential definitions appear; reported performance numbers are direct outputs of training and evaluation rather than quantities forced by construction from fitted inputs or prior self-citations. The methodology is self-contained against external benchmarks with no load-bearing self-citation chains or ansatzes.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the premise that joint training on occurrence and direction improves consistency; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • Model hyperparameters (learning rate, LSTM hidden size, ViT patch size, etc.)
    Tuned during training to achieve reported F1 scores; typical for deep networks.
axioms (2)
  • domain assumption Multi-task formulation prevents error propagation between detection and direction subtasks
    Invoked to justify the unified prediction head.
  • domain assumption Peak detection on the temporal confidence signal isolates true handover events
    Used to convert continuous scores into discrete events.

pith-pipeline@v0.9.0 · 5556 in / 1294 out tokens · 32727 ms · 2026-05-10T17:54:58.088407+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10 (7):e0130140, 2015

    Sebastian Bach, Alexander Binder, Gr ´egoire Montavon, Frederick Klauschen, Klaus-Robert M ¨uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10 (7):e0130140, 2015. 3

  2. [2]

    TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

    Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation.CoRR, abs/2102.04306, 2021. 2

  3. [3]

    When do they stop?: A first step towards automatically identifying team communication in the oper- ating room, 2025

    Keqi Chen, Lilien Schewski, Vinkle Srivastav, Jo ¨el La- vanchy, Didier Mutter, Guido Beldi, Sandra Keller, and Nicolas Padoy. When do they stop?: A first step towards automatically identifying team communication in the oper- ating room, 2025. 2

  4. [4]

    Tecno: Surgical phase recognition with multi- stage temporal convolutional networks

    Tobias Czempiel, Magdalini Paschali, Matthias Keicher, Wolfgang Simson, Hannes Feussner, Seong Tae Kim, and Nassir Navab. Tecno: Surgical phase recognition with multi- stage temporal convolutional networks. InMedical Image Computing and Computer-Assisted Intervention (MICCAI),

  5. [5]

    OperA: Attention-Regularized Transformers for Surgical Phase Recognition, page 604–614

    Tobias Czempiel, Magdalini Paschali, Daniel Ostler, Seong Tae Kim, Benjamin Busam, and Nassir Navab. OperA: Attention-Regularized Transformers for Surgical Phase Recognition, page 604–614. Springer International Publishing, 2021. 2

  6. [6]

    Human-centered evalu- ation of xai methods

    Karam Dawoud, Wojciech Samek, Peter Eisert, Sebastian Lapuschkin, and Sebastian Bosse. Human-centered evalu- ation of xai methods. In2023 IEEE International Confer- ence on Data Mining Workshops (ICDMW), pages 912–921,

  7. [7]

    An image is worth 16 × 16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16 × 16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021. 2

  8. [8]

    Mees, J ¨org Weitz, and Stefanie Speidel

    Ines Funke, Sebastian T. Mees, J ¨org Weitz, and Stefanie Speidel. Using 3d convolutional neural networks to learn spatiotemporal features for surgical workflow analysis.In- ternational Journal of Computer Assisted Radiology and Surgery, 14:1217–1225, 2019. 2

  9. [9]

    Trans-svnet: Accurate phase recogni- tion from surgical videos via hybrid embedding aggregation transformer.CoRR, abs/2103.09712, 2021

    Xiaojie Gao, Yueming Jin, Yonghao Long, Qi Dou, and Pheng-Ann Heng. Trans-svnet: Accurate phase recogni- tion from surgical videos via hybrid embedding aggregation transformer.CoRR, abs/2103.09712, 2021. 2

  10. [10]

    Mamba: Linear-time sequence mod- eling with selective state spaces, 2024

    Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces, 2024. 3

  11. [11]

    Efficiently mod- eling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher R´e. Efficiently mod- eling long sequences with structured state spaces. InInter- national Conference on Learning Representations (ICLR),

  12. [12]

    St(or)2: Spatio-temporal object level reasoning for activity recognition in the operating room, 2023

    Idris Hamoud, Muhammad Abdullah Jamal, Vinkle Srivas- tav, Didier Mutter, Nicolas Padoy, and Omid Mohareri. St(or)2: Spatio-temporal object level reasoning for activity recognition in the operating room, 2023. 2

  13. [13]

    UNETR: Transformers for 3d medical im- age segmentation

    Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger Roth, and Daguang Xu. UNETR: Transformers for 3d medical im- age segmentation. InWinter Conference on Applications of Computer Vision (WACV), 2022. 2

  14. [14]

    Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. 3

  15. [15]

    Layercam: Exploring hierarchical class activation maps for localization.IEEE Transactions on Image Processing, 30:5875–5888, 2021

    Peng-Tao Jiang, Chang-Bin Zhang, Qibin Hou, Ming-Ming Cheng, and Yunchao Wei. Layercam: Exploring hierarchical class activation maps for localization.IEEE Transactions on Image Processing, 30:5875–5888, 2021. 2, 3

  16. [16]

    Sv-rcnet: Workflow recognition from surgical videos using recurrent convolu- tional network.IEEE Transactions on Medical Imaging, PP: 1–1, 2017

    Yueming Jin, Qi Dou, Hao Chen, Lequan Yu, Jing Qin, Chi-Wing Fu, and Pheng-Ann Heng. Sv-rcnet: Workflow recognition from surgical videos using recurrent convolu- tional network.IEEE Transactions on Medical Imaging, PP: 1–1, 2017. 2, 3

  17. [17]

    Surgigard: Surgical instrument handover graph-based supervision and robust detection

    Katerina Katsarou, Amira Mouakher, Paul Chojecki, David Przewozny, Detlef Runde, and Sebastian Bosse. Surgigard: Surgical instrument handover graph-based supervision and robust detection. InProceedings of the IEEE 13th Interna- tional Conference on Healthcare Informatics (ICHI), pages 452–461, 2025. 1, 3

  18. [18]

    The kinetics human action video dataset, 2017

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. 5

  19. [19]

    A vision transformer for decoding surgeon activity from surgical videos.Nature Biomedical Engineer- ing, 2023

    Dani Kiyasseh, Runzhuo Ma, Taseen F Haque, Brian J Miles, Christian Wagner, Daniel Donoho, Anima Anandkumar, and Andrew J Hung. A vision transformer for decoding surgeon activity from surgical videos.Nature Biomedical Engineer- ing, 2023. 3

  20. [20]

    Surgical instrument counting: Current prac- tice and staff perspectives on technological support.Human Factors in Healthcare, 6:100087, 2024

    AM Kooijmans, L de Rouw, M van der Elst, and JJ van den Dobbelsteen. Surgical instrument counting: Current prac- tice and staff perspectives on technological support.Human Factors in Healthcare, 6:100087, 2024. 1

  21. [21]

    Colin Lea, Austin Reiter, Rene Vidal, and Gregory D. Hager. Segmental spatiotemporal cnns for fine-grained action seg- mentation, 2016. 2

  22. [22]

    Flynn, Ren ´e Vidal, Austin Reiter, and Gregory D

    Colin Lea, Michael D. Flynn, Ren ´e Vidal, Austin Reiter, and Gregory D. Hager. Temporal convolutional networks for action segmentation and detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2017. 3

  23. [23]

    Videomamba: State space model for efficient video understanding

    Kunchang Li and et al. Videomamba: State space model for efficient video understanding. InEuropean Conference on Computer Vision (ECCV), 2024. 3, 5, 11

  24. [24]

    Garcia-Peraza-Herrera, Tom Vercauteren, Prokar Dasgupta, Alejandro Granados, and S´ebastien Ourselin

    Yang Liu, Maxence Boels, Luis C. Garcia-Peraza-Herrera, Tom Vercauteren, Prokar Dasgupta, Alejandro Granados, and S´ebastien Ourselin. Lovit: Long video transformer for surgical phase recognition.Medical Image Analysis, 99: 103366, 2025. 3

  25. [25]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV), 2021. 2

  26. [26]

    Recognition of instrument-tissue interactions in endoscopic videos via action triplets

    Chinedu Innocent Nwoye, Deepak Alapatt, Tong Yu, Ar- men Vardazaryan, Lena Maier-Hein, and Nicolas Padoy. Recognition of instrument-tissue interactions in endoscopic videos via action triplets. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2021. 2

  27. [27]

    Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

  28. [28]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021. 3

  29. [29]

    ”why should i trust you?”: Explaining the predictions of any classifier

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ”why should i trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, page 1135–1144, New York, NY , USA, 2016. Asso- ciation for Computing Machinery. 3

  30. [30]

    Massimo Salvi, Silvia Seoni, Andrea Campagner, Arkadiusz Gertych, U.Rajendra Acharya, Filippo Molinari, and Fed- erico Cabitza. Explainability and uncertainty: Two sides of the same coin for enhancing the interpretability of deep learning models in healthcare.International Journal of Med- ical Informatics, 197:105846, 2025. 3

  31. [31]

    Anders, and Klaus-Robert M ¨uller

    Wojciech Samek, Gr ´egoire Montavon, Sebastian La- puschkin, Christopher J. Anders, and Klaus-Robert M ¨uller. Explaining deep neural networks and beyond: A review of methods and applications.Proceedings of the IEEE, 109(3): 247–278, 2021. 3

  32. [32]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 618–626, 2017. 3

  33. [33]

    SmoothGrad: removing noise by adding noise

    Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi ´egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise.arXiv preprint arXiv:1706.03825, 2017. 3

  34. [34]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational conference on machine learning, pages 3319–3328. PMLR, 2017. 3

  35. [35]

    Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy

    Andru P. Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy. En- donet: A deep architecture for recognition tasks on laparo- scopic videos.IEEE Transactions on Medical Imaging, 36 (1):86–97, 2017. 2, 3

  36. [36]

    Jeya Maria Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu, and Vishal M. Patel. Medical transformer: Gated axial- attention for medical image segmentation. InNeurIPS Work- shop on Medical Imaging, 2021. 2

  37. [37]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

  38. [38]

    MedCLIP: Contrastive learning from unpaired medical images and text

    Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. MedCLIP: Contrastive learning from unpaired medical images and text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3876–3887, Abu Dhabi, United Arab Emirates, 2022. Asso- ciation for Computational Linguistics. 3

  39. [39]

    Operating room surveil- lance video analysis for group activity recognition.Ad- vanced Biomedical Engineering, 12:171–181, 2023

    Koji Yokoyama, Goshiro Yamamoto, Chang Liu, Kazumasa Kishimoto, and Tomohiro Kuroda. Operating room surveil- lance video analysis for group activity recognition.Ad- vanced Biomedical Engineering, 12:171–181, 2023. 2

  40. [40]

    Visualizing and under- standing convolutional networks

    Matthew D Zeiler and Rob Fergus. Visualizing and under- standing convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014. 3

  41. [41]

    Bokai Zhang, Bharti Goel, Mohammad Hasan Sarhan, Varun Kejriwal Goel, Rami Abukhalil, Bindu Kalesan, Na- talie Stottler, and Svetlana Petculescu. Surgical workflow recognition with temporal convolution and transformer for action segmentation.International journal of computer as- sisted radiology and surgery, 18(4):785—794, 2023. 3

  42. [42]

    Clip in medical imaging: A sur- vey.Medical image analysis, 102:103551, 2025

    Zihao Zhao, Yuxiao Liu, Han Wu, Mei Wang, Yonghao Li, Sheng Wang, Lin Teng, Disheng Liu, Zhiming Cui, Qian Wang, and Dinggang Shen. Clip in medical imaging: A sur- vey.Medical image analysis, 102:103551, 2025. 3 Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models Supplementary Material This document provides a...