arxiv: 2604.07577 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

Katerina Katsarou , George Zountsas , Karam Tomotaki-Dawoud , Alexander Ehrenhoefer , Paul Chojecki , David Przewozny , Igor Maximilian Sauer , Amira Mouakher

show 1 more author

Sebastian Bosse

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical video analysisinstrument handover detectionmulti-task learningVision TransformerLSTM temporal modelingevent detectioninterpretable visionintraoperative monitoring

0 comments

The pith

A ViT-LSTM model detects surgical instrument handovers at the event level and classifies transfer directions from video confidence signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Surgical teams exchange instruments frequently during procedures, but tracking these handovers manually from video is tedious and error-prone. The paper builds a model that pulls spatial details from each frame with a Vision Transformer and tracks their evolution over time with a unidirectional LSTM. A single network head predicts both whether a handover is occurring and its direction, producing a smooth confidence curve whose peaks mark individual events. On kidney transplant videos this yields an F1 of 0.84 for detection and 0.72 for direction, beating a single-task version and a VideoMamba baseline on classification while matching detection performance. Layer-CAM maps show the model attends to hand-instrument contact regions, making the output easier to inspect.

Core claim

The authors establish that a unified multi-task ViT-LSTM architecture, trained to output both handover occurrence and direction, generates a temporal confidence signal from which peak detection extracts discrete events. On a kidney transplant dataset the approach reaches an F1-score of 0.84 for detection and a mean F1-score of 0.72 for direction classification, outperforming a single-task ablation and a VideoMamba baseline on direction while retaining comparable detection performance. Layer-CAM attribution maps confirm that decisions rest on hand-instrument interaction cues rather than background clutter.

What carries the argument

Multi-task ViT-LSTM network that produces a temporal handover confidence signal processed by peak detection to isolate events, with joint direction classification and Layer-CAM visualizations of spatial attention.

If this is right

Joint training on occurrence and direction avoids error propagation that occurs in cascaded detection-then-classification pipelines.
Peak detection applied to the model's temporal signal isolates discrete handover events without excessive false positives on the tested data.
The model surpasses both a single-task variant and a VideoMamba baseline on direction classification while maintaining detection performance.
Layer-CAM attribution maps highlight hand-instrument interaction regions as the primary drivers of model decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same temporal-signal-plus-peak-detection design could be tested on videos from laparoscopic or robotic procedures to check whether the kidney-transplant patterns transfer.
Adding the handover detector to existing surgical-phase recognition systems might produce richer summaries of team coordination during an entire case.
Real-time versions could generate live alerts when expected handovers are delayed, potentially supporting OR workflow optimization.
The interpretability maps may help trainees review which visual cues they missed when observing recorded handovers.

Load-bearing premise

The patterns of handovers observed in kidney transplant videos are representative enough that the same model will perform well on other types of surgery.

What would settle it

Running the trained model on a held-out set of videos from a different procedure such as cardiac or orthopedic surgery and measuring whether both F1 scores fall below 0.6.

Figures

Figures reproduced from arXiv: 2604.07577 by Alexander Ehrenhoefer, Amira Mouakher, David Przewozny, George Zountsas, Igor Maximilian Sauer, Karam Tomotaki-Dawoud, Katerina Katsarou, Paul Chojecki, Sebastian Bosse.

**Figure 1.** Figure 1: Overview of the proposed architecture. Spatial features are extracted independently from sampled video frames using a ViT [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Handover detection evaluation approach applied to a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Normalized confusion matrices comparing multi-task [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Visualized gradient accumulation of the VideoMamba [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Layer-CAM explanation maps of the multi-task ViT– [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Layer-CAM explanation maps illustrating the contri [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViT plus LSTM multi-task setup detects handovers and direction in kidney transplant videos at F1 0.84/0.72, with Layer-CAM, but stays narrow to one procedure.

read the letter

The paper puts together a ViT backbone and unidirectional LSTM to jointly predict whether an instrument handover happens and which way it goes, then runs peak detection on the output scores to mark discrete events. On their kidney transplant dataset it reaches 0.84 F1 for detection and 0.72 mean F1 for direction, beating a single-task version and a VideoMamba baseline on the direction task while staying comparable on detection. Layer-CAM is added to highlight the hand-instrument regions the model uses. That combination is the concrete new piece: the joint formulation plus event-level output on this specific surgical video task, with numbers and a baseline comparison attached. The approach is sensible for an OR monitoring tool and the interpretability step is a reasonable addition rather than window dressing. The main limitation is scope. Everything is shown on kidney transplants only, so we have no evidence it transfers to other procedures with different lighting, motion, or instrument sets. The abstract gives performance figures but no dataset size, train-test split details, or statistical tests, which leaves the strength of the gains unclear. Peak detection itself could introduce extra false positives if the temporal signal is spiky. This is useful reading for groups already working on surgical video or OR workflow monitoring. It is a clean empirical application of standard components rather than a new method, so I would not cite it in my own work unless I needed the exact task. It is grounded enough and reports enough to deserve peer review, mainly to check the missing dataset and ablation details and to see whether the claims hold on wider data.

Referee Report

2 major / 3 minor

Summary. The paper proposes a multi-task spatiotemporal model using a Vision Transformer (ViT) backbone for spatial feature extraction and a unidirectional LSTM for temporal aggregation to jointly detect surgical instrument handovers and classify their direction in intraoperative videos. Discrete events are extracted from the model's temporal confidence scores via peak detection, and Layer-CAM is employed to visualize spatial attributions. On a kidney transplant procedure dataset, the approach reports an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming a single-task ablation and a VideoMamba baseline for direction while matching detection performance.

Significance. If the empirical results are reproducible and generalize, the work provides a practical pipeline for event-level analysis of instrument interactions in surgical videos, which could support workflow monitoring and safety applications in the operating room. The unified multi-task formulation and post-hoc interpretability via Layer-CAM are constructive elements that avoid cascaded error propagation common in sequential pipelines.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The central performance claims (F1=0.84 detection, mean F1=0.72 direction) are presented without dataset size, number of videos/procedures, frame counts, class balance, train/validation/test split ratios, or any statistical significance testing (e.g., p-values or confidence intervals) for the reported outperformance over baselines. These details are load-bearing for verifying the claims and assessing generalizability beyond the single kidney-transplant domain.
[Methods] Methods section: The peak-detection step that converts the continuous temporal confidence signal into discrete handover events lacks specification of key parameters (threshold, minimum distance, or window size) and any sensitivity analysis, making it impossible to determine whether the reported F1 scores depend on post-processing choices that could introduce false positives or missed events.

minor comments (3)

[Methods] Clarify whether the LSTM is truly unidirectional as stated or if bidirectional processing was considered, and ensure all model hyperparameters (learning rate, hidden size, ViT patch size) are listed with values used in the experiments.
[Abstract] The abstract uses the phrase 'strong performance' without quantitative context relative to the baselines; consider replacing with direct comparison phrasing.
[Discussion] Add a limitations paragraph discussing the single-procedure dataset and potential domain shift to other surgeries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on reproducibility and methodological clarity. We address each major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The central performance claims (F1=0.84 detection, mean F1=0.72 direction) are presented without dataset size, number of videos/procedures, frame counts, class balance, train/validation/test split ratios, or any statistical significance testing (e.g., p-values or confidence intervals) for the reported outperformance over baselines. These details are load-bearing for verifying the claims and assessing generalizability beyond the single kidney-transplant domain.

Authors: We agree these details are essential for reproducibility and evaluation of generalizability. In the revised manuscript we will expand the Experiments section (and update the abstract if space permits) to report the full dataset composition, including the number of videos and procedures, total frame count, class balance for handover and direction labels, the train/validation/test split ratios, and statistical measures such as confidence intervals or p-values for the comparisons against the single-task ablation and VideoMamba baseline. revision: yes
Referee: [Methods] Methods section: The peak-detection step that converts the continuous temporal confidence signal into discrete handover events lacks specification of key parameters (threshold, minimum distance, or window size) and any sensitivity analysis, making it impossible to determine whether the reported F1 scores depend on post-processing choices that could introduce false positives or missed events.

Authors: We acknowledge that the peak-detection parameters were not specified. In the revised Methods section we will explicitly state the threshold, minimum peak distance, and window size used for event extraction, and we will add a sensitivity analysis showing the effect of reasonable variations in these hyperparameters on the final F1 scores for both detection and direction classification. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a standard empirical ML pipeline: a ViT+LSTM multi-task model trained on a kidney-transplant video dataset, with post-hoc peak detection on confidence scores to identify events, evaluated via F1 metrics against ablations and a VideoMamba baseline. No equations, derivations, or self-referential definitions appear; reported performance numbers are direct outputs of training and evaluation rather than quantities forced by construction from fitted inputs or prior self-citations. The methodology is self-contained against external benchmarks with no load-bearing self-citation chains or ansatzes.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the premise that joint training on occurrence and direction improves consistency; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

Model hyperparameters (learning rate, LSTM hidden size, ViT patch size, etc.)
Tuned during training to achieve reported F1 scores; typical for deep networks.

axioms (2)

domain assumption Multi-task formulation prevents error propagation between detection and direction subtasks
Invoked to justify the unified prediction head.
domain assumption Peak detection on the temporal confidence signal isolates true handover events
Used to convert continuous scores into discrete events.

pith-pipeline@v0.9.0 · 5556 in / 1294 out tokens · 32727 ms · 2026-05-10T17:54:58.088407+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat (8-tick period forced by distinction) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

For each time index t, we construct a temporal input sequence by sampling T=8 frames from the video with a fixed temporal stride sf=4
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The model combines a Vision Transformer (ViT) backbone ... unidirectional Long Short-Term Memory (LSTM) ... peak detection ... Layer-CAM

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 3 canonical work pages · 1 internal anchor

[1]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10 (7):e0130140, 2015

Sebastian Bach, Alexander Binder, Gr ´egoire Montavon, Frederick Klauschen, Klaus-Robert M ¨uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10 (7):e0130140, 2015. 3

2015
[2]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation.CoRR, abs/2102.04306, 2021. 2

work page internal anchor Pith review arXiv 2021
[3]

When do they stop?: A first step towards automatically identifying team communication in the oper- ating room, 2025

Keqi Chen, Lilien Schewski, Vinkle Srivastav, Jo ¨el La- vanchy, Didier Mutter, Guido Beldi, Sandra Keller, and Nicolas Padoy. When do they stop?: A first step towards automatically identifying team communication in the oper- ating room, 2025. 2

2025
[4]

Tecno: Surgical phase recognition with multi- stage temporal convolutional networks

Tobias Czempiel, Magdalini Paschali, Matthias Keicher, Wolfgang Simson, Hannes Feussner, Seong Tae Kim, and Nassir Navab. Tecno: Surgical phase recognition with multi- stage temporal convolutional networks. InMedical Image Computing and Computer-Assisted Intervention (MICCAI),
[5]

OperA: Attention-Regularized Transformers for Surgical Phase Recognition, page 604–614

Tobias Czempiel, Magdalini Paschali, Daniel Ostler, Seong Tae Kim, Benjamin Busam, and Nassir Navab. OperA: Attention-Regularized Transformers for Surgical Phase Recognition, page 604–614. Springer International Publishing, 2021. 2

2021
[6]

Human-centered evalu- ation of xai methods

Karam Dawoud, Wojciech Samek, Peter Eisert, Sebastian Lapuschkin, and Sebastian Bosse. Human-centered evalu- ation of xai methods. In2023 IEEE International Confer- ence on Data Mining Workshops (ICDMW), pages 912–921,
[7]

An image is worth 16 × 16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16 × 16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021. 2

2021
[8]

Mees, J ¨org Weitz, and Stefanie Speidel

Ines Funke, Sebastian T. Mees, J ¨org Weitz, and Stefanie Speidel. Using 3d convolutional neural networks to learn spatiotemporal features for surgical workflow analysis.In- ternational Journal of Computer Assisted Radiology and Surgery, 14:1217–1225, 2019. 2

2019
[9]

Trans-svnet: Accurate phase recogni- tion from surgical videos via hybrid embedding aggregation transformer.CoRR, abs/2103.09712, 2021

Xiaojie Gao, Yueming Jin, Yonghao Long, Qi Dou, and Pheng-Ann Heng. Trans-svnet: Accurate phase recogni- tion from surgical videos via hybrid embedding aggregation transformer.CoRR, abs/2103.09712, 2021. 2

work page arXiv 2021
[10]

Mamba: Linear-time sequence mod- eling with selective state spaces, 2024

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces, 2024. 3

2024
[11]

Efficiently mod- eling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher R´e. Efficiently mod- eling long sequences with structured state spaces. InInter- national Conference on Learning Representations (ICLR),
[12]

St(or)2: Spatio-temporal object level reasoning for activity recognition in the operating room, 2023

Idris Hamoud, Muhammad Abdullah Jamal, Vinkle Srivas- tav, Didier Mutter, Nicolas Padoy, and Omid Mohareri. St(or)2: Spatio-temporal object level reasoning for activity recognition in the operating room, 2023. 2

2023
[13]

UNETR: Transformers for 3d medical im- age segmentation

Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger Roth, and Daguang Xu. UNETR: Transformers for 3d medical im- age segmentation. InWinter Conference on Applications of Computer Vision (WACV), 2022. 2

2022
[14]

Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. 3

1997
[15]

Layercam: Exploring hierarchical class activation maps for localization.IEEE Transactions on Image Processing, 30:5875–5888, 2021

Peng-Tao Jiang, Chang-Bin Zhang, Qibin Hou, Ming-Ming Cheng, and Yunchao Wei. Layercam: Exploring hierarchical class activation maps for localization.IEEE Transactions on Image Processing, 30:5875–5888, 2021. 2, 3

2021
[16]

Sv-rcnet: Workflow recognition from surgical videos using recurrent convolu- tional network.IEEE Transactions on Medical Imaging, PP: 1–1, 2017

Yueming Jin, Qi Dou, Hao Chen, Lequan Yu, Jing Qin, Chi-Wing Fu, and Pheng-Ann Heng. Sv-rcnet: Workflow recognition from surgical videos using recurrent convolu- tional network.IEEE Transactions on Medical Imaging, PP: 1–1, 2017. 2, 3

2017
[17]

Surgigard: Surgical instrument handover graph-based supervision and robust detection

Katerina Katsarou, Amira Mouakher, Paul Chojecki, David Przewozny, Detlef Runde, and Sebastian Bosse. Surgigard: Surgical instrument handover graph-based supervision and robust detection. InProceedings of the IEEE 13th Interna- tional Conference on Healthcare Informatics (ICHI), pages 452–461, 2025. 1, 3

2025
[18]

The kinetics human action video dataset, 2017

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. 5

2017
[19]

A vision transformer for decoding surgeon activity from surgical videos.Nature Biomedical Engineer- ing, 2023

Dani Kiyasseh, Runzhuo Ma, Taseen F Haque, Brian J Miles, Christian Wagner, Daniel Donoho, Anima Anandkumar, and Andrew J Hung. A vision transformer for decoding surgeon activity from surgical videos.Nature Biomedical Engineer- ing, 2023. 3

2023
[20]

Surgical instrument counting: Current prac- tice and staff perspectives on technological support.Human Factors in Healthcare, 6:100087, 2024

AM Kooijmans, L de Rouw, M van der Elst, and JJ van den Dobbelsteen. Surgical instrument counting: Current prac- tice and staff perspectives on technological support.Human Factors in Healthcare, 6:100087, 2024. 1

2024
[21]

Colin Lea, Austin Reiter, Rene Vidal, and Gregory D. Hager. Segmental spatiotemporal cnns for fine-grained action seg- mentation, 2016. 2

2016
[22]

Flynn, Ren ´e Vidal, Austin Reiter, and Gregory D

Colin Lea, Michael D. Flynn, Ren ´e Vidal, Austin Reiter, and Gregory D. Hager. Temporal convolutional networks for action segmentation and detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2017. 3

2017
[23]

Videomamba: State space model for efficient video understanding

Kunchang Li and et al. Videomamba: State space model for efficient video understanding. InEuropean Conference on Computer Vision (ECCV), 2024. 3, 5, 11

2024
[24]

Garcia-Peraza-Herrera, Tom Vercauteren, Prokar Dasgupta, Alejandro Granados, and S´ebastien Ourselin

Yang Liu, Maxence Boels, Luis C. Garcia-Peraza-Herrera, Tom Vercauteren, Prokar Dasgupta, Alejandro Granados, and S´ebastien Ourselin. Lovit: Long video transformer for surgical phase recognition.Medical Image Analysis, 99: 103366, 2025. 3

2025
[25]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV), 2021. 2

2021
[26]

Recognition of instrument-tissue interactions in endoscopic videos via action triplets

Chinedu Innocent Nwoye, Deepak Alapatt, Tong Yu, Ar- men Vardazaryan, Lena Maier-Hein, and Nicolas Padoy. Recognition of instrument-tissue interactions in endoscopic videos via action triplets. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2021. 2

2021
[27]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

2023
[28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021. 3

2021
[29]

”why should i trust you?”: Explaining the predictions of any classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ”why should i trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, page 1135–1144, New York, NY , USA, 2016. Asso- ciation for Computing Machinery. 3

2016
[30]

Massimo Salvi, Silvia Seoni, Andrea Campagner, Arkadiusz Gertych, U.Rajendra Acharya, Filippo Molinari, and Fed- erico Cabitza. Explainability and uncertainty: Two sides of the same coin for enhancing the interpretability of deep learning models in healthcare.International Journal of Med- ical Informatics, 197:105846, 2025. 3

2025
[31]

Anders, and Klaus-Robert M ¨uller

Wojciech Samek, Gr ´egoire Montavon, Sebastian La- puschkin, Christopher J. Anders, and Klaus-Robert M ¨uller. Explaining deep neural networks and beyond: A review of methods and applications.Proceedings of the IEEE, 109(3): 247–278, 2021. 3

2021
[32]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 618–626, 2017. 3

2017
[33]

SmoothGrad: removing noise by adding noise

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi ´egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise.arXiv preprint arXiv:1706.03825, 2017. 3

work page Pith review arXiv 2017
[34]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational conference on machine learning, pages 3319–3328. PMLR, 2017. 3

2017
[35]

Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy

Andru P. Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy. En- donet: A deep architecture for recognition tasks on laparo- scopic videos.IEEE Transactions on Medical Imaging, 36 (1):86–97, 2017. 2, 3

2017
[36]

Jeya Maria Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu, and Vishal M. Patel. Medical transformer: Gated axial- attention for medical image segmentation. InNeurIPS Work- shop on Medical Imaging, 2021. 2

2021
[37]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

2017
[38]

MedCLIP: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. MedCLIP: Contrastive learning from unpaired medical images and text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3876–3887, Abu Dhabi, United Arab Emirates, 2022. Asso- ciation for Computational Linguistics. 3

2022
[39]

Operating room surveil- lance video analysis for group activity recognition.Ad- vanced Biomedical Engineering, 12:171–181, 2023

Koji Yokoyama, Goshiro Yamamoto, Chang Liu, Kazumasa Kishimoto, and Tomohiro Kuroda. Operating room surveil- lance video analysis for group activity recognition.Ad- vanced Biomedical Engineering, 12:171–181, 2023. 2

2023
[40]

Visualizing and under- standing convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and under- standing convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014. 3

2014
[41]

Bokai Zhang, Bharti Goel, Mohammad Hasan Sarhan, Varun Kejriwal Goel, Rami Abukhalil, Bindu Kalesan, Na- talie Stottler, and Svetlana Petculescu. Surgical workflow recognition with temporal convolution and transformer for action segmentation.International journal of computer as- sisted radiology and surgery, 18(4):785—794, 2023. 3

2023
[42]

Clip in medical imaging: A sur- vey.Medical image analysis, 102:103551, 2025

Zihao Zhao, Yuxiao Liu, Han Wu, Mei Wang, Yonghao Li, Sheng Wang, Lin Teng, Disheng Liu, Zhiming Cui, Qian Wang, and Dinggang Shen. Clip in medical imaging: A sur- vey.Medical image analysis, 102:103551, 2025. 3 Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models Supplementary Material This document provides a...

2025