Intuitive Surgical SurgToolLoc and SurgVU Challenges Results: 2022-2025

Achita Chitrapan; Adnan Qayyum; Aleksandr Matsun; Amine Yamlahi; Amir M. Hajiyavand; Andr\'e Ferreira; Andrew D. Beggs; Aneeq Zia; Ange Lou; Anh Quoc Nguyen

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

The SurgToolLoc and SurgVU challenges document AI performance in surgical tool localization and visual understanding from 2022 to 2025.

2026-05-24 08:37 UTC pith:AGVRJTZW

load-bearing objection This is a challenge results report with no new scientific claims or methods.

arxiv 2305.07152 v4 pith:AGVRJTZW submitted 2023-05-11 cs.CV

Intuitive Surgical SurgToolLoc and SurgVU Challenges Results: 2022-2025

Aneeq Zia , Max Berniker , Rogerio Garcia Nespolo , Xiaorui Zhang , Conor Perreault , Kiran Bhattacharyya , Xi Liu , Ziheng Wang

show 148 more authors

Satoshi Kondo Satoshi Kasai Kousuke Hirasawa Bo Liu David Austin Yiheng Wang Michal Futrega Jean-Francois Puget Zhenqiang Li Yoichi Sato Ryo Fujii Ryo Hachiuma Mana Masuda Hideo Saito An Wang Mengya Xu Mobarakol Islam Long Bai Winnie Pang Hongliang Ren Chinedu Nwoye Luca Sestini Nicolas Padoy Maximilian Nielsen Samuel Sch\"uttler Thilo Sentker H\"umeyra Husseini Ivo Baltruschat R\"udiger Schmitz Ren\'e Werner Aleksandr Matsun Mugariya Farooq Numan Saaed Jose Renato Restom Viera Mohammad Yaqub Neil Getty Fangfang Xia Zixuan Zhao Xiaotian Duan Xing Yao Ange Lou Hao Yang Jintong Han Jack Noble Jie Ying Wu Tamer Abdulbaki Alshirbaji Nour Aldeen Jalal Herag Arabian Ning Ding Knut Moeller Weiliang Chen Quan He Muhammad Bilal Taofeek Akinosho Adnan Qayyum Massimo Caputo Hunaid Vohra Michael Loizou Anuoluwapo Ajayi Ilhem Berrou Faatihah Niyi-Odumosu Charlie Budd Oluwatosin Alabi Tom Vercauteren Ruoxi Zhao Ayberk Acar John Han Jumanh Atoum Yinhong Qin Surong Hua Lu Ping Wenming Wu Rongfeng Wei Jinlin Wu You Pang Zhen Chen Tim Jaspers Amine Yamlahi Piotr Kalinowski Dominik Michael Tim R\"adsch Marco H\"ubner Danail Stoyanov Stefanie Speidel Lena Maier-Hein Jie Tian Ruxin Zhang Khang Hoang Nguyen Anh Quoc Nguyen Tam Minh Nguyen Khoi Dinh Tran Minh Nguyen Dang Nhat Trinh Thi Doan Pham Linh Van Nguyen Chunyang Jiang Dewei Yang Haitao Li Yannick Prudent Thibaut Boissin Mahmood Alam Shazad Ashraf Andrew D. Beggs Lukman Akanbi Manuel D. Delgado Narain Gupta Amir M. Hajiyavand Iqbal Qasim Hafiz A. Alaka Junaid Qadir Shu Yang Yihui Wang Hao Chen Shin Paul Yosuke Yamagishi Zhang Dong Hongyun Li Hongyu Gu Xiaoliu Ding Xiaoyao Liu Xingyu Zhao Mariana Ribeiro Tiago Jesus Andr\'e Ferreira Guilherme Barbosa Jo\~ao Carvalho Leonardo Barroso Nuno Gomes Rafael Peixoto Rodrigo Ralha Victor Alves Stephanie Nattapat Ittikosil Achita Chitrapan Quan Huu Cap Jiayuan Huang Shreyas C Dhake Sergi Kavtaradze Mobarak I Hoque Ka Young Kim Su Yong Yun Young Tae Kim Hyeon Bae Kim Seong Tae Kim Zuxing Deng Ling Li Jieyu Zheng Xiaojian Li Anthony Jarc

This is my paper

classification cs.CV

keywords surgical tool localizationsurgical visual understandingrobotic surgerymachine learningchallenge resultscomputer vision

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports the results of a series of machine learning challenges organized by Intuitive Surgical at the MICCAI conference. The challenges focus on localizing surgical tools and understanding visual scenes in robotic assisted surgery using publicly released datasets. A sympathetic reader cares because these results provide concrete benchmarks that show how well current algorithms handle the visual demands of surgery and what remains to be solved for clinical use.

Core claim

The authors document the winning methods and their performance scores in the SurgToolLoc challenge for tool localization and the SurgVU challenge for surgical visual understanding over the years 2022 to 2025.

What carries the argument

The evaluation on the SurgToolLoc and SurgVU datasets using metrics for localization accuracy and visual understanding tasks.

Load-bearing premise

The challenge tasks and datasets provide a meaningful proxy for real clinical robotic surgery scenarios that will translate to improved patient outcomes.

What would settle it

Deploying the top challenge models in live robotic surgeries and comparing error rates or patient outcomes to standard procedures without AI assistance.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Future research can build on the top performing methods as baselines.
The released dataset enables standardized comparisons in surgical data science.
High performance in these tasks suggests AI is approaching usability for assisting in robotic procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Successful methods from these challenges could be integrated into robotic systems to provide real-time feedback to surgeons.
Extending the challenges to include more varied surgical procedures might reveal gaps in current models.
The results imply that visual understanding in surgery is solvable with existing computer vision techniques when given appropriate training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

0 major / 3 minor

Summary. The manuscript documents the results of the Intuitive Surgical SurgToolLoc (surgical tool localization) and SurgVU (surgical visual understanding) challenges hosted annually at MICCAI from 2022 to 2025. It reports participation numbers, submitted methods, and performance metrics across the challenge tasks, while referring readers to a companion paper (arXiv:2501.09209) for details on the released datasets.

Significance. If the reported outcomes are accurate, the paper provides a useful archival record of community progress on standardized benchmarks in robotic surgery computer vision. Multi-year documentation of this form can help the field track incremental improvements in tool detection and scene understanding, and the public datasets enable follow-on reproducible work.

minor comments (3)

The abstract states the purpose but contains no numerical results or key findings; the main text should include a concise summary table of top-performing methods and metrics in the introduction or a dedicated results overview section for quick reference.
Ensure that all challenge years (2022–2025) are covered with consistent reporting of task definitions, evaluation metrics, and participation statistics; any year-to-year changes in rules or data should be explicitly tabulated.
Clarify the relationship to the companion dataset paper: add a short paragraph in the introduction that distinguishes what is new in this results report versus what is already described in arXiv:2501.09209.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation to accept the manuscript. The assessment correctly identifies the value of a multi-year archival record of the SurgToolLoc and SurgVU challenges.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a factual report documenting the results of the SurgToolLoc and SurgVU challenges from 2022-2025. It contains no mathematical derivations, predictions, fitted parameters, or generalization claims. The central claim is confined to reporting participation, methods, and numerical outcomes from external submissions, with the dataset referenced to a separate paper. No load-bearing self-citations, self-definitional steps, or reductions of outputs to inputs by construction are present. The derivation chain is empty by the nature of the document.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced or required by the abstract content.

pith-pipeline@v0.9.0 · 6391 in / 889 out tokens · 17796 ms · 2026-05-24T08:37:57.315142+00:00 · methodology

0 comments

read the original abstract

Robotic assisted (RA) surgery promises to transform surgical intervention. Intuitive Surgical is committed to fostering these changes and the machine learning models and algorithms that will enable them. With these goals in mind we have invited the surgical data science community to participate in a yearly competition hosted through the Medical Imaging Computing and Computer Assisted Interventions (MICCAI) conference. With varying changes from year to year, we have challenged the community to solve difficult machine learning problems in the context of advanced RA applications. Here we document the results of these challenges, focusing on surgical tool localization (SurgToolLoc) and surgical visual understanding (SurgVU). The publicly released dataset that accompanies these challenges is detailed in a separate paper arXiv:2501.09209 [1].

Figures

Figures reproduced from arXiv: 2305.07152 by Achita Chitrapan, Adnan Qayyum, Aleksandr Matsun, Amine Yamlahi, Amir M. Hajiyavand, Andr\'e Ferreira, Andrew D. Beggs, Aneeq Zia, Ange Lou, Anh Quoc Nguyen, Anthony Jarc, Anuoluwapo Ajayi, An Wang, Ayberk Acar, Bo Liu, Charlie Budd, Chinedu Nwoye, Chunyang Jiang, Conor Perreault, Danail Stoyanov, David Austin, Dewei Yang, Dominik Michael, Faatihah Niyi-Odumosu, Fangfang Xia, Guilherme Barbosa, Hafiz A. Alaka, Haitao Li, Hao Chen, Hao Yang, Herag Arabian, Hideo Saito, Hongliang Ren, Hongyu Gu, Hongyun Li, H\"umeyra Husseini, Hunaid Vohra, Hyeon Bae Kim, Ilhem Berrou, Iqbal Qasim, Ivo Baltruschat, Jack Noble, Jean-Francois Puget, Jiayuan Huang, Jie Tian, Jie Ying Wu, Jieyu Zheng, Jinlin Wu, Jintong Han, Jo\~ao Carvalho, John Han, Jose Renato Restom Viera, Jumanh Atoum, Junaid Qadir, Ka Young Kim, Khang Hoang Nguyen, Khoi Dinh Tran, Kiran Bhattacharyya, Knut Moeller, Kousuke Hirasawa, Lena Maier-Hein, Leonardo Barroso, Ling Li, Linh Van Nguyen, Long Bai, Luca Sestini, Lukman Akanbi, Lu Ping, Mahmood Alam, Mana Masuda, Manuel D. Delgado, Marco H\"ubner, Mariana Ribeiro, Massimo Caputo, Max Berniker, Maximilian Nielsen, Mengya Xu, Michael Loizou, Michal Futrega, Minh Nguyen Dang Nhat, Mobarak I Hoque, Mobarakol Islam, Mohammad Yaqub, Mugariya Farooq, Muhammad Bilal, Narain Gupta, Nattapat Ittikosil, Neil Getty, Nicolas Padoy, Ning Ding, Nour Aldeen Jalal, Numan Saaed, Nuno Gomes, Oluwatosin Alabi, Piotr Kalinowski, Quan He, Quan Huu Cap, Rafael Peixoto, Ren\'e Werner, Rodrigo Ralha, Rogerio Garcia Nespolo, Rongfeng Wei, R\"udiger Schmitz, Ruoxi Zhao, Ruxin Zhang, Ryo Fujii, Ryo Hachiuma, Samuel Sch\"uttler, Satoshi Kasai, Satoshi Kondo, Seong Tae Kim, Sergi Kavtaradze, Shazad Ashraf, Shin Paul, Shreyas C Dhake, Shu Yang, Stefanie Speidel, Stephanie, Surong Hua, Su Yong Yun, Tamer Abdulbaki Alshirbaji, Tam Minh Nguyen, Taofeek Akinosho, Thibaut Boissin, Thilo Sentker, Tiago Jesus, Tim Jaspers, Tim R\"adsch, Tom Vercauteren, Trinh Thi Doan Pham, Victor Alves, Weiliang Chen, Wenming Wu, Winnie Pang, Xiaojian Li, Xiaoliu Ding, Xiaorui Zhang, Xiaotian Duan, Xiaoyao Liu, Xi Liu, Xing Yao, Xingyu Zhao, Yannick Prudent, Yiheng Wang, Yihui Wang, Yinhong Qin, Yoichi Sato, Yosuke Yamagishi, Young Tae Kim, You Pang, Zhang Dong, Zhen Chen, Zhenqiang Li, Ziheng Wang, Zixuan Zhao, Zuxing Deng.

**Figure 2.** Figure 2: Sample frames with presence labels (left) and a snapshot of the labels CSV file (right) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Sample frames with testing labels. The UI interface was blurred to avoid its embedded information [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: HRI MV: The information required for detection and classification is different. Thus, the ROI proposal box output by RPN network was modified [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: HRI MV: ROI expand makes model identification device The detection frame output from the previous frame algorithm was used for target tracking and integrated into the results of target detection and target tracking via weighting. In this way, as long as the device appears in a certain frame, it can be identified by target tracking in its subsequent frames, solving the problem of target missing detection. T… view at source ↗

**Figure 6.** Figure 6: HRI MV: model ensemble [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: HKMV: Primary training dataset The clevis part in the mask label of the dataset was converted into a bounding box label and used as the primary training dataset. Since endovis17 and endovis18 do not include all 14 surgical tools, part of the data was added to the training dataset to ensure that the dataset contains all 14 surgical tools. The final set contained 5212 images. Surgical Tool Localization Algor… view at source ↗

**Figure 8.** Figure 8: HKMV: Model architecture In order to increase the size of the dataset, the trained model was used to infer the images of the competition dataset, adding these to the training dataset. A score threshold of 0.7 was employed to filter out the poor bounding box. After several operations, the dataset was expanded to 11035 images. The dataset expansion flow diagram of the algorithm is described in [PITH_FULL_IM… view at source ↗

**Figure 9.** Figure 9: HKMV: Data expansion scheme 3.5 NVIDIA - Team NVIDIA Team Members: Bo Liu, David Austin, Yiheng Wang, Michal Futrega, Jean-Francois Puget This team consists of five NVIDIA employees. Three of them (Jean-Francois, David, and Bo) are members of the Kaggle Grandmasters team, with extensive experience in computer vision machine learning competitions. One member (Yiheng) works on MONAI, an open-source framewor… view at source ↗

**Figure 10.** Figure 10: NVIDIA: Team NVIDIA Category 2 workflow 3.5.2 Model Training For Category 1, they trained 5 EfficientNet-B4 models (on 5 fold splits) and 5 ConvNext-tiny models and ensembled them using a trick called logit shift. The idea of the logit shift trick is that, when data is extremely imbalanced between classes as in this dataset, the minor classes’ probabilities are extremely biased towards 0. The extent of th… view at source ↗

**Figure 11.** Figure 11: ANL-Surg: Top right frame is the result of the parts detection model. Bottom left is the [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: HVRL: Examples of preprocessing. The original image included the black region on the left and [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: SK: Overview of our proposed method. The augmentation techniques used were horizontal flip, shift, scale, rotation, color jitter, Gaussian blur, and Gaussian noise. The augmented images were resized to 640 × 480 pixels. The employed optimization method was Adam, and the initial learning rate was set to 1.0 × 10−5 changing at every epoch with cosine annealing. The cross-entropy loss was used as the loss fu… view at source ↗

**Figure 14.** Figure 14: VANDY-VISE: Architecture of Query2Label [42] [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: UKE: DINO-based self-distillation. Global and local crops are extracted from the input frame, [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: ITeM: The complete pipeline of the proposed model for tool presence detection and localiza [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: ITeM: Model performance on the validation data. [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: MM: R50+ViT-B 16 hybrid model 29 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: MM: The training process. 3.12.3 Preliminary Performance An mAP of 0.86 was achieved on the validation dataset [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: BioMedIA: Frequency of each tool across the whole dataset. The graph illustrates the frequent [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: BioMedIA: Surgical instruments present in Cholec80 Dataset. The tools are similar to the tools [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

**Figure 22.** Figure 22: BioMedIA: Surgical instruments present in M2CAI Dataset. Some of the tools present are similar [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗

**Figure 23.** Figure 23: BioMedIA: Architecture of the two-tier model used for classification. Images are passed through [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗

**Figure 24.** Figure 24: WhiteBox: The tool presence classification framework used by WhiteBox team. [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

**Figure 25.** Figure 25: CAMMA: Architecture of the spatial attention network (SANet) for surgical tool presence detec [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗

**Figure 26.** Figure 26: TeamZero: Proposed methodology for robust detection of surgical tools with noisy labels. [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗

**Figure 27.** Figure 27: TeamZero: Examples of noisy labels (a) and cropped Images using Segmentation Learner (b). [PITH_FULL_IMAGE:figures/full_fig_p038_27.png] view at source ↗

**Figure 28.** Figure 28: TeamZero: Learning rate chart showing 5e-3 where ViT model can learn the most on the given [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗

**Figure 29.** Figure 29: TeamZero: Learning rate chart showing 5e-3 where ViT model can learn the most on the given [PITH_FULL_IMAGE:figures/full_fig_p041_29.png] view at source ↗

**Figure 30.** Figure 30: Class wise distribution for the original SurgToolLoc labels and the generated pseudo labels. The [PITH_FULL_IMAGE:figures/full_fig_p048_30.png] view at source ↗

**Figure 33.** Figure 33: Confusion matrix for all classes. 4.5 MapleLab Team Members: John Han, Ayberk Acar, Jumanh Atoum, Yinhong Qin, Jie Ying Wu 51 [PITH_FULL_IMAGE:figures/full_fig_p051_33.png] view at source ↗

**Figure 34.** Figure 34: A fine-tuned ResNet-101 model takes the concatenated 7-channel image as input and outputs the [PITH_FULL_IMAGE:figures/full_fig_p052_34.png] view at source ↗

**Figure 35.** Figure 35: In the final reiterated inference stage, predictions are made on each bounding box via masking [PITH_FULL_IMAGE:figures/full_fig_p053_35.png] view at source ↗

**Figure 36.** Figure 36: Qualitative Results 4.5.2 Model Training To improve the quality of training data, a morphological opening and closing was applied to the segmentation mask to remove noise and close holes after running TernausNet. The images with poor segmentation masks were removed, via comparing the size of the segmentation blob with certain thresholds. Segmentation blobs with areas > 0.02 × total image area were kept an… view at source ↗

**Figure 37.** Figure 37: Data Processing 4.6 PUMCH • Team name: PUMCH • Members: Surong Hua, Lu Ping, Wenming Wu • Research field: Computer Vision • Institution: Peking Union Medical College Hospital • City: Beijing, China • motivation and plan: At first glance, we were thinking of weakly supervised learning. But then we found that the video tags provided for training only contain information on whether a certain instrument exist… view at source ↗

**Figure 38.** Figure 38: OsTrack Pipeline Model Loss Function To train the one-stage object detector, we adpoted a dynamic soft label assignment strategy based on SimOTA [78]. As RTM-DET [79]does, we used the IoU between the predictions and ground truth boxes as the soft label to train the classification branch, used the logarithm of the IoU as the regression cost instead of GIoU used in the loss function, used a soft center regi… view at source ↗

**Figure 39.** Figure 39: Long Tail Model Size Bbox map Cascade R-CNN-Detectors (894,682) 63.6 RTMDet-l (864,864) 70.6 YOLOv8-l (864,864) 69.1 Co-DETR-swin-s Multi scale 64.6 [PITH_FULL_IMAGE:figures/full_fig_p056_39.png] view at source ↗

**Figure 40.** Figure 40: The overall pipeline of our proposed method. [PITH_FULL_IMAGE:figures/full_fig_p058_40.png] view at source ↗

**Figure 41.** Figure 41: Pseudo-code of Initial Label Filter Algorithm. [PITH_FULL_IMAGE:figures/full_fig_p059_41.png] view at source ↗

**Figure 42.** Figure 42: Pseudo-code of Multi-round Label Filter Algorithm. [PITH_FULL_IMAGE:figures/full_fig_p061_42.png] view at source ↗

**Figure 43.** Figure 43: Proposed training workflow: (a) Keypoints tracking-based bounding box generation. (b) Classi [PITH_FULL_IMAGE:figures/full_fig_p063_43.png] view at source ↗

**Figure 44.** Figure 44: Team TUE-VCA Schematic representation of the proposed workflow. [PITH_FULL_IMAGE:figures/full_fig_p065_44.png] view at source ↗

**Figure 45.** Figure 45: Overview of challenge results from 2022 to 2023 demonstrating a robust increase in model [PITH_FULL_IMAGE:figures/full_fig_p069_45.png] view at source ↗

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Surgical Anatomy Recognition with Context Learning using Foundation Representations
cs.CV 2026-06 unverdicted novelty 5.0

Presents ATLAS-120k dataset and ATLAS model for context-aware surgical anatomy segmentation using foundation representations and temporal cues.
Surgical Visual Understanding (SurgVU) Dataset
cs.CV 2025-01 unverdicted novelty 5.0

Releases the SurgVU dataset of surgical videos and labels to enable machine learning research in surgical data science.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Surgical visual understanding (surgvu) dataset, 2025

Aneeq Zia, Max Berniker, Rogerio Nespolo, Conor Perreault, Ziheng Wang, Benjamin Mueller, Ryan Schmidt, Kiran Bhattacharyya, Xi Liu, and Anthony Jarc. Surgical visual understanding (surgvu) dataset, 2025. 69

work page 2025
[2]

Trends in robot-assisted procedures for general surgery in the veterans health administration

Michael A Mederos, R Lorie Jacob, Rachel Ward, Rivfka Shenoy, Melinda M Gibbons, Mark D Girgis, Devan Kansagara, Denise Hynes, Paul G Shekelle, and Karli Kondo. Trends in robot-assisted procedures for general surgery in the veterans health administration. Journal of Surgical Research , 279:788–795, 2022

work page 2022
[3]

Exploring the paradigm of robotic surgery and its contribution to the growth of surgical volume

Emily A Grimsley, Tara M Barry, Haroon Janjua, Emanuel Eguia, Christopher DuCoin, and Paul C Kuo. Exploring the paradigm of robotic surgery and its contribution to the growth of surgical volume. Surgery Open Science, 10:36–42, 2022

work page 2022
[4]

Surgical data science– from concepts toward clinical translation

Lena Maier-Hein, Matthias Eisenmann, Duygu Sarikaya, Keno M¨ arz, Toby Collins, Anand Malpani, Johannes Fallert, Hubertus Feussner, Stamatia Giannarou, Pietro Mascagni, et al. Surgical data science– from concepts toward clinical translation. Medical image analysis, 76:102306, 2022

work page 2022
[5]

Surgical data science: the new knowledge domain

S Swaroop Vedula and Gregory D Hager. Surgical data science: the new knowledge domain. Innovative surgical sciences, 2(3):109–121, 2017

work page 2017
[6]

Review of automated performance metrics to assess surgical technical skills in robot-assisted laparoscopy

Sonia Guerin, Arnaud Huaulm´ e, Vincent Lavoue, Pierre Jannin, and Krystel Nyangoh Timoh. Review of automated performance metrics to assess surgical technical skills in robot-assisted laparoscopy. Surgical Endoscopy, pages 1–18, 2022

work page 2022
[7]

Deep learning-based computer vision to recognize and classify suturing gestures in robot-assisted surgery

Francisco Luongo, Ryan Hakim, Jessica H Nguyen, Animashree Anandkumar, and Andrew J Hung. Deep learning-based computer vision to recognize and classify suturing gestures in robot-assisted surgery. Surgery, 169(5):1240–1244, 2021

work page 2021
[8]

A deep-learning model using automated performance metrics and clinical features to predict urinary continence recovery after robot-assisted radical prostatectomy

Andrew J Hung, Jian Chen, Saum Ghodoussipour, Paul J Oh, Zequn Liu, Jessica Nguyen, Sanjay Pu- rushotham, Inderbir S Gill, and Yan Liu. A deep-learning model using automated performance metrics and clinical features to predict urinary continence recovery after robot-assisted radical prostatectomy. BJU international, 124(3):487–495, 2019

work page 2019
[9]

How to bring surgery to the next level: interpretable skills assessment in robotic-assisted surgery

Kristen C Brown, Kiran D Bhattacharyya, Sue Kulason, Aneeq Zia, and Anthony Jarc. How to bring surgery to the next level: interpretable skills assessment in robotic-assisted surgery. Visceral medicine, 36(6):463–470, 2020

work page 2020
[10]

Automated surgical skill assessment in rmis training

Aneeq Zia and Irfan Essa. Automated surgical skill assessment in rmis training. International journal of computer assisted radiology and surgery , 13(5):731–739, 2018

work page 2018
[11]

Temporal clustering of surgical activities in robot-assisted surgery

Aneeq Zia, Chi Zhang, Xiaobin Xiong, and Anthony M Jarc. Temporal clustering of surgical activities in robot-assisted surgery. International journal of computer assisted radiology and surgery , 12(7):1171– 1178, 2017

work page 2017
[12]

Novel evaluation of surgical activ- ity recognition models using task-based efficiency metrics

Aneeq Zia, Liheng Guo, Linlin Zhou, Irfan Essa, and Anthony Jarc. Novel evaluation of surgical activ- ity recognition models using task-based efficiency metrics. International journal of computer assisted radiology and surgery, 14(12):2155–2163, 2019

work page 2019
[13]

Surgical activity recognition in robot-assisted radical prostatectomy using deep learning

Aneeq Zia, Andrew Hung, Irfan Essa, and Anthony Jarc. Surgical activity recognition in robot-assisted radical prostatectomy using deep learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 273–280. Springer, 2018

work page 2018
[14]

Biomedical image analysis competitions: The state of current participation practice

Matthias Eisenmann, Annika Reinke, Vivienn Weru, Minu Dietlinde Tizabi, Fabian Isensee, Tim J Adler, Patrick Godau, Veronika Cheplygina, Michal Kozubek, Sharib Ali, et al. Biomedical image analysis competitions: The state of current participation practice. arXiv preprint arXiv:2212.08568 , 2022

work page arXiv 2022
[15]

Surgical visual domain adaptation: results from the miccai 2020 surgvisdom challenge

Aneeq Zia, Kiran Bhattacharyya, Xi Liu, Ziheng Wang, Satoshi Kondo, Emanuele Colleoni, Beatrice van Amsterdam, Razeen Hussain, Raabid Hussain, Lena Maier-Hein, et al. Surgical visual domain adaptation: results from the miccai 2020 surgvisdom challenge. arXiv preprint arXiv:2102.13644, 2021. 70

work page arXiv 2020
[16]

Learning motion flows for semi- supervised instrument segmentation from robotic surgical video

Zixu Zhao, Yueming Jin, Xiaojie Gao, Qi Dou, and Pheng-Ann Heng. Learning motion flows for semi- supervised instrument segmentation from robotic surgical video. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23 , pages 679–689. Springer, 2020

work page 2020
[17]

Objective surgical skills assessment and tool localization: Results from the miccai 2021 simsurgskill challenge

Aneeq Zia, Kiran Bhattacharyya, Xi Liu, Ziheng Wang, Max Berniker, Satoshi Kondo, Emanuele Colleoni, Dimitris Psychogyios, Yueming Jin, Jinfan Zhou, et al. Objective surgical skills assessment and tool localization: Results from the miccai 2021 simsurgskill challenge. arXiv preprint arXiv:2212.04448, 2022

work page arXiv 2021
[18]

Stereo correspon- dence and reconstruction of endoscopic data challenge,

Max Allan, Jonathan Mcleod, Congcong Wang, Jean Claude Rosenthal, Zhenglei Hu, Niklas Gard, Peter Eisert, Ke Xue Fu, Trevor Zeffiro, Wenyao Xia, et al. Stereo correspondence and reconstruction of endoscopic data challenge. arXiv preprint arXiv:2101.01133 , 2021

work page arXiv 2021
[19]

2018 robotic scene segmentation challenge, 2020

Max Allan, Satoshi Kondo, Sebastian Bodenstedt, Stefan Leger, Rahim Kadkhodamohammadi, Imanol Luengo, Felix Fuentes, Evangello Flouty, Ahmed Mohammed, Marius Pedersen, Avinash Kori, Vargh- ese Alex, Ganapathy Krishnamurthi, David Rauber, Robert Mendel, Christoph Palm, Sophia Bano, Guinther Saibro, Chi-Sheng Shih, Hsun-An Chiang, Juntang Zhuang, Junlin Yan...

work page 2018
[20]

Endonet: a deep architecture for recognition tasks on laparoscopic videos

Andru P Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathelin, and Nicolas Padoy. Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging, 36(1):86–97, 2016

work page 2016
[21]

Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks

Amy Jin, Serena Yeung, Jeffrey Jopling, Jonathan Krause, Dan Azagury, Arnold Milstein, and Li Fei- Fei. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In 2018 IEEE winter conference on applications of computer vision (WACV) , pages 691–699. IEEE, 2018

work page 2018
[22]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision , 115(3):211–252, 2015

work page 2015
[23]

Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines

Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and Gang Yu. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12549–12556, 2020

work page 2020
[24]

Cascade r-cnn: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018

work page 2018
[25]

2017 robotic instrument segmentation challenge, 2019

Max Allan, Alex Shvets, Thomas Kurmann, Zichen Zhang, Rahul Duggal, Yun-Hsuan Su, Nicola Rieke, Iro Laina, Niveditha Kalavakonda, Sebastian Bodenstedt, Luis Herrera, Wenqi Li, Vladimir Iglovikov, Huoling Luo, Jian Yang, Danail Stoyanov, Lena Maier-Hein, Stefanie Speidel, and Mahdi Azizian. 2017 robotic instrument segmentation challenge, 2019

work page 2017
[26]

Shvets, Alexander Rakhlin, Alexandr A

Alexey A. Shvets, Alexander Rakhlin, Alexandr A. Kalinin, and Vladimir I. Iglovikov. Automatic instrument segmentation in robot-assisted surgery using deep learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) . IEEE, December 2018

work page 2018
[27]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

work page 2022
[28]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning , pages 6105–6114. PMLR, 2019. 71

work page 2019
[29]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision , pages 618–626, 2017

work page 2017
[30]

YOLOv5 by Ultralytics, 5 2020

Glenn Jocher. YOLOv5 by Ultralytics, 5 2020

work page 2020
[31]

Weakly supervised pseudo-label assisted learning for als point cloud semantic segmentation

Puzuo Wang and Wei Yao. Weakly supervised pseudo-label assisted learning for als point cloud semantic segmentation. arXiv preprint arXiv:2105.01919 , 2021

work page arXiv 2021
[32]

Fastai: A layered API for deep learning

Jeremy Howard and Sylvain Gugger. Fastai: A layered API for deep learning. Information, 11(2):108, feb 2020

work page 2020
[33]

Detectron2, 2019

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2, 2019

work page 2019
[34]

Surgical tool detection in open surgery videos

Ryo Fujii, Ryo Hachiuma, Hiroki Kajita, and Hideo Saito. Surgical tool detection in open surgery videos. Applied Sciences, 12(20), 2022

work page 2022
[35]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution. In CVPR, pages 12009–12019, June 2022

work page 2022
[36]

Efficientnetv2: Smaller models and faster training

Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In ICML, 2021

work page 2021
[37]

Aggregated Residual Transformations for Deep Neural Networks

Saining Xie, Ross B. Girshick, Piotr Doll´ ar, Zhuowen Tu, and Kaiming He. Aggregated residual trans- formations for deep neural networks. CoRR, abs/1611.05431, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Pytorch image models, 2019

Ross Wightman. Pytorch image models, 2019

work page 2019
[39]

Grad- cam++: Generalized gradient-based visual explanations for deep convolutional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad- cam++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV, 2018

work page 2018
[40]

Imagenet: A large-scale hierar- chical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierar- chical image database. In CVPR, 2009

work page 2009
[41]

Shallow feature matters for weakly supervised object localization

Jun Wei, Qin Wang, Zhen Li, Sheng Wang, S Kevin Zhou, and Shuguang Cui. Shallow feature matters for weakly supervised object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5993–6001, 2021

work page 2021
[42]

Query2label: A simple transformer way to multi-label classification.arXiv preprint arXiv:2107.10834,

Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834 , 2021

work page arXiv 2021
[43]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[44]

arXiv preprint arXiv:2009.14119 (2020)

Emanuel Ben-Baruch, Tal Ridnik, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119 , 2020

work page arXiv 2009
[45]

The opencv library

Gary Bradski. The opencv library. Dr. Dobb’s Journal: Software Tools for the Professional Programmer, 25(11):120–123, 2000

work page 2000
[46]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, et al. Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021

work page 2021
[47]

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, et al. iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR, 2022

work page 2022
[48]

A deep learning spatial-temporal framework for detecting surgical tools in laparoscopic videos

Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, Paul D Docherty, Thomas Neumuth, and Knut M¨ oller. A deep learning spatial-temporal framework for detecting surgical tools in laparoscopic videos. Biomed- ical Signal Processing and Control , 68:102801, 2021. 72

work page 2021
[49]

Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation

Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 642–651, 2017

work page 2017
[50]

Weakly-supervised learn- ing for tool localization in laparoscopic videos

Armine Vardazaryan, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Weakly-supervised learn- ing for tool localization in laparoscopic videos. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis , pages 169–179. Springer, 2018

work page 2018
[51]

Weakly supervised convolutional LSTM approach for tool tracking in laparoscopic videos.International journal of computer assisted radiology and surgery, 14(6):1059–1067, 2019

Chinedu Innocent Nwoye, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Weakly supervised convolutional LSTM approach for tool tracking in laparoscopic videos.International journal of computer assisted radiology and surgery, 14(6):1059–1067, 2019

work page 2019
[52]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7132–7141, 2018

work page 2018
[53]

Abdulbaki Alshirbaji, Nour A

T. Abdulbaki Alshirbaji, Nour A. Jalal, Paul D. Docherty, P. T. Neumuth, and Knut M¨ oller. Improving the Generalisability of Deep CNNs by Combining Multi-stage Features for Surgical Tool Classification. In 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 533–536. IEEE, 2022

work page 2022
[54]

Surgical tool classification in la- paroscopic videos using convolutional neural network

Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, and Knut M¨ oller. Surgical tool classification in la- paroscopic videos using convolutional neural network. Current Directions in Biomedical Engineering , 4(1):407–410, 2018

work page 2018
[55]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint, 2020

work page 2020
[56]

DeVries Terrance, and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint

work page
[57]

Randaugment: Practical automated data augmentation with a reduced search space

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. arXiv e-prints, 2019

work page 2019
[58]

Asymmetric loss for multi-label classification

Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 82–91, 2021

work page 2021
[59]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[60]

Hong, C.-L

W-Y Hong, C-L Kao, Y-H Kuo, J-R Wang, W-L Chang, and C-S Shih. Cholecseg8k: a semantic segmen- tation dataset for laparoscopic cholecystectomy based on cholec80. arXiv preprint arXiv:2012.12453 , 2020

work page arXiv 2012
[61]

Can masses of non- experts train highly accurate image classifiers? In International conference on medical image computing and computer-assisted intervention, pages 438–445

Lena Maier-Hein, Sven Mersmann, Daniel Kondermann, Sebastian Bodenstedt, Alexandro Sanchez, Christian Stock, Hannes Gotz Kenngott, Mathias Eisenmann, and Stefanie Speidel. Can masses of non- experts train highly accurate image classifiers? In International conference on medical image computing and computer-assisted intervention, pages 438–445. Springer, 2014

work page 2014
[62]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[63]

Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos

Chinedu Innocent Nwoye, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos. International journal of computer assisted radiology and surgery, 14:1059–1067, 2019. 73

work page 2019
[64]

Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos

Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis, 78:102433, 2022

work page 2022
[65]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016
[66]

Fastai: A layered api for deep learning

Jeremy Howard and Sylvain Gugger. Fastai: A layered api for deep learning. Information, 11(2):108, 2020

work page 2020
[67]

Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis

Davood Karimi, Haoran Dou, Simon K Warfield, and Ali Gholipour. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Medical image analysis, 65:101759, 2020

work page 2020
[68]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011
[69]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[70]

Faster r-cnn: Towards real-time object detection with region proposal networks, 2016

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2016

work page 2016
[71]

U-net: Convolutional networks for biomedical image segmentation, 2015

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015

work page 2015
[72]

Ultralytics YOLO, January 2023

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics YOLO, January 2023

work page 2023
[73]

Mixformer: End-to-end tracking with iterative mixed attention, 2022

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed attention, 2022

work page 2022
[74]

Dinov2: Learning robust visual features without supervision, 2024

Maxime Oquab, Timoth´ ee Darcet, Th´ eo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv´ e Jegou, Julien Mairal, Patri...

work page 2024
[75]

Deep residual learning for image recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015

work page 2015
[76]

Masked autoen- coders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick. Masked autoen- coders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022

work page 2022
[77]

Joint feature learning and relation modeling for tracking: A one-stream framework

Botao Ye, Hong Chang, Bingpeng Ma, and Shiguang Shan. Joint feature learning and relation modeling for tracking: A one-stream framework

work page
[78]

Yolox: Exceeding yolo series in 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun, and Megvii Technology. Yolox: Exceeding yolo series in 2021

work page 2021
[79]

Rtmdet: An empirical study of designing real-time object detectors

Chengqi Lyu, Wenwei Zhang, Haian Huang, Yue Zhou, Yudong Wang, Yanyi Liu, Shilong Zhang, Kai Chen, Concat Conv, and Resize Concat. Rtmdet: An empirical study of designing real-time object detectors

work page
[80]

Weighted boxes fusion: Ensembling boxes from different object detection models

Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing , page 104117, Mar 2021. 74

work page 2021

Showing first 80 references.

[1] [1]

Surgical visual understanding (surgvu) dataset, 2025

Aneeq Zia, Max Berniker, Rogerio Nespolo, Conor Perreault, Ziheng Wang, Benjamin Mueller, Ryan Schmidt, Kiran Bhattacharyya, Xi Liu, and Anthony Jarc. Surgical visual understanding (surgvu) dataset, 2025. 69

work page 2025

[2] [2]

Trends in robot-assisted procedures for general surgery in the veterans health administration

Michael A Mederos, R Lorie Jacob, Rachel Ward, Rivfka Shenoy, Melinda M Gibbons, Mark D Girgis, Devan Kansagara, Denise Hynes, Paul G Shekelle, and Karli Kondo. Trends in robot-assisted procedures for general surgery in the veterans health administration. Journal of Surgical Research , 279:788–795, 2022

work page 2022

[3] [3]

Exploring the paradigm of robotic surgery and its contribution to the growth of surgical volume

Emily A Grimsley, Tara M Barry, Haroon Janjua, Emanuel Eguia, Christopher DuCoin, and Paul C Kuo. Exploring the paradigm of robotic surgery and its contribution to the growth of surgical volume. Surgery Open Science, 10:36–42, 2022

work page 2022

[4] [4]

Surgical data science– from concepts toward clinical translation

Lena Maier-Hein, Matthias Eisenmann, Duygu Sarikaya, Keno M¨ arz, Toby Collins, Anand Malpani, Johannes Fallert, Hubertus Feussner, Stamatia Giannarou, Pietro Mascagni, et al. Surgical data science– from concepts toward clinical translation. Medical image analysis, 76:102306, 2022

work page 2022

[5] [5]

Surgical data science: the new knowledge domain

S Swaroop Vedula and Gregory D Hager. Surgical data science: the new knowledge domain. Innovative surgical sciences, 2(3):109–121, 2017

work page 2017

[6] [6]

Review of automated performance metrics to assess surgical technical skills in robot-assisted laparoscopy

Sonia Guerin, Arnaud Huaulm´ e, Vincent Lavoue, Pierre Jannin, and Krystel Nyangoh Timoh. Review of automated performance metrics to assess surgical technical skills in robot-assisted laparoscopy. Surgical Endoscopy, pages 1–18, 2022

work page 2022

[7] [7]

Deep learning-based computer vision to recognize and classify suturing gestures in robot-assisted surgery

Francisco Luongo, Ryan Hakim, Jessica H Nguyen, Animashree Anandkumar, and Andrew J Hung. Deep learning-based computer vision to recognize and classify suturing gestures in robot-assisted surgery. Surgery, 169(5):1240–1244, 2021

work page 2021

[8] [8]

A deep-learning model using automated performance metrics and clinical features to predict urinary continence recovery after robot-assisted radical prostatectomy

Andrew J Hung, Jian Chen, Saum Ghodoussipour, Paul J Oh, Zequn Liu, Jessica Nguyen, Sanjay Pu- rushotham, Inderbir S Gill, and Yan Liu. A deep-learning model using automated performance metrics and clinical features to predict urinary continence recovery after robot-assisted radical prostatectomy. BJU international, 124(3):487–495, 2019

work page 2019

[9] [9]

How to bring surgery to the next level: interpretable skills assessment in robotic-assisted surgery

Kristen C Brown, Kiran D Bhattacharyya, Sue Kulason, Aneeq Zia, and Anthony Jarc. How to bring surgery to the next level: interpretable skills assessment in robotic-assisted surgery. Visceral medicine, 36(6):463–470, 2020

work page 2020

[10] [10]

Automated surgical skill assessment in rmis training

Aneeq Zia and Irfan Essa. Automated surgical skill assessment in rmis training. International journal of computer assisted radiology and surgery , 13(5):731–739, 2018

work page 2018

[11] [11]

Temporal clustering of surgical activities in robot-assisted surgery

Aneeq Zia, Chi Zhang, Xiaobin Xiong, and Anthony M Jarc. Temporal clustering of surgical activities in robot-assisted surgery. International journal of computer assisted radiology and surgery , 12(7):1171– 1178, 2017

work page 2017

[12] [12]

Novel evaluation of surgical activ- ity recognition models using task-based efficiency metrics

Aneeq Zia, Liheng Guo, Linlin Zhou, Irfan Essa, and Anthony Jarc. Novel evaluation of surgical activ- ity recognition models using task-based efficiency metrics. International journal of computer assisted radiology and surgery, 14(12):2155–2163, 2019

work page 2019

[13] [13]

Surgical activity recognition in robot-assisted radical prostatectomy using deep learning

Aneeq Zia, Andrew Hung, Irfan Essa, and Anthony Jarc. Surgical activity recognition in robot-assisted radical prostatectomy using deep learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 273–280. Springer, 2018

work page 2018

[14] [14]

Biomedical image analysis competitions: The state of current participation practice

Matthias Eisenmann, Annika Reinke, Vivienn Weru, Minu Dietlinde Tizabi, Fabian Isensee, Tim J Adler, Patrick Godau, Veronika Cheplygina, Michal Kozubek, Sharib Ali, et al. Biomedical image analysis competitions: The state of current participation practice. arXiv preprint arXiv:2212.08568 , 2022

work page arXiv 2022

[15] [15]

Surgical visual domain adaptation: results from the miccai 2020 surgvisdom challenge

Aneeq Zia, Kiran Bhattacharyya, Xi Liu, Ziheng Wang, Satoshi Kondo, Emanuele Colleoni, Beatrice van Amsterdam, Razeen Hussain, Raabid Hussain, Lena Maier-Hein, et al. Surgical visual domain adaptation: results from the miccai 2020 surgvisdom challenge. arXiv preprint arXiv:2102.13644, 2021. 70

work page arXiv 2020

[16] [16]

Learning motion flows for semi- supervised instrument segmentation from robotic surgical video

Zixu Zhao, Yueming Jin, Xiaojie Gao, Qi Dou, and Pheng-Ann Heng. Learning motion flows for semi- supervised instrument segmentation from robotic surgical video. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23 , pages 679–689. Springer, 2020

work page 2020

[17] [17]

Objective surgical skills assessment and tool localization: Results from the miccai 2021 simsurgskill challenge

Aneeq Zia, Kiran Bhattacharyya, Xi Liu, Ziheng Wang, Max Berniker, Satoshi Kondo, Emanuele Colleoni, Dimitris Psychogyios, Yueming Jin, Jinfan Zhou, et al. Objective surgical skills assessment and tool localization: Results from the miccai 2021 simsurgskill challenge. arXiv preprint arXiv:2212.04448, 2022

work page arXiv 2021

[18] [18]

Stereo correspon- dence and reconstruction of endoscopic data challenge,

Max Allan, Jonathan Mcleod, Congcong Wang, Jean Claude Rosenthal, Zhenglei Hu, Niklas Gard, Peter Eisert, Ke Xue Fu, Trevor Zeffiro, Wenyao Xia, et al. Stereo correspondence and reconstruction of endoscopic data challenge. arXiv preprint arXiv:2101.01133 , 2021

work page arXiv 2021

[19] [19]

2018 robotic scene segmentation challenge, 2020

Max Allan, Satoshi Kondo, Sebastian Bodenstedt, Stefan Leger, Rahim Kadkhodamohammadi, Imanol Luengo, Felix Fuentes, Evangello Flouty, Ahmed Mohammed, Marius Pedersen, Avinash Kori, Vargh- ese Alex, Ganapathy Krishnamurthi, David Rauber, Robert Mendel, Christoph Palm, Sophia Bano, Guinther Saibro, Chi-Sheng Shih, Hsun-An Chiang, Juntang Zhuang, Junlin Yan...

work page 2018

[20] [20]

Endonet: a deep architecture for recognition tasks on laparoscopic videos

Andru P Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathelin, and Nicolas Padoy. Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging, 36(1):86–97, 2016

work page 2016

[21] [21]

Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks

Amy Jin, Serena Yeung, Jeffrey Jopling, Jonathan Krause, Dan Azagury, Arnold Milstein, and Li Fei- Fei. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In 2018 IEEE winter conference on applications of computer vision (WACV) , pages 691–699. IEEE, 2018

work page 2018

[22] [22]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision , 115(3):211–252, 2015

work page 2015

[23] [23]

Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines

Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and Gang Yu. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12549–12556, 2020

work page 2020

[24] [24]

Cascade r-cnn: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018

work page 2018

[25] [25]

2017 robotic instrument segmentation challenge, 2019

Max Allan, Alex Shvets, Thomas Kurmann, Zichen Zhang, Rahul Duggal, Yun-Hsuan Su, Nicola Rieke, Iro Laina, Niveditha Kalavakonda, Sebastian Bodenstedt, Luis Herrera, Wenqi Li, Vladimir Iglovikov, Huoling Luo, Jian Yang, Danail Stoyanov, Lena Maier-Hein, Stefanie Speidel, and Mahdi Azizian. 2017 robotic instrument segmentation challenge, 2019

work page 2017

[26] [26]

Shvets, Alexander Rakhlin, Alexandr A

Alexey A. Shvets, Alexander Rakhlin, Alexandr A. Kalinin, and Vladimir I. Iglovikov. Automatic instrument segmentation in robot-assisted surgery using deep learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) . IEEE, December 2018

work page 2018

[27] [27]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

work page 2022

[28] [28]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning , pages 6105–6114. PMLR, 2019. 71

work page 2019

[29] [29]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision , pages 618–626, 2017

work page 2017

[30] [30]

YOLOv5 by Ultralytics, 5 2020

Glenn Jocher. YOLOv5 by Ultralytics, 5 2020

work page 2020

[31] [31]

Weakly supervised pseudo-label assisted learning for als point cloud semantic segmentation

Puzuo Wang and Wei Yao. Weakly supervised pseudo-label assisted learning for als point cloud semantic segmentation. arXiv preprint arXiv:2105.01919 , 2021

work page arXiv 2021

[32] [32]

Fastai: A layered API for deep learning

Jeremy Howard and Sylvain Gugger. Fastai: A layered API for deep learning. Information, 11(2):108, feb 2020

work page 2020

[33] [33]

Detectron2, 2019

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2, 2019

work page 2019

[34] [34]

Surgical tool detection in open surgery videos

Ryo Fujii, Ryo Hachiuma, Hiroki Kajita, and Hideo Saito. Surgical tool detection in open surgery videos. Applied Sciences, 12(20), 2022

work page 2022

[35] [35]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution. In CVPR, pages 12009–12019, June 2022

work page 2022

[36] [36]

Efficientnetv2: Smaller models and faster training

Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In ICML, 2021

work page 2021

[37] [37]

Aggregated Residual Transformations for Deep Neural Networks

Saining Xie, Ross B. Girshick, Piotr Doll´ ar, Zhuowen Tu, and Kaiming He. Aggregated residual trans- formations for deep neural networks. CoRR, abs/1611.05431, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

Pytorch image models, 2019

Ross Wightman. Pytorch image models, 2019

work page 2019

[39] [39]

Grad- cam++: Generalized gradient-based visual explanations for deep convolutional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad- cam++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV, 2018

work page 2018

[40] [40]

Imagenet: A large-scale hierar- chical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierar- chical image database. In CVPR, 2009

work page 2009

[41] [41]

Shallow feature matters for weakly supervised object localization

Jun Wei, Qin Wang, Zhen Li, Sheng Wang, S Kevin Zhou, and Shuguang Cui. Shallow feature matters for weakly supervised object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5993–6001, 2021

work page 2021

[42] [42]

Query2label: A simple transformer way to multi-label classification.arXiv preprint arXiv:2107.10834,

Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834 , 2021

work page arXiv 2021

[43] [43]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[44] [44]

arXiv preprint arXiv:2009.14119 (2020)

Emanuel Ben-Baruch, Tal Ridnik, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119 , 2020

work page arXiv 2009

[45] [45]

The opencv library

Gary Bradski. The opencv library. Dr. Dobb’s Journal: Software Tools for the Professional Programmer, 25(11):120–123, 2000

work page 2000

[46] [46]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, et al. Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021

work page 2021

[47] [47]

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, et al. iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR, 2022

work page 2022

[48] [48]

A deep learning spatial-temporal framework for detecting surgical tools in laparoscopic videos

Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, Paul D Docherty, Thomas Neumuth, and Knut M¨ oller. A deep learning spatial-temporal framework for detecting surgical tools in laparoscopic videos. Biomed- ical Signal Processing and Control , 68:102801, 2021. 72

work page 2021

[49] [49]

Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation

Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 642–651, 2017

work page 2017

[50] [50]

Weakly-supervised learn- ing for tool localization in laparoscopic videos

Armine Vardazaryan, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Weakly-supervised learn- ing for tool localization in laparoscopic videos. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis , pages 169–179. Springer, 2018

work page 2018

[51] [51]

Weakly supervised convolutional LSTM approach for tool tracking in laparoscopic videos.International journal of computer assisted radiology and surgery, 14(6):1059–1067, 2019

Chinedu Innocent Nwoye, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Weakly supervised convolutional LSTM approach for tool tracking in laparoscopic videos.International journal of computer assisted radiology and surgery, 14(6):1059–1067, 2019

work page 2019

[52] [52]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7132–7141, 2018

work page 2018

[53] [53]

Abdulbaki Alshirbaji, Nour A

T. Abdulbaki Alshirbaji, Nour A. Jalal, Paul D. Docherty, P. T. Neumuth, and Knut M¨ oller. Improving the Generalisability of Deep CNNs by Combining Multi-stage Features for Surgical Tool Classification. In 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 533–536. IEEE, 2022

work page 2022

[54] [54]

Surgical tool classification in la- paroscopic videos using convolutional neural network

Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, and Knut M¨ oller. Surgical tool classification in la- paroscopic videos using convolutional neural network. Current Directions in Biomedical Engineering , 4(1):407–410, 2018

work page 2018

[55] [55]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint, 2020

work page 2020

[56] [56]

DeVries Terrance, and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint

work page

[57] [57]

Randaugment: Practical automated data augmentation with a reduced search space

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. arXiv e-prints, 2019

work page 2019

[58] [58]

Asymmetric loss for multi-label classification

Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 82–91, 2021

work page 2021

[59] [59]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[60] [60]

Hong, C.-L

W-Y Hong, C-L Kao, Y-H Kuo, J-R Wang, W-L Chang, and C-S Shih. Cholecseg8k: a semantic segmen- tation dataset for laparoscopic cholecystectomy based on cholec80. arXiv preprint arXiv:2012.12453 , 2020

work page arXiv 2012

[61] [61]

Can masses of non- experts train highly accurate image classifiers? In International conference on medical image computing and computer-assisted intervention, pages 438–445

Lena Maier-Hein, Sven Mersmann, Daniel Kondermann, Sebastian Bodenstedt, Alexandro Sanchez, Christian Stock, Hannes Gotz Kenngott, Mathias Eisenmann, and Stefanie Speidel. Can masses of non- experts train highly accurate image classifiers? In International conference on medical image computing and computer-assisted intervention, pages 438–445. Springer, 2014

work page 2014

[62] [62]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[63] [63]

Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos

Chinedu Innocent Nwoye, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos. International journal of computer assisted radiology and surgery, 14:1059–1067, 2019. 73

work page 2019

[64] [64]

Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos

Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis, 78:102433, 2022

work page 2022

[65] [65]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016

[66] [66]

Fastai: A layered api for deep learning

Jeremy Howard and Sylvain Gugger. Fastai: A layered api for deep learning. Information, 11(2):108, 2020

work page 2020

[67] [67]

Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis

Davood Karimi, Haoran Dou, Simon K Warfield, and Ali Gholipour. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Medical image analysis, 65:101759, 2020

work page 2020

[68] [68]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011

[69] [69]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

work page 2014

[70] [70]

Faster r-cnn: Towards real-time object detection with region proposal networks, 2016

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2016

work page 2016

[71] [71]

U-net: Convolutional networks for biomedical image segmentation, 2015

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015

work page 2015

[72] [72]

Ultralytics YOLO, January 2023

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics YOLO, January 2023

work page 2023

[73] [73]

Mixformer: End-to-end tracking with iterative mixed attention, 2022

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed attention, 2022

work page 2022

[74] [74]

Dinov2: Learning robust visual features without supervision, 2024

Maxime Oquab, Timoth´ ee Darcet, Th´ eo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv´ e Jegou, Julien Mairal, Patri...

work page 2024

[75] [75]

Deep residual learning for image recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015

work page 2015

[76] [76]

Masked autoen- coders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick. Masked autoen- coders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022

work page 2022

[77] [77]

Joint feature learning and relation modeling for tracking: A one-stream framework

Botao Ye, Hong Chang, Bingpeng Ma, and Shiguang Shan. Joint feature learning and relation modeling for tracking: A one-stream framework

work page

[78] [78]

Yolox: Exceeding yolo series in 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun, and Megvii Technology. Yolox: Exceeding yolo series in 2021

work page 2021

[79] [79]

Rtmdet: An empirical study of designing real-time object detectors

Chengqi Lyu, Wenwei Zhang, Haian Huang, Yue Zhou, Yudong Wang, Yanyi Liu, Shilong Zhang, Kai Chen, Concat Conv, and Resize Concat. Rtmdet: An empirical study of designing real-time object detectors

work page

[80] [80]

Weighted boxes fusion: Ensembling boxes from different object detection models

Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing , page 104117, Mar 2021. 74

work page 2021