pith. sign in

arxiv: 2407.11906 · v3 · submitted 2024-07-16 · 💻 cs.CV · cs.RO

SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge

Pith reviewed 2026-05-23 22:42 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords surgical tool segmentationmodel robustnessnon-adversarial corruptionschallenge benchmarkendoscopic imagesbinary segmentationdeep neural networks
0
0 comments X

The pith

A new benchmark with paired clean and corrupted surgical images shows that prior knowledge and custom training improve tool segmentation robustness to bleeding, smoke, and low brightness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the SegSTRONG-C challenge to measure how surgical tool segmentation models degrade under plausible non-adversarial corruptions and to identify methods that resist them. It supplies a dataset of paired clean and corrupted samples created through counterfactual robotic replay so that models trained only on clean data can be tested on corrupted versions. Top submissions reach high scores by drawing on prior knowledge, tailored training, and architecture choices. This setup matters because it isolates the effect of realistic corruptions that appear in surgery without adversarial intent. The results also flag that most gains still come from conventional techniques and call for fresh approaches to achieve wider robustness.

Core claim

The SegSTRONG-C challenge supplies paired clean and corrupted endoscopic images for the binary robot tool segmentation task, with corruptions generated through counterfactual robotic replay. Participants train on the clean domain and are evaluated on unreleased test sets containing bleeding, smoke, and low brightness. The leading entries attain an average 0.9394 DSC and 0.9301 NSD. These outcomes demonstrate that prior knowledge, customized training strategies, and architectural decisions can be leveraged to improve robustness. The challenge also surfaces recurring failure modes and concludes that conventional techniques remain limited, advocating new paradigms for universal robustness to un

What carries the argument

The paired clean-corrupted dataset generated through counterfactual robotic replay, which enables reproducible testing of models trained on uncorrupted data against non-adversarial corruptions.

If this is right

  • Models trained solely on clean data can still perform well on corrupted domains when prior knowledge and custom strategies are applied.
  • Architectural choices contribute measurably to accuracy under the tested corruption types.
  • Most successful entries rely on established techniques that carry known limits for handling unforeseen corruptions.
  • Further gains in surgical data science will require approaches beyond current conventional methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The paired structure could support training methods that explicitly enforce invariance to these specific corruptions.
  • Results from this benchmark may inform robustness evaluation in other endoscopic or medical imaging tasks.
  • Additional corruption types encountered in actual procedures could be added to increase coverage.

Load-bearing premise

The corruptions produced by counterfactual robotic replay match the non-adversarial corruptions that occur in real surgical procedures.

What would settle it

A direct comparison of the same models on the challenge's generated corruptions versus on naturally occurring corruptions recorded during live surgery would show whether performance transfers.

read the original abstract

Surgical data science has seen rapid advancement with the excellent performance of end-to-end deep neural networks (DNNs). Despite their successes, DNNs have been proven susceptible to minor "corruptions," introducing a major concern for the translation of cutting-edge technology, especially in high-stakes scenarios. We introduce the SegSTRONG-C challenge dedicated to better understanding model deterioration under unforeseen but plausible non-adversarial "corruption" and the capabilities of contemporary methods that seek to improve it. Built on a dataset generated through counterfactual robotic replay, SegSTRONG-C provides paired clean and "corrupted" samples, enabling reproducible evaluation of model robustness. Participants are challenged to train tool segmentation algorithms on "uncorrupted" data and evaluate them on "corrupted" test domains for the binary robot tool segmentation task. Through comprehensive baseline experiments and participating submissions from widespread community engagement, SegSTRONG-C reveals key themes for model failure and identifies promising directions for improving robustness. The performance of challenge winners, achieving an average 0.9394 DSC and 0.9301 NSD across the unreleased test sets with "corruption" types: bleeding, smoke, and low brightness. This highlights how prior knowledge, customized training strategies, and architectural choice can be leveraged to improve robustness. In conclusion, the SegSTRONG-C challenge has identified practical approaches for enhancing model robustness. However, most approaches rely on conventional techniques that have known limitations. Looking ahead, we advocate for expanding intellectual diversity and creativity in non-adversarial robustness beyond data augmentation, calling for new paradigms that enhance universal robustness to unforeseen "corruptions" to facilitate richer applications in surgical data science.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces the SegSTRONG-C EndoVis'24 challenge for binary robot tool segmentation under non-adversarial corruptions (bleeding, smoke, low brightness) generated via counterfactual robotic replay, supplying paired clean/corrupted data. It reports baseline experiments plus community submissions, with winners reaching average DSC 0.9394 and NSD 0.9301 on unreleased test sets, and identifies themes for model failure while advocating new robustness paradigms beyond conventional augmentation.

Significance. If the generated corruptions are representative of real surgical conditions, the challenge supplies a reproducible benchmark that empirically demonstrates how prior knowledge, training strategies, and architecture choices can yield high robustness on the specified corruptions, thereby guiding practical improvements in surgical data science.

major comments (1)
  1. [Abstract] Abstract and dataset description: the positioning of the corruptions as 'plausible non-adversarial' and relevant to real OR conditions is not supported by any quantitative validation (feature-space distances, perceptual metrics, or clinician ratings) that the counterfactual robotic replay preserves the statistics causing model failure in live procedures. This assumption is load-bearing for interpreting the reported DSC/NSD scores as evidence of robustness to clinically meaningful corruptions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the positioning of the generated corruptions. We address this point directly below and outline the planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and dataset description: the positioning of the corruptions as 'plausible non-adversarial' and relevant to real OR conditions is not supported by any quantitative validation (feature-space distances, perceptual metrics, or clinician ratings) that the counterfactual robotic replay preserves the statistics causing model failure in live procedures. This assumption is load-bearing for interpreting the reported DSC/NSD scores as evidence of robustness to clinically meaningful corruptions.

    Authors: We agree that the manuscript does not provide quantitative validation (e.g., feature-space distances, perceptual metrics, or clinician ratings) demonstrating that the counterfactual robotic replay corruptions preserve the exact statistics of model failures observed in live procedures. The generation process relies on replaying robotic trajectories with added visual effects (bleeding, smoke, low brightness) to create paired clean/corrupted samples, which we positioned as plausible non-adversarial corruptions based on the method's design. However, this remains an unvalidated assumption. In the revised manuscript we will (1) tone down the abstract and dataset description to describe the corruptions as 'synthetically generated to simulate common non-adversarial effects' rather than asserting clinical representativeness, (2) add an explicit limitations paragraph discussing the lack of such validation and its implications for interpreting the DSC/NSD scores, and (3) note this as an important direction for future work. These textual changes will make the claims more precise without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

Empirical challenge report with no derivations or fitted predictions

full rationale

The paper is a report on an EndoVis'24 segmentation challenge. It describes a dataset of paired clean/corrupted images generated via counterfactual robotic replay, reports baseline and community-submitted DSC/NSD scores on unreleased test sets, and discusses practical robustness strategies. No equations, first-principles derivations, parameter fittings, or predictions are present. The central claims are empirical observations from community results, not reductions of outputs to inputs by construction. Self-citations are limited to prior challenge organization and do not bear load on any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and challenge paper without mathematical derivations. No free parameters, axioms, or invented entities are introduced; the contribution rests on the new dataset construction and evaluation protocol applied to existing segmentation networks.

pith-pipeline@v0.9.0 · 6043 in / 1222 out tokens · 33243 ms · 2026-05-23T22:42:46.891002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models

    cs.RO 2024-09 unverdicted novelty 5.0

    Digital twin representations from vision foundation models enable LLM-based planning for robust peg transfer and gauze retrieval on the dVRK surgical platform with claimed generalizability.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    arXiv preprint arXiv:2503.00695 (2025)

    Ding, H., Lian, X., Unberath, M.: Mosformer: Augmenting temporal con- text with memory of surgery for surgical phase recognition. arXiv preprint arXiv:2503.00695 (2025)

  2. [2]

    arXiv preprint arXiv:2503.21054 (2025)

    Shen, Y., Li, C., Liu, B., Li, C.-Y., Porras, T., Unberath, M.: Operating room workflow analysis via reasoning segmentation over digital twins. arXiv preprint arXiv:2503.21054 (2025)

  3. [3]

    arXiv preprint arXiv:2411.18018 (2024)

    Ding, H., Gao, Z., Planche, B., Luan, T., Sharma, A., Zheng, M., Lou, A., Chen, T., Unberath, M., Wu, Z.: Neural finite-state machines for surgical phase recognition. arXiv preprint arXiv:2411.18018 (2024)

  4. [4]

    arXiv preprint arXiv:2410.20026 (2024)

    Ding, H., Zhang, Y., Shu, H., Lian, X., Kim, J.W., Krieger, A., Unberath, M.: Towards robust algorithms for surgical phase recognition via digital twin-based scene representation. arXiv preprint arXiv:2410.20026 (2024)

  5. [5]

    In: 2020 25th International Conference on Pattern Recogni- tion (ICPR), pp

    Ghamsarian, N., Taschwer, M., Putzgruber-Adamitsch, D., Sarny, S., Schoeff- mann, K.: Relevance detection in cataract surgery videos by spatio-temporal action localization. In: 2020 25th International Conference on Pattern Recogni- tion (ICPR), pp. 10720–10727 (2021). IEEE 25

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Liu, D., Li, Q., Jiang, T., Wang, Y., Miao, R., Shan, F., Li, Z.: Towards uni- fied surgical skill assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9522–9531 (2021)

  7. [7]

    Scientific reports11(1), 5197 (2021)

    Lavanchy, J.L., Zindel, J., Kirtac, K., Twick, I., Hosgor, E., Candinas, D., Beldi, G.: Automation of surgical skill assessment using a three-stage machine learning algorithm. Scientific reports11(1), 5197 (2021)

  8. [8]

    Healthcare Technology Letters12(1), 12119 (2025)

    Shu, H., Liu, M., Seenivasan, L., Gu, S., Ku, P.-C., Knopf, J., Taylor, R., Unberath, M.: Seamless augmented reality integration in arthroscopy: a pipeline for articular reconstruction and guidance. Healthcare Technology Letters12(1), 12119 (2025)

  9. [9]

    International journal of computer assisted radiology and surgery19(6), 1213–1222 (2024)

    Killeen, B.D., Zhang, H., Wang, L.J., Liu, Z., Kleinbeck, C., Rosen, M., Tay- lor, R.H., Osgood, G., Unberath, M.: Stand in surgeon’s shoes: virtual reality cross-training to enhance teamwork in surgery. International journal of computer assisted radiology and surgery19(6), 1213–1222 (2024)

  10. [10]

    Healthcare Technology Letters 11(6), 355–364 (2024)

    Zhang, H., Killeen, B.D., Ku, Y.-C., Seenivasan, L., Zhao, Y., Liu, M., Yang, Y., Gu, S., Martin-Gomez, A., Osgood, G.,et al.: Straighttrack: Towards mixed real- ity navigation system for percutaneous k-wire insertion. Healthcare Technology Letters 11(6), 355–364 (2024)

  11. [11]

    International Journal of Computer Assisted Radiology and Surgery 19(7), 1301–1312 (2024)

    Kleinbeck, C., Zhang, H., Killeen, B.D., Roth, D., Unberath, M.: Neural digital twins: reconstructing complex medical environments for spatial planning in vir- tual reality. International Journal of Computer Assisted Radiology and Surgery 19(7), 1301–1312 (2024)

  12. [12]

    Inter- national Journal of Computer Assisted Radiology and Surgery18(7), 1235–1243 (2023)

    Gu, W., Knopf, J., Cast, J., Higgins, L.D., Knopf, D., Unberath, M.: Nail it! vision-based drift correction for accurate mixed reality surgical guidance. Inter- national Journal of Computer Assisted Radiology and Surgery18(7), 1235–1243 (2023)

  13. [13]

    Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models

    Ding, H., Seenivasan, L., Shu, H., Byrd, G., Zhang, H., Xiao, P., Barragan, J.A., Taylor, R.H., Kazanzides, P., Unberath, M.: Towards robust automation of surgi- cal systems via digital twin-based scene representations from foundation models. arXiv preprint arXiv:2409.13107 (2024)

  14. [14]

    W.et al.Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks (2024)

    Kim, J.W., Zhao, T.Z., Schmidgall, S., Deguet, A., Kobilarov, M., Finn, C., Krieger, A.: Surgical robot transformer (srt): Imitation learning for surgical tasks. arXiv preprint arXiv:2407.12998 (2024)

  15. [15]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015) 26

  16. [16]

    In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp

    Ronneberger,O.,Fischer,P.,Brox,T.:U-net:Convolutionalnetworksforbiomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241 (2015). Springer

  17. [17]

    In: Pro- ceedings of the European Conference on Computer Vision (ECCV), pp

    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Pro- ceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)

  18. [19]

    IEEE Robotics and Automation Letters 7(2), 3858–3865 (2022)

    Seenivasan, L., Mitheran, S., Islam, M., Ren, H.: Global-reasoned multi-task learning model for surgical scene understanding. IEEE Robotics and Automation Letters 7(2), 3858–3865 (2022)

  19. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Zheng,S.,Lu,J.,Zhao,H.,Zhu,X.,Luo,Z.,Wang,Y.,Fu,Y.,Feng,J.,Xiang,T., Torr, P.H.,et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

  20. [21]

    Advances in Neural Information Processing Systems34, 12077–12090 (2021)

    Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Seg- former: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems34, 12077–12090 (2021)

  21. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)

  22. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Ding, H., Qiao, S., Yuille, A., Shen, W.: Deeply shape-guided cascade for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8278–8288 (2021)

  23. [24]

    arXiv preprint arXiv:2001.11190 (2020)

    Allan, M., Kondo, S., Bodenstedt, S., Leger, S., Kadkhodamohammadi, R., Luengo, I., Fuentes, F., Flouty, E., Mohammed, A., Pedersen, M., et al.: 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190 (2020)

  24. [25]

    2017 Robotic Instrument Segmentation Challenge

    Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.-H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426 (2019)

  25. [26]

    International journal of computer vision 88, 303–338 (2010) 27

    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–338 (2010) 27

  26. [27]

    International Journal of Computer Vision127(3), 302–321 (2019)

    Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision127(3), 302–321 (2019)

  27. [28]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

    Ghamsarian, N., Gamazo Tejero, J., Márquez-Neila, P., Wolf, S., Zinkernagel, M., Schoeffmann, K., Sznitman, R.: Domain adaptation for medical image segmen- tation using transformation-invariant self-training. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 331–341 (2023). Springer

  28. [29]

    Drenkow, N., Sani, N., Shpitser, I., Unberath, M.: A systematic review of robustness in deep learning for computer vision: Mind the gap? arXiv preprint arXiv:2112.00639 (2021)

  29. [30]

    arXiv preprint arXiv:2501.17628 (2025)

    Nasirihaghighi, S., Ghamsarian, N., Sznitman, R., Schoeffmann, K.: Dual invari- ance self-training for reliable semi-supervised surgical phase recognition. arXiv preprint arXiv:2501.17628 (2025)

  30. [31]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)

  31. [32]

    arXiv preprint arXiv:2410.23494 (2024)

    Drenkow, N., Ribaudo, C., Unberath, M.: Causality-driven audits of model robustness. arXiv preprint arXiv:2410.23494 (2024)

  32. [33]

    arXiv preprint arXiv:2503.09969 (2025)

    Drenkow, N., Pavlak, M., Harrigian, K., Zirikly, A., Subbaswamy, A., Unberath, M.: Detecting dataset bias in medical ai: A generalized and modality-agnostic auditing framework. arXiv preprint arXiv:2503.09969 (2025)

  33. [34]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123 (2019)

  34. [35]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  35. [36]

    arXiv preprint arXiv:2408.04098 (2024)

    Shen, Y., Ding, H., Shao, X., Unberath, M.: Performance and non-adversarial robustness of the segment anything model 2 in surgical video segmentation. arXiv preprint arXiv:2408.04098 (2024)

  36. [37]

    Biomimetics 7(2), 68 (2022)

    Seenivasan, L., Islam, M., Ng, C.-F., Lim, C.M., Ren, H.: Biomimetic incremental domain generalization with a graph network for surgical scene understanding. Biomimetics 7(2), 68 (2022)

  37. [38]

    International Journal of Computer Assisted Radiology and Surgery18(5), 939–944 (2023) 28

    Reiter, W.: Domain generalization improves end-to-end object detection for real-time surgical tool detection. International Journal of Computer Assisted Radiology and Surgery18(5), 939–944 (2023) 28

  38. [39]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

    Philipp, M., Alperovich, A., Gutt-Will, M., Mathis, A., Saur, S., Raabe, A., Mathis-Ullrich, F.: Dynamic cnns using uncertainty to overcome domain gener- alization for surgical instrument localization. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3612–3621 (2022)

  39. [40]

    In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp

    Ding, H., Zhang, J., Kazanzides, P., Wu, J.Y., Unberath, M.: Carts: Causality- driven robot tool segmentation from vision and kinematics data. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 387–398 (2022). Springer

  40. [41]

    International Journal of Computer Assisted Radiology and Surgery18(6), 1009–1016 (2023)

    Ding, H., Wu, J.Y., Li, Z., Unberath, M.: Rethinking causality-driven robot tool segmentation with temporal constraints. International Journal of Computer Assisted Radiology and Surgery18(6), 1009–1016 (2023)

  41. [42]

    arXiv preprint arXiv:2503.21056 (2025)

    Shen, Y., Liu, B., Li, C., Seenivasan, L., Unberath, M.: Online reasoning video segmentation with just-in-time digital twins. arXiv preprint arXiv:2503.21056 (2025)

  42. [43]

    In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp

    Kazanzides, P., Chen, Z., Deguet, A., Fischer, G.S., Taylor, R.H., DiMaio, S.P.: An open-source research kit for the da vinci® surgical system. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 6434–6439 (2014). IEEE

  43. [44]

    Segment Anything

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  44. [45]

    Video-based surveillance systems: Computer vision and distributed processing, 135–144 (2002)

    KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model for real-time tracking with shadow detection. Video-based surveillance systems: Computer vision and distributed processing, 135–144 (2002)

  45. [46]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)

  46. [47]

    arXiv preprint arXiv:2407.19714 (2024)

    Jamal, M.A., Mohareri, O.: Rethinking rgb-d fusion for semantic segmentation in surgical datasets. arXiv preprint arXiv:2407.19714 (2024)

  47. [48]

    arXiv preprint arXiv:2309.09668 (2023)

    Yin, B., Zhang, X., Li, Z., Liu, L., Cheng, M.-M., Hou, Q.: Dformer: Rethink- ing rgbd representation learning for semantic segmentation. arXiv preprint arXiv:2309.09668 (2023)

  48. [49]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth any- thing: Unleashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10371–10381 (2024) 29

  49. [50]

    In: 2021 International Conference on 3D Vision (3DV), pp

    Lipson, L., Teed, Z., Deng, J.: Raft-stereo: Multilevel recurrent field transforms for stereo matching. In: 2021 International Conference on 3D Vision (3DV), pp. 218–227 (2021). IEEE

  50. [51]

    18963–18974 (2022)

    Kar, O.F., Yeo, T., Atanov, A., Zamir, A.: 3d common corruptions and data augmentation.In:ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition, pp. 18963–18974 (2022)

  51. [52]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)

  52. [53]

    Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Fluids Eng. (1960)

  53. [54]

    U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

    Ma, J., Li, F., Wang, B.: U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 (2024)

  54. [55]

    arXiv preprint arXiv:2401.13560 (2024)

    Xing, Z., Ye, T., Yang, Y., Liu, G., Zhu, L.: Segmamba: Long-range sequen- tial modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560 (2024)

  55. [56]

    VMamba: Visual State Space Model

    Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y.: Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166 (2024)

  56. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Choi, Y., Uh, Y., Yoo, J., Ha, J.-W.: Stargan v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188–8197 (2020)

  57. [58]

    IEEE transactions on medical imaging40(5), 1450–1460 (2021)

    Garcia-Peraza-Herrera,L.C.,Fidon,L.,D’Ettorre,C.,Stoyanov,D.,Vercauteren, T., Ourselin, S.: Image compositing for segmentation of surgical tools without manual annotations. IEEE transactions on medical imaging40(5), 1450–1460 (2021)

  58. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16901–16911 (2024)

  59. [60]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(5), 5436–5447 (2022)

    Guo, M.-H., Liu, Z.-N., Mu, T.-J., Hu, S.-M.: Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence45(5), 5436–5447 (2022)

  60. [61]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)

  61. [62]

    In: Proceedings of the European Conference on Computer Vision 30 (ECCV), pp

    Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block atten- tion module. In: Proceedings of the European Conference on Computer Vision 30 (ECCV), pp. 3–19 (2018)

  62. [63]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Howard, A.G.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  63. [64]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16133–16142 (2023)

  64. [65]

    Informa- tion 11(2), 125 (2020)

    Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M., Kalinin, A.A.: Albumentations: fast and flexible image augmentations. Informa- tion 11(2), 125 (2020)

  65. [66]

    : Swin transformer v2: Scaling up capacity and resolution

    Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al. : Swin transformer v2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022)

  66. [67]

    In: Computer Vision (ICCV), 2017 IEEE International Conference On (2017)

    Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networkss. In: Computer Vision (ICCV), 2017 IEEE International Conference On (2017)

  67. [68]

    Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods18, 203–211 (2021) https://doi.org/10.1038/s41592-020-01008-z

  68. [69]

    In: Neural Infor- mation Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part III 28, pp

    Ghamsarian, N., Taschwer, M., Putzgruber-Adamitsch, D., Sarny, S., El- Shabrawi, Y., Schöffmann, K.: Recal-net: Joint region-channel-wise calibrated network for semantic segmentation in cataract surgery videos. In: Neural Infor- mation Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part II...

  69. [70]

    Ghamsarian, N., Taschwer, M., Putzgruber-Adamitsch, D., Sarny, S., El- Shabrawi, Y., Schoeffmann, K.: Lensid: a cnn-rnn-based framework towards lens irregularity detection in cataract surgery videos. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Confer- ence, Strasbourg, France, September 27–October 1, 2021...

  70. [71]

    International journal of computer assisted radiol- ogy and surgery, 1–9 (2024)

    Ghamsarian, N., Wolf, S., Zinkernagel, M., Schoeffmann, K., Sznitman, R.: Deeppyramid+: medical image segmentation using pyramid view fusion and deformable pyramid reception. International journal of computer assisted radiol- ogy and surgery, 1–9 (2024)

  71. [72]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

    Ghamsarian, N., Taschwer, M., Sznitman, R., Schoeffmann, K.: Deeppyramid: 31 Enabling pyramid view and deformable pyramid reception for semantic segmen- tation in cataract surgery videos. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 276–286 (2022). Springer 32