pith. sign in

arxiv: 2510.04772 · v2 · submitted 2025-10-06 · 💻 cs.CV · cs.AI· cs.LG

Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

Pith reviewed 2026-05-18 10:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords federated learningsurgical visionappendicitis classificationspatiotemporal modelslaparoscopic videosprivacy-preserving AImulti-center generalizationpersonalized adaptation
0
0 comments X

The pith

Even with all surgical video data pooled centrally, appendicitis classification reaches only 26.31 percent F1 on an unseen center, and decentralized training adds a further separable penalty while video-level models outperform frame-level,

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a benchmark for federated learning applied to multi-center laparoscopic appendectomy videos as a test case for privacy-preserving surgical AI. It shows that centralizing all data still produces low generalization to a new hospital, and that avoiding data sharing through federated methods creates an extra performance cost on top of that baseline difficulty. Video sequences processed as a whole prove more effective than treating individual frames separately, no matter whether training is centralized or spread across sites. Simple local adaptation on each hospital's data tends to fail because of class imbalance, so the work points toward more deliberate personalization methods as a practical next step.

Core claim

In the FedSurg challenge on a preliminary subset of the Appendix300 dataset, centralized training achieved only 26.31 percent F1-score when tested on videos from an unseen center. Federated and swarm-learning submissions incurred an additional, measurable performance drop beyond that central baseline. Spatiotemporal models operating on full video clips outperformed frame-by-frame approaches under every aggregation method tested. Naive local fine-tuning on imbalanced per-center data produced classifier collapse, whereas structured personalized federated learning combined with parameter-efficient fine-tuning provided a clearer path for center-specific adaptation.

What carries the argument

The unseen-center generalization split used to separate inherent task difficulty from the effects of data decentralization across centralized, federated, and swarm-learning submissions.

If this is right

  • Centralized pooling of multi-center surgical videos still yields only 26.31 percent F1 on unseen centers, showing the task remains hard even without privacy constraints.
  • Decentralized training adds a distinct performance penalty separate from the difficulty of the underlying classification problem.
  • Video-level spatiotemporal models outperform frame-level models under both centralized and decentralized training.
  • Naive local fine-tuning collapses on imbalanced center-specific data.
  • Structured personalized federated learning with parameter-efficient fine-tuning offers a more reliable route to center adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Raising the performance ceiling may first require stronger base architectures for temporal surgical data before federation techniques are refined further.
  • The reported gaps suggest that larger or more balanced multi-center collections could be needed before such systems reach clinical viability.
  • The same unseen-center protocol could be applied to other laparoscopic procedures to check whether temporal modeling remains the dominant factor.

Load-bearing premise

The preliminary subset of the Appendix300 dataset together with the three-submission challenge format and the chosen unseen-center split sufficiently capture the statistical and logistical difficulties of real-world multi-institutional surgical video data.

What would settle it

Re-running the same evaluation protocol on the full Appendix300 dataset or with additional submissions and observing whether F1 on the unseen center rises well above 26.31 percent or whether decentralized methods match centralized performance would directly test the reported limitations.

Figures

Figures reproduced from arXiv: 2510.04772 by Alexander C. Jenke, Annika Reinke, Claas de Boer, Danail Stoyanov, Fiona R. Kolbinger, Hanna Hoffmann, Jakob N. Kather, Julia Alekseenko, Kevin Pfeiffer, Lena Maier-Hein, Lorenzo Mazza, Max Kirchner, Nicolas Padoy, Oliver L. Saldanha, Santhi Raj Kolamuri, Sebastian Bodenstedt, Sophia Bano, Stefanie Speidel, Weam Kanjo.

Figure 1
Figure 1. Figure 1: FedSurg24 Challenge Highlights: The top panel shows example images of intraoperative appendicitis grades, defined according to Gomes et al. [24], which were used for video annotation. The lower panel illustrates the FedSurg Challenge workflow: teams submitted Docker containers via Synapse, which were executed on a secure cluster simulating FL across three centers with local training and centralized aggrega… view at source ↗
Figure 2
Figure 2. Figure 2: Label Distribution Across Data Subsets per Center. Label distributions for (a) the training dataset and (b) the test dataset across the four centers. The plots highlight notable inter￾center variability and class imbalance. In the training set visualization, the darker segments represent the publicly available subset for participant development, while the lighter segments show the complete dataset used for… view at source ↗
Figure 3
Figure 3. Figure 3: Methods Overview: The three submissions shown utilize different backbone architectures and federated strategies. A common approach is that in each server round, the best-performing model from a client’s local training rounds is sent to the server for aggregation. (a) Team Santhi uses a frozen ViViT backbone with a fine-tuned classification head processing 32 frames per video, with updates aggregated via Fe… view at source ↗
Figure 4
Figure 4. Figure 4: Confusion Matrices – Task 1, Center 4. Confusion matrices for the participating teams on Center 4 (Task 1). The values in the confusion matrices are not normalized. The color highlighting is normalized row-wise by true labels. The diagonal highlights class-wise recall, while off-diagonal values indicate common misclassification patterns [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bootstrapped Performance Results. Visualization of the performance results with stan￾dard deviation as error bars for all teams and tasks after bootstrapping with 10,000 repetitions. The plot illustrates the variability and stability of the outcomes across different centers. 15 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ranking Stability. Bootstrapped ranking distributions for each metric and task, based on 10,000 bootstrap iterations. Circle size indicates the percentage of times a team’s model achieved a specific rank across samples. Black crosses show median ranks, and black lines denote the 95% bootstrap confidence intervals. Subfigures (a) and (b) correspond to Task 1 (generalization ability) with metrics EC and F1-s… view at source ↗
Figure 7
Figure 7. Figure 7: Confusion Matrices – Task 2, Centers 1–3. The values in the confusion matrices are not normalized. The color highlighting is normalized row-wise by true labels. The diagonal highlights class-wise recall, while off-diagonal values indicate common misclassification patterns. 18 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Developing generalizable surgical AI requires multi-institutional data, yet patient privacy constraints preclude direct data sharing, making Federated Learning (FL) a natural candidate solution. The application of FL to complex, spatiotemporal surgical video data remains largely unbenchmarked. We present the FedSurg Challenge, the first international benchmarking initiative dedicated to FL in surgical vision, evaluated as a proof-of-concept on a multi-center laparoscopic appendectomy dataset (preliminary subset of Appendix300). Three submissions were evaluated on generalization to an unseen center and center-specific adaptation. Centralized and Swarm Learning baselines isolate the contributions of task difficulty and decentralization to observed performance. Even with all data pooled centrally, the task achieved only 26.31\% F1-score on the unseen center, while decentralized training introduced an additional, separable performance penalty. Temporal modeling emerges as the dominant architectural factor: video-level spatiotemporal models consistently outperformed frame-level approaches regardless of aggregation strategy. Naive local fine-tuning leads to classifier collapse on imbalanced local data; structured personalized FL with parameter-efficient fine-tuning represents a more principled path toward center-specific adaptation. By characterizing current FL limitations through rigorous statistical analysis, this work establishes a methodological reference point for robust, privacy-preserving AI systems in surgical video analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the FedSurg EndoVis 2024 Challenge as the first benchmarking effort for federated learning in surgical vision, using a preliminary subset of the multi-center Appendix300 laparoscopic appendectomy dataset. Three submissions are evaluated for generalization to an unseen center and center-specific adaptation, with comparisons to centralized and swarm learning baselines. Key findings include a low centralized F1-score of 26.31% on the unseen center, an additional performance penalty from decentralization, the superiority of video-level spatiotemporal models over frame-level approaches irrespective of aggregation, and the risks of naive local fine-tuning leading to classifier collapse on imbalanced data.

Significance. If the empirical comparisons hold under more extensive validation, this establishes a useful reference point for privacy-preserving surgical AI by quantifying the inherent difficulty of the appendicitis classification task even in the centralized case and isolating the additional impact of decentralization. The observation that temporal modeling dominates across strategies and the contrast between naive fine-tuning and structured personalized approaches with parameter-efficient fine-tuning provide concrete directions for future work.

major comments (2)
  1. [Results / Abstract] The assertion that video-level spatiotemporal models 'consistently outperformed' frame-level approaches 'regardless of aggregation strategy' rests on results from only three submissions on a single unseen-center split of the preliminary Appendix300 subset; this sample size is too small to support a general architectural conclusion without additional submissions, cross-validation, or statistical tests for significance of the observed ordering.
  2. [Abstract / Methods] The claim of a 'separable' performance penalty from decentralized training beyond the centralized 26.31% F1 baseline lacks supporting details on whether the centralized and swarm baselines used identical architectures, hyperparameters, and data preprocessing as the submissions; without this, the isolation of decentralization effects from task difficulty cannot be verified.
minor comments (2)
  1. [Results] Include explicit statistical tests (e.g., paired t-tests or bootstrap confidence intervals) for all F1 comparisons and report the exact number of videos/frames in the preliminary subset and unseen center.
  2. [Methods] Clarify the precise definitions of 'video-level spatiotemporal models' versus 'frame-level approaches' and list the three submissions' architectures in a table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on the FedSurg EndoVis 2024 Challenge. We address each major comment below in a point-by-point manner and indicate the revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: The assertion that video-level spatiotemporal models 'consistently outperformed' frame-level approaches 'regardless of aggregation strategy' rests on results from only three submissions on a single unseen-center split of the preliminary Appendix300 subset; this sample size is too small to support a general architectural conclusion without additional submissions, cross-validation, or statistical tests for significance of the observed ordering.

    Authors: We agree that the small number of submissions limits the generalizability of this observation. The challenge received only three valid submissions, and all results are reported on a single fixed unseen-center split of the preliminary dataset. In the revised manuscript we have softened the language in the abstract and results to state that spatiotemporal models outperformed frame-level approaches among the submitted methods, rather than claiming a general architectural principle. We have added an explicit limitations section noting the preliminary nature of the finding, the absence of cross-validation or significance testing due to the challenge format, and the need for future challenges with larger numbers of participants to confirm the pattern. Raw per-submission scores are already provided so readers can evaluate consistency directly. revision: yes

  2. Referee: The claim of a 'separable' performance penalty from decentralized training beyond the centralized 26.31% F1 baseline lacks supporting details on whether the centralized and swarm baselines used identical architectures, hyperparameters, and data preprocessing as the submissions; without this, the isolation of decentralization effects from task difficulty cannot be verified.

    Authors: We have revised the Methods section to provide the requested details. The centralized and swarm baselines were run on the identical data splits, preprocessing pipeline (including video sampling, normalization, and augmentation), and evaluation protocol as the federated submissions. Where architectures overlapped with submitted methods we reused the same backbone and hyperparameters; otherwise we selected representative models matched as closely as possible to the challenge task. A new supplementary table now lists the exact configuration for each baseline to make the isolation of decentralization effects transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical challenge results are self-contained

full rationale

The paper reports direct empirical F1-scores and performance comparisons from three challenge submissions plus centralized (26.31% on unseen center) and Swarm baselines on a preliminary Appendix300 subset. Claims about temporal-model dominance and separable decentralization penalty rest on these held-out evaluations and standard FL aggregation, with no equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the reported outcomes to inputs by construction. The experimental design isolates task difficulty from decentralization without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard definitions of F1-score, generalization to held-out centers, and the assumption that the chosen dataset split reflects real multi-center variability; no free parameters or invented entities are introduced.

axioms (2)
  • standard math Standard definitions of precision, recall, and F1-score for multi-class classification.
    Used to quantify performance on the unseen center.
  • domain assumption The preliminary subset of Appendix300 is representative of multi-center laparoscopic appendectomy video distributions.
    Invoked when interpreting generalization results.

pith-pipeline@v0.9.0 · 5840 in / 1336 out tokens · 36421 ms · 2026-05-18T10:18:45.207240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

  1. [1]

    Maier-Hein, S

    L. Maier-Hein, S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou, M. Hashizume, D. Katić, H. Kenngott, M. Kranzfelder, A. Malpani, K. März, T. Neumuth, N. Padoy, C. Pugh, P. Jannin, Surgical data science for next-generation interventions, Nature Biomedical Engineer- ing 1 (Sep. 2017).doi:10.1038/s4...

  2. [2]

    J. M. Brandenburg, A. C. Jenke, A. Stern, M. T. J. Daum, A. Schulze, R. Younis, P. Petrynowski, T. Davitashvili, V. Vanat, N. Bhasker, S. Schneider, L. Münder- mann, A. Reinke, F. R. Kolbinger, V. Jörns, F. Fritz-Kebede, M. Dugas, L. Maier- Hein, R. Klotz, M. Distler, J. Weitz, B. P. Müller-Stich, S. Speidel, S. Bodenstedt, M. Wagner, Active learning for ...

  3. [3]

    Maier-Hein, M

    L. Maier-Hein, M. Eisenmann, D. Sarikaya, K. März, T. Collins, A. Malpani, J. Fallert, H. Feussner, S. Giannarou, P. Mascagni, H. Nakawala, A. Park, C. Pugh, D. Stoyanov, S. S. Vedula, K. Cleary, G. Fichtinger, G. Forestier, B. Gibaud, T. Grantcharov, M. Hashizume, D. Heckmann-Nötzel, H. G. Kenngott, R. Kikinis, L. Mündermann, N. Navab, S. Onogur, T. Roß,...

  4. [4]

    Carstens, S

    M. Carstens, S. Vasisht, Z. Zhang, I. Barbur, A. Reinke, L. Maier-Hein, D. A. Hashimoto, F. R. Kolbinger, Artificial intelligence for surgical scene understanding: A systematic review and reporting quality meta-analysis, ISSN: 3067-2007 Pages: 2025.07.12.25330122 (2025).doi:10.1101/2025.07.12.25330122. URLhttps://www.medrxiv.org/content/10.1101/2025.07.12...

  5. [5]

    Kirtac, N

    K. Kirtac, N. Aydin, J. Lavanchy, G. Beldi, M. Smit, M. Woods, F. Aspart, Surgical Phase Recognition: From Public Datasets to Real-World Data, Applied Sciences 12 (2022) 8746.doi:10.3390/app12178746

  6. [6]

    J. L. Lavanchy, S. Ramesh, D. Dall’Alba, C. Gonzalez, P. Fiorini, B. Muller-Stich, P. C. Nett, J. Marescaux, D. Mutter, N. Padoy, Challenges in Multi-centric Gen- eralization: Phase and Step Recognition in Roux-en-Y Gastric Bypass Surgery, arXiv:2312.11250 [cs] (Dec. 2023).doi:10.48550/arXiv.2312.11250. URLhttp://arxiv.org/abs/2312.11250

  7. [7]

    O. f. C. Rights (OCR), Health information privacy, last Modified: 2025-06- 27T11:38:47-0400 (2021). URLhttps://www.hhs.gov/hipaa/index.html 26

  8. [8]

    URLhttps://gdpr-info.eu/

    General data protection regulation (GDPR) – legal text (2016). URLhttps://gdpr-info.eu/

  9. [9]

    McMahan, E

    B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. y. Arcas, Communication- Efficient Learning of Deep Networks from Decentralized Data, in: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR, 2017, pp. 1273–1282, iSSN: 2640-3498. URLhttps://proceedings.mlr.press/v54/mcmahan17a.html

  10. [10]

    D. Yin, Y. Chen, R. Kannan, P. Bartlett, Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates, in: Proceedings of the 35th International Con- ference on Machine Learning, PMLR, 2018, pp. 5650–5659, iSSN: 2640-3498. URLhttps://proceedings.mlr.press/v80/yin18a.html

  11. [11]

    Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N

    N. Rieke, J. Hancox, W. Li, F. Milletarì, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, S. Ourselin, M. Sheller, R. M. Summers, A. Trask, D. Xu, M. Baust, M. J. Cardoso, The future of digital health with federated learning, npj Digital Medicine 3 (1) (2020) 119, publisher: Nature Publishing Group. doi:10.1038/s41746-020...

  12. [12]

    T. Li, A. K. Sahu, A. Talwalkar, V. Smith, Federated Learning: Challenges, Methods, and Future Directions, IEEE Signal Processing Magazine 37 (3) (2020) 50–60.doi: 10.1109/MSP.2020.2975749. URLhttps://ieeexplore.ieee.org/document/9084352

  13. [13]

    arXiv preprint arXiv:1912.04977 (2019)

    P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eich- ner, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak...

  14. [14]

    URLhttp://arxiv.org/abs/2208.03392

    A.Rauniyar, D.H.Hagos, D.Jha, J.E.Håkegård, U.Bagci, D.B.Rawat, V.Vlassov, Federated learning for medical applications: A taxonomy, current trends, chal- lenges, and future research directions (2023).arXiv:2208.03392[cs],doi: 10.48550/arXiv.2208.03392. URLhttp://arxiv.org/abs/2208.03392

  15. [15]

    A. Z. Tan, H. Yu, L. Cui, Q. Yang, Towards personalized federated learning, IEEE Transactions on Neural Networks and Learning Systems 34 (12) (2023) 9587–9603. doi:10.1109/TNNLS.2022.3160699

  16. [16]

    T. Li, M. Sanjabi, A. Beirami, V. Smith, Fair resource allocation in federated learning (2020).arXiv:1905.10497[cs],doi:10.48550/arXiv.1905.10497. URLhttp://arxiv.org/abs/1905.10497 27

  17. [17]

    Kassem, D

    H. Kassem, D. Alapatt, P. Mascagni, C. AI4SafeChole, A. Karargyris, N. Padoy, Fed- erated cycling (FedCy): Semi-supervised federated learning of surgical phases, IEEE Transactions on Medical Imaging (2022) 1–1Conference Name: IEEE Transactions on Medical Imaging.doi:10.1109/TMI.2022.3222126

  18. [18]

    Kirchner, A

    M. Kirchner, A. C. Jenke, S. Bodenstedt, F. R. Kolbinger, O. L. Saldanha, J. N. Kather, M. Wagner, S. Speidel, Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections, arXivArXiv:2504.16612 [cs] (May 2025).doi:10.48550/arXiv.2504.16612. URLhttp://arxiv.org/abs/2504.16612

  19. [19]

    Y. Li, S. S. Kundu, M. Boels, T. Mahmoodi, S. Ourselin, T. Vercauteren, P. Das- gupta, J. Shapey, A. Granados, UltraFlwr – an efficient federated medical and surgical object detection framework (2025).arXiv:2503.15161[cs],doi: 10.48550/arXiv.2503.15161. URLhttp://arxiv.org/abs/2503.15161

  20. [20]

    Speidel, L

    S. Speidel, L. Maier-Hein, D. Stoyanov, S. Bodenstedt, A. Reinke, S. Bano, Endo- scopic Vision Challenge – A MICCAI Challenge. URLhttps://opencas.dkfz.de/endovis/

  21. [21]

    F. R. Kolbinger, M. Kirchner, K. Pfeiffer, S. Bodenstedt, A. C. Jenke, J. Barthel, M. R. Carstens, K. Dehlke, S. Dietz, S. Emmanouilidis, G. Fitze, L. Leiter- mann, S. T. Mees, S. Pistorius, C. Prudlo, A. Seiberth, J. Schultz, K. Thiel, D. Ziehn, S. Speidel, J. Weitz, J. N. Kather, M. Distler, O. L. Saldanha, Ap- pendix300: A multi-institutional laparosco...

  22. [22]

    O. L. Saldanha, K. Pfeiffer, S. Bodenstedt, M. Kirchner, A. C. Jenke, C. Barata, S. Barbosa, J. Barthel, M. Carstens, L. T. Castro, K. Dehlke, S. Dietz, S. Emmanoui- lidis, G. Fitze, M. Freitag, F. Holderried, W. Kanjo, L. Leitermann, S. T. Mees, A. S. Soares, M. Pascoal, S. Pistorius, C. Prudlo, J. Schultz, A. Seiberth, K. Thiel, X. Wu, D. Ziehn, S. Spei...

  23. [23]

    Image Analysis66, 101796, https://doi.org/10.1016/j.media.2020.101796 (2020)

    L. Maier-Hein, A. Reinke, M. Kozubek, A. L. Martel, T. Arbel, M. Eisenmann, A. Hanbury, P. Jannin, H. Müller, S. Onogur, J. Saez-Rodriguez, B. van Gin- neken, A. Kopp-Schneider, B. A. Landman, BIAS: Transparent reporting of biomedical image analysis challenges, Medical Image Analysis 66 (2020) 101796. doi:10.1016/j.media.2020.101796. URLhttps://www.scienc...

  24. [24]

    C. A. Gomes, T. A. Nunes, J. M. Fonseca Chebli, C. S. Junior, C. C. Gomes, La- paroscopy grading system of acute appendicitis: new insight for future trials, Sur- 28 gical Laparoscopy, Endoscopy & Percutaneous Techniques 22 (5) (2012) 463–466. doi:10.1097/SLE.0b013e318262edf1

  25. [25]

    Tomar, Converting video formats with ffmpeg, Linux Journal 2006 (146) (2006) 10

    S. Tomar, Converting video formats with ffmpeg, Linux Journal 2006 (146) (2006) 10

  26. [26]

    L. R. Dice, Measures of the amount of ecologic associa- tion between species, Ecology 26 (3) (1945) 297–302, _eprint: https://esajournals.onlinelibrary.wiley.com/doi/pdf/10.2307/1932409.doi: 10.2307/1932409. URLhttps://onlinelibrary.wiley.com/doi/abs/10.2307/1932409

  27. [27]

    Maier-Hein, A

    L. Maier-Hein, A. Reinke, P. Godau, M. D. Tizabi, F. Buettner, E. Christodoulou, B. Glocker, F. Isensee, J. Kleesiek, M. Kozubek, M. Reyes, M. A. Riegler, M. Wiesen- farth, A. E. Kavur, C. H. Sudre, M. Baumgartner, M. Eisenmann, D. Heckmann- Nötzel, T. Rädsch, L. Acion, M. Antonelli, T. Arbel, S. Bakas, A. Benis, M. B. Blaschko, M. J. Cardoso, V. Cheplygi...

  28. [28]

    Hastie, R

    T. Hastie, R. Tibshirani, J. H. Friedman, J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction, Vol. 2, Springer, 2009

  29. [29]

    Elkan, The foundations of cost-sensitive learning, in: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI), 2001, pp

    C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI), 2001, pp. 973–978

  30. [30]

    Efron, Bootstrap methods: another look at the jackknife, in: Breakthroughs in statistics: Methodology and distribution, Springer, 1992, pp

    B. Efron, Bootstrap methods: another look at the jackknife, in: Breakthroughs in statistics: Methodology and distribution, Springer, 1992, pp. 569–593

  31. [31]

    Commun.9, 10.1038/s41467-018-07619-7 (2018)

    L. Maier-Hein, M. Eisenmann, A. Reinke, S. Onogur, M. Stankovic, P. Scholz, T. Ar- bel, H. Bogunovic, A. P. Bradley, A. Carass, C. Feldmann, A. F. Frangi, P. M. Full, B. van Ginneken, A. Hanbury, K. Honauer, M. Kozubek, B. A. Landman, K. März, O. Maier, K. Maier-Hein, B. H. Menze, H. Müller, P. F. Neher, W. Niessen, N. Rajpoot, G. C. Sharp, K. Sirinukunwa...

  32. [32]

    Arnab, M

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, ViViT: A Video Vision Transformer, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836–6846. 29 URLhttps://openaccess.thecvf.com/content/ICCV2021/html/ Arnab_ViViT_A_Video_Vision_Transformer_ICCV_2021_paper. html?ref=https://githubhelp.com

  33. [33]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv:2010.11929 [cs] (Jun. 2021).doi:10.48550/arXiv.2010.11929. URLhttp://arxiv.org/abs/2010.11929

  34. [34]

    D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y. Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. d. Gusmão, N. D. Lane, Flower: A Friendly Federated Learning Research Framework, arXiv:2007.14390 [cs] (Mar. 2022).doi:10.48550/ arXiv.2007.14390. URLhttp://arxiv.org/abs/2007.14390

  35. [35]

    Batić, F

    D. Batić, F. Holm, E. Özsoy, T. Czempiel, N. Navab, EndoViT: pretraining vision transformers on a large collection of endoscopic images, International Journal of ComputerAssistedRadiologyandSurgery19(6)(2024)1085–1091.doi:10.1007/ s11548-024-03091-5. URLhttps://doi.org/10.1007/s11548-024-03091-5

  36. [36]

    S. Yang, F. Zhou, L. Mayer, F. Huang, Y. Chen, Y. Wang, S. He, Y. Nie, X. Wang, Ö. Sümer, Y. Jin, H. Sun, S. Xu, A. Q. Liu, Z. Li, J. Qin, J. Y. Teoh, L. Maier-Hein, H. Chen, Large-scale Self-supervised Video Foundation Model for Intelligent Surgery, arXiv:2506.02692 [cs] (Jun. 2025).doi:10.48550/arXiv.2506.02692. URLhttp://arxiv.org/abs/2506.02692

  37. [37]

    Schmidgall, J

    S. Schmidgall, J. W. Kim, J. Jopling, A. Krieger, General surgery vision transformer: A video pre-trained foundation model for general surgery, arXiv:2403.05949 [cs] (Apr. 2024).doi:10.48550/arXiv.2403.05949. URLhttp://arxiv.org/abs/2403.05949

  38. [38]

    Caldarola, B

    D. Caldarola, B. Caputo, M. Ciccone, Improving Generalization in Federated Learn- ing by Seeking Flat Minima, in: S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 654–672.doi:10.1007/978-3-031-20050-2_38

  39. [39]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    P. Foret, A. Kleiner, H. Mobahi, B. Neyshabur, Sharpness-Aware Minimization for Efficiently Improving Generalization, arXiv:2010.01412 [cs] (Apr. 2021).doi:10. 48550/arXiv.2010.01412. URLhttp://arxiv.org/abs/2010.01412

  40. [40]

    arXiv preprint arXiv:2003.00295 , year=

    S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečný, S. Kumar, H. B. McMahan, Adaptivefederatedoptimization, version: 5.arXiv:2003.00295[cs], doi:10.48550/arXiv.2003.00295. URLhttp://arxiv.org/abs/2003.00295

  41. [41]

    K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, arXiv:1512.03385 [cs] (Dec. 2015).doi:10.48550/arXiv.1512.03385. URLhttp://arxiv.org/abs/1512.03385 30

  42. [42]

    Bromley, I

    J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature Verification using a "Siamese" Time Delay Neural Network, in: Advances in Neural Information Processing Systems, Vol. 6, Morgan-Kaufmann, 1993. URLhttps://proceedings.neurips.cc/paper/1993/hash/ 288cc0ff022877bd3df94bc9360b9c5d-Abstract.html

  43. [43]

    P. Luo, R. Zhang, J. Ren, Z. Peng, J. Li, Switchable Normalization for Learning- to-Normalize Deep Representation, IEEE Transactions on Pattern Analysis and Ma- chine Intelligence 43 (2) (2021) 712–728.doi:10.1109/TPAMI.2019.2932062. URLhttps://ieeexplore.ieee.org/abstract/document/8781758 31