pith. machine review for the scientific record. sign in

arxiv: 2605.06820 · v1 · submitted 2026-05-07 · ⚛️ physics.med-ph · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Overcoming data scarcity through multi-center federated learning for organs-at-risk segmentation in pediatric upper abdominal radiotherapy

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:58 UTC · model grok-4.3

classification ⚛️ physics.med-ph cs.AI
keywords federated learningorgans at risk segmentationpediatric radiotherapyCT imagingmulti-center collaborationnnU-Netabdominal tumors
0
0 comments X

The pith

Federated learning across two centers yields OAR segmentation models for pediatric upper abdominal tumors that match local accuracy while improving cross-center robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pediatric radiotherapy planning needs precise CT segmentation of organs at risk, yet small and scattered patient cohorts make single-center models brittle when applied elsewhere. The work tests a real-world federated learning setup that lets two hospitals exchange only model weights rather than raw scans. Local nnU-Net models trained at each site perform well inside their own data but lose accuracy on the other center's cases for four to seven of nine evaluated structures. The federated model closes most of that gap, matches or exceeds local performance on in-center tests, and records the highest cross-center scores on Dice, Hausdorff, and surface distance metrics.

Core claim

Using 310 postoperative CT scans from 272 patients at Utrecht and Heidelberg, the authors show that a federated nnU-Net trained with secure weight exchange produces a single model whose cross-center Dice scores exceed those of either local model by 0.003 to 0.007 while preserving in-center accuracy for at least seven of nine organs at risk and reducing false-positive kidney labels.

What carries the argument

nnU-Net framework adapted for federated learning via secure weight exchange on cloud storage across institutional firewalls

If this is right

  • The federated model maintains stable performance when patient orientation varies.
  • It reduces false-positive segmentations of kidneys that have been surgically removed.
  • It delivers the best cross-center results across Dice, 95th-percentile Hausdorff, and mean surface distance for the nine evaluated OARs.
  • It matches local-model performance for at least seven of the nine OARs on each center's own data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could scale to additional centers to further reduce domain shift.
  • Similar federated pipelines may help other pediatric imaging tasks limited by small per-site cohorts.
  • The modest Dice gains may grow in clinical value once models are deployed on larger multi-center test sets.

Load-bearing premise

The two-center dataset and the specific nnU-Net federated implementation are representative enough that the small observed gains will appear at other pediatric centers and scanner protocols.

What would settle it

Evaluation of the same federated model on CT data from a third independent pediatric center with different scanners and protocols, checking whether it still matches or beats a newly trained local model there.

Figures

Figures reproduced from arXiv: 2605.06820 by Annemieke S. Littooij, Geert O. Janssens, Jens-Peter Schenk, Marry M. van den Heuvel-Eibrink, Martine van Grotel, Matteo Maspero, Maximilian Knoll, Max van Noesel, Mianyong Ding, Semi Harrabi.

Figure 1
Figure 1. Figure 1: The study's workflow included data preparation, local training, and deployment of federated learning between UTR and HEI via cloud storage (SURFdrive) as an intermediate layer. Before training, the UTR and HEI cohorts generated dataset fingerprints locally from their own data and sent them to the central server to aggregate them and generate a global plan; the server also initialized the model based on thi… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of demographic characteristics, imaging parameters, and image statistics between the UTR and HEI cohorts. Panels show (A) gender distribution, (B) tumor type, (C) age distribution, (D) IV contrast usage, (E) scanner model, (F) patient orientation, (G) in-plane pixel spacing, (H) slice thickness, and (I) OAR foreground CT attenuation values distribution (HU). All summarized information for each c… view at source ↗
Figure 4
Figure 4. Figure 4: Three CTs from two patients in which the federated model failed to predict the left kidney (DSC = 0). The first row shows a CT from one patient acquired in the left lateral (left side-lying) position. The second and third rows show CTs from a second patient acquired in the supine position, without and with intravenous (IV) contrast, respectively. All images are displayed in their original patient orientati… view at source ↗
Figure 5
Figure 5. Figure 5: Performance of local models (Model_UTR and Model_HEI) and the federated model (Model_FL) on simulated CTs with different patient orientations (0° supine, 90° left lateral, 180° right lateral, and 270° prone). Discussion In this study, we successfully implemented real-world FL between two European medical centers, Utrecht and Heidelberg, thereby demonstrating its feasibility. We show that FL enhances the ro… view at source ↗
read the original abstract

Deep learning-based organs/structures-at-risk(OARs) auto-contouring models can improve radiotherapy workflows, but models trained on adult data often underperform in pediatric patients. Developing robust pediatric-specific models is hindered by data scarcity and fragmentation across centers. Federated learning (FL) enables privacy-preserving collaborative training without the need for data sharing. We evaluated the feasibility and performance of FL for developing pediatric-specific OAR segmentation models across two European medical centers. Computed tomography (CT) images from pediatric patients from Utrecht and Heidelberg with a renal tumor or abdominal neuroblastoma were retrospectively collected and locally processed. An nnU-Net-based framework segmented 19 OARs using local and FL schemes. FL was implemented with secure weight exchange on a cloud storage across institutional firewalls. Performance was assessed using the Dice similarity coefficient (DSC), 95th percentile Hausdorff distance, and mean surface distance. Robustness to patient orientation, false-positive segmentation of surgically removed kidneys, and failure cases were identified. A total of 310 postoperative CTs from 272 patients (105 renal tumors, 167 neuroblastomas) were included. Local models performed well on their respective center data but showed significantly reduced cross-center performance for four to seven of the nine evaluated OARs (DSC). In contrast, the FL model matched local performance for at least seven of nine OARs and achieved the best cross-center results across three metrics, with DSC gains of 0.003-0.007 over local models. FL also maintained stable performance across patient orientations and reduced false-positive kidney segmentations. Real-world FL improves cross-center robustness of CT-based OAR segmentation models in pediatric upper abdominal tumors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates federated learning (FL) with an nnU-Net framework for segmenting 19 organs-at-risk (OARs) on postoperative CT scans from 272 pediatric patients (310 scans total) with renal tumors or abdominal neuroblastomas across two centers (Utrecht and Heidelberg). Local models are shown to degrade on cross-center data for 4-7 of 9 evaluated OARs, while the FL model (implemented via secure weight exchange over cloud storage) matches local performance on at least 7/9 OARs and yields small DSC gains of 0.003-0.007 on cross-center tests, with additional checks for orientation robustness and false-positive kidney segmentations.

Significance. If the small observed gains prove robust, the work demonstrates practical feasibility of real-world FL for improving cross-center generalization in a privacy-sensitive, data-scarce pediatric radiotherapy setting without requiring data sharing, which could support multi-center model development.

major comments (2)
  1. [Abstract and Results] Abstract and Results: the central claim of improved cross-center robustness rests on DSC gains of only 0.003-0.007 with no statistical tests, confidence intervals, or patient-level variability reported, leaving unclear whether these differences are significant or clinically meaningful.
  2. [Methods and Results] Methods and Results: the evaluation uses data from only two centers (Utrecht and Heidelberg); the generalization claim for FL robustness would require at least one additional independent center or external validation set to rule out site-specific similarities as the source of the observed gains.
minor comments (2)
  1. [Abstract] Abstract states 19 OARs are segmented but only 9 are evaluated in cross-center tests; clarify the selection criteria and list the specific OARs.
  2. [Results] Results section lacks details on the exact nnU-Net FL implementation (e.g., aggregation method, number of communication rounds) and any hyperparameter differences between local and FL runs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed review of our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results: the central claim of improved cross-center robustness rests on DSC gains of only 0.003-0.007 with no statistical tests, confidence intervals, or patient-level variability reported, leaving unclear whether these differences are significant or clinically meaningful.

    Authors: We acknowledge that the reported DSC gains are modest and that the absence of statistical testing and variability measures limits interpretation of their significance. In the revised manuscript we will add 95% confidence intervals for all DSC, HD95 and MSD values and differences; report per-patient standard deviations and ranges to illustrate variability; and include results of paired non-parametric statistical tests (Wilcoxon signed-rank test with Bonferroni correction) comparing the federated model against each local model on cross-center test sets. We will also expand the discussion to address the potential clinical relevance of these small but consistent gains in a pediatric radiotherapy context where even minor reductions in manual contouring effort are valuable. revision: yes

  2. Referee: [Methods and Results] Methods and Results: the evaluation uses data from only two centers (Utrecht and Heidelberg); the generalization claim for FL robustness would require at least one additional independent center or external validation set to rule out site-specific similarities as the source of the observed gains.

    Authors: We agree that claims of broad generalization are not supported by a two-center design. The two participating sites differ in scanner vendors, acquisition protocols, and patient demographics, providing a non-trivial test of cross-center performance; however, we cannot exclude the possibility that unobserved site-specific factors contribute to the observed results. In the revision we will (i) rephrase the abstract, results and conclusions to state that improved robustness is demonstrated between these two specific centers, (ii) add an explicit limitations section discussing the two-center scope and the risk of site-specific similarities, and (iii) include a supplementary analysis of inter-center imaging differences. We do not have access to data from additional centers under current approvals. revision: partial

standing simulated objections not resolved
  • Requirement for at least one additional independent center or external validation set, as the study is restricted to the two centers from which retrospective data were available and obtaining further multi-center data would require new ethical approvals and collaborations outside the present work.

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of local vs federated training on held-out data

full rationale

The manuscript presents an empirical evaluation of nnU-Net models trained locally versus via federated learning on a two-center dataset of 310 CT scans. Performance is measured directly via DSC, Hausdorff distance, and surface distance on cross-center held-out cases, with no equations, fitted parameters, or derivations invoked. The reported DSC gains of 0.003-0.007 are computed outputs from the experiments rather than predictions forced by any self-referential construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the abstract or described methods; the central claim rests on observable metric differences, not on renaming or re-deriving inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the work rests on standard assumptions of deep-learning medical segmentation (nnU-Net architecture suitability, Dice coefficient as primary metric) and the premise that secure weight exchange preserves privacy while enabling useful collaboration.

axioms (2)
  • domain assumption nnU-Net framework is an appropriate base model for multi-organ CT segmentation
    Invoked implicitly by the choice of segmentation method
  • domain assumption Federated averaging of model weights produces a model that generalizes across institutional data distributions
    Central premise of the FL experiment

pith-pipeline@v0.9.0 · 5660 in / 1344 out tokens · 40014 ms · 2026-05-11T00:58:16.837270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages

  1. [1]

    Littooij, MD a,b, Prof

    1 Overcoming data scarcity through multi-center federated learning for organs-at-risk segmentation in pediatric upper abdominal radiotherapy Mianyong Ding, Msca,b, Maximilian Knoll, MDd, Semi Harrabi, MDd, Martine van Grotel, MD a, Annemieke S. Littooij, MD a,b, Prof. Max van Noesel, MDa,e, Prof Jens-Peter Schenk, MDf, Prof Marry M. van den Heuvel-Eibrink...

  2. [2]

    The HEI-cohort comprised 82 CTs from 57 similar patients acquired between 2017 and

  3. [3]

    2 van den Heuvel-Eibrink MM, Hol JA, Pritchard-Jones K, van Tinteren H, Furtwängler R, Verschuur AC et al

    CA Cancer J Clin 2019; 69: 7–34. 2 van den Heuvel-Eibrink MM, Hol JA, Pritchard-Jones K, van Tinteren H, Furtwängler R, Verschuur AC et al. Rationale for the treatment of Wilms tumour in the UMBRELLA SIOP–RTSG 2016 protocol. nature.comMM Van Den Heuvel-eibrink, JA Hol, K Pritchard-Jones, H Van Tinteren, R FurtwänglerNature Reviews Urology, 2017•nature.com

  4. [4]

    3 Janssens GO, Melchior P, Mul J, Saunders D, Bolle S, Cameron AL et al

    doi:10.1038/nrurol.2017.163. 3 Janssens GO, Melchior P, Mul J, Saunders D, Bolle S, Cameron AL et al. The SIOP-Renal Tumour Study Group consensus statement on flank target volume delineation for highly conformal radiotherapy. Lancet Child Adolesc Health 2020; 4: 846–852. 4 Ding M, Maspero M, Harrabi S, Jouglar E, Vennarini S, Spencer T et al. Impact of de...

  5. [5]

    Multicentre evaluation of deep learning CT autosegmentation of the head and neck region for radiotherapy

    6 Pang EPP, Tan HQ, Wang F, Niemelä J, Bolard G, Ramadan S et al. Multicentre evaluation of deep learning CT autosegmentation of the head and neck region for radiotherapy. NPJ Digit Med 2025; 8: 1–11. 7 Choi MS, Chang JS, Kim K, Kim JH, Kim TH, Kim S et al. Assessment of deep learning-based auto-contouring on interobserver consistency in target volume and...

  6. [6]

    Communication-efficient learning of deep networks from decentralized data,

    12 Ding M, Maspero M, Littooij AS, van Grotel M, Fajardo RD, van Noesel MM et al. Deep learning-based auto-contouring of organs/structures-at-risk for pediatric upper abdominal radiotherapy. Radiotherapy and Oncology 2025; 208: 110914. 13 Janssens GO, Timmermann B, Laprie A, Mandeville H, Padovani L, Chargari C et al. The organization of care in pediatric...

  7. [7]

    18 Lee EH, Han M, Wright J, Kuwabara M, Mevorach J, Fu G et al

    doi:10.1148/RYAI.240485. 18 Lee EH, Han M, Wright J, Kuwabara M, Mevorach J, Fu G et al. An international study presenting a federated learning AI platform for pediatric brain tumors. Nat Commun 2024; 15:

  8. [8]

    Federated brain tumor segmentation: An extensive benchmark

    19 Manthe M, Duffner S, Lartizien C. Federated brain tumor segmentation: An extensive benchmark. Med Image Anal 2024; 97: 103270. 21 20 Cao K, Zou Y , Zhang C, Zhang W, Zhang J, Wang G et al. A multicenter bladder cancer MRI dataset and baseline evaluation of federated learning in clinical application. Scientific Data 2024; 11: 1–10. 21 Teo ZL, Jin L, Li ...

  9. [9]

    26 Somasundaram E, Taylor Z, Alves V V ., Qiu L, Fortson BL, Mahalingam N et al

    doi:10.48550/arXiv.2210.13291. 26 Somasundaram E, Taylor Z, Alves V V ., Qiu L, Fortson BL, Mahalingam N et al. Deep Learning Models for Abdominal CT Organ Segmentation in Children: Development and Validation in Internal and Heterogeneous Public Datasets. https://www.ajronline.org/

  10. [10]

    27 Thibodeau-Antonacci A, Popovic M, Ates O, Hua CH, Schneider J, Skamene S et al

    doi:10.2214/AJR.24.30931. 27 Thibodeau-Antonacci A, Popovic M, Ates O, Hua CH, Schneider J, Skamene S et al. Trade-off of different deep learning-based auto-segmentation approaches for treatment planning of pediatric craniospinal irradiation autocontouring of OARs for pediatric CSI. Med Phys 2025; 52: 3541–3556. 28 Xu X, Deng HH, Gateno J, Yan P. Federate...

  11. [11]

    Federated learning with knowledge distillation for multi-organ segmentation with partially labeled datasets

    29 Kim S, Park H, Kang M, Jin KH, Adeli E, Pohl KM et al. Federated learning with knowledge distillation for multi-organ segmentation with partially labeled datasets. Med Image Anal 2024; 95: 103156. 30 Schoenpflug LA, Benavides RB, Nowak M, Sheikhzadeh F, Moayyedi A, Wasag K et al. Navigating real-world challenges: A case study on federated learning in c...

  12. [12]

    Efficiency Optimization Techniques in Privacy-Preserving Federated Learning With Homomorphic Encryption: A Brief Survey

    33 Xie Q, Jiang S, Jiang L, Huang Y , Zhao Z, Khan S et al. Efficiency Optimization Techniques in Privacy-Preserving Federated Learning With Homomorphic Encryption: A Brief Survey. IEEE Internet Things J 2024; 11: 24569–24580. 34 Jere MS, Farnan T, Koushanfar F. A Taxonomy of Attacks on Federated Learning. IEEE Secur Priv 2021; 19: 20–28