arxiv: 2605.09025 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift

Kiran Naseer , Naveed Anwer Butt

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:18 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords federated learningbrain tumor segmentationMRI appearance shiftrobustness evaluationFedBNworst-case performancemulti-site medical imaging

0 comments

The pith

FedBN reduces the gap between best and worst hospitals from 0.0850 to 0.0503 Dice in federated brain tumor segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that average performance metrics in federated learning can hide large failures at individual hospitals, creating safety risks in clinical use. By distributing BraTS 2020 slices across four simulated hospitals and applying controlled MRI appearance shifts, the evaluation reveals that FedAvg produces high mean accuracy but a substantial disparity across sites. FedBN narrows this disparity by 41 percent with only a tiny drop in overall Dice score and lifts the weakest hospital by 3.5 points. This demonstrates why robustness-oriented protocols that track worst-case and disparity metrics are necessary for reliable multi-site medical imaging deployments.

Core claim

Using worst-hospital Dice and inter-hospital disparity as primary metrics rather than mean accuracy, FedBN closes the performance gap between hospitals by 41 percent (0.0850 to 0.0503) while reducing mean Dice only from 0.8159 to 0.8109 and raising the weakest hospital from 0.7309 to 0.7656.

What carries the argument

MedFL-Stress, a controlled stress-testing framework that distributes 2D BraTS 2020 axial slices across four simulated hospital clients, applies graded MRI appearance shifts (gamma contrast, scale-shift, noise-plus-blur), and evaluates federated methods with worst-hospital Dice and disparity as primary outcomes.

If this is right

Evaluation protocols for federated medical imaging must treat worst-site performance and inter-site disparity as primary metrics instead of reporting only global averages.
FedBN offers a practical way to improve equity across hospitals without meaningful loss in average segmentation accuracy.
Deployment decisions for privacy-preserving models should include explicit testing under scanner and acquisition variability to avoid hidden site-specific failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same stress-test protocol to other segmentation tasks or imaging modalities could show whether the 41 percent gap reduction generalizes beyond brain tumors.
If real-world multi-center data produce larger appearance shifts than the simulated ones, the relative benefit of FedBN over FedAvg could increase.
Hospitals might adopt disparity-aware metrics to choose among federated algorithms when equitable outcomes across sites are a clinical requirement.

Load-bearing premise

The graded MRI appearance shifts and four simulated hospital clients accurately reflect real scanner and acquisition variability in multi-site clinical deployments.

What would settle it

Repeating the federated training and stress tests on real multi-hospital MRI datasets collected from actual scanners and measuring whether FedBN still reduces the worst-to-best Dice gap by roughly 41 percent.

Figures

Figures reproduced from arXiv: 2605.09025 by Kiran Naseer, Naveed Anwer Butt.

**Figure 2.** Figure 2: Federated convergence across communication rounds. FedAvg and FedBN show stable convergence under heterogeneous MRI distributions. FedProx with µ=0.1 exhibits degraded performance under strong appearance heterogeneity; µ=0.01 is the stronger configuration and is used in all subsequent analysis. only modest accuracy costs under the proposed heterogeneity setting—a cost that is substantially outweighed by th… view at source ↗

**Figure 3.** Figure 3: Client-wise robustness under strong appearance heterogeneity. (a) FedBN im [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

Federated learning enables hospitals to collaboratively train segmentation models without sharing patient data. However, current evaluation protocols report only average performance across clients, masking failures at individual sites. In clinical deployment, a model that fails consistently at one hospital is a real safety risk that a good mean score can hide entirely. We introduce MedFL-Stress, a controlled stress-testing framework that exposes exactly this failure mode. Using 2D axial slices from BraTS 2020 distributed across four simulated hospital clients, we apply graded MRI appearance shifts (gamma contrast, scale-shift, and noise-plus-blur) reflecting scanner and acquisition variability in real multi-site deployments. Three federated baselines are evaluated: FedAvg, FedProx, and FedBN. Worst-hospital Dice and inter-hospital disparity are treated as primary metrics, not supplementary observations. FedAvg achieves the highest global mean Dice (0.8159) but conceals a 0.0850 gap between its best and worst-performing hospital. FedBN closes that gap by 41% (0.0850 to 0.0503) while sacrificing less than half a Dice point in mean accuracy (0.8159 to 0.8109), and the weakest hospital gains 3.5 Dice points outright (0.7309 to 0.7656). These findings demonstrate that robustness-oriented evaluation protocols are essential for reliable federated medical imaging deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is showing that FedBN cuts the best-to-worst hospital Dice gap by 41% in a simulated federated brain tumor setup while barely hurting mean accuracy, but the result depends on how well their parametric MRI shifts represent actual scanner differences.

read the letter

The punchline is that standard average Dice scores in federated medical imaging can mask large site-to-site failures, and this work gives a concrete example where FedBN narrows the gap without much overall cost. They split BraTS 2020 2D axial slices across four simulated hospitals and apply graded shifts in contrast, scale, noise, and blur. FedAvg reaches 0.8159 mean Dice but leaves an 0.0850 spread between best and worst site. FedBN drops the spread to 0.0503, lifts the weakest site from 0.7309 to 0.7656, and keeps mean Dice at 0.8109. That is the clearest quantitative finding and it is presented directly rather than buried in supplements. The framework itself, called MedFL-Stress, is new in its emphasis on worst-hospital and disparity metrics as primary rather than secondary checks. It also runs the same comparison across FedAvg, FedProx, and FedBN, which lets the reader see the trade-offs plainly. The work is useful for anyone who has to decide whether a federated model is safe enough to deploy across hospitals. It forces attention on the fact that a good global number does not guarantee every site will perform acceptably. The soft spot is the simulation. The shifts are parametric and controlled, which is fine for a stress test, but real cross-hospital MRI variation includes field strength, pulse sequence choices, reconstruction methods, and demographic differences that these four operations do not fully span. If the chosen shifts under- or over-state the actual appearance change, the reported advantage for FedBN could shrink or reverse. The abstract also gives no error bars, no statistical tests, and no exact parameter ranges, so it is hard to judge how stable the 41% gap reduction really is. This paper is for researchers working on federated learning for medical imaging who care about deployment risks rather than just benchmark averages. A reader who already knows the BraTS dataset and basic FL methods will get the most out of it. It deserves a serious referee because the safety framing is legitimate and the empirical comparison is straightforward, even though the realism of the shifts needs more scrutiny and the protocol details need to be expanded.

Referee Report

2 major / 1 minor

Summary. The paper introduces MedFL-Stress, a controlled stress-testing framework for federated brain tumor segmentation. It partitions 2D axial slices from BraTS 2020 across four simulated hospital clients, applies graded parametric MRI appearance shifts (gamma contrast, scale-shift, noise-plus-blur), and evaluates FedAvg, FedProx, and FedBN. Primary metrics are worst-hospital Dice and inter-hospital disparity rather than mean performance alone. The central empirical result is that FedBN reduces the best-worst Dice gap by 41% (0.0850 to 0.0503) while preserving mean Dice near 0.81 and improving the weakest hospital by 3.5 points.

Significance. If the simulated shifts adequately capture real cross-hospital variability, the work is significant for demonstrating that mean Dice alone can conceal clinically dangerous site-specific failures in federated medical imaging. It supplies concrete quantitative support for preferring batch-norm adaptation (FedBN) over standard FedAvg or FedProx when disparity reduction is prioritized, and it advocates for robustness-oriented evaluation protocols that could shape future benchmarks.

major comments (2)

[Abstract] Abstract: the claim that the graded shifts (gamma contrast, scale-shift, noise-plus-blur) 'reflect scanner and acquisition variability in real multi-site deployments' is unsupported. No validation against actual multi-center MRI data is provided, and the parametric family does not span key real-world factors such as field strength, pulse-sequence parameters, reconstruction kernels, or coil sensitivities; if the simulated distribution under- or over-represents true appearance shift, the reported 41% gap reduction and relative advantage of FedBN could reverse.
[Results] Results section (quantitative claims): the headline Dice values (0.8159, 0.8109, 0.7309, 0.7656, gap 0.0850 to 0.0503) are stated without error bars, standard deviations across runs, confidence intervals, or statistical significance tests. This absence prevents assessment of whether the 3.5-point weakest-hospital gain and 41% disparity reduction are reliable or sensitive to random seeds and exact shift parameters.

minor comments (1)

The exact parameter ranges and application protocol for the graded shifts (e.g., per-client vs. per-image, how 'graded' levels are discretized) are not detailed in the provided abstract; including them would improve reproducibility even if the full experimental section contains them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important aspects of validation and statistical reporting. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the graded shifts (gamma contrast, scale-shift, noise-plus-blur) 'reflect scanner and acquisition variability in real multi-site deployments' is unsupported. No validation against actual multi-center MRI data is provided, and the parametric family does not span key real-world factors such as field strength, pulse-sequence parameters, reconstruction kernels, or coil sensitivities; if the simulated distribution under- or over-represents true appearance shift, the reported 41% gap reduction and relative advantage of FedBN could reverse.

Authors: We agree that our original wording in the abstract overstated the direct correspondence between our simulated shifts and real-world multi-site MRI variability. The shifts were chosen to represent common parametric variations observed in MRI (e.g., contrast adjustments, intensity scaling, and noise/blur effects), but we did not validate them against empirical distributions from actual multi-center datasets. We will revise the abstract to replace 'reflecting scanner and acquisition variability in real multi-site deployments' with 'simulating common scanner and acquisition variability observed in multi-site MRI'. We will also expand the discussion section to explicitly acknowledge this as a limitation and propose validation with real multi-center data as future work. Regarding the potential reversal of results, while we cannot rule it out without real data, the controlled nature of the simulation allows us to isolate the effect of appearance shift, providing a baseline for such comparisons. revision: yes
Referee: [Results] Results section (quantitative claims): the headline Dice values (0.8159, 0.8109, 0.7309, 0.7656, gap 0.0850 to 0.0503) are stated without error bars, standard deviations across runs, confidence intervals, or statistical significance tests. This absence prevents assessment of whether the 3.5-point weakest-hospital gain and 41% disparity reduction are reliable or sensitive to random seeds and exact shift parameters.

Authors: We acknowledge that the reported Dice values lack measures of variability and statistical testing, which is a valid concern for assessing the reliability of the findings. To address this, we will rerun the experiments with multiple random seeds (at least 5 independent runs) and report the mean and standard deviation for all key metrics, including worst-hospital Dice and the disparity gap. We will also include statistical significance tests (e.g., paired t-tests between methods) where appropriate. These updates will be incorporated into the Results section and any relevant tables or figures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of existing FL methods on simulated data shifts.

full rationale

The manuscript introduces MedFL-Stress as an experimental stress-testing protocol that partitions BraTS 2020 slices into four clients, applies parametric appearance shifts (gamma, scale, noise+blur), and reports empirical metrics (mean Dice, worst-hospital Dice, inter-hospital gap) for FedAvg, FedProx, and FedBN. All reported numbers (e.g., 0.8159 mean Dice, 0.0850 gap reduced to 0.0503) are direct outputs of the described training and evaluation runs. No derivation chain, first-principles result, fitted parameter renamed as prediction, or self-citation load-bearing theorem exists. The paper contains no equations that define quantities in terms of themselves and no ansatz or uniqueness claim imported from prior author work. The skeptic concern about simulation fidelity is a question of external validity, not circularity within the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that the chosen image transformations and simulated clients capture real cross-hospital MRI variability; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The simulated MRI appearance shifts reflect real scanner and acquisition variability across hospitals
Invoked to justify the stress test as representative of clinical multi-site conditions.

pith-pipeline@v0.9.0 · 5562 in / 1419 out tokens · 53926 ms · 2026-05-12T03:18:06.598869+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FedBN closes that gap by 41% (0.0850 to 0.0503) while sacrificing less than half a Dice point in mean accuracy (0.8159 to 0.8109)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

graded MRI appearance shifts (gamma contrast, scale-shift, and noise-plus-blur)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge

Bakas, S., Reyes, M., Jakab, A., et al.: Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BraTS challenge. arXiv:1811.02629 (2018)

work page Pith review arXiv 2018
[2]

Chen, R.J., Lu, M.Y., Chen, T.Y., Williamson, D.F.K., Mahmood, F.: Algorithm fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7(6), 719–742 (2023)

work page 2023
[3]

Pattern Recognit.149, 110218 (2024)

Guan, H., Wang, Y., Li, M., Xu, Z., Han, S., Yao, Y., et al.: Federated learning for medical image analysis: A survey. Pattern Recognit.149, 110218 (2024)

work page 2024
[4]

In: Proc

Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated optimization in heteroge- neous networks. In: Proc. 3rd MLSys Conf. (2020)

work page 2020
[5]

Li, X., Jiang, M., Zhang, X., Kamp, M., Dou, Q.: FedBN: Federated learning on non-IID features via local batch normalization. In: Int. Conf. Learn. Represent. (ICLR) (2021)

work page 2021
[6]

Manthe, M., et al.: Federated brain tumor segmentation: An extensive benchmark. Med. Image Anal.98, 103348 (2024). https://doi.org/10.1016/j.media.2024.103348

work page doi:10.1016/j.media.2024.103348 2024
[7]

In: Proc

McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication- efficient learning of deep networks from decentralized data. In: Proc. 20th AISTATS (2017)

work page 2017
[8]

IEEE Trans

Menze, B.H., Jakab, A., Bauer, S., et al.: The multimodal brain tumor image segmentation benchmark (BraTS). IEEE Trans. Med. Imaging34(10), 1993–2024 (2015)

work page 1993
[9]

In: Proc

Pati, S., Baid, U., Zenk, M., et al.: The Federated Tumor Segmentation (FeTS) challenge. In: Proc. MICCAI. pp. 234–241 (2021)

work page 2021
[10]

Pérez-García, F., Sparks, R., Ourselin, S.: TorchIO: A Python library for efficient preprocessing and sampling of medical images. Comput. Methods Programs Biomed. 208, 106236 (2021)

work page 2021
[11]

npj Digit

Pfohl, S.R., et al.: The role of fairness in medical AI. npj Digit. Med.4(1), 1–10 (2021)

work page 2021
[12]

npj Digit

Rieke, N., Hancox, J., Li, W., et al.: The future of digital health with federated learning. npj Digit. Med.3(1), 1–7 (2020)

work page 2020
[13]

In: Proc

Ronneberger,O.,Fischer,P.,Brox,T.:U-Net:Convolutionalnetworksforbiomedical image segmentation. In: Proc. MICCAI. pp. 234–241 (2015)

work page 2015
[14]

Sheller, M.J., Edwards, B., Reina, G.A., et al.: Federated learning in medicine: Collaborative training without sharing patient data. Sci. Rep.10(1), 1–12 (2020)

work page 2020
[15]

In: Brainlesion: Glioma, MS, Stroke and TBI

Sheller, M.J., Reina, G.A., Edwards, B., et al.: Multi-institutional deep learning modeling without sharing patient data. In: Brainlesion: Glioma, MS, Stroke and TBI. pp. 92–104 (2019)

work page 2019
[16]

Xu, J., Glicksberg, B.S., et al.: Federated learning in medical imaging: A survey. Med. Image Anal.85, 102760 (2023) Stress-Testing Cross-Hospital Segmentation in FL 17

work page 2023
[17]

In: Proc

Yan, G., et al.: FedVCK: Robust federated learning for medical image analysis. In: Proc. AAAI (2025)

work page 2025
[18]

Zenk, M., Baid, U., Pati, S., et al.: Towards fair decentralized bench- marking of healthcare AI algorithms. Nat. Commun.16(1) (2025). https://doi.org/10.1038/s41467-025-60466-1

work page doi:10.1038/s41467-025-60466-1 2025
[19]

IEEE Trans

Zhou, K., Liu, Z., et al.: A survey on domain generalization in medical imaging. IEEE Trans. Med. Imaging43, 101–120 (2024)

work page 2024
[20]

Zhou, Z., et al.: Federated learning for medical image classifi- cation: A benchmark. IEEE J. Biomed. Health Inform. (2025). https://doi.org/10.1109/JBHI.2025.3631706

work page doi:10.1109/jbhi.2025.3631706 2025