FederatedRSF : Federated Random Survival Forests for Partially Overlapping Medical Data
Pith reviewed 2026-05-25 06:02 UTC · model grok-4.3
The pith
Federated random survival forests can match centralized performance on survival prediction even when sites have only partially overlapping features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training survival trees locally at each site and redistributing only feature-compatible trees, FederatedRSF allows the construction of a federated model that performs comparably to centralized training on the GBSG2 cohort when feature overlap is partial, as measured by Harrell's C-index in cross-validation experiments.
What carries the argument
Redistributing only feature-compatible trees to each site after local training, enabling the federated model to handle partial feature overlap without data sharing.
If this is right
- Survival prediction models can be trained from data distributed across institutions without centralizing records.
- The approach maintains discrimination performance similar to pooled training under simulated feature heterogeneity.
- It provides a Python implementation usable for real medical survival analysis tasks.
- Different clinical variables or sequencing panels at sites no longer prevent joint modeling.
Where Pith is reading between the lines
- This technique could be adapted for other tree-based methods in federated healthcare analytics.
- Further tests on genuinely multi-institutional datasets would clarify how well the simulation holds.
- Such methods may encourage more collaborative predictive modeling in regulated medical environments.
Load-bearing premise
Withholding random subsets of features from the GBSG2 cohort produces a realistic simulation of the partial feature overlap that occurs across real medical institutions with different sequencing panels or clinical variables.
What would settle it
Collect data from multiple actual hospitals with known distinct feature sets, apply the federated method, and compare its C-index directly to a model trained on pooled data to see if the gap stays small.
read the original abstract
Multi-center survival prediction can improve robustness and generalizability, yet privacy regulations and institutional governance often prevent pooling patient-level clinical and genomic data across institutions. In practice, deployment is further complicated by feature-space heterogeneity, in which sites collect different covariates or use different sequencing panels, resulting in only partially overlapping feature sets. We present FederatedRSF, a Python package that implements federated random survival forests, aggregating locally trained survival trees and redistributing only feature-compatible trees to each site, enabling inference with partial overlap without sharing raw data. We evaluate FederatedRSF on the GBSG2 breast cancer cohort distributed with the scikit-survival package, simulating feature heterogeneity across clients by withholding subsets of features, and assessing discrimination using Harrell's concordance index (C-Index) under repeated cross-validation and site-splits. The results demonstrated that the federated model can achieve performance comparable to that of the centralized training setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FederatedRSF, a Python package implementing federated random survival forests for multi-center survival prediction under privacy constraints and partial feature overlap. Locally trained survival trees are aggregated and only feature-compatible trees are redistributed to sites, enabling inference without raw data sharing. Evaluation on the GBSG2 breast cancer cohort (from scikit-survival) simulates heterogeneity by randomly withholding feature subsets across clients; Harrell's C-Index under repeated cross-validation and site splits is reported as comparable between federated and centralized settings.
Significance. If the comparability result holds, the approach would address a practical barrier to collaborative survival modeling in medicine where institutions cannot pool data due to privacy rules and have non-identical feature spaces. The open-source Python package is a concrete strength that supports reproducibility and potential adoption.
major comments (1)
- [Abstract / experimental evaluation] Abstract (experimental evaluation): the central comparability claim rests on random feature withholding to simulate partial overlap. Real institutional heterogeneity typically exhibits structured, correlated missingness (e.g., one site has a full genomic panel while another has only clinical variables plus a disjoint panel), which can reduce the number of compatible trees and change both local tree quality and the aggregation step in ways uniform random subsets do not capture. This assumption is load-bearing for the reported performance equivalence and requires either justification via additional structured-missingness experiments or real multi-site data.
minor comments (2)
- [Abstract] Abstract: the description of the evaluation protocol omits the number of cross-validation repetitions, the exact site-split configuration, variance or confidence intervals on the C-Index, and any statistical test for equivalence to the centralized baseline.
- [Methods] The manuscript would benefit from an explicit statement of the aggregation rule (e.g., how trees are selected or weighted when feature sets differ) and any hyperparameters controlling tree redistribution.
Simulated Author's Rebuttal
We thank the referee for highlighting an important aspect of our experimental design. We address the comment below.
read point-by-point responses
-
Referee: [Abstract / experimental evaluation] Abstract (experimental evaluation): the central comparability claim rests on random feature withholding to simulate partial overlap. Real institutional heterogeneity typically exhibits structured, correlated missingness (e.g., one site has a full genomic panel while another has only clinical variables plus a disjoint panel), which can reduce the number of compatible trees and change both local tree quality and the aggregation step in ways uniform random subsets do not capture. This assumption is load-bearing for the reported performance equivalence and requires either justification via additional structured-missingness experiments or real multi-site data.
Authors: We agree that random feature withholding constitutes a controlled but simplified simulation of partial overlap and does not replicate the structured, correlated missingness patterns typical of real institutional data (e.g., disjoint genomic panels). Our simulation isolates the effect of varying overlap ratios on tree compatibility and aggregation while holding other factors fixed, which is a standard approach in federated learning studies when real multi-center datasets are unavailable due to privacy constraints. The FederatedRSF procedure itself is agnostic to the missingness mechanism and simply redistributes only those trees whose splitting features are present at a given site. We recognize that structured missingness could alter the number of compatible trees and local model quality in ways not tested here. We cannot provide real multi-site data or additional structured-missingness experiments within the current scope. We will add an explicit discussion of this limitation to the experimental evaluation section. revision: partial
- Additional structured-missingness experiments or real multi-site data
Circularity Check
No significant circularity; empirical method and simulation stand alone
full rationale
The paper introduces FederatedRSF as an implementation that aggregates locally trained survival trees and redistributes feature-compatible trees for partial overlap, then reports an empirical C-Index comparison on GBSG2 under random feature withholding. No equations, parameter fits, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the central comparability claim to a tautology or input by construction. The evaluation is a direct simulation-based benchmark against centralized training, with no load-bearing steps that collapse into self-definition or renamed fits. This is a standard empirical presentation of a federated algorithm and is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ishwaran, Hemant and Kogalur, Udaya B. and Blackstone, Eugene H. and Lauer, Michael S. , title =. The Annals of Applied Statistics , volume =. 2008 , doi =
work page 2008
-
[2]
Brendan and Avent, Brendan and Bellet, Aur
Kairouz, Peter and McMahan, H. Brendan and Avent, Brendan and Bellet, Aur. Advances and Open Problems in Federated Learning , journal =. 2021 , doi =
work page 2021
- [3]
-
[4]
Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =
McMahan, Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and y Arcas, Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =. 2017 , url =
work page 2017
-
[5]
ACM Computing Surveys , volume =
Ye, Mang and Liu, Jie and Wang, Bolin and Cao, Siwei and Song, Limeng and Erfani, Sarah Monazam and Bailey, James and Ghanem, Bernard and Wu, Qing , title =. ACM Computing Surveys , volume =. 2023 , doi =
work page 2023
-
[6]
BMC Medical Genomics , volume =
Quy, Pham Nguyen and others , title =. BMC Medical Genomics , volume =. 2022 , doi =
work page 2022
-
[7]
and Sumer, Selcuk Onur and Aksoy, B
Cerami, Ethan and Gao, Jianjiong and Dogrusoz, Ugur and Gross, Benjamin E. and Sumer, Selcuk Onur and Aksoy, B. The. Cancer Discovery , volume =. 2012 , doi =
work page 2012
-
[8]
ACM Transactions on Intelligent Systems and Technology , volume =
Yang, Qiang and Liu, Yang and Chen, Tianjian and Tong, Yongxin , title =. ACM Transactions on Intelligent Systems and Technology , volume =. 2019 , doi =
work page 2019
-
[9]
Standards for Privacy of Individually Identifiable Health Information; Final Rule , year =
-
[10]
Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the
Gao, Jianjiong and Aksoy, B. Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the. Science Signaling , volume =. 2013 , doi =
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.