FederatedRSF : Federated Random Survival Forests for Partially Overlapping Medical Data

Amirreza Aleyasin; Anne-Christin Hauschild; Jonas Harriehausen; Lion Philipp Wolf; Maryam Moradpour; Youngjun Park

arxiv: 2605.22954 · v1 · pith:IEK3C3LSnew · submitted 2026-05-21 · 💻 cs.LG · q-bio.QM

FederatedRSF : Federated Random Survival Forests for Partially Overlapping Medical Data

Maryam Moradpour , Jonas Harriehausen , Amirreza Aleyasin , Lion Philipp Wolf , Youngjun Park , Anne-Christin Hauschild This is my paper

Pith reviewed 2026-05-25 06:02 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords federated learningrandom survival forestssurvival analysispartial feature overlapprivacy preservingmedical dataconcordance indexbreast cancer

0 comments

The pith

Federated random survival forests can match centralized performance on survival prediction even when sites have only partially overlapping features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FederatedRSF, a method and package for training random survival forests in a federated way across medical institutions. It aggregates locally trained trees and sends back only those compatible with each site's features, avoiding any sharing of raw patient data. This is tested by splitting the GBSG2 breast cancer dataset to simulate different feature sets at different sites. The federated results come close to what a single model trained on all data would achieve, using the C-index for evaluation. Readers care because many medical datasets cannot be combined due to privacy rules, yet joint models could be more reliable.

Core claim

By training survival trees locally at each site and redistributing only feature-compatible trees, FederatedRSF allows the construction of a federated model that performs comparably to centralized training on the GBSG2 cohort when feature overlap is partial, as measured by Harrell's C-index in cross-validation experiments.

What carries the argument

Redistributing only feature-compatible trees to each site after local training, enabling the federated model to handle partial feature overlap without data sharing.

If this is right

Survival prediction models can be trained from data distributed across institutions without centralizing records.
The approach maintains discrimination performance similar to pooled training under simulated feature heterogeneity.
It provides a Python implementation usable for real medical survival analysis tasks.
Different clinical variables or sequencing panels at sites no longer prevent joint modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique could be adapted for other tree-based methods in federated healthcare analytics.
Further tests on genuinely multi-institutional datasets would clarify how well the simulation holds.
Such methods may encourage more collaborative predictive modeling in regulated medical environments.

Load-bearing premise

Withholding random subsets of features from the GBSG2 cohort produces a realistic simulation of the partial feature overlap that occurs across real medical institutions with different sequencing panels or clinical variables.

What would settle it

Collect data from multiple actual hospitals with known distinct feature sets, apply the federated method, and compare its C-index directly to a model trained on pooled data to see if the gap stays small.

read the original abstract

Multi-center survival prediction can improve robustness and generalizability, yet privacy regulations and institutional governance often prevent pooling patient-level clinical and genomic data across institutions. In practice, deployment is further complicated by feature-space heterogeneity, in which sites collect different covariates or use different sequencing panels, resulting in only partially overlapping feature sets. We present FederatedRSF, a Python package that implements federated random survival forests, aggregating locally trained survival trees and redistributing only feature-compatible trees to each site, enabling inference with partial overlap without sharing raw data. We evaluate FederatedRSF on the GBSG2 breast cancer cohort distributed with the scikit-survival package, simulating feature heterogeneity across clients by withholding subsets of features, and assessing discrimination using Harrell's concordance index (C-Index) under repeated cross-validation and site-splits. The results demonstrated that the federated model can achieve performance comparable to that of the centralized training setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FederatedRSF is a practical package for random survival forests under partial feature overlap, but the random-withholding simulation on GBSG2 leaves open whether it handles real institutional heterogeneity.

read the letter

The paper's main point is a working Python package that trains local survival trees and only redistributes those whose features match at each site, letting centers collaborate on survival models without sharing raw data. They test this on the GBSG2 cohort by randomly dropping features to create partial overlap and report that the federated C-index stays close to the centralized baseline under cross-validation and site splits. This is a direct engineering extension of existing random survival forest code rather than new theory, and it fills a clear gap for privacy-constrained medical data where feature sets differ across hospitals. The implementation details on tree compatibility and redistribution are the part that could be reused. The evaluation protocol is straightforward and uses standard metrics, which is fine for an initial demonstration. The main limitation is the simulation design. Random feature withholding assumes independent missingness, but real medical sites often drop entire correlated blocks of variables, such as one center having a full genomic panel and another only basic clinical fields. That structure can reduce the number of compatible trees and change local model quality in ways the current setup does not test. If the full paper includes any structured-missingness experiments or real multi-site data, that would strengthen the claim; otherwise the comparability result rests on a narrow regime. The citations follow the usual federated-learning and survival-analysis lines without circularity. This work is aimed at applied clinical ML groups that already use scikit-survival and need a federated starting point. A reader looking for a ready package and basic performance numbers would get value from it. It deserves peer review so referees can check the full experimental details and the realism of the heterogeneity model.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces FederatedRSF, a Python package implementing federated random survival forests for multi-center survival prediction under privacy constraints and partial feature overlap. Locally trained survival trees are aggregated and only feature-compatible trees are redistributed to sites, enabling inference without raw data sharing. Evaluation on the GBSG2 breast cancer cohort (from scikit-survival) simulates heterogeneity by randomly withholding feature subsets across clients; Harrell's C-Index under repeated cross-validation and site splits is reported as comparable between federated and centralized settings.

Significance. If the comparability result holds, the approach would address a practical barrier to collaborative survival modeling in medicine where institutions cannot pool data due to privacy rules and have non-identical feature spaces. The open-source Python package is a concrete strength that supports reproducibility and potential adoption.

major comments (1)

[Abstract / experimental evaluation] Abstract (experimental evaluation): the central comparability claim rests on random feature withholding to simulate partial overlap. Real institutional heterogeneity typically exhibits structured, correlated missingness (e.g., one site has a full genomic panel while another has only clinical variables plus a disjoint panel), which can reduce the number of compatible trees and change both local tree quality and the aggregation step in ways uniform random subsets do not capture. This assumption is load-bearing for the reported performance equivalence and requires either justification via additional structured-missingness experiments or real multi-site data.

minor comments (2)

[Abstract] Abstract: the description of the evaluation protocol omits the number of cross-validation repetitions, the exact site-split configuration, variance or confidence intervals on the C-Index, and any statistical test for equivalence to the centralized baseline.
[Methods] The manuscript would benefit from an explicit statement of the aggregation rule (e.g., how trees are selected or weighted when feature sets differ) and any hyperparameters controlling tree redistribution.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for highlighting an important aspect of our experimental design. We address the comment below.

read point-by-point responses

Referee: [Abstract / experimental evaluation] Abstract (experimental evaluation): the central comparability claim rests on random feature withholding to simulate partial overlap. Real institutional heterogeneity typically exhibits structured, correlated missingness (e.g., one site has a full genomic panel while another has only clinical variables plus a disjoint panel), which can reduce the number of compatible trees and change both local tree quality and the aggregation step in ways uniform random subsets do not capture. This assumption is load-bearing for the reported performance equivalence and requires either justification via additional structured-missingness experiments or real multi-site data.

Authors: We agree that random feature withholding constitutes a controlled but simplified simulation of partial overlap and does not replicate the structured, correlated missingness patterns typical of real institutional data (e.g., disjoint genomic panels). Our simulation isolates the effect of varying overlap ratios on tree compatibility and aggregation while holding other factors fixed, which is a standard approach in federated learning studies when real multi-center datasets are unavailable due to privacy constraints. The FederatedRSF procedure itself is agnostic to the missingness mechanism and simply redistributes only those trees whose splitting features are present at a given site. We recognize that structured missingness could alter the number of compatible trees and local model quality in ways not tested here. We cannot provide real multi-site data or additional structured-missingness experiments within the current scope. We will add an explicit discussion of this limitation to the experimental evaluation section. revision: partial

standing simulated objections not resolved

Additional structured-missingness experiments or real multi-site data

Circularity Check

0 steps flagged

No significant circularity; empirical method and simulation stand alone

full rationale

The paper introduces FederatedRSF as an implementation that aggregates locally trained survival trees and redistributes feature-compatible trees for partial overlap, then reports an empirical C-Index comparison on GBSG2 under random feature withholding. No equations, parameter fits, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the central comparability claim to a tautology or input by construction. The evaluation is a direct simulation-based benchmark against centralized training, with no load-bearing steps that collapse into self-definition or renamed fits. This is a standard empirical presentation of a federated algorithm and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach implicitly rests on standard random-forest and federated-learning assumptions that are not detailed here.

pith-pipeline@v0.9.0 · 5708 in / 1113 out tokens · 29629 ms · 2026-05-25T06:02:44.232720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

and Blackstone, Eugene H

Ishwaran, Hemant and Kogalur, Udaya B. and Blackstone, Eugene H. and Lauer, Michael S. , title =. The Annals of Applied Statistics , volume =. 2008 , doi =

work page 2008
[2]

Brendan and Avent, Brendan and Bellet, Aur

Kairouz, Peter and McMahan, H. Brendan and Avent, Brendan and Bellet, Aur. Advances and Open Problems in Federated Learning , journal =. 2021 , doi =

work page 2021
[3]

2016 , howpublished =

Regulation (. 2016 , howpublished =

work page 2016
[4]

Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =

McMahan, Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and y Arcas, Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =. 2017 , url =

work page 2017
[5]

ACM Computing Surveys , volume =

Ye, Mang and Liu, Jie and Wang, Bolin and Cao, Siwei and Song, Limeng and Erfani, Sarah Monazam and Bailey, James and Ghanem, Bernard and Wu, Qing , title =. ACM Computing Surveys , volume =. 2023 , doi =

work page 2023
[6]

BMC Medical Genomics , volume =

Quy, Pham Nguyen and others , title =. BMC Medical Genomics , volume =. 2022 , doi =

work page 2022
[7]

and Sumer, Selcuk Onur and Aksoy, B

Cerami, Ethan and Gao, Jianjiong and Dogrusoz, Ugur and Gross, Benjamin E. and Sumer, Selcuk Onur and Aksoy, B. The. Cancer Discovery , volume =. 2012 , doi =

work page 2012
[8]

ACM Transactions on Intelligent Systems and Technology , volume =

Yang, Qiang and Liu, Yang and Chen, Tianjian and Tong, Yongxin , title =. ACM Transactions on Intelligent Systems and Technology , volume =. 2019 , doi =

work page 2019
[9]

Standards for Privacy of Individually Identifiable Health Information; Final Rule , year =

work page
[10]

Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the

Gao, Jianjiong and Aksoy, B. Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the. Science Signaling , volume =. 2013 , doi =

work page 2013

[1] [1]

and Blackstone, Eugene H

Ishwaran, Hemant and Kogalur, Udaya B. and Blackstone, Eugene H. and Lauer, Michael S. , title =. The Annals of Applied Statistics , volume =. 2008 , doi =

work page 2008

[2] [2]

Brendan and Avent, Brendan and Bellet, Aur

Kairouz, Peter and McMahan, H. Brendan and Avent, Brendan and Bellet, Aur. Advances and Open Problems in Federated Learning , journal =. 2021 , doi =

work page 2021

[3] [3]

2016 , howpublished =

Regulation (. 2016 , howpublished =

work page 2016

[4] [4]

Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =

McMahan, Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and y Arcas, Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =. 2017 , url =

work page 2017

[5] [5]

ACM Computing Surveys , volume =

Ye, Mang and Liu, Jie and Wang, Bolin and Cao, Siwei and Song, Limeng and Erfani, Sarah Monazam and Bailey, James and Ghanem, Bernard and Wu, Qing , title =. ACM Computing Surveys , volume =. 2023 , doi =

work page 2023

[6] [6]

BMC Medical Genomics , volume =

Quy, Pham Nguyen and others , title =. BMC Medical Genomics , volume =. 2022 , doi =

work page 2022

[7] [7]

and Sumer, Selcuk Onur and Aksoy, B

Cerami, Ethan and Gao, Jianjiong and Dogrusoz, Ugur and Gross, Benjamin E. and Sumer, Selcuk Onur and Aksoy, B. The. Cancer Discovery , volume =. 2012 , doi =

work page 2012

[8] [8]

ACM Transactions on Intelligent Systems and Technology , volume =

Yang, Qiang and Liu, Yang and Chen, Tianjian and Tong, Yongxin , title =. ACM Transactions on Intelligent Systems and Technology , volume =. 2019 , doi =

work page 2019

[9] [9]

Standards for Privacy of Individually Identifiable Health Information; Final Rule , year =

work page

[10] [10]

Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the

Gao, Jianjiong and Aksoy, B. Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the. Science Signaling , volume =. 2013 , doi =

work page 2013