pith. sign in

arxiv: 2605.22954 · v1 · pith:IEK3C3LSnew · submitted 2026-05-21 · 💻 cs.LG · q-bio.QM

FederatedRSF : Federated Random Survival Forests for Partially Overlapping Medical Data

Pith reviewed 2026-05-25 06:02 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM
keywords federated learningrandom survival forestssurvival analysispartial feature overlapprivacy preservingmedical dataconcordance indexbreast cancer
0
0 comments X

The pith

Federated random survival forests can match centralized performance on survival prediction even when sites have only partially overlapping features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FederatedRSF, a method and package for training random survival forests in a federated way across medical institutions. It aggregates locally trained trees and sends back only those compatible with each site's features, avoiding any sharing of raw patient data. This is tested by splitting the GBSG2 breast cancer dataset to simulate different feature sets at different sites. The federated results come close to what a single model trained on all data would achieve, using the C-index for evaluation. Readers care because many medical datasets cannot be combined due to privacy rules, yet joint models could be more reliable.

Core claim

By training survival trees locally at each site and redistributing only feature-compatible trees, FederatedRSF allows the construction of a federated model that performs comparably to centralized training on the GBSG2 cohort when feature overlap is partial, as measured by Harrell's C-index in cross-validation experiments.

What carries the argument

Redistributing only feature-compatible trees to each site after local training, enabling the federated model to handle partial feature overlap without data sharing.

If this is right

  • Survival prediction models can be trained from data distributed across institutions without centralizing records.
  • The approach maintains discrimination performance similar to pooled training under simulated feature heterogeneity.
  • It provides a Python implementation usable for real medical survival analysis tasks.
  • Different clinical variables or sequencing panels at sites no longer prevent joint modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique could be adapted for other tree-based methods in federated healthcare analytics.
  • Further tests on genuinely multi-institutional datasets would clarify how well the simulation holds.
  • Such methods may encourage more collaborative predictive modeling in regulated medical environments.

Load-bearing premise

Withholding random subsets of features from the GBSG2 cohort produces a realistic simulation of the partial feature overlap that occurs across real medical institutions with different sequencing panels or clinical variables.

What would settle it

Collect data from multiple actual hospitals with known distinct feature sets, apply the federated method, and compare its C-index directly to a model trained on pooled data to see if the gap stays small.

read the original abstract

Multi-center survival prediction can improve robustness and generalizability, yet privacy regulations and institutional governance often prevent pooling patient-level clinical and genomic data across institutions. In practice, deployment is further complicated by feature-space heterogeneity, in which sites collect different covariates or use different sequencing panels, resulting in only partially overlapping feature sets. We present FederatedRSF, a Python package that implements federated random survival forests, aggregating locally trained survival trees and redistributing only feature-compatible trees to each site, enabling inference with partial overlap without sharing raw data. We evaluate FederatedRSF on the GBSG2 breast cancer cohort distributed with the scikit-survival package, simulating feature heterogeneity across clients by withholding subsets of features, and assessing discrimination using Harrell's concordance index (C-Index) under repeated cross-validation and site-splits. The results demonstrated that the federated model can achieve performance comparable to that of the centralized training setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces FederatedRSF, a Python package implementing federated random survival forests for multi-center survival prediction under privacy constraints and partial feature overlap. Locally trained survival trees are aggregated and only feature-compatible trees are redistributed to sites, enabling inference without raw data sharing. Evaluation on the GBSG2 breast cancer cohort (from scikit-survival) simulates heterogeneity by randomly withholding feature subsets across clients; Harrell's C-Index under repeated cross-validation and site splits is reported as comparable between federated and centralized settings.

Significance. If the comparability result holds, the approach would address a practical barrier to collaborative survival modeling in medicine where institutions cannot pool data due to privacy rules and have non-identical feature spaces. The open-source Python package is a concrete strength that supports reproducibility and potential adoption.

major comments (1)
  1. [Abstract / experimental evaluation] Abstract (experimental evaluation): the central comparability claim rests on random feature withholding to simulate partial overlap. Real institutional heterogeneity typically exhibits structured, correlated missingness (e.g., one site has a full genomic panel while another has only clinical variables plus a disjoint panel), which can reduce the number of compatible trees and change both local tree quality and the aggregation step in ways uniform random subsets do not capture. This assumption is load-bearing for the reported performance equivalence and requires either justification via additional structured-missingness experiments or real multi-site data.
minor comments (2)
  1. [Abstract] Abstract: the description of the evaluation protocol omits the number of cross-validation repetitions, the exact site-split configuration, variance or confidence intervals on the C-Index, and any statistical test for equivalence to the centralized baseline.
  2. [Methods] The manuscript would benefit from an explicit statement of the aggregation rule (e.g., how trees are selected or weighted when feature sets differ) and any hyperparameters controlling tree redistribution.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for highlighting an important aspect of our experimental design. We address the comment below.

read point-by-point responses
  1. Referee: [Abstract / experimental evaluation] Abstract (experimental evaluation): the central comparability claim rests on random feature withholding to simulate partial overlap. Real institutional heterogeneity typically exhibits structured, correlated missingness (e.g., one site has a full genomic panel while another has only clinical variables plus a disjoint panel), which can reduce the number of compatible trees and change both local tree quality and the aggregation step in ways uniform random subsets do not capture. This assumption is load-bearing for the reported performance equivalence and requires either justification via additional structured-missingness experiments or real multi-site data.

    Authors: We agree that random feature withholding constitutes a controlled but simplified simulation of partial overlap and does not replicate the structured, correlated missingness patterns typical of real institutional data (e.g., disjoint genomic panels). Our simulation isolates the effect of varying overlap ratios on tree compatibility and aggregation while holding other factors fixed, which is a standard approach in federated learning studies when real multi-center datasets are unavailable due to privacy constraints. The FederatedRSF procedure itself is agnostic to the missingness mechanism and simply redistributes only those trees whose splitting features are present at a given site. We recognize that structured missingness could alter the number of compatible trees and local model quality in ways not tested here. We cannot provide real multi-site data or additional structured-missingness experiments within the current scope. We will add an explicit discussion of this limitation to the experimental evaluation section. revision: partial

standing simulated objections not resolved
  • Additional structured-missingness experiments or real multi-site data

Circularity Check

0 steps flagged

No significant circularity; empirical method and simulation stand alone

full rationale

The paper introduces FederatedRSF as an implementation that aggregates locally trained survival trees and redistributes feature-compatible trees for partial overlap, then reports an empirical C-Index comparison on GBSG2 under random feature withholding. No equations, parameter fits, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the central comparability claim to a tautology or input by construction. The evaluation is a direct simulation-based benchmark against centralized training, with no load-bearing steps that collapse into self-definition or renamed fits. This is a standard empirical presentation of a federated algorithm and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach implicitly rests on standard random-forest and federated-learning assumptions that are not detailed here.

pith-pipeline@v0.9.0 · 5708 in / 1113 out tokens · 29629 ms · 2026-05-25T06:02:44.232720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    and Blackstone, Eugene H

    Ishwaran, Hemant and Kogalur, Udaya B. and Blackstone, Eugene H. and Lauer, Michael S. , title =. The Annals of Applied Statistics , volume =. 2008 , doi =

  2. [2]

    Brendan and Avent, Brendan and Bellet, Aur

    Kairouz, Peter and McMahan, H. Brendan and Avent, Brendan and Bellet, Aur. Advances and Open Problems in Federated Learning , journal =. 2021 , doi =

  3. [3]

    2016 , howpublished =

    Regulation (. 2016 , howpublished =

  4. [4]

    Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =

    McMahan, Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and y Arcas, Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =. 2017 , url =

  5. [5]

    ACM Computing Surveys , volume =

    Ye, Mang and Liu, Jie and Wang, Bolin and Cao, Siwei and Song, Limeng and Erfani, Sarah Monazam and Bailey, James and Ghanem, Bernard and Wu, Qing , title =. ACM Computing Surveys , volume =. 2023 , doi =

  6. [6]

    BMC Medical Genomics , volume =

    Quy, Pham Nguyen and others , title =. BMC Medical Genomics , volume =. 2022 , doi =

  7. [7]

    and Sumer, Selcuk Onur and Aksoy, B

    Cerami, Ethan and Gao, Jianjiong and Dogrusoz, Ugur and Gross, Benjamin E. and Sumer, Selcuk Onur and Aksoy, B. The. Cancer Discovery , volume =. 2012 , doi =

  8. [8]

    ACM Transactions on Intelligent Systems and Technology , volume =

    Yang, Qiang and Liu, Yang and Chen, Tianjian and Tong, Yongxin , title =. ACM Transactions on Intelligent Systems and Technology , volume =. 2019 , doi =

  9. [9]

    Standards for Privacy of Individually Identifiable Health Information; Final Rule , year =

  10. [10]

    Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the

    Gao, Jianjiong and Aksoy, B. Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the. Science Signaling , volume =. 2013 , doi =