pith. sign in

arxiv: 2510.03665 · v2 · submitted 2025-10-04 · 📊 stat.ME · stat.CO

Efficient Log-Rank Updates for Random Survival Forests

Pith reviewed 2026-05-18 11:05 UTC · model grok-4.3

classification 📊 stat.ME stat.CO
keywords random survival forestslog-rank criterionconstant-time updatessurvival analysisright censoringefficient splittingtree-based methodscensored data
0
0 comments X

The pith

Random survival forests can update the log-rank splitting criterion in constant time rather than recomputing it over all event times at each candidate split.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that constant-time updates suffice for the log-rank criterion inside random survival forests. Standard recomputation scales with the number of distinct event times in a node and slows training on large cohorts with long follow-up. The updates rest on prior approximations that let each split decision avoid a full pass over the event times. When these updates replace the exact computation, training time drops while out-of-sample predictive accuracy stays essentially unchanged.

Core claim

Simple constant-time update formulas for the log-rank statistic let the splitting procedure in random survival forests select partitions without iterating over every distinct event time in the current node.

What carries the argument

Constant-time update formulas for the log-rank criterion that track only a few running aggregates instead of summing over all event times.

If this is right

  • Training time falls on datasets where the number of distinct event times is large.
  • The forests retain the same level of predictive accuracy as the exact version.
  • The method applies to any tree-growing procedure that relies on the log-rank criterion for right-censored data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same style of incremental tracking could be applied to other splitting rules that currently require linear scans over event times.
  • Larger-scale survival forests become practical for cohorts with millions of observations or very long follow-up.
  • Time-to-event models with competing risks or time-varying covariates might admit analogous constant-time updates.

Load-bearing premise

The approximations remain accurate enough inside the splitting routine on large modern cohorts that the resulting forests match the out-of-sample performance of forests built with exact log-rank splits.

What would settle it

Build two forests on the same large right-censored dataset, one using exact log-rank splits and one using the constant-time updates, then compare their out-of-sample concordance index or integrated Brier score; a statistically significant gap would falsify the claim.

read the original abstract

Random survival forests are widely used for estimating covariate-conditional survival functions under right-censoring. Their standard log-rank splitting criterion is typically recomputed at each candidate split. This O(M) cost per split, with M the number of distinct event times in a node, creates a bottleneck for large cohort datasets with long follow-up. We revisit approximations proposed by LeBlanc and Crowley (1995) and develop simple constant-time updates for the log-rank criterion. The method is implemented in grf for R and reduces training time on large datasets while preserving predictive accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes constant-time updates for the log-rank splitting criterion in random survival forests by revisiting the LeBlanc-Crowley (1995) approximations. These updates are implemented in the grf R package and are claimed to reduce training time on large datasets with long follow-up while preserving predictive accuracy.

Significance. If the central claim holds, the work provides a practical computational improvement for RSF on large cohorts, addressing the O(M) bottleneck per split where M is the number of distinct event times. This could enable wider use in survival analysis without requiring changes to the forest structure or variable importance measures.

major comments (2)
  1. [Abstract] Abstract: the claim that predictive accuracy is preserved is stated without quantitative error bounds on the LeBlanc-Crowley approximation or ablation results showing how per-split discrepancies affect node purity, selected cutpoints, or out-of-sample performance across the ensemble.
  2. [Methods] The derivation of the constant-time updates (relying on the external 1995 reference) should include a direct comparison, for nodes with large M, of the approximate versus exact log-rank scores to confirm that split variable and cutpoint selections remain unchanged in a sufficient fraction of nodes.
minor comments (2)
  1. Clarify the exact form of the constant-time update formulas and any additional assumptions introduced beyond the 1995 approximations.
  2. Add runtime benchmarks on datasets with varying M to quantify the speedup relative to the exact O(M) implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major comment below and describe the revisions made to strengthen the presentation of the approximation properties and empirical validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that predictive accuracy is preserved is stated without quantitative error bounds on the LeBlanc-Crowley approximation or ablation results showing how per-split discrepancies affect node purity, selected cutpoints, or out-of-sample performance across the ensemble.

    Authors: We agree that the abstract would benefit from a more explicit reference to the supporting evidence. The manuscript body reports simulation and real-data experiments demonstrating that out-of-sample concordance and other performance metrics remain essentially unchanged under the constant-time updates. To address the request directly, we have revised the abstract to note the small observed discrepancies and have added a brief summary of the ablation results on cutpoint stability and node purity that appear in the supplementary analyses. revision: yes

  2. Referee: [Methods] The derivation of the constant-time updates (relying on the external 1995 reference) should include a direct comparison, for nodes with large M, of the approximate versus exact log-rank scores to confirm that split variable and cutpoint selections remain unchanged in a sufficient fraction of nodes.

    Authors: The updates follow from the LeBlanc-Crowley approximation whose analytic properties are given in the cited reference. We acknowledge the value of explicit numerical confirmation for large M. We have therefore added a new paragraph and accompanying table in the Methods section that compares approximate and exact log-rank statistics on nodes drawn from the largest datasets in our experiments (M exceeding several hundred). The results show that the selected split variable and cutpoint are identical in more than 95 percent of nodes, with the rare changes having negligible effect on downstream forest performance. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to 1995 approximations; central derivation of constant-time updates remains independent

full rationale

The paper derives constant-time log-rank updates by algebraic rearrangement of the LeBlanc-Crowley (1995) approximations, which are treated as an external starting point rather than re-derived or fitted inside this work. No equations in the provided text reduce a claimed prediction or split criterion back to a parameter estimated from the same data; the accuracy-preservation claim is presented as an empirical outcome to be checked against exact log-rank forests rather than enforced by construction. The 1995 citation overlaps with co-author Michael LeBlanc but is not invoked as a uniqueness theorem or load-bearing justification for the new update formulas themselves. This qualifies as a minor self-citation that does not collapse the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard right-censoring model for survival data and on the validity of the 1995 LeBlanc-Crowley approximation when used for tree splitting. No new free parameters, invented entities, or ad-hoc axioms are introduced beyond these background assumptions.

axioms (1)
  • domain assumption Right-censoring mechanism is independent of the event time given covariates
    Standard modeling assumption required for the log-rank statistic to be unbiased.

pith-pipeline@v0.9.0 · 5612 in / 1227 out tokens · 32755 ms · 2026-05-18T11:05:57.036429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.