Efficient Log-Rank Updates for Random Survival Forests

Erik Sverdrup; James Yang; Michael LeBlanc

arxiv: 2510.03665 · v2 · submitted 2025-10-04 · 📊 stat.ME · stat.CO

Efficient Log-Rank Updates for Random Survival Forests

Erik Sverdrup , James Yang , Michael LeBlanc This is my paper

Pith reviewed 2026-05-18 11:05 UTC · model grok-4.3

classification 📊 stat.ME stat.CO

keywords random survival forestslog-rank criterionconstant-time updatessurvival analysisright censoringefficient splittingtree-based methodscensored data

0 comments

The pith

Random survival forests can update the log-rank splitting criterion in constant time rather than recomputing it over all event times at each candidate split.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that constant-time updates suffice for the log-rank criterion inside random survival forests. Standard recomputation scales with the number of distinct event times in a node and slows training on large cohorts with long follow-up. The updates rest on prior approximations that let each split decision avoid a full pass over the event times. When these updates replace the exact computation, training time drops while out-of-sample predictive accuracy stays essentially unchanged.

Core claim

Simple constant-time update formulas for the log-rank statistic let the splitting procedure in random survival forests select partitions without iterating over every distinct event time in the current node.

What carries the argument

Constant-time update formulas for the log-rank criterion that track only a few running aggregates instead of summing over all event times.

If this is right

Training time falls on datasets where the number of distinct event times is large.
The forests retain the same level of predictive accuracy as the exact version.
The method applies to any tree-growing procedure that relies on the log-rank criterion for right-censored data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same style of incremental tracking could be applied to other splitting rules that currently require linear scans over event times.
Larger-scale survival forests become practical for cohorts with millions of observations or very long follow-up.
Time-to-event models with competing risks or time-varying covariates might admit analogous constant-time updates.

Load-bearing premise

The approximations remain accurate enough inside the splitting routine on large modern cohorts that the resulting forests match the out-of-sample performance of forests built with exact log-rank splits.

What would settle it

Build two forests on the same large right-censored dataset, one using exact log-rank splits and one using the constant-time updates, then compare their out-of-sample concordance index or integrated Brier score; a statistically significant gap would falsify the claim.

read the original abstract

Random survival forests are widely used for estimating covariate-conditional survival functions under right-censoring. Their standard log-rank splitting criterion is typically recomputed at each candidate split. This O(M) cost per split, with M the number of distinct event times in a node, creates a bottleneck for large cohort datasets with long follow-up. We revisit approximations proposed by LeBlanc and Crowley (1995) and develop simple constant-time updates for the log-rank criterion. The method is implemented in grf for R and reduces training time on large datasets while preserving predictive accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They give random survival forests a constant-time log-rank update by adapting the 1995 LeBlanc-Crowley approximations, which should cut training time on large cohorts if the accuracy claim holds.

read the letter

This paper is mainly a practical speed-up for random survival forests. They take the log-rank splitting criterion, which normally costs O(M) per candidate split where M is the number of distinct event times, and turn it into a constant-time update using approximations from LeBlanc and Crowley 1995. The implementation lands in the grf R package, and they report faster training on big datasets with no real loss in predictive accuracy. That is the new piece: the adaptation and the open code that lets people actually use it inside the forest-building routine. It is a targeted engineering fix rather than a new statistical framework, but it removes a clear bottleneck for anyone working with large survival cohorts that have long follow-up. The efficiency side looks mechanically sound once the update formulas are accepted. The paper does a decent job showing the computational gain in practice. The soft spot is the accuracy claim. It rests on the approximations not changing which splits get chosen in enough nodes to matter for the final forest or variable importance. The abstract asserts that predictive performance stays the same, but there are no explicit error bounds or detailed checks on how approximation error behaves when M is large. That regime is exactly where the original 1995 work might be stretched, and small per-split discrepancies could compound across thousands of trees. If the experiments in the full paper only cover moderate-sized data, this remains an open question rather than a settled point. This work is for applied people who already use random survival forests on sizable medical or similar datasets and have run into compute limits. Readers who care about scaling existing tools will get direct value from the code and the reported speed-ups. It is not revolutionary, but it is the kind of incremental improvement that deserves a serious referee to check the implementation details and the accuracy experiments. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes constant-time updates for the log-rank splitting criterion in random survival forests by revisiting the LeBlanc-Crowley (1995) approximations. These updates are implemented in the grf R package and are claimed to reduce training time on large datasets with long follow-up while preserving predictive accuracy.

Significance. If the central claim holds, the work provides a practical computational improvement for RSF on large cohorts, addressing the O(M) bottleneck per split where M is the number of distinct event times. This could enable wider use in survival analysis without requiring changes to the forest structure or variable importance measures.

major comments (2)

[Abstract] Abstract: the claim that predictive accuracy is preserved is stated without quantitative error bounds on the LeBlanc-Crowley approximation or ablation results showing how per-split discrepancies affect node purity, selected cutpoints, or out-of-sample performance across the ensemble.
[Methods] The derivation of the constant-time updates (relying on the external 1995 reference) should include a direct comparison, for nodes with large M, of the approximate versus exact log-rank scores to confirm that split variable and cutpoint selections remain unchanged in a sufficient fraction of nodes.

minor comments (2)

Clarify the exact form of the constant-time update formulas and any additional assumptions introduced beyond the 1995 approximations.
Add runtime benchmarks on datasets with varying M to quantify the speedup relative to the exact O(M) implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major comment below and describe the revisions made to strengthen the presentation of the approximation properties and empirical validation.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that predictive accuracy is preserved is stated without quantitative error bounds on the LeBlanc-Crowley approximation or ablation results showing how per-split discrepancies affect node purity, selected cutpoints, or out-of-sample performance across the ensemble.

Authors: We agree that the abstract would benefit from a more explicit reference to the supporting evidence. The manuscript body reports simulation and real-data experiments demonstrating that out-of-sample concordance and other performance metrics remain essentially unchanged under the constant-time updates. To address the request directly, we have revised the abstract to note the small observed discrepancies and have added a brief summary of the ablation results on cutpoint stability and node purity that appear in the supplementary analyses. revision: yes
Referee: [Methods] The derivation of the constant-time updates (relying on the external 1995 reference) should include a direct comparison, for nodes with large M, of the approximate versus exact log-rank scores to confirm that split variable and cutpoint selections remain unchanged in a sufficient fraction of nodes.

Authors: The updates follow from the LeBlanc-Crowley approximation whose analytic properties are given in the cited reference. We acknowledge the value of explicit numerical confirmation for large M. We have therefore added a new paragraph and accompanying table in the Methods section that compares approximate and exact log-rank statistics on nodes drawn from the largest datasets in our experiments (M exceeding several hundred). The results show that the selected split variable and cutpoint are identical in more than 95 percent of nodes, with the rare changes having negligible effect on downstream forest performance. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to 1995 approximations; central derivation of constant-time updates remains independent

full rationale

The paper derives constant-time log-rank updates by algebraic rearrangement of the LeBlanc-Crowley (1995) approximations, which are treated as an external starting point rather than re-derived or fitted inside this work. No equations in the provided text reduce a claimed prediction or split criterion back to a parameter estimated from the same data; the accuracy-preservation claim is presented as an empirical outcome to be checked against exact log-rank forests rather than enforced by construction. The 1995 citation overlaps with co-author Michael LeBlanc but is not invoked as a uniqueness theorem or load-bearing justification for the new update formulas themselves. This qualifies as a minor self-citation that does not collapse the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard right-censoring model for survival data and on the validity of the 1995 LeBlanc-Crowley approximation when used for tree splitting. No new free parameters, invented entities, or ad-hoc axioms are introduced beyond these background assumptions.

axioms (1)

domain assumption Right-censoring mechanism is independent of the event time given covariates
Standard modeling assumption required for the log-rank statistic to be unbiased.

pith-pipeline@v0.9.0 · 5612 in / 1227 out tokens · 32755 ms · 2026-05-18T11:05:57.036429+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We revisit approximations proposed by LeBlanc and Crowley (1995) and develop simple constant-time updates for the log-rank criterion... eL(c) = [numerator] / sqrt(1/E1 + 1/E2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.