One-sample survival tests in the presence of non-proportional hazards in oncology clinical trials

Chlo\'e Szurewsky (U1018 (\'Equipe 2)); Guosheng Yin (DSAS); Gw\'ena\"el Le Teuff (U1018 (\'Equipe 2))

arxiv: 2506.18608 · v3 · submitted 2025-06-23 · 📊 stat.AP · stat.ME

One-sample survival tests in the presence of non-proportional hazards in oncology clinical trials

Chlo\'e Szurewsky (U1018 (\'Equipe 2)) , Guosheng Yin (DSAS) , Gw\'ena\"el Le Teuff (U1018 (\'Equipe 2)) This is my paper

Pith reviewed 2026-05-19 07:55 UTC · model grok-4.3

classification 📊 stat.AP stat.ME

keywords one-sample log-rank testnon-proportional hazardssingle-arm trialsmax-Combo testsurvival analysisoncologyscore testrestricted mean survival time

0 comments

The pith

Max-Combo test outperforms one-sample log-rank across non-proportional hazards in single-arm oncology trials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-arm oncology trials often compare new treatments to external historical controls using the one-sample log-rank test, yet this method loses power when hazards are not proportional. The paper extends the score-test version of the test to cover early, middle, delayed, and crossing effects through piecewise exponential and accelerated hazards models, then combines these with a restricted mean survival time statistic into a max-Combo procedure. Simulations demonstrate that the max-Combo test maintains higher power than the standard test in every scenario examined. The approach therefore supplies a practical way to analyze time-to-event data in trials where randomized controls are difficult to obtain. Performance still depends heavily on how well the external control survival curve is known.

Core claim

By constructing score tests under piecewise exponential and accelerated hazards models and combining them with a restricted mean survival time statistic into a max-Combo procedure, the resulting test is more powerful than the one-sample log-rank test for single-arm trials under any examined non-proportional hazards pattern.

What carries the argument

The max-Combo test, which takes the maximum of adjusted statistics from several component score tests each matched to a different non-proportional hazards pattern.

If this is right

Single-arm trials can now be powered against a wider range of treatment-effect shapes including delayed and crossing hazards.
Trial designers gain a menu of score tests that can be chosen or combined according to the expected pattern of benefit.
The same framework supplies a direct way to incorporate restricted mean survival time into the comparison with historical controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be adapted for other time-to-event endpoints such as progression-free survival if similar historical data exist.
Routine sensitivity analyses that vary the historical curve within its estimation uncertainty would strengthen claims based on these tests.
The combination approach suggests a template for constructing robust tests in other single-sample settings outside oncology.

Load-bearing premise

The survival curve of the external control group is known accurately from historical data with little uncertainty or model error.

What would settle it

A simulation or real-data re-analysis in which the external control survival curve is deliberately misspecified to check whether type I error inflates or power collapses for the max-Combo procedure.

read the original abstract

In oncology, conduct well-powered time-to-event randomized clinical trials may be challenging due to limited patietns number. Many designs for single-arm trials (SATs) have recently emerged as an alternative to overcome this issue. They rely on the (modified) one-sample log-rank test (OSLRT) under the proportional hazards to compare the survival curves of an experimental and an external control group. We extend Finkelstein's formulation of OSLRT as a score test by using a piecewise exponential model for early, middle and delayed treatment effects and an accelerated hazards model for crossing hazards. We adapt the restricted mean survival time based test and construct a combination test procedure (max-Combo) to SATs. The performance of the developed are evaluated through a simulation study. The score tests are as conservative as the OSLRT and have the highest power when the data generation matches the model underlying score tests. The max-Combo test is more powerful than the OSLRT whatever the scenarios and is thus an interesting approach as compared to a score test. Uncertainty on the survival curve estimated of the external control group and its model misspecification may have a significant impact on performance. For illustration, we apply the developed tests on real data examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends one-sample log-rank tests for single-arm oncology trials to non-proportional hazards via piecewise and accelerated models plus a max-Combo, but the power edge for the combo test is shown only when the external control curve is treated as known exactly.

read the letter

The paper takes Finkelstein's score-test version of the one-sample log-rank test and adds piecewise exponential models that target early, middle, or delayed effects plus an accelerated-hazards model for crossing curves. It also adapts the restricted-mean test and wraps everything in a max-Combo procedure for single-arm settings. That combination is the concrete new piece relative to the earlier formulation they cite. The simulations then compare these tests against the plain OSLRT and report that the tailored score tests recover power when the data-generating process matches their assumptions, while the max-Combo stays competitive or better across the scenarios they ran. That is useful work for people who actually design single-arm oncology trials and need something that does not collapse under the non-proportional hazards that show up in real data. The authors are clear that the external control curve comes from historical data and that uncertainty or misspecification around it can matter. The simulations, however, appear to fix that curve as known when they generate data and compute the test statistics. Because the abstract itself flags the potential impact of estimation error, the reported power advantage for max-Combo over OSLRT is conditional on an assumption that will not hold in practice. Adding even a modest bootstrap or historical-cohort sampling step to the power comparisons would make the results more credible for the setting the paper claims to address. The derivations look straightforward and the citation pattern is appropriate; nothing in the abstract suggests internal contradictions or circular fitting. This is aimed at statisticians who work on single-arm time-to-event designs in oncology. A reader who needs concrete alternatives to the standard OSLRT under non-proportional hazards will find the extensions and the simulation layout worth looking at. The paper is coherent enough on its own terms to deserve a serious referee, mainly so the simulation design can be tightened to include realistic estimation of the control curve. I would send it out for review with that specific request rather than desk-reject it.

Referee Report

1 major / 2 minor

Summary. The paper extends Finkelstein's one-sample log-rank test (OSLRT) for single-arm oncology trials to non-proportional hazards settings by deriving score tests under piecewise exponential models (early, middle, delayed effects) and an accelerated hazards model (crossing hazards). It further adapts a restricted mean survival time test and constructs a max-Combo combination procedure. Performance is evaluated in simulations across these scenarios, with the central claim that the max-Combo test is more powerful than the OSLRT in all cases while score tests match OSLRT conservatism but gain power under model match; uncertainty from estimating the external control curve is flagged as potentially impactful. Real-data illustrations are provided.

Significance. If the reported power advantages of max-Combo hold after propagating estimation uncertainty from the external control (a common feature of SATs), the work would supply practical, more robust alternatives to standard OSLRT for small-sample oncology trials with non-PH patterns. The explicit model-based extensions and combination test are technically straightforward and directly address a recognized limitation of PH-based one-sample tests.

major comments (1)

[Abstract / Simulation study] Abstract and simulation study: The claim that 'the max-Combo test is more powerful than the OSLRT whatever the scenarios' rests on simulations that treat the external control survival curve as known and fixed when generating data and computing statistics. The abstract itself states that 'Uncertainty on the survival curve estimated of the external control group and its model misspecification may have a significant impact on performance,' yet the reported results do not appear to incorporate finite-sample estimation error (e.g., via bootstrap resampling of historical data or sampling from an estimated control distribution). This omission is load-bearing for the superiority claim, as the advantage may not persist in the realistic SAT setting where the control curve must itself be estimated from limited historical data.

minor comments (2)

[Abstract] Abstract: Typo 'patietns' should be 'patients'; grammar 'The performance of the developed are evaluated' should be revised for subject-verb agreement.
The manuscript would benefit from explicit statements of the exact simulation parameters (sample sizes, censoring rates, number of replications, and how the external control curve is generated or fixed) to allow full reproducibility of the power comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript on extending one-sample survival tests to non-proportional hazards settings. The major comment raises an important point about the realism of our simulation design. We address it directly below and will revise the manuscript to incorporate additional analyses that propagate estimation uncertainty from the external control curve.

read point-by-point responses

Referee: Abstract and simulation study: The claim that 'the max-Combo test is more powerful than the OSLRT whatever the scenarios' rests on simulations that treat the external control survival curve as known and fixed when generating data and computing statistics. The abstract itself states that 'Uncertainty on the survival curve estimated of the external control group and its model misspecification may have a significant impact on performance,' yet the reported results do not appear to incorporate finite-sample estimation error (e.g., via bootstrap resampling of historical data or sampling from an estimated control distribution). This omission is load-bearing for the superiority claim, as the advantage may not persist in the realistic SAT setting where the control curve must itself be estimated from limited historical data.

Authors: We agree that this is a valid and substantive concern. Our current simulations were intentionally constructed under the assumption of a known external control curve to isolate and evaluate the operating characteristics of the proposed score tests, restricted mean survival time test, and max-Combo procedure across the targeted non-proportional hazards patterns (early, middle, delayed, and crossing effects). This design choice follows the standard approach in many methodological papers on one-sample tests to first establish performance under idealized conditions before layering in additional sources of variability. The abstract does flag the potential impact of estimation uncertainty and model misspecification, but we acknowledge that the superiority claim for max-Combo would be strengthened by explicit quantification of this effect. We will therefore revise the simulation study to include scenarios in which the control survival curve is estimated from finite historical data (e.g., via bootstrap resampling or parametric fitting with sampling from the estimated distribution in each replicate). These new results will be reported alongside the existing ones, with appropriate discussion of how the relative power of max-Combo versus OSLRT changes under realistic estimation error. We believe this addition will directly address the referee's point without altering the core methodological contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit model-based test constructions evaluated on independent simulations

full rationale

The paper explicitly extends Finkelstein's OSLRT formulation into score tests under piecewise exponential and accelerated hazards models, adapts the RMST test, and defines the max-Combo combination procedure from these components. Power comparisons are obtained from simulation studies that generate data under specified scenarios (early/middle/delayed effects, crossing hazards) and treat these as external benchmarks. No derivation step reduces a claimed result to a fitted parameter from the same dataset, nor does any central claim rest on a self-citation chain or ansatz smuggled from prior author work. The noted uncertainty in external control curve estimation affects simulation realism but does not create a definitional loop within the paper's own equations or procedures.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The methods rest on standard survival analysis assumptions plus the availability of a reliable external control survival curve; no new free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption External control survival curve can be estimated without substantial bias from historical data
Abstract notes that uncertainty and misspecification on this curve significantly affect performance

pith-pipeline@v0.9.0 · 5785 in / 1205 out tokens · 32982 ms · 2026-05-19T07:55:26.408583+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extend Finkelstein’s formulation of OSLRT as a score test under PH by using a piecewise exponential model with change-points (CPs) for early, middle and delayed treatment effects and an accelerated hazards model for crossing hazards. … The max-Combo test is more powerful than the OSLRT whatever the scenarios
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The developed score tests are as conservative as the OSLRT … Uncertainty on the survival curve estimate of the external control group and model misspecification may have a significant impact on performance.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.