GRAFT: Decoupling Ranking and Calibration for Survival Analysis

Mohammad Ashhad; Ricardo Henao; Robert Hoehndorf

arxiv: 2602.07884 · v2 · pith:Q5RQ54S7new · submitted 2026-02-08 · 💻 cs.LG · cs.AI

GRAFT: Decoupling Ranking and Calibration for Survival Analysis

Mohammad Ashhad , Robert Hoehndorf , Ricardo Henao This is my paper

Pith reviewed 2026-05-21 13:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords survival analysiscalibrationC-indexaccelerated failure timeresidual networksfeature selectioncensored datadeep learning

0 comments

The pith

GRAFT decouples ranking optimization from calibration in a hybrid survival model to improve both discrimination and probability estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Survival analysis requires both accurate ranking of patients by risk and well-calibrated estimates of survival probabilities over time, yet classical models handle calibration but limit flexibility while deep networks excel at ranking but often miscalibrate outputs. GRAFT addresses the tradeoff through a hybrid architecture that trains a linear accelerated failure time base augmented by a non-linear residual network and stochastic gates. The model first optimizes a differentiable ranking loss aligned with the C-index, handling censored data via stochastic imputation from local Kaplan-Meier estimators. Calibrated survival probabilities are then produced by a separate post-training adjustment step. This separation aims to deliver strong performance on both fronts while preserving sparsity and robustness under noisy conditions.

Core claim

GRAFT is a gated residual accelerated failure time model that combines a linear AFT component with a non-linear residual neural network and stochastic gates for automatic feature selection. It is trained by optimizing a C-index-aligned ranking loss that uses stochastic conditional imputation drawn from local Kaplan-Meier estimators to manage censoring, after which simple post-training calibration produces the final survival probability estimates.

What carries the argument

The hybrid linear AFT plus non-linear residual network with stochastic gates, which separates optimization of a C-index-aligned ranking loss during training from a subsequent post-training calibration step for survival probabilities.

If this is right

The approach outperforms baselines in both discrimination and calibration on public survival analysis benchmarks.
The model remains robust and produces sparse feature selections even when inputs contain high levels of noise.
Non-linear interactions are captured without sacrificing the calibration properties typically associated with simpler linear models.
Censored observations are handled directly through the stochastic imputation method during ranking optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ranking-then-calibrate separation could extend to other time-to-event or risk-prediction settings where discrimination and probability accuracy often trade off against each other.
Testing the method on larger real-world clinical registries with varying censoring patterns would reveal whether the post-calibration adjustment continues to preserve ranking quality at scale.
Embedding this hybrid structure into clinical decision-support systems might yield prognostic outputs that clinicians can both order reliably and interpret probabilistically.

Load-bearing premise

Post-training calibration applied after optimizing the C-index-aligned ranking loss will produce accurate survival probabilities without reintroducing bias or degrading the learned ranking.

What would settle it

A benchmark experiment in which the post-training calibration step produces survival curves with substantially worse calibration metrics or measurably lower C-index ranking performance than the uncalibrated model would falsify the decoupling benefit.

read the original abstract

Survival analysis is complicated by censored data, high-dimensional features, and non-linear interactions. Classical models offer interpretability and superior calibration but are restricted to linear or predefined functional forms, while deep learning models are flexible and achieve strong discriminative performance, but tend to produce poorly calibrated survival estimates. To address this trade-off, we propose GRAFT (Gated Residual Accelerated Failure Time), a novel AFT model that decouples prognostic ranking from survival calibration. GRAFT's hybrid architecture combines a linear AFT model with a non-linear residual neural network, and it also integrates stochastic gates for automatic feature selection. The model is trained by optimizing a differentiable, C-index-aligned ranking loss using stochastic conditional imputation from local Kaplan-Meier estimators, while calibrated survival estimates are obtained through simple post-training calibration. In public benchmarks, GRAFT outperforms baselines in discrimination and calibration, while remaining robust and sparse in high-noise settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRAFT's hybrid linear AFT plus residual network with post-training calibration is a reasonable attempt at the ranking-calibration tradeoff, but the separation needs explicit checks to confirm it holds up.

read the letter

The main point here is that GRAFT trains a hybrid model on a C-index-aligned ranking loss with stochastic imputation from local Kaplan-Meier estimators, then applies a separate calibration step afterward. The architecture mixes a linear AFT core for some interpretability with a non-linear residual network and stochastic gates for sparsity and feature selection. This setup aims to deliver both strong discrimination and usable probability estimates in censored survival data, which is a practical pain point in medical and reliability work. The paper positions the decoupling as the key move, and if the full experiments show gains over standard baselines like Cox, AFT, and deep survival models on public benchmarks, that would be a useful incremental result. The robustness claims in high-noise settings and the sparsity from the gates also sound like they could help in real high-dimensional cases. What the work does reasonably well is lay out a concrete training procedure that avoids forcing the ranking loss to directly produce calibrated outputs. That separation keeps the optimization focused. On the softer side, the post-training calibration step is the load-bearing part of the claim, and without reported checks on whether it preserves the learned ordering or interacts badly with the imputation distribution, it's easy to worry about hidden degradation. The abstract gives no numbers or ablation details, so the full paper's tables and sensitivity analyses will decide how much weight to give the outperformance statements. This is the kind of paper that would interest applied researchers who already use survival models and want both ranking and calibration without switching to fully non-parametric approaches. It is not a big conceptual shift but packages existing pieces in a new combination. I would send it to peer review because the problem is well-motivated, the method is described clearly enough to reproduce, and the empirical questions are straightforward to test.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GRAFT, a hybrid linear AFT model augmented with a residual neural network and stochastic gates for automatic feature selection in survival analysis. The model is trained by optimizing a differentiable ranking loss aligned to the C-index, using stochastic conditional imputation drawn from local Kaplan-Meier estimators; calibrated survival probabilities are then obtained via a separate post-training calibration step. The central empirical claim is that this architecture outperforms baselines in both discrimination and calibration on public benchmarks while remaining robust and sparse under high noise.

Significance. If the reported gains in joint discrimination and calibration hold with proper controls, the decoupling strategy could meaningfully address a persistent tension in survival modeling between flexible non-linear ranking and reliable probability estimates, particularly in high-dimensional censored settings. The hybrid linear-plus-residual design and gated sparsity offer potential interpretability advantages over pure deep models.

major comments (2)

[Abstract] Abstract and §4 (empirical results): the claim of outperformance in both discrimination and calibration on public benchmarks is stated without quantitative metrics, error bars, baseline specifications, or ablation tables in the provided text, preventing verification of the central empirical assertion.
[§3] §3 (training procedure): the post-training calibration is applied after C-index-aligned ranking optimization with stochastic local KM imputation; no bound, commutativity argument, or preservation experiment is referenced showing that the calibration map leaves the induced ranking order unchanged or avoids reintroducing censoring bias, which is load-bearing for the joint superiority claim.

minor comments (2)

[§3] Notation for the stochastic gate parameters and the exact form of the conditional imputation distribution should be defined more explicitly to allow reproduction.
[§2] The manuscript would benefit from a clearer statement of how the hybrid linear-AFT + residual-NN outputs are combined before the ranking loss is applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the presentation of our empirical results and the justification for the decoupling approach.

read point-by-point responses

Referee: [Abstract] Abstract and §4 (empirical results): the claim of outperformance in both discrimination and calibration on public benchmarks is stated without quantitative metrics, error bars, baseline specifications, or ablation tables in the provided text, preventing verification of the central empirical assertion.

Authors: We acknowledge that the abstract summarizes the performance claims at a high level. Section 4 of the manuscript already contains the supporting quantitative evidence, including tables reporting C-index and calibration metrics (Brier score and ICI) with standard errors across repeated runs, explicit baseline implementations (CoxPH, RSF, DeepSurv, and others), and ablation studies on the gated residual and stochastic imputation components. To make these results immediately verifiable, we will revise the abstract to include representative numerical improvements and direct references to the relevant tables and figures in §4. revision: yes
Referee: [§3] §3 (training procedure): the post-training calibration is applied after C-index-aligned ranking optimization with stochastic local KM imputation; no bound, commutativity argument, or preservation experiment is referenced showing that the calibration map leaves the induced ranking order unchanged or avoids reintroducing censoring bias, which is load-bearing for the joint superiority claim.

Authors: This concern is well taken. The calibration step applies a monotonic (isotonic) mapping to the survival estimates produced by the trained model. Because the mapping is strictly non-decreasing and applied uniformly, it leaves the relative ordering of predictions—and therefore the C-index—unchanged by construction. We will add a concise paragraph in §3 stating this invariance property together with a short empirical verification that C-index values computed before and after calibration are identical on the held-out test folds. The stochastic local KM imputation occurs exclusively during training of the ranking loss; calibration itself is performed on a separate validation set using standard censoring-aware estimators, so no additional censoring bias is introduced. We will cite the relevant monotonicity property and include the before/after C-index check in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: ranking loss and post-training calibration are independent steps

full rationale

The paper trains via a C-index-aligned ranking loss with stochastic imputation from local Kaplan-Meier estimators, then applies separate post-training calibration for survival probabilities. This decoupling is explicit in the abstract and architecture description; neither the ranking objective nor the final probabilities are defined in terms of each other by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to force the result. The performance claims rest on external benchmarks rather than reducing to fitted inputs. The derivation chain is self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard neural-network optimization assumptions plus the domain claim that local Kaplan-Meier imputation plus post-hoc calibration can be treated as independent of the ranking objective.

free parameters (1)

stochastic gate parameters
Learned during training to perform automatic feature selection; their values are fitted to the data.

axioms (2)

domain assumption A differentiable surrogate loss aligned to the C-index can be optimized jointly with the hybrid AFT parameters
Invoked to train the ranking component.
domain assumption Post-training calibration step produces well-calibrated survival curves without feedback into the learned ranking
Required for the decoupling claim.

pith-pipeline@v0.9.0 · 5682 in / 1300 out tokens · 47860 ms · 2026-05-21T13:02:50.893951+00:00 · methodology

GRAFT: Decoupling Ranking and Calibration for Survival Analysis

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)