pith. sign in

arxiv: 2506.02075 · v3 · pith:72A7LXDTnew · submitted 2025-06-02 · 📊 stat.ME · cs.LG

Position: Stop Chasing the C-index when Evaluating Survival Analysis Models

classification 📊 stat.ME cs.LG
keywords evaluationsurvivalanalysismodelingalignmentassumptionsc-indexcensoring
0
0 comments X
read the original abstract

The current state of evaluation in survival analysis is plagued by the persistent use of evaluation metrics in ways that are misaligned with the stated modeling objective. In addition, many such evaluations are based on censoring assumptions that are left implicit or unjustified. This means that the reported performance can be misleading and may fail to answer the scientific or modeling question the evaluation was intended to address. In this position paper, we critically examine evaluation practices in survival analysis and highlight how censoring makes evaluation fundamentally different from standard regression or classification. We place particular focus on concordance-based measures, such as the C-index, which we show are heavily overused in the literature. To help identify appropriate metrics, we propose a set of key desiderata and introduce a double-helix ladder, in which valid evaluation requires alignment between metric and modeling assumptions. Through controlled experiments, we show that violations of this alignment can lead to misleading model comparisons. We conclude by providing practical guidance on how to evaluate a survival model.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Landmarking with Latent Class Mixed Models for Dynamic Prediction of Time-to-event Data with Heterogeneous Biomarker Trajectories

    stat.ME 2026-06 unverdicted novelty 6.0

    A landmarking approach using latent class mixed models for dynamic prediction of time-to-event data that accounts for latent heterogeneity in longitudinal biomarker trajectories.