Understanding Overparametrization in Survival Models through Interpolation

arxiv: 2512.12463 · v3 · submitted 2025-12-13 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Understanding Overparametrization in Survival Models through Interpolation

Yin Liu , Jianwen Cai , Didong Li This is my paper

Pith reviewed 2026-05-16 22:22 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH

keywords survival analysisoverparametrizationinterpolationdouble descentsurvival modelsgeneralizationlikelihood loss

0 comments p. Extension

The pith

Overparametrization does not improve generalization in survival models because their likelihood-based losses prevent beneficial interpolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how increasing model capacity affects generalization in survival analysis, a setting with censored data. Classical theory suggests a U-shaped loss curve, but modern ML often shows double descent where loss drops again after the interpolation threshold. The authors analyze four survival models and find that overparametrization does not lead to this improvement, unlike in regression or classification. This matters because it means simply using larger models may not help and could worsen performance in survival tasks. They define interpolation specifically for these models to explain why.

Core claim

The study shows the existence or absence of interpolation and finite-norm interpolation in DeepSurv, PC-Hazard, Nnet-Survival, and N-MTLR. Likelihood-based losses and model implementation jointly determine the feasibility of interpolation, clarifying that overparametrization should not be regarded as benign for survival models.

What carries the argument

Interpolation and finite-norm interpolation defined for loss-based survival models, which determine whether double descent can occur.

If this is right

Overparametrization does not lead to improved test performance in the examined survival models.
Likelihood-based losses make interpolation infeasible or non-beneficial in survival settings.
Model implementation details affect whether finite-norm interpolation is achieved.
Numerical experiments validate that generalization behaviors differ from those in regression and classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Survival models may need specialized capacity control methods beyond standard scaling.
The results could apply to other domains with censored or incomplete observations.
Different loss functions might be explored to enable double descent in survival analysis.

Load-bearing premise

The four chosen models and their specific implementations are representative of the broader class of survival models.

What would settle it

Observing a decrease in test loss for any of the four models as capacity grows past the interpolation threshold would contradict the claim that overparametrization is not benign.

Figures

Figures reproduced from arXiv: 2512.12463 by Didong Li, Jianwen Cai, Yin Liu.

**Figure 2.** Figure 2: Training and test losses of DeepSurv, and the error of the estimated log-hazard [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗

**Figure 3.** Figure 3: Training and test losses of PC-Hazard versus the number of neurons per layer. [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗

**Figure 4.** Figure 4: Training and test losses of Nnet-Survival versus the number of neurons per layer. [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

**Figure 5.** Figure 5: Training and test losses of N-MTLR versus the number of neurons per layer. [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

read the original abstract

Classical statistical learning theory predicts a U-shaped relationship between test loss and model capacity, driven by the bias-variance trade-off. Recent advances in modern machine learning have revealed a more complex pattern, double-descent, in which test loss, after peaking near the interpolation threshold, decreases again as model capacity continues to grow. While this behavior has been extensively analyzed in regression and classification, its manifestation in survival analysis remains unexplored. This study investigates overparametrization in four representative survival models: DeepSurv, PC-Hazard, Nnet-Survival, and N-MTLR. We rigorously define interpolation and finite-norm interpolation, two key characteristics of loss-based models to understand double-descent. We then show the existence (or absence) of (finite-norm) interpolation of all four models. Our findings clarify how likelihood-based losses and model implementation jointly determine the feasibility of interpolation and show that overparametrization should not be regarded as benign for survival models. All theoretical results are supported by numerical experiments that highlight the distinct generalization behaviors of survival models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines interpolation for likelihood losses in survival models and checks it on four neural nets, but the broader claim about overparametrization doesn't generalize beyond those cases.

read the letter

The main point is that these authors define interpolation and finite-norm interpolation for likelihood-based survival losses, then show that the four neural models they picked either reach or avoid it depending on implementation details. This leads them to say overparametrization is not benign for survival models the way it can be elsewhere. That definition step is the clearest new piece; it adapts the usual interpolation idea to censored data and partial likelihoods in a way that prior double-descent work did not cover. The experiments line up with the definitions and illustrate different generalization curves for DeepSurv, PC-Hazard, Nnet-Survival, and N-MTLR. That part is useful for anyone already running these exact architectures. The soft spot is exactly the one the stress-test flagged: everything rests on neural-network versions. There is no argument or check for classical Cox models, Weibull, or tree-based survival methods, so the headline conclusion does not yet apply to survival analysis as a whole. If the non-benign behavior is tied to how these networks implement the loss rather than to survival data itself, the claim narrows. The paper is aimed at people who already work on neural survival models and want to think about capacity. A reader who knows the regression and classification double-descent results can follow the extension without much trouble. It is worth sending for peer review so the authors can address the scope question and so referees can check the derivations that the abstract only sketches.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates overparametrization and double-descent phenomena in survival analysis, which have been studied in regression and classification but not yet in this domain. It defines interpolation and finite-norm interpolation for loss-based models, examines their presence or absence in four neural survival models (DeepSurv, PC-Hazard, Nnet-Survival, and N-MTLR), links the behavior to likelihood-based losses and model implementations, and concludes that overparametrization should not be regarded as benign for survival models. All claims are supported by numerical experiments.

Significance. If the central findings hold, the work usefully extends double-descent analysis to survival analysis by highlighting how censoring and likelihood losses can produce non-benign overparametrization behavior distinct from standard supervised learning. This could inform capacity selection and regularization choices in survival modeling.

major comments (2)

[Abstract] Abstract: The headline conclusion that overparametrization should not be regarded as benign for survival models rests on interpolation results for only four neural-network implementations. No argument or experiment addresses whether the same non-benign behavior appears in classical semi-parametric models (e.g., Cox PH) or parametric models (e.g., Weibull), so the general claim for the survival-analysis domain does not follow from the reported evidence.
[Abstract] Abstract: The statement that theoretical results on interpolation existence are supported by numerical experiments lacks accompanying details on dataset characteristics, censoring rates, hyperparameter selection procedures, or implementation choices that could affect whether finite-norm interpolation is observed; without these, it is impossible to assess whether the reported behaviors are robust or sensitive to post-hoc decisions.

minor comments (1)

The four models are all neural; a brief comparison table of their architectures, loss formulations, and how they map to the general definitions of interpolation would help readers evaluate representativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline conclusion that overparametrization should not be regarded as benign for survival models rests on interpolation results for only four neural-network implementations. No argument or experiment addresses whether the same non-benign behavior appears in classical semi-parametric models (e.g., Cox PH) or parametric models (e.g., Weibull), so the general claim for the survival-analysis domain does not follow from the reported evidence.

Authors: We agree that the headline phrasing in the abstract and conclusion is too broad. Our work deliberately targets modern neural survival models (DeepSurv, PC-Hazard, Nnet-Survival, N-MTLR) because these are the settings in which overparametrization and interpolation are practically relevant. Classical semi-parametric and parametric models operate under different capacity regimes and loss structures and were outside the scope of the study. We will revise the abstract, introduction, and conclusion to state explicitly that the non-benign overparametrization behavior is observed for the four neural implementations examined, and we will add a short paragraph noting that classical models such as Cox PH are not expected to exhibit the same interpolation phenomena due to their fixed functional form. revision: yes
Referee: [Abstract] Abstract: The statement that theoretical results on interpolation existence are supported by numerical experiments lacks accompanying details on dataset characteristics, censoring rates, hyperparameter selection procedures, or implementation choices that could affect whether finite-norm interpolation is observed; without these, it is impossible to assess whether the reported behaviors are robust or sensitive to post-hoc decisions.

Authors: The abstract is intentionally concise, but the referee is correct that it should convey the experimental scope. Full details on the four datasets (including sample sizes, feature dimensions, and censoring rates), the hyperparameter grids, early-stopping rules, and implementation choices appear in Sections 4.1–4.2 and the supplementary material. To address the concern directly, we will insert one additional sentence in the abstract summarizing the experimental setting: “Experiments across four real-world datasets with censoring rates ranging from 20% to 70% and systematic hyperparameter sweeps confirm the theoretical predictions.” revision: yes

Circularity Check

0 steps flagged

No circularity: definitions and empirical checks are independent of inputs

full rationale

The paper introduces explicit definitions of interpolation and finite-norm interpolation for survival models, then verifies their presence or absence in four concrete neural implementations (DeepSurv, PC-Hazard, Nnet-Survival, N-MTLR) via direct analysis of their loss functions and architectures. These steps rely on the models' own likelihood-based formulations and numerical experiments rather than any self-citation chain, fitted-parameter renaming, or imported uniqueness theorem. The central claim about non-benign overparametrization follows from the observed interpolation behaviors and is not equivalent to the input definitions by construction. No load-bearing self-citations or ansatz smuggling appear in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard mathematical definitions of interpolation and prior results from statistical learning theory without introducing new free parameters or invented entities.

axioms (1)

standard math Standard definitions of interpolation and finite-norm interpolation for loss-based models
Invoked to characterize when models achieve zero training loss or bounded norm solutions

pith-pipeline@v0.9.0 · 5485 in / 1060 out tokens · 40522 ms · 2026-05-16T22:22:18.868393+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

doi: 10.1016/j.neunet.2020. 07.021. Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3:463–482,

work page doi:10.1016/j.neunet.2020 2020
[2]

George Cybenko

doi: 10.1201/b18041. George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314,

work page doi:10.1201/b18041
[3]

Deep Neural Networks for Survival Analysis Based on a Multi-Task Framework

doi: 10.1007/BF02551274. Stephane Fotso. Deep neural networks for survival analysis based on a multi-task frame- work.ArXiv, abs/1801.05512,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/bf02551274
[4]

PeerJ7, 6257 (2019) https://doi.org/10.7717/peerj.6257

doi: 10.7717/peerj.6257. Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learn- ing: Data Mining, Inference, and Prediction. Springer, New York, 2nd edition,

work page doi:10.7717/peerj.6257
[5]

48 Trevor Hastie, Andrea Montanari, Saharon Rosset, and Robert J

doi: 10.1007/978-0-387-84858-7. 48 Trevor Hastie, Andrea Montanari, Saharon Rosset, and Robert J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.The Annals of Statistics, 50(2): 949–986,

work page doi:10.1007/978-0-387-84858-7
[6]

URLhttps://doi.org/10.1214/21-AOS2133

doi: 10.1214/21-aos2133. Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural Networks, 4(2):251–257,

work page doi:10.1214/21-aos2133
[7]

Approximation capabilities of multilayer feedforward networks,

doi: 10.1016/0893-6080(91)90009-T. URLhttps: //www.sciencedirect.com/science/article/pii/089360809190009T. Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network.Bioinformatics, 34(13):2329–2336,

work page doi:10.1016/0893-6080(91)90009-t
[8]

Ganesh Kini and Christos Thrampoulidis

doi: 10.1093/bioinformatics/bty068. Ganesh Kini and Christos Thrampoulidis. Analytic study of double descent in binary classification: The impact of loss.arXiv:2001.11572 [stat.ML],

work page doi:10.1093/bioinformatics/bty068 2001
[9]

doi: 10.48550/ arXiv.2001.11572. John P. Klein and Melvin L. Moeschberger.Survival Analysis: Techniques for Cen- sored and Truncated Data. Springer, New York, 2nd edition,

work page arXiv 2001
[10]

H˚avard Kvamme, Ørnulf Borgan, and Ida Scheel

doi: 10.1007/978-1-4419-6646-9. H˚avard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-event prediction with neural networks and Cox regression.Journal of Machine Learning Research, 20(129):1–30,

work page doi:10.1007/978-1-4419-6646-9
[11]

Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai

doi: 10.1073/pnas.2010378117. Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. Classification vs regression in overparameterized regimes: Does the loss function matter?Journal of Machine Learning Research, 22(222):1–69,

work page doi:10.1073/pnas.2010378117
[12]

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever

URL http://jmlr.org/papers/v22/20-1346.html. Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Jour- nal of Statistical Mechanics: Theory and Experiment, 2021(12):124003,

work page 2021
[13]

Sidak Pal Singh, Aurelien Lucchi, Thomas Hofmann, and Bernhard Sch ¨olkopf

doi: 10.1088/1742-5468/ac3db5. Sidak Pal Singh, Aurelien Lucchi, Thomas Hofmann, and Bernhard Sch ¨olkopf. Phe- nomenology of double descent in finite-width neural networks. InProceedings of the International Conference on Learning Representations (ICLR),

work page doi:10.1088/1742-5468/ac3db5
[14]

doi: 10.1007/s10462-023-10681-3. 50

work page doi:10.1007/s10462-023-10681-3

[1] [1]

doi: 10.1016/j.neunet.2020. 07.021. Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3:463–482,

work page doi:10.1016/j.neunet.2020 2020

[2] [2]

George Cybenko

doi: 10.1201/b18041. George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314,

work page doi:10.1201/b18041

[3] [3]

Deep Neural Networks for Survival Analysis Based on a Multi-Task Framework

doi: 10.1007/BF02551274. Stephane Fotso. Deep neural networks for survival analysis based on a multi-task frame- work.ArXiv, abs/1801.05512,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/bf02551274

[4] [4]

PeerJ7, 6257 (2019) https://doi.org/10.7717/peerj.6257

doi: 10.7717/peerj.6257. Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learn- ing: Data Mining, Inference, and Prediction. Springer, New York, 2nd edition,

work page doi:10.7717/peerj.6257

[5] [5]

48 Trevor Hastie, Andrea Montanari, Saharon Rosset, and Robert J

doi: 10.1007/978-0-387-84858-7. 48 Trevor Hastie, Andrea Montanari, Saharon Rosset, and Robert J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.The Annals of Statistics, 50(2): 949–986,

work page doi:10.1007/978-0-387-84858-7

[6] [6]

URLhttps://doi.org/10.1214/21-AOS2133

doi: 10.1214/21-aos2133. Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural Networks, 4(2):251–257,

work page doi:10.1214/21-aos2133

[7] [7]

Approximation capabilities of multilayer feedforward networks,

doi: 10.1016/0893-6080(91)90009-T. URLhttps: //www.sciencedirect.com/science/article/pii/089360809190009T. Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network.Bioinformatics, 34(13):2329–2336,

work page doi:10.1016/0893-6080(91)90009-t

[8] [8]

Ganesh Kini and Christos Thrampoulidis

doi: 10.1093/bioinformatics/bty068. Ganesh Kini and Christos Thrampoulidis. Analytic study of double descent in binary classification: The impact of loss.arXiv:2001.11572 [stat.ML],

work page doi:10.1093/bioinformatics/bty068 2001

[9] [9]

doi: 10.48550/ arXiv.2001.11572. John P. Klein and Melvin L. Moeschberger.Survival Analysis: Techniques for Cen- sored and Truncated Data. Springer, New York, 2nd edition,

work page arXiv 2001

[10] [10]

H˚avard Kvamme, Ørnulf Borgan, and Ida Scheel

doi: 10.1007/978-1-4419-6646-9. H˚avard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-event prediction with neural networks and Cox regression.Journal of Machine Learning Research, 20(129):1–30,

work page doi:10.1007/978-1-4419-6646-9

[11] [11]

Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai

doi: 10.1073/pnas.2010378117. Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. Classification vs regression in overparameterized regimes: Does the loss function matter?Journal of Machine Learning Research, 22(222):1–69,

work page doi:10.1073/pnas.2010378117

[12] [12]

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever

URL http://jmlr.org/papers/v22/20-1346.html. Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Jour- nal of Statistical Mechanics: Theory and Experiment, 2021(12):124003,

work page 2021

[13] [13]

Sidak Pal Singh, Aurelien Lucchi, Thomas Hofmann, and Bernhard Sch ¨olkopf

doi: 10.1088/1742-5468/ac3db5. Sidak Pal Singh, Aurelien Lucchi, Thomas Hofmann, and Bernhard Sch ¨olkopf. Phe- nomenology of double descent in finite-width neural networks. InProceedings of the International Conference on Learning Representations (ICLR),

work page doi:10.1088/1742-5468/ac3db5

[14] [14]

doi: 10.1007/s10462-023-10681-3. 50

work page doi:10.1007/s10462-023-10681-3