pith. sign in

arxiv: 2408.02839 · v6 · submitted 2024-08-05 · 📊 stat.ML · cs.LG

Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance

Pith reviewed 2026-05-23 21:49 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords mini-batch estimationCox neural networkspartial likelihoodSGDconsistencyconvergence ratessurvival analysisdeep learning
0
0 comments X

The pith

Mini-batch maximum partial-likelihood estimators are consistent and achieve optimal minimax rates for Cox neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how stochastic gradient descent trains deep Cox neural networks by optimizing an average of mini-batch partial likelihoods instead of the usual full-data partial likelihood. This difference requires new theory for the resulting global optimizer, the mini-batch maximum partial-likelihood estimator. The authors prove that this estimator remains consistent for Cox neural networks and attains the optimal minimax convergence rate aside from a polylogarithmic factor. In the linear Cox case they further establish sqrt(n)-consistency, asymptotic normality, and variance approaching the information lower bound as batch size grows. They also supply practical rules for choosing learning rate relative to batch size and for ensuring SGD iterations reach the global optimizer.

Core claim

The mini-batch maximum partial-likelihood estimator (mb-MPLE) obtained via SGD is consistent for Cox neural networks and attains the optimal minimax convergence rate up to a polylogarithmic factor. In the linear covariate case, mb-MPLE is sqrt(n)-consistent, asymptotically normal, and its asymptotic variance approaches the information lower bound as batch size grows.

What carries the argument

The mini-batch maximum partial-likelihood estimator (mb-MPLE), the global optimizer of the average mini-batch partial likelihood approximated by SGD iterations.

If this is right

  • mb-MPLE is consistent for Cox neural network models.
  • mb-MPLE attains the optimal minimax convergence rate up to a polylogarithmic factor.
  • For linear Cox regression, mb-MPLE is sqrt(n)-consistent and asymptotically normal with variance near the information bound for large batches.
  • The learning-rate-to-batch-size ratio governs SGD dynamics for Cox-NN and controls approximation quality.
  • Sufficient SGD iterations ensure convergence to mb-MPLE for linear Cox models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Survival analyses on datasets too large for full partial-likelihood computation become statistically grounded with this estimator.
  • Tuning guidelines for deep survival models can prioritize the learning-rate-to-batch-size ratio rather than separate searches.
  • The same mini-batch surrogate approach may extend to other hazard models once similar surrogate properties are verified.
  • The remaining polylog factor leaves open the possibility of sharper rate results without extra logarithmic terms.

Load-bearing premise

The mini-batch partial likelihood acts as a statistically valid surrogate for the full partial likelihood so that consistency and rate results carry over.

What would settle it

Run simulations of Cox-NN on increasing sample sizes and check whether the mb-MPLE estimation error decreases at the claimed minimax rate (up to polylog) or deviates from it.

read the original abstract

The stochastic gradient descent (SGD) algorithm has been widely used to optimize deep Cox neural network (Cox-NN) by updating model parameters using mini-batches of data. We show that SGD aims to optimize the average of mini-batch partial-likelihood, which is different from the standard partial-likelihood. This distinction requires developing new statistical properties for the global optimizer, namely, the mini-batch maximum partial-likelihood estimator (mb-MPLE). We establish that mb-MPLE for Cox-NN is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression with linear covariate effects, we further show that mb-MPLE is $\sqrt{n}$-consistent and asymptotically normal with asymptotic variance approaching the information lower bound as batch size increases, which is confirmed by simulation studies. Additionally, we offer practical guidance on using SGD, supported by theoretical analysis and numerical evidence. For Cox-NN, we demonstrate that the ratio of the learning rate to the batch size is critical in SGD dynamics, offering insight into hyperparameter tuning. For Cox regression, we characterize the iterative convergence of SGD, ensuring that the global optimizer, mb-MPLE, can be approximated with sufficiently many iterations. Finally, we demonstrate the effectiveness of mb-MPLE in a large-scale real-world application where the standard MPLE is intractable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops statistical theory for mini-batch maximum partial likelihood estimation (mb-MPLE) in Cox proportional hazards models, including deep neural network versions (Cox-NN). It claims that the global maximizer of the averaged mini-batch partial likelihood is consistent for Cox-NN and attains the optimal minimax rate up to a polylog factor; for linear Cox regression it is sqrt(n)-consistent and asymptotically normal with variance approaching the information bound as batch size grows. The work also derives practical guidance on SGD dynamics (learning-rate-to-batch-size ratio for Cox-NN; iterative convergence for linear Cox) and demonstrates utility on large-scale data where full MPLE is intractable.

Significance. If the derivations hold, the results supply missing statistical foundations for SGD training of deep survival models, a setting where scalability demands mini-batching yet standard partial-likelihood theory does not directly apply. The linear-case asymptotic normality result that recovers the information bound, the explicit hyperparameter guidance, and the large-scale application are concrete strengths. The work is relevant to both theoretical statisticians and practitioners using neural networks for time-to-event data.

major comments (2)
  1. [§3] §3 (or the section containing the consistency proof for Cox-NN): the argument that the mini-batch partial likelihood satisfies the requisite uniform convergence or empirical-process conditions for consistency appears to rely on new properties that are not standard for the full partial likelihood; the manuscript should explicitly state the additional regularity conditions (e.g., on the neural-network class, censoring, or batch-size scaling) that close the gap between the averaged mini-batch objective and the population partial likelihood.
  2. [Theorem on minimax rate] Theorem on minimax rate (likely in §4): the claim of optimality up to a polylog factor requires a matching lower bound; if the lower bound is taken from the literature on nonparametric Cox estimation, the manuscript must verify that the mini-batch estimator does not incur an extra factor beyond the polylog term already present in the full-data case.
minor comments (2)
  1. [Introduction / §2] The notation distinguishing the full partial likelihood from the averaged mini-batch version should be introduced with an explicit equation in the introduction or §2 to avoid reader confusion.
  2. [Simulation section] Simulation figures for the linear case (asymptotic normality) would benefit from reporting the effective sample size or number of SGD iterations alongside batch size to allow direct comparison with the theoretical regime where batch size grows with n.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our theoretical results. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses
  1. Referee: [§3] §3 (or the section containing the consistency proof for Cox-NN): the argument that the mini-batch partial likelihood satisfies the requisite uniform convergence or empirical-process conditions for consistency appears to rely on new properties that are not standard for the full partial likelihood; the manuscript should explicitly state the additional regularity conditions (e.g., on the neural-network class, censoring, or batch-size scaling) that close the gap between the averaged mini-batch objective and the population partial likelihood.

    Authors: We agree that the regularity conditions should be stated more explicitly to highlight the differences from the full partial likelihood. In the revised manuscript we will add a dedicated subsection (or appendix) that lists the precise conditions on the neural-network function class (e.g., bounded Lipschitz constants and weight norms), the censoring distribution (independent censoring with bounded density away from zero), and the batch-size scaling (batch size b_n satisfying b_n → ∞ and b_n = o(n)) that guarantee the uniform convergence of the averaged mini-batch objective to the population partial likelihood at the required rate. These conditions are already implicit in the proofs but will now be collected and contrasted with the classical full-data setting. revision: yes

  2. Referee: [Theorem on minimax rate] Theorem on minimax rate (likely in §4): the claim of optimality up to a polylog factor requires a matching lower bound; if the lower bound is taken from the literature on nonparametric Cox estimation, the manuscript must verify that the mini-batch estimator does not incur an extra factor beyond the polylog term already present in the full-data case.

    Authors: The upper bound we derive for the mb-MPLE matches the known minimax lower bound for nonparametric Cox estimation (up to the polylog factor already present in the full-data literature) because the additional variability from mini-batching is absorbed into the same polylog term under our batch-size scaling. In the revised version we will add an explicit remark (and a short proof sketch) verifying that the mini-batch averaging does not introduce a multiplicative constant beyond the polylog factor that appears in the full partial-likelihood case; the key step is that the empirical-process deviation between the mini-batch and full objectives is o_p of the rate achieved by the full estimator. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The central claims rest on newly derived statistical properties for the mini-batch partial likelihood as a surrogate for the full partial likelihood in Cox-NN models. These properties are established directly via analysis of the averaged mini-batch objective, without reduction to fitted inputs renamed as predictions, self-definitional loops, or load-bearing self-citations. The linear-case asymptotic normality result follows standard M-estimator arguments once batch size is permitted to grow, and the polylog factor is standard for neural-network rates. The derivation chain is self-contained against external benchmarks and does not invoke author-specific uniqueness theorems or ansatzes smuggled via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces mb-MPLE as a new estimator but relies on standard survival analysis axioms; no free parameters or invented entities mentioned in abstract.

axioms (1)
  • domain assumption Standard assumptions for Cox proportional hazards model hold, including independent censoring and proportional hazards.
    Typical for Cox models, implied by the use of partial likelihood.

pith-pipeline@v0.9.0 · 5768 in / 1211 out tokens · 42017 ms · 2026-05-23T21:49:00.901622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Radiomics-Guided Vision Transformers for Survival Analysis

    physics.med-ph 2026-04 unverdicted novelty 5.0

    A radiomics-guided hybrid Vision Transformer integrates pixel embeddings with interpretable radiomic features in a multimodal Cox model for survival analysis, yielding competitive discrimination and clinically meaning...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

  2. [2]

    Amari, S.-i. (1993). Backpropagation and stochastic gradient descent method. Neurocomputing 5, 185--196

  3. [3]

    Andersen, P. K. and Gill, R. D. (1982). Cox's regression model for counting processes: a large sample study. The annals of statistics pages 1100--1120

  4. [4]

    and Flammarion, N

    Andriushchenko, M. and Flammarion, N. (2022). Towards understanding sharpness-aware minimization. In International Conference on Machine Learning , pages 639--668. PMLR

  5. [5]

    and Kohler, M

    Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The annals of statistics

  6. [6]

    Bottou, L. (2012). Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade: Second Edition , pages 421--436. Springer

  7. [7]

    Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning , pages 129--136

  8. [8]

    G., and Shao, Q.-M

    Chen, M.-H., Ibrahim, J. G., and Shao, Q.-M. (2009). Maximum likelihood inference for the cox regression model with applications to missing covariates. Journal of multivariate analysis 100, 2018--2030

  9. [9]

    Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning , pages 1597--1607. PMLR

  10. [10]

    Ching, T., Zhu, X., and Garmire, L. X. (2018). Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS computational biology 14, e1006076

  11. [11]

    Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 187--202

  12. [12]

    Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269--276

  13. [13]

    and Simon, R

    Faraggi, D. and Simon, R. (1995). A neural network model for survival data. Statistics in medicine 14, 73--82

  14. [14]

    and Langholz, B

    Goldstein, L. and Langholz, B. (1992). Asymptotic theory for nested case-control sampling in the cox regression model. The Annals of Statistics pages 1903--1928

  15. [15]

    Goyal, P., Doll \'a r, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677

  16. [16]

    He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770--778

  17. [17]

    Hinton, G., Srivastava, N., and Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14, 2

  18. [18]

    Hoeffding, W. (1992). A class of statistics with asymptotically normal distribution. Breakthroughs in Statistics: Foundations and Basic Theory pages 308--334

  19. [19]

    Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. (2017). Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623

  20. [20]

    Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio. Artificial Neural Networks and Machine Learning--ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7...

  21. [21]

    L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y

    Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y. (2018). Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC medical research methodology 18, 1--12

  22. [22]

    Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  23. [23]

    Kleinberg, B., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? In International conference on machine learning , pages 2698--2707. PMLR

  24. [24]

    Kvamme, H., Borgan, ., and Scheel, I. (2019). Time-to-event prediction with neural networks and cox regression. Journal of Machine Learning Research 20, 1--30

  25. [25]

    Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012). A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002

  26. [26]

    Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018). Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31,

  27. [27]

    Luce, R. D. (1959). Individual choice behavior , volume 4. Wiley New York

  28. [28]

    and Bach, F

    Moulines, E. and Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems 24,

  29. [29]

    and Hinton, G

    Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807--814

  30. [30]

    Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics 24, 193--202

  31. [31]

    Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization 30, 838--855

  32. [32]

    Qi, H., Wang, F., and Wang, H. (2023). Statistical analysis of fixed mini-batch gradient descent estimator. Journal of Computational and Graphical Statistics pages 1--24

  33. [33]

    Ruppert, D. (1988). Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering

  34. [34]

    Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics

  35. [35]

    Srinivas, S., Subramanya, A., and Venkatesh Babu, R. (2017). Training sparse neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages 138--145

  36. [36]

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1929--1958

  37. [37]

    Sun, T., Wei, Y., Chen, W., and Ding, Y. (2020). Genome-wide association study-based deep learning for survival prediction. Statistics in medicine 39, 4605--4620

  38. [38]

    and Simon, N

    Tarkhan, A. and Simon, N. (2024). An online framework for survival analysis: reframing cox proportional hazards model for large data sets and neural networks. Biostatistics 25, 134--153

  39. [39]

    Therneau, T. et al. (2015). A package for survival analysis in s. R package version 2, 2014

  40. [40]

    and Airoldi, E

    Toulis, P. and Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics

  41. [41]

    Xie, Z., Sato, I., and Sugiyama, M. (2020). A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations

  42. [42]

    Zhong, Q., Mueller, J., and Wang, J.-L. (2022). Deep learning for the partially linear cox model. The Annals of Statistics 50, 1348--1375