Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance
Pith reviewed 2026-05-23 21:49 UTC · model grok-4.3
The pith
Mini-batch maximum partial-likelihood estimators are consistent and achieve optimal minimax rates for Cox neural networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The mini-batch maximum partial-likelihood estimator (mb-MPLE) obtained via SGD is consistent for Cox neural networks and attains the optimal minimax convergence rate up to a polylogarithmic factor. In the linear covariate case, mb-MPLE is sqrt(n)-consistent, asymptotically normal, and its asymptotic variance approaches the information lower bound as batch size grows.
What carries the argument
The mini-batch maximum partial-likelihood estimator (mb-MPLE), the global optimizer of the average mini-batch partial likelihood approximated by SGD iterations.
If this is right
- mb-MPLE is consistent for Cox neural network models.
- mb-MPLE attains the optimal minimax convergence rate up to a polylogarithmic factor.
- For linear Cox regression, mb-MPLE is sqrt(n)-consistent and asymptotically normal with variance near the information bound for large batches.
- The learning-rate-to-batch-size ratio governs SGD dynamics for Cox-NN and controls approximation quality.
- Sufficient SGD iterations ensure convergence to mb-MPLE for linear Cox models.
Where Pith is reading between the lines
- Survival analyses on datasets too large for full partial-likelihood computation become statistically grounded with this estimator.
- Tuning guidelines for deep survival models can prioritize the learning-rate-to-batch-size ratio rather than separate searches.
- The same mini-batch surrogate approach may extend to other hazard models once similar surrogate properties are verified.
- The remaining polylog factor leaves open the possibility of sharper rate results without extra logarithmic terms.
Load-bearing premise
The mini-batch partial likelihood acts as a statistically valid surrogate for the full partial likelihood so that consistency and rate results carry over.
What would settle it
Run simulations of Cox-NN on increasing sample sizes and check whether the mb-MPLE estimation error decreases at the claimed minimax rate (up to polylog) or deviates from it.
read the original abstract
The stochastic gradient descent (SGD) algorithm has been widely used to optimize deep Cox neural network (Cox-NN) by updating model parameters using mini-batches of data. We show that SGD aims to optimize the average of mini-batch partial-likelihood, which is different from the standard partial-likelihood. This distinction requires developing new statistical properties for the global optimizer, namely, the mini-batch maximum partial-likelihood estimator (mb-MPLE). We establish that mb-MPLE for Cox-NN is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression with linear covariate effects, we further show that mb-MPLE is $\sqrt{n}$-consistent and asymptotically normal with asymptotic variance approaching the information lower bound as batch size increases, which is confirmed by simulation studies. Additionally, we offer practical guidance on using SGD, supported by theoretical analysis and numerical evidence. For Cox-NN, we demonstrate that the ratio of the learning rate to the batch size is critical in SGD dynamics, offering insight into hyperparameter tuning. For Cox regression, we characterize the iterative convergence of SGD, ensuring that the global optimizer, mb-MPLE, can be approximated with sufficiently many iterations. Finally, we demonstrate the effectiveness of mb-MPLE in a large-scale real-world application where the standard MPLE is intractable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops statistical theory for mini-batch maximum partial likelihood estimation (mb-MPLE) in Cox proportional hazards models, including deep neural network versions (Cox-NN). It claims that the global maximizer of the averaged mini-batch partial likelihood is consistent for Cox-NN and attains the optimal minimax rate up to a polylog factor; for linear Cox regression it is sqrt(n)-consistent and asymptotically normal with variance approaching the information bound as batch size grows. The work also derives practical guidance on SGD dynamics (learning-rate-to-batch-size ratio for Cox-NN; iterative convergence for linear Cox) and demonstrates utility on large-scale data where full MPLE is intractable.
Significance. If the derivations hold, the results supply missing statistical foundations for SGD training of deep survival models, a setting where scalability demands mini-batching yet standard partial-likelihood theory does not directly apply. The linear-case asymptotic normality result that recovers the information bound, the explicit hyperparameter guidance, and the large-scale application are concrete strengths. The work is relevant to both theoretical statisticians and practitioners using neural networks for time-to-event data.
major comments (2)
- [§3] §3 (or the section containing the consistency proof for Cox-NN): the argument that the mini-batch partial likelihood satisfies the requisite uniform convergence or empirical-process conditions for consistency appears to rely on new properties that are not standard for the full partial likelihood; the manuscript should explicitly state the additional regularity conditions (e.g., on the neural-network class, censoring, or batch-size scaling) that close the gap between the averaged mini-batch objective and the population partial likelihood.
- [Theorem on minimax rate] Theorem on minimax rate (likely in §4): the claim of optimality up to a polylog factor requires a matching lower bound; if the lower bound is taken from the literature on nonparametric Cox estimation, the manuscript must verify that the mini-batch estimator does not incur an extra factor beyond the polylog term already present in the full-data case.
minor comments (2)
- [Introduction / §2] The notation distinguishing the full partial likelihood from the averaged mini-batch version should be introduced with an explicit equation in the introduction or §2 to avoid reader confusion.
- [Simulation section] Simulation figures for the linear case (asymptotic normality) would benefit from reporting the effective sample size or number of SGD iterations alongside batch size to allow direct comparison with the theoretical regime where batch size grows with n.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our theoretical results. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications.
read point-by-point responses
-
Referee: [§3] §3 (or the section containing the consistency proof for Cox-NN): the argument that the mini-batch partial likelihood satisfies the requisite uniform convergence or empirical-process conditions for consistency appears to rely on new properties that are not standard for the full partial likelihood; the manuscript should explicitly state the additional regularity conditions (e.g., on the neural-network class, censoring, or batch-size scaling) that close the gap between the averaged mini-batch objective and the population partial likelihood.
Authors: We agree that the regularity conditions should be stated more explicitly to highlight the differences from the full partial likelihood. In the revised manuscript we will add a dedicated subsection (or appendix) that lists the precise conditions on the neural-network function class (e.g., bounded Lipschitz constants and weight norms), the censoring distribution (independent censoring with bounded density away from zero), and the batch-size scaling (batch size b_n satisfying b_n → ∞ and b_n = o(n)) that guarantee the uniform convergence of the averaged mini-batch objective to the population partial likelihood at the required rate. These conditions are already implicit in the proofs but will now be collected and contrasted with the classical full-data setting. revision: yes
-
Referee: [Theorem on minimax rate] Theorem on minimax rate (likely in §4): the claim of optimality up to a polylog factor requires a matching lower bound; if the lower bound is taken from the literature on nonparametric Cox estimation, the manuscript must verify that the mini-batch estimator does not incur an extra factor beyond the polylog term already present in the full-data case.
Authors: The upper bound we derive for the mb-MPLE matches the known minimax lower bound for nonparametric Cox estimation (up to the polylog factor already present in the full-data literature) because the additional variability from mini-batching is absorbed into the same polylog term under our batch-size scaling. In the revised version we will add an explicit remark (and a short proof sketch) verifying that the mini-batch averaging does not introduce a multiplicative constant beyond the polylog factor that appears in the full partial-likelihood case; the key step is that the empirical-process deviation between the mini-batch and full objectives is o_p of the rate achieved by the full estimator. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The central claims rest on newly derived statistical properties for the mini-batch partial likelihood as a surrogate for the full partial likelihood in Cox-NN models. These properties are established directly via analysis of the averaged mini-batch objective, without reduction to fitted inputs renamed as predictions, self-definitional loops, or load-bearing self-citations. The linear-case asymptotic normality result follows standard M-estimator arguments once batch size is permitted to grow, and the polylog factor is standard for neural-network rates. The derivation chain is self-contained against external benchmarks and does not invoke author-specific uniqueness theorems or ansatzes smuggled via citation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions for Cox proportional hazards model hold, including independent censoring and proportional hazards.
Forward citations
Cited by 1 Pith paper
-
Radiomics-Guided Vision Transformers for Survival Analysis
A radiomics-guided hybrid Vision Transformer integrates pixel embeddings with interpretable radiomic features in a multimodal Cox model for survival analysis, yielding competitive discrimination and clinically meaning...
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...
-
[2]
Amari, S.-i. (1993). Backpropagation and stochastic gradient descent method. Neurocomputing 5, 185--196
work page 1993
-
[3]
Andersen, P. K. and Gill, R. D. (1982). Cox's regression model for counting processes: a large sample study. The annals of statistics pages 1100--1120
work page 1982
-
[4]
Andriushchenko, M. and Flammarion, N. (2022). Towards understanding sharpness-aware minimization. In International Conference on Machine Learning , pages 639--668. PMLR
work page 2022
-
[5]
Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The annals of statistics
work page 2019
-
[6]
Bottou, L. (2012). Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade: Second Edition , pages 421--436. Springer
work page 2012
-
[7]
Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning , pages 129--136
work page 2007
-
[8]
Chen, M.-H., Ibrahim, J. G., and Shao, Q.-M. (2009). Maximum likelihood inference for the cox regression model with applications to missing covariates. Journal of multivariate analysis 100, 2018--2030
work page 2009
-
[9]
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning , pages 1597--1607. PMLR
work page 2020
-
[10]
Ching, T., Zhu, X., and Garmire, L. X. (2018). Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS computational biology 14, e1006076
work page 2018
-
[11]
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 187--202
work page 1972
-
[12]
Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269--276
work page 1975
-
[13]
Faraggi, D. and Simon, R. (1995). A neural network model for survival data. Statistics in medicine 14, 73--82
work page 1995
-
[14]
Goldstein, L. and Langholz, B. (1992). Asymptotic theory for nested case-control sampling in the cox regression model. The Annals of Statistics pages 1903--1928
work page 1992
-
[15]
Goyal, P., Doll \'a r, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770--778
work page 2016
-
[17]
Hinton, G., Srivastava, N., and Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14, 2
work page 2012
-
[18]
Hoeffding, W. (1992). A class of statistics with asymptotically normal distribution. Breakthroughs in Statistics: Foundations and Basic Theory pages 308--334
work page 1992
-
[19]
Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. (2017). Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio. Artificial Neural Networks and Machine Learning--ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7...
work page 2018
-
[21]
L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y
Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y. (2018). Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC medical research methodology 18, 1--12
work page 2018
-
[22]
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[23]
Kleinberg, B., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? In International conference on machine learning , pages 2698--2707. PMLR
work page 2018
-
[24]
Kvamme, H., Borgan, ., and Scheel, I. (2019). Time-to-event prediction with neural networks and cox regression. Journal of Machine Learning Research 20, 1--30
work page 2019
-
[25]
Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012). A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[26]
Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018). Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31,
work page 2018
-
[27]
Luce, R. D. (1959). Individual choice behavior , volume 4. Wiley New York
work page 1959
-
[28]
Moulines, E. and Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems 24,
work page 2011
-
[29]
Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807--814
work page 2010
-
[30]
Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics 24, 193--202
work page 1975
-
[31]
Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization 30, 838--855
work page 1992
-
[32]
Qi, H., Wang, F., and Wang, H. (2023). Statistical analysis of fixed mini-batch gradient descent estimator. Journal of Computational and Graphical Statistics pages 1--24
work page 2023
-
[33]
Ruppert, D. (1988). Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering
work page 1988
-
[34]
Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics
work page 2020
-
[35]
Srinivas, S., Subramanya, A., and Venkatesh Babu, R. (2017). Training sparse neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages 138--145
work page 2017
-
[36]
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1929--1958
work page 2014
-
[37]
Sun, T., Wei, Y., Chen, W., and Ding, Y. (2020). Genome-wide association study-based deep learning for survival prediction. Statistics in medicine 39, 4605--4620
work page 2020
-
[38]
Tarkhan, A. and Simon, N. (2024). An online framework for survival analysis: reframing cox proportional hazards model for large data sets and neural networks. Biostatistics 25, 134--153
work page 2024
-
[39]
Therneau, T. et al. (2015). A package for survival analysis in s. R package version 2, 2014
work page 2015
-
[40]
Toulis, P. and Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics
work page 2017
-
[41]
Xie, Z., Sato, I., and Sugiyama, M. (2020). A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations
work page 2020
-
[42]
Zhong, Q., Mueller, J., and Wang, J.-L. (2022). Deep learning for the partially linear cox model. The Annals of Statistics 50, 1348--1375
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.