Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance

Lang Zeng; Weijing Tang; Ying Ding; Zhao Ren

arxiv: 2408.02839 · v6 · submitted 2024-08-05 · 📊 stat.ML · cs.LG

Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance

Lang Zeng , Weijing Tang , Zhao Ren , Ying Ding This is my paper

Pith reviewed 2026-05-23 21:49 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords mini-batch estimationCox neural networkspartial likelihoodSGDconsistencyconvergence ratessurvival analysisdeep learning

0 comments

The pith

Mini-batch maximum partial-likelihood estimators are consistent and achieve optimal minimax rates for Cox neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how stochastic gradient descent trains deep Cox neural networks by optimizing an average of mini-batch partial likelihoods instead of the usual full-data partial likelihood. This difference requires new theory for the resulting global optimizer, the mini-batch maximum partial-likelihood estimator. The authors prove that this estimator remains consistent for Cox neural networks and attains the optimal minimax convergence rate aside from a polylogarithmic factor. In the linear Cox case they further establish sqrt(n)-consistency, asymptotic normality, and variance approaching the information lower bound as batch size grows. They also supply practical rules for choosing learning rate relative to batch size and for ensuring SGD iterations reach the global optimizer.

Core claim

The mini-batch maximum partial-likelihood estimator (mb-MPLE) obtained via SGD is consistent for Cox neural networks and attains the optimal minimax convergence rate up to a polylogarithmic factor. In the linear covariate case, mb-MPLE is sqrt(n)-consistent, asymptotically normal, and its asymptotic variance approaches the information lower bound as batch size grows.

What carries the argument

The mini-batch maximum partial-likelihood estimator (mb-MPLE), the global optimizer of the average mini-batch partial likelihood approximated by SGD iterations.

If this is right

mb-MPLE is consistent for Cox neural network models.
mb-MPLE attains the optimal minimax convergence rate up to a polylogarithmic factor.
For linear Cox regression, mb-MPLE is sqrt(n)-consistent and asymptotically normal with variance near the information bound for large batches.
The learning-rate-to-batch-size ratio governs SGD dynamics for Cox-NN and controls approximation quality.
Sufficient SGD iterations ensure convergence to mb-MPLE for linear Cox models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Survival analyses on datasets too large for full partial-likelihood computation become statistically grounded with this estimator.
Tuning guidelines for deep survival models can prioritize the learning-rate-to-batch-size ratio rather than separate searches.
The same mini-batch surrogate approach may extend to other hazard models once similar surrogate properties are verified.
The remaining polylog factor leaves open the possibility of sharper rate results without extra logarithmic terms.

Load-bearing premise

The mini-batch partial likelihood acts as a statistically valid surrogate for the full partial likelihood so that consistency and rate results carry over.

What would settle it

Run simulations of Cox-NN on increasing sample sizes and check whether the mb-MPLE estimation error decreases at the claimed minimax rate (up to polylog) or deviates from it.

read the original abstract

The stochastic gradient descent (SGD) algorithm has been widely used to optimize deep Cox neural network (Cox-NN) by updating model parameters using mini-batches of data. We show that SGD aims to optimize the average of mini-batch partial-likelihood, which is different from the standard partial-likelihood. This distinction requires developing new statistical properties for the global optimizer, namely, the mini-batch maximum partial-likelihood estimator (mb-MPLE). We establish that mb-MPLE for Cox-NN is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression with linear covariate effects, we further show that mb-MPLE is $\sqrt{n}$-consistent and asymptotically normal with asymptotic variance approaching the information lower bound as batch size increases, which is confirmed by simulation studies. Additionally, we offer practical guidance on using SGD, supported by theoretical analysis and numerical evidence. For Cox-NN, we demonstrate that the ratio of the learning rate to the batch size is critical in SGD dynamics, offering insight into hyperparameter tuning. For Cox regression, we characterize the iterative convergence of SGD, ensuring that the global optimizer, mb-MPLE, can be approximated with sufficiently many iterations. Finally, we demonstrate the effectiveness of mb-MPLE in a large-scale real-world application where the standard MPLE is intractable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives consistency and near-optimal rates for the global maximizer of the averaged mini-batch partial likelihood in Cox neural nets, plus asymptotic normality in the linear case.

read the letter

The new piece is treating the mini-batch partial likelihood as a distinct object and proving that its maximizer (mb-MPLE) is consistent for Cox-NN and hits the minimax rate up to a polylog factor. In the linear Cox case they also get sqrt(n) consistency and asymptotic normality whose variance approaches the information bound once batch size grows. They tie this to SGD behavior and give concrete tuning advice on the learning-rate-to-batch-size ratio, plus an example where the full MPLE is infeasible but the mini-batch version runs on large data. Simulations check the linear asymptotics and the real-data section shows practical payoff. That combination of new theory plus usable guidance is the main value. The polylog factor is typical for neural-net rates and not a surprise, but it does mean the result is not fully sharp. The theory centers on the global maximizer, so the gap between that and what SGD actually reaches in deep-net training is addressed only at a high level through iteration analysis. Assumptions look standard for this literature but could be spelled out more explicitly for medical-data settings. This is for people working on scalable survival models or on the statistical side of mini-batch M-estimation. A reader who needs foundations for deep Cox methods or practical SGD rules for censored data will find it useful. It deserves a serious referee because the claims are specific, the linear-case results are checkable, and they supply both theory and evidence.

Referee Report

2 major / 2 minor

Summary. The paper develops statistical theory for mini-batch maximum partial likelihood estimation (mb-MPLE) in Cox proportional hazards models, including deep neural network versions (Cox-NN). It claims that the global maximizer of the averaged mini-batch partial likelihood is consistent for Cox-NN and attains the optimal minimax rate up to a polylog factor; for linear Cox regression it is sqrt(n)-consistent and asymptotically normal with variance approaching the information bound as batch size grows. The work also derives practical guidance on SGD dynamics (learning-rate-to-batch-size ratio for Cox-NN; iterative convergence for linear Cox) and demonstrates utility on large-scale data where full MPLE is intractable.

Significance. If the derivations hold, the results supply missing statistical foundations for SGD training of deep survival models, a setting where scalability demands mini-batching yet standard partial-likelihood theory does not directly apply. The linear-case asymptotic normality result that recovers the information bound, the explicit hyperparameter guidance, and the large-scale application are concrete strengths. The work is relevant to both theoretical statisticians and practitioners using neural networks for time-to-event data.

major comments (2)

[§3] §3 (or the section containing the consistency proof for Cox-NN): the argument that the mini-batch partial likelihood satisfies the requisite uniform convergence or empirical-process conditions for consistency appears to rely on new properties that are not standard for the full partial likelihood; the manuscript should explicitly state the additional regularity conditions (e.g., on the neural-network class, censoring, or batch-size scaling) that close the gap between the averaged mini-batch objective and the population partial likelihood.
[Theorem on minimax rate] Theorem on minimax rate (likely in §4): the claim of optimality up to a polylog factor requires a matching lower bound; if the lower bound is taken from the literature on nonparametric Cox estimation, the manuscript must verify that the mini-batch estimator does not incur an extra factor beyond the polylog term already present in the full-data case.

minor comments (2)

[Introduction / §2] The notation distinguishing the full partial likelihood from the averaged mini-batch version should be introduced with an explicit equation in the introduction or §2 to avoid reader confusion.
[Simulation section] Simulation figures for the linear case (asymptotic normality) would benefit from reporting the effective sample size or number of SGD iterations alongside batch size to allow direct comparison with the theoretical regime where batch size grows with n.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our theoretical results. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses

Referee: [§3] §3 (or the section containing the consistency proof for Cox-NN): the argument that the mini-batch partial likelihood satisfies the requisite uniform convergence or empirical-process conditions for consistency appears to rely on new properties that are not standard for the full partial likelihood; the manuscript should explicitly state the additional regularity conditions (e.g., on the neural-network class, censoring, or batch-size scaling) that close the gap between the averaged mini-batch objective and the population partial likelihood.

Authors: We agree that the regularity conditions should be stated more explicitly to highlight the differences from the full partial likelihood. In the revised manuscript we will add a dedicated subsection (or appendix) that lists the precise conditions on the neural-network function class (e.g., bounded Lipschitz constants and weight norms), the censoring distribution (independent censoring with bounded density away from zero), and the batch-size scaling (batch size b_n satisfying b_n → ∞ and b_n = o(n)) that guarantee the uniform convergence of the averaged mini-batch objective to the population partial likelihood at the required rate. These conditions are already implicit in the proofs but will now be collected and contrasted with the classical full-data setting. revision: yes
Referee: [Theorem on minimax rate] Theorem on minimax rate (likely in §4): the claim of optimality up to a polylog factor requires a matching lower bound; if the lower bound is taken from the literature on nonparametric Cox estimation, the manuscript must verify that the mini-batch estimator does not incur an extra factor beyond the polylog term already present in the full-data case.

Authors: The upper bound we derive for the mb-MPLE matches the known minimax lower bound for nonparametric Cox estimation (up to the polylog factor already present in the full-data literature) because the additional variability from mini-batching is absorbed into the same polylog term under our batch-size scaling. In the revised version we will add an explicit remark (and a short proof sketch) verifying that the mini-batch averaging does not introduce a multiplicative constant beyond the polylog factor that appears in the full partial-likelihood case; the key step is that the empirical-process deviation between the mini-batch and full objectives is o_p of the rate achieved by the full estimator. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The central claims rest on newly derived statistical properties for the mini-batch partial likelihood as a surrogate for the full partial likelihood in Cox-NN models. These properties are established directly via analysis of the averaged mini-batch objective, without reduction to fitted inputs renamed as predictions, self-definitional loops, or load-bearing self-citations. The linear-case asymptotic normality result follows standard M-estimator arguments once batch size is permitted to grow, and the polylog factor is standard for neural-network rates. The derivation chain is self-contained against external benchmarks and does not invoke author-specific uniqueness theorems or ansatzes smuggled via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces mb-MPLE as a new estimator but relies on standard survival analysis axioms; no free parameters or invented entities mentioned in abstract.

axioms (1)

domain assumption Standard assumptions for Cox proportional hazards model hold, including independent censoring and proportional hazards.
Typical for Cox models, implied by the use of partial likelihood.

pith-pipeline@v0.9.0 · 5768 in / 1211 out tokens · 42017 ms · 2026-05-23T21:49:00.901622+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Radiomics-Guided Vision Transformers for Survival Analysis
physics.med-ph 2026-04 unverdicted novelty 5.0

A radiomics-guided hybrid Vision Transformer integrates pixel embeddings with interpretable radiomic features in a multimodal Cox model for survival analysis, yielding competitive discrimination and clinically meaning...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page
[2]

Amari, S.-i. (1993). Backpropagation and stochastic gradient descent method. Neurocomputing 5, 185--196

work page 1993
[3]

Andersen, P. K. and Gill, R. D. (1982). Cox's regression model for counting processes: a large sample study. The annals of statistics pages 1100--1120

work page 1982
[4]

and Flammarion, N

Andriushchenko, M. and Flammarion, N. (2022). Towards understanding sharpness-aware minimization. In International Conference on Machine Learning , pages 639--668. PMLR

work page 2022
[5]

and Kohler, M

Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The annals of statistics

work page 2019
[6]

Bottou, L. (2012). Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade: Second Edition , pages 421--436. Springer

work page 2012
[7]

Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning , pages 129--136

work page 2007
[8]

G., and Shao, Q.-M

Chen, M.-H., Ibrahim, J. G., and Shao, Q.-M. (2009). Maximum likelihood inference for the cox regression model with applications to missing covariates. Journal of multivariate analysis 100, 2018--2030

work page 2009
[9]

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning , pages 1597--1607. PMLR

work page 2020
[10]

Ching, T., Zhu, X., and Garmire, L. X. (2018). Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS computational biology 14, e1006076

work page 2018
[11]

Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 187--202

work page 1972
[12]

Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269--276

work page 1975
[13]

and Simon, R

Faraggi, D. and Simon, R. (1995). A neural network model for survival data. Statistics in medicine 14, 73--82

work page 1995
[14]

and Langholz, B

Goldstein, L. and Langholz, B. (1992). Asymptotic theory for nested case-control sampling in the cox regression model. The Annals of Statistics pages 1903--1928

work page 1992
[15]

Goyal, P., Doll \'a r, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770--778

work page 2016
[17]

Hinton, G., Srivastava, N., and Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14, 2

work page 2012
[18]

Hoeffding, W. (1992). A class of statistics with asymptotically normal distribution. Breakthroughs in Statistics: Foundations and Basic Theory pages 308--334

work page 1992
[19]

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. (2017). Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio. Artificial Neural Networks and Machine Learning--ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7...

work page 2018
[21]

L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y

Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y. (2018). Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC medical research methodology 18, 1--12

work page 2018
[22]

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

Kleinberg, B., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? In International conference on machine learning , pages 2698--2707. PMLR

work page 2018
[24]

Kvamme, H., Borgan, ., and Scheel, I. (2019). Time-to-event prediction with neural networks and cox regression. Journal of Machine Learning Research 20, 1--30

work page 2019
[25]

Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012). A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002

work page internal anchor Pith review Pith/arXiv arXiv 2012
[26]

Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018). Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31,

work page 2018
[27]

Luce, R. D. (1959). Individual choice behavior , volume 4. Wiley New York

work page 1959
[28]

and Bach, F

Moulines, E. and Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems 24,

work page 2011
[29]

and Hinton, G

Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807--814

work page 2010
[30]

Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics 24, 193--202

work page 1975
[31]

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization 30, 838--855

work page 1992
[32]

Qi, H., Wang, F., and Wang, H. (2023). Statistical analysis of fixed mini-batch gradient descent estimator. Journal of Computational and Graphical Statistics pages 1--24

work page 2023
[33]

Ruppert, D. (1988). Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering

work page 1988
[34]

Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics

work page 2020
[35]

Srinivas, S., Subramanya, A., and Venkatesh Babu, R. (2017). Training sparse neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages 138--145

work page 2017
[36]

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1929--1958

work page 2014
[37]

Sun, T., Wei, Y., Chen, W., and Ding, Y. (2020). Genome-wide association study-based deep learning for survival prediction. Statistics in medicine 39, 4605--4620

work page 2020
[38]

and Simon, N

Tarkhan, A. and Simon, N. (2024). An online framework for survival analysis: reframing cox proportional hazards model for large data sets and neural networks. Biostatistics 25, 134--153

work page 2024
[39]

Therneau, T. et al. (2015). A package for survival analysis in s. R package version 2, 2014

work page 2015
[40]

and Airoldi, E

Toulis, P. and Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics

work page 2017
[41]

Xie, Z., Sato, I., and Sugiyama, M. (2020). A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations

work page 2020
[42]

Zhong, Q., Mueller, J., and Wang, J.-L. (2022). Deep learning for the partially linear cox model. The Annals of Statistics 50, 1348--1375

work page 2022

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[2] [2]

Amari, S.-i. (1993). Backpropagation and stochastic gradient descent method. Neurocomputing 5, 185--196

work page 1993

[3] [3]

Andersen, P. K. and Gill, R. D. (1982). Cox's regression model for counting processes: a large sample study. The annals of statistics pages 1100--1120

work page 1982

[4] [4]

and Flammarion, N

Andriushchenko, M. and Flammarion, N. (2022). Towards understanding sharpness-aware minimization. In International Conference on Machine Learning , pages 639--668. PMLR

work page 2022

[5] [5]

and Kohler, M

Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The annals of statistics

work page 2019

[6] [6]

Bottou, L. (2012). Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade: Second Edition , pages 421--436. Springer

work page 2012

[7] [7]

Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning , pages 129--136

work page 2007

[8] [8]

G., and Shao, Q.-M

Chen, M.-H., Ibrahim, J. G., and Shao, Q.-M. (2009). Maximum likelihood inference for the cox regression model with applications to missing covariates. Journal of multivariate analysis 100, 2018--2030

work page 2009

[9] [9]

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning , pages 1597--1607. PMLR

work page 2020

[10] [10]

Ching, T., Zhu, X., and Garmire, L. X. (2018). Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS computational biology 14, e1006076

work page 2018

[11] [11]

Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 187--202

work page 1972

[12] [12]

Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269--276

work page 1975

[13] [13]

and Simon, R

Faraggi, D. and Simon, R. (1995). A neural network model for survival data. Statistics in medicine 14, 73--82

work page 1995

[14] [14]

and Langholz, B

Goldstein, L. and Langholz, B. (1992). Asymptotic theory for nested case-control sampling in the cox regression model. The Annals of Statistics pages 1903--1928

work page 1992

[15] [15]

Goyal, P., Doll \'a r, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770--778

work page 2016

[17] [17]

Hinton, G., Srivastava, N., and Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14, 2

work page 2012

[18] [18]

Hoeffding, W. (1992). A class of statistics with asymptotically normal distribution. Breakthroughs in Statistics: Foundations and Basic Theory pages 308--334

work page 1992

[19] [19]

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. (2017). Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio. Artificial Neural Networks and Machine Learning--ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7...

work page 2018

[21] [21]

L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y

Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y. (2018). Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC medical research methodology 18, 1--12

work page 2018

[22] [22]

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[23] [23]

Kleinberg, B., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? In International conference on machine learning , pages 2698--2707. PMLR

work page 2018

[24] [24]

Kvamme, H., Borgan, ., and Scheel, I. (2019). Time-to-event prediction with neural networks and cox regression. Journal of Machine Learning Research 20, 1--30

work page 2019

[25] [25]

Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012). A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002

work page internal anchor Pith review Pith/arXiv arXiv 2012

[26] [26]

Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018). Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31,

work page 2018

[27] [27]

Luce, R. D. (1959). Individual choice behavior , volume 4. Wiley New York

work page 1959

[28] [28]

and Bach, F

Moulines, E. and Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems 24,

work page 2011

[29] [29]

and Hinton, G

Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807--814

work page 2010

[30] [30]

Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics 24, 193--202

work page 1975

[31] [31]

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization 30, 838--855

work page 1992

[32] [32]

Qi, H., Wang, F., and Wang, H. (2023). Statistical analysis of fixed mini-batch gradient descent estimator. Journal of Computational and Graphical Statistics pages 1--24

work page 2023

[33] [33]

Ruppert, D. (1988). Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering

work page 1988

[34] [34]

Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics

work page 2020

[35] [35]

Srinivas, S., Subramanya, A., and Venkatesh Babu, R. (2017). Training sparse neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages 138--145

work page 2017

[36] [36]

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1929--1958

work page 2014

[37] [37]

Sun, T., Wei, Y., Chen, W., and Ding, Y. (2020). Genome-wide association study-based deep learning for survival prediction. Statistics in medicine 39, 4605--4620

work page 2020

[38] [38]

and Simon, N

Tarkhan, A. and Simon, N. (2024). An online framework for survival analysis: reframing cox proportional hazards model for large data sets and neural networks. Biostatistics 25, 134--153

work page 2024

[39] [39]

Therneau, T. et al. (2015). A package for survival analysis in s. R package version 2, 2014

work page 2015

[40] [40]

and Airoldi, E

Toulis, P. and Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics

work page 2017

[41] [41]

Xie, Z., Sato, I., and Sugiyama, M. (2020). A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations

work page 2020

[42] [42]

Zhong, Q., Mueller, J., and Wang, J.-L. (2022). Deep learning for the partially linear cox model. The Annals of Statistics 50, 1348--1375

work page 2022