Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization

Panos M. Pardalos; Seonho Park; Seung Hyun Jung

arxiv: 1906.11417 · v1 · pith:QWY6EUJ3new · submitted 2019-06-27 · 🧮 math.OC · cs.LG

Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization

Seonho Park , Seung Hyun Jung , Panos M. Pardalos This is my paper

Pith reviewed 2026-05-25 15:12 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords nonconvex optimizationstochastic optimizationcubic regularizationnegative curvatureadaptive methodsfinite-sum problemsmachine learningsaddle point escape

0 comments

The pith

SANC merges negative curvature into adaptive cubic regularization to update even on unsuccessful iterations for stochastic nonconvex problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the SANC algorithm to minimize nonconvex finite-sum functions common in machine learning. It extends the adaptive cubic regularized Newton method by adding negative curvature directions so that the algorithm can still make progress when a step is rejected by the trust-region acceptance test. The method draws stochastic gradient and Hessian estimates from independent fixed-size data batches at every iteration rather than varying batches. This design aims to keep the global convergence and strict saddle escape properties while making the scheme practical for large problems. Experiments on neural network tasks are presented to illustrate the approach.

Core claim

The central claim is that negative curvature can be combined with the adaptive cubic regularized Newton method to produce updates on unsuccessful iterations, using independent fixed-size stochastic estimators for gradients and Hessians, and that this is the first such combination while retaining the underlying convergence guarantees for nonconvex finite-sum minimization.

What carries the argument

The SANC algorithm, which augments the cubic model acceptance test with a negative curvature direction computed from the stochastic Hessian when the regularized step is rejected.

If this is right

The algorithm performs updates at every iteration rather than skipping unsuccessful ones.
Global convergence and escape from strict saddle points are retained under the stochastic estimators.
Fixed batch sizes make the method directly applicable to large-scale machine learning without changing sample sizes per iteration.
Empirical tests on neural network training problems show practical efficiency gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same negative-curvature augmentation could be tested inside other adaptive regularization frameworks that already use trust-region acceptance.
If the fixed-batch assumption holds, similar combinations might extend to variance-reduced estimators without altering the core update logic.
The approach suggests a template for adding escape mechanisms to any cubic or higher-order regularization method that currently discards rejected steps.

Load-bearing premise

Independent fixed-size stochastic gradient and Hessian estimates drawn at each iteration remain accurate enough to preserve the convergence and saddle-escape behavior of the deterministic cubic scheme.

What would settle it

A constructed nonconvex finite-sum problem containing a known strict saddle where SANC with small fixed batches fails to produce negative curvature directions that escape the saddle while the deterministic version succeeds.

Figures

Figures reproduced from arXiv: 1906.11417 by Panos M. Pardalos, Seonho Park, Seung Hyun Jung.

**Figure 1.** Figure 1: Training losses of the logistic regression with a nonconvex regularization term with λ = 1.0 over the number of oracle calls. We set σ0 = 0.001 for both methods. This setting makes unsuccessful iterations at initial iterations. All function losses are the average of independent 10 runs. out in Section 3, with a smaller σ0, we experienced that the SCR method suffers from unsuccessful iterations in initial … view at source ↗

**Figure 2.** Figure 2: Training losses over the number of oracle calls. (a) Logistic regression with a nonconvex regularization term with λ = 1.0 solving for w1a (n = 2477, d = 300)(top) and higgs (n = 11000000, d = 28)(bottom) data. (b) multi-layer perceptron solving for seismic(top) and segment(bottom) data. (c) convolutional neural networks solving for MNIST(top) and CIFAR10(bottom) data. We omit CR method for CNN problems be… view at source ↗

**Figure 3.** Figure 3: Training losses of the logistic regression over the number of oracle calls. All function losses are the average of independent 10 runs [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Training losses of the logistic regression with a nonconvex regularization term with λ = 1.0 over the number of oracle calls. SANC and SCR start with σ0 = 0.001. All function losses are the average of independent 10 runs. (a) MNIST dataset (b) CIFAR10 dataset [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Training losses of the convolutional neural networks with ReLU nonlinear activation functions over the number of oracle calls. All function losses are the average of independent 10 runs [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

read the original abstract

We focus on minimizing nonconvex finite-sum functions that typically arise in machine learning problems. In an attempt to solve this problem, the adaptive cubic regularized Newton method has shown its strong global convergence guarantees and ability to escape from strict saddle points. This method uses a trust region-like scheme to determine if an iteration is successful or not, and updates only when it is successful. In this paper, we suggest an algorithm combining negative curvature with the adaptive cubic regularized Newton method to update even at unsuccessful iterations. We call this new method Stochastic Adaptive cubic regularization with Negative Curvature (SANC). Unlike the previous method, in order to attain stochastic gradient and Hessian estimators, the SANC algorithm uses independent sets of data points of consistent size over all iterations. It makes the SANC algorithm more practical to apply for solving large-scale machine learning problems. To the best of our knowledge, this is the first approach that combines the negative curvature method with the adaptive cubic regularized Newton method. Finally, we provide experimental results including neural networks problems supporting the efficiency of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SANC adds negative curvature steps to stochastic adaptive cubic regularization with fixed-size batches, but constant variance likely prevents the accuracy needed to keep the global convergence and saddle-escape guarantees.

read the letter

The main thing to know is that the paper introduces SANC, which combines negative curvature with the adaptive cubic regularized Newton method so that the algorithm can still make progress on unsuccessful iterations, all while drawing stochastic gradient and Hessian estimates from independent fixed-size batches each step. The abstract positions this combination as new and reports experiments on neural network problems to support practicality for large-scale finite-sum nonconvex problems.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Stochastic Adaptive cubic regularization with Negative Curvature (SANC) algorithm for nonconvex finite-sum minimization. It augments the adaptive cubic regularized Newton method with negative curvature directions to permit progress on unsuccessful iterations, employs independent fixed-size batches for stochastic gradient and Hessian estimates at every step, claims to retain global convergence and strict-saddle escape guarantees, asserts novelty as the first such combination, and reports supporting experiments on neural-network problems.

Significance. If the analysis rigorously shows that fixed-batch stochastic estimators suffice to preserve the deterministic convergence and escape properties, the result would be a practical advance for large-scale nonconvex optimization, removing the need for growing batches or variance reduction while retaining the strong theoretical features of adaptive cubic regularization.

major comments (2)

[Convergence analysis (implicit in abstract claims)] The central claim that fixed-size independent batches preserve the accuracy conditions required by adaptive cubic regularization (trust-region acceptance and negative-curvature escape) is load-bearing yet unsupported by any mechanism (growing batches, variance bounds, or relaxed tolerances) that would counteract non-vanishing variance; this directly affects the global convergence and saddle-escape guarantees asserted in the abstract.
[Algorithm 1 and surrounding text] The description of how negative curvature is combined with the adaptive cubic model to enable updates on unsuccessful iterations must specify the modified acceptance test and the stochastic error tolerances used at each step; without these, it is impossible to verify that the stochastic estimators remain compatible with the deterministic analysis.

minor comments (2)

[Experiments] The experimental section should report the precise batch sizes employed, the number of independent runs, and quantitative comparison metrics against the plain adaptive cubic method and other stochastic baselines.
[Preliminaries] Notation for the stochastic estimators (e.g., how ĝ and Ĥ are defined from the fixed-size batches) should be introduced explicitly before the algorithm statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address the two major comments point by point below.

read point-by-point responses

Referee: [Convergence analysis (implicit in abstract claims)] The central claim that fixed-size independent batches preserve the accuracy conditions required by adaptive cubic regularization (trust-region acceptance and negative-curvature escape) is load-bearing yet unsupported by any mechanism (growing batches, variance bounds, or relaxed tolerances) that would counteract non-vanishing variance; this directly affects the global convergence and saddle-escape guarantees asserted in the abstract.

Authors: We agree that the manuscript asserts retention of the global convergence and strict-saddle escape guarantees with fixed-size batches but does not supply the supporting analysis or mechanism to control non-vanishing variance. The current text therefore leaves this central claim unsupported. In revision we will either add a rigorous section establishing the required accuracy conditions under fixed batches (or suitable relaxed tolerances) or we will remove/qualify the guarantee claims. This is a substantive gap that must be resolved. revision: yes
Referee: [Algorithm 1 and surrounding text] The description of how negative curvature is combined with the adaptive cubic model to enable updates on unsuccessful iterations must specify the modified acceptance test and the stochastic error tolerances used at each step; without these, it is impossible to verify that the stochastic estimators remain compatible with the deterministic analysis.

Authors: We accept that the current description of Algorithm 1 and the surrounding text is incomplete on this point. We will revise the manuscript to explicitly state the modified acceptance test used when negative-curvature steps are taken on unsuccessful iterations and to list the stochastic error tolerances imposed on the gradient and Hessian estimators at each step. These additions will allow direct verification of compatibility with the deterministic framework. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic proposal is self-contained

full rationale

The paper introduces SANC as a direct algorithmic combination of adaptive cubic regularization, negative curvature steps, and fixed-batch stochastic estimators. No derivation chain, fitted parameters, or predictions are presented that reduce by construction to inputs. The novelty claim ('first approach that combines') and experimental results stand independently; no self-citation is load-bearing for any mathematical result. The central assumption about estimator accuracy is stated explicitly but does not create definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the algorithm description does not introduce new mathematical objects beyond standard stochastic estimators.

pith-pipeline@v0.9.0 · 5721 in / 1026 out tokens · 23521 ms · 2026-05-25T15:12:13.634604+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We call this new method Stochastic Adaptive cubic regularization with Negative Curvature (SANC). ... uses independent sets of data points of consistent size over all iterations.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 7 (Worst-Case Iteration Complexity ... O(ε^{-3})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 5 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Finding approximate local minima faster than gradient descent

Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp.\ 1195--1199. ACM, 2017 a

work page 2017
[3]

Second-order stochastic optimization for machine learning in linear time

Agarwal, N., Bullins, B., and Hazan, E. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18 0 (1): 0 4148--4187, 2017 b

work page 2017
[4]

H., Diouane, Y., and Gratton, S

Bergou, E. H., Diouane, Y., and Gratton, S. A line-search algorithm inspired by the adaptive cubic regularization framework and complexity analysis. Journal of Optimization Theory and Applications, 178 0 (3): 0 885--913, 2018

work page 2018
[5]

E., and Nocedal, J

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60 0 (2): 0 223--311, 2018

work page 2018
[6]

H., Hansen, S

Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26 0 (2): 0 1008--1031, 2016

work page 2016
[7]

M., and Prieto, F

Cano, J., Moguerza, J. M., and Prieto, F. J. Using improved directions of negative curvature for the solution of bound-constrained nonconvex problems. Journal of Optimization Theory and Applications, 174 0 (2): 0 474--499, 2017

work page 2017
[8]

and Duchi, J

Carmon, Y. and Duchi, J. C. Gradient descent efficiently finds the cubic-regularized non-convex newton step. arXiv preprint arXiv:1612.00547, 2016

work page arXiv 2016
[9]

C., Hinder, O., and Sidford, A

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28 0 (2): 0 1751--1772, 2018

work page 2018
[10]

I., and Toint, P

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127 0 (2): 0 245--295, 2011 a

work page 2011
[11]

I., and Toint, P

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130 0 (2): 0 295--319, 2011 b

work page 2011
[12]

and Lin, C.-J

Chang, C.-C. and Lin, C.-J. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2 0 (3): 0 27, 2011

work page 2011
[13]

Coakley, E. S. and Rokhlin, V. A fast divide-and-conquer algorithm for computing the spectra of real symmetric tridiagonal matrices. Applied and Computational Harmonic Analysis, 34 0 (3): 0 379--414, 2013

work page 2013
[14]

Curtis, F. E. and Robinson, D. P. Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

and Bengio, Y

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.\ 249--256, 2010

work page 2010
[16]

The modification of newton’s method for unconstrained optimization by bounding cubic terms

Griewank, A. The modification of newton’s method for unconstrained optimization by bounding cubic terms. Technical report, Technical report NA/12, 1981

work page 1981
[17]

Recovering low-rank matrices from few coefficients in any basis

Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57 0 (3): 0 1548--1566, 2011

work page 2011
[18]

and Zhang, T

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp.\ 315--323, 2013

work page 2013
[19]

Kohler, J. M. and Lucchi, A. Sub-sampled cubic regularization for non-convex optimization. arXiv preprint arXiv:1705.05933, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

and Hinton, G

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

work page 2009
[21]

and Wo \'z niakowski, H

Kuczy \'n ski, J. and Wo \'z niakowski, H. Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM journal on matrix analysis and applications, 13 0 (4): 0 1094--1122, 1992

work page 1992
[22]

Gradient-based learning applied to document recognition

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998

work page 1998
[23]

Gradient Descent Converges to Minimizers

Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Adaptive negative curvature descent with applications in non-convex optimization

Liu, M., Li, Z., Wang, X., Yi, J., and Yang, T. Adaptive negative curvature descent with applications in non-convex optimization. In Advances in Neural Information Processing Systems, pp.\ 4854--4863, 2018

work page 2018
[25]

Deep learning via hessian-free optimization

Martens, J. Deep learning via hessian-free optimization. In ICML, volume 27, pp.\ 735--742, 2010

work page 2010
[26]

and Sutskever, I

Martens, J. and Sutskever, I. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.\ 1033--1040. Citeseer, 2011

work page 2011
[27]

and Polyak, B

Nesterov, Y. and Polyak, B. T. Cubic regularization of newton method and its global performance. Mathematical Programming, 108 0 (1): 0 177--205, 2006

work page 2006
[28]

Simplified neuron model as a principal component analyzer

Oja, E. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15 0 (3): 0 267--273, 1982

work page 1982
[29]

Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural computation, 6 0 (1): 0 147--160, 1994

work page 1994
[30]

A Generic Approach for Escaping Saddle points

Reddi, S. J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., and Smola, A. J. A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

and Povey, D

Vinyals, O. and Povey, D. Krylov subspace descent for deep learning. In Artificial Intelligence and Statistics, pp.\ 1261--1268, 2012

work page 2012
[32]

Wang, X., Fan, N., and Pardalos, P. M. Stochastic subgradient descent method for large-scale robust chance-constrained support vector machines. Optimization letters, 11 0 (5): 0 1013--1024, 2017 a

work page 2017
[33]

Stochastic quasi-newton methods for nonconvex stochastic optimization

Wang, X., Ma, S., Goldfarb, D., and Liu, W. Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27 0 (2): 0 927--956, 2017 b

work page 2017
[34]

Cubic Regularization with Momentum for Nonconvex Optimization

Wang, Z., Zhou, Y., Liang, Y., and Lan, G. Cubic regularization with momentum for nonconvex optimization. arXiv preprint arXiv:1810.03763, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

First-order stochastic algorithms for escaping from saddle points in almost linear time

Xu, Y., Rong, J., and Yang, T. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pp.\ 5531--5541, 2018

work page 2018

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Finding approximate local minima faster than gradient descent

Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp.\ 1195--1199. ACM, 2017 a

work page 2017

[3] [3]

Second-order stochastic optimization for machine learning in linear time

Agarwal, N., Bullins, B., and Hazan, E. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18 0 (1): 0 4148--4187, 2017 b

work page 2017

[4] [4]

H., Diouane, Y., and Gratton, S

Bergou, E. H., Diouane, Y., and Gratton, S. A line-search algorithm inspired by the adaptive cubic regularization framework and complexity analysis. Journal of Optimization Theory and Applications, 178 0 (3): 0 885--913, 2018

work page 2018

[5] [5]

E., and Nocedal, J

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60 0 (2): 0 223--311, 2018

work page 2018

[6] [6]

H., Hansen, S

Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26 0 (2): 0 1008--1031, 2016

work page 2016

[7] [7]

M., and Prieto, F

Cano, J., Moguerza, J. M., and Prieto, F. J. Using improved directions of negative curvature for the solution of bound-constrained nonconvex problems. Journal of Optimization Theory and Applications, 174 0 (2): 0 474--499, 2017

work page 2017

[8] [8]

and Duchi, J

Carmon, Y. and Duchi, J. C. Gradient descent efficiently finds the cubic-regularized non-convex newton step. arXiv preprint arXiv:1612.00547, 2016

work page arXiv 2016

[9] [9]

C., Hinder, O., and Sidford, A

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28 0 (2): 0 1751--1772, 2018

work page 2018

[10] [10]

I., and Toint, P

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127 0 (2): 0 245--295, 2011 a

work page 2011

[11] [11]

I., and Toint, P

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130 0 (2): 0 295--319, 2011 b

work page 2011

[12] [12]

and Lin, C.-J

Chang, C.-C. and Lin, C.-J. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2 0 (3): 0 27, 2011

work page 2011

[13] [13]

Coakley, E. S. and Rokhlin, V. A fast divide-and-conquer algorithm for computing the spectra of real symmetric tridiagonal matrices. Applied and Computational Harmonic Analysis, 34 0 (3): 0 379--414, 2013

work page 2013

[14] [14]

Curtis, F. E. and Robinson, D. P. Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

and Bengio, Y

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.\ 249--256, 2010

work page 2010

[16] [16]

The modification of newton’s method for unconstrained optimization by bounding cubic terms

Griewank, A. The modification of newton’s method for unconstrained optimization by bounding cubic terms. Technical report, Technical report NA/12, 1981

work page 1981

[17] [17]

Recovering low-rank matrices from few coefficients in any basis

Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57 0 (3): 0 1548--1566, 2011

work page 2011

[18] [18]

and Zhang, T

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp.\ 315--323, 2013

work page 2013

[19] [19]

Kohler, J. M. and Lucchi, A. Sub-sampled cubic regularization for non-convex optimization. arXiv preprint arXiv:1705.05933, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

and Hinton, G

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

work page 2009

[21] [21]

and Wo \'z niakowski, H

Kuczy \'n ski, J. and Wo \'z niakowski, H. Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM journal on matrix analysis and applications, 13 0 (4): 0 1094--1122, 1992

work page 1992

[22] [22]

Gradient-based learning applied to document recognition

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998

work page 1998

[23] [23]

Gradient Descent Converges to Minimizers

Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Adaptive negative curvature descent with applications in non-convex optimization

Liu, M., Li, Z., Wang, X., Yi, J., and Yang, T. Adaptive negative curvature descent with applications in non-convex optimization. In Advances in Neural Information Processing Systems, pp.\ 4854--4863, 2018

work page 2018

[25] [25]

Deep learning via hessian-free optimization

Martens, J. Deep learning via hessian-free optimization. In ICML, volume 27, pp.\ 735--742, 2010

work page 2010

[26] [26]

and Sutskever, I

Martens, J. and Sutskever, I. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.\ 1033--1040. Citeseer, 2011

work page 2011

[27] [27]

and Polyak, B

Nesterov, Y. and Polyak, B. T. Cubic regularization of newton method and its global performance. Mathematical Programming, 108 0 (1): 0 177--205, 2006

work page 2006

[28] [28]

Simplified neuron model as a principal component analyzer

Oja, E. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15 0 (3): 0 267--273, 1982

work page 1982

[29] [29]

Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural computation, 6 0 (1): 0 147--160, 1994

work page 1994

[30] [30]

A Generic Approach for Escaping Saddle points

Reddi, S. J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., and Smola, A. J. A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

and Povey, D

Vinyals, O. and Povey, D. Krylov subspace descent for deep learning. In Artificial Intelligence and Statistics, pp.\ 1261--1268, 2012

work page 2012

[32] [32]

Wang, X., Fan, N., and Pardalos, P. M. Stochastic subgradient descent method for large-scale robust chance-constrained support vector machines. Optimization letters, 11 0 (5): 0 1013--1024, 2017 a

work page 2017

[33] [33]

Stochastic quasi-newton methods for nonconvex stochastic optimization

Wang, X., Ma, S., Goldfarb, D., and Liu, W. Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27 0 (2): 0 927--956, 2017 b

work page 2017

[34] [34]

Cubic Regularization with Momentum for Nonconvex Optimization

Wang, Z., Zhou, Y., Liang, Y., and Lan, G. Cubic regularization with momentum for nonconvex optimization. arXiv preprint arXiv:1810.03763, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

First-order stochastic algorithms for escaping from saddle points in almost linear time

Xu, Y., Rong, J., and Yang, T. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pp.\ 5531--5541, 2018

work page 2018