pith. sign in

arxiv: 1906.11417 · v1 · pith:QWY6EUJ3new · submitted 2019-06-27 · 🧮 math.OC · cs.LG

Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization

Pith reviewed 2026-05-25 15:12 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords nonconvex optimizationstochastic optimizationcubic regularizationnegative curvatureadaptive methodsfinite-sum problemsmachine learningsaddle point escape
0
0 comments X

The pith

SANC merges negative curvature into adaptive cubic regularization to update even on unsuccessful iterations for stochastic nonconvex problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the SANC algorithm to minimize nonconvex finite-sum functions common in machine learning. It extends the adaptive cubic regularized Newton method by adding negative curvature directions so that the algorithm can still make progress when a step is rejected by the trust-region acceptance test. The method draws stochastic gradient and Hessian estimates from independent fixed-size data batches at every iteration rather than varying batches. This design aims to keep the global convergence and strict saddle escape properties while making the scheme practical for large problems. Experiments on neural network tasks are presented to illustrate the approach.

Core claim

The central claim is that negative curvature can be combined with the adaptive cubic regularized Newton method to produce updates on unsuccessful iterations, using independent fixed-size stochastic estimators for gradients and Hessians, and that this is the first such combination while retaining the underlying convergence guarantees for nonconvex finite-sum minimization.

What carries the argument

The SANC algorithm, which augments the cubic model acceptance test with a negative curvature direction computed from the stochastic Hessian when the regularized step is rejected.

If this is right

  • The algorithm performs updates at every iteration rather than skipping unsuccessful ones.
  • Global convergence and escape from strict saddle points are retained under the stochastic estimators.
  • Fixed batch sizes make the method directly applicable to large-scale machine learning without changing sample sizes per iteration.
  • Empirical tests on neural network training problems show practical efficiency gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same negative-curvature augmentation could be tested inside other adaptive regularization frameworks that already use trust-region acceptance.
  • If the fixed-batch assumption holds, similar combinations might extend to variance-reduced estimators without altering the core update logic.
  • The approach suggests a template for adding escape mechanisms to any cubic or higher-order regularization method that currently discards rejected steps.

Load-bearing premise

Independent fixed-size stochastic gradient and Hessian estimates drawn at each iteration remain accurate enough to preserve the convergence and saddle-escape behavior of the deterministic cubic scheme.

What would settle it

A constructed nonconvex finite-sum problem containing a known strict saddle where SANC with small fixed batches fails to produce negative curvature directions that escape the saddle while the deterministic version succeeds.

Figures

Figures reproduced from arXiv: 1906.11417 by Panos M. Pardalos, Seonho Park, Seung Hyun Jung.

Figure 1
Figure 1. Figure 1: Training losses of the logistic regression with a noncon￾vex regularization term with λ = 1.0 over the number of oracle calls. We set σ0 = 0.001 for both methods. This setting makes unsuccessful iterations at initial iterations. All function losses are the average of independent 10 runs. out in Section 3, with a smaller σ0, we experienced that the SCR method suffers from unsuccessful iterations in initial … view at source ↗
Figure 2
Figure 2. Figure 2: Training losses over the number of oracle calls. (a) Logistic regression with a nonconvex regularization term with λ = 1.0 solving for w1a (n = 2477, d = 300)(top) and higgs (n = 11000000, d = 28)(bottom) data. (b) multi-layer perceptron solving for seismic(top) and segment(bottom) data. (c) convolutional neural networks solving for MNIST(top) and CIFAR10(bottom) data. We omit CR method for CNN problems be… view at source ↗
Figure 3
Figure 3. Figure 3: Training losses of the logistic regression over the number of oracle calls. All function losses are the average of independent 10 runs [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training losses of the logistic regression with a nonconvex regularization term with λ = 1.0 over the number of oracle calls. SANC and SCR start with σ0 = 0.001. All function losses are the average of independent 10 runs. (a) MNIST dataset (b) CIFAR10 dataset [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training losses of the convolutional neural networks with ReLU nonlinear activation functions over the number of oracle calls. All function losses are the average of independent 10 runs [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
read the original abstract

We focus on minimizing nonconvex finite-sum functions that typically arise in machine learning problems. In an attempt to solve this problem, the adaptive cubic regularized Newton method has shown its strong global convergence guarantees and ability to escape from strict saddle points. This method uses a trust region-like scheme to determine if an iteration is successful or not, and updates only when it is successful. In this paper, we suggest an algorithm combining negative curvature with the adaptive cubic regularized Newton method to update even at unsuccessful iterations. We call this new method Stochastic Adaptive cubic regularization with Negative Curvature (SANC). Unlike the previous method, in order to attain stochastic gradient and Hessian estimators, the SANC algorithm uses independent sets of data points of consistent size over all iterations. It makes the SANC algorithm more practical to apply for solving large-scale machine learning problems. To the best of our knowledge, this is the first approach that combines the negative curvature method with the adaptive cubic regularized Newton method. Finally, we provide experimental results including neural networks problems supporting the efficiency of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Stochastic Adaptive cubic regularization with Negative Curvature (SANC) algorithm for nonconvex finite-sum minimization. It augments the adaptive cubic regularized Newton method with negative curvature directions to permit progress on unsuccessful iterations, employs independent fixed-size batches for stochastic gradient and Hessian estimates at every step, claims to retain global convergence and strict-saddle escape guarantees, asserts novelty as the first such combination, and reports supporting experiments on neural-network problems.

Significance. If the analysis rigorously shows that fixed-batch stochastic estimators suffice to preserve the deterministic convergence and escape properties, the result would be a practical advance for large-scale nonconvex optimization, removing the need for growing batches or variance reduction while retaining the strong theoretical features of adaptive cubic regularization.

major comments (2)
  1. [Convergence analysis (implicit in abstract claims)] The central claim that fixed-size independent batches preserve the accuracy conditions required by adaptive cubic regularization (trust-region acceptance and negative-curvature escape) is load-bearing yet unsupported by any mechanism (growing batches, variance bounds, or relaxed tolerances) that would counteract non-vanishing variance; this directly affects the global convergence and saddle-escape guarantees asserted in the abstract.
  2. [Algorithm 1 and surrounding text] The description of how negative curvature is combined with the adaptive cubic model to enable updates on unsuccessful iterations must specify the modified acceptance test and the stochastic error tolerances used at each step; without these, it is impossible to verify that the stochastic estimators remain compatible with the deterministic analysis.
minor comments (2)
  1. [Experiments] The experimental section should report the precise batch sizes employed, the number of independent runs, and quantitative comparison metrics against the plain adaptive cubic method and other stochastic baselines.
  2. [Preliminaries] Notation for the stochastic estimators (e.g., how ĝ and Ĥ are defined from the fixed-size batches) should be introduced explicitly before the algorithm statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Convergence analysis (implicit in abstract claims)] The central claim that fixed-size independent batches preserve the accuracy conditions required by adaptive cubic regularization (trust-region acceptance and negative-curvature escape) is load-bearing yet unsupported by any mechanism (growing batches, variance bounds, or relaxed tolerances) that would counteract non-vanishing variance; this directly affects the global convergence and saddle-escape guarantees asserted in the abstract.

    Authors: We agree that the manuscript asserts retention of the global convergence and strict-saddle escape guarantees with fixed-size batches but does not supply the supporting analysis or mechanism to control non-vanishing variance. The current text therefore leaves this central claim unsupported. In revision we will either add a rigorous section establishing the required accuracy conditions under fixed batches (or suitable relaxed tolerances) or we will remove/qualify the guarantee claims. This is a substantive gap that must be resolved. revision: yes

  2. Referee: [Algorithm 1 and surrounding text] The description of how negative curvature is combined with the adaptive cubic model to enable updates on unsuccessful iterations must specify the modified acceptance test and the stochastic error tolerances used at each step; without these, it is impossible to verify that the stochastic estimators remain compatible with the deterministic analysis.

    Authors: We accept that the current description of Algorithm 1 and the surrounding text is incomplete on this point. We will revise the manuscript to explicitly state the modified acceptance test used when negative-curvature steps are taken on unsuccessful iterations and to list the stochastic error tolerances imposed on the gradient and Hessian estimators at each step. These additions will allow direct verification of compatibility with the deterministic framework. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic proposal is self-contained

full rationale

The paper introduces SANC as a direct algorithmic combination of adaptive cubic regularization, negative curvature steps, and fixed-batch stochastic estimators. No derivation chain, fitted parameters, or predictions are presented that reduce by construction to inputs. The novelty claim ('first approach that combines') and experimental results stand independently; no self-citation is load-bearing for any mathematical result. The central assumption about estimator accuracy is stated explicitly but does not create definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the algorithm description does not introduce new mathematical objects beyond standard stochastic estimators.

pith-pipeline@v0.9.0 · 5721 in / 1026 out tokens · 23521 ms · 2026-05-25T15:12:13.634604+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 5 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Finding approximate local minima faster than gradient descent

    Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp.\ 1195--1199. ACM, 2017 a

  3. [3]

    Second-order stochastic optimization for machine learning in linear time

    Agarwal, N., Bullins, B., and Hazan, E. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18 0 (1): 0 4148--4187, 2017 b

  4. [4]

    H., Diouane, Y., and Gratton, S

    Bergou, E. H., Diouane, Y., and Gratton, S. A line-search algorithm inspired by the adaptive cubic regularization framework and complexity analysis. Journal of Optimization Theory and Applications, 178 0 (3): 0 885--913, 2018

  5. [5]

    E., and Nocedal, J

    Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60 0 (2): 0 223--311, 2018

  6. [6]

    H., Hansen, S

    Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26 0 (2): 0 1008--1031, 2016

  7. [7]

    M., and Prieto, F

    Cano, J., Moguerza, J. M., and Prieto, F. J. Using improved directions of negative curvature for the solution of bound-constrained nonconvex problems. Journal of Optimization Theory and Applications, 174 0 (2): 0 474--499, 2017

  8. [8]

    and Duchi, J

    Carmon, Y. and Duchi, J. C. Gradient descent efficiently finds the cubic-regularized non-convex newton step. arXiv preprint arXiv:1612.00547, 2016

  9. [9]

    C., Hinder, O., and Sidford, A

    Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28 0 (2): 0 1751--1772, 2018

  10. [10]

    I., and Toint, P

    Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127 0 (2): 0 245--295, 2011 a

  11. [11]

    I., and Toint, P

    Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130 0 (2): 0 295--319, 2011 b

  12. [12]

    and Lin, C.-J

    Chang, C.-C. and Lin, C.-J. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2 0 (3): 0 27, 2011

  13. [13]

    Coakley, E. S. and Rokhlin, V. A fast divide-and-conquer algorithm for computing the spectra of real symmetric tridiagonal matrices. Applied and Computational Harmonic Analysis, 34 0 (3): 0 379--414, 2013

  14. [14]

    Curtis, F. E. and Robinson, D. P. Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412, 2017

  15. [15]

    and Bengio, Y

    Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.\ 249--256, 2010

  16. [16]

    The modification of newton’s method for unconstrained optimization by bounding cubic terms

    Griewank, A. The modification of newton’s method for unconstrained optimization by bounding cubic terms. Technical report, Technical report NA/12, 1981

  17. [17]

    Recovering low-rank matrices from few coefficients in any basis

    Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57 0 (3): 0 1548--1566, 2011

  18. [18]

    and Zhang, T

    Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp.\ 315--323, 2013

  19. [19]

    Kohler, J. M. and Lucchi, A. Sub-sampled cubic regularization for non-convex optimization. arXiv preprint arXiv:1705.05933, 2017

  20. [20]

    and Hinton, G

    Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

  21. [21]

    and Wo \'z niakowski, H

    Kuczy \'n ski, J. and Wo \'z niakowski, H. Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM journal on matrix analysis and applications, 13 0 (4): 0 1094--1122, 1992

  22. [22]

    Gradient-based learning applied to document recognition

    LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998

  23. [23]

    Gradient Descent Converges to Minimizers

    Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016

  24. [24]

    Adaptive negative curvature descent with applications in non-convex optimization

    Liu, M., Li, Z., Wang, X., Yi, J., and Yang, T. Adaptive negative curvature descent with applications in non-convex optimization. In Advances in Neural Information Processing Systems, pp.\ 4854--4863, 2018

  25. [25]

    Deep learning via hessian-free optimization

    Martens, J. Deep learning via hessian-free optimization. In ICML, volume 27, pp.\ 735--742, 2010

  26. [26]

    and Sutskever, I

    Martens, J. and Sutskever, I. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.\ 1033--1040. Citeseer, 2011

  27. [27]

    and Polyak, B

    Nesterov, Y. and Polyak, B. T. Cubic regularization of newton method and its global performance. Mathematical Programming, 108 0 (1): 0 177--205, 2006

  28. [28]

    Simplified neuron model as a principal component analyzer

    Oja, E. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15 0 (3): 0 267--273, 1982

  29. [29]

    Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural computation, 6 0 (1): 0 147--160, 1994

  30. [30]

    A Generic Approach for Escaping Saddle points

    Reddi, S. J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., and Smola, A. J. A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434, 2017

  31. [31]

    and Povey, D

    Vinyals, O. and Povey, D. Krylov subspace descent for deep learning. In Artificial Intelligence and Statistics, pp.\ 1261--1268, 2012

  32. [32]

    Wang, X., Fan, N., and Pardalos, P. M. Stochastic subgradient descent method for large-scale robust chance-constrained support vector machines. Optimization letters, 11 0 (5): 0 1013--1024, 2017 a

  33. [33]

    Stochastic quasi-newton methods for nonconvex stochastic optimization

    Wang, X., Ma, S., Goldfarb, D., and Liu, W. Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27 0 (2): 0 927--956, 2017 b

  34. [34]

    Cubic Regularization with Momentum for Nonconvex Optimization

    Wang, Z., Zhou, Y., Liang, Y., and Lan, G. Cubic regularization with momentum for nonconvex optimization. arXiv preprint arXiv:1810.03763, 2018

  35. [35]

    First-order stochastic algorithms for escaping from saddle points in almost linear time

    Xu, Y., Rong, J., and Yang, T. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pp.\ 5531--5541, 2018