Cubic Regularized Newton Method with Variance Reduction for Finite-sum Non-convex Problems

Dmitry Kamzolov; Dmitry Pasechnyuk-Vilensky; Martin Tak\'a\v{c}

arxiv: 2510.08714 · v2 · submitted 2025-10-09 · 🧮 math.OC

Cubic Regularized Newton Method with Variance Reduction for Finite-sum Non-convex Problems

Dmitry Pasechnyuk-Vilensky , Dmitry Kamzolov , Martin Tak\'a\v{c} This is my paper

Pith reviewed 2026-05-18 08:19 UTC · model grok-4.3

classification 🧮 math.OC

keywords finite-sum optimizationnon-convex optimizationcubic regularizationvariance reductionsecond-order stationaritySARAH estimatorstochastic Newtonhomotopy refinement

0 comments

The pith

Variance-reduced cubic Newton method finds (ε, sqrt(L2ε)) second-order points with total complexity n plus order sqrt(n) times ε to the -3/2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a cubic-regularized Newton algorithm for finite-sum non-convex problems that incorporates variance reduction through EMA-smoothed SARAH estimators on both gradients and Hessians. It proceeds in two phases: a coarse stochastic backbone that makes initial progress at reduced per-iteration cost, followed by a terminal homotopy refinement that geometrically lowers the cubic regularization parameter, shortens stages, and ramps up mini-batch sizes while restarting exact snapshots. Under the paper's averaged squared-gradient and mean-cubed-Hessian smoothness conditions together with an inexact cubic-subproblem solver, the method is shown to return a point whose gradient norm is at most ε and whose smallest Hessian eigenvalue is at least minus sqrt(L2 ε). The resulting finite-sum oracle complexity is n plus a term that scales as the square root of n times ε to the power -3/2, improving the dependence on the number of summands relative to non-variance-reduced cubic methods.

Core claim

Under average squared gradient smoothness and average mean-cubed Hessian smoothness, the variance-reduced cubic Newton method equipped with an inexact cubic-subproblem certificate returns an (ε, √(L₂ε))-second-order stationary point after n + Õ(n^{1/2} ε^{-3/2}) finite-sum oracle evaluations; the proof splits the run into a coarse stochastic progress phase that supplies the sqrt(n) factor and a terminal local bootstrap that converts the model certificate into a true second-order guarantee.

What carries the argument

EMA-smoothed SARAH estimators for gradient and Hessian information, combined with the terminal homotopy refinement that decreases the regularization level geometrically while increasing mini-batch size and restarting exact finite-sum snapshots.

If this is right

The dominant cost after the initial full-gradient snapshot scales only as sqrt(n) ε^{-3/2} rather than linearly in n.
The analysis applies to objective classes where Hessian variation is controlled only in an averaged sense, not uniformly along every trajectory.
Once the iterates enter the certified small-step regime, the geometric reduction in regularization together with reciprocal batch-size growth converts an inexact model step into a true second-order certificate.
The separation into coarse backbone and terminal bootstrap phases isolates the stochastic variance reduction from the final accuracy requirement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The homotopy refinement schedule could be ported to other stochastic higher-order methods to obtain similar local certification without full determinism.
Problems arising in deep learning where per-sample Hessians vary sharply but their average remains well-behaved may be natural targets for this complexity regime.
An empirical study that measures the growth rate of the averaged Hessian smoothness constant on standard non-convex benchmarks would indicate how often the paper's assumptions are milder than trajectory-wise bounds.
The same variance-reduction-plus-homotopy pattern might yield improved rates for finding approximate local minima of fourth-order regularized models.

Load-bearing premise

The finite-sum objective must obey the stated average squared gradient smoothness and average mean-cubed Hessian smoothness conditions throughout the run.

What would settle it

Run the method on a finite-sum instance engineered so that the averaged mean-cubed Hessian smoothness constant grows along typical trajectories; if the observed iteration count then exceeds the predicted Õ(sqrt(n) ε^{-3/2}) bound by a polynomial factor in 1/ε, the complexity claim fails.

read the original abstract

We study finite-sum non-convex optimization $\min_{x\in\mathbb{R}^d} F(x) \;=\; \frac{1}{n}\sum_{i=1}^n f_i(x)$ and analyze a variance-reduced cubic Newton method based on EMA-smoothed SARAH estimators for both gradient and Hessian information. The method combines a coarse stochastic backbone with a terminal homotopy refinement: once the iterates enter a certified small-step regime, the algorithm decreases the regularization level geometrically, shortens the stage length, and increases the mini-batch size at the reciprocal rate while restarting exact finite-sum snapshots at each stage. We work under average squared gradient smoothness and average mean-cubed Hessian smoothness, thereby avoiding the trajectory-wise Hessian boundedness assumption that is often used in related analyses. Under these assumptions and a standard inexact cubic-subproblem certificate, we establish that the method returns an $(\varepsilon,\sqrt{L_2\varepsilon})$-second-order stationary point with total finite-sum oracle complexity $n+\widetilde{\mathcal O}\!\left(n^{1/2}\varepsilon^{-3/2}\right)$. The analysis separates into a coarse progress phase, which yields the $n^{1/2}$-scaled stochastic backbone, and a terminal local bootstrap, which supplies the pointwise accuracy needed to turn the model step certificate into a true second-order certificate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper claims an improved complexity for variance-reduced cubic Newton on finite-sum non-convex problems under averaged smoothness, but the terminal phase needs scrutiny on trajectory control.

read the letter

The main point is this paper gives a cubic regularized Newton method with variance reduction for finite-sum non-convex optimization that reaches second-order stationary points with complexity n plus order sqrt(n) epsilon to the -3/2, using averaged smoothness assumptions. They use EMA-smoothed SARAH for both gradients and Hessians. The method has a coarse stochastic backbone followed by a terminal homotopy refinement: geometric decrease in regularization, shorter stages, reciprocal batch size increase, and restarted exact snapshots. This setup works under average squared gradient smoothness and average mean-cubed Hessian smoothness, sidestepping the trajectory-wise boundedness condition common in other analyses. The phase separation is a strength. It lets the coarse phase handle the sqrt(n) scaling while the terminal part supplies the accuracy to turn the inexact cubic step into a real second-order certificate. The milder assumptions are a genuine improvement if the proof closes the gaps. The potential issue is whether the averaged cubic smoothness delivers the trajectory control needed in the bootstrap. Since it is an average over samples, the proof has to show that the EMA estimator still gives the pointwise accuracy required by the inexact certificate at each refinement stage. Without high-probability bounds or careful propagation of errors, the conversion from model minimizer to true SOSP might not hold at the claimed rate. The abstract states the result but the details are in the full text, so that section needs verification. The rest of the argument looks standard and non-circular. The citation pattern covers the relevant prior work on variance reduction and cubic methods. This paper is for people in stochastic optimization who are interested in second-order guarantees for large-scale non-convex problems. A reader who knows the SARAH and cubic Newton literature will see the technical novelty in the estimator combination and the refinement strategy. It is worth sending to peer review. The complexity claim is interesting enough and the approach is developed enough to get useful feedback from referees.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes a variance-reduced cubic-regularized Newton method for finite-sum non-convex optimization min (1/n) sum f_i(x). The algorithm employs EMA-smoothed SARAH estimators for both gradients and Hessians, combining a coarse stochastic backbone phase with a terminal homotopy refinement that geometrically decreases the cubic regularization parameter, shortens stage lengths, ramps up mini-batch sizes, and restarts exact finite-sum snapshots. Under the assumptions of average squared-gradient smoothness and average mean-cubed Hessian smoothness (avoiding trajectory-wise Hessian boundedness), together with a standard inexact cubic-subproblem certificate, the method is shown to return an (ε, √(L₂ε))-second-order stationary point with total finite-sum oracle complexity n + Õ(n^{1/2} ε^{-3/2}). The proof is separated into a coarse progress phase delivering the √n factor and a terminal local bootstrap supplying the pointwise accuracy for the second-order certificate.

Significance. If the central claims hold, the result improves upon existing stochastic second-order methods by replacing trajectory-wise Hessian bounds with averaged smoothness conditions while retaining near-optimal dependence on n and ε. The explicit separation of coarse stochastic and terminal refinement phases, together with the use of EMA-smoothed SARAH estimators, constitutes a technically interesting contribution to variance-reduced higher-order methods for non-convex finite-sum problems.

major comments (1)

Terminal bootstrap phase (described in the abstract and §4–5): the argument that average mean-cubed Hessian smoothness plus EMA-smoothed SARAH yields the high-probability pointwise accuracy required to convert the inexact cubic certificate into a true (ε, √(L₂ε))-SOSP guarantee is load-bearing. The provided description does not exhibit an explicit concentration or uniform trajectory bound showing that the averaged cubic-smoothness condition controls the error term uniformly along the realized path at the claimed rate; without this step the terminal phase may fail to close the certificate gap.

minor comments (2)

The abstract states the complexity and the two-phase structure but supplies no derivation outline or error-bound calculation; adding a one-paragraph proof sketch in the introduction would improve readability.
Notation for the averaged smoothness constants (L₁, L₂) and the geometric decrease factor for the regularization parameter should be introduced with explicit definitions before the complexity statement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. The single major comment raises an important point about the clarity of the terminal bootstrap phase, which we address below. We will revise the manuscript to strengthen the presentation of the relevant concentration arguments.

read point-by-point responses

Referee: Terminal bootstrap phase (described in the abstract and §4–5): the argument that average mean-cubed Hessian smoothness plus EMA-smoothed SARAH yields the high-probability pointwise accuracy required to convert the inexact cubic certificate into a true (ε, √(L₂ε))-SOSP guarantee is load-bearing. The provided description does not exhibit an explicit concentration or uniform trajectory bound showing that the averaged cubic-smoothness condition controls the error term uniformly along the realized path at the claimed rate; without this step the terminal phase may fail to close the certificate gap.

Authors: We agree that the high-probability pointwise accuracy in the terminal phase is a critical step and that the current write-up would benefit from greater explicitness. The proof in Section 5 proceeds by first invoking the averaged mean-cubed Hessian smoothness (Assumption 3) to bound the third-moment terms that appear in the variance of the EMA-smoothed SARAH Hessian estimator. Because the terminal stages are geometrically shortened while the mini-batch sizes are increased at the reciprocal rate, the accumulated variance remains summable; a martingale concentration argument (Freedman-type inequality applied to the recursive SARAH updates) then yields a uniform bound over the short trajectory segment with probability 1−δ. The EMA smoothing further decouples the dependence on distant past iterates, which simplifies the martingale differences. We acknowledge that this chain of reasoning is currently distributed across several lemmas rather than isolated in a single self-contained statement. In the revision we will insert a new Lemma 5.1 that extracts precisely the uniform trajectory bound, making the passage from averaged cubic smoothness to the required (ε,√(L₂ε))-certificate fully explicit and thereby closing the gap identified by the referee. revision: yes

Circularity Check

0 steps flagged

No circularity: complexity bound derived from explicit assumptions and standard certificate without reduction to inputs by construction.

full rationale

The paper presents a variance-reduced cubic Newton method whose complexity guarantee is obtained by splitting the argument into a coarse stochastic phase (producing the n^{1/2} factor) and a terminal homotopy refinement that converts inexact cubic certificates into true second-order stationarity. Both phases rest on the explicitly declared average squared-gradient smoothness and average mean-cubed-Hessian smoothness conditions together with a standard inexact cubic-subproblem certificate; these are independent modeling assumptions, not quantities defined inside the proof or fitted to the target bound. No equation equates the final complexity expression to a parameter that was itself chosen from the same expression, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work by the same authors. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 0 invented entities

The central claim rests on two averaged smoothness assumptions that replace stronger trajectory-wise bounds, plus an inexact cubic-subproblem certificate whose translation into a true second-order guarantee is part of the analysis.

free parameters (2)

geometric regularization decrease factor
The terminal phase decreases the regularization level geometrically; the specific rate is a design choice that balances stage length and accuracy in the complexity calculation.
stage-length shortening schedule
Stage lengths are shortened while mini-batch sizes increase reciprocally; these schedules are chosen to achieve the final Õ(n^{1/2} ε^{-3/2}) term.

axioms (3)

domain assumption average squared gradient smoothness
Invoked to control variance of the stochastic gradient estimator without requiring per-trajectory bounds.
domain assumption average mean-cubed Hessian smoothness
Used to bound the cubic term in the model while avoiding the stronger trajectory-wise Hessian boundedness assumption common in prior cubic Newton analyses.
domain assumption standard inexact cubic-subproblem certificate
Assumed to hold for the inner solver; the analysis must show this certificate implies the desired second-order stationarity.

pith-pipeline@v0.9.0 · 5783 in / 1823 out tokens · 42358 ms · 2026-05-18T08:19:01.376638+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under average squared gradient smoothness and average mean-cubed Hessian smoothness... the method returns an (ε, √(L₂ε))-second-order stationary point with total finite-sum oracle complexity n + Õ(n^{1/2} ε^{-3/2}).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We work under average squared gradient smoothness and average mean-cubed Hessian smoothness, thereby avoiding the trajectory-wise Hessian boundedness assumption

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Agafonov, D

A. Agafonov, D. Kamzolov, A. Gasnikov, A. Kavis, K. Antonakopoulos, V. Cevher, and M. Tak´ aˇ c. Advancing the lower bounds: an accel- erated, stochastic, second-order method with op- timal adaptation to inexactness.arXiv preprint arXiv:2309.01570, 2023

work page arXiv 2023
[2]

Cartis, N

C. Cartis, N. I. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results.Mathematical Programming, 127(2):245–295, 2011

work page 2011
[3]

E. M. Chayti, N. Doikov, and M. Jaggi. Unified convergence theory of stochastic and variance- reduced cubic newton methods.arXiv preprint arXiv:2302.11962, 2023

work page arXiv 2023
[4]

E. M. Chayti, N. Doikov, and M. Jaggi. Im- proving stochastic cubic newton with momentum. arXiv preprint arXiv:2410.19644, 2024

work page arXiv 2024
[5]

Derezi´ nski

M. Derezi´ nski. Stochastic variance-reduced new- ton: Accelerating finite-sum minimization with large batches.arXiv preprint arXiv:2206.02702, 2022

work page arXiv 2022
[6]

C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spi- der: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing sys- tems, 31, 2018

work page 2018
[7]

Kamzolov, D

D. Kamzolov, D. Pasechnyuk, A. Agafonov, A. Gasnikov, and M. Tak´ aˇ c. Optami: Global Cubic Newton + Variance Reduction + Momentum + Quadratic Regularization superlinear convergence of high-order methods. arXiv preprint arXiv:2410.04083, 2024

work page arXiv 2024
[8]

Nesterov and B

Y. Nesterov and B. T. Polyak. Cubic regular- ization of newton method and its global perfor- mance.Mathematical programming, 108(1):177– 205, 2006

work page 2006
[9]

L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak´ aˇ c. Sarah: A novel method for machine learning problems using stochastic recursive gra- dient. InInternational conference on machine learning, pages 2613–2621. PMLR, 2017

work page 2017
[10]

C. Sun, Z. Zhang, and S. Yang. A variance- reduced cubic-regularized newton for policy opti- mization.arXiv preprint arXiv:2507.10120, 2025

work page arXiv 2025
[11]

Z. Wang, Y. Zhou, Y. Liang, and G. Lan. Stochas- tic variance-reduced cubic regularization for non- convex optimization. InThe 22nd International Conference on Artificial Intelligence and Statis- tics, pages 2731–2740. PMLR, 2019

work page 2019
[12]

Y. Yang, C. He, X. Wang, and Z. Peng. Faster stochastic cubic regularized newton methods with momentum.arXiv preprint arXiv:2507.13003, 2025

work page arXiv 2025
[13]

Zhang, L

J. Zhang, L. Xiao, and S. Zhang. Adaptive stochastic variance reduction for subsampled new- ton method with cubic regularization.INFORMS Journal on Optimization, 4(1):45–64, 2022

work page 2022
[14]

Zhou and Q

D. Zhou and Q. Gu. Stochastic recursive variance- reduced cubic regularization methods. InInter- national conference on artificial intelligence and statistics, pages 3980–3990. PMLR, 2020

work page 2020
[15]

D. Zhou, P. Xu, and Q. Gu. Stochastic variance- reduced cubic regularization methods.Journal of Machine Learning Research, 20(134):1–47, 2019

work page 2019

[1] [1]

Agafonov, D

A. Agafonov, D. Kamzolov, A. Gasnikov, A. Kavis, K. Antonakopoulos, V. Cevher, and M. Tak´ aˇ c. Advancing the lower bounds: an accel- erated, stochastic, second-order method with op- timal adaptation to inexactness.arXiv preprint arXiv:2309.01570, 2023

work page arXiv 2023

[2] [2]

Cartis, N

C. Cartis, N. I. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results.Mathematical Programming, 127(2):245–295, 2011

work page 2011

[3] [3]

E. M. Chayti, N. Doikov, and M. Jaggi. Unified convergence theory of stochastic and variance- reduced cubic newton methods.arXiv preprint arXiv:2302.11962, 2023

work page arXiv 2023

[4] [4]

E. M. Chayti, N. Doikov, and M. Jaggi. Im- proving stochastic cubic newton with momentum. arXiv preprint arXiv:2410.19644, 2024

work page arXiv 2024

[5] [5]

Derezi´ nski

M. Derezi´ nski. Stochastic variance-reduced new- ton: Accelerating finite-sum minimization with large batches.arXiv preprint arXiv:2206.02702, 2022

work page arXiv 2022

[6] [6]

C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spi- der: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing sys- tems, 31, 2018

work page 2018

[7] [7]

Kamzolov, D

D. Kamzolov, D. Pasechnyuk, A. Agafonov, A. Gasnikov, and M. Tak´ aˇ c. Optami: Global Cubic Newton + Variance Reduction + Momentum + Quadratic Regularization superlinear convergence of high-order methods. arXiv preprint arXiv:2410.04083, 2024

work page arXiv 2024

[8] [8]

Nesterov and B

Y. Nesterov and B. T. Polyak. Cubic regular- ization of newton method and its global perfor- mance.Mathematical programming, 108(1):177– 205, 2006

work page 2006

[9] [9]

L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak´ aˇ c. Sarah: A novel method for machine learning problems using stochastic recursive gra- dient. InInternational conference on machine learning, pages 2613–2621. PMLR, 2017

work page 2017

[10] [10]

C. Sun, Z. Zhang, and S. Yang. A variance- reduced cubic-regularized newton for policy opti- mization.arXiv preprint arXiv:2507.10120, 2025

work page arXiv 2025

[11] [11]

Z. Wang, Y. Zhou, Y. Liang, and G. Lan. Stochas- tic variance-reduced cubic regularization for non- convex optimization. InThe 22nd International Conference on Artificial Intelligence and Statis- tics, pages 2731–2740. PMLR, 2019

work page 2019

[12] [12]

Y. Yang, C. He, X. Wang, and Z. Peng. Faster stochastic cubic regularized newton methods with momentum.arXiv preprint arXiv:2507.13003, 2025

work page arXiv 2025

[13] [13]

Zhang, L

J. Zhang, L. Xiao, and S. Zhang. Adaptive stochastic variance reduction for subsampled new- ton method with cubic regularization.INFORMS Journal on Optimization, 4(1):45–64, 2022

work page 2022

[14] [14]

Zhou and Q

D. Zhou and Q. Gu. Stochastic recursive variance- reduced cubic regularization methods. InInter- national conference on artificial intelligence and statistics, pages 3980–3990. PMLR, 2020

work page 2020

[15] [15]

D. Zhou, P. Xu, and Q. Gu. Stochastic variance- reduced cubic regularization methods.Journal of Machine Learning Research, 20(134):1–47, 2019

work page 2019