Adaptive Regularization within Trust Region Methods for Stochastic Nonconvex Optimization
Pith reviewed 2026-05-10 10:25 UTC · model grok-4.3
The pith
Reg-ASTRO achieves almost sure Õ(ε^{-1.5}) iteration complexity for stochastic nonconvex optimization with noisy gradients.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that Reg-ASTRO, which adds an adaptively regularized local model to the adaptive sampling trust-region optimization framework, attains an almost sure Õ(ε^{-1.5}) iteration complexity and Õ(ε^{-4.5}) sample complexity for reaching ε-stationary points in smooth nonconvex stochastic optimization. The sample complexity improves to Õ(ε^{-3.5}) under stronger regularity conditions together with common random numbers, outperforming first-order methods in both theory and numerical experiments.
What carries the argument
The adaptively regularized local model inside the ASTRO trust-region framework, which couples the choice of trust-region radius and regularization parameter so that both are set without advance knowledge of the gradient estimate at each iteration.
If this is right
- The total number of gradient samples required is substantially lower than for standard first-order stochastic methods under the same noise model.
- When common random numbers are employed and stronger regularity holds, the sample complexity bound improves to Õ(ε^{-3.5}).
- Adaptive sampling automatically raises accuracy requirements only for iterates already close to stationarity, limiting wasted computation on distant points.
- The framework applies to a broad class of problems because the noise model allows unbounded support and decision dependence.
Where Pith is reading between the lines
- The same coupling of adaptive regularization and radius selection could be tested inside quasi-Newton or limited-memory trust-region variants.
- The subexponential tail condition permits direct application to certain heavy-tailed gradient settings common in deep learning.
- Implementation in existing trust-region solvers would require only modest changes to the model-building step.
Load-bearing premise
The mean-zero stochastic noise is decision-dependent with unbounded support and subexponential tails, while the trust-region radius and regularization parameter remain coupled and are chosen without prior gradient estimation.
What would settle it
A numerical run or constructed counterexample in which the number of iterations needed to reach an ε-stationary point grows faster than Õ(ε^{-1.5}) with high probability, while satisfying the smoothness and subexponential noise assumptions, would falsify the almost-sure complexity guarantee.
read the original abstract
We propose a stochastic nonconvex optimization algorithm that achieves almost sure $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ iteration complexity for problems with smooth objective functions and gradients only observable with noise. The mean-zero stochastic noise is decision-dependent and has unbounded support with subexponential tail, allowing our framework to cover a broad class of problems. The improved almost sure iteration complexity is achieved with a new variant of the adaptive sampling trust-region optimization (ASTRO) augmented with an adaptively regularized local model, which we term Reg-ASTRO. Adaptive sampling ensures that the estimation precision is aligned with a measure of stationarity, so that iterates closer to stationarity trigger higher accuracy requirement for sampling. A key analytical challenge arises because the trust-region radius and regularization are coupled and not determined prior to gradient estimation at each iteration. We further establish an almost sure $\tilde{\mathcal{O}}(\epsilon^{-4.5})$ sample complexity for Reg-ASTRO, which improves to $\tilde{\mathcal{O}}(\epsilon^{-3.5})$ under stronger regularity conditions and use of common random numbers, substantially outperforming first-order methods in theory and numerical experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Reg-ASTRO, an adaptive-sampling trust-region algorithm augmented with adaptive regularization for stochastic nonconvex optimization. Under smooth objectives and decision-dependent subexponential noise with unbounded support, it claims an almost-sure Õ(ε^{-1.5}) iteration complexity together with Õ(ε^{-4.5}) sample complexity (improving to Õ(ε^{-3.5}) under stronger regularity and common random numbers). The key technical device is adaptive sampling whose accuracy is tied to a stationarity measure, with the trust-region radius and regularization parameter chosen from the resulting gradient estimate.
Significance. If the almost-sure complexity claims are rigorously established, the work would strengthen the theoretical foundation for adaptive second-order methods in stochastic nonconvex settings and demonstrate concrete sample-complexity gains over first-order alternatives. The handling of decision-dependent noise with subexponential tails is a positive feature that broadens applicability.
major comments (2)
- [§4] §4 (Complexity Analysis), the proof of the Õ(ε^{-1.5}) a.s. iteration bound: the argument must explicitly close the dependency loop in which the stationarity measure (used to set gradient sampling accuracy) is itself a function of the trust-region radius Δ_k and regularization parameter, both of which are computed from the gradient estimate whose accuracy is being prescribed. If the resolution relies on an a-priori lower bound on Δ_k, a preliminary fixed-accuracy sample, or an implicit uniform noise bound not justified by the subexponential tail, the claimed rate does not follow for the stated algorithm.
- [Theorem 4.1] Theorem 4.1 (or equivalent statement of the main iteration complexity): the almost-sure bound is stated without an accompanying high-probability or expectation version that would allow direct comparison with existing stochastic trust-region analyses; the paper should clarify whether the a.s. result is obtained by Borel-Cantelli or by a direct supermartingale argument and whether the same argument yields the stated sample complexity.
minor comments (2)
- [§2] Notation for the adaptive regularization parameter and its relation to the trust-region radius should be introduced earlier (e.g., in §2 or §3) to improve readability of the algorithm description.
- [§5] The numerical experiments section would benefit from an explicit statement of the random-seed protocol and the number of independent runs used to generate the reported curves.
Simulated Author's Rebuttal
We thank the referee for the thorough review and valuable comments on our paper. We address each major comment below and have revised the manuscript accordingly to clarify the complexity analysis.
read point-by-point responses
-
Referee: [§4] §4 (Complexity Analysis), the proof of the Õ(ε^{-1.5}) a.s. iteration bound: the argument must explicitly close the dependency loop in which the stationarity measure (used to set gradient sampling accuracy) is itself a function of the trust-region radius Δ_k and regularization parameter, both of which are computed from the gradient estimate whose accuracy is being prescribed. If the resolution relies on an a-priori lower bound on Δ_k, a preliminary fixed-accuracy sample, or an implicit uniform noise bound not justified by the subexponential tail, the claimed rate does not follow for the stated algorithm.
Authors: We appreciate the referee pointing out the need to explicitly resolve this dependency. In our analysis, we do not rely on an a-priori lower bound or preliminary fixed-accuracy sampling that would alter the complexity. Instead, the proof proceeds by contradiction or by considering the possible cases for the gradient estimate error. Specifically, we use the subexponential concentration to bound the probability that the estimate deviates significantly, and show that on the event where the estimate is accurate enough (which happens almost surely by Borel-Cantelli), the chosen Δ_k and regularization are consistent with the stationarity measure. We have expanded Section 4 with a detailed explanation of this closure, including a new supporting lemma that demonstrates the consistency without additional assumptions on the noise beyond the stated subexponential tails. revision: yes
-
Referee: [Theorem 4.1] Theorem 4.1 (or equivalent statement of the main iteration complexity): the almost-sure bound is stated without an accompanying high-probability or expectation version that would allow direct comparison with existing stochastic trust-region analyses; the paper should clarify whether the a.s. result is obtained by Borel-Cantelli or by a direct supermartingale argument and whether the same argument yields the stated sample complexity.
Authors: We thank the referee for this suggestion. The almost-sure bound in Theorem 4.1 is derived using a combination of Borel-Cantelli lemma applied to the bad events of insufficient sampling accuracy and a supermartingale argument for the decrease in the objective function. This approach ensures almost sure convergence to stationarity with the stated rate. The sample complexity follows directly from summing the per-iteration sample counts, which are controlled by the stationarity measure and thus inherit the iteration complexity. To facilitate comparison, we have added a high-probability version of the complexity result as a new corollary in the revised manuscript. revision: yes
Circularity Check
Derivation self-contained; no reduction of complexity bounds to fitted quantities or self-citations
full rationale
The paper introduces Reg-ASTRO as a variant of ASTRO with adaptive regularization and sampling for stochastic nonconvex problems under decision-dependent subexponential noise. It explicitly flags the coupling of trust-region radius, regularization parameter, and gradient estimation accuracy as an analytical challenge in the abstract, then claims to derive the almost-sure Õ(ε^{-1.5}) iteration complexity and subsequent sample complexities from the algorithm's progress measures and noise tail properties. No equations or steps in the provided text reduce the target bounds to a prior fit, a self-defined quantity, or a load-bearing self-citation; the results are presented as following from direct analysis of the adaptive scheme. This is the expected non-finding for a paper whose central claims rest on explicit algorithmic construction and standard stochastic analysis tools rather than circular re-labeling of inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
SIAM Journal on Optimization 35(3), 2098– 2127 (2025)
Ha, Y., Shashaani, S., Pasupathy, R.: Complexity of zeroth-and first-order stochastic trust-region algorithms. SIAM Journal on Optimization 35(3), 2098– 2127 (2025)
work page 2098
-
[2]
Cartis, C., Gould, N.I., Toint, P.L.: Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM, ??? (2022)
work page 2022
-
[3]
Mathematical programming 108(1), 177–205 (2006)
Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Mathematical programming 108(1), 177–205 (2006)
work page 2006
-
[4]
part i: motivation, convergence and numerical results
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming 127(2), 245–295 (2011)
work page 2011
-
[5]
Mathematical programming 204(1), 1–25 (2024)
Doikov, N., Nesterov, Y.: Gradient regularization of newton method with bregman distances. Mathematical programming 204(1), 1–25 (2024)
work page 2024
-
[6]
SIAM Journal on Optimization 33(3), 1440–1462 (2023)
Mishchenko, K.: Regularized newton method with global convergence. SIAM Journal on Optimization 33(3), 1440–1462 (2023)
work page 2023
-
[7]
IMA Journal of Numerical Analysis 45(2), 971–1008 (2025) 41
Gratton, S., Jerad, S., Toint, P.L.: Yet another fast variant of newton’s method for nonconvex optimization. IMA Journal of Numerical Analysis 45(2), 971–1008 (2025) 41
work page 2025
-
[8]
Mathematical Programming 162(1), 1–32 (2017)
Curtis, F.E., Robinson, D.P., Samadi, M.: A trust region algorithm with a worst- case iteration complexity of o( ε−3/2) for nonconvex optimization. Mathematical Programming 162(1), 1–32 (2017)
work page 2017
-
[9]
Journal of Scientific Computing 106(1), 28 (2026)
Jiang, Y., He, C., Zhang, C., Ge, D., Jiang, B., Ye, Y.: Beyond nonconvexity: A universal trust-region method with new analyses. Journal of Scientific Computing 106(1), 28 (2026)
work page 2026
-
[10]
Journal of Optimization Theory and applications 184(3), 953–971 (2020)
Park, S., Jung, S.H., Pardalos, P.M.: Combining stochastic adaptive cubic regularization with negative curvature for nonconvex optimization. Journal of Optimization Theory and applications 184(3), 953–971 (2020)
work page 2020
-
[11]
Advances in neural information processing systems 31 (2018)
Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. Advances in neural information processing systems 31 (2018)
work page 2018
-
[12]
Scheinberg, K., Xie, M.: First-and second-order stochastic adaptive regularization with cubics: High probability iteration and sample complexity. arXiv e-prints, 2308 (2023)
work page 2023
-
[13]
Mathematical Program- ming 199(1), 165–214 (2023)
Arjevani, Y., Carmon, Y., Duchi, J.C., Foster, D.J., Srebro, N., Woodworth, B.: Lower bounds for non-convex stochastic optimization. Mathematical Program- ming 199(1), 165–214 (2023)
work page 2023
-
[14]
SIAM journal on optimization 23(4), 2341–2368 (2013)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization 23(4), 2341–2368 (2013)
work page 2013
-
[15]
In: International Conference on Machine Learning, pp
Liu, Z., Nguyen, T.D., Nguyen, T.H., Ene, A., Nguyen, H.: High probability con- vergence of stochastic gradient methods. In: International Conference on Machine Learning, pp. 21884–21914 (2023). PMLR
work page 2023
-
[16]
Journal of Machine Learning Research 25(241), 1–36 (2024)
Madden, L., Dall’Anese, E., Becker, S.: High probability convergence bounds for non-convex stochastic gradient descent with sub-weibull noise. Journal of Machine Learning Research 25(241), 1–36 (2024)
work page 2024
-
[17]
Advances in Neural Information Processing Systems 34, 20571–20582 (2021)
Levy, K., Kavis, A., Cevher, V.: Storm+: Fully adaptive sgd with recur- sive momentum for nonconvex optimization. Advances in Neural Information Processing Systems 34, 20571–20582 (2021)
work page 2021
-
[18]
Advances in neural information processing systems 31 (2018)
Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: Near-optimal non-convex opti- mization via stochastic path-integrated differential estimator. Advances in neural information processing systems 31 (2018)
work page 2018
-
[19]
SIAM Journal on Optimization 34(2), 2067–2092 (2024) 42
Rinaldi, F., Vicente, L.N., Zeffiro, D.: Stochastic trust-region and direct-search methods: A weak tail bound condition and reduced sample sizing. SIAM Journal on Optimization 34(2), 2067–2092 (2024) 42
work page 2067
-
[20]
The Annals of Probability 27(1), 537–564 (1999)
Pena, V.H.: A general class of exponential inequalities for martingales and ratios. The Annals of Probability 27(1), 537–564 (1999)
work page 1999
-
[21]
Electronic journal of probability 20(none), 1–22 (2015)
Fan, X., Grama, I., Liu, Q.: Exponential inequalities for martingales with applications. Electronic journal of probability 20(none), 1–22 (2015)
work page 2015
-
[22]
The Annals of Statistics, 1779–1801 (1995)
Geer, S.: Exponential inequalities for martingales, with application to maximum likelihood estimation for counting processes. The Annals of Statistics, 1779–1801 (1995)
work page 1995
-
[23]
: Lectures on Convex Optimization vol
Nesterov, Y., et al. : Lectures on Convex Optimization vol. 137. Springer, Cham (2018)
work page 2018
-
[24]
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (1999)
work page 1999
-
[25]
In: 2022 Winter Simulation Conference (WSC), pp
Ford, M.T., Henderson, S.G., Eckman, D.J.: Automatic differentiation for gradi- ent estimators in simulation. In: 2022 Winter Simulation Conference (WSC), pp. 3134–3145 (2022). IEEE
work page 2022
-
[26]
SIAM Journal on Optimization 28(4), 3145–3176 (2018)
Shashaani, S., Hashemi, F.S., Pasupathy, R.: Astro-df: A class of adaptive sam- pling trust-region algorithms for derivative-free stochastic optimization. SIAM Journal on Optimization 28(4), 3145–3176 (2018)
work page 2018
-
[27]
Adam: A Method for Stochastic Optimization
Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[28]
https:// github.com/simopt-admin/simopt 43
Eckman, D., Henderson, S., Shashaani, S., Pasupathy, R.: SimOpt. https:// github.com/simopt-admin/simopt 43
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.