Non-Convex Self-Concordant Functions: Practical Algorithms and Complexity Analysis

Donald Goldfarb; Jiayu Zhang; Lexiao Lai; Tianyi Lin

arxiv: 2511.15019 · v2 · submitted 2025-11-19 · 🧮 math.OC

Non-Convex Self-Concordant Functions: Practical Algorithms and Complexity Analysis

Donald Goldfarb , Lexiao Lai , Tianyi Lin , Jiayu Zhang This is my paper

Pith reviewed 2026-05-17 21:21 UTC · model grok-4.3

classification 🧮 math.OC

keywords non-convex optimizationself-concordant functionsNewton methodregularizationconvergence analysisfirst-order stationary pointssecond-order stationary pointslarge-scale optimization

0 comments

The pith

Non-convex self-concordant functions let regularized Newton methods reach first-order stationary points in O(ε^{-2}) iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines two new classes of functions—weakly self-concordant and F-based self-concordant—that extend the self-concordance property to non-convex problems. It develops a regularized Newton method and an adaptive regularization method that carry global convergence guarantees for these classes. The methods reach an ε-approximate first-order stationary point in O(ε^{-2}) iterations. They operate without assuming Lipschitz continuity of the gradient or Hessian. Experiments indicate the approach is more robust and efficient than cubic regularization or trust-region methods on large-scale and neural-network problems.

Core claim

What carries the argument

The weakly self-concordant and F-based self-concordant function classes, which support regularized Newton and adaptive regularization algorithms with global convergence.

If this is right

The algorithms converge globally to an ε-approximate first-order stationary point in O(ε^{-2}) iterations for qualifying non-convex functions.
Equipped with a negative-curvature oracle, the adaptive method reaches an approximate second-order stationary point.
The methods avoid any Lipschitz continuity requirement on the gradient or Hessian.
Numerical tests show greater robustness and speed than cubic regularization and trust-region approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization idea could be tested on other non-convex second-order schemes to see whether similar iteration bounds appear.
Because the classes do not require Lipschitz conditions, they may simplify analysis of training dynamics in neural networks that exhibit non-convex loss surfaces.
Extensions to constrained problems or stochastic variants could be explored by adapting the regularization step while preserving the self-concordance property.

Load-bearing premise

The objective function must belong to one of the newly defined weakly self-concordant or F-based self-concordant classes.

What would settle it

A concrete function that meets the weakly self-concordant or F-based self-concordant definition yet requires more than O(ε^{-2}) iterations of the proposed method to reach an ε-approximate first-order stationary point.

Figures

Figures reproduced from arXiv: 2511.15019 by Donald Goldfarb, Jiayu Zhang, Lexiao Lai, Tianyi Lin.

read the original abstract

We extend the standard notion of self-concordance to non-convex optimization and develop a family of second-order algorithms with global convergence guarantees. In particular, two function classes -- \textit{weakly self-concordant} functions and \textit{$F$-based self-concordant} functions -- generalize the self-concordant framework beyond convexity, without assuming the Lipschitz continuity of the gradient or Hessian. For these function classes, we propose a regularized Newton method and an adaptive regularization method that achieve an $\epsilon$-approximate first-order stationary point in $O(\epsilon^{-2})$ iterations. Equipped with an oracle capable of detecting negative curvature, the adaptive algorithm can further attain convergence to an approximate second-order stationary point. Our experimental results demonstrate that the proposed methods offer superior robustness and computational efficiency compared to cubic regularization and trust-region approaches, underscoring the broad potential of self-concordant regularization for large-scale and neural network optimization problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper defines weakly self-concordant and F-based self-concordant classes to extend self-concordance analysis to non-convex problems and gives regularized Newton methods with O(ε^{-2}) rates to first-order points without any Lipschitz assumptions.

read the letter

The main point is that the authors drop convexity from self-concordance while keeping enough structure for global convergence of second-order methods. They introduce two new function classes and show that regularized Newton and an adaptive regularization scheme reach an ε-approximate first-order stationary point in O(ε^{-2}) iterations. With a negative-curvature oracle the adaptive method also handles approximate second-order points. The analysis relies on a potential function built directly from the new inequalities and avoids hidden Lipschitz or boundedness constants.

Referee Report

0 major / 2 minor

Summary. The paper extends self-concordance beyond convexity by introducing the classes of weakly self-concordant functions and F-based self-concordant functions. It proposes a regularized Newton method and an adaptive regularization method that achieve an ε-approximate first-order stationary point in O(ε^{-2}) iterations for these classes, without assuming Lipschitz continuity of the gradient or Hessian. Equipped with a negative-curvature oracle the adaptive method further reaches approximate second-order stationarity. Experiments indicate superior robustness and efficiency relative to cubic regularization and trust-region methods on large-scale and neural-network problems.

Significance. If the definitions and potential-function arguments hold, the work supplies a new non-convex generalization of self-concordant analysis that removes standard Lipschitz assumptions while retaining global convergence guarantees and practical second-order algorithms. The O(ε^{-2}) first-order complexity is competitive, and the explicit step-size rules derived directly from the self-concordance inequalities constitute a clean technical contribution with potential applicability to neural-network training.

minor comments (2)

The experimental section would benefit from explicit reporting of the neural-network architectures, dataset sizes, and hyper-parameter selection protocol used to generate the robustness and timing comparisons.
Notation for the auxiliary function F in the F-based self-concordant definition should be introduced once in the main text rather than only in the appendix.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive report, the recognition of the technical contribution, and the recommendation for minor revision. We are pleased that the significance of extending self-concordance to the non-convex setting without Lipschitz assumptions was appreciated.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines new function classes (weakly self-concordant and F-based self-concordant) and derives algorithm convergence rates directly from the inequalities and potential-function arguments supplied in those definitions. The O(ε^{-2}) iteration bounds for first-order stationarity (and second-order with negative-curvature oracle) follow from the stated step-size rules and regularization without reducing to any fitted parameter, self-citation chain, or imported uniqueness theorem. The analysis is self-contained against the paper's own assumptions and does not invoke external Lipschitz constants or prior results by the same authors as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest primarily on the assumption that functions satisfy the newly defined weakly self-concordant or F-based self-concordant properties, which are introduced in the paper without independent external verification.

axioms (1)

domain assumption Objective functions satisfy the properties of weakly self-concordant or F-based self-concordant functions as defined.
This is the key generalization enabling convergence analysis without convexity or Lipschitz continuity.

invented entities (2)

weakly self-concordant functions no independent evidence
purpose: Generalize self-concordance to non-convex optimization without convexity assumption
New function class introduced to support the algorithms and analysis.
F-based self-concordant functions no independent evidence
purpose: Provide an alternative generalization using a function F for non-convex cases
New function class defined in the paper.

pith-pipeline@v0.9.0 · 5469 in / 1382 out tokens · 46575 ms · 2026-05-17T21:21:48.896916+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We extend the standard notion of self-concordance to non-convex optimization... two function classes – weakly self-concordant functions and F-based self-concordant functions – generalize the self-concordant framework beyond convexity, without assuming the Lipschitz continuity of the gradient or Hessian.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

f(x+td)≤f(x)−ρt+κ^{-2}ω^*(κt√(δ+Δ))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

[1]

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

N. Abreu, N. Vyas, S. Kakade, and D. Morwani. The potential of second-order optimization for LLMs: A study with full Gauss-Newton.ArXiv Preprint: 2510.09378, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Absil, R

P.-A. Absil, R. Mahony, and B. Andrews. Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005

work page 2005
[3]

Agarwal, Z

N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima faster than gradient descent. InSTOC, pages 1195–1199, 2017

work page 2017
[4]

F. Bach. Self-concordant analysis for logistic regression.Electronic Journal of Statistics, 4:384–414, 2010

work page 2010
[5]

F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression.The Journal of Machine Learning Research, 15(1):595–627, 2014

work page 2014
[6]

T. J. Boerner, S. Deems, T. R. Furlani, S. L. Knuth, and J. Towns. Access: Advancing innovation: NSF’s advanced cyberinfrastructure coordination ecosystem: Services & support. InPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good, pages 173–176. Association for Computing Machinery (ACM), 2023

work page 2023
[7]

Carderera, M

A. Carderera, M. Besançon, and S. Pokutta. Simple steps are all you need: Frank-Wolfe and generalized self-concordant functions. InNeurIPS, pages 5390–5401, 2021

work page 2021
[8]

Carmon, J

Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018

work page 2018
[9]

Cartis, N

C. Cartis, N. I. M. Gould, and P. L. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems.SIAM Journal on Optimization, 20(6):2833–2852, 2010

work page 2010
[10]

Cartis, N

C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results.Mathematical Programming, 127(2):245–295, 2011

work page 2011
[11]

Cartis, N

C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity.Mathematical Programming, 130(2):295–319, 2011

work page 2011
[12]

Cartis, N

C. Cartis, N. I. M. Gould, and P. L. Toint.Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM, 2022

work page 2022
[13]

A. R. Conn, N. I. M. Gould, and P. L. Toint.Trust Region Methods. SIAM, 2000

work page 2000
[14]

F. E. Curtis, Z. Lubberts, and D. P. Robinson. Concise complexity analyses for trust region methods. Optimization Letters, 12(8):1713–1724, 2018

work page 2018
[15]

F. E. Curtis, D. P. Robinson, C. W. Royer, and S. J. Wright. Trust-region Newton-CG with strong second-order complexity guarantees for nonconvex optimization.SIAM Journal on Optimization, 31(1):518–544, 2021. 23

work page 2021
[16]

F. E. Curtis, D. P. Robinson, and M. Samadi. A trust region algorithm with a worst-case iteration complexity ofO(ϵ −3/2)for nonconvex optimization.Mathematical Programming, 162:1–32, 2017

work page 2017
[17]

F. E. Curtis and Q. Wang. Worst-case complexity of trace with inexact subproblem solutions for nonconvex smooth optimization.SIAM Journal on Optimization, 33(3):2191–2221, 2023

work page 2023
[18]

Davis and D

D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

work page 2019
[19]

N. Doikov. Minimizing quasi-self-concordant functions by gradient regularization of Newton method. Mathematical Programming, pages 1–39, 2025

work page 2025
[20]

Doikov and Y

N. Doikov and Y. Nesterov. Gradient regularization of Newton method with Bregman distances. Mathematical Programming, 204(1):1–25, 2024

work page 2024
[21]

Dussault, T

J.-P. Dussault, T. Migot, and D. Orban. Scalable adaptive cubic regularization methods.Mathe- matical Programming, 207(1):191–225, 2024

work page 2024
[22]

Dvurechensky, P

P. Dvurechensky, P. Ostroukhov, K. Safin, S. Shtern, and M. Staudigl. Self-concordant analysis of Frank-Wolfe algorithms. InICML, pages 2814–2824. PMLR, 2020

work page 2020
[23]

Dvurechensky, K

P. Dvurechensky, K. Safin, S. Shtern, and M. Staudigl. Generalized self-concordant analysis of Frank-Wolfe algorithms.Mathematical Programming, 198(1):255–323, 2023

work page 2023
[24]

Gao and D

W. Gao and D. Goldfarb. Quasi-Newton methods: Superlinear convergence without line searches for self-concordant functions.Optimization Methods and Software, 34(1):194–217, 2019

work page 2019
[25]

Goldfarb

D. Goldfarb. The use of negative curvature in minimization algorithms. Technical report, Cornell University, 1980

work page 1980
[26]

Gratton, S

S. Gratton, S. Jerad, and P. L. Toint. Yet another fast variant of Newton’s method for nonconvex optimization.IMA Journal of Numerical Analysis, 45(2):971–1008, 2025

work page 2025
[27]

Hanzely, D

S. Hanzely, D. Kamzolov, D. Pasechnyuk, A. Gasnikov, P. Richtárik, and M. Takáč. A damped Newton method achieves globalo(1/k2)and local quadratic convergence rate. InNeurIPS, pages 25320–25334, 2022

work page 2022
[28]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016

work page 2016
[29]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InICLR, 2015

work page 2015
[30]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images.Master’s Thesis, University of Toronto, 2009

work page 2009
[31]

Kuczyński and H

J. Kuczyński and H. Woźniakowski. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start.SIAM Journal on Matrix Analysis and Applications, 13(4):1094–1122, 1992

work page 1992
[32]

Lee and S

D. Lee and S. Seung. Algorithms for non-negative matrix factorization. InNeurIPS, pages 535–541, 2000

work page 2000
[33]

Lee, S.-Y

J. Lee, S.-Y. Yun, and K.-S. Jun. Improved regret bounds of (multinomial) logistic bandits via regret-to-confidence-set conversion. InAISTATS, pages 4474–4482. PMLR, 2024. 24

work page 2024
[34]

Łojasiewicz

S. Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels.Les équations aux dérivées partielles, 117(87-89), 1963

work page 1963
[35]

Z. Lu. Randomized block proximal damped Newton method for composite self-concordant mini- mization.SIAM Journal on Optimization, 27(3):1910–1942, 2017

work page 1910
[36]

Mairal, F

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NeurIPS, pages 1033–1040, 2008

work page 2008
[37]

J. Martens. New insights and perspectives on the natural gradient method.The Journal of Machine Learning Research, 21(146):1–76, 2020

work page 2020
[38]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. InICML, pages 2408–2417. PMLR, 2015

work page 2015
[39]

Martens and I

J. Martens and I. Sutskever. Training deep and recurrent networks with Hessian-free optimization. InNeural Networks: Tricks of the Trade: Second Edition, pages 479–535. Springer, 2012

work page 2012
[40]

G. P. McCormick. A modification of Armijo’s step-size rule for negative curvature.Mathematical Programming, 13(1):111–115, 1977

work page 1977
[41]

Mishchenko

K. Mishchenko. Regularized Newton method with globalO(1/k2) convergence.SIAM Journal on Optimization, 33(3):1440–1462, 2023

work page 2023
[42]

J. J. Moré and D. C. Sorensen. On the use of directions of negative curvature in a modified Newton method.Mathematical Programming, 16(1):1–20, 1979

work page 1979
[43]

J.-J. Moreau. Proximité et dualité dans un espace hilbertien.Bulletin de la Société mathématique de France, 93:273–299, 1965

work page 1965
[44]

Nesterov

Y. Nesterov. Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1):159–181, 2008

work page 2008
[45]

Nesterov.Lectures on Convex Optimization, volume 137

Y. Nesterov.Lectures on Convex Optimization, volume 137. Springer, 2018

work page 2018
[46]

Nesterov and A

Y. Nesterov and A. Nemirovskii.Interior-Point Polynomial Algorithms in Convex Programming. SIAM, 1994

work page 1994
[47]

Nesterov and B

Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006

work page 2006
[48]

Nocedal and S

J. Nocedal and S. J. Wright.Numerical Optimization. Springer, 2006

work page 2006
[49]

Pascanu and Y

R. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. InICLR, 2014

work page 2014
[50]

B. T. Polyak. Gradient methods for solving equations and inequalities.USSR Computational Mathematics and Mathematical Physics, 4(6):17–32, 1964

work page 1964
[51]

R. T. Rockafellar. Monotone operators and the proximal point algorithm.SIAM journal on control and optimization, 14(5):877–898, 1976

work page 1976
[52]

Rodomanov and Y

A. Rodomanov and Y. Nesterov. Greedy quasi-Newton methods with explicit superlinear convergence. SIAM Journal on Optimization, 31(1):785–811, 2021. 25

work page 2021
[53]

C. W. Royer, M. O’Neill, and S. J. Wright. A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization.Mathematical Programming, 180(1):451–488, 2020

work page 2020
[54]

C. W. Royer and S. J. Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization.SIAM Journal on Optimization, 28(2):1448–1477, 2018

work page 2018
[55]

Schmidt, G

M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. InECML, pages 286–297. Springer, 2007

work page 2007
[56]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. InICLR, 2015

work page 2015
[57]

Strande, H

S. Strande, H. Cai, M. Tatineni, W. Pfeiffer, C. Irving, A. Majumdar, D. Mishin, R. Sinkovits, M. Norman, N. Wolter, et al. Expanse: Computing without boundaries: Architecture, deployment, and early operations experiences of a supercomputer designed for the rapid evolution in science and engineering. InPractice and Experience in Advanced Research Computin...

work page 2021
[58]

J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval.Foundations of Computational Mathematics, 18(5):1131–1198, 2018

work page 2018
[59]

Tang, K.-C

T. Tang, K.-C. Toh, N. Xiao, and Y. Ye. A Riemannian dimension-reduced second-order method with application in sensor network localization.SIAM Journal on Scientific Computing, 46(3):A2025– A2046, 2024

work page 2024
[60]

M. J. Todd.Minimum-Volume Ellipsoids: Theory and Algorithms. SIAM, 2016

work page 2016
[61]

Tran-Dinh, A

Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. Composite self-concordant minimization.The Journal of Machine Learning Research, 16(1):371–416, 2015

work page 2015
[62]

Tran-Dinh, T

Q. Tran-Dinh, T. Sun, and S. Lu. Self-concordant inclusions: A unified framework for path-following generalized Newton-type algorithms.Mathematical Programming, 177(1):173–223, 2019

work page 2019
[63]

Virtanen, R

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python.Nature Methods, 17(3):261–272, 2020

work page 2020
[64]

Zagoruyko and N

S. Zagoruyko and N. Komodakis. Wide residual networks. InBMVC, pages 87–1. British Machine Vision Association, 2016

work page 2016
[65]

Zhang, C

G. Zhang, C. Wang, B. Xu, and R. Grosse. Three mechanisms of weight decay regularization. In ICLR, 2019

work page 2019
[66]

Zhang, C

Y. Zhang, C. Chen, T. Ding, Z. Li, R. Sun, and Z.-Q. Luo. Why transformers need ADAM: A Hessian perspective. InNeurIPS, pages 131786–131823, 2024

work page 2024
[67]

Zhang and X

Y. Zhang and X. Lin. DISCO: Distributed optimization for self-concordant empirical loss. InICML, pages 362–370. PMLR, 2015

work page 2015
[68]

Zhang and M

Y.-J. Zhang and M. Sugiyama. Online (multinomial) logistic bandit: Improved regret and constant computation cost. InNeurIPS, pages 29741–29782, 2023. 26

work page 2023

[1] [1]

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

N. Abreu, N. Vyas, S. Kakade, and D. Morwani. The potential of second-order optimization for LLMs: A study with full Gauss-Newton.ArXiv Preprint: 2510.09378, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Absil, R

P.-A. Absil, R. Mahony, and B. Andrews. Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005

work page 2005

[3] [3]

Agarwal, Z

N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima faster than gradient descent. InSTOC, pages 1195–1199, 2017

work page 2017

[4] [4]

F. Bach. Self-concordant analysis for logistic regression.Electronic Journal of Statistics, 4:384–414, 2010

work page 2010

[5] [5]

F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression.The Journal of Machine Learning Research, 15(1):595–627, 2014

work page 2014

[6] [6]

T. J. Boerner, S. Deems, T. R. Furlani, S. L. Knuth, and J. Towns. Access: Advancing innovation: NSF’s advanced cyberinfrastructure coordination ecosystem: Services & support. InPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good, pages 173–176. Association for Computing Machinery (ACM), 2023

work page 2023

[7] [7]

Carderera, M

A. Carderera, M. Besançon, and S. Pokutta. Simple steps are all you need: Frank-Wolfe and generalized self-concordant functions. InNeurIPS, pages 5390–5401, 2021

work page 2021

[8] [8]

Carmon, J

Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018

work page 2018

[9] [9]

Cartis, N

C. Cartis, N. I. M. Gould, and P. L. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems.SIAM Journal on Optimization, 20(6):2833–2852, 2010

work page 2010

[10] [10]

Cartis, N

C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results.Mathematical Programming, 127(2):245–295, 2011

work page 2011

[11] [11]

Cartis, N

C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity.Mathematical Programming, 130(2):295–319, 2011

work page 2011

[12] [12]

Cartis, N

C. Cartis, N. I. M. Gould, and P. L. Toint.Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM, 2022

work page 2022

[13] [13]

A. R. Conn, N. I. M. Gould, and P. L. Toint.Trust Region Methods. SIAM, 2000

work page 2000

[14] [14]

F. E. Curtis, Z. Lubberts, and D. P. Robinson. Concise complexity analyses for trust region methods. Optimization Letters, 12(8):1713–1724, 2018

work page 2018

[15] [15]

F. E. Curtis, D. P. Robinson, C. W. Royer, and S. J. Wright. Trust-region Newton-CG with strong second-order complexity guarantees for nonconvex optimization.SIAM Journal on Optimization, 31(1):518–544, 2021. 23

work page 2021

[16] [16]

F. E. Curtis, D. P. Robinson, and M. Samadi. A trust region algorithm with a worst-case iteration complexity ofO(ϵ −3/2)for nonconvex optimization.Mathematical Programming, 162:1–32, 2017

work page 2017

[17] [17]

F. E. Curtis and Q. Wang. Worst-case complexity of trace with inexact subproblem solutions for nonconvex smooth optimization.SIAM Journal on Optimization, 33(3):2191–2221, 2023

work page 2023

[18] [18]

Davis and D

D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

work page 2019

[19] [19]

N. Doikov. Minimizing quasi-self-concordant functions by gradient regularization of Newton method. Mathematical Programming, pages 1–39, 2025

work page 2025

[20] [20]

Doikov and Y

N. Doikov and Y. Nesterov. Gradient regularization of Newton method with Bregman distances. Mathematical Programming, 204(1):1–25, 2024

work page 2024

[21] [21]

Dussault, T

J.-P. Dussault, T. Migot, and D. Orban. Scalable adaptive cubic regularization methods.Mathe- matical Programming, 207(1):191–225, 2024

work page 2024

[22] [22]

Dvurechensky, P

P. Dvurechensky, P. Ostroukhov, K. Safin, S. Shtern, and M. Staudigl. Self-concordant analysis of Frank-Wolfe algorithms. InICML, pages 2814–2824. PMLR, 2020

work page 2020

[23] [23]

Dvurechensky, K

P. Dvurechensky, K. Safin, S. Shtern, and M. Staudigl. Generalized self-concordant analysis of Frank-Wolfe algorithms.Mathematical Programming, 198(1):255–323, 2023

work page 2023

[24] [24]

Gao and D

W. Gao and D. Goldfarb. Quasi-Newton methods: Superlinear convergence without line searches for self-concordant functions.Optimization Methods and Software, 34(1):194–217, 2019

work page 2019

[25] [25]

Goldfarb

D. Goldfarb. The use of negative curvature in minimization algorithms. Technical report, Cornell University, 1980

work page 1980

[26] [26]

Gratton, S

S. Gratton, S. Jerad, and P. L. Toint. Yet another fast variant of Newton’s method for nonconvex optimization.IMA Journal of Numerical Analysis, 45(2):971–1008, 2025

work page 2025

[27] [27]

Hanzely, D

S. Hanzely, D. Kamzolov, D. Pasechnyuk, A. Gasnikov, P. Richtárik, and M. Takáč. A damped Newton method achieves globalo(1/k2)and local quadratic convergence rate. InNeurIPS, pages 25320–25334, 2022

work page 2022

[28] [28]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016

work page 2016

[29] [29]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InICLR, 2015

work page 2015

[30] [30]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images.Master’s Thesis, University of Toronto, 2009

work page 2009

[31] [31]

Kuczyński and H

J. Kuczyński and H. Woźniakowski. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start.SIAM Journal on Matrix Analysis and Applications, 13(4):1094–1122, 1992

work page 1992

[32] [32]

Lee and S

D. Lee and S. Seung. Algorithms for non-negative matrix factorization. InNeurIPS, pages 535–541, 2000

work page 2000

[33] [33]

Lee, S.-Y

J. Lee, S.-Y. Yun, and K.-S. Jun. Improved regret bounds of (multinomial) logistic bandits via regret-to-confidence-set conversion. InAISTATS, pages 4474–4482. PMLR, 2024. 24

work page 2024

[34] [34]

Łojasiewicz

S. Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels.Les équations aux dérivées partielles, 117(87-89), 1963

work page 1963

[35] [35]

Z. Lu. Randomized block proximal damped Newton method for composite self-concordant mini- mization.SIAM Journal on Optimization, 27(3):1910–1942, 2017

work page 1910

[36] [36]

Mairal, F

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NeurIPS, pages 1033–1040, 2008

work page 2008

[37] [37]

J. Martens. New insights and perspectives on the natural gradient method.The Journal of Machine Learning Research, 21(146):1–76, 2020

work page 2020

[38] [38]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. InICML, pages 2408–2417. PMLR, 2015

work page 2015

[39] [39]

Martens and I

J. Martens and I. Sutskever. Training deep and recurrent networks with Hessian-free optimization. InNeural Networks: Tricks of the Trade: Second Edition, pages 479–535. Springer, 2012

work page 2012

[40] [40]

G. P. McCormick. A modification of Armijo’s step-size rule for negative curvature.Mathematical Programming, 13(1):111–115, 1977

work page 1977

[41] [41]

Mishchenko

K. Mishchenko. Regularized Newton method with globalO(1/k2) convergence.SIAM Journal on Optimization, 33(3):1440–1462, 2023

work page 2023

[42] [42]

J. J. Moré and D. C. Sorensen. On the use of directions of negative curvature in a modified Newton method.Mathematical Programming, 16(1):1–20, 1979

work page 1979

[43] [43]

J.-J. Moreau. Proximité et dualité dans un espace hilbertien.Bulletin de la Société mathématique de France, 93:273–299, 1965

work page 1965

[44] [44]

Nesterov

Y. Nesterov. Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1):159–181, 2008

work page 2008

[45] [45]

Nesterov.Lectures on Convex Optimization, volume 137

Y. Nesterov.Lectures on Convex Optimization, volume 137. Springer, 2018

work page 2018

[46] [46]

Nesterov and A

Y. Nesterov and A. Nemirovskii.Interior-Point Polynomial Algorithms in Convex Programming. SIAM, 1994

work page 1994

[47] [47]

Nesterov and B

Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006

work page 2006

[48] [48]

Nocedal and S

J. Nocedal and S. J. Wright.Numerical Optimization. Springer, 2006

work page 2006

[49] [49]

Pascanu and Y

R. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. InICLR, 2014

work page 2014

[50] [50]

B. T. Polyak. Gradient methods for solving equations and inequalities.USSR Computational Mathematics and Mathematical Physics, 4(6):17–32, 1964

work page 1964

[51] [51]

R. T. Rockafellar. Monotone operators and the proximal point algorithm.SIAM journal on control and optimization, 14(5):877–898, 1976

work page 1976

[52] [52]

Rodomanov and Y

A. Rodomanov and Y. Nesterov. Greedy quasi-Newton methods with explicit superlinear convergence. SIAM Journal on Optimization, 31(1):785–811, 2021. 25

work page 2021

[53] [53]

C. W. Royer, M. O’Neill, and S. J. Wright. A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization.Mathematical Programming, 180(1):451–488, 2020

work page 2020

[54] [54]

C. W. Royer and S. J. Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization.SIAM Journal on Optimization, 28(2):1448–1477, 2018

work page 2018

[55] [55]

Schmidt, G

M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. InECML, pages 286–297. Springer, 2007

work page 2007

[56] [56]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. InICLR, 2015

work page 2015

[57] [57]

Strande, H

S. Strande, H. Cai, M. Tatineni, W. Pfeiffer, C. Irving, A. Majumdar, D. Mishin, R. Sinkovits, M. Norman, N. Wolter, et al. Expanse: Computing without boundaries: Architecture, deployment, and early operations experiences of a supercomputer designed for the rapid evolution in science and engineering. InPractice and Experience in Advanced Research Computin...

work page 2021

[58] [58]

J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval.Foundations of Computational Mathematics, 18(5):1131–1198, 2018

work page 2018

[59] [59]

Tang, K.-C

T. Tang, K.-C. Toh, N. Xiao, and Y. Ye. A Riemannian dimension-reduced second-order method with application in sensor network localization.SIAM Journal on Scientific Computing, 46(3):A2025– A2046, 2024

work page 2024

[60] [60]

M. J. Todd.Minimum-Volume Ellipsoids: Theory and Algorithms. SIAM, 2016

work page 2016

[61] [61]

Tran-Dinh, A

Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. Composite self-concordant minimization.The Journal of Machine Learning Research, 16(1):371–416, 2015

work page 2015

[62] [62]

Tran-Dinh, T

Q. Tran-Dinh, T. Sun, and S. Lu. Self-concordant inclusions: A unified framework for path-following generalized Newton-type algorithms.Mathematical Programming, 177(1):173–223, 2019

work page 2019

[63] [63]

Virtanen, R

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python.Nature Methods, 17(3):261–272, 2020

work page 2020

[64] [64]

Zagoruyko and N

S. Zagoruyko and N. Komodakis. Wide residual networks. InBMVC, pages 87–1. British Machine Vision Association, 2016

work page 2016

[65] [65]

Zhang, C

G. Zhang, C. Wang, B. Xu, and R. Grosse. Three mechanisms of weight decay regularization. In ICLR, 2019

work page 2019

[66] [66]

Zhang, C

Y. Zhang, C. Chen, T. Ding, Z. Li, R. Sun, and Z.-Q. Luo. Why transformers need ADAM: A Hessian perspective. InNeurIPS, pages 131786–131823, 2024

work page 2024

[67] [67]

Zhang and X

Y. Zhang and X. Lin. DISCO: Distributed optimization for self-concordant empirical loss. InICML, pages 362–370. PMLR, 2015

work page 2015

[68] [68]

Zhang and M

Y.-J. Zhang and M. Sugiyama. Online (multinomial) logistic bandit: Improved regret and constant computation cost. InNeurIPS, pages 29741–29782, 2023. 26

work page 2023