pith. sign in

arxiv: 2511.15019 · v2 · submitted 2025-11-19 · 🧮 math.OC

Non-Convex Self-Concordant Functions: Practical Algorithms and Complexity Analysis

Pith reviewed 2026-05-17 21:21 UTC · model grok-4.3

classification 🧮 math.OC
keywords non-convex optimizationself-concordant functionsNewton methodregularizationconvergence analysisfirst-order stationary pointssecond-order stationary pointslarge-scale optimization
0
0 comments X

The pith

Non-convex self-concordant functions let regularized Newton methods reach first-order stationary points in O(ε^{-2}) iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines two new classes of functions—weakly self-concordant and F-based self-concordant—that extend the self-concordance property to non-convex problems. It develops a regularized Newton method and an adaptive regularization method that carry global convergence guarantees for these classes. The methods reach an ε-approximate first-order stationary point in O(ε^{-2}) iterations. They operate without assuming Lipschitz continuity of the gradient or Hessian. Experiments indicate the approach is more robust and efficient than cubic regularization or trust-region methods on large-scale and neural-network problems.

Core claim

We extend the standard notion of self-concordance to non-convex optimization and develop a family of second-order algorithms with global convergence guarantees. In particular, two function classes—weakly self-concordant functions and F-based self-concordant functions—generalize the self-concordant framework beyond convexity, without assuming the Lipschitz continuity of the gradient or Hessian. For these function classes, we propose a regularized Newton method and an adaptive regularization method that achieve an ε-approximate first-order stationary point in O(ε^{-2}) iterations.

What carries the argument

The weakly self-concordant and F-based self-concordant function classes, which support regularized Newton and adaptive regularization algorithms with global convergence.

If this is right

  • The algorithms converge globally to an ε-approximate first-order stationary point in O(ε^{-2}) iterations for qualifying non-convex functions.
  • Equipped with a negative-curvature oracle, the adaptive method reaches an approximate second-order stationary point.
  • The methods avoid any Lipschitz continuity requirement on the gradient or Hessian.
  • Numerical tests show greater robustness and speed than cubic regularization and trust-region approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regularization idea could be tested on other non-convex second-order schemes to see whether similar iteration bounds appear.
  • Because the classes do not require Lipschitz conditions, they may simplify analysis of training dynamics in neural networks that exhibit non-convex loss surfaces.
  • Extensions to constrained problems or stochastic variants could be explored by adapting the regularization step while preserving the self-concordance property.

Load-bearing premise

The objective function must belong to one of the newly defined weakly self-concordant or F-based self-concordant classes.

What would settle it

A concrete function that meets the weakly self-concordant or F-based self-concordant definition yet requires more than O(ε^{-2}) iterations of the proposed method to reach an ε-approximate first-order stationary point.

Figures

Figures reproduced from arXiv: 2511.15019 by Donald Goldfarb, Jiayu Zhang, Lexiao Lai, Tianyi Lin.

Figure 1
Figure 1. Figure 1: Performance comparison on NMF with MSE loss (left) and KL divergence (right). [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
read the original abstract

We extend the standard notion of self-concordance to non-convex optimization and develop a family of second-order algorithms with global convergence guarantees. In particular, two function classes -- \textit{weakly self-concordant} functions and \textit{$F$-based self-concordant} functions -- generalize the self-concordant framework beyond convexity, without assuming the Lipschitz continuity of the gradient or Hessian. For these function classes, we propose a regularized Newton method and an adaptive regularization method that achieve an $\epsilon$-approximate first-order stationary point in $O(\epsilon^{-2})$ iterations. Equipped with an oracle capable of detecting negative curvature, the adaptive algorithm can further attain convergence to an approximate second-order stationary point. Our experimental results demonstrate that the proposed methods offer superior robustness and computational efficiency compared to cubic regularization and trust-region approaches, underscoring the broad potential of self-concordant regularization for large-scale and neural network optimization problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper extends self-concordance beyond convexity by introducing the classes of weakly self-concordant functions and F-based self-concordant functions. It proposes a regularized Newton method and an adaptive regularization method that achieve an ε-approximate first-order stationary point in O(ε^{-2}) iterations for these classes, without assuming Lipschitz continuity of the gradient or Hessian. Equipped with a negative-curvature oracle the adaptive method further reaches approximate second-order stationarity. Experiments indicate superior robustness and efficiency relative to cubic regularization and trust-region methods on large-scale and neural-network problems.

Significance. If the definitions and potential-function arguments hold, the work supplies a new non-convex generalization of self-concordant analysis that removes standard Lipschitz assumptions while retaining global convergence guarantees and practical second-order algorithms. The O(ε^{-2}) first-order complexity is competitive, and the explicit step-size rules derived directly from the self-concordance inequalities constitute a clean technical contribution with potential applicability to neural-network training.

minor comments (2)
  1. The experimental section would benefit from explicit reporting of the neural-network architectures, dataset sizes, and hyper-parameter selection protocol used to generate the robustness and timing comparisons.
  2. Notation for the auxiliary function F in the F-based self-concordant definition should be introduced once in the main text rather than only in the appendix.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive report, the recognition of the technical contribution, and the recommendation for minor revision. We are pleased that the significance of extending self-concordance to the non-convex setting without Lipschitz assumptions was appreciated.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines new function classes (weakly self-concordant and F-based self-concordant) and derives algorithm convergence rates directly from the inequalities and potential-function arguments supplied in those definitions. The O(ε^{-2}) iteration bounds for first-order stationarity (and second-order with negative-curvature oracle) follow from the stated step-size rules and regularization without reducing to any fitted parameter, self-citation chain, or imported uniqueness theorem. The analysis is self-contained against the paper's own assumptions and does not invoke external Lipschitz constants or prior results by the same authors as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest primarily on the assumption that functions satisfy the newly defined weakly self-concordant or F-based self-concordant properties, which are introduced in the paper without independent external verification.

axioms (1)
  • domain assumption Objective functions satisfy the properties of weakly self-concordant or F-based self-concordant functions as defined.
    This is the key generalization enabling convergence analysis without convexity or Lipschitz continuity.
invented entities (2)
  • weakly self-concordant functions no independent evidence
    purpose: Generalize self-concordance to non-convex optimization without convexity assumption
    New function class introduced to support the algorithms and analysis.
  • F-based self-concordant functions no independent evidence
    purpose: Provide an alternative generalization using a function F for non-convex cases
    New function class defined in the paper.

pith-pipeline@v0.9.0 · 5469 in / 1382 out tokens · 46575 ms · 2026-05-17T21:21:48.896916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

  1. [1]

    The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

    N. Abreu, N. Vyas, S. Kakade, and D. Morwani. The potential of second-order optimization for LLMs: A study with full Gauss-Newton.ArXiv Preprint: 2510.09378, 2025

  2. [2]

    Absil, R

    P.-A. Absil, R. Mahony, and B. Andrews. Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547, 2005

  3. [3]

    Agarwal, Z

    N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima faster than gradient descent. InSTOC, pages 1195–1199, 2017

  4. [4]

    F. Bach. Self-concordant analysis for logistic regression.Electronic Journal of Statistics, 4:384–414, 2010

  5. [5]

    F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression.The Journal of Machine Learning Research, 15(1):595–627, 2014

  6. [6]

    T. J. Boerner, S. Deems, T. R. Furlani, S. L. Knuth, and J. Towns. Access: Advancing innovation: NSF’s advanced cyberinfrastructure coordination ecosystem: Services & support. InPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good, pages 173–176. Association for Computing Machinery (ACM), 2023

  7. [7]

    Carderera, M

    A. Carderera, M. Besançon, and S. Pokutta. Simple steps are all you need: Frank-Wolfe and generalized self-concordant functions. InNeurIPS, pages 5390–5401, 2021

  8. [8]

    Carmon, J

    Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018

  9. [9]

    Cartis, N

    C. Cartis, N. I. M. Gould, and P. L. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems.SIAM Journal on Optimization, 20(6):2833–2852, 2010

  10. [10]

    Cartis, N

    C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results.Mathematical Programming, 127(2):245–295, 2011

  11. [11]

    Cartis, N

    C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity.Mathematical Programming, 130(2):295–319, 2011

  12. [12]

    Cartis, N

    C. Cartis, N. I. M. Gould, and P. L. Toint.Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM, 2022

  13. [13]

    A. R. Conn, N. I. M. Gould, and P. L. Toint.Trust Region Methods. SIAM, 2000

  14. [14]

    F. E. Curtis, Z. Lubberts, and D. P. Robinson. Concise complexity analyses for trust region methods. Optimization Letters, 12(8):1713–1724, 2018

  15. [15]

    F. E. Curtis, D. P. Robinson, C. W. Royer, and S. J. Wright. Trust-region Newton-CG with strong second-order complexity guarantees for nonconvex optimization.SIAM Journal on Optimization, 31(1):518–544, 2021. 23

  16. [16]

    F. E. Curtis, D. P. Robinson, and M. Samadi. A trust region algorithm with a worst-case iteration complexity ofO(ϵ −3/2)for nonconvex optimization.Mathematical Programming, 162:1–32, 2017

  17. [17]

    F. E. Curtis and Q. Wang. Worst-case complexity of trace with inexact subproblem solutions for nonconvex smooth optimization.SIAM Journal on Optimization, 33(3):2191–2221, 2023

  18. [18]

    Davis and D

    D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

  19. [19]

    N. Doikov. Minimizing quasi-self-concordant functions by gradient regularization of Newton method. Mathematical Programming, pages 1–39, 2025

  20. [20]

    Doikov and Y

    N. Doikov and Y. Nesterov. Gradient regularization of Newton method with Bregman distances. Mathematical Programming, 204(1):1–25, 2024

  21. [21]

    Dussault, T

    J.-P. Dussault, T. Migot, and D. Orban. Scalable adaptive cubic regularization methods.Mathe- matical Programming, 207(1):191–225, 2024

  22. [22]

    Dvurechensky, P

    P. Dvurechensky, P. Ostroukhov, K. Safin, S. Shtern, and M. Staudigl. Self-concordant analysis of Frank-Wolfe algorithms. InICML, pages 2814–2824. PMLR, 2020

  23. [23]

    Dvurechensky, K

    P. Dvurechensky, K. Safin, S. Shtern, and M. Staudigl. Generalized self-concordant analysis of Frank-Wolfe algorithms.Mathematical Programming, 198(1):255–323, 2023

  24. [24]

    Gao and D

    W. Gao and D. Goldfarb. Quasi-Newton methods: Superlinear convergence without line searches for self-concordant functions.Optimization Methods and Software, 34(1):194–217, 2019

  25. [25]

    Goldfarb

    D. Goldfarb. The use of negative curvature in minimization algorithms. Technical report, Cornell University, 1980

  26. [26]

    Gratton, S

    S. Gratton, S. Jerad, and P. L. Toint. Yet another fast variant of Newton’s method for nonconvex optimization.IMA Journal of Numerical Analysis, 45(2):971–1008, 2025

  27. [27]

    Hanzely, D

    S. Hanzely, D. Kamzolov, D. Pasechnyuk, A. Gasnikov, P. Richtárik, and M. Takáč. A damped Newton method achieves globalo(1/k2)and local quadratic convergence rate. InNeurIPS, pages 25320–25334, 2022

  28. [28]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016

  29. [29]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InICLR, 2015

  30. [30]

    Krizhevsky

    A. Krizhevsky. Learning multiple layers of features from tiny images.Master’s Thesis, University of Toronto, 2009

  31. [31]

    Kuczyński and H

    J. Kuczyński and H. Woźniakowski. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start.SIAM Journal on Matrix Analysis and Applications, 13(4):1094–1122, 1992

  32. [32]

    Lee and S

    D. Lee and S. Seung. Algorithms for non-negative matrix factorization. InNeurIPS, pages 535–541, 2000

  33. [33]

    Lee, S.-Y

    J. Lee, S.-Y. Yun, and K.-S. Jun. Improved regret bounds of (multinomial) logistic bandits via regret-to-confidence-set conversion. InAISTATS, pages 4474–4482. PMLR, 2024. 24

  34. [34]

    Łojasiewicz

    S. Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels.Les équations aux dérivées partielles, 117(87-89), 1963

  35. [35]

    Z. Lu. Randomized block proximal damped Newton method for composite self-concordant mini- mization.SIAM Journal on Optimization, 27(3):1910–1942, 2017

  36. [36]

    Mairal, F

    J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NeurIPS, pages 1033–1040, 2008

  37. [37]

    J. Martens. New insights and perspectives on the natural gradient method.The Journal of Machine Learning Research, 21(146):1–76, 2020

  38. [38]

    Martens and R

    J. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. InICML, pages 2408–2417. PMLR, 2015

  39. [39]

    Martens and I

    J. Martens and I. Sutskever. Training deep and recurrent networks with Hessian-free optimization. InNeural Networks: Tricks of the Trade: Second Edition, pages 479–535. Springer, 2012

  40. [40]

    G. P. McCormick. A modification of Armijo’s step-size rule for negative curvature.Mathematical Programming, 13(1):111–115, 1977

  41. [41]

    Mishchenko

    K. Mishchenko. Regularized Newton method with globalO(1/k2) convergence.SIAM Journal on Optimization, 33(3):1440–1462, 2023

  42. [42]

    J. J. Moré and D. C. Sorensen. On the use of directions of negative curvature in a modified Newton method.Mathematical Programming, 16(1):1–20, 1979

  43. [43]

    J.-J. Moreau. Proximité et dualité dans un espace hilbertien.Bulletin de la Société mathématique de France, 93:273–299, 1965

  44. [44]

    Nesterov

    Y. Nesterov. Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1):159–181, 2008

  45. [45]

    Nesterov.Lectures on Convex Optimization, volume 137

    Y. Nesterov.Lectures on Convex Optimization, volume 137. Springer, 2018

  46. [46]

    Nesterov and A

    Y. Nesterov and A. Nemirovskii.Interior-Point Polynomial Algorithms in Convex Programming. SIAM, 1994

  47. [47]

    Nesterov and B

    Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006

  48. [48]

    Nocedal and S

    J. Nocedal and S. J. Wright.Numerical Optimization. Springer, 2006

  49. [49]

    Pascanu and Y

    R. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. InICLR, 2014

  50. [50]

    B. T. Polyak. Gradient methods for solving equations and inequalities.USSR Computational Mathematics and Mathematical Physics, 4(6):17–32, 1964

  51. [51]

    R. T. Rockafellar. Monotone operators and the proximal point algorithm.SIAM journal on control and optimization, 14(5):877–898, 1976

  52. [52]

    Rodomanov and Y

    A. Rodomanov and Y. Nesterov. Greedy quasi-Newton methods with explicit superlinear convergence. SIAM Journal on Optimization, 31(1):785–811, 2021. 25

  53. [53]

    C. W. Royer, M. O’Neill, and S. J. Wright. A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization.Mathematical Programming, 180(1):451–488, 2020

  54. [54]

    C. W. Royer and S. J. Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization.SIAM Journal on Optimization, 28(2):1448–1477, 2018

  55. [55]

    Schmidt, G

    M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. InECML, pages 286–297. Springer, 2007

  56. [56]

    Simonyan and A

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. InICLR, 2015

  57. [57]

    Strande, H

    S. Strande, H. Cai, M. Tatineni, W. Pfeiffer, C. Irving, A. Majumdar, D. Mishin, R. Sinkovits, M. Norman, N. Wolter, et al. Expanse: Computing without boundaries: Architecture, deployment, and early operations experiences of a supercomputer designed for the rapid evolution in science and engineering. InPractice and Experience in Advanced Research Computin...

  58. [58]

    J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval.Foundations of Computational Mathematics, 18(5):1131–1198, 2018

  59. [59]

    Tang, K.-C

    T. Tang, K.-C. Toh, N. Xiao, and Y. Ye. A Riemannian dimension-reduced second-order method with application in sensor network localization.SIAM Journal on Scientific Computing, 46(3):A2025– A2046, 2024

  60. [60]

    M. J. Todd.Minimum-Volume Ellipsoids: Theory and Algorithms. SIAM, 2016

  61. [61]

    Tran-Dinh, A

    Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. Composite self-concordant minimization.The Journal of Machine Learning Research, 16(1):371–416, 2015

  62. [62]

    Tran-Dinh, T

    Q. Tran-Dinh, T. Sun, and S. Lu. Self-concordant inclusions: A unified framework for path-following generalized Newton-type algorithms.Mathematical Programming, 177(1):173–223, 2019

  63. [63]

    Virtanen, R

    P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python.Nature Methods, 17(3):261–272, 2020

  64. [64]

    Zagoruyko and N

    S. Zagoruyko and N. Komodakis. Wide residual networks. InBMVC, pages 87–1. British Machine Vision Association, 2016

  65. [65]

    Zhang, C

    G. Zhang, C. Wang, B. Xu, and R. Grosse. Three mechanisms of weight decay regularization. In ICLR, 2019

  66. [66]

    Zhang, C

    Y. Zhang, C. Chen, T. Ding, Z. Li, R. Sun, and Z.-Q. Luo. Why transformers need ADAM: A Hessian perspective. InNeurIPS, pages 131786–131823, 2024

  67. [67]

    Zhang and X

    Y. Zhang and X. Lin. DISCO: Distributed optimization for self-concordant empirical loss. InICML, pages 362–370. PMLR, 2015

  68. [68]

    Zhang and M

    Y.-J. Zhang and M. Sugiyama. Online (multinomial) logistic bandit: Improved regret and constant computation cost. InNeurIPS, pages 29741–29782, 2023. 26