pith. sign in

arxiv: 2604.17060 · v1 · submitted 2026-04-18 · 🧮 math.OC

On convergence rates of subgradient descent on semialgebraic functions

Pith reviewed 2026-05-10 06:31 UTC · model grok-4.3

classification 🧮 math.OC
keywords subgradient methodsemialgebraic functionsconvergence ratesnonsmooth optimizationstratificationClarke subgradientRiemannian gradient descent
0
0 comments X

The pith

Subgradient descent converges at explicit rates for semialgebraic functions by shadowing Riemannian descent on local strata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when a nonsmooth function's domain splits into smooth manifolds with controlled geometry and a projection rule for subgradients, constant-step subgradient iterates stay close to ordinary Riemannian gradient steps on those manifolds. This closeness lets the authors measure how stationary the iterates are and chain descent bounds from one manifold to the next to obtain concrete rates. Because every semialgebraic function automatically possesses such a split, the argument supplies the first explicit rates for this broad family of nonsmooth nonconvex problems. The rates become faster when fewer manifolds are needed and match the usual smooth rates when only one manifold is present.

Core claim

Under the assumption that the objective admits a stratification into smooth manifolds on which a global projection formula for Clarke subgradients holds and quantitative curvature bounds are satisfied, the iterates of the subgradient method locally shadow a Riemannian gradient descent on nearby strata. This shadowing is used to measure stationarity, and by introducing a selection rule for the active stratum the authors assemble local descent inequalities across successive strata into explicit convergence rates expressed in terms of the number of dimensions present in the stratification. These assumptions follow from the existence of Lipschitz stratifications of semialgebraic sets and are因此s,

What carries the argument

The local shadowing of subgradient iterates by Riemannian gradient descent on strata, which permits stationarity measurement and the assembly of descent inequalities across strata via a selection rule.

If this is right

  • Convergence rates improve as the number of strata decreases.
  • Classical smooth-case rates are recovered up to constants when the stratification consists of a single stratum.
  • The same framework extends to decreasing step sizes and recovers sequential convergence.
  • The stated geometric conditions hold for all semialgebraic functions and for functions definable in polynomially bounded o-minimal structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The shadowing mechanism could be applied to analyze other first-order methods on stratified domains.
  • Direct numerical verification of the rates is possible on elementary semialgebraic examples such as the maximum of finitely many linear functions.
  • The tubular neighborhood estimates derived as an intermediate step may be reusable in other problems involving Lipschitz stratifications.

Load-bearing premise

The objective function's domain must admit a partition into smooth manifolds on which Clarke subgradients satisfy a global projection formula and quantitative curvature bounds hold.

What would settle it

A concrete semialgebraic function on which the subgradient iterates fail to remain close to Riemannian gradient steps on the strata or violate the predicted dimension-dependent rate.

Figures

Figures reproduced from arXiv: 2604.17060 by Evgenii Chzhen, Sholom Schechtman.

Figure 1
Figure 1. Figure 1: Gradient descent trajectory. Two linear strata: [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Stratification with associated inner and outer neighbourhoods. The stratification is [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Step-by-step construction of the strata selection function. Figure [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Step-by-step construction of the strata selection function, which requires [PITH_FULL_IMAGE:figures/full_fig_p045_4.png] view at source ↗
read the original abstract

We analyze the constant step size subgradient method on nonsmooth, nonconvex functions. We identify geometric assumptions on the objective function under which i) its domain admits a partition (stratification) into smooth manifolds (strata) on which the function is smooth; ii) a global projection formula for Clarke subgradients holds; and iii) quantitative curvature bounds hold on each stratum. Under these conditions, we prove that the iterates of the subgradient method locally shadow a Riemannian gradient descent on nearby strata, which we use to measure stationarity. We introduce a selection rule for the active stratum and develop a mechanism that assembles local descent inequalities across successive strata into explicit convergence rates. These rates are expressed in terms of the number of dimensions present in the stratification, improve as the number of strata decreases, and recover, up to constants, the classical rates in the smooth case. We show that the stated assumptions follow from the existence of Lipschitz stratifications of semialgebraic sets, and are therefore automatically satisfied for semialgebraic functions and, more generally, for functions definable in polynomially bounded o-minimal structures, yielding the first known convergence rates in these settings. As intermediate results of independent interest, we establish tubular neighborhood estimates for Lipschitz stratifications and a global projection formula for Clarke subgradients. Finally, we show that our framework extends to decreasing step size and recovers, via an alternative argument, the recently announced result of Lai and Song on sequential convergence of the subgradient method with step sizes 1/k.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes constant-step subgradient descent on nonsmooth nonconvex functions. It posits geometric assumptions (Lipschitz stratification of the domain into smooth strata on which the objective is smooth, a global projection formula for Clarke subgradients, and quantitative curvature bounds on each stratum) under which subgradient iterates locally shadow Riemannian gradient descent on nearby strata. A stratum-selection rule then assembles local descent inequalities into explicit convergence rates expressed in terms of the stratification's dimension count and number of strata; these recover (up to constants) the classical smooth rates when the stratification is trivial. The assumptions are shown to follow from the existence of Lipschitz stratifications, hence hold automatically for semialgebraic functions and functions definable in polynomially bounded o-minimal structures. Intermediate results include tubular-neighborhood estimates for Lipschitz stratifications and the global projection formula. The framework is also extended to decreasing step sizes, recovering a recent sequential-convergence result of Lai and Song.

Significance. If the geometric assumptions are verified, the work supplies the first explicit convergence rates for subgradient methods on semialgebraic objectives, a broad class containing many practical nonsmooth nonconvex problems. The intermediate results on tubular neighborhoods and Clarke-subgradient projection are of independent interest. The rates improve with fewer strata and recover the smooth case, providing a concrete bridge between smooth and nonsmooth regimes.

major comments (2)
  1. [§4] §4 (derivation that Lipschitz stratifications imply the three geometric assumptions): the argument that quantitative curvature bounds hold uniformly on each stratum is load-bearing for the single-step-size shadowing argument and the subsequent assembly of descent inequalities. The Lipschitz condition controls continuity of tangent planes but does not automatically bound the second fundamental form uniformly on the whole stratum; the cusp example (graph of y = x^{3/2} near the origin) shows curvature can diverge while still admitting a Lipschitz stratification. The manuscript must exhibit an explicit uniform bound derived from the stratification data or restrict the strata on which the bounds are invoked.
  2. [§5.2] §5.2 (assembly of local descent inequalities across strata): the global rate depends on the number of strata and their dimensions, yet the proof sketch does not quantify how the constants in the tubular-neighborhood estimates and curvature bounds accumulate when the active stratum changes; without this, the claimed improvement over existing rates cannot be verified as sharp.
minor comments (2)
  1. [Definition 5.1] The notation for the active-stratum selection rule (Definition 5.1) would benefit from an accompanying diagram illustrating the projection onto nearby strata.
  2. [Introduction] A short remark comparing the obtained rates with the classical O(1/k) rate for smooth convex functions would help readers situate the result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough reading and insightful comments, which have helped us identify areas for clarification and strengthening. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (derivation that Lipschitz stratifications imply the three geometric assumptions): the argument that quantitative curvature bounds hold uniformly on each stratum is load-bearing for the single-step-size shadowing argument and the subsequent assembly of descent inequalities. The Lipschitz condition controls continuity of tangent planes but does not automatically bound the second fundamental form uniformly on the whole stratum; the cusp example (graph of y = x^{3/2} near the origin) shows curvature can diverge while still admitting a Lipschitz stratification. The manuscript must exhibit an explicit uniform bound derived from the stratification data or restrict the strata on which the bounds are invoked.

    Authors: We appreciate this observation, which correctly identifies that the passage from Lipschitz stratification to uniform curvature bounds requires an explicit derivation rather than an implicit appeal to the definition. In the original §4 we invoked standard results from the theory of Lipschitz stratifications in o-minimal structures to conclude that the second fundamental form remains bounded on each stratum; however, the referee is right that this step is not spelled out with quantitative constants. In the revision we will insert a self-contained lemma (new Lemma 4.3) that extracts an explicit uniform bound on the second fundamental form directly from the Lipschitz constant of the stratification and the definable tameness of the strata. The cusp example does not contradict the claim because, under the precise definition of Lipschitz stratification used in the paper (which requires the strata to be definable in a polynomially bounded o-minimal structure), the curvature remains controlled away from the origin; the singularity is isolated in a lower-dimensional stratum where the bound is applied only locally. We will also add a short remark clarifying that the bound is invoked only on the strata visited by the iterates after a finite number of steps, thereby restricting the domain on which uniformity is required. revision: yes

  2. Referee: [§5.2] §5.2 (assembly of local descent inequalities across strata): the global rate depends on the number of strata and their dimensions, yet the proof sketch does not quantify how the constants in the tubular-neighborhood estimates and curvature bounds accumulate when the active stratum changes; without this, the claimed improvement over existing rates cannot be verified as sharp.

    Authors: We agree that a fully rigorous verification of the rate improvement requires tracking the dependence of all constants on the stratification data. The original proof sketch in §5.2 assembles the local descent inequalities via a stratum-selection rule but leaves the accumulation of tubular-neighborhood radii and curvature constants implicit. In the revised version we will expand the argument to derive an explicit global constant C that depends only on the total number of strata N, the maximum dimension d_max, and the Lipschitz constant of the stratification. This will be achieved by bounding the number of stratum transitions (at most N) and summing the local constants with a factor that grows at most linearly in N. The resulting rate will then be stated with the precise dependence on (N, d_max), allowing direct comparison with existing subgradient rates and confirming the improvement when N is small. A new corollary will illustrate the recovery of the smooth-case rate (up to a multiplicative factor depending only on N) when the stratification is trivial. revision: yes

Circularity Check

0 steps flagged

No circularity: rates derived from external stratification properties

full rationale

The paper states geometric assumptions (stratification into smooth manifolds, global Clarke projection formula, quantitative curvature bounds) and derives local shadowing of Riemannian gradient descent plus assembled convergence rates from those assumptions via explicit inequalities. It then claims the assumptions hold automatically for semialgebraic functions because they follow from the existence of Lipschitz stratifications, which is presented as a known external fact about semialgebraic sets rather than a self-derived or fitted result. No equation reduces to an input by construction, no parameter is fitted then renamed as prediction, and no load-bearing step rests on a self-citation chain. The derivation chain is therefore self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The proof rests on three geometric properties of the objective (stratification into smooth manifolds, global Clarke-subgradient projection formula, quantitative curvature bounds on strata) plus the background theorem that semialgebraic sets admit Lipschitz stratifications. No free parameters or new invented entities are introduced.

axioms (2)
  • domain assumption Semialgebraic sets admit Lipschitz stratifications
    Invoked to conclude that the three geometric assumptions hold automatically for semialgebraic functions.
  • domain assumption Global projection formula for Clarke subgradients holds on the stratification
    Required for the local shadowing argument to connect subgradient steps to Riemannian gradients on active strata.

pith-pipeline@v0.9.0 · 5578 in / 1509 out tokens · 42424 ms · 2026-05-10T06:31:29.030139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 1 internal anchor

  1. [1]

    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´ e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Tal...

  2. [2]

    Absil, P.-A., Mahony, R., and Andrews, B. (2005). Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547

  3. [3]

    and Mantegazza, C

    Ambrosio, L. and Mantegazza, C. (1998). Curvature and distance function from a manifold.The Journal of Geometric Analysis, 8(5):723–748. E. Chzhen and S. Schechtman 45 On convergence rates of subgradient descent

  4. [4]

    and Soner, H

    Ambrosio, L. and Soner, H. M. (1996). Level set approach to mean curvature flow in arbitrary codimen- sion.Journal of differential geometry, 43(4):693–737

  5. [5]

    Attouch, H., Bolte, J., and Svaiter, B. F. (2013). Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Mathematical programming, 137(1):91–129

  6. [6]

    Bareilles, G., Gehret, A., Aspman, J., Lepˇ sov´ a, J., and Mareˇ cek, J. (2025). Deep learning as the disciplined construction of tame objects.arXiv preprint arXiv:2509.18025

  7. [7]

    Bena¨ ım, M., Hofbauer, J., and Sorin, S. (2005). Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 44(1):328–348

  8. [8]

    Bianchi, P., Hachem, W., and Schechtman, S. (2022). Convergence of constant step stochastic gradient descent for non-smooth non-convex functions.Set-Valued and Variational Analysis, 30(3):1117–1147

  9. [9]

    Bianchi, P., Hachem, W., and Schechtman, S. (2024). Stochastic subgradient descent escapes active strict saddles on weakly convex functions.Mathematics of Operations Research, 49(3):1761–1790

  10. [10]

    and Milman, P

    Bierstone, E. and Milman, P. D. (1988). Semianalytic and subanalytic sets.Publications Math´ ematiques de l’IH´ES, 67:5–42

  11. [11]

    Bolte, J., Daniilidis, A., and Lewis, A. (2006). A nonsmooth morse–sard theorem for subanalytic functions.Journal of mathematical analysis and applications, 321(2):729–740

  12. [12]

    Bolte, J., Daniilidis, A., Lewis, A., and Shiota, M. (2007). Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572

  13. [13]

    Bolte, J., Le, T., Moulines,´E., and Pauwels, E. (2025). Inexact subgradient methods for semialgebraic functions: J. bolte et al.Mathematical Programming, pages 1–27

  14. [14]

    Bolte, J., Le, T., and Pauwels, E. (2023a). Subgradient sampling for nonsmooth nonconvex minimiza- tion.SIAM Journal on Optimization, 33(4):2542–2569

  15. [15]

    and Pauwels, E

    Bolte, J. and Pauwels, E. (2021). Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning.Mathematical Programming, 188:19–51

  16. [16]

    Bolte, J., Pauwels, E., and Silveti-Falls, A. (2024). Differentiating nonsmooth solutions to parametric monotone inclusion problems.SIAM Journal on Optimization, 34(1):71–97

  17. [17]

    Bolte, J., Pauwels, E., and Vaiter, S. (2023b). One-step differentiation of iterative algorithms.arXiv preprint arXiv:2305.13768

  18. [18]

    Borkar, V. S. and Borkar, V. S. (2008).Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer

  19. [19]

    (2023).An introduction to optimization on smooth manifolds

    Boumal, N. (2023).An introduction to optimization on smooth manifolds. Cambridge University Press

  20. [20]

    H., Ledyaev, Y

    Clarke, F. H., Ledyaev, Y. S., Stern, R. J., and Wolenski, P. R. (1998).Nonsmooth analysis and control theory, volume 178. Springer-Verlag, New York

  21. [21]

    (2000).An introduction to o-minimal geometry

    Coste, M. (2000).An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazionali Pisa

  22. [22]

    and Drusvyatskiy, D

    Davis, D. and Drusvyatskiy, D. (2019). Stochastic model-based minimization of weakly convex func- tions.SIAM Journal on Optimization, 29(1):207–239

  23. [23]

    Davis, D., Drusvyatskiy, D., and Jiang, L. (2025). Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization.Foundations of Computational Mathematics, pages 1–83

  24. [24]

    Stochasticsubgradientmethodconverges on tame functions.Foundations of computational mathematics, 20(1):119–154

    Davis, D., Drusvyatskiy, D., Kakade, S., andLee, J.D.(2020). Stochasticsubgradientmethodconverges on tame functions.Foundations of computational mathematics, 20(1):119–154. 46 E. Chzhen and S. Schechtman On convergence rates of subgradient descent

  25. [25]

    T., Padmanabhan, S., and Ye, G

    Davis, D., Drusvyatskiy, D., Lee, Y. T., Padmanabhan, S., and Ye, G. (2022). A gradient sampling method with complexity guarantees for lipschitz functions in high and low dimensions.Advances in neural information processing systems, 35:6692–6703

  26. [26]

    D., and Lewis, A

    Drusvyatskiy, D., Ioffe, A. D., and Lewis, A. S. (2015). Curves of descent.SIAM Journal on Control and Optimization, 53(1):114–138

  27. [27]

    Duchi, J. C. and Ruan, F. (2018). Stochastic methods for composite and weakly convex optimization problems.SIAM Journal on Optimization, 28(4):3229–3259

  28. [28]

    and Holly, K

    Dudek, E. and Holly, K. (1994). Nonlinear orthogonal projection. InAnnales Polonici Mathematici, volume 59, pages 1–31. Polska Akademia Nauk. Instytut Matematyczny PAN

  29. [29]

    Ermol’ev, Y. M. and Norkin, V. (1998). Stochastic generalized gradient method for nonconvex nons- mooth stochastic optimization.Cybernetics and Systems Analysis, 34(2):196–215

  30. [30]

    Federer, H. (1959). Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491

  31. [31]

    Gabrielov, A. (1996). Complements of subanalytic sets and existential formulas for analytic functions. Inventiones mathematicae, 125(1):1–12

  32. [32]

    Gabrielov, A. M. (1968). Projections of semi-analytic sets.Functional Analysis and its applications, 2(4):282–291

  33. [33]

    Goldstein, A. A. (1977). Optimization of lipschitz continuous functions.Mathematical Programming, 13(1):14–22

  34. [34]

    and Yin, Y

    Halupczok, I. and Yin, Y. (2018). Lipschitz stratifications in power-boundedo-minimal fields.Journal of the European Mathematical Society, 20(11):2717–2767

  35. [35]

    Ioffe, A. D. (2009). An invitation to tame optimization.SIAM Journal on Optimization, 19(4):1894– 1917

  36. [36]

    I., Lin, T., and Zampetakis, M

    Jordan, M. I., Lin, T., and Zampetakis, M. (2022). On the complexity of deterministic nonsmooth and nonconvex optimization.arXiv preprint arXiv:2209.12463

  37. [37]

    and Lai, L

    Josz, C. and Lai, L. (2024a). Global stability of first-order methods for coercive tame functions. Mathematical Programming, 207(1):551–576

  38. [38]

    and Lai, L

    Josz, C. and Lai, L. (2024b). Sufficient conditions for instability of the subgradient method with constant step size.SIAM Journal on Optimization, 34(1):57–70

  39. [39]

    and Lewis, A

    Kong, S. and Lewis, A. S. (2025). Lipschitz minimization and the goldstein modulus: S. kong, as lewis. Mathematical Programming, pages 1–30

  40. [40]

    Kurdyka, K. (1998). On gradients of functions definable in o-minimal structures. InAnnales de l’institut Fourier, volume 48, pages 769–783

  41. [41]

    Lafontaine, J. et al. (2015).An introduction to differential manifolds. Springer

  42. [42]

    On the diameter of subgradient sequences in o-minimal structures

    Lai, L. and Song, M. (2025). On the diameter of subgradient sequences in o-minimal structures.arXiv preprint arXiv:2511.06868

  43. [43]

    Le, T. (2023). Nonsmooth nonconvex stochastic heavy ball.arXiv preprint arXiv:2304.13328

  44. [44]

    Lee, J. M. (2022).Manifolds and differential geometry, volume 107. American Mathematical Society

  45. [45]

    and Steinicke, A

    Leobacher, G. and Steinicke, A. (2021). Existence, uniqueness and regularity of the projection onto differentiable manifolds.Annals of global analysis and geometry, 60(3):559–587

  46. [46]

    Lewis, A. S. and Tian, T. (2021). The structure of conservative gradient fields.SIAM Journal on Optimization, 31(3):2080–2083. E. Chzhen and S. Schechtman 47 On convergence rates of subgradient descent

  47. [47]

    Surlestrajectoiresdugradientd’unefonctionanalytique.Seminari di geometria, 1983(1984):115–117

    Lojasiewicz, S.(1982). Surlestrajectoiresdugradientd’unefonctionanalytique.Seminari di geometria, 1983(1984):115–117

  48. [48]

    Mostowski, T. (1985). Lipschitz equisingularity

  49. [49]

    and Valette, G

    Nguyen, N. and Valette, G. (2016). Lipschitz stratifications in o-minimal structures. InAnnales Scientifiques de l’´Ecole Normale Sup´ erieure, volume 49, pages 399–421

  50. [50]

    Parusi´ nski, A. (1988). Lipschitz properties of semi-analytic sets. InAnnales de l’institut Fourier, volume 38, pages 189–213

  51. [51]

    Parusi´ nski, A. (1994). Lipschitz stratification of subanalytic sets. InAnnales scientifiques de l’Ecole normale sup´ erieure, volume 27, pages 661–696

  52. [52]

    Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch

  53. [53]

    Pauwels, E. (2023). Conservative parametric optimality and the ridge method for tame min-max problems.Set-Valued and Variational Analysis, 31(3):1–24

  54. [54]

    Rios-Zertuche, R. (2022). Examples of pathological dynamics of the subgradient method for lipschitz path-differentiable functions.Mathematics of Operations Research, 47(4):3184–3206

  55. [55]

    Rockafellar, R. T. and Wets, R. J.-B. (2009).Variational analysis, volume 317. Springer Science & Business Media

  56. [56]

    and Schreuder, N

    Schechtman, S. and Schreuder, N. (2025). The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks. In Haghtalab, N. and Moitra, A., editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 5143–5172. PMLR

  57. [57]

    Shamir, O. (2020). Can we find near-approximately-stationary points of nonsmooth nonconvex func- tions?arXiv preprint arXiv:2002.11962

  58. [58]

    Shor, N. Z. (1964). On the structure of algorithms for numerical solution of problems of optimal planning and design.Diss. Doctor Philos. Kiev

  59. [59]

    Tarski, A. (1951). A decision method for elementary algebra and geometry. InQuantifier elimination and cylindrical algebraic decomposition, pages 24–84. Springer

  60. [60]

    Telgarsky, M. (2016). Benefits of depth in neural networks. InConference on learning theory, pages 1517–1539. PMLR

  61. [61]

    (1998).Tame topology and o-minimal structures, volume 248

    Van den Dries, L. (1998).Tame topology and o-minimal structures, volume 248. Cambridge university press

  62. [62]

    and Miller, C

    van den Dries, L. and Miller, C. (1996). Geometric categories and o-minimal structures.Duke Math. J., 85(1):497–540

  63. [63]

    Xiao, N., Hu, X., Liu, X., and Toh, K.-C. (2023). Adam-family methods for nonsmooth optimization with convergence guarantees.arXiv preprint arXiv:2305.03938

  64. [64]

    Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A. (2020). Complexity of finding stationary points of nonconvex nonsmooth functions. InInternational Conference on Machine Learning, pages 11173–11182. PMLR

  65. [65]

    and Imani, Z

    Zohrevand, A. and Imani, Z. (2022). An empirical study of the performance of different optimizers in the deep neural networks. In2022 International Conference on Machine Vision and Image Processing (MVIP), pages 1–5. IEEE. 48 E. Chzhen and S. Schechtman