On convergence rates of subgradient descent on semialgebraic functions

Evgenii Chzhen; Sholom Schechtman

arxiv: 2604.17060 · v1 · submitted 2026-04-18 · 🧮 math.OC

On convergence rates of subgradient descent on semialgebraic functions

Evgenii Chzhen , Sholom Schechtman This is my paper

Pith reviewed 2026-05-10 06:31 UTC · model grok-4.3

classification 🧮 math.OC

keywords subgradient methodsemialgebraic functionsconvergence ratesnonsmooth optimizationstratificationClarke subgradientRiemannian gradient descent

0 comments

The pith

Subgradient descent converges at explicit rates for semialgebraic functions by shadowing Riemannian descent on local strata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when a nonsmooth function's domain splits into smooth manifolds with controlled geometry and a projection rule for subgradients, constant-step subgradient iterates stay close to ordinary Riemannian gradient steps on those manifolds. This closeness lets the authors measure how stationary the iterates are and chain descent bounds from one manifold to the next to obtain concrete rates. Because every semialgebraic function automatically possesses such a split, the argument supplies the first explicit rates for this broad family of nonsmooth nonconvex problems. The rates become faster when fewer manifolds are needed and match the usual smooth rates when only one manifold is present.

Core claim

Under the assumption that the objective admits a stratification into smooth manifolds on which a global projection formula for Clarke subgradients holds and quantitative curvature bounds are satisfied, the iterates of the subgradient method locally shadow a Riemannian gradient descent on nearby strata. This shadowing is used to measure stationarity, and by introducing a selection rule for the active stratum the authors assemble local descent inequalities across successive strata into explicit convergence rates expressed in terms of the number of dimensions present in the stratification. These assumptions follow from the existence of Lipschitz stratifications of semialgebraic sets and are因此s,

What carries the argument

The local shadowing of subgradient iterates by Riemannian gradient descent on strata, which permits stationarity measurement and the assembly of descent inequalities across strata via a selection rule.

If this is right

Convergence rates improve as the number of strata decreases.
Classical smooth-case rates are recovered up to constants when the stratification consists of a single stratum.
The same framework extends to decreasing step sizes and recovers sequential convergence.
The stated geometric conditions hold for all semialgebraic functions and for functions definable in polynomially bounded o-minimal structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shadowing mechanism could be applied to analyze other first-order methods on stratified domains.
Direct numerical verification of the rates is possible on elementary semialgebraic examples such as the maximum of finitely many linear functions.
The tubular neighborhood estimates derived as an intermediate step may be reusable in other problems involving Lipschitz stratifications.

Load-bearing premise

The objective function's domain must admit a partition into smooth manifolds on which Clarke subgradients satisfy a global projection formula and quantitative curvature bounds hold.

What would settle it

A concrete semialgebraic function on which the subgradient iterates fail to remain close to Riemannian gradient steps on the strata or violate the predicted dimension-dependent rate.

Figures

Figures reproduced from arXiv: 2604.17060 by Evgenii Chzhen, Sholom Schechtman.

**Figure 2.** Figure 2: Stratification with associated inner and outer neighbourhoods. The stratification is [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Step-by-step construction of the strata selection function. Figure [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Step-by-step construction of the strata selection function, which requires [PITH_FULL_IMAGE:figures/full_fig_p045_4.png] view at source ↗

read the original abstract

We analyze the constant step size subgradient method on nonsmooth, nonconvex functions. We identify geometric assumptions on the objective function under which i) its domain admits a partition (stratification) into smooth manifolds (strata) on which the function is smooth; ii) a global projection formula for Clarke subgradients holds; and iii) quantitative curvature bounds hold on each stratum. Under these conditions, we prove that the iterates of the subgradient method locally shadow a Riemannian gradient descent on nearby strata, which we use to measure stationarity. We introduce a selection rule for the active stratum and develop a mechanism that assembles local descent inequalities across successive strata into explicit convergence rates. These rates are expressed in terms of the number of dimensions present in the stratification, improve as the number of strata decreases, and recover, up to constants, the classical rates in the smooth case. We show that the stated assumptions follow from the existence of Lipschitz stratifications of semialgebraic sets, and are therefore automatically satisfied for semialgebraic functions and, more generally, for functions definable in polynomially bounded o-minimal structures, yielding the first known convergence rates in these settings. As intermediate results of independent interest, we establish tubular neighborhood estimates for Lipschitz stratifications and a global projection formula for Clarke subgradients. Finally, we show that our framework extends to decreasing step size and recovers, via an alternative argument, the recently announced result of Lai and Song on sequential convergence of the subgradient method with step sizes 1/k.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first explicit convergence rates for constant-step subgradient descent on semialgebraic functions by turning Lipschitz stratifications into uniform curvature bounds and local shadowing of Riemannian gradient descent.

read the letter

The main takeaway is that the authors derive concrete rates for the subgradient method on the full class of semialgebraic functions. They do this by showing that Lipschitz stratifications automatically supply the three geometric conditions needed: a smooth stratification of the domain, a global projection formula for Clarke subgradients, and quantitative curvature bounds on each stratum. From there they prove that subgradient iterates locally shadow Riemannian gradient descent on the active stratum, introduce a selection rule for which stratum is active, and glue the local descent inequalities into global rates that scale with the number of strata and their dimensions. These rates recover the smooth case up to constants and also cover decreasing step sizes, including the recent Lai-Song sequential convergence result as a byproduct. The tubular neighborhood estimates and global projection formula are presented as intermediate results that may stand on their own. The approach is new in how it assembles cross-stratum inequalities without assuming global smoothness or convexity. The potential weak point is the claim that Lipschitz stratifications automatically deliver uniform quantitative curvature bounds on each stratum. The example of a stratum whose second fundamental form diverges near a lower-dimensional boundary (such as the graph of y = x^{3/2}) shows that tangent-plane continuity alone does not guarantee a uniform bound. If the curvature constants deteriorate near stratum transitions, the fixed step size and the explicit rate assembly could lose their uniformity. The abstract asserts that the bounds follow from the stratification properties and that the authors prove the necessary tubular estimates, so the manuscript presumably closes this gap, but it is the section that needs the most careful verification. This work is aimed at researchers in nonsmooth nonconvex optimization who already work with semialgebraic or definable functions and want rate statements beyond the convex or smooth settings. A reader looking for a geometric route to explicit rates will find the framework useful. The paper deserves a serious referee because the proof outline is coherent, the claims are sharp, and the geometric reductions are checkable even if one ultimately disagrees with the final constants.

Referee Report

2 major / 2 minor

Summary. The paper analyzes constant-step subgradient descent on nonsmooth nonconvex functions. It posits geometric assumptions (Lipschitz stratification of the domain into smooth strata on which the objective is smooth, a global projection formula for Clarke subgradients, and quantitative curvature bounds on each stratum) under which subgradient iterates locally shadow Riemannian gradient descent on nearby strata. A stratum-selection rule then assembles local descent inequalities into explicit convergence rates expressed in terms of the stratification's dimension count and number of strata; these recover (up to constants) the classical smooth rates when the stratification is trivial. The assumptions are shown to follow from the existence of Lipschitz stratifications, hence hold automatically for semialgebraic functions and functions definable in polynomially bounded o-minimal structures. Intermediate results include tubular-neighborhood estimates for Lipschitz stratifications and the global projection formula. The framework is also extended to decreasing step sizes, recovering a recent sequential-convergence result of Lai and Song.

Significance. If the geometric assumptions are verified, the work supplies the first explicit convergence rates for subgradient methods on semialgebraic objectives, a broad class containing many practical nonsmooth nonconvex problems. The intermediate results on tubular neighborhoods and Clarke-subgradient projection are of independent interest. The rates improve with fewer strata and recover the smooth case, providing a concrete bridge between smooth and nonsmooth regimes.

major comments (2)

[§4] §4 (derivation that Lipschitz stratifications imply the three geometric assumptions): the argument that quantitative curvature bounds hold uniformly on each stratum is load-bearing for the single-step-size shadowing argument and the subsequent assembly of descent inequalities. The Lipschitz condition controls continuity of tangent planes but does not automatically bound the second fundamental form uniformly on the whole stratum; the cusp example (graph of y = x^{3/2} near the origin) shows curvature can diverge while still admitting a Lipschitz stratification. The manuscript must exhibit an explicit uniform bound derived from the stratification data or restrict the strata on which the bounds are invoked.
[§5.2] §5.2 (assembly of local descent inequalities across strata): the global rate depends on the number of strata and their dimensions, yet the proof sketch does not quantify how the constants in the tubular-neighborhood estimates and curvature bounds accumulate when the active stratum changes; without this, the claimed improvement over existing rates cannot be verified as sharp.

minor comments (2)

[Definition 5.1] The notation for the active-stratum selection rule (Definition 5.1) would benefit from an accompanying diagram illustrating the projection onto nearby strata.
[Introduction] A short remark comparing the obtained rates with the classical O(1/k) rate for smooth convex functions would help readers situate the result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough reading and insightful comments, which have helped us identify areas for clarification and strengthening. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (derivation that Lipschitz stratifications imply the three geometric assumptions): the argument that quantitative curvature bounds hold uniformly on each stratum is load-bearing for the single-step-size shadowing argument and the subsequent assembly of descent inequalities. The Lipschitz condition controls continuity of tangent planes but does not automatically bound the second fundamental form uniformly on the whole stratum; the cusp example (graph of y = x^{3/2} near the origin) shows curvature can diverge while still admitting a Lipschitz stratification. The manuscript must exhibit an explicit uniform bound derived from the stratification data or restrict the strata on which the bounds are invoked.

Authors: We appreciate this observation, which correctly identifies that the passage from Lipschitz stratification to uniform curvature bounds requires an explicit derivation rather than an implicit appeal to the definition. In the original §4 we invoked standard results from the theory of Lipschitz stratifications in o-minimal structures to conclude that the second fundamental form remains bounded on each stratum; however, the referee is right that this step is not spelled out with quantitative constants. In the revision we will insert a self-contained lemma (new Lemma 4.3) that extracts an explicit uniform bound on the second fundamental form directly from the Lipschitz constant of the stratification and the definable tameness of the strata. The cusp example does not contradict the claim because, under the precise definition of Lipschitz stratification used in the paper (which requires the strata to be definable in a polynomially bounded o-minimal structure), the curvature remains controlled away from the origin; the singularity is isolated in a lower-dimensional stratum where the bound is applied only locally. We will also add a short remark clarifying that the bound is invoked only on the strata visited by the iterates after a finite number of steps, thereby restricting the domain on which uniformity is required. revision: yes
Referee: [§5.2] §5.2 (assembly of local descent inequalities across strata): the global rate depends on the number of strata and their dimensions, yet the proof sketch does not quantify how the constants in the tubular-neighborhood estimates and curvature bounds accumulate when the active stratum changes; without this, the claimed improvement over existing rates cannot be verified as sharp.

Authors: We agree that a fully rigorous verification of the rate improvement requires tracking the dependence of all constants on the stratification data. The original proof sketch in §5.2 assembles the local descent inequalities via a stratum-selection rule but leaves the accumulation of tubular-neighborhood radii and curvature constants implicit. In the revised version we will expand the argument to derive an explicit global constant C that depends only on the total number of strata N, the maximum dimension d_max, and the Lipschitz constant of the stratification. This will be achieved by bounding the number of stratum transitions (at most N) and summing the local constants with a factor that grows at most linearly in N. The resulting rate will then be stated with the precise dependence on (N, d_max), allowing direct comparison with existing subgradient rates and confirming the improvement when N is small. A new corollary will illustrate the recovery of the smooth-case rate (up to a multiplicative factor depending only on N) when the stratification is trivial. revision: yes

Circularity Check

0 steps flagged

No circularity: rates derived from external stratification properties

full rationale

The paper states geometric assumptions (stratification into smooth manifolds, global Clarke projection formula, quantitative curvature bounds) and derives local shadowing of Riemannian gradient descent plus assembled convergence rates from those assumptions via explicit inequalities. It then claims the assumptions hold automatically for semialgebraic functions because they follow from the existence of Lipschitz stratifications, which is presented as a known external fact about semialgebraic sets rather than a self-derived or fitted result. No equation reduces to an input by construction, no parameter is fitted then renamed as prediction, and no load-bearing step rests on a self-citation chain. The derivation chain is therefore self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The proof rests on three geometric properties of the objective (stratification into smooth manifolds, global Clarke-subgradient projection formula, quantitative curvature bounds on strata) plus the background theorem that semialgebraic sets admit Lipschitz stratifications. No free parameters or new invented entities are introduced.

axioms (2)

domain assumption Semialgebraic sets admit Lipschitz stratifications
Invoked to conclude that the three geometric assumptions hold automatically for semialgebraic functions.
domain assumption Global projection formula for Clarke subgradients holds on the stratification
Required for the local shadowing argument to connect subgradient steps to Riemannian gradients on active strata.

pith-pipeline@v0.9.0 · 5578 in / 1509 out tokens · 42424 ms · 2026-05-10T06:31:29.030139+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 1 internal anchor

[1]

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´ e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Tal...

work page 2015
[2]

Absil, P.-A., Mahony, R., and Andrews, B. (2005). Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547

work page 2005
[3]

and Mantegazza, C

Ambrosio, L. and Mantegazza, C. (1998). Curvature and distance function from a manifold.The Journal of Geometric Analysis, 8(5):723–748. E. Chzhen and S. Schechtman 45 On convergence rates of subgradient descent

work page 1998
[4]

and Soner, H

Ambrosio, L. and Soner, H. M. (1996). Level set approach to mean curvature flow in arbitrary codimen- sion.Journal of differential geometry, 43(4):693–737

work page 1996
[5]

Attouch, H., Bolte, J., and Svaiter, B. F. (2013). Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Mathematical programming, 137(1):91–129

work page 2013
[6]

Bareilles, G., Gehret, A., Aspman, J., Lepˇ sov´ a, J., and Mareˇ cek, J. (2025). Deep learning as the disciplined construction of tame objects.arXiv preprint arXiv:2509.18025

work page arXiv 2025
[7]

Bena¨ ım, M., Hofbauer, J., and Sorin, S. (2005). Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 44(1):328–348

work page 2005
[8]

Bianchi, P., Hachem, W., and Schechtman, S. (2022). Convergence of constant step stochastic gradient descent for non-smooth non-convex functions.Set-Valued and Variational Analysis, 30(3):1117–1147

work page 2022
[9]

Bianchi, P., Hachem, W., and Schechtman, S. (2024). Stochastic subgradient descent escapes active strict saddles on weakly convex functions.Mathematics of Operations Research, 49(3):1761–1790

work page 2024
[10]

and Milman, P

Bierstone, E. and Milman, P. D. (1988). Semianalytic and subanalytic sets.Publications Math´ ematiques de l’IH´ES, 67:5–42

work page 1988
[11]

Bolte, J., Daniilidis, A., and Lewis, A. (2006). A nonsmooth morse–sard theorem for subanalytic functions.Journal of mathematical analysis and applications, 321(2):729–740

work page 2006
[12]

Bolte, J., Daniilidis, A., Lewis, A., and Shiota, M. (2007). Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572

work page 2007
[13]

Bolte, J., Le, T., Moulines,´E., and Pauwels, E. (2025). Inexact subgradient methods for semialgebraic functions: J. bolte et al.Mathematical Programming, pages 1–27

work page 2025
[14]

Bolte, J., Le, T., and Pauwels, E. (2023a). Subgradient sampling for nonsmooth nonconvex minimiza- tion.SIAM Journal on Optimization, 33(4):2542–2569

work page
[15]

and Pauwels, E

Bolte, J. and Pauwels, E. (2021). Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning.Mathematical Programming, 188:19–51

work page 2021
[16]

Bolte, J., Pauwels, E., and Silveti-Falls, A. (2024). Differentiating nonsmooth solutions to parametric monotone inclusion problems.SIAM Journal on Optimization, 34(1):71–97

work page 2024
[17]

Bolte, J., Pauwels, E., and Vaiter, S. (2023b). One-step differentiation of iterative algorithms.arXiv preprint arXiv:2305.13768

work page arXiv
[18]

Borkar, V. S. and Borkar, V. S. (2008).Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer

work page 2008
[19]

(2023).An introduction to optimization on smooth manifolds

Boumal, N. (2023).An introduction to optimization on smooth manifolds. Cambridge University Press

work page 2023
[20]

H., Ledyaev, Y

Clarke, F. H., Ledyaev, Y. S., Stern, R. J., and Wolenski, P. R. (1998).Nonsmooth analysis and control theory, volume 178. Springer-Verlag, New York

work page 1998
[21]

(2000).An introduction to o-minimal geometry

Coste, M. (2000).An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazionali Pisa

work page 2000
[22]

and Drusvyatskiy, D

Davis, D. and Drusvyatskiy, D. (2019). Stochastic model-based minimization of weakly convex func- tions.SIAM Journal on Optimization, 29(1):207–239

work page 2019
[23]

Davis, D., Drusvyatskiy, D., and Jiang, L. (2025). Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization.Foundations of Computational Mathematics, pages 1–83

work page 2025
[24]

Stochasticsubgradientmethodconverges on tame functions.Foundations of computational mathematics, 20(1):119–154

Davis, D., Drusvyatskiy, D., Kakade, S., andLee, J.D.(2020). Stochasticsubgradientmethodconverges on tame functions.Foundations of computational mathematics, 20(1):119–154. 46 E. Chzhen and S. Schechtman On convergence rates of subgradient descent

work page 2020
[25]

T., Padmanabhan, S., and Ye, G

Davis, D., Drusvyatskiy, D., Lee, Y. T., Padmanabhan, S., and Ye, G. (2022). A gradient sampling method with complexity guarantees for lipschitz functions in high and low dimensions.Advances in neural information processing systems, 35:6692–6703

work page 2022
[26]

D., and Lewis, A

Drusvyatskiy, D., Ioffe, A. D., and Lewis, A. S. (2015). Curves of descent.SIAM Journal on Control and Optimization, 53(1):114–138

work page 2015
[27]

Duchi, J. C. and Ruan, F. (2018). Stochastic methods for composite and weakly convex optimization problems.SIAM Journal on Optimization, 28(4):3229–3259

work page 2018
[28]

and Holly, K

Dudek, E. and Holly, K. (1994). Nonlinear orthogonal projection. InAnnales Polonici Mathematici, volume 59, pages 1–31. Polska Akademia Nauk. Instytut Matematyczny PAN

work page 1994
[29]

Ermol’ev, Y. M. and Norkin, V. (1998). Stochastic generalized gradient method for nonconvex nons- mooth stochastic optimization.Cybernetics and Systems Analysis, 34(2):196–215

work page 1998
[30]

Federer, H. (1959). Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491

work page 1959
[31]

Gabrielov, A. (1996). Complements of subanalytic sets and existential formulas for analytic functions. Inventiones mathematicae, 125(1):1–12

work page 1996
[32]

Gabrielov, A. M. (1968). Projections of semi-analytic sets.Functional Analysis and its applications, 2(4):282–291

work page 1968
[33]

Goldstein, A. A. (1977). Optimization of lipschitz continuous functions.Mathematical Programming, 13(1):14–22

work page 1977
[34]

and Yin, Y

Halupczok, I. and Yin, Y. (2018). Lipschitz stratifications in power-boundedo-minimal fields.Journal of the European Mathematical Society, 20(11):2717–2767

work page 2018
[35]

Ioffe, A. D. (2009). An invitation to tame optimization.SIAM Journal on Optimization, 19(4):1894– 1917

work page 2009
[36]

I., Lin, T., and Zampetakis, M

Jordan, M. I., Lin, T., and Zampetakis, M. (2022). On the complexity of deterministic nonsmooth and nonconvex optimization.arXiv preprint arXiv:2209.12463

work page arXiv 2022
[37]

and Lai, L

Josz, C. and Lai, L. (2024a). Global stability of first-order methods for coercive tame functions. Mathematical Programming, 207(1):551–576

work page
[38]

and Lai, L

Josz, C. and Lai, L. (2024b). Sufficient conditions for instability of the subgradient method with constant step size.SIAM Journal on Optimization, 34(1):57–70

work page
[39]

and Lewis, A

Kong, S. and Lewis, A. S. (2025). Lipschitz minimization and the goldstein modulus: S. kong, as lewis. Mathematical Programming, pages 1–30

work page 2025
[40]

Kurdyka, K. (1998). On gradients of functions definable in o-minimal structures. InAnnales de l’institut Fourier, volume 48, pages 769–783

work page 1998
[41]

Lafontaine, J. et al. (2015).An introduction to differential manifolds. Springer

work page 2015
[42]

On the diameter of subgradient sequences in o-minimal structures

Lai, L. and Song, M. (2025). On the diameter of subgradient sequences in o-minimal structures.arXiv preprint arXiv:2511.06868

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Le, T. (2023). Nonsmooth nonconvex stochastic heavy ball.arXiv preprint arXiv:2304.13328

work page arXiv 2023
[44]

Lee, J. M. (2022).Manifolds and differential geometry, volume 107. American Mathematical Society

work page 2022
[45]

and Steinicke, A

Leobacher, G. and Steinicke, A. (2021). Existence, uniqueness and regularity of the projection onto differentiable manifolds.Annals of global analysis and geometry, 60(3):559–587

work page 2021
[46]

Lewis, A. S. and Tian, T. (2021). The structure of conservative gradient fields.SIAM Journal on Optimization, 31(3):2080–2083. E. Chzhen and S. Schechtman 47 On convergence rates of subgradient descent

work page 2021
[47]

Surlestrajectoiresdugradientd’unefonctionanalytique.Seminari di geometria, 1983(1984):115–117

Lojasiewicz, S.(1982). Surlestrajectoiresdugradientd’unefonctionanalytique.Seminari di geometria, 1983(1984):115–117

work page 1982
[48]

Mostowski, T. (1985). Lipschitz equisingularity

work page 1985
[49]

and Valette, G

Nguyen, N. and Valette, G. (2016). Lipschitz stratifications in o-minimal structures. InAnnales Scientifiques de l’´Ecole Normale Sup´ erieure, volume 49, pages 399–421

work page 2016
[50]

Parusi´ nski, A. (1988). Lipschitz properties of semi-analytic sets. InAnnales de l’institut Fourier, volume 38, pages 189–213

work page 1988
[51]

Parusi´ nski, A. (1994). Lipschitz stratification of subanalytic sets. InAnnales scientifiques de l’Ecole normale sup´ erieure, volume 27, pages 661–696

work page 1994
[52]

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch

work page 2017
[53]

Pauwels, E. (2023). Conservative parametric optimality and the ridge method for tame min-max problems.Set-Valued and Variational Analysis, 31(3):1–24

work page 2023
[54]

Rios-Zertuche, R. (2022). Examples of pathological dynamics of the subgradient method for lipschitz path-differentiable functions.Mathematics of Operations Research, 47(4):3184–3206

work page 2022
[55]

Rockafellar, R. T. and Wets, R. J.-B. (2009).Variational analysis, volume 317. Springer Science & Business Media

work page 2009
[56]

and Schreuder, N

Schechtman, S. and Schreuder, N. (2025). The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks. In Haghtalab, N. and Moitra, A., editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 5143–5172. PMLR

work page 2025
[57]

Shamir, O. (2020). Can we find near-approximately-stationary points of nonsmooth nonconvex func- tions?arXiv preprint arXiv:2002.11962

work page arXiv 2020
[58]

Shor, N. Z. (1964). On the structure of algorithms for numerical solution of problems of optimal planning and design.Diss. Doctor Philos. Kiev

work page 1964
[59]

Tarski, A. (1951). A decision method for elementary algebra and geometry. InQuantifier elimination and cylindrical algebraic decomposition, pages 24–84. Springer

work page 1951
[60]

Telgarsky, M. (2016). Benefits of depth in neural networks. InConference on learning theory, pages 1517–1539. PMLR

work page 2016
[61]

(1998).Tame topology and o-minimal structures, volume 248

Van den Dries, L. (1998).Tame topology and o-minimal structures, volume 248. Cambridge university press

work page 1998
[62]

and Miller, C

van den Dries, L. and Miller, C. (1996). Geometric categories and o-minimal structures.Duke Math. J., 85(1):497–540

work page 1996
[63]

Xiao, N., Hu, X., Liu, X., and Toh, K.-C. (2023). Adam-family methods for nonsmooth optimization with convergence guarantees.arXiv preprint arXiv:2305.03938

work page arXiv 2023
[64]

Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A. (2020). Complexity of finding stationary points of nonconvex nonsmooth functions. InInternational Conference on Machine Learning, pages 11173–11182. PMLR

work page 2020
[65]

and Imani, Z

Zohrevand, A. and Imani, Z. (2022). An empirical study of the performance of different optimizers in the deep neural networks. In2022 International Conference on Machine Vision and Image Processing (MVIP), pages 1–5. IEEE. 48 E. Chzhen and S. Schechtman

work page 2022

[1] [1]

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´ e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Tal...

work page 2015

[2] [2]

Absil, P.-A., Mahony, R., and Andrews, B. (2005). Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547

work page 2005

[3] [3]

and Mantegazza, C

Ambrosio, L. and Mantegazza, C. (1998). Curvature and distance function from a manifold.The Journal of Geometric Analysis, 8(5):723–748. E. Chzhen and S. Schechtman 45 On convergence rates of subgradient descent

work page 1998

[4] [4]

and Soner, H

Ambrosio, L. and Soner, H. M. (1996). Level set approach to mean curvature flow in arbitrary codimen- sion.Journal of differential geometry, 43(4):693–737

work page 1996

[5] [5]

Attouch, H., Bolte, J., and Svaiter, B. F. (2013). Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Mathematical programming, 137(1):91–129

work page 2013

[6] [6]

Bareilles, G., Gehret, A., Aspman, J., Lepˇ sov´ a, J., and Mareˇ cek, J. (2025). Deep learning as the disciplined construction of tame objects.arXiv preprint arXiv:2509.18025

work page arXiv 2025

[7] [7]

Bena¨ ım, M., Hofbauer, J., and Sorin, S. (2005). Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 44(1):328–348

work page 2005

[8] [8]

Bianchi, P., Hachem, W., and Schechtman, S. (2022). Convergence of constant step stochastic gradient descent for non-smooth non-convex functions.Set-Valued and Variational Analysis, 30(3):1117–1147

work page 2022

[9] [9]

Bianchi, P., Hachem, W., and Schechtman, S. (2024). Stochastic subgradient descent escapes active strict saddles on weakly convex functions.Mathematics of Operations Research, 49(3):1761–1790

work page 2024

[10] [10]

and Milman, P

Bierstone, E. and Milman, P. D. (1988). Semianalytic and subanalytic sets.Publications Math´ ematiques de l’IH´ES, 67:5–42

work page 1988

[11] [11]

Bolte, J., Daniilidis, A., and Lewis, A. (2006). A nonsmooth morse–sard theorem for subanalytic functions.Journal of mathematical analysis and applications, 321(2):729–740

work page 2006

[12] [12]

Bolte, J., Daniilidis, A., Lewis, A., and Shiota, M. (2007). Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572

work page 2007

[13] [13]

Bolte, J., Le, T., Moulines,´E., and Pauwels, E. (2025). Inexact subgradient methods for semialgebraic functions: J. bolte et al.Mathematical Programming, pages 1–27

work page 2025

[14] [14]

Bolte, J., Le, T., and Pauwels, E. (2023a). Subgradient sampling for nonsmooth nonconvex minimiza- tion.SIAM Journal on Optimization, 33(4):2542–2569

work page

[15] [15]

and Pauwels, E

Bolte, J. and Pauwels, E. (2021). Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning.Mathematical Programming, 188:19–51

work page 2021

[16] [16]

Bolte, J., Pauwels, E., and Silveti-Falls, A. (2024). Differentiating nonsmooth solutions to parametric monotone inclusion problems.SIAM Journal on Optimization, 34(1):71–97

work page 2024

[17] [17]

Bolte, J., Pauwels, E., and Vaiter, S. (2023b). One-step differentiation of iterative algorithms.arXiv preprint arXiv:2305.13768

work page arXiv

[18] [18]

Borkar, V. S. and Borkar, V. S. (2008).Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer

work page 2008

[19] [19]

(2023).An introduction to optimization on smooth manifolds

Boumal, N. (2023).An introduction to optimization on smooth manifolds. Cambridge University Press

work page 2023

[20] [20]

H., Ledyaev, Y

Clarke, F. H., Ledyaev, Y. S., Stern, R. J., and Wolenski, P. R. (1998).Nonsmooth analysis and control theory, volume 178. Springer-Verlag, New York

work page 1998

[21] [21]

(2000).An introduction to o-minimal geometry

Coste, M. (2000).An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazionali Pisa

work page 2000

[22] [22]

and Drusvyatskiy, D

Davis, D. and Drusvyatskiy, D. (2019). Stochastic model-based minimization of weakly convex func- tions.SIAM Journal on Optimization, 29(1):207–239

work page 2019

[23] [23]

Davis, D., Drusvyatskiy, D., and Jiang, L. (2025). Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization.Foundations of Computational Mathematics, pages 1–83

work page 2025

[24] [24]

Stochasticsubgradientmethodconverges on tame functions.Foundations of computational mathematics, 20(1):119–154

Davis, D., Drusvyatskiy, D., Kakade, S., andLee, J.D.(2020). Stochasticsubgradientmethodconverges on tame functions.Foundations of computational mathematics, 20(1):119–154. 46 E. Chzhen and S. Schechtman On convergence rates of subgradient descent

work page 2020

[25] [25]

T., Padmanabhan, S., and Ye, G

Davis, D., Drusvyatskiy, D., Lee, Y. T., Padmanabhan, S., and Ye, G. (2022). A gradient sampling method with complexity guarantees for lipschitz functions in high and low dimensions.Advances in neural information processing systems, 35:6692–6703

work page 2022

[26] [26]

D., and Lewis, A

Drusvyatskiy, D., Ioffe, A. D., and Lewis, A. S. (2015). Curves of descent.SIAM Journal on Control and Optimization, 53(1):114–138

work page 2015

[27] [27]

Duchi, J. C. and Ruan, F. (2018). Stochastic methods for composite and weakly convex optimization problems.SIAM Journal on Optimization, 28(4):3229–3259

work page 2018

[28] [28]

and Holly, K

Dudek, E. and Holly, K. (1994). Nonlinear orthogonal projection. InAnnales Polonici Mathematici, volume 59, pages 1–31. Polska Akademia Nauk. Instytut Matematyczny PAN

work page 1994

[29] [29]

Ermol’ev, Y. M. and Norkin, V. (1998). Stochastic generalized gradient method for nonconvex nons- mooth stochastic optimization.Cybernetics and Systems Analysis, 34(2):196–215

work page 1998

[30] [30]

Federer, H. (1959). Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491

work page 1959

[31] [31]

Gabrielov, A. (1996). Complements of subanalytic sets and existential formulas for analytic functions. Inventiones mathematicae, 125(1):1–12

work page 1996

[32] [32]

Gabrielov, A. M. (1968). Projections of semi-analytic sets.Functional Analysis and its applications, 2(4):282–291

work page 1968

[33] [33]

Goldstein, A. A. (1977). Optimization of lipschitz continuous functions.Mathematical Programming, 13(1):14–22

work page 1977

[34] [34]

and Yin, Y

Halupczok, I. and Yin, Y. (2018). Lipschitz stratifications in power-boundedo-minimal fields.Journal of the European Mathematical Society, 20(11):2717–2767

work page 2018

[35] [35]

Ioffe, A. D. (2009). An invitation to tame optimization.SIAM Journal on Optimization, 19(4):1894– 1917

work page 2009

[36] [36]

I., Lin, T., and Zampetakis, M

Jordan, M. I., Lin, T., and Zampetakis, M. (2022). On the complexity of deterministic nonsmooth and nonconvex optimization.arXiv preprint arXiv:2209.12463

work page arXiv 2022

[37] [37]

and Lai, L

Josz, C. and Lai, L. (2024a). Global stability of first-order methods for coercive tame functions. Mathematical Programming, 207(1):551–576

work page

[38] [38]

and Lai, L

Josz, C. and Lai, L. (2024b). Sufficient conditions for instability of the subgradient method with constant step size.SIAM Journal on Optimization, 34(1):57–70

work page

[39] [39]

and Lewis, A

Kong, S. and Lewis, A. S. (2025). Lipschitz minimization and the goldstein modulus: S. kong, as lewis. Mathematical Programming, pages 1–30

work page 2025

[40] [40]

Kurdyka, K. (1998). On gradients of functions definable in o-minimal structures. InAnnales de l’institut Fourier, volume 48, pages 769–783

work page 1998

[41] [41]

Lafontaine, J. et al. (2015).An introduction to differential manifolds. Springer

work page 2015

[42] [42]

On the diameter of subgradient sequences in o-minimal structures

Lai, L. and Song, M. (2025). On the diameter of subgradient sequences in o-minimal structures.arXiv preprint arXiv:2511.06868

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Le, T. (2023). Nonsmooth nonconvex stochastic heavy ball.arXiv preprint arXiv:2304.13328

work page arXiv 2023

[44] [44]

Lee, J. M. (2022).Manifolds and differential geometry, volume 107. American Mathematical Society

work page 2022

[45] [45]

and Steinicke, A

Leobacher, G. and Steinicke, A. (2021). Existence, uniqueness and regularity of the projection onto differentiable manifolds.Annals of global analysis and geometry, 60(3):559–587

work page 2021

[46] [46]

Lewis, A. S. and Tian, T. (2021). The structure of conservative gradient fields.SIAM Journal on Optimization, 31(3):2080–2083. E. Chzhen and S. Schechtman 47 On convergence rates of subgradient descent

work page 2021

[47] [47]

Surlestrajectoiresdugradientd’unefonctionanalytique.Seminari di geometria, 1983(1984):115–117

Lojasiewicz, S.(1982). Surlestrajectoiresdugradientd’unefonctionanalytique.Seminari di geometria, 1983(1984):115–117

work page 1982

[48] [48]

Mostowski, T. (1985). Lipschitz equisingularity

work page 1985

[49] [49]

and Valette, G

Nguyen, N. and Valette, G. (2016). Lipschitz stratifications in o-minimal structures. InAnnales Scientifiques de l’´Ecole Normale Sup´ erieure, volume 49, pages 399–421

work page 2016

[50] [50]

Parusi´ nski, A. (1988). Lipschitz properties of semi-analytic sets. InAnnales de l’institut Fourier, volume 38, pages 189–213

work page 1988

[51] [51]

Parusi´ nski, A. (1994). Lipschitz stratification of subanalytic sets. InAnnales scientifiques de l’Ecole normale sup´ erieure, volume 27, pages 661–696

work page 1994

[52] [52]

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch

work page 2017

[53] [53]

Pauwels, E. (2023). Conservative parametric optimality and the ridge method for tame min-max problems.Set-Valued and Variational Analysis, 31(3):1–24

work page 2023

[54] [54]

Rios-Zertuche, R. (2022). Examples of pathological dynamics of the subgradient method for lipschitz path-differentiable functions.Mathematics of Operations Research, 47(4):3184–3206

work page 2022

[55] [55]

Rockafellar, R. T. and Wets, R. J.-B. (2009).Variational analysis, volume 317. Springer Science & Business Media

work page 2009

[56] [56]

and Schreuder, N

Schechtman, S. and Schreuder, N. (2025). The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks. In Haghtalab, N. and Moitra, A., editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 5143–5172. PMLR

work page 2025

[57] [57]

Shamir, O. (2020). Can we find near-approximately-stationary points of nonsmooth nonconvex func- tions?arXiv preprint arXiv:2002.11962

work page arXiv 2020

[58] [58]

Shor, N. Z. (1964). On the structure of algorithms for numerical solution of problems of optimal planning and design.Diss. Doctor Philos. Kiev

work page 1964

[59] [59]

Tarski, A. (1951). A decision method for elementary algebra and geometry. InQuantifier elimination and cylindrical algebraic decomposition, pages 24–84. Springer

work page 1951

[60] [60]

Telgarsky, M. (2016). Benefits of depth in neural networks. InConference on learning theory, pages 1517–1539. PMLR

work page 2016

[61] [61]

(1998).Tame topology and o-minimal structures, volume 248

Van den Dries, L. (1998).Tame topology and o-minimal structures, volume 248. Cambridge university press

work page 1998

[62] [62]

and Miller, C

van den Dries, L. and Miller, C. (1996). Geometric categories and o-minimal structures.Duke Math. J., 85(1):497–540

work page 1996

[63] [63]

Xiao, N., Hu, X., Liu, X., and Toh, K.-C. (2023). Adam-family methods for nonsmooth optimization with convergence guarantees.arXiv preprint arXiv:2305.03938

work page arXiv 2023

[64] [64]

Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A. (2020). Complexity of finding stationary points of nonconvex nonsmooth functions. InInternational Conference on Machine Learning, pages 11173–11182. PMLR

work page 2020

[65] [65]

and Imani, Z

Zohrevand, A. and Imani, Z. (2022). An empirical study of the performance of different optimizers in the deep neural networks. In2022 International Conference on Machine Vision and Image Processing (MVIP), pages 1–5. IEEE. 48 E. Chzhen and S. Schechtman

work page 2022