On convergence rates of subgradient descent on semialgebraic functions
Pith reviewed 2026-05-10 06:31 UTC · model grok-4.3
The pith
Subgradient descent converges at explicit rates for semialgebraic functions by shadowing Riemannian descent on local strata.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the assumption that the objective admits a stratification into smooth manifolds on which a global projection formula for Clarke subgradients holds and quantitative curvature bounds are satisfied, the iterates of the subgradient method locally shadow a Riemannian gradient descent on nearby strata. This shadowing is used to measure stationarity, and by introducing a selection rule for the active stratum the authors assemble local descent inequalities across successive strata into explicit convergence rates expressed in terms of the number of dimensions present in the stratification. These assumptions follow from the existence of Lipschitz stratifications of semialgebraic sets and are因此s,
What carries the argument
The local shadowing of subgradient iterates by Riemannian gradient descent on strata, which permits stationarity measurement and the assembly of descent inequalities across strata via a selection rule.
If this is right
- Convergence rates improve as the number of strata decreases.
- Classical smooth-case rates are recovered up to constants when the stratification consists of a single stratum.
- The same framework extends to decreasing step sizes and recovers sequential convergence.
- The stated geometric conditions hold for all semialgebraic functions and for functions definable in polynomially bounded o-minimal structures.
Where Pith is reading between the lines
- The shadowing mechanism could be applied to analyze other first-order methods on stratified domains.
- Direct numerical verification of the rates is possible on elementary semialgebraic examples such as the maximum of finitely many linear functions.
- The tubular neighborhood estimates derived as an intermediate step may be reusable in other problems involving Lipschitz stratifications.
Load-bearing premise
The objective function's domain must admit a partition into smooth manifolds on which Clarke subgradients satisfy a global projection formula and quantitative curvature bounds hold.
What would settle it
A concrete semialgebraic function on which the subgradient iterates fail to remain close to Riemannian gradient steps on the strata or violate the predicted dimension-dependent rate.
Figures
read the original abstract
We analyze the constant step size subgradient method on nonsmooth, nonconvex functions. We identify geometric assumptions on the objective function under which i) its domain admits a partition (stratification) into smooth manifolds (strata) on which the function is smooth; ii) a global projection formula for Clarke subgradients holds; and iii) quantitative curvature bounds hold on each stratum. Under these conditions, we prove that the iterates of the subgradient method locally shadow a Riemannian gradient descent on nearby strata, which we use to measure stationarity. We introduce a selection rule for the active stratum and develop a mechanism that assembles local descent inequalities across successive strata into explicit convergence rates. These rates are expressed in terms of the number of dimensions present in the stratification, improve as the number of strata decreases, and recover, up to constants, the classical rates in the smooth case. We show that the stated assumptions follow from the existence of Lipschitz stratifications of semialgebraic sets, and are therefore automatically satisfied for semialgebraic functions and, more generally, for functions definable in polynomially bounded o-minimal structures, yielding the first known convergence rates in these settings. As intermediate results of independent interest, we establish tubular neighborhood estimates for Lipschitz stratifications and a global projection formula for Clarke subgradients. Finally, we show that our framework extends to decreasing step size and recovers, via an alternative argument, the recently announced result of Lai and Song on sequential convergence of the subgradient method with step sizes 1/k.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes constant-step subgradient descent on nonsmooth nonconvex functions. It posits geometric assumptions (Lipschitz stratification of the domain into smooth strata on which the objective is smooth, a global projection formula for Clarke subgradients, and quantitative curvature bounds on each stratum) under which subgradient iterates locally shadow Riemannian gradient descent on nearby strata. A stratum-selection rule then assembles local descent inequalities into explicit convergence rates expressed in terms of the stratification's dimension count and number of strata; these recover (up to constants) the classical smooth rates when the stratification is trivial. The assumptions are shown to follow from the existence of Lipschitz stratifications, hence hold automatically for semialgebraic functions and functions definable in polynomially bounded o-minimal structures. Intermediate results include tubular-neighborhood estimates for Lipschitz stratifications and the global projection formula. The framework is also extended to decreasing step sizes, recovering a recent sequential-convergence result of Lai and Song.
Significance. If the geometric assumptions are verified, the work supplies the first explicit convergence rates for subgradient methods on semialgebraic objectives, a broad class containing many practical nonsmooth nonconvex problems. The intermediate results on tubular neighborhoods and Clarke-subgradient projection are of independent interest. The rates improve with fewer strata and recover the smooth case, providing a concrete bridge between smooth and nonsmooth regimes.
major comments (2)
- [§4] §4 (derivation that Lipschitz stratifications imply the three geometric assumptions): the argument that quantitative curvature bounds hold uniformly on each stratum is load-bearing for the single-step-size shadowing argument and the subsequent assembly of descent inequalities. The Lipschitz condition controls continuity of tangent planes but does not automatically bound the second fundamental form uniformly on the whole stratum; the cusp example (graph of y = x^{3/2} near the origin) shows curvature can diverge while still admitting a Lipschitz stratification. The manuscript must exhibit an explicit uniform bound derived from the stratification data or restrict the strata on which the bounds are invoked.
- [§5.2] §5.2 (assembly of local descent inequalities across strata): the global rate depends on the number of strata and their dimensions, yet the proof sketch does not quantify how the constants in the tubular-neighborhood estimates and curvature bounds accumulate when the active stratum changes; without this, the claimed improvement over existing rates cannot be verified as sharp.
minor comments (2)
- [Definition 5.1] The notation for the active-stratum selection rule (Definition 5.1) would benefit from an accompanying diagram illustrating the projection onto nearby strata.
- [Introduction] A short remark comparing the obtained rates with the classical O(1/k) rate for smooth convex functions would help readers situate the result.
Simulated Author's Rebuttal
We thank the referee for the thorough reading and insightful comments, which have helped us identify areas for clarification and strengthening. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4 (derivation that Lipschitz stratifications imply the three geometric assumptions): the argument that quantitative curvature bounds hold uniformly on each stratum is load-bearing for the single-step-size shadowing argument and the subsequent assembly of descent inequalities. The Lipschitz condition controls continuity of tangent planes but does not automatically bound the second fundamental form uniformly on the whole stratum; the cusp example (graph of y = x^{3/2} near the origin) shows curvature can diverge while still admitting a Lipschitz stratification. The manuscript must exhibit an explicit uniform bound derived from the stratification data or restrict the strata on which the bounds are invoked.
Authors: We appreciate this observation, which correctly identifies that the passage from Lipschitz stratification to uniform curvature bounds requires an explicit derivation rather than an implicit appeal to the definition. In the original §4 we invoked standard results from the theory of Lipschitz stratifications in o-minimal structures to conclude that the second fundamental form remains bounded on each stratum; however, the referee is right that this step is not spelled out with quantitative constants. In the revision we will insert a self-contained lemma (new Lemma 4.3) that extracts an explicit uniform bound on the second fundamental form directly from the Lipschitz constant of the stratification and the definable tameness of the strata. The cusp example does not contradict the claim because, under the precise definition of Lipschitz stratification used in the paper (which requires the strata to be definable in a polynomially bounded o-minimal structure), the curvature remains controlled away from the origin; the singularity is isolated in a lower-dimensional stratum where the bound is applied only locally. We will also add a short remark clarifying that the bound is invoked only on the strata visited by the iterates after a finite number of steps, thereby restricting the domain on which uniformity is required. revision: yes
-
Referee: [§5.2] §5.2 (assembly of local descent inequalities across strata): the global rate depends on the number of strata and their dimensions, yet the proof sketch does not quantify how the constants in the tubular-neighborhood estimates and curvature bounds accumulate when the active stratum changes; without this, the claimed improvement over existing rates cannot be verified as sharp.
Authors: We agree that a fully rigorous verification of the rate improvement requires tracking the dependence of all constants on the stratification data. The original proof sketch in §5.2 assembles the local descent inequalities via a stratum-selection rule but leaves the accumulation of tubular-neighborhood radii and curvature constants implicit. In the revised version we will expand the argument to derive an explicit global constant C that depends only on the total number of strata N, the maximum dimension d_max, and the Lipschitz constant of the stratification. This will be achieved by bounding the number of stratum transitions (at most N) and summing the local constants with a factor that grows at most linearly in N. The resulting rate will then be stated with the precise dependence on (N, d_max), allowing direct comparison with existing subgradient rates and confirming the improvement when N is small. A new corollary will illustrate the recovery of the smooth-case rate (up to a multiplicative factor depending only on N) when the stratification is trivial. revision: yes
Circularity Check
No circularity: rates derived from external stratification properties
full rationale
The paper states geometric assumptions (stratification into smooth manifolds, global Clarke projection formula, quantitative curvature bounds) and derives local shadowing of Riemannian gradient descent plus assembled convergence rates from those assumptions via explicit inequalities. It then claims the assumptions hold automatically for semialgebraic functions because they follow from the existence of Lipschitz stratifications, which is presented as a known external fact about semialgebraic sets rather than a self-derived or fitted result. No equation reduces to an input by construction, no parameter is fitted then renamed as prediction, and no load-bearing step rests on a self-citation chain. The derivation chain is therefore self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Semialgebraic sets admit Lipschitz stratifications
- domain assumption Global projection formula for Clarke subgradients holds on the stratification
Reference graph
Works this paper leans on
-
[1]
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´ e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Tal...
work page 2015
-
[2]
Absil, P.-A., Mahony, R., and Andrews, B. (2005). Convergence of the iterates of descent methods for analytic cost functions.SIAM Journal on Optimization, 16(2):531–547
work page 2005
-
[3]
Ambrosio, L. and Mantegazza, C. (1998). Curvature and distance function from a manifold.The Journal of Geometric Analysis, 8(5):723–748. E. Chzhen and S. Schechtman 45 On convergence rates of subgradient descent
work page 1998
-
[4]
Ambrosio, L. and Soner, H. M. (1996). Level set approach to mean curvature flow in arbitrary codimen- sion.Journal of differential geometry, 43(4):693–737
work page 1996
-
[5]
Attouch, H., Bolte, J., and Svaiter, B. F. (2013). Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Mathematical programming, 137(1):91–129
work page 2013
- [6]
-
[7]
Bena¨ ım, M., Hofbauer, J., and Sorin, S. (2005). Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 44(1):328–348
work page 2005
-
[8]
Bianchi, P., Hachem, W., and Schechtman, S. (2022). Convergence of constant step stochastic gradient descent for non-smooth non-convex functions.Set-Valued and Variational Analysis, 30(3):1117–1147
work page 2022
-
[9]
Bianchi, P., Hachem, W., and Schechtman, S. (2024). Stochastic subgradient descent escapes active strict saddles on weakly convex functions.Mathematics of Operations Research, 49(3):1761–1790
work page 2024
-
[10]
Bierstone, E. and Milman, P. D. (1988). Semianalytic and subanalytic sets.Publications Math´ ematiques de l’IH´ES, 67:5–42
work page 1988
-
[11]
Bolte, J., Daniilidis, A., and Lewis, A. (2006). A nonsmooth morse–sard theorem for subanalytic functions.Journal of mathematical analysis and applications, 321(2):729–740
work page 2006
-
[12]
Bolte, J., Daniilidis, A., Lewis, A., and Shiota, M. (2007). Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572
work page 2007
-
[13]
Bolte, J., Le, T., Moulines,´E., and Pauwels, E. (2025). Inexact subgradient methods for semialgebraic functions: J. bolte et al.Mathematical Programming, pages 1–27
work page 2025
-
[14]
Bolte, J., Le, T., and Pauwels, E. (2023a). Subgradient sampling for nonsmooth nonconvex minimiza- tion.SIAM Journal on Optimization, 33(4):2542–2569
-
[15]
Bolte, J. and Pauwels, E. (2021). Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning.Mathematical Programming, 188:19–51
work page 2021
-
[16]
Bolte, J., Pauwels, E., and Silveti-Falls, A. (2024). Differentiating nonsmooth solutions to parametric monotone inclusion problems.SIAM Journal on Optimization, 34(1):71–97
work page 2024
- [17]
-
[18]
Borkar, V. S. and Borkar, V. S. (2008).Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer
work page 2008
-
[19]
(2023).An introduction to optimization on smooth manifolds
Boumal, N. (2023).An introduction to optimization on smooth manifolds. Cambridge University Press
work page 2023
-
[20]
Clarke, F. H., Ledyaev, Y. S., Stern, R. J., and Wolenski, P. R. (1998).Nonsmooth analysis and control theory, volume 178. Springer-Verlag, New York
work page 1998
-
[21]
(2000).An introduction to o-minimal geometry
Coste, M. (2000).An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazionali Pisa
work page 2000
-
[22]
Davis, D. and Drusvyatskiy, D. (2019). Stochastic model-based minimization of weakly convex func- tions.SIAM Journal on Optimization, 29(1):207–239
work page 2019
-
[23]
Davis, D., Drusvyatskiy, D., and Jiang, L. (2025). Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization.Foundations of Computational Mathematics, pages 1–83
work page 2025
-
[24]
Davis, D., Drusvyatskiy, D., Kakade, S., andLee, J.D.(2020). Stochasticsubgradientmethodconverges on tame functions.Foundations of computational mathematics, 20(1):119–154. 46 E. Chzhen and S. Schechtman On convergence rates of subgradient descent
work page 2020
-
[25]
T., Padmanabhan, S., and Ye, G
Davis, D., Drusvyatskiy, D., Lee, Y. T., Padmanabhan, S., and Ye, G. (2022). A gradient sampling method with complexity guarantees for lipschitz functions in high and low dimensions.Advances in neural information processing systems, 35:6692–6703
work page 2022
-
[26]
Drusvyatskiy, D., Ioffe, A. D., and Lewis, A. S. (2015). Curves of descent.SIAM Journal on Control and Optimization, 53(1):114–138
work page 2015
-
[27]
Duchi, J. C. and Ruan, F. (2018). Stochastic methods for composite and weakly convex optimization problems.SIAM Journal on Optimization, 28(4):3229–3259
work page 2018
-
[28]
Dudek, E. and Holly, K. (1994). Nonlinear orthogonal projection. InAnnales Polonici Mathematici, volume 59, pages 1–31. Polska Akademia Nauk. Instytut Matematyczny PAN
work page 1994
-
[29]
Ermol’ev, Y. M. and Norkin, V. (1998). Stochastic generalized gradient method for nonconvex nons- mooth stochastic optimization.Cybernetics and Systems Analysis, 34(2):196–215
work page 1998
-
[30]
Federer, H. (1959). Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491
work page 1959
-
[31]
Gabrielov, A. (1996). Complements of subanalytic sets and existential formulas for analytic functions. Inventiones mathematicae, 125(1):1–12
work page 1996
-
[32]
Gabrielov, A. M. (1968). Projections of semi-analytic sets.Functional Analysis and its applications, 2(4):282–291
work page 1968
-
[33]
Goldstein, A. A. (1977). Optimization of lipschitz continuous functions.Mathematical Programming, 13(1):14–22
work page 1977
-
[34]
Halupczok, I. and Yin, Y. (2018). Lipschitz stratifications in power-boundedo-minimal fields.Journal of the European Mathematical Society, 20(11):2717–2767
work page 2018
-
[35]
Ioffe, A. D. (2009). An invitation to tame optimization.SIAM Journal on Optimization, 19(4):1894– 1917
work page 2009
-
[36]
I., Lin, T., and Zampetakis, M
Jordan, M. I., Lin, T., and Zampetakis, M. (2022). On the complexity of deterministic nonsmooth and nonconvex optimization.arXiv preprint arXiv:2209.12463
-
[37]
Josz, C. and Lai, L. (2024a). Global stability of first-order methods for coercive tame functions. Mathematical Programming, 207(1):551–576
-
[38]
Josz, C. and Lai, L. (2024b). Sufficient conditions for instability of the subgradient method with constant step size.SIAM Journal on Optimization, 34(1):57–70
-
[39]
Kong, S. and Lewis, A. S. (2025). Lipschitz minimization and the goldstein modulus: S. kong, as lewis. Mathematical Programming, pages 1–30
work page 2025
-
[40]
Kurdyka, K. (1998). On gradients of functions definable in o-minimal structures. InAnnales de l’institut Fourier, volume 48, pages 769–783
work page 1998
-
[41]
Lafontaine, J. et al. (2015).An introduction to differential manifolds. Springer
work page 2015
-
[42]
On the diameter of subgradient sequences in o-minimal structures
Lai, L. and Song, M. (2025). On the diameter of subgradient sequences in o-minimal structures.arXiv preprint arXiv:2511.06868
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [43]
-
[44]
Lee, J. M. (2022).Manifolds and differential geometry, volume 107. American Mathematical Society
work page 2022
-
[45]
Leobacher, G. and Steinicke, A. (2021). Existence, uniqueness and regularity of the projection onto differentiable manifolds.Annals of global analysis and geometry, 60(3):559–587
work page 2021
-
[46]
Lewis, A. S. and Tian, T. (2021). The structure of conservative gradient fields.SIAM Journal on Optimization, 31(3):2080–2083. E. Chzhen and S. Schechtman 47 On convergence rates of subgradient descent
work page 2021
-
[47]
Surlestrajectoiresdugradientd’unefonctionanalytique.Seminari di geometria, 1983(1984):115–117
Lojasiewicz, S.(1982). Surlestrajectoiresdugradientd’unefonctionanalytique.Seminari di geometria, 1983(1984):115–117
work page 1982
-
[48]
Mostowski, T. (1985). Lipschitz equisingularity
work page 1985
-
[49]
Nguyen, N. and Valette, G. (2016). Lipschitz stratifications in o-minimal structures. InAnnales Scientifiques de l’´Ecole Normale Sup´ erieure, volume 49, pages 399–421
work page 2016
-
[50]
Parusi´ nski, A. (1988). Lipschitz properties of semi-analytic sets. InAnnales de l’institut Fourier, volume 38, pages 189–213
work page 1988
-
[51]
Parusi´ nski, A. (1994). Lipschitz stratification of subanalytic sets. InAnnales scientifiques de l’Ecole normale sup´ erieure, volume 27, pages 661–696
work page 1994
-
[52]
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch
work page 2017
-
[53]
Pauwels, E. (2023). Conservative parametric optimality and the ridge method for tame min-max problems.Set-Valued and Variational Analysis, 31(3):1–24
work page 2023
-
[54]
Rios-Zertuche, R. (2022). Examples of pathological dynamics of the subgradient method for lipschitz path-differentiable functions.Mathematics of Operations Research, 47(4):3184–3206
work page 2022
-
[55]
Rockafellar, R. T. and Wets, R. J.-B. (2009).Variational analysis, volume 317. Springer Science & Business Media
work page 2009
-
[56]
Schechtman, S. and Schreuder, N. (2025). The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks. In Haghtalab, N. and Moitra, A., editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 5143–5172. PMLR
work page 2025
- [57]
-
[58]
Shor, N. Z. (1964). On the structure of algorithms for numerical solution of problems of optimal planning and design.Diss. Doctor Philos. Kiev
work page 1964
-
[59]
Tarski, A. (1951). A decision method for elementary algebra and geometry. InQuantifier elimination and cylindrical algebraic decomposition, pages 24–84. Springer
work page 1951
-
[60]
Telgarsky, M. (2016). Benefits of depth in neural networks. InConference on learning theory, pages 1517–1539. PMLR
work page 2016
-
[61]
(1998).Tame topology and o-minimal structures, volume 248
Van den Dries, L. (1998).Tame topology and o-minimal structures, volume 248. Cambridge university press
work page 1998
-
[62]
van den Dries, L. and Miller, C. (1996). Geometric categories and o-minimal structures.Duke Math. J., 85(1):497–540
work page 1996
- [63]
-
[64]
Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A. (2020). Complexity of finding stationary points of nonconvex nonsmooth functions. InInternational Conference on Machine Learning, pages 11173–11182. PMLR
work page 2020
-
[65]
Zohrevand, A. and Imani, Z. (2022). An empirical study of the performance of different optimizers in the deep neural networks. In2022 International Conference on Machine Vision and Image Processing (MVIP), pages 1–5. IEEE. 48 E. Chzhen and S. Schechtman
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.