pith. sign in

arxiv: 2507.10901 · v2 · pith:DGJTKORQnew · submitted 2025-07-15 · 🧮 math.OC

Non-smooth stochastic gradient descent using smoothing functions

Pith reviewed 2026-05-19 05:20 UTC · model grok-4.3

classification 🧮 math.OC
keywords smoothing functionsstochastic gradient descentnon-smooth optimizationcompositional optimizationconvergence ratesstochastic optimizationmachine learning
0
0 comments X

The pith

Smoothing stochastic gradient descent achieves a 1/T to the 1/4 rate for convex non-smooth compositional problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a smoothing stochastic gradient method for stochastic optimization problems that compose a non-smooth outer function with a smooth inner function. Smoothing functions that are continuously differentiable and approach the original non-smooth function as a parameter goes to zero are introduced, and this parameter is driven to zero at a chosen rate during iterations. The method establishes convergence guarantees under strongly convex, convex, and non-convex settings that match known rates for non-smooth stochastic compositional optimization. In the convex case the rate is 1 over T to the one fourth when measured by the number of stochastic gradient evaluations. A reader would care because the approach handles non-differentiable objectives that arise in machine learning and operations research while preserving theoretical efficiency.

Core claim

By approximating the non-smooth outer function with smoothing functions that are continuously differentiable and converge to the original as the smoothing parameter tends to zero (at the cost of higher Lipschitz constants), and by iteratively decreasing this parameter, the smoothing stochastic gradient method proves convergence rates that match those for non-smooth stochastic compositional optimization, including a 1/T^{1/4} rate in terms of stochastic gradient evaluations for convex objectives.

What carries the argument

Smoothing functions that are continuously differentiable, approach the original non-smooth function as the parameter tends to zero with increasing Lipschitz constants, together with stochastic gradient estimates that satisfy unbiasedness, bounded second moments, and accurate smoothing error bounds.

If this is right

  • The method applies directly to general compositional and finite-sum compositional problems that appear in large-scale machine learning and risk-averse optimization.
  • Convergence guarantees hold under strongly convex, convex, and non-convex assumptions and match existing rates for non-smooth stochastic compositional optimization.
  • For convex objectives the convergence rate is 1/T^{1/4} measured by the number of stochastic gradient evaluations.
  • Preliminary numerical results indicate competitiveness with existing methods on certain problem classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the required smoothing error bounds hold for common non-smooth functions such as the maximum or absolute value, the method could be tested on robust optimization tasks not explicitly covered in the paper.
  • The finite-sum compositional setting suggests possible use in distributed training where compositional risk measures appear.
  • A direct comparison against subgradient methods on standard non-smooth benchmarks would test whether the theoretical rate translates to practical speed-ups.

Load-bearing premise

The smoothing functions must be continuously differentiable and approach the original non-smooth function as the parameter tends to zero, while the stochastic gradient estimates satisfy unbiasedness, bounded second moments, and accurate smoothing error bounds.

What would settle it

A numerical test on a convex compositional problem where the observed error decreases slower than 1/T^{1/4} under the paper's prescribed smoothing-parameter schedule would falsify the claimed rate.

Figures

Figures reproduced from arXiv: 2507.10901 by Jingfu Tan, Luis Nunes Vicente, Tommaso Giovannelli.

Figure 1
Figure 1. Figure 1: Training and test loss curves for standard SGD vs. SSG method [PITH_FULL_IMAGE:figures/full_fig_p020_1.png] view at source ↗
read the original abstract

In this paper, we address stochastic optimization problems involving a composition of a non-smooth outer function and a smooth inner function, a formulation frequently encountered in machine learning and operations research. To deal with the non-differentiability of the outer function, we approximate the original non-smooth function using smoothing functions, which are continuously differentiable and approach the original function as a smoothing parameter goes to zero (at the price of increasingly higher Lipschitz constants). The proposed smoothing stochastic gradient method iteratively drives the smoothing parameter to zero at a designated rate. We establish convergence guarantees under strongly convex, convex, and non-convex settings, proving convergence rates that match known results for non-smooth stochastic compositional optimization. In particular, for convex objectives, smoothing stochastic gradient achieves a 1/T^(1/4) rate in terms of the number of stochastic gradient evaluations. We further show how general compositional and finite-sum compositional problems (widely used frameworks in large-scale machine learning and risk-averse optimization) fit the assumptions needed for the rates (unbiased gradient estimates, bounded second moments, and accurate smoothing errors). We present preliminary numerical results indicating that smoothing stochastic gradient descent can be competitive for certain classes of problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a smoothing stochastic gradient method for stochastic optimization of compositions where the outer function is non-smooth and the inner function is smooth. Smoothing functions that are continuously differentiable and converge to the original non-smooth function (with growing Lipschitz constants) are used to approximate the objective; the smoothing parameter is driven to zero at a prescribed rate. Convergence guarantees and rates are established for strongly convex, convex, and non-convex cases under standard assumptions on unbiasedness, bounded second moments, and smoothing error; the convex case achieves a 1/T^{1/4} rate in stochastic gradient evaluations, matching known results for non-smooth stochastic compositional optimization. The paper also verifies that general compositional and finite-sum problems satisfy the required assumptions and presents preliminary numerical experiments.

Significance. If the rates hold under the stated assumptions, the work supplies a concrete algorithmic framework for non-smooth stochastic compositional problems that arise in machine learning and risk-averse optimization, with rates that align with the existing non-smooth literature. The explicit verification that standard compositional and finite-sum settings fit the assumptions is a useful contribution for practitioners.

major comments (1)
  1. [§4 and Assumption 3] §4 (convex case analysis) and the statement of Assumption 3 (bounded second moments): the 1/T^{1/4} rate is derived via a standard supermartingale argument that requires E[||g_t||^2] ≤ σ^2 uniformly in t. Because the smoothing functions are C^1 with Lipschitz constant L(ε) → ∞ as ε → 0 and the schedule drives ε_t → 0, the composed stochastic gradient norm can scale with L(ε_t). It is not shown that the uniform second-moment bound remains valid for the chosen decay rate of ε_t without an extra growth-control assumption; this is load-bearing for the claimed rate.
minor comments (2)
  1. [Abstract and §1] The abstract and introduction claim that the rates 'match known results for non-smooth stochastic compositional optimization'; a brief citation to the specific prior rate (e.g., the reference establishing the 1/T^{1/4} benchmark) would strengthen the comparison.
  2. [§2 and §4] Notation for the smoothing error bound (e.g., the term involving |f_ε(x) - f(x)|) is introduced in §2 but used without an explicit equation number in the rate proofs; adding a numbered display would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and valuable feedback. We address the single major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§4 and Assumption 3] §4 (convex case analysis) and the statement of Assumption 3 (bounded second moments): the 1/T^{1/4} rate is derived via a standard supermartingale argument that requires E[||g_t||^2] ≤ σ^2 uniformly in t. Because the smoothing functions are C^1 with Lipschitz constant L(ε) → ∞ as ε → 0 and the schedule drives ε_t → 0, the composed stochastic gradient norm can scale with L(ε_t). It is not shown that the uniform second-moment bound remains valid for the chosen decay rate of ε_t without an extra growth-control assumption; this is load-bearing for the claimed rate.

    Authors: We thank the referee for this observation. Assumption 3 states a uniform bound on the second moments of the stochastic gradients g_t, which is used in the supermartingale argument of §4. In Section 5 we explicitly verify that the standard assumptions for general compositional problems and finite-sum compositional problems (bounded inner Jacobians, unbiased estimators, and controlled smoothing error) imply that E[||g_t||^2] remains bounded by a constant independent of t, even as ε_t → 0. The growth of L(ε_t) is offset by the fact that the inner mapping is evaluated at points whose gradients are controlled by the problem class; the specific schedule ε_t ∼ T^{-1/4} is chosen precisely so that the bias-variance trade-off yields the 1/T^{1/4} rate while preserving the uniform moment bound already verified for the motivating problem classes. To make this dependence fully transparent, we will insert a short clarifying paragraph immediately after the statement of Assumption 3 and before the convex-case theorem, referencing the verification in Section 5. revision: yes

Circularity Check

0 steps flagged

Derivation uses standard stochastic approximation on externally assumed smoothing bounds

full rationale

The paper establishes convergence rates for the smoothing stochastic gradient method by applying standard stochastic approximation arguments (telescoping sums or supermartingale inequalities) to the smoothed objective under stated assumptions on unbiased gradient estimates, bounded second moments, and smoothing error bounds. These assumptions are introduced as external conditions on the smoothing functions and stochastic oracles rather than being derived from or fitted to quantities defined inside the paper's own equations. The claimed 1/T^{1/4} rate for convex cases is presented as matching known results for non-smooth stochastic compositional optimization, with no load-bearing steps that reduce by construction to self-defined parameters, self-citations, or ansatzes imported from the authors' prior work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard stochastic optimization assumptions plus the existence and properties of smoothing functions; no new entities are postulated and the only potential free parameter is the designated rate at which the smoothing parameter is driven to zero.

free parameters (1)
  • smoothing parameter decay rate
    The rate at which the smoothing parameter is iteratively driven to zero is designated by the method and directly affects the convergence analysis.
axioms (2)
  • domain assumption Stochastic gradient estimates are unbiased with bounded second moments
    Invoked to obtain the stated convergence rates under the smoothing approximation.
  • domain assumption Smoothing functions are continuously differentiable and converge to the original non-smooth function with controlled error as parameter tends to zero
    Core property used to justify replacing the non-smooth outer function while maintaining convergence.

pith-pipeline@v0.9.0 · 5737 in / 1385 out tokens · 50296 ms · 2026-05-19T05:20:54.101412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

  1. [1]

    Ahmed, U

    S. Ahmed, U. C ¸ akmak, and A. Shapiro. Coherent risk measures in inventory problems. European J. Oper. Res., 182:226–238, 2007

  2. [2]

    Beck and M

    A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIIMS, 2:183–202, 2009

  3. [3]

    Bottou, F

    L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., 60:223–311, 2018

  4. [4]

    Chen and O

    C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear and mixed complementarity problems. Comput. Optim. Appl. , 5:97–138, 1996. 21

  5. [5]

    T. Chen, Y. Sun, and W. Yin. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Trans. Signal Process., 69:4937–4948, 2021

  6. [6]

    X. Chen. Smoothing methods for nonsmooth, nonconvex minimization. Math. Program., 134:71–99, 2012

  7. [7]

    Chen and W

    X. Chen and W. Zhou. Smoothing nonlinear conjugate gradient method for image restora- tion using nonsmooth nonconvex minimization. SIAM J. Imaging Sci. , 3:765–790, 2010

  8. [8]

    C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and comparison. J. Mach. Learn. Res. , 15:809–883, 2014

  9. [9]

    Daubechies, M

    I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. , 57:1413–1457, 2004

  10. [10]

    Defazio, F

    A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. NeurIPS, 27, 2014

  11. [11]

    D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory , 52:1289–1306, 2006

  12. [12]

    Duchi and Y

    J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. , 10:2899–2934, 2009

  13. [13]

    J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIOPT, 22:674–701, 2012

  14. [14]

    Ermoliev

    Y. Ermoliev. Stochastic programming methods, 1976

  15. [15]

    Ermoliev, M

    Y. Ermoliev, M. Yu, and R. B. Wets. Numerical Techniques for Stochastic Optimization . Springer-Verlag, Heidelberg, 1988

  16. [16]

    L. Evans. Measure Theory and Fine Properties of Functions . Routledge, Milton Park, Oxfordshire, UK, 2018

  17. [17]

    C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017

  18. [18]

    Garmanjani and L

    R. Garmanjani and L. N. Vicente. Smoothing and worst-case complexity for direct-search methods in nonsmooth optimization. IMA J. Numer. Anal. , 33:1008–1028, 2013

  19. [19]

    Giovannelli, G

    T. Giovannelli, G. Kent, and L. N. Vicente. Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. ISE Technical Report 21T-025, Lehigh University, December 2022

  20. [20]

    Z. Huo, B. Gu, J. Liu, and H. Huang. Accelerated method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  21. [21]

    Jalilzadeh, U

    A. Jalilzadeh, U. Shanbhag, J. Blanchet, and P. W. Glynn. Smoothed variable sample-size accelerated proximal methods for nonsmooth stochastic convex programs. Stoch. Syst., 12: 373–410, 2022. 22

  22. [22]

    Laguel, K

    Y. Laguel, K. Pillutla, J. Malick, and Z. Harchaoui. Superquantiles at work: Machine learning applications and efficient subgradient computation. Set-Valued Var. Anal. , 29: 967–996, 2021

  23. [23]

    T. Li, A. Beirami, M. Sanjabi, and V. Smith. On tilted losses in machine learning: Theory and applications. J. Mach. Learn. Res. , 24:1–79, 2023

  24. [24]

    X. Lian, M. Wang, and J. Liu. Finite-sum composition optimization via variance reduced gradient descent. In Artificial Intelligence and Statistics , pages 1159–1167. PMLR, 2017

  25. [25]

    Q. Lin, X. Chen, and J. Pena. A smoothing stochastic gradient method for composite optimization. Optim. Methods Softw. , 29:1281–1301, 2014

  26. [26]

    J. Liu, Y. Cui, and J. Pang. Solving nonsmooth and nonconvex compound stochastic programs with applications to risk measure minimization. Math. Oper. Res., 47:3051–3083, 2022

  27. [27]

    L. Liu, J. Liu, and D. Tao. Variance reduced methods for non-convex composition opti- mization. IEEE TPAMI, 44:5813–5825, 2021

  28. [28]

    J. Moreau. Proximit´ e et dualit´ e dans un espace hilbertien. Bulletin de la Soci´ et´ e math´ ematique de France, 93:273–299, 1965

  29. [29]

    Moulines and F

    E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. NeurIPS, 24, 2011

  30. [30]

    Nemirovski, A

    A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. , 19:1574–1609, 2009

  31. [31]

    A. S. Nemirovskij and D. B. Yudin. Problem complexity and method efficiency in opti- mization. SIAM Rev., 1983

  32. [32]

    Nesterov

    Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103:127–152, 2005

  33. [33]

    Nesterov

    Y. Nesterov. Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & Business Media, Berlin, 2013

  34. [34]

    Nesterov and V

    Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. FoCM, 17:527–566, 2017

  35. [35]

    B. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Math. Math. Phys. , 3:864–878, 1963

  36. [36]

    B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Math. Math. Phys. , 4:1–17, 1964

  37. [37]

    B.T. Polyak. Introduction to Optimization. Optimization Software, New York, 1987

  38. [38]

    Robbins and S

    H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat. , pages 400–407, 1951. 23

  39. [39]

    R. T. Rockafellar and R. J. B. Wets. Variational Analysis. Springer, Berlin, 1998

  40. [40]

    Shapiro, D

    A. Shapiro, D. Dentcheva, and A. Ruszczynski. Lectures on Stochastic Programming: Mod- eling and Theory . SIAM, Philadelphia, PA, USA, 2021

  41. [41]

    N. Z. Shor. Minimization methods for non-differentiable functions , volume 3. Springer Science & Business Media, Berlin, 2012

  42. [42]

    Tibshirani

    R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol., 58:267–288, 1996

  43. [43]

    M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program., 161:419–449, 2017

  44. [44]

    M. Wang, J. Liu, and E. X. Fang. Accelerating stochastic composition optimization. J. Mach. Learn. Res., 18:1–23, 2017

  45. [45]

    R. Wang, C. Zhang, L. Wang, and Y. Shao. A stochastic nesterov’s smoothing accelerated method for general nonsmooth constrained stochastic composite convex optimization. J. Sci. Comput., 93:52, 2022

  46. [46]

    Wolberg, O

    W. Wolberg, O. Mangasarian, N. Street, and W. Street. Breast Cancer Wisconsin (Diagnos- tic). UCI Machine Learning Repository, 1993. DOI: https://doi.org/10.24432/C5DW2B

  47. [47]

    L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. NeurIPS, 22, 2009

  48. [48]

    Xiao and T

    L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim , 24:2057–2075, 2014

  49. [49]

    Fast Stochastic Variance Reduced ADMM for Stochastic Composition Optimization

    Y. Yu and L. Huang. Fast stochastic variance reduced admm for stochastic composition optimization. arXiv preprint arXiv:1705.04138 , 2017

  50. [50]

    H. Yuan, X. Lian, and J. Liu. Stochastic recursive variance reduction for efficient smooth non-convex compositional optimization. arXiv preprint arXiv:1912.13515 , 2019

  51. [51]

    Zhang and X

    C. Zhang and X. Chen. Smoothing projected gradient method and its application to stochas- tic linear complementarity problems. SIOPT, 20:627–649, 2009

  52. [52]

    Zhang and L

    J. Zhang and L. Xiao. A composite randomized incremental gradient method. In ICML, pages 7454–7462. PMLR, 2019

  53. [53]

    Zhang and L

    J. Zhang and L. Xiao. A stochastic composite gradient method with incremental variance reduction. NeurIPS, 32, 2019. 24 A Assumption 3.8 for the case f (x) = ϕ(ψ(x)) In this part of the appendix, we want to show strong evidence, for the case f(x) = ϕ(ψ(x)), that there exists y ∈ B(x, E1µ) for some E1 > 0, such that ∥∇x ˜f(x, µ) − g(y)∥ ≤ E2µ for some E2 >...

  54. [54]

    and for sufficiently large T , we have E[∥xT +1 − x∗∥2] ≤ O G2 1G2 2 σ2 1 T 1−2a + L2 ˜ϕG6 1G2 2 σ4 1 T 2/3−4a + L2 ˜ϕG2 1Vψ σ2 1 T 2/3−2a ! = O 1 T 2/3−4a . Proof. First, from the update rule of Algorithm 4, we have ∥xk+1 − x∗∥2 = ∥xk − αk ˜ϕ′(yk+1, µk)∇xψ(xk, ξ2 k) − x∗∥2 = ∥xk − x∗∥2 − 2αk(xk − x∗)⊤ ˜ϕ′(yk+1, µk)∇xψ(xk, ξ2 k) + α2 k∥˜ϕ′(yk+1, µk)∇xψ(xk...