Non-smooth stochastic gradient descent using smoothing functions
Pith reviewed 2026-05-19 05:20 UTC · model grok-4.3
The pith
Smoothing stochastic gradient descent achieves a 1/T to the 1/4 rate for convex non-smooth compositional problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By approximating the non-smooth outer function with smoothing functions that are continuously differentiable and converge to the original as the smoothing parameter tends to zero (at the cost of higher Lipschitz constants), and by iteratively decreasing this parameter, the smoothing stochastic gradient method proves convergence rates that match those for non-smooth stochastic compositional optimization, including a 1/T^{1/4} rate in terms of stochastic gradient evaluations for convex objectives.
What carries the argument
Smoothing functions that are continuously differentiable, approach the original non-smooth function as the parameter tends to zero with increasing Lipschitz constants, together with stochastic gradient estimates that satisfy unbiasedness, bounded second moments, and accurate smoothing error bounds.
If this is right
- The method applies directly to general compositional and finite-sum compositional problems that appear in large-scale machine learning and risk-averse optimization.
- Convergence guarantees hold under strongly convex, convex, and non-convex assumptions and match existing rates for non-smooth stochastic compositional optimization.
- For convex objectives the convergence rate is 1/T^{1/4} measured by the number of stochastic gradient evaluations.
- Preliminary numerical results indicate competitiveness with existing methods on certain problem classes.
Where Pith is reading between the lines
- If the required smoothing error bounds hold for common non-smooth functions such as the maximum or absolute value, the method could be tested on robust optimization tasks not explicitly covered in the paper.
- The finite-sum compositional setting suggests possible use in distributed training where compositional risk measures appear.
- A direct comparison against subgradient methods on standard non-smooth benchmarks would test whether the theoretical rate translates to practical speed-ups.
Load-bearing premise
The smoothing functions must be continuously differentiable and approach the original non-smooth function as the parameter tends to zero, while the stochastic gradient estimates satisfy unbiasedness, bounded second moments, and accurate smoothing error bounds.
What would settle it
A numerical test on a convex compositional problem where the observed error decreases slower than 1/T^{1/4} under the paper's prescribed smoothing-parameter schedule would falsify the claimed rate.
Figures
read the original abstract
In this paper, we address stochastic optimization problems involving a composition of a non-smooth outer function and a smooth inner function, a formulation frequently encountered in machine learning and operations research. To deal with the non-differentiability of the outer function, we approximate the original non-smooth function using smoothing functions, which are continuously differentiable and approach the original function as a smoothing parameter goes to zero (at the price of increasingly higher Lipschitz constants). The proposed smoothing stochastic gradient method iteratively drives the smoothing parameter to zero at a designated rate. We establish convergence guarantees under strongly convex, convex, and non-convex settings, proving convergence rates that match known results for non-smooth stochastic compositional optimization. In particular, for convex objectives, smoothing stochastic gradient achieves a 1/T^(1/4) rate in terms of the number of stochastic gradient evaluations. We further show how general compositional and finite-sum compositional problems (widely used frameworks in large-scale machine learning and risk-averse optimization) fit the assumptions needed for the rates (unbiased gradient estimates, bounded second moments, and accurate smoothing errors). We present preliminary numerical results indicating that smoothing stochastic gradient descent can be competitive for certain classes of problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a smoothing stochastic gradient method for stochastic optimization of compositions where the outer function is non-smooth and the inner function is smooth. Smoothing functions that are continuously differentiable and converge to the original non-smooth function (with growing Lipschitz constants) are used to approximate the objective; the smoothing parameter is driven to zero at a prescribed rate. Convergence guarantees and rates are established for strongly convex, convex, and non-convex cases under standard assumptions on unbiasedness, bounded second moments, and smoothing error; the convex case achieves a 1/T^{1/4} rate in stochastic gradient evaluations, matching known results for non-smooth stochastic compositional optimization. The paper also verifies that general compositional and finite-sum problems satisfy the required assumptions and presents preliminary numerical experiments.
Significance. If the rates hold under the stated assumptions, the work supplies a concrete algorithmic framework for non-smooth stochastic compositional problems that arise in machine learning and risk-averse optimization, with rates that align with the existing non-smooth literature. The explicit verification that standard compositional and finite-sum settings fit the assumptions is a useful contribution for practitioners.
major comments (1)
- [§4 and Assumption 3] §4 (convex case analysis) and the statement of Assumption 3 (bounded second moments): the 1/T^{1/4} rate is derived via a standard supermartingale argument that requires E[||g_t||^2] ≤ σ^2 uniformly in t. Because the smoothing functions are C^1 with Lipschitz constant L(ε) → ∞ as ε → 0 and the schedule drives ε_t → 0, the composed stochastic gradient norm can scale with L(ε_t). It is not shown that the uniform second-moment bound remains valid for the chosen decay rate of ε_t without an extra growth-control assumption; this is load-bearing for the claimed rate.
minor comments (2)
- [Abstract and §1] The abstract and introduction claim that the rates 'match known results for non-smooth stochastic compositional optimization'; a brief citation to the specific prior rate (e.g., the reference establishing the 1/T^{1/4} benchmark) would strengthen the comparison.
- [§2 and §4] Notation for the smoothing error bound (e.g., the term involving |f_ε(x) - f(x)|) is introduced in §2 but used without an explicit equation number in the rate proofs; adding a numbered display would improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and valuable feedback. We address the single major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [§4 and Assumption 3] §4 (convex case analysis) and the statement of Assumption 3 (bounded second moments): the 1/T^{1/4} rate is derived via a standard supermartingale argument that requires E[||g_t||^2] ≤ σ^2 uniformly in t. Because the smoothing functions are C^1 with Lipschitz constant L(ε) → ∞ as ε → 0 and the schedule drives ε_t → 0, the composed stochastic gradient norm can scale with L(ε_t). It is not shown that the uniform second-moment bound remains valid for the chosen decay rate of ε_t without an extra growth-control assumption; this is load-bearing for the claimed rate.
Authors: We thank the referee for this observation. Assumption 3 states a uniform bound on the second moments of the stochastic gradients g_t, which is used in the supermartingale argument of §4. In Section 5 we explicitly verify that the standard assumptions for general compositional problems and finite-sum compositional problems (bounded inner Jacobians, unbiased estimators, and controlled smoothing error) imply that E[||g_t||^2] remains bounded by a constant independent of t, even as ε_t → 0. The growth of L(ε_t) is offset by the fact that the inner mapping is evaluated at points whose gradients are controlled by the problem class; the specific schedule ε_t ∼ T^{-1/4} is chosen precisely so that the bias-variance trade-off yields the 1/T^{1/4} rate while preserving the uniform moment bound already verified for the motivating problem classes. To make this dependence fully transparent, we will insert a short clarifying paragraph immediately after the statement of Assumption 3 and before the convex-case theorem, referencing the verification in Section 5. revision: yes
Circularity Check
Derivation uses standard stochastic approximation on externally assumed smoothing bounds
full rationale
The paper establishes convergence rates for the smoothing stochastic gradient method by applying standard stochastic approximation arguments (telescoping sums or supermartingale inequalities) to the smoothed objective under stated assumptions on unbiased gradient estimates, bounded second moments, and smoothing error bounds. These assumptions are introduced as external conditions on the smoothing functions and stochastic oracles rather than being derived from or fitted to quantities defined inside the paper's own equations. The claimed 1/T^{1/4} rate for convex cases is presented as matching known results for non-smooth stochastic compositional optimization, with no load-bearing steps that reduce by construction to self-defined parameters, self-citations, or ansatzes imported from the authors' prior work. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- smoothing parameter decay rate
axioms (2)
- domain assumption Stochastic gradient estimates are unbiased with bounded second moments
- domain assumption Smoothing functions are continuously differentiable and converge to the original non-smooth function with controlled error as parameter tends to zero
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
smoothing functions ... continuously differentiable and approach the original function as a smoothing parameter goes to zero (at the price of increasingly higher Lipschitz constants)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
best convergence rate ... 1/T^{1/4} ... matches ... non-smooth stochastic compositional optimization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIIMS, 2:183–202, 2009
work page 2009
- [3]
-
[4]
C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear and mixed complementarity problems. Comput. Optim. Appl. , 5:97–138, 1996. 21
work page 1996
-
[5]
T. Chen, Y. Sun, and W. Yin. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Trans. Signal Process., 69:4937–4948, 2021
work page 2021
-
[6]
X. Chen. Smoothing methods for nonsmooth, nonconvex minimization. Math. Program., 134:71–99, 2012
work page 2012
-
[7]
X. Chen and W. Zhou. Smoothing nonlinear conjugate gradient method for image restora- tion using nonsmooth nonconvex minimization. SIAM J. Imaging Sci. , 3:765–790, 2010
work page 2010
-
[8]
C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and comparison. J. Mach. Learn. Res. , 15:809–883, 2014
work page 2014
-
[9]
I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. , 57:1413–1457, 2004
work page 2004
-
[10]
A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. NeurIPS, 27, 2014
work page 2014
-
[11]
D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory , 52:1289–1306, 2006
work page 2006
-
[12]
J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. , 10:2899–2934, 2009
work page 2009
-
[13]
J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIOPT, 22:674–701, 2012
work page 2012
- [14]
-
[15]
Y. Ermoliev, M. Yu, and R. B. Wets. Numerical Techniques for Stochastic Optimization . Springer-Verlag, Heidelberg, 1988
work page 1988
-
[16]
L. Evans. Measure Theory and Fine Properties of Functions . Routledge, Milton Park, Oxfordshire, UK, 2018
work page 2018
-
[17]
C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017
work page 2017
-
[18]
R. Garmanjani and L. N. Vicente. Smoothing and worst-case complexity for direct-search methods in nonsmooth optimization. IMA J. Numer. Anal. , 33:1008–1028, 2013
work page 2013
-
[19]
T. Giovannelli, G. Kent, and L. N. Vicente. Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. ISE Technical Report 21T-025, Lehigh University, December 2022
work page 2022
-
[20]
Z. Huo, B. Gu, J. Liu, and H. Huang. Accelerated method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018
work page 2018
-
[21]
A. Jalilzadeh, U. Shanbhag, J. Blanchet, and P. W. Glynn. Smoothed variable sample-size accelerated proximal methods for nonsmooth stochastic convex programs. Stoch. Syst., 12: 373–410, 2022. 22
work page 2022
- [22]
-
[23]
T. Li, A. Beirami, M. Sanjabi, and V. Smith. On tilted losses in machine learning: Theory and applications. J. Mach. Learn. Res. , 24:1–79, 2023
work page 2023
-
[24]
X. Lian, M. Wang, and J. Liu. Finite-sum composition optimization via variance reduced gradient descent. In Artificial Intelligence and Statistics , pages 1159–1167. PMLR, 2017
work page 2017
-
[25]
Q. Lin, X. Chen, and J. Pena. A smoothing stochastic gradient method for composite optimization. Optim. Methods Softw. , 29:1281–1301, 2014
work page 2014
-
[26]
J. Liu, Y. Cui, and J. Pang. Solving nonsmooth and nonconvex compound stochastic programs with applications to risk measure minimization. Math. Oper. Res., 47:3051–3083, 2022
work page 2022
-
[27]
L. Liu, J. Liu, and D. Tao. Variance reduced methods for non-convex composition opti- mization. IEEE TPAMI, 44:5813–5825, 2021
work page 2021
-
[28]
J. Moreau. Proximit´ e et dualit´ e dans un espace hilbertien. Bulletin de la Soci´ et´ e math´ ematique de France, 93:273–299, 1965
work page 1965
-
[29]
E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. NeurIPS, 24, 2011
work page 2011
-
[30]
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. , 19:1574–1609, 2009
work page 2009
-
[31]
A. S. Nemirovskij and D. B. Yudin. Problem complexity and method efficiency in opti- mization. SIAM Rev., 1983
work page 1983
- [32]
- [33]
-
[34]
Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. FoCM, 17:527–566, 2017
work page 2017
-
[35]
B. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Math. Math. Phys. , 3:864–878, 1963
work page 1963
-
[36]
B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Math. Math. Phys. , 4:1–17, 1964
work page 1964
-
[37]
B.T. Polyak. Introduction to Optimization. Optimization Software, New York, 1987
work page 1987
-
[38]
H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat. , pages 400–407, 1951. 23
work page 1951
-
[39]
R. T. Rockafellar and R. J. B. Wets. Variational Analysis. Springer, Berlin, 1998
work page 1998
-
[40]
A. Shapiro, D. Dentcheva, and A. Ruszczynski. Lectures on Stochastic Programming: Mod- eling and Theory . SIAM, Philadelphia, PA, USA, 2021
work page 2021
-
[41]
N. Z. Shor. Minimization methods for non-differentiable functions , volume 3. Springer Science & Business Media, Berlin, 2012
work page 2012
-
[42]
R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol., 58:267–288, 1996
work page 1996
-
[43]
M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program., 161:419–449, 2017
work page 2017
-
[44]
M. Wang, J. Liu, and E. X. Fang. Accelerating stochastic composition optimization. J. Mach. Learn. Res., 18:1–23, 2017
work page 2017
-
[45]
R. Wang, C. Zhang, L. Wang, and Y. Shao. A stochastic nesterov’s smoothing accelerated method for general nonsmooth constrained stochastic composite convex optimization. J. Sci. Comput., 93:52, 2022
work page 2022
-
[46]
W. Wolberg, O. Mangasarian, N. Street, and W. Street. Breast Cancer Wisconsin (Diagnos- tic). UCI Machine Learning Repository, 1993. DOI: https://doi.org/10.24432/C5DW2B
-
[47]
L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. NeurIPS, 22, 2009
work page 2009
-
[48]
L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim , 24:2057–2075, 2014
work page 2057
-
[49]
Fast Stochastic Variance Reduced ADMM for Stochastic Composition Optimization
Y. Yu and L. Huang. Fast stochastic variance reduced admm for stochastic composition optimization. arXiv preprint arXiv:1705.04138 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [50]
-
[51]
C. Zhang and X. Chen. Smoothing projected gradient method and its application to stochas- tic linear complementarity problems. SIOPT, 20:627–649, 2009
work page 2009
-
[52]
J. Zhang and L. Xiao. A composite randomized incremental gradient method. In ICML, pages 7454–7462. PMLR, 2019
work page 2019
-
[53]
J. Zhang and L. Xiao. A stochastic composite gradient method with incremental variance reduction. NeurIPS, 32, 2019. 24 A Assumption 3.8 for the case f (x) = ϕ(ψ(x)) In this part of the appendix, we want to show strong evidence, for the case f(x) = ϕ(ψ(x)), that there exists y ∈ B(x, E1µ) for some E1 > 0, such that ∥∇x ˜f(x, µ) − g(y)∥ ≤ E2µ for some E2 >...
work page 2019
-
[54]
and for sufficiently large T , we have E[∥xT +1 − x∗∥2] ≤ O G2 1G2 2 σ2 1 T 1−2a + L2 ˜ϕG6 1G2 2 σ4 1 T 2/3−4a + L2 ˜ϕG2 1Vψ σ2 1 T 2/3−2a ! = O 1 T 2/3−4a . Proof. First, from the update rule of Algorithm 4, we have ∥xk+1 − x∗∥2 = ∥xk − αk ˜ϕ′(yk+1, µk)∇xψ(xk, ξ2 k) − x∗∥2 = ∥xk − x∗∥2 − 2αk(xk − x∗)⊤ ˜ϕ′(yk+1, µk)∇xψ(xk, ξ2 k) + α2 k∥˜ϕ′(yk+1, µk)∇xψ(xk...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.