Non-smooth stochastic gradient descent using smoothing functions

Jingfu Tan; Luis Nunes Vicente; Tommaso Giovannelli

arxiv: 2507.10901 · v2 · pith:DGJTKORQnew · submitted 2025-07-15 · 🧮 math.OC

Non-smooth stochastic gradient descent using smoothing functions

Tommaso Giovannelli , Jingfu Tan , Luis Nunes Vicente This is my paper

Pith reviewed 2026-05-19 05:20 UTC · model grok-4.3

classification 🧮 math.OC

keywords smoothing functionsstochastic gradient descentnon-smooth optimizationcompositional optimizationconvergence ratesstochastic optimizationmachine learning

0 comments

The pith

Smoothing stochastic gradient descent achieves a 1/T to the 1/4 rate for convex non-smooth compositional problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a smoothing stochastic gradient method for stochastic optimization problems that compose a non-smooth outer function with a smooth inner function. Smoothing functions that are continuously differentiable and approach the original non-smooth function as a parameter goes to zero are introduced, and this parameter is driven to zero at a chosen rate during iterations. The method establishes convergence guarantees under strongly convex, convex, and non-convex settings that match known rates for non-smooth stochastic compositional optimization. In the convex case the rate is 1 over T to the one fourth when measured by the number of stochastic gradient evaluations. A reader would care because the approach handles non-differentiable objectives that arise in machine learning and operations research while preserving theoretical efficiency.

Core claim

By approximating the non-smooth outer function with smoothing functions that are continuously differentiable and converge to the original as the smoothing parameter tends to zero (at the cost of higher Lipschitz constants), and by iteratively decreasing this parameter, the smoothing stochastic gradient method proves convergence rates that match those for non-smooth stochastic compositional optimization, including a 1/T^{1/4} rate in terms of stochastic gradient evaluations for convex objectives.

What carries the argument

Smoothing functions that are continuously differentiable, approach the original non-smooth function as the parameter tends to zero with increasing Lipschitz constants, together with stochastic gradient estimates that satisfy unbiasedness, bounded second moments, and accurate smoothing error bounds.

If this is right

The method applies directly to general compositional and finite-sum compositional problems that appear in large-scale machine learning and risk-averse optimization.
Convergence guarantees hold under strongly convex, convex, and non-convex assumptions and match existing rates for non-smooth stochastic compositional optimization.
For convex objectives the convergence rate is 1/T^{1/4} measured by the number of stochastic gradient evaluations.
Preliminary numerical results indicate competitiveness with existing methods on certain problem classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the required smoothing error bounds hold for common non-smooth functions such as the maximum or absolute value, the method could be tested on robust optimization tasks not explicitly covered in the paper.
The finite-sum compositional setting suggests possible use in distributed training where compositional risk measures appear.
A direct comparison against subgradient methods on standard non-smooth benchmarks would test whether the theoretical rate translates to practical speed-ups.

Load-bearing premise

The smoothing functions must be continuously differentiable and approach the original non-smooth function as the parameter tends to zero, while the stochastic gradient estimates satisfy unbiasedness, bounded second moments, and accurate smoothing error bounds.

What would settle it

A numerical test on a convex compositional problem where the observed error decreases slower than 1/T^{1/4} under the paper's prescribed smoothing-parameter schedule would falsify the claimed rate.

Figures

Figures reproduced from arXiv: 2507.10901 by Jingfu Tan, Luis Nunes Vicente, Tommaso Giovannelli.

read the original abstract

In this paper, we address stochastic optimization problems involving a composition of a non-smooth outer function and a smooth inner function, a formulation frequently encountered in machine learning and operations research. To deal with the non-differentiability of the outer function, we approximate the original non-smooth function using smoothing functions, which are continuously differentiable and approach the original function as a smoothing parameter goes to zero (at the price of increasingly higher Lipschitz constants). The proposed smoothing stochastic gradient method iteratively drives the smoothing parameter to zero at a designated rate. We establish convergence guarantees under strongly convex, convex, and non-convex settings, proving convergence rates that match known results for non-smooth stochastic compositional optimization. In particular, for convex objectives, smoothing stochastic gradient achieves a 1/T^(1/4) rate in terms of the number of stochastic gradient evaluations. We further show how general compositional and finite-sum compositional problems (widely used frameworks in large-scale machine learning and risk-averse optimization) fit the assumptions needed for the rates (unbiased gradient estimates, bounded second moments, and accurate smoothing errors). We present preliminary numerical results indicating that smoothing stochastic gradient descent can be competitive for certain classes of problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Smoothing SGD for non-smooth outer/smooth inner compositions gets the expected 1/T^{1/4} convex rate by driving the parameter to zero, but the uniform second-moment bound needs checking against growing Lipschitz constants.

read the letter

The paper's main contribution is a smoothing stochastic gradient descent method tailored to stochastic optimization problems with a non-smooth outer function and a smooth inner function. By driving the smoothing parameter to zero at a specific rate, they achieve convergence rates that align with those for non-smooth stochastic compositional optimization, including a 1/T^{1/4} rate for convex objectives in terms of stochastic gradient evaluations. What stands out is how they connect this to broader frameworks like general compositional and finite-sum problems used in large-scale machine learning and risk-averse optimization. They verify that these fit the assumptions of unbiased estimates, bounded second moments, and accurate smoothing error bounds. The coverage of strongly convex, convex, and non-convex cases is comprehensive for this type of work. The approach is incremental rather than revolutionary, building on existing smoothing techniques and stochastic approximation arguments. The preliminary numerical results suggest competitiveness but are too early to draw strong conclusions from. A potential soft spot lies in maintaining the bounded second-moment assumption as the smoothing parameter decreases. Since the smoothing functions have Lipschitz constants that grow as the parameter approaches zero, the composed gradients might have norms that increase accordingly. If this growth isn't properly controlled by the decay schedule, the standard arguments for the 1/T^{1/4} rate could require additional justification. The abstract presents the bounds as holding, but seeing the details in the proofs would clarify this. This work is for researchers focused on stochastic optimization in operations research and machine learning, particularly those dealing with non-smooth compositions. Readers interested in practical algorithms backed by theory for these settings would find the rates and assumption mappings useful. I recommend sending it for peer review. The theoretical claims are worth a detailed check by experts in the area.

Referee Report

1 major / 2 minor

Summary. The paper introduces a smoothing stochastic gradient method for stochastic optimization of compositions where the outer function is non-smooth and the inner function is smooth. Smoothing functions that are continuously differentiable and converge to the original non-smooth function (with growing Lipschitz constants) are used to approximate the objective; the smoothing parameter is driven to zero at a prescribed rate. Convergence guarantees and rates are established for strongly convex, convex, and non-convex cases under standard assumptions on unbiasedness, bounded second moments, and smoothing error; the convex case achieves a 1/T^{1/4} rate in stochastic gradient evaluations, matching known results for non-smooth stochastic compositional optimization. The paper also verifies that general compositional and finite-sum problems satisfy the required assumptions and presents preliminary numerical experiments.

Significance. If the rates hold under the stated assumptions, the work supplies a concrete algorithmic framework for non-smooth stochastic compositional problems that arise in machine learning and risk-averse optimization, with rates that align with the existing non-smooth literature. The explicit verification that standard compositional and finite-sum settings fit the assumptions is a useful contribution for practitioners.

major comments (1)

[§4 and Assumption 3] §4 (convex case analysis) and the statement of Assumption 3 (bounded second moments): the 1/T^{1/4} rate is derived via a standard supermartingale argument that requires E[||g_t||^2] ≤ σ^2 uniformly in t. Because the smoothing functions are C^1 with Lipschitz constant L(ε) → ∞ as ε → 0 and the schedule drives ε_t → 0, the composed stochastic gradient norm can scale with L(ε_t). It is not shown that the uniform second-moment bound remains valid for the chosen decay rate of ε_t without an extra growth-control assumption; this is load-bearing for the claimed rate.

minor comments (2)

[Abstract and §1] The abstract and introduction claim that the rates 'match known results for non-smooth stochastic compositional optimization'; a brief citation to the specific prior rate (e.g., the reference establishing the 1/T^{1/4} benchmark) would strengthen the comparison.
[§2 and §4] Notation for the smoothing error bound (e.g., the term involving |f_ε(x) - f(x)|) is introduced in §2 but used without an explicit equation number in the rate proofs; adding a numbered display would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and valuable feedback. We address the single major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [§4 and Assumption 3] §4 (convex case analysis) and the statement of Assumption 3 (bounded second moments): the 1/T^{1/4} rate is derived via a standard supermartingale argument that requires E[||g_t||^2] ≤ σ^2 uniformly in t. Because the smoothing functions are C^1 with Lipschitz constant L(ε) → ∞ as ε → 0 and the schedule drives ε_t → 0, the composed stochastic gradient norm can scale with L(ε_t). It is not shown that the uniform second-moment bound remains valid for the chosen decay rate of ε_t without an extra growth-control assumption; this is load-bearing for the claimed rate.

Authors: We thank the referee for this observation. Assumption 3 states a uniform bound on the second moments of the stochastic gradients g_t, which is used in the supermartingale argument of §4. In Section 5 we explicitly verify that the standard assumptions for general compositional problems and finite-sum compositional problems (bounded inner Jacobians, unbiased estimators, and controlled smoothing error) imply that E[||g_t||^2] remains bounded by a constant independent of t, even as ε_t → 0. The growth of L(ε_t) is offset by the fact that the inner mapping is evaluated at points whose gradients are controlled by the problem class; the specific schedule ε_t ∼ T^{-1/4} is chosen precisely so that the bias-variance trade-off yields the 1/T^{1/4} rate while preserving the uniform moment bound already verified for the motivating problem classes. To make this dependence fully transparent, we will insert a short clarifying paragraph immediately after the statement of Assumption 3 and before the convex-case theorem, referencing the verification in Section 5. revision: yes

Circularity Check

0 steps flagged

Derivation uses standard stochastic approximation on externally assumed smoothing bounds

full rationale

The paper establishes convergence rates for the smoothing stochastic gradient method by applying standard stochastic approximation arguments (telescoping sums or supermartingale inequalities) to the smoothed objective under stated assumptions on unbiased gradient estimates, bounded second moments, and smoothing error bounds. These assumptions are introduced as external conditions on the smoothing functions and stochastic oracles rather than being derived from or fitted to quantities defined inside the paper's own equations. The claimed 1/T^{1/4} rate for convex cases is presented as matching known results for non-smooth stochastic compositional optimization, with no load-bearing steps that reduce by construction to self-defined parameters, self-citations, or ansatzes imported from the authors' prior work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard stochastic optimization assumptions plus the existence and properties of smoothing functions; no new entities are postulated and the only potential free parameter is the designated rate at which the smoothing parameter is driven to zero.

free parameters (1)

smoothing parameter decay rate
The rate at which the smoothing parameter is iteratively driven to zero is designated by the method and directly affects the convergence analysis.

axioms (2)

domain assumption Stochastic gradient estimates are unbiased with bounded second moments
Invoked to obtain the stated convergence rates under the smoothing approximation.
domain assumption Smoothing functions are continuously differentiable and converge to the original non-smooth function with controlled error as parameter tends to zero
Core property used to justify replacing the non-smooth outer function while maintaining convergence.

pith-pipeline@v0.9.0 · 5737 in / 1385 out tokens · 50296 ms · 2026-05-19T05:20:54.101412+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

smoothing functions ... continuously differentiable and approach the original function as a smoothing parameter goes to zero (at the price of increasingly higher Lipschitz constants)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

best convergence rate ... 1/T^{1/4} ... matches ... non-smooth stochastic compositional optimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

[1]

Ahmed, U

S. Ahmed, U. C ¸ akmak, and A. Shapiro. Coherent risk measures in inventory problems. European J. Oper. Res., 182:226–238, 2007

work page 2007
[2]

Beck and M

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIIMS, 2:183–202, 2009

work page 2009
[3]

Bottou, F

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., 60:223–311, 2018

work page 2018
[4]

Chen and O

C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear and mixed complementarity problems. Comput. Optim. Appl. , 5:97–138, 1996. 21

work page 1996
[5]

T. Chen, Y. Sun, and W. Yin. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Trans. Signal Process., 69:4937–4948, 2021

work page 2021
[6]

X. Chen. Smoothing methods for nonsmooth, nonconvex minimization. Math. Program., 134:71–99, 2012

work page 2012
[7]

Chen and W

X. Chen and W. Zhou. Smoothing nonlinear conjugate gradient method for image restora- tion using nonsmooth nonconvex minimization. SIAM J. Imaging Sci. , 3:765–790, 2010

work page 2010
[8]

C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and comparison. J. Mach. Learn. Res. , 15:809–883, 2014

work page 2014
[9]

Daubechies, M

I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. , 57:1413–1457, 2004

work page 2004
[10]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. NeurIPS, 27, 2014

work page 2014
[11]

D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory , 52:1289–1306, 2006

work page 2006
[12]

Duchi and Y

J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. , 10:2899–2934, 2009

work page 2009
[13]

J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIOPT, 22:674–701, 2012

work page 2012
[14]

Ermoliev

Y. Ermoliev. Stochastic programming methods, 1976

work page 1976
[15]

Ermoliev, M

Y. Ermoliev, M. Yu, and R. B. Wets. Numerical Techniques for Stochastic Optimization . Springer-Verlag, Heidelberg, 1988

work page 1988
[16]

L. Evans. Measure Theory and Fine Properties of Functions . Routledge, Milton Park, Oxfordshire, UK, 2018

work page 2018
[17]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017

work page 2017
[18]

Garmanjani and L

R. Garmanjani and L. N. Vicente. Smoothing and worst-case complexity for direct-search methods in nonsmooth optimization. IMA J. Numer. Anal. , 33:1008–1028, 2013

work page 2013
[19]

Giovannelli, G

T. Giovannelli, G. Kent, and L. N. Vicente. Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. ISE Technical Report 21T-025, Lehigh University, December 2022

work page 2022
[20]

Z. Huo, B. Gu, J. Liu, and H. Huang. Accelerated method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[21]

Jalilzadeh, U

A. Jalilzadeh, U. Shanbhag, J. Blanchet, and P. W. Glynn. Smoothed variable sample-size accelerated proximal methods for nonsmooth stochastic convex programs. Stoch. Syst., 12: 373–410, 2022. 22

work page 2022
[22]

Laguel, K

Y. Laguel, K. Pillutla, J. Malick, and Z. Harchaoui. Superquantiles at work: Machine learning applications and efficient subgradient computation. Set-Valued Var. Anal. , 29: 967–996, 2021

work page 2021
[23]

T. Li, A. Beirami, M. Sanjabi, and V. Smith. On tilted losses in machine learning: Theory and applications. J. Mach. Learn. Res. , 24:1–79, 2023

work page 2023
[24]

X. Lian, M. Wang, and J. Liu. Finite-sum composition optimization via variance reduced gradient descent. In Artificial Intelligence and Statistics , pages 1159–1167. PMLR, 2017

work page 2017
[25]

Q. Lin, X. Chen, and J. Pena. A smoothing stochastic gradient method for composite optimization. Optim. Methods Softw. , 29:1281–1301, 2014

work page 2014
[26]

J. Liu, Y. Cui, and J. Pang. Solving nonsmooth and nonconvex compound stochastic programs with applications to risk measure minimization. Math. Oper. Res., 47:3051–3083, 2022

work page 2022
[27]

L. Liu, J. Liu, and D. Tao. Variance reduced methods for non-convex composition opti- mization. IEEE TPAMI, 44:5813–5825, 2021

work page 2021
[28]

J. Moreau. Proximit´ e et dualit´ e dans un espace hilbertien. Bulletin de la Soci´ et´ e math´ ematique de France, 93:273–299, 1965

work page 1965
[29]

Moulines and F

E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. NeurIPS, 24, 2011

work page 2011
[30]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. , 19:1574–1609, 2009

work page 2009
[31]

A. S. Nemirovskij and D. B. Yudin. Problem complexity and method efficiency in opti- mization. SIAM Rev., 1983

work page 1983
[32]

Nesterov

Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103:127–152, 2005

work page 2005
[33]

Nesterov

Y. Nesterov. Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & Business Media, Berlin, 2013

work page 2013
[34]

Nesterov and V

Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. FoCM, 17:527–566, 2017

work page 2017
[35]

B. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Math. Math. Phys. , 3:864–878, 1963

work page 1963
[36]

B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Math. Math. Phys. , 4:1–17, 1964

work page 1964
[37]

B.T. Polyak. Introduction to Optimization. Optimization Software, New York, 1987

work page 1987
[38]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat. , pages 400–407, 1951. 23

work page 1951
[39]

R. T. Rockafellar and R. J. B. Wets. Variational Analysis. Springer, Berlin, 1998

work page 1998
[40]

Shapiro, D

A. Shapiro, D. Dentcheva, and A. Ruszczynski. Lectures on Stochastic Programming: Mod- eling and Theory . SIAM, Philadelphia, PA, USA, 2021

work page 2021
[41]

N. Z. Shor. Minimization methods for non-differentiable functions , volume 3. Springer Science & Business Media, Berlin, 2012

work page 2012
[42]

Tibshirani

R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol., 58:267–288, 1996

work page 1996
[43]

M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program., 161:419–449, 2017

work page 2017
[44]

M. Wang, J. Liu, and E. X. Fang. Accelerating stochastic composition optimization. J. Mach. Learn. Res., 18:1–23, 2017

work page 2017
[45]

R. Wang, C. Zhang, L. Wang, and Y. Shao. A stochastic nesterov’s smoothing accelerated method for general nonsmooth constrained stochastic composite convex optimization. J. Sci. Comput., 93:52, 2022

work page 2022
[46]

Wolberg, O

W. Wolberg, O. Mangasarian, N. Street, and W. Street. Breast Cancer Wisconsin (Diagnos- tic). UCI Machine Learning Repository, 1993. DOI: https://doi.org/10.24432/C5DW2B

work page doi:10.24432/c5dw2b 1993
[47]

L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. NeurIPS, 22, 2009

work page 2009
[48]

Xiao and T

L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim , 24:2057–2075, 2014

work page 2057
[49]

Fast Stochastic Variance Reduced ADMM for Stochastic Composition Optimization

Y. Yu and L. Huang. Fast stochastic variance reduced admm for stochastic composition optimization. arXiv preprint arXiv:1705.04138 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

H. Yuan, X. Lian, and J. Liu. Stochastic recursive variance reduction for efficient smooth non-convex compositional optimization. arXiv preprint arXiv:1912.13515 , 2019

work page arXiv 1912
[51]

Zhang and X

C. Zhang and X. Chen. Smoothing projected gradient method and its application to stochas- tic linear complementarity problems. SIOPT, 20:627–649, 2009

work page 2009
[52]

Zhang and L

J. Zhang and L. Xiao. A composite randomized incremental gradient method. In ICML, pages 7454–7462. PMLR, 2019

work page 2019
[53]

Zhang and L

J. Zhang and L. Xiao. A stochastic composite gradient method with incremental variance reduction. NeurIPS, 32, 2019. 24 A Assumption 3.8 for the case f (x) = ϕ(ψ(x)) In this part of the appendix, we want to show strong evidence, for the case f(x) = ϕ(ψ(x)), that there exists y ∈ B(x, E1µ) for some E1 > 0, such that ∥∇x ˜f(x, µ) − g(y)∥ ≤ E2µ for some E2 >...

work page 2019
[54]

and for sufficiently large T , we have E[∥xT +1 − x∗∥2] ≤ O G2 1G2 2 σ2 1 T 1−2a + L2 ˜ϕG6 1G2 2 σ4 1 T 2/3−4a + L2 ˜ϕG2 1Vψ σ2 1 T 2/3−2a ! = O 1 T 2/3−4a . Proof. First, from the update rule of Algorithm 4, we have ∥xk+1 − x∗∥2 = ∥xk − αk ˜ϕ′(yk+1, µk)∇xψ(xk, ξ2 k) − x∗∥2 = ∥xk − x∗∥2 − 2αk(xk − x∗)⊤ ˜ϕ′(yk+1, µk)∇xψ(xk, ξ2 k) + α2 k∥˜ϕ′(yk+1, µk)∇xψ(xk...

work page

[1] [1]

Ahmed, U

S. Ahmed, U. C ¸ akmak, and A. Shapiro. Coherent risk measures in inventory problems. European J. Oper. Res., 182:226–238, 2007

work page 2007

[2] [2]

Beck and M

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIIMS, 2:183–202, 2009

work page 2009

[3] [3]

Bottou, F

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., 60:223–311, 2018

work page 2018

[4] [4]

Chen and O

C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear and mixed complementarity problems. Comput. Optim. Appl. , 5:97–138, 1996. 21

work page 1996

[5] [5]

T. Chen, Y. Sun, and W. Yin. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Trans. Signal Process., 69:4937–4948, 2021

work page 2021

[6] [6]

X. Chen. Smoothing methods for nonsmooth, nonconvex minimization. Math. Program., 134:71–99, 2012

work page 2012

[7] [7]

Chen and W

X. Chen and W. Zhou. Smoothing nonlinear conjugate gradient method for image restora- tion using nonsmooth nonconvex minimization. SIAM J. Imaging Sci. , 3:765–790, 2010

work page 2010

[8] [8]

C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and comparison. J. Mach. Learn. Res. , 15:809–883, 2014

work page 2014

[9] [9]

Daubechies, M

I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. , 57:1413–1457, 2004

work page 2004

[10] [10]

Defazio, F

A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. NeurIPS, 27, 2014

work page 2014

[11] [11]

D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory , 52:1289–1306, 2006

work page 2006

[12] [12]

Duchi and Y

J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. , 10:2899–2934, 2009

work page 2009

[13] [13]

J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIOPT, 22:674–701, 2012

work page 2012

[14] [14]

Ermoliev

Y. Ermoliev. Stochastic programming methods, 1976

work page 1976

[15] [15]

Ermoliev, M

Y. Ermoliev, M. Yu, and R. B. Wets. Numerical Techniques for Stochastic Optimization . Springer-Verlag, Heidelberg, 1988

work page 1988

[16] [16]

L. Evans. Measure Theory and Fine Properties of Functions . Routledge, Milton Park, Oxfordshire, UK, 2018

work page 2018

[17] [17]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017

work page 2017

[18] [18]

Garmanjani and L

R. Garmanjani and L. N. Vicente. Smoothing and worst-case complexity for direct-search methods in nonsmooth optimization. IMA J. Numer. Anal. , 33:1008–1028, 2013

work page 2013

[19] [19]

Giovannelli, G

T. Giovannelli, G. Kent, and L. N. Vicente. Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. ISE Technical Report 21T-025, Lehigh University, December 2022

work page 2022

[20] [20]

Z. Huo, B. Gu, J. Liu, and H. Huang. Accelerated method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018

[21] [21]

Jalilzadeh, U

A. Jalilzadeh, U. Shanbhag, J. Blanchet, and P. W. Glynn. Smoothed variable sample-size accelerated proximal methods for nonsmooth stochastic convex programs. Stoch. Syst., 12: 373–410, 2022. 22

work page 2022

[22] [22]

Laguel, K

Y. Laguel, K. Pillutla, J. Malick, and Z. Harchaoui. Superquantiles at work: Machine learning applications and efficient subgradient computation. Set-Valued Var. Anal. , 29: 967–996, 2021

work page 2021

[23] [23]

T. Li, A. Beirami, M. Sanjabi, and V. Smith. On tilted losses in machine learning: Theory and applications. J. Mach. Learn. Res. , 24:1–79, 2023

work page 2023

[24] [24]

X. Lian, M. Wang, and J. Liu. Finite-sum composition optimization via variance reduced gradient descent. In Artificial Intelligence and Statistics , pages 1159–1167. PMLR, 2017

work page 2017

[25] [25]

Q. Lin, X. Chen, and J. Pena. A smoothing stochastic gradient method for composite optimization. Optim. Methods Softw. , 29:1281–1301, 2014

work page 2014

[26] [26]

J. Liu, Y. Cui, and J. Pang. Solving nonsmooth and nonconvex compound stochastic programs with applications to risk measure minimization. Math. Oper. Res., 47:3051–3083, 2022

work page 2022

[27] [27]

L. Liu, J. Liu, and D. Tao. Variance reduced methods for non-convex composition opti- mization. IEEE TPAMI, 44:5813–5825, 2021

work page 2021

[28] [28]

J. Moreau. Proximit´ e et dualit´ e dans un espace hilbertien. Bulletin de la Soci´ et´ e math´ ematique de France, 93:273–299, 1965

work page 1965

[29] [29]

Moulines and F

E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. NeurIPS, 24, 2011

work page 2011

[30] [30]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. , 19:1574–1609, 2009

work page 2009

[31] [31]

A. S. Nemirovskij and D. B. Yudin. Problem complexity and method efficiency in opti- mization. SIAM Rev., 1983

work page 1983

[32] [32]

Nesterov

Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103:127–152, 2005

work page 2005

[33] [33]

Nesterov

Y. Nesterov. Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & Business Media, Berlin, 2013

work page 2013

[34] [34]

Nesterov and V

Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. FoCM, 17:527–566, 2017

work page 2017

[35] [35]

B. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Math. Math. Phys. , 3:864–878, 1963

work page 1963

[36] [36]

B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Math. Math. Phys. , 4:1–17, 1964

work page 1964

[37] [37]

B.T. Polyak. Introduction to Optimization. Optimization Software, New York, 1987

work page 1987

[38] [38]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat. , pages 400–407, 1951. 23

work page 1951

[39] [39]

R. T. Rockafellar and R. J. B. Wets. Variational Analysis. Springer, Berlin, 1998

work page 1998

[40] [40]

Shapiro, D

A. Shapiro, D. Dentcheva, and A. Ruszczynski. Lectures on Stochastic Programming: Mod- eling and Theory . SIAM, Philadelphia, PA, USA, 2021

work page 2021

[41] [41]

N. Z. Shor. Minimization methods for non-differentiable functions , volume 3. Springer Science & Business Media, Berlin, 2012

work page 2012

[42] [42]

Tibshirani

R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol., 58:267–288, 1996

work page 1996

[43] [43]

M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program., 161:419–449, 2017

work page 2017

[44] [44]

M. Wang, J. Liu, and E. X. Fang. Accelerating stochastic composition optimization. J. Mach. Learn. Res., 18:1–23, 2017

work page 2017

[45] [45]

R. Wang, C. Zhang, L. Wang, and Y. Shao. A stochastic nesterov’s smoothing accelerated method for general nonsmooth constrained stochastic composite convex optimization. J. Sci. Comput., 93:52, 2022

work page 2022

[46] [46]

Wolberg, O

W. Wolberg, O. Mangasarian, N. Street, and W. Street. Breast Cancer Wisconsin (Diagnos- tic). UCI Machine Learning Repository, 1993. DOI: https://doi.org/10.24432/C5DW2B

work page doi:10.24432/c5dw2b 1993

[47] [47]

L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. NeurIPS, 22, 2009

work page 2009

[48] [48]

Xiao and T

L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim , 24:2057–2075, 2014

work page 2057

[49] [49]

Fast Stochastic Variance Reduced ADMM for Stochastic Composition Optimization

Y. Yu and L. Huang. Fast stochastic variance reduced admm for stochastic composition optimization. arXiv preprint arXiv:1705.04138 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[50] [50]

H. Yuan, X. Lian, and J. Liu. Stochastic recursive variance reduction for efficient smooth non-convex compositional optimization. arXiv preprint arXiv:1912.13515 , 2019

work page arXiv 1912

[51] [51]

Zhang and X

C. Zhang and X. Chen. Smoothing projected gradient method and its application to stochas- tic linear complementarity problems. SIOPT, 20:627–649, 2009

work page 2009

[52] [52]

Zhang and L

J. Zhang and L. Xiao. A composite randomized incremental gradient method. In ICML, pages 7454–7462. PMLR, 2019

work page 2019

[53] [53]

Zhang and L

J. Zhang and L. Xiao. A stochastic composite gradient method with incremental variance reduction. NeurIPS, 32, 2019. 24 A Assumption 3.8 for the case f (x) = ϕ(ψ(x)) In this part of the appendix, we want to show strong evidence, for the case f(x) = ϕ(ψ(x)), that there exists y ∈ B(x, E1µ) for some E1 > 0, such that ∥∇x ˜f(x, µ) − g(y)∥ ≤ E2µ for some E2 >...

work page 2019

[54] [54]

and for sufficiently large T , we have E[∥xT +1 − x∗∥2] ≤ O G2 1G2 2 σ2 1 T 1−2a + L2 ˜ϕG6 1G2 2 σ4 1 T 2/3−4a + L2 ˜ϕG2 1Vψ σ2 1 T 2/3−2a ! = O 1 T 2/3−4a . Proof. First, from the update rule of Algorithm 4, we have ∥xk+1 − x∗∥2 = ∥xk − αk ˜ϕ′(yk+1, µk)∇xψ(xk, ξ2 k) − x∗∥2 = ∥xk − x∗∥2 − 2αk(xk − x∗)⊤ ˜ϕ′(yk+1, µk)∇xψ(xk, ξ2 k) + α2 k∥˜ϕ′(yk+1, µk)∇xψ(xk...

work page