Locally Linear Convergence for Nonsmooth Convex Optimization via Coupled Smoothing and Momentum
Pith reviewed 2026-05-17 22:40 UTC · model grok-4.3
The pith
Coupling the smoothing parameter with the momentum parameter achieves optimal global sublinear convergence and local linear convergence for nonsmooth convex optimization under a local strong convexity condition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By coupling the smoothing parameter decrease with the momentum parameter through a specific adaptive rule, the algorithm attains the provably optimal global rate of O(1/k) for minimizing a nonsmooth convex objective. Under the additional assumption that the nonsmooth term fulfills a locally strong convexity condition near the optimum, the method further guarantees local linear convergence. The same coupling extends to the sum of two nonsmooth convex functions and explains the observed practical two-phase behavior of faster initial convergence followed by asymptotic linear convergence.
What carries the argument
The adaptive accelerated smoothing technique whose smoothing update rule is coupled to the momentum parameter.
Load-bearing premise
The nonsmooth term must satisfy a locally strong convexity condition near the optimum and the smoothing and momentum parameters must obey a precise coupling rule.
What would settle it
A numerical example in which a nonsmooth convex function with local strong convexity near the optimum produces only sublinear convergence under the proposed coupled scheme would disprove the local linear rate claim.
Figures
read the original abstract
We propose an adaptive accelerated smoothing technique for a nonsmooth convex optimization problem where the smoothing update rule is coupled with the momentum parameter. We also extend the setting to the case where the objective function is the sum of two nonsmooth functions. With regard to convergence rate, we provide the global (optimal) sublinear convergence guarantees of O(1/k), which is known to be provably optimal for the studied class of functions, along with a local linear rate if the nonsmooth term fulfills a so-call locally strong convexity condition. We validate the performance of our algorithm on several problem classes, including regression with the l1-norm (the Lasso problem), sparse semidefinite programming (the MaxCut problem), Nuclear norm minimization with application in model free fault diagnosis, and l_1-regularized model predictive control to showcase the benefits of the coupling. An interesting observation is that although our global convergence result guarantees O(1/k) convergence, we consistently observe a practical transient convergence rate of O(1/k^2), followed by asymptotic linear convergence as anticipated by the theoretical result. This two-phase behavior can also be explained in view of the proposed smoothing rule.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an adaptive accelerated smoothing method for nonsmooth convex optimization in which the smoothing parameter is explicitly coupled to the momentum sequence. It establishes global O(1/k) sublinear convergence (optimal for the problem class) and, under an additional locally strong convexity assumption on the nonsmooth term, a local linear rate. The framework is extended to the sum of two nonsmooth convex functions. Numerical results on Lasso, MaxCut, nuclear-norm minimization, and l1-regularized MPC illustrate practical performance, including an observed transient O(1/k^2) phase followed by linear convergence.
Significance. If the local linear-rate analysis is complete and the coupling schedule is shown to be robust, the work would provide a meaningful advance by delivering both the optimal global rate and a faster local rate for nonsmooth convex problems, together with a plausible explanation for the commonly observed two-phase practical behavior. The extension to two nonsmooth terms and the concrete algorithmic coupling are additional strengths.
major comments (3)
- [§4, Theorem 4.3] §4, Theorem 4.3 (local linear rate): the proof relies on the smoothing bias vanishing at a rate compatible with the momentum parameter approaching 1, yet no explicit lower bound is derived for the effective strong-convexity modulus that survives the smoothing perturbation; without this, it is unclear for which range of the local strong-convexity constant the linear contraction remains valid.
- [§5] §5 (two-nonsmooth case): the error bound between the doubly smoothed objective and the original function is stated as O(μ_k), but the constants are not expressed in terms of the individual Lipschitz moduli of the two nonsmooth terms; this omission makes it impossible to verify that the same coupling schedule preserves a positive effective strong-convexity modulus near the optimum.
- [§3.2, coupling rule (3.4)] §3.2, coupling rule (3.4): the specific choice μ_k = Θ(1/k) and β_k = 1 − Θ(1/k) is shown to yield the global O(1/k) rate, but the manuscript does not quantify the perturbation size in the momentum sequence that would destroy the local linear regime, leaving the robustness of the schedule uncharacterized.
minor comments (2)
- [Figure 2] The caption of Figure 2 should explicitly state the vertical axis (function-value gap or distance to optimum) and the horizontal axis scaling (iteration count or CPU time).
- [Introduction] A short paragraph comparing the proposed coupling to existing adaptive-smoothing schedules (e.g., Nesterov 2005, Beck & Teboulle) would help readers situate the novelty.
Simulated Author's Rebuttal
We thank the referee for the thorough review and valuable suggestions. We address each of the major comments in detail below and will revise the manuscript accordingly to incorporate the clarifications.
read point-by-point responses
-
Referee: [§4, Theorem 4.3] §4, Theorem 4.3 (local linear rate): the proof relies on the smoothing bias vanishing at a rate compatible with the momentum parameter approaching 1, yet no explicit lower bound is derived for the effective strong-convexity modulus that survives the smoothing perturbation; without this, it is unclear for which range of the local strong-convexity constant the linear contraction remains valid.
Authors: We appreciate this observation. Upon closer inspection of the proof in Theorem 4.3, the smoothing bias term is bounded by a multiple of μ_k, and since μ_k = Θ(1/k) and the momentum parameter β_k approaches 1 at rate 1/k, the effective strong-convexity modulus can be bounded below by a positive constant depending on the local strong-convexity parameter μ and the Lipschitz constant. Specifically, for k sufficiently large, the effective modulus is at least μ/2. We will add this explicit lower bound and specify the range of μ for which the linear rate holds in the revised manuscript. revision: yes
-
Referee: [§5] §5 (two-nonsmooth case): the error bound between the doubly smoothed objective and the original function is stated as O(μ_k), but the constants are not expressed in terms of the individual Lipschitz moduli of the two nonsmooth terms; this omission makes it impossible to verify that the same coupling schedule preserves a positive effective strong-convexity modulus near the optimum.
Authors: The referee correctly identifies that the error bound in Section 5 is given as O(μ_k) without explicit constants. In fact, if the two nonsmooth functions have Lipschitz moduli L1 and L2, the smoothing error is at most (L1 + L2) μ_k. This allows us to show that the effective strong-convexity modulus remains positive near the optimum as long as μ_k is small enough relative to the local strong-convexity constant. We will update the manuscript to express the constants in terms of L1 and L2 and include the verification for the coupling schedule. revision: yes
-
Referee: [§3.2, coupling rule (3.4)] §3.2, coupling rule (3.4): the specific choice μ_k = Θ(1/k) and β_k = 1 − Θ(1/k) is shown to yield the global O(1/k) rate, but the manuscript does not quantify the perturbation size in the momentum sequence that would destroy the local linear regime, leaving the robustness of the schedule uncharacterized.
Authors: We agree that a characterization of robustness would enhance the result. The local linear convergence in Theorem 4.3 holds provided that the momentum parameter satisfies β_k ≥ 1 - c/k for some c, and μ_k ≤ C/k, with appropriate constants c and C depending on the problem parameters. Perturbations that violate this rate would indeed affect the linear regime. We will add a remark or corollary quantifying the allowable perturbation size in the momentum sequence to ensure the local linear rate is preserved. revision: yes
Circularity Check
No significant circularity; rates derived from standard analysis plus explicit assumption
full rationale
The global O(1/k) sublinear rate is presented as the known optimal rate for nonsmooth convex problems and follows from standard smoothing-plus-momentum analysis once the smoothing parameter is driven to zero at a suitable rate. The local linear rate is explicitly conditioned on the additional assumption that the nonsmooth term satisfies a locally strong convexity condition near the optimum; the coupling schedule between smoothing and momentum is a design choice introduced to make the analysis close, not a fitted quantity renamed as a prediction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, or ansatz smuggling appear in the derivation chain. The manuscript remains self-contained against external benchmarks for convex optimization convergence rates.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The nonsmooth term satisfies a locally strong convexity condition near the solution
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose an adaptive accelerated smoothing technique ... smoothing update rule is coupled with the momentum parameter ... global (optimal) sublinear convergence guarantees of O(1/k) ... local linear rate if the nonsmooth term fulfills a so-called locally strong convexity condition.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Feriel Abboud, Emilie Chouzenoux, Jean-Christophe Pesquet, Jean-Hugues Chenot, and Louis Laborelli. Dual block-coordinate forward-backward algorithm with application to deconvolution and deinterlacing of video sequences. Journal of Mathematical Imaging and Vision , 59:415–431, 2017
work page 2017
-
[2]
An admm algorithm for solving ℓ1 regularized mpc
Mariette Annergren, Anders Hansson, and Bo W ahlberg. An admm algorithm for solving ℓ1 regularized mpc. In IEEE Conference on Decision and Control , 2012
work page 2012
-
[3]
Efficient and Practical Stochastic Subgradient Descent for Nuclear Norm Regularization
Haim A vron, Satyen Kale, Shiva Kasiviswanathan, and Vik as Sindhwani. Efficient and practical stochastic subgradient descent for nuclear norm regularization. preprint available at arXiv:1206.6384, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[4]
Adaptive accelerated composit e minimization
Reza Rahimi Baghbadorani, Sergio Grammatico, and Peyman Mohajerin Esfahani. Adaptive accelerated composit e minimization. arXiv preprint arXiv:2405.03414 , 2024
-
[5]
First-order methods in optimization
Amir Beck. First-order methods in optimization . SIAM, 2017
work page 2017
-
[6]
Mirror descent and nonlinea r projected subgradient methods for convex optimization
Amir Beck and Marc Teboulle. Mirror descent and nonlinea r projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003
work page 2003
-
[7]
A fast iterative shrinkage- thresholding algorithm for linear inverse problems
Amir Beck and Marc Teboulle. A fast iterative shrinkage- thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009
work page 2009
-
[8]
Smoothing and first order methods: A unified framework
Amir Beck and Marc Teboulle. Smoothing and first order methods: A unified framework. SIAM Journal on Optimization, 22(2):557–580, 2012
work page 2012
-
[9]
Quadratic growth conditions and uniqueness of optimal solution to lasso
Yunier Bello-Cruz, Guoyin Li, and Tran Thai An Nghia. Quadratic growth conditions and uniqueness of optimal solution to lasso. Journal of Optimization Theory and Applications, 194(1):167–190, 2022
work page 2022
-
[10]
Simultaneous analysis of lasso and dantzig selector
Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov . Simultaneous analysis of lasso and dantzig selector. The Annals of Statistic , 37(4):1705–1732, 2009
work page 2009
-
[11]
A variable smoothing algorithm for solving convex optimization problems
Radu Ioan Boţ and Christopher Hendrich. A variable smoothing algorithm for solving convex optimization problems. An Official Journal of the Spanish Society of Statistics and Operations Research , 23:124–150, 2015
work page 2015
-
[12]
A first-order primal - dual algorithm for convex problems with applications to imaging
Antonin Chambolle and Thomas Pock. A first-order primal - dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision , 40:120–145, 2011. 12
work page 2011
-
[13]
Libsvm: a library fo r support vector machines
Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library fo r support vector machines. ACM transactions on intelligent systems and technology (TIST) , 2(3):1–27, 2011
work page 2011
-
[14]
Nested iterative algorithms for convex constra ined image recovery problems
Caroline Chaux, Jean-Christophe Pesquet, and Nelly Pustelnik. Nested iterative algorithms for convex constra ined image recovery problems. SIAM Journal on Imaging Sciences, 2(2):730–762, 2009
work page 2009
-
[15]
Caihua Chen, Bingsheng He, Yinyu Ye, and Xiaoming Yuan. The direct extension of admm for multi-block convex minimization problems is not necessarily convergen t. Mathematical Programming, 155(1):57–79, 2016
work page 2016
-
[16]
A direct formulation for sparse pca usin g semidefinite programming
Alexandre dAspremont, Laurent Ghaoui, Michael Jordan , and Gert Lanckriet. A direct formulation for sparse pca usin g semidefinite programming. Advances in Neural Information Processing Systems, 2004
work page 2004
-
[17]
A stochas tic smoothing algorithm for semidefinite programming
Alexandre dAspremont and Noureddine Karoui. A stochas tic smoothing algorithm for semidefinite programming. SIAM Journal on Optimization , 24(3):1138–1177, 2014
work page 2014
-
[18]
Efficient minimization methods of mixed l2-l1 and l1-l1 norms for image restoration
Haoying Fu, Michael K Ng, Mila Nikolova, and Jesse L Barlow. Efficient minimization methods of mixed l2-l1 and l1-l1 norms for image restoration. SIAM Journal on Scientific computing, 27(6):1881–1902, 2006
work page 1902
-
[19]
The landscape of the proximal point method for nonconvex–nonconcave minimax optimization
Benjamin Grimmer, Haihao Lu, Pratik W orah, and Vahab Mirrokni. The landscape of the proximal point method for nonconvex–nonconcave minimax optimization. Mathematical Programming, 201(1):373–407, 2023
work page 2023
-
[20]
Robust transfer principal component analy sis with rank constraints
Yuhong Guo. Robust transfer principal component analy sis with rank constraints. Advances in Neural Information Processing Systems, 2013
work page 2013
-
[21]
A spectral bundle method for semidefinite programming
Christoph Helmberg and Franz Rendl. A spectral bundle method for semidefinite programming. SIAM Journal on Optimization, 10(3):673–696, 2000
work page 2000
-
[22]
Model predictive control for constrained systems with uncertain state-dela ys
Xiao-Bing Hu and W en-Hua Chen. Model predictive control for constrained systems with uncertain state-dela ys. International Journal of Robust and Nonlinear Control , 14(17):1421–1432, 2004
work page 2004
-
[23]
A simple algorithm f or nuclear norm regularized problems
Martin Jaggi, Marek Sulovsk, et al. A simple algorithm f or nuclear norm regularized problems. In Proceedings of the 27th International Conference on Machine Learning , 2010
work page 2010
-
[24]
Teaching multivariable control using the quadruple-tank process
Karl Henrik Johansson, Alexander Horch, Olle Wijk, and Anders Hansson. Teaching multivariable control using the quadruple-tank process. In 38th IEEE Conference on Decision and Control , volume 1. IEEE, 1999
work page 1999
-
[25]
Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition
Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16 , pages 795–
work page 2016
-
[26]
Activi ty identification and local linear convergence of forward– backward-type methods
Jingwei Liang, Jalal Fadili, and Gabriel Peyré. Activi ty identification and local linear convergence of forward– backward-type methods. SIAM Journal on Optimization , 27(1):408–437, 2017
work page 2017
-
[27]
Local linear convergence of forward–backward under partial smoothness
Jingwei Liang, Jalal M Fadili, and Gabriel Peyré. Local linear convergence of forward–backward under partial smoothness . Advances in neural information processing systems , 27, 2014
work page 2014
-
[28]
Probl em complexity and method efficiency in optimization
Arkadij Nemirovskij and David Borisovich Yudin. Probl em complexity and method efficiency in optimization. 1983
work page 1983
-
[29]
A fast and accurate first-order method for sparse recovery, 2009
Becker S Bobin J Nesta. A fast and accurate first-order method for sparse recovery, 2009
work page 2009
-
[30]
A method of solving a convex programmin g problem with convergence rate O(1/k 2)
Yurii Nesterov. A method of solving a convex programmin g problem with convergence rate O(1/k 2). In Doklady Akademii Nauk , volume 269, pages 543–547. Russian Academy of Sciences, 1983
work page 1983
-
[31]
Excessive gap technique in nonsmooth c onvex minimization
Yurii Nesterov. Excessive gap technique in nonsmooth c onvex minimization. SIAM Journal on Optimization , 16(1), 2005
work page 2005
-
[32]
Smooth minimization of non-smooth functions
Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103:127–152, 2005
work page 2005
-
[33]
Smoothing technique and its applicati ons in semidefinite optimization
Yurii Nesterov. Smoothing technique and its applicati ons in semidefinite optimization. Mathematical Programming, 110(2):245–259, 2007
work page 2007
-
[34]
Gradient methods for minimizing compo site functions
Yurii Nesterov. Gradient methods for minimizing compo site functions. Mathematical Programming, 140(1):125–161, 2013
work page 2013
-
[35]
Proximal-based recursive implementation for model-free data-driven fault diagnosis
Jacques Noom, Oleg Soloviev, and Michel Verhaegen. Proximal-based recursive implementation for model-free data-driven fault diagnosis. Automatica, 165:111656, 2024
work page 2024
-
[36]
PRISMA: PRoximal Iterative SMoothing Algorithm
Francesco Orabona, Andreas Argyriou, and Nathan Srebr o. Prisma: Proximal iterative smoothing algorithm. preprint available at arXiv:1206.2372 , 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[37]
Sparse control using sum-of-norms regularized mode l predictive control
Sina Khoshfetrat Pakazad, Henrik Ohlsson, and Lennart Ljung. Sparse control using sum-of-norms regularized mode l predictive control. In 52nd IEEE Conference on Decision and Control , pages 5758–5763. IEEE, 2013
work page 2013
-
[38]
Generalized hessian properties of regularized nonsmooth functions
René A Poliquin and R Tyrrell Rockafellar. Generalized hessian properties of regularized nonsmooth functions. SIAM Journal on Optimization , 6(4):1121–1137, 1996
work page 1996
-
[39]
Boris Polyak. Introduction to Optimization . New York, NY, USA: Optimization Software, 1987
work page 1987
-
[40]
Proximity operato r of a sum of functions; application to depth map estimation
Nelly Pustelnik and Laurent Condat. Proximity operato r of a sum of functions; application to depth map estimation. IEEE Signal Processing Letters , 24(12):1827–1831, 2017
work page 2017
-
[41]
An inexact augmented lagrangian framework for nonconvex optimization with nonlinear constraints
Mehmet Fatih Sahin, Armin Eftekhari, Ahmet Alacaoglu, Fabian Latorre, and Volkan Cevher. An inexact augmented lagrangian framework for nonconvex optimization with nonlinear constraints. Advances in Neural Information Processing Systems, 2019
work page 2019
-
[42]
Helga Schramm and Jochem Zowe. A version of the bundle idea for minimizing a nonsmooth function: Conceptual idea, convergence analysis, numerical results. SIAM Journal on Optimization, 2(1):121–152, 1992
work page 1992
-
[43]
Maximum-margin matrix factorization
Nathan Srebro, Jason Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization. Advances in Neural Information Processing Systems , 2004
work page 2004
-
[44]
System identification via nuclear norm regularization
Yue Sun, Samet Oymak, and Maryam Fazel. System identification via nuclear norm regularization. preprint available at arXiv:2203.16673 , 2022
-
[45]
Lasso with non-linear measurements is equivalent to one with linear measurements
Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassi bi. Lasso with non-linear measurements is equivalent to one with linear measurements. Advances in Neural Information Processing Systems, 2015
work page 2015
-
[46]
Adaptive smoothing algorithms for nonsmooth composite convex minimization
Quoc Tran-Dinh. Adaptive smoothing algorithms for nonsmooth composite convex minimization. Computational Optimization and Applications , 66:425–451, 2017
work page 2017
-
[47]
Dictionary learning based impulse noise removal via l1–l1 minimization
Shanshan W ang, Qiegen Liu, Yong Xia, Pei Dong, Jianhua Luo, Qiu Huang, and David Dagan Feng. Dictionary learning based impulse noise removal via l1–l1 minimization. Signal Processing, 93(9):2696–2708, 2013
work page 2013
-
[48]
Alternating direction algorithms for ℓ1-problems in compressive sensing
Junfeng Yang and Yin Zhang. Alternating direction algorithms for ℓ1-problems in compressive sensing. SIAM Journal on Scientific Computing , 33(1):250–278, 2011
work page 2011
-
[49]
On decomposing the proximal map
Yao-Liang Yu. On decomposing the proximal map. Advances in Neural Information Processing Systems , 2013
work page 2013
-
[50]
Alp Yurtsever, Olivier Fercoq, and Volkan Cevher. A conditional-gradient-based augmented lagrangi an framework. ICML, 2019. 13 A Supplemental Numerics This section presents additional numerical experiments showcasing the performance of the proposed method in the case of regression problem (27) and MaxCut (30) and also the comparison with the ADMM and Algo...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.