Introduces generalized Lipschitz class and shows clipped AdamW outperforms SGD and AdaGrad for stochastic convex optimization under this and related assumptions.
Why gradient clipping accelerates training: A theoretical justification for adaptivity
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 9roles
background 2representative citing papers
Normalized momentum SGD and variance-reduced STORM achieve O(ε^{-6}) and O(ε^{-4}) oracle complexities respectively under quadratic distance-dependent noise in nonconvex stochastic optimization.
Nonlinear preconditioning extends Newton methods to objectives lacking Hessian Lipschitz continuity by analyzing a transformed mapping under a relaxed smoothness condition, with superlinear convergence and O(ε^{-3/2}) iteration complexity.
The Multi-Block DC class admits polynomial-size DC decompositions for problems that require exponential size under standard DC programming and supplies explicit constructive formulations for deep ReLU networks together with convergent batch and stochastic algorithms.
DR-MOO adds distributional robustness to multi-objective optimization and gives single-loop MGDA algorithms reaching epsilon-Pareto-stationary points in O(epsilon^{-4}) samples for nonconvex problems.
Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
Proposes (L0, L1)-Frank-Wolfe and adaptive variant claiming superior convergence rates for (L0, L1)-smooth objectives over classical Frank-Wolfe.
citing papers explorer
-
Stochastic Non-Smooth Convex Optimization with Unbounded Gradients
Introduces generalized Lipschitz class and shows clipped AdamW outperforms SGD and AdaGrad for stochastic convex optimization under this and related assumptions.
-
Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise
Normalized momentum SGD and variance-reduced STORM achieve O(ε^{-6}) and O(ε^{-4}) oracle complexities respectively under quadratic distance-dependent noise in nonconvex stochastic optimization.
-
Newton methods beyond Hessian Lipschitz continuity: A nonlinear preconditioning approach
Nonlinear preconditioning extends Newton methods to objectives lacking Hessian Lipschitz continuity by analyzing a transformed mapping under a relaxed smoothness condition, with superlinear convergence and O(ε^{-3/2}) iteration complexity.
-
The Multi-Block DC Function Class: Theory, Algorithms, and Applications
The Multi-Block DC class admits polynomial-size DC decompositions for problems that require exponential size under standard DC programming and supplies explicit constructive formulations for deep ReLU networks together with convergent batch and stochastic algorithms.
-
Distributionally Robust Multi-Objective Optimization
DR-MOO adds distributional robustness to multi-objective optimization and gives single-loop MGDA algorithms reaching epsilon-Pareto-stationary points in O(epsilon^{-4}) samples for nonconvex problems.
-
Cost-Aware Learning
Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
-
Adaptive Federated Optimization
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
-
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
-
Frank-Wolfe Algorithms for (L0, L1)-smooth functions
Proposes (L0, L1)-Frank-Wolfe and adaptive variant claiming superior convergence rates for (L0, L1)-smooth objectives over classical Frank-Wolfe.