Why gradient clipping accelerates training: A theoretical justification for adaptivity

Zhang, J · 1905 · arXiv 1905.11881

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

Stochastic Non-Smooth Convex Optimization with Unbounded Gradients

math.OC · 2026-05-15 · unverdicted · novelty 8.0

Introduces generalized Lipschitz class and shows clipped AdamW outperforms SGD and AdaGrad for stochastic convex optimization under this and related assumptions.

Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Normalized momentum SGD and variance-reduced STORM achieve O(ε^{-6}) and O(ε^{-4}) oracle complexities respectively under quadratic distance-dependent noise in nonconvex stochastic optimization.

Newton methods beyond Hessian Lipschitz continuity: A nonlinear preconditioning approach

math.OC · 2026-05-12 · unverdicted · novelty 7.0

Nonlinear preconditioning extends Newton methods to objectives lacking Hessian Lipschitz continuity by analyzing a transformed mapping under a relaxed smoothness condition, with superlinear convergence and O(ε^{-3/2}) iteration complexity.

The Multi-Block DC Function Class: Theory, Algorithms, and Applications

math.OC · 2026-04-19 · unverdicted · novelty 7.0

The Multi-Block DC class admits polynomial-size DC decompositions for problems that require exponential size under standard DC programming and supplies explicit constructive formulations for deep ReLU networks together with convergent batch and stochastic algorithms.

Distributionally Robust Multi-Objective Optimization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

DR-MOO adds distributional robustness to multi-objective optimization and gives single-loop MGDA algorithms reaching epsilon-Pareto-stationary points in O(epsilon^{-4}) samples for nonconvex problems.

Cost-Aware Learning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.

Adaptive Federated Optimization

cs.LG · 2020-02-29 · unverdicted · novelty 6.0

Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

math.OC · 2026-05-12 · unverdicted · novelty 5.0

Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.

Frank-Wolfe Algorithms for (L0, L1)-smooth functions

math.OC · 2025-10-18 · unverdicted · novelty 5.0 · 2 refs

Proposes (L0, L1)-Frank-Wolfe and adaptive variant claiming superior convergence rates for (L0, L1)-smooth objectives over classical Frank-Wolfe.

citing papers explorer

Showing 9 of 9 citing papers.

Stochastic Non-Smooth Convex Optimization with Unbounded Gradients math.OC · 2026-05-15 · unverdicted · none · ref 3
Introduces generalized Lipschitz class and shows clipped AdamW outperforms SGD and AdaGrad for stochastic convex optimization under this and related assumptions.
Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise cs.LG · 2026-05-14 · unverdicted · none · ref 44
Normalized momentum SGD and variance-reduced STORM achieve O(ε^{-6}) and O(ε^{-4}) oracle complexities respectively under quadratic distance-dependent noise in nonconvex stochastic optimization.
Newton methods beyond Hessian Lipschitz continuity: A nonlinear preconditioning approach math.OC · 2026-05-12 · unverdicted · none · ref 42
Nonlinear preconditioning extends Newton methods to objectives lacking Hessian Lipschitz continuity by analyzing a transformed mapping under a relaxed smoothness condition, with superlinear convergence and O(ε^{-3/2}) iteration complexity.
The Multi-Block DC Function Class: Theory, Algorithms, and Applications math.OC · 2026-04-19 · unverdicted · none · ref 14
The Multi-Block DC class admits polynomial-size DC decompositions for problems that require exponential size under standard DC programming and supplies explicit constructive formulations for deep ReLU networks together with convergent batch and stochastic algorithms.
Distributionally Robust Multi-Objective Optimization cs.LG · 2026-05-07 · unverdicted · none · ref 27
DR-MOO adds distributional robustness to multi-objective optimization and gives single-loop MGDA algorithms reaching epsilon-Pareto-stationary points in O(epsilon^{-4}) samples for nonconvex problems.
Cost-Aware Learning cs.LG · 2026-04-30 · unverdicted · none · ref 22
Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
Adaptive Federated Optimization cs.LG · 2020-02-29 · unverdicted · none · ref 44
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives math.OC · 2026-05-12 · unverdicted · none · ref 71
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
Frank-Wolfe Algorithms for (L0, L1)-smooth functions math.OC · 2025-10-18 · unverdicted · none · ref 21 · 2 links
Proposes (L0, L1)-Frank-Wolfe and adaptive variant claiming superior convergence rates for (L0, L1)-smooth objectives over classical Frank-Wolfe.

Why gradient clipping accelerates training: A theoretical justification for adaptivity

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer