Proves sharp O(1/k) rate for Sinkhorn via local bipartite graph analysis of positive-mass edges, bootstrapped from prior almost-sharp global bound.
hub
Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Stochastic Krasnoselskii-Mann iterations converge almost surely and with rates under finite variance at a single fixed point rather than uniform variance bounds, recovering optimal complexity and providing first such results for some splitting methods.
Price's gradient estimator enables black-box VI to achieve the same state-of-the-art iteration complexity as Wasserstein VI, with experiments confirming it as the main performance driver.
LoRA gradient descent converges to a stationary point at rate O(1/log T).
SketchGuard decouples Byzantine filtering from aggregation in decentralized federated learning by exchanging k-dimensional Count Sketches for screening and full models only from accepted neighbors, achieving up to 50-70% communication savings while proving convergence and matching SOTA robustness.
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
New class of CDF-based estimators for sliced Wasserstein distance avoids sorting, enables massive parallelism, and suits federated learning and Gaussian mixture models.
RCGLS replaces the gradient in CGLS with a randomized coordinate version via a constraint correction view, proving linear convergence in expectation better than randomized coordinate descent, plus sparse implementation and ridge regression extension.
Proposes Factor-Augmented SGD that runs on streaming high-dimensional data and supplies the first convergence analysis explicitly accounting for latent-factor estimation error.
Tight feasibility thresholds are derived for the minimal sub-optimality gap in convex L-smooth distributed optimization under bounded adversarial gradient perturbations, together with algorithms attaining them at matching query complexity.
Rotosolve converges to ε-stationary points for smooth non-convex objectives and ε-suboptimal points under PL, with explicit worst-case rates in the finite-shot regime, outperforming or matching RCD in nuanced ways.
A mini-batch stochastic Krasnosel'skiĭ-Mann algorithm converges almost surely to fixed points of nonexpansive mappings when batch sizes increase appropriately.
Benchmarks gradient-ascent algorithms for constrained free energy minimization on quantum Heisenberg models and stabilizer codes, with applications to thermal state design and fixed-temperature quantum encoding.
Refines subspace preconditioning for randomized linear solvers via QR-like factorization, enabling implicit use and proving expected linear convergence while reducing to a smaller system with good singular values.
ADIW accelerates dynamic importance weighting for joint distribution shift by using a few lightweight projected gradient descent updates with warm-starting from prior weights and generalizes it to support multiple divergence-based estimators in a plug-and-play manner.
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
Treating stochastic and deterministic gradients separately in mini-batch SGD yields faster convergence and smaller error radius than uniform treatment, with further gains under strong convexity.
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead comparable to Adam.
AI models discovered the worst-case complexity of the Kaczmarz algorithm for solving linear systems.
citing papers explorer
-
Sharp $O(1/k)$ convergence rate for the Sinkhorn algorithm via a local analysis
Proves sharp O(1/k) rate for Sinkhorn via local bipartite graph analysis of positive-mass edges, bootstrapped from prior almost-sharp global bound.
-
Stochastic Krasnoselskii-Mann Iterations: Convergence without Uniformly Bounded Variance
Stochastic Krasnoselskii-Mann iterations converge almost surely and with rates under finite variance at a single fixed point rather than uniform variance bounds, recovering optimal complexity and providing first such results for some splitting methods.
-
Stochastic Gradient Variational Inference with Price's Gradient Estimator from Bures-Wasserstein to Parameter Space
Price's gradient estimator enables black-box VI to achieve the same state-of-the-art iteration complexity as Wasserstein VI, with experiments confirming it as the main performance driver.
-
On the Convergence Rate of LoRA Gradient Descent
LoRA gradient descent converges to a stationary point at rate O(1/log T).
-
How does the optimizer implicitly bias the model merging loss landscape?
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
-
Highly Data Parallelizable Estimation of the Sliced-Wasserstein Distance Using Cumulative Distribution Functions
New class of CDF-based estimators for sliced Wasserstein distance avoids sorting, enables massive parallelism, and suits federated learning and Gaussian mixture models.
-
Randomized conjugate gradient least squares
RCGLS replaces the gradient in CGLS with a randomized coordinate version via a constraint correction view, proving linear convergence in expectation better than randomized coordinate descent, plus sparse implementation and ridge regression extension.
-
Factor Augmented High-Dimensional SGD
Proposes Factor-Augmented SGD that runs on streaming high-dimensional data and supplies the first convergence analysis explicitly accounting for latent-factor estimation error.
-
Distributed Learning with Adversarial Gradient Perturbations
Tight feasibility thresholds are derived for the minimal sub-optimality gap in convex L-smooth distributed optimization under bounded adversarial gradient perturbations, together with algorithms attaining them at matching query complexity.
-
One Coordinate at a Time: Convergence Guarantees for Rotosolve in Variational Quantum Algorithms
Rotosolve converges to ε-stationary points for smooth non-convex objectives and ε-suboptimal points under PL, with explicit worst-case rates in the finite-shot regime, outperforming or matching RCD in nuanced ways.
-
Mini-Batch Stochastic Krasnosel'ski\u\i-Mann Algorithm for Nonexpansive Fixed Point Problems
A mini-batch stochastic Krasnosel'skiĭ-Mann algorithm converges almost surely to fixed points of nonexpansive mappings when batch sizes increase appropriately.
-
Constrained free energy minimization for the design of thermal states and stabilizer thermodynamic systems
Benchmarks gradient-ascent algorithms for constrained free energy minimization on quantum Heisenberg models and stabilizer codes, with applications to thermal state design and fixed-temperature quantum encoding.
-
On subspace-constrained preconditioning for randomized iterative methods
Refines subspace preconditioning for randomized linear solvers via QR-like factorization, enabling implicit use and proving expected linear convergence while reducing to a smaller system with good singular values.
-
Accelerated Dynamic Importance Weighting with Versatile Divergence-Minimizing Estimators
ADIW accelerates dynamic importance weighting for joint distribution shift by using a few lightweight projected gradient descent updates with warm-starting from prior weights and generalizes it to support multiple divergence-based estimators in a plug-and-play manner.
-
COOPO: Cyclic Offline-Online Policy Optimization Algorithm
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.
-
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
-
Stochastic versus Deterministic in Stochastic Gradient Descent
Treating stochastic and deterministic gradients separately in mini-batch SGD yields faster convergence and smaller error radius than uniform treatment, with further gains under strong convexity.
-
On the Convergence Analysis of Muon
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
-
Accelerated Gradient Descent for Faster Convergence with Minimal Overhead
CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead comparable to Adam.
-
How AI settled the complexity of the oldest SGD algorithm
AI models discovered the worst-case complexity of the Kaczmarz algorithm for solving linear systems.