RR dominates SGD in smooth convex optimization under any reasonable stepsize after any finite number of epochs.
hub
Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235
26 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Derives the cold Sinkhorn limiting dynamics as tau approaches zero, proving finite-time convergence to unregularized OT and improved O(tau^{-1}) iteration complexity for dual suboptimality.
Proves sharp O(1/k) rate for Sinkhorn via local bipartite graph analysis of positive-mass edges, bootstrapped from prior almost-sharp global bound.
Stochastic Krasnoselskii-Mann iterations converge almost surely and with rates under finite variance at a single fixed point rather than uniform variance bounds, recovering optimal complexity and providing first such results for some splitting methods.
Price's gradient estimator enables black-box VI to achieve the same state-of-the-art iteration complexity as Wasserstein VI, with experiments confirming it as the main performance driver.
LoRA gradient descent converges to a stationary point at rate O(1/log T).
SketchGuard decouples Byzantine filtering from aggregation in decentralized federated learning by exchanging k-dimensional Count Sketches for screening and full models only from accepted neighbors, achieving up to 50-70% communication savings while proving convergence and matching SOTA robustness.
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
New class of CDF-based estimators for sliced Wasserstein distance avoids sorting, enables massive parallelism, and suits federated learning and Gaussian mixture models.
RCGLS replaces the gradient in CGLS with a randomized coordinate version via a constraint correction view, proving linear convergence in expectation better than randomized coordinate descent, plus sparse implementation and ridge regression extension.
Proposes Factor-Augmented SGD that runs on streaming high-dimensional data and supplies the first convergence analysis explicitly accounting for latent-factor estimation error.
Tight feasibility thresholds are derived for the minimal sub-optimality gap in convex L-smooth distributed optimization under bounded adversarial gradient perturbations, together with algorithms attaining them at matching query complexity.
Rotosolve converges to ε-stationary points for smooth non-convex objectives and ε-suboptimal points under PL, with explicit worst-case rates in the finite-shot regime, outperforming or matching RCD in nuanced ways.
A mini-batch stochastic Krasnosel'skiĭ-Mann algorithm converges almost surely to fixed points of nonexpansive mappings when batch sizes increase appropriately.
Benchmarks gradient-ascent algorithms for constrained free energy minimization on quantum Heisenberg models and stabilizer codes, with applications to thermal state design and fixed-temperature quantum encoding.
Refines subspace preconditioning for randomized linear solvers via QR-like factorization, enabling implicit use and proving expected linear convergence while reducing to a smaller system with good singular values.
ADIW accelerates dynamic importance weighting for joint distribution shift by using a few lightweight projected gradient descent updates with warm-starting from prior weights and generalizes it to support multiple divergence-based estimators in a plug-and-play manner.
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.
Under the local PL condition with multiplicative noise for C² functions, (S)GD asymptotic rates match those of strongly convex quadratics via a geometric argument.
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
Treating stochastic and deterministic gradients separately in mini-batch SGD yields faster convergence and smaller error radius than uniform treatment, with further gains under strong convexity.
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
FOAM adaptively controls damping and update frequency in Shampoo based on staleness-oriented error approximation to cut wall-clock time while preserving convergence.
CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead comparable to Adam.
citing papers explorer
No citing papers match the current filters.