Connections between schedule-free optimizers, AdEMAMix, and accelerated sgd variants.arXiv preprint arXiv:2502.02431

Connections between schedule-free optimizers, ademamix, accelerated sgd variants , author= · 2025 · arXiv 2502.02431

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

AMUSE: Anytime Muon with Stable Gradient Evaluation

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.

Optimistic Dual Averaging Unifies Modern Optimizers

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.

Benchmarking Optimizers for MLPs in Tabular Deep Learning

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

ScheduleFree+ scales schedule-free learning to LLMs with fixes for large batches and models, outperforming Warmup-Stable-Decay schedules by up to 31% at 1000 tokens per parameter.

citing papers explorer

Showing 4 of 4 citing papers.

AMUSE: Anytime Muon with Stable Gradient Evaluation cs.LG · 2026-05-21 · unverdicted · none · ref 31
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
Optimistic Dual Averaging Unifies Modern Optimizers cs.LG · 2026-05-11 · unverdicted · none · ref 10
SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.
Benchmarking Optimizers for MLPs in Tabular Deep Learning cs.LG · 2026-04-16 · unverdicted · none · ref 9
Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models cs.LG · 2026-05-18 · unverdicted · none · ref 24
ScheduleFree+ scales schedule-free learning to LLMs with fixes for large batches and models, outperforming Warmup-Stable-Decay schedules by up to 31% at 1000 tokens per parameter.

Connections between schedule-free optimizers, AdEMAMix, and accelerated sgd variants.arXiv preprint arXiv:2502.02431

fields

years

verdicts

representative citing papers

citing papers explorer