Benchmarking Neural Network Training Algorithms

Abel L. Peirson; Ankush Garg; Bilal Khan; Chandramouli Shama Sastry; Chris J. Maddison; Daniel Snider; Daniel Suo; Ehsan Amid; Frank Schneider; George E. Dahl

arxiv: 2306.07179 · v2 · pith:GS2EYIIBnew · submitted 2023-06-12 · 💻 cs.LG · stat.ML

Benchmarking Neural Network Training Algorithms

George E. Dahl , Frank Schneider , Zachary Nado , Naman Agarwal , Chandramouli Shama Sastry , Philipp Hennig , Sourabh Medapati , Runa Eschenhagen

show 17 more authors

Priya Kasimbeg Daniel Suo Juhan Bae Justin Gilmer Abel L. Peirson Bilal Khan Rohan Anil Mike Rabbat Shankar Krishnan Daniel Snider Ehsan Amid Kongtao Chen Chris J. Maddison Rakshith Vasudev Michal Badura Ankush Garg Peter Mattson

This is my paper

classification 💻 cs.LG stat.ML

keywords trainingbenchmarkalgorithmsalgorithmsubmissionsworkloadbaselinebetter

0 comments

read the original abstract

Training algorithms, broadly construed, are an essential part of every deep learning pipeline. Training algorithm improvements that speed up training across a wide variety of workloads (e.g., better update rules, tuning protocols, learning rate schedules, or data selection schemes) could save time, save computational resources, and lead to better, more accurate, models. Unfortunately, as a community, we are currently unable to reliably identify training algorithm improvements, or even determine the state-of-the-art training algorithm. In this work, using concrete experiments, we argue that real progress in speeding up training requires new benchmarks that resolve three basic challenges faced by empirical comparisons of training algorithms: (1) how to decide when training is complete and precisely measure training time, (2) how to handle the sensitivity of measurements to exact workload details, and (3) how to fairly compare algorithms that require hyperparameter tuning. In order to address these challenges, we introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware, the AlgoPerf: Training Algorithms benchmark. Our benchmark includes a set of workload variants that make it possible to detect benchmark submissions that are more robust to workload changes than current widely-used methods. Finally, we evaluate baseline submissions constructed using various optimizers that represent current practice, as well as other optimizers that have recently received attention in the literature. These baseline results collectively demonstrate the feasibility of our benchmark, show that non-trivial gaps between methods exist, and set a provisional state-of-the-art for future benchmark submissions to try and surpass.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AMUSE: Anytime Muon with Stable Gradient Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
math.OC 2026-05 conditional novelty 7.0

Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
Training Deep Learning Models with Norm-Constrained LMOs
cs.LG 2025-02 unverdicted novelty 7.0

Scion is a new stochastic LMO-based optimizer family that unifies existing methods, supports unconstrained problems, and delivers hyperparameter transferability plus speedups on nanoGPT training.
Old Optimizer, New Norm: An Anthology
cs.LG 2024-09 unverdicted novelty 7.0

Optimizers like Adam reduce to steepest descent under particular norms, opening a design space of norm assignments tailored to layer roles.
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
cs.LG 2026-05 unverdicted novelty 6.0

LionMuon alternates Lion and Muon steps with shared dual-EMA buffer to Pareto-dominate existing optimizers in loss and compute on models up to 720M parameters.
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
cs.LG 2026-05 unverdicted novelty 6.0

LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
math.OC 2026-05 unverdicted novelty 6.0

Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.
FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo
cs.LG 2026-06 unverdicted novelty 4.0

FOAM adaptively controls damping and update frequency in Shampoo based on staleness-oriented error approximation to cut wall-clock time while preserving convergence.
Position: Adopt Constraints Over Fixed Penalties in Deep Learning
cs.LG 2025-05 accept novelty 4.0

Fixed penalty methods in deep learning do not reliably solve problems with hard non-negotiable constraints, so the constrained formulation should be the starting point instead.