Large Batch Training of Convolutional Networks

Yang You , Igor Gitman , Boris Ginsburg

Authors on Pith no claims yet

classification 💻 cs.CV

keywords trainingbatchlargesizeaccuracycomputationalconvolutionallars

read the original abstract

A common way to speed up training of large convolutional networks is to add computational units. Training is then performed using data-parallel synchronous Stochastic Gradient Descent (SGD) with mini-batch divided between computational units. With an increase in the number of nodes, the batch size grows. But training with large batch size often results in the lower model accuracy. We argue that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge. To overcome this optimization difficulties we propose a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS). Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Convergence of difference inclusions via a diameter criterion
math.OC 2026-05 unverdicted novelty 7.0

A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.
A Simple Framework for Contrastive Learning of Visual Representations
cs.LG 2020-02 accept novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
Solving Rubik's Cube with a Robot Hand
cs.LG 2019-10 accept novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
cs.LG 2026-05 unverdicted novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining
cs.LG 2026-05 unverdicted novelty 6.0

A bilevel method learns composite pretraining loss weights online via gradient alignment with a downstream objective, matching tuned baselines at roughly 30% extra cost over one training run.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Information theoretic underpinning of self-supervised learning by clustering
cs.LG 2026-05 unverdicted novelty 5.0

SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
Communication-Efficient Gluon in Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning
cs.LG 2026-05 unverdicted novelty 4.0

OpMech defines the order-gap between consolidation and expansion operators as a real-time, trajectory-based signal for convergence and principled stopping in adaptive learning.
Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning
cs.LG 2026-05 unverdicted novelty 4.0

OpMech defines the order-gap as a computable non-commutativity measure between consolidation and expansion operators to provide real-time convergence signals and stopping rules in adaptive learning.