Large Batch Training of Convolutional Networks
read the original abstract
A common way to speed up training of large convolutional networks is to add computational units. Training is then performed using data-parallel synchronous Stochastic Gradient Descent (SGD) with mini-batch divided between computational units. With an increase in the number of nodes, the batch size grows. But training with large batch size often results in the lower model accuracy. We argue that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge. To overcome this optimization difficulties we propose a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS). Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.
This paper has not been read by Pith yet.
Forward citations
Cited by 32 Pith papers
-
Masked Autoencoders Are Scalable Vision Learners
Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
-
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
-
PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment
PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
-
Convergence of difference inclusions via a diameter criterion
A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.
-
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
-
Training Deep Learning Models with Norm-Constrained LMOs
Scion is a new stochastic LMO-based optimizer family that unifies existing methods, supports unconstrained problems, and delivers hyperparameter transferability plus speedups on nanoGPT training.
-
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
VICReg prevents collapse in self-supervised image embeddings via explicit variance, invariance, and covariance regularization and matches state-of-the-art downstream performance.
-
A Simple Framework for Contrastive Learning of Visual Representations
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
-
Solving Rubik's Cube with a Robot Hand
Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
-
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
-
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.
-
Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
Establishes matching lower and upper oracle complexity bounds for scale-invariant methods with spectral norm under heavy-tailed noise, plus improved rates with higher-order smoothness, and practical tests on neural networks.
-
Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates
The Adam-SGD gap in large-batch LLM pre-training arises mainly from SGD's restricted effective learning rates caused by small gradients and output-layer spikes; clipping lets SGD recover nearly all of Adam's performance.
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
-
When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining
A bilevel method learns composite pretraining loss weights online via gradient alignment with a downstream objective, matching tuned baselines at roughly 30% extra cost over one training run.
-
Closed-Form Last Layer Optimization
A method that alternates gradient steps on a neural network backbone with closed-form optimal updates to the final linear layer under squared loss, including an SGD adaptation and NTK-regime convergence analysis.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Scaling Laws for Transfer
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
-
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.
-
Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling
Adaptive Batch Scaling dynamically increases batch size in on-policy RL as policy volatility drops, measured by a new Behavioral Divergence metric, and shows larger networks plus larger batches outperform on ALE with PQN.
-
Rethinking Neural Network Learning Rates: A Stackelberg Perspective
Non-uniform learning rates correspond to a Stackelberg reformulation of the training objective whose two-time-scale alternating gradient descent yields finite-time convergence and can accelerate training through stron...
-
Information theoretic underpinning of self-supervised learning by clustering
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
-
Communication-Efficient Gluon in Federated Learning
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
-
A Physics-Inspired Optimizer: Velocity Regularized Adam
VRAdam hybridizes Adam's per-parameter adaptation with a physics-inspired velocity regularizer to stabilize training at the edge of stability, delivering better empirical performance than AdamW and O(ln(N)/sqrt(N)) co...
-
Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD
GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.
-
Accelerated Gradient Descent for Faster Convergence with Minimal Overhead
CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead co...
-
Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning
OpMech defines the order-gap between consolidation and expansion operators as a real-time, trajectory-based signal for convergence and principled stopping in adaptive learning.
-
Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning
OpMech defines the order-gap as a computable non-commutativity measure between consolidation and expansion operators to provide real-time convergence signals and stopping rules in adaptive learning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.