pith. sign in

arxiv: 1708.03888 · v3 · pith:CUJAYCDEnew · submitted 2017-08-13 · 💻 cs.CV

Large Batch Training of Convolutional Networks

classification 💻 cs.CV
keywords trainingbatchlargesizeaccuracycomputationalconvolutionallars
0
0 comments X
read the original abstract

A common way to speed up training of large convolutional networks is to add computational units. Training is then performed using data-parallel synchronous Stochastic Gradient Descent (SGD) with mini-batch divided between computational units. With an increase in the number of nodes, the batch size grows. But training with large batch size often results in the lower model accuracy. We argue that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge. To overcome this optimization difficulties we propose a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS). Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Masked Autoencoders Are Scalable Vision Learners

    cs.CV 2021-11 accept novelty 8.0

    Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.

  2. Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

    math.OC 2026-05 conditional novelty 7.0

    Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.

  3. PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.

  4. Convergence of difference inclusions via a diameter criterion

    math.OC 2026-05 unverdicted novelty 7.0

    A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.

  5. Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

    cs.CV 2026-03 unverdicted novelty 7.0

    DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.

  6. Training Deep Learning Models with Norm-Constrained LMOs

    cs.LG 2025-02 unverdicted novelty 7.0

    Scion is a new stochastic LMO-based optimizer family that unifies existing methods, supports unconstrained problems, and delivers hyperparameter transferability plus speedups on nanoGPT training.

  7. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

    cs.CV 2021-05 accept novelty 7.0

    VICReg prevents collapse in self-supervised image embeddings via explicit variance, invariance, and covariance regularization and matches state-of-the-art downstream performance.

  8. A Simple Framework for Contrastive Learning of Visual Representations

    cs.LG 2020-02 accept novelty 7.0

    SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.

  9. Solving Rubik's Cube with a Robot Hand

    cs.LG 2019-10 accept novelty 7.0

    Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.

  10. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

    cs.LG 2019-10 accept novelty 7.0

    ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.

  11. One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

    cs.LG 2026-05 conditional novelty 6.0

    Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.

  12. Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

    math.OC 2026-05 unverdicted novelty 6.0

    Establishes matching lower and upper oracle complexity bounds for scale-invariant methods with spectral norm under heavy-tailed noise, plus improved rates with higher-order smoothness, and practical tests on neural networks.

  13. Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

    cs.LG 2026-05 unverdicted novelty 6.0

    The Adam-SGD gap in large-batch LLM pre-training arises mainly from SGD's restricted effective learning rates caused by small gradients and output-layer spikes; clipping lets SGD recover nearly all of Adam's performance.

  14. ShardTensor: Domain Parallelism for Scientific Machine Learning

    cs.DC 2026-05 unverdicted novelty 6.0

    ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

  15. OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

    cs.LG 2026-05 unverdicted novelty 6.0

    OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...

  16. When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining

    cs.LG 2026-05 unverdicted novelty 6.0

    A bilevel method learns composite pretraining loss weights online via gradient alignment with a downstream objective, matching tuned baselines at roughly 30% extra cost over one training run.

  17. Closed-Form Last Layer Optimization

    cs.LG 2025-10 unverdicted novelty 6.0

    A method that alternates gradient steps on a neural network backbone with closed-form optimal updates to the final linear layer under squared loss, including an SGD adaptation and NTK-regime convergence analysis.

  18. Revisiting Feature Prediction for Learning Visual Representations from Video

    cs.CV 2024-02 conditional novelty 6.0

    V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

  19. Vision Transformers Need Registers

    cs.CV 2023-09 unverdicted novelty 6.0

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  20. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  21. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  22. Scaling Laws for Transfer

    cs.LG 2021-02 unverdicted novelty 6.0

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

  23. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    cs.LG 2019-04 conditional novelty 6.0

    LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.

  24. Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

    stat.ML 2026-05 unverdicted novelty 5.0

    Adaptive Batch Scaling dynamically increases batch size in on-policy RL as policy volatility drops, measured by a new Behavioral Divergence metric, and shows larger networks plus larger batches outperform on ALE with PQN.

  25. Rethinking Neural Network Learning Rates: A Stackelberg Perspective

    cs.LG 2026-05 unverdicted novelty 5.0

    Non-uniform learning rates correspond to a Stackelberg reformulation of the training objective whose two-time-scale alternating gradient descent yields finite-time convergence and can accelerate training through stron...

  26. Information theoretic underpinning of self-supervised learning by clustering

    cs.LG 2026-05 unverdicted novelty 5.0

    SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.

  27. Communication-Efficient Gluon in Federated Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

  28. A Physics-Inspired Optimizer: Velocity Regularized Adam

    cs.LG 2025-05 unverdicted novelty 5.0

    VRAdam hybridizes Adam's per-parameter adaptation with a physics-inspired velocity regularizer to stabilize training at the edge of stability, delivering better empirical performance than AdamW and O(ln(N)/sqrt(N)) co...

  29. Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

    cs.LG 2019-06 unverdicted novelty 5.0

    GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.

  30. Accelerated Gradient Descent for Faster Convergence with Minimal Overhead

    cs.LG 2026-05 unverdicted novelty 4.0

    CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead co...

  31. Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

    cs.LG 2026-05 unverdicted novelty 4.0

    OpMech defines the order-gap between consolidation and expansion operators as a real-time, trajectory-based signal for convergence and principled stopping in adaptive learning.

  32. Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

    cs.LG 2026-05 unverdicted novelty 4.0

    OpMech defines the order-gap as a computable non-commutativity measure between consolidation and expansion operators to provide real-time convergence signals and stopping rules in adaptive learning.