pith. sign in

arxiv: 1804.07612 · v1 · pith:6CB5P4A4new · submitted 2018-04-20 · 💻 cs.LG · cs.CV· stat.ML

Revisiting Small Batch Training for Deep Neural Networks

classification 💻 cs.LG cs.CVstat.ML
keywords mini-batchtrainingperformancesizesgradientlearningsmallbatch
0
0 comments X
read the original abstract

Modern deep neural network training is typically based on mini-batch stochastic gradient optimization. While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide improved generalization performance and allows a significantly smaller memory footprint, which might also be exploited to improve machine throughput. In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. We adopt a learning rate that corresponds to a constant average weight update per gradient calculation (i.e., per unit cost of computation), and point out that this results in a variance of the weight updates that increases linearly with the mini-batch size $m$. The collected experimental results for the CIFAR-10, CIFAR-100 and ImageNet datasets show that increasing the mini-batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance. On the other hand, small mini-batch sizes provide more up-to-date gradient calculations, which yields more stable and reliable training. The best performance has been consistently obtained for mini-batch sizes between $m = 2$ and $m = 32$, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

    cs.CL 2026-05 unverdicted novelty 7.0

    Bilingual fine-tuning on a new parallel Filipino-English dementia dataset yields Macro-F1 scores of 0.969-0.973 and eliminates cross-lingual degradation for all tested transformers.

  2. Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

    cs.LG 2026-04 unverdicted novelty 7.0

    Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.

  3. Stochastic Gradient Optimization with Model-Assisted Sampling

    cs.LG 2026-06 unverdicted novelty 6.0

    Model-assisted sampling applies survey sampling variance reduction with auxiliary predictors to stochastic gradients, yielding empirical gains on benchmarks and faster generalization with AdamW.

  4. Behavior Score Prediction in Resting-State Functional MRI by Deep State Space Modeling

    eess.SP 2026-02 unverdicted novelty 6.0

    A deep state space model on rs-fMRI time series predicts Alzheimer's behavior scores better than functional connectivity approaches and identifies key predictive brain regions.

  5. Deep Multi-View Learning via Task-Optimal CCA

    cs.LG 2019-07 unverdicted novelty 6.0

    End-to-end deep optimization of CCA plus task loss produces discriminative shared representations that outperform prior multi-view methods on classification and semi-supervised tasks.

  6. A Residual-Subspace Constraint Framework for Fourier Ptychographic Microscopy

    physics.optics 2026-05 unverdicted novelty 5.0

    RSCF decouples low-rank systematic residuals from noise via subspace constraints to enable high-fidelity phase and amplitude recovery in Fourier ptychographic microscopy under aberrations and misalignments.

  7. Algorithmic Advantage on a Gate-Based Photonic Quantum Neural Network

    quant-ph 2026-05 unverdicted novelty 5.0

    Photonic QNNs with two trainable parameters solve nonlinear tasks like XOR at 100% accuracy where parameter-matched ANNs fail, with hardware deployment confirming the result.

  8. Algorithmic Advantage on a Gate-Based Photonic Quantum Neural Network

    quant-ph 2026-05 unverdicted novelty 5.0

    A two-parameter photonic QNN achieves 100% accuracy on nonlinear tasks where a matched classical ANN saturates at random guessing, suggesting algorithmic advantage on current photonic hardware.

  9. Simulation-based inference for rapid Bayesian parameter estimation in epidemiological models: a comparison with MCMC

    cs.AI 2026-06 unverdicted novelty 4.0

    SBI matches MCMC posterior accuracy on a SECIR model but runs 15-120 times faster on GPU for 31-day and 201-day inference windows.