Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Pith reviewed 2026-05-13 17:14 UTC · model grok-4.3
The pith
Batch Normalization normalizes each layer's inputs using mini-batch statistics, allowing higher learning rates and faster convergence in deep networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Making normalization a part of the model architecture and performing it per mini-batch reduces internal covariate shift, so that the same accuracy is reached with far fewer training steps while using higher learning rates and less careful initialization.
What carries the argument
Batch Normalization, which subtracts the mini-batch mean and divides by the mini-batch standard deviation for each layer's activations before applying learned scale and shift parameters.
If this is right
- Networks can safely use significantly higher learning rates without divergence.
- Training requires less careful parameter initialization.
- The regularizing effect can eliminate the need for dropout in some models.
- Target accuracy is reached after 14 times fewer training steps on image classification tasks.
- An ensemble achieves 4.9 percent top-5 error on ImageNet, beating prior published results.
Where Pith is reading between the lines
- The same per-batch normalization idea could stabilize training in other sequence or graph models where layer input distributions also drift.
- Smaller batch sizes may limit the reliability of the estimated statistics, pointing to possible variants that use running averages or different grouping.
- By reducing sensitivity to initialization, the method could make deep learning more accessible outside specialized labs.
Load-bearing premise
The changing distribution of each layer's inputs is the main cause of slow training, and normalizing per mini-batch will reliably reduce this shift without introducing instabilities or needing extensive extra tuning.
What would settle it
A network trained with batch normalization that still requires low learning rates, careful initialization, or more steps than the baseline to reach the same accuracy would falsify the central claim.
read the original abstract
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Batch Normalization as an architectural component that normalizes each layer's inputs to zero mean and unit variance using per-mini-batch statistics, followed by learnable scale and shift parameters. It claims this mitigates internal covariate shift, enabling substantially higher learning rates, reduced sensitivity to initialization, and a regularizing effect that can replace Dropout. Experiments on MNIST and a state-of-the-art ImageNet model report that the same accuracy is reached with 14 times fewer training steps and that an ensemble improves top-5 validation error to 4.9%.
Significance. If the empirical gains hold under the reported conditions, the work is significant: it supplies a practical, low-overhead technique that has become standard in deep-network training pipelines and directly enabled deeper architectures. The paper supplies explicit algorithmic pseudocode, the full training protocol for the ImageNet model, and reproducible speed-up numbers, all of which strengthen its contribution.
major comments (2)
- [§4] §4 (ImageNet experiments): no direct metric of internal covariate shift (mean/variance drift, KL divergence, or Wasserstein distance between successive layer-input distributions) is reported for the baseline versus BN networks. Consequently the central causal claim—that the observed 14-fold reduction in training steps stems from reduced ICS rather than from stochastic regularization or improved loss-landscape conditioning—remains unverified.
- [§3.2] §3.2, Eq. (3)–(5): the normalization is performed with mini-batch statistics whose variance is itself stochastic; the manuscript provides no analysis or bound showing that this stochasticity reliably decreases (rather than merely reparameterizes) the covariate shift that the authors define in §2.
minor comments (2)
- [Figure 1] Figure 1 caption: the legend does not explicitly state which curves include the BN layers and which are the plain baseline, making the speed-up comparison harder to read at a glance.
- [§4.1] §4.1: the MNIST results are reported without error bars or the number of independent runs, even though the absolute accuracy differences are small.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We respond to each major comment below, providing clarifications and indicating where revisions can be made.
read point-by-point responses
-
Referee: [§4] §4 (ImageNet experiments): no direct metric of internal covariate shift (mean/variance drift, KL divergence, or Wasserstein distance between successive layer-input distributions) is reported for the baseline versus BN networks. Consequently the central causal claim—that the observed 14-fold reduction in training steps stems from reduced ICS rather than from stochastic regularization or improved loss-landscape conditioning—remains unverified.
Authors: We acknowledge that direct metrics of internal covariate shift (e.g., distribution distances) are not reported. The primary evidence remains the empirical training speedups and accuracy gains on MNIST and ImageNet, which are consistent with reduced ICS. Other mechanisms such as regularization may contribute, and we can add a short discussion in revision noting the absence of direct ICS quantification while emphasizing the practical benefits. revision: partial
-
Referee: [§3.2] §3.2, Eq. (3)–(5): the normalization is performed with mini-batch statistics whose variance is itself stochastic; the manuscript provides no analysis or bound showing that this stochasticity reliably decreases (rather than merely reparameterizes) the covariate shift that the authors define in §2.
Authors: Mini-batch statistics are stochastic by nature, yet the normalization (combined with learnable scale/shift and population statistics at inference) stabilizes each layer's input distribution. We provide no formal bound or analysis of the stochasticity, as the paper is primarily empirical; the consistent speed and accuracy improvements across models indicate a net reduction in effective covariate shift despite the stochastic estimates. revision: no
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces Batch Normalization as an explicit architectural layer that computes per-mini-batch mean and variance, normalizes activations, and applies learnable scale/shift parameters. Its central claims of faster convergence, higher learning rates, and regularization effects are supported by direct empirical comparisons on external benchmarks (e.g., ImageNet accuracy and training steps) rather than any mathematical reduction of a predicted quantity back to a fitted parameter defined from the same data. No equations equate a claimed improvement to an input by construction, and no load-bearing premise relies on self-citation chains or imported uniqueness theorems. The derivation is therefore self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
free parameters (1)
- gamma and beta
axioms (1)
- domain assumption Changing distributions of layer inputs during training slow convergence and require lower learning rates
invented entities (1)
-
internal covariate shift
no independent evidence
Forward citations
Cited by 60 Pith papers
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
-
Density estimation using Real NVP
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
-
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
-
Determining star formation histories and age-metallicity relations with convolutional neural networks
A CNN with attention and shared latent space recovers SFHs and metallicities from spectro-photometric data with ~0.12 dex age and ~0.03 dex metallicity dispersion while running thousands of times faster than full spec...
-
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
-
Physics-informed, Generative Adversarial Design of Funicular Shells
A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
-
Deep Learning for CMB Foreground Removal and Beam Deconvolution: A U-Net GAN Approach
A U-Net GAN reconstructs CMB T and E maps from Planck-like simulations with foregrounds and systematics, achieving under 1% error outside the Galactic region and demonstrating first-time correction for non-circular be...
-
High Fidelity Neural Audio Compression
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...
-
A Simple Framework for Contrastive Learning of Visual Representations
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
-
MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks
Releases MVB, a multi-view baggage re-identification dataset with 4519 identities and 22660 images, plus a merged Siamese network baseline evaluated on it.
-
Learning to learn with quantum neural networks via classical neural networks
Classical RNNs trained on small instances provide parameter initializations for QAOA and VQE that reduce total optimization iterations and generalize across problem sizes.
-
IRNet: A General Purpose Deep Residual Regression Framework for Materials Discovery
IRNet uses per-layer residual shortcuts in fully connected networks to achieve better prediction accuracy and training convergence than prior ML methods on OQMD and Materials Project datasets for material properties.
-
Importance Estimation for Neural Network Pruning
Taylor-expansion importance scoring enables layer-agnostic pruning of neural networks that outperforms prior methods on ImageNet accuracy-FLOPs trade-offs.
-
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
-
The Kinetics Human Action Video Dataset
Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
-
Continuous control with deep reinforcement learning
DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competiti...
-
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.
-
CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation
CogAdapt adapts clinical ECG foundation models to 3-lead wearable signals for cognitive load assessment via a LeadBridge adapter and ProFine progressive fine-tuning, outperforming scratch-trained models with macro-F1 ...
-
Q-PhotoNAS: Hybrid Quantum Neural Architecture Search Framework on Photonic Devices
Q-PhotoNAS applies genetic algorithm search to jointly optimize classical preprocessing, phase encoding, and photonic circuit structure for hybrid quantum-classical models, reporting 99.44% and 98.78% accuracy on Digi...
-
A Dual Physics-Informed Kolmogorov-Arnold Neural Network Framework for Continuum Topology Optimization
Dual HRKAN framework (DPIKAN-TO) for topology optimization with one network predicting displacements and another handling sensitivity-based design updates.
-
Demystifying Manifold Constraints in LLM Pre-training
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
-
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
-
Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers
TaperNorm gradually removes internal normalization in pre-norm transformers via learned gates that reach zero, revealing final norm as a scale anchor and enabling up to 1.18x faster KV-cached decoding with small loss ...
-
TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings
TriagerX combines dual-transformer content rankings with developer interaction history to improve top-k accuracy for developer and component recommendations in bug triaging across five datasets.
-
Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs
Introduces layer-wise learning signals combining knowledge distillation and local errors into Equilibrium Propagation, enabling scalable training of deep VGG-style CRNNs with SOTA results on CIFAR-10 and CIFAR-100.
-
Sinc Kolmogorov-Arnold network and its application for solving PDEs with singularities
SincKANs integrate Sinc interpolation into KAN activations and report better empirical results than alternatives on function approximation and PINN tasks.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Sharpness-Aware Minimization for Efficiently Improving Generalization
SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
-
Unsupervised Learning Framework of Interest Point Via Properties Optimization
Unsupervised EM-based joint optimization of interest point detector and descriptor via probability formulations of sparsity, repeatability and discriminability, yielding Property Network that outperforms SOTA on match...
-
Segmenting Objects in Day and Night:Edge-Conditioned CNN for Thermal Image Semantic Segmentation
EC-CNN uses a gated feature-wise transform to incorporate edge priors for thermal semantic segmentation and introduces the SODA dataset of over 7,000 labeled thermal images.
-
A Deep Learning System for Predicting Size and Fit in Fashion E-Commerce
A deep learning content-collaborative model for size and fit prediction that outperforms state-of-the-art on two public and two proprietary datasets.
-
Interaction-and-Aggregation Network for Person Re-identification
Introduces IA network with SIA and CIA modules to adaptively model spatial and channel feature interdependencies for improved person re-identification on benchmarks.
-
QUOTIENT: Two-Party Secure Neural Network Training and Prediction
QUOTIENT achieves 50X faster WAN training time and 6% higher absolute accuracy for secure two-party DNN training by jointly optimizing a discretized training algorithm with a tailored secure protocol.
-
Adaptive Weighting Depth-variant Deconvolution of Fluorescence Microscopy Images with Convolutional Neural Network
A CNN predicts depth-variant PSFs for patch-wise deconvolution of fluorescence microscopy images, with adaptive weighting to reduce artifacts, claiming 98.2% accuracy and up to 6.6 dB PSNR gain.
-
Graph-based Knowledge Distillation by Multi-head Attention Network
Multi-head attention constructs a graph of dataset relations from the teacher embedding procedure and transfers it to the student via multi-task learning, yielding 7.05% higher CIFAR-100 accuracy than the student alon...
-
Generalizing from a few environments in safety-critical reinforcement learning
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
-
Rethinking Atrous Convolution for Semantic Image Segmentation
DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF p...
-
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
-
Unveiling Hidden Lyman Alpha Emitters in the DESI DR1 Data
A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.
-
A sound-horizon-free measurement of the Hubble constant from DESI DR2 baryon acoustic oscillations using artificial neural networks
Neural network reconstruction of DESI DR2 BAO, SNe Ia, and cosmic chronometer data gives H0 = 71.5 ± 2.2 km s^{-1} Mpc^{-1} without sound horizon input.
-
Distributional Value Estimation Without Target Networks for Robust Quality-Diversity
QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.
-
Enhancing Event Reconstruction in Hyper-Kamiokande with Machine Learning: A ResNet Implementation
ResNet models classify four particle types and regress vertex, direction, and momentum in Hyper-Kamiokande with resolutions matching likelihood methods but at 30,000-50,000x faster inference on GPU.
-
Probabilistic Hysteresis Factor Prediction for Electric Vehicle Batteries with Graphite Anodes Containing Silicon
A data-driven probabilistic approach predicts the hysteresis factor for silicon-graphite anode batteries in electric vehicles, with tests for generalization across vehicle models.
-
DoSReMC: Domain Shift Resilient Mammography Classification using Batch Normalization Adaptation
DoSReMC improves cross-domain generalization in mammography classification by fine-tuning only batch normalization and fully connected layers of pretrained CNNs while preserving convolutional filters, combined with ad...
-
Model-independent calibration of Gamma-Ray Bursts with neural networks
Neural networks calibrate 2D and 3D Dainotti relations on the Platinum GRB sample via ANN-driven MCMC to produce a model-independent Hubble diagram with reduced scatter.
-
YOLOv4: Optimal Speed and Accuracy of Object Detection
YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
-
Product Image Recognition with Guidance Learning and Noisy Supervision
Presents the Product-90 noisy product image dataset and a guidance learning method that combines noisy labels with teacher soft labels to train CNNs, reporting gains over prior methods on Product-90 and three public n...
-
Mitigating the Hubness Problem for Zero-Shot Learning of 3D Objects
A specialized loss mitigates hubness bias in 3D zero-shot learning and sets new state-of-the-art results on ModelNet40, ModelNet10, McGill, and SHREC2015 for both ZSL and GZSL.
-
Signal Conditioning for Learning in the Wild
Olfactory-inspired signal conditioning regularizes diverse inputs so a single brain-mimetic network performs classification across gas sensing, remote sensing, and species identification without hyperparameter changes.
-
AVD: Adversarial Video Distillation
AVD maps videos to semantically realistic 2D images via 3D conv encoder-decoder plus adversarial training, enabling image-based classifiers to perform video activity recognition.
-
UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor
UnsuperPoint is an end-to-end unsupervised interest point detector and descriptor trained with a siamese network and novel losses for uniform distribution and repeatability, running in real time with competitive resul...
-
Neuron ranking -- an informed way to condense convolutional neural networks architecture
Shapley value and variational importance switch methods produce consistent rankings of filter importance in CNNs, enabling compression and interpretability.
-
Semantic Product Search
A neural semantic matcher for product search uses a custom loss on behavior data, n-gram pooling, and hashing to beat prior methods by 4.7% Recall@100 and 14.5% MAP.
-
New pointwise convolution in Deep Neural Networks through Extremely Fast and Non Parametric Transforms
Replacing pointwise convolutions with DWHT yields a model with 79.1% fewer parameters, 48.4% fewer FLOPs, and 1.49% higher accuracy than MobileNet-V1 on CIFAR-100.
-
Efficient Multi-Domain Network Learning by Covariance Normalization
CovNorm reduces parameters in domain-adaptive layers via two PCAs and a mini-adaptation layer, enabling efficient multi-domain learning with performance close to full fine-tuning.
-
Multimodal and Multi-view Models for Emotion Recognition
Multimodal training with attention and contrastive multi-view learning improves both combined and acoustic-only emotion recognition on IEMOCAP over prior acoustic baselines.
-
Complex Signal Denoising and Interference Mitigation for Automotive Radar Using Convolutional Neural Networks
CNNs trained on simulated data outperform conventional methods for complex signal denoising and interference mitigation in automotive radar.
-
Deep Single Image Deraining Via Estimating Transmission and Atmospheric Light in rainy Scenes
A deep network estimates per-image atmospheric light and a transmission map, then recovers a clear image from the atmospheric scattering model, outperforming prior deraining methods.
-
Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling
Deep autoregressive models with F0 discretization, post-processing, and self-attention prenet outperform RNNs in objective and subjective metrics for singing voice synthesis on a Chinese corpus.
Reference graph
Works this paper leans on
-
[1]
Understanding the difficulty of training deep feedforward neural networks
Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp.\ 249--256, May 2010
work page 2010
-
[2]
Large scale distributed deep networks
Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc'Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012
work page 2012
-
[3]
Desjardins, Guillaume and Kavukcuoglu, Koray. Natural neural networks. (unpublished)
-
[4]
Adaptive subgradient methods for online learning and stochastic optimization
Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12: 0 2121--2159, July 2011. ISSN 1532-4435
work page 2011
-
[5]
Knowledge matters: Importance of prior information for optimization
G \" u l c ehre, C aglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013
-
[6]
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
He , K., Zhang , X., Ren , S., and Sun , J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification . ArXiv e-prints, February 2015
work page 2015
-
[7]
Hyv\" a rinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13 0 (4-5): 0 411--430, May 2000
work page 2000
-
[8]
A literature survey on domain adaptation of statistical classifiers, 2008
Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008
work page 2008
-
[9]
Gradient-based learning applied to document recognition
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, November 1998 a
work page 1998
-
[10]
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998 b
work page 1998
-
[11]
Nonlinear image representation using divisive normalization
Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp.\ 1--8. IEEE Computer Society, Jun 23-28 2008. doi:10.1109/CVPR.2008.4587821
-
[12]
Rectified linear units improve restricted boltzmann machines
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp.\ 807--814. Omnipress, 2010
work page 2010
-
[13]
On the difficulty of training recurrent neural networks
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 , pp.\ 1310--1318, 2013
work page 2013
-
[14]
Parallel training of DNNs with Natural Gradient and Parameter Averaging
Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014
work page Pith review arXiv 2014
-
[15]
Deep learning made easier by linear transformations in perceptrons
Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics ( AISTATS ) , pp.\ 924--932, 2012
work page 2012
-
[16]
ImageNet Large Scale Visual Recognition Challenge , 2014
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge , 2014
work page 2014
-
[17]
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013
work page Pith review arXiv 2013
-
[18]
Improving predictive inference under covariate shift by weighting the log-likelihood function
Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 0 (2): 0 227--244, October 2000
work page 2000
-
[19]
Dropout: A simple way to prevent neural networks from overfitting
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15 0 (1): 0 1929--1958, January 2014
work page 1929
-
[20]
On the importance of initialization and momentum in deep learning
Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp.\ 1139--1147. JMLR.org, 2013
work page 2013
-
[21]
Going Deeper with Convolutions
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014
work page Pith review arXiv 2014
-
[22]
A convergence analysis of log-linear training
Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp.\ 657--665, Granada, Spain, December 2011
work page 2011
-
[23]
Mean-normalized stochastic gradient for large-scale deep learning
Wiesler, Simon, Richard, Alexander, Schl \"u ter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.\ 180--184, Florence, Italy, May 2014
work page 2014
-
[24]
Deep image: Scaling up image recognition, 2015
Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.