Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Christian Szegedy; Sergey Ioffe

arxiv: 1502.03167 · v3 · submitted 2015-02-11 · 💻 cs.LG

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe , Christian Szegedy This is my paper

Pith reviewed 2026-05-13 17:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords batch normalizationinternal covariate shiftdeep neural networkstraining accelerationmini-batch statisticsimage classificationregularizationlearning rate

0 comments

The pith

Batch Normalization normalizes each layer's inputs using mini-batch statistics, allowing higher learning rates and faster convergence in deep networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep networks train slowly because the input distribution to each layer shifts as parameters in earlier layers change, a problem the authors call internal covariate shift. This forces small learning rates and careful initialization, especially when using saturating nonlinearities. The paper integrates normalization directly into the architecture by computing mean and variance over each training mini-batch for every layer, then applying learned scale and shift parameters. The result is that networks can use much higher learning rates, become less sensitive to initialization, and gain a regularizing effect that sometimes removes the need for dropout. On a state-of-the-art image model this reaches target accuracy after 14 times fewer steps and sets a new record on ImageNet when ensembled.

Core claim

Making normalization a part of the model architecture and performing it per mini-batch reduces internal covariate shift, so that the same accuracy is reached with far fewer training steps while using higher learning rates and less careful initialization.

What carries the argument

Batch Normalization, which subtracts the mini-batch mean and divides by the mini-batch standard deviation for each layer's activations before applying learned scale and shift parameters.

If this is right

Networks can safely use significantly higher learning rates without divergence.
Training requires less careful parameter initialization.
The regularizing effect can eliminate the need for dropout in some models.
Target accuracy is reached after 14 times fewer training steps on image classification tasks.
An ensemble achieves 4.9 percent top-5 error on ImageNet, beating prior published results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-batch normalization idea could stabilize training in other sequence or graph models where layer input distributions also drift.
Smaller batch sizes may limit the reliability of the estimated statistics, pointing to possible variants that use running averages or different grouping.
By reducing sensitivity to initialization, the method could make deep learning more accessible outside specialized labs.

Load-bearing premise

The changing distribution of each layer's inputs is the main cause of slow training, and normalizing per mini-batch will reliably reduce this shift without introducing instabilities or needing extensive extra tuning.

What would settle it

A network trained with batch normalization that still requires low learning rates, careful initialization, or more steps than the baseline to reach the same accuracy would falsify the central claim.

read the original abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Batch Norm is a practical architectural tweak that speeds up deep net training with higher learning rates and delivers measurable ImageNet gains, even if the internal covariate shift story is not directly measured.

read the letter

Batch normalization folds per-mini-batch mean and variance normalization into the network itself, with learnable gamma and beta parameters after the normalization step. This is the concrete novelty: normalization stops being a preprocessing trick and becomes part of the forward pass, so gradients flow through it during training. The experiments show the payoff clearly. They reach the same accuracy on a strong image model with roughly 14 times fewer steps, tolerate much higher learning rates, and improve the final top-5 error to 4.9 percent with an ensemble. It also sometimes removes the need for dropout, which is a useful side effect they document on the same models. The math is straightforward and the implementation details are given so others can reproduce the speed-up. The results hold up on the ImageNet numbers they report. The main soft spot is that the central motivation, reduced internal covariate shift, is never quantified. They do not track distribution drift metrics across layers or training steps, so the observed gains could come from the stochastic regularization of batch statistics or from better-conditioned gradients rather than from explicitly shrinking the shift. That does not invalidate the empirical wins, but it leaves the causal claim thinner than the speed-up numbers. The paper is aimed at people who train large convolutional networks and need faster iteration. The evidence is strong enough on the practical side that it deserves a serious referee rather than a desk reject. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Batch Normalization as an architectural component that normalizes each layer's inputs to zero mean and unit variance using per-mini-batch statistics, followed by learnable scale and shift parameters. It claims this mitigates internal covariate shift, enabling substantially higher learning rates, reduced sensitivity to initialization, and a regularizing effect that can replace Dropout. Experiments on MNIST and a state-of-the-art ImageNet model report that the same accuracy is reached with 14 times fewer training steps and that an ensemble improves top-5 validation error to 4.9%.

Significance. If the empirical gains hold under the reported conditions, the work is significant: it supplies a practical, low-overhead technique that has become standard in deep-network training pipelines and directly enabled deeper architectures. The paper supplies explicit algorithmic pseudocode, the full training protocol for the ImageNet model, and reproducible speed-up numbers, all of which strengthen its contribution.

major comments (2)

[§4] §4 (ImageNet experiments): no direct metric of internal covariate shift (mean/variance drift, KL divergence, or Wasserstein distance between successive layer-input distributions) is reported for the baseline versus BN networks. Consequently the central causal claim—that the observed 14-fold reduction in training steps stems from reduced ICS rather than from stochastic regularization or improved loss-landscape conditioning—remains unverified.
[§3.2] §3.2, Eq. (3)–(5): the normalization is performed with mini-batch statistics whose variance is itself stochastic; the manuscript provides no analysis or bound showing that this stochasticity reliably decreases (rather than merely reparameterizes) the covariate shift that the authors define in §2.

minor comments (2)

[Figure 1] Figure 1 caption: the legend does not explicitly state which curves include the BN layers and which are the plain baseline, making the speed-up comparison harder to read at a glance.
[§4.1] §4.1: the MNIST results are reported without error bars or the number of independent runs, even though the absolute accuracy differences are small.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We respond to each major comment below, providing clarifications and indicating where revisions can be made.

read point-by-point responses

Referee: [§4] §4 (ImageNet experiments): no direct metric of internal covariate shift (mean/variance drift, KL divergence, or Wasserstein distance between successive layer-input distributions) is reported for the baseline versus BN networks. Consequently the central causal claim—that the observed 14-fold reduction in training steps stems from reduced ICS rather than from stochastic regularization or improved loss-landscape conditioning—remains unverified.

Authors: We acknowledge that direct metrics of internal covariate shift (e.g., distribution distances) are not reported. The primary evidence remains the empirical training speedups and accuracy gains on MNIST and ImageNet, which are consistent with reduced ICS. Other mechanisms such as regularization may contribute, and we can add a short discussion in revision noting the absence of direct ICS quantification while emphasizing the practical benefits. revision: partial
Referee: [§3.2] §3.2, Eq. (3)–(5): the normalization is performed with mini-batch statistics whose variance is itself stochastic; the manuscript provides no analysis or bound showing that this stochasticity reliably decreases (rather than merely reparameterizes) the covariate shift that the authors define in §2.

Authors: Mini-batch statistics are stochastic by nature, yet the normalization (combined with learnable scale/shift and population statistics at inference) stabilizes each layer's input distribution. We provide no formal bound or analysis of the stochasticity, as the paper is primarily empirical; the consistent speed and accuracy improvements across models indicate a net reduction in effective covariate shift despite the stochastic estimates. revision: no

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces Batch Normalization as an explicit architectural layer that computes per-mini-batch mean and variance, normalizes activations, and applies learnable scale/shift parameters. Its central claims of faster convergence, higher learning rates, and regularization effects are supported by direct empirical comparisons on external benchmarks (e.g., ImageNet accuracy and training steps) rather than any mathematical reduction of a predicted quantity back to a fitted parameter defined from the same data. No equations equate a claimed improvement to an input by construction, and no load-bearing premise relies on self-citation chains or imported uniqueness theorems. The derivation is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that input distribution shifts are the dominant training obstacle and on the introduction of two learnable parameters per layer to restore representational power after normalization.

free parameters (1)

gamma and beta
Learnable scale and shift parameters per feature that are fitted during training to allow the network to recover any desired distribution after normalization.

axioms (1)

domain assumption Changing distributions of layer inputs during training slow convergence and require lower learning rates
Invoked in the opening paragraph to motivate the need for normalization.

invented entities (1)

internal covariate shift no independent evidence
purpose: To name and frame the phenomenon of changing layer-input distributions as the core training difficulty
New term introduced to describe the problem the method targets; no independent measurement provided.

pith-pipeline@v0.9.0 · 5493 in / 1370 out tokens · 49252 ms · 2026-05-13T17:14:42.526886+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
cs.LG 2017-01 accept novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Density estimation using Real NVP
cs.LG 2016-05 accept novelty 8.0

Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
cs.LG 2015-11 accept novelty 8.0

DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
Determining star formation histories and age-metallicity relations with convolutional neural networks
astro-ph.GA 2026-05 unverdicted novelty 7.0

A CNN with attention and shared latent space recovers SFHs and metallicities from spectro-photometric data with ~0.12 dex age and ~0.03 dex metallicity dispersion while running thousands of times faster than full spec...
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
cs.CV 2026-05 unverdicted novelty 7.0

The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
Physics-informed, Generative Adversarial Design of Funicular Shells
cs.CE 2026-04 unverdicted novelty 7.0

A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
Deep Learning for CMB Foreground Removal and Beam Deconvolution: A U-Net GAN Approach
astro-ph.IM 2025-08 unverdicted novelty 7.0

A U-Net GAN reconstructs CMB T and E maps from Planck-like simulations with foregrounds and systematics, achieving under 1% error outside the Galactic region and demonstrating first-time correction for non-circular be...
High Fidelity Neural Audio Compression
eess.AS 2022-10 accept novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...
A Simple Framework for Contrastive Learning of Visual Representations
cs.LG 2020-02 accept novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks
cs.CV 2019-07 unverdicted novelty 7.0

Releases MVB, a multi-view baggage re-identification dataset with 4519 identities and 22660 images, plus a merged Siamese network baseline evaluated on it.
Learning to learn with quantum neural networks via classical neural networks
quant-ph 2019-07 unverdicted novelty 7.0

Classical RNNs trained on small instances provide parameter initializations for QAOA and VQE that reduce total optimization iterations and generalize across problem sizes.
IRNet: A General Purpose Deep Residual Regression Framework for Materials Discovery
physics.comp-ph 2019-07 unverdicted novelty 7.0

IRNet uses per-layer residual shortcuts in fully connected networks to achieve better prediction accuracy and training convergence than prior ML methods on OQMD and Materials Project datasets for material properties.
Importance Estimation for Neural Network Pruning
cs.LG 2019-06 unverdicted novelty 7.0

Taylor-expansion importance scoring enables layer-agnostic pruning of neural networks that outperforms prior methods on ImageNet accuracy-FLOPs trade-offs.
Progressive Growing of GANs for Improved Quality, Stability, and Variation
cs.NE 2017-10 accept novelty 7.0

Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
The Kinetics Human Action Video Dataset
cs.CV 2017-05 accept novelty 7.0

Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
cs.CV 2017-04 accept novelty 7.0

MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
Continuous control with deep reinforcement learning
cs.LG 2015-09 accept novelty 7.0

DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competiti...
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
cs.CV 2015-06 accept novelty 7.0

LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.
CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

CogAdapt adapts clinical ECG foundation models to 3-lead wearable signals for cognitive load assessment via a LeadBridge adapter and ProFine progressive fine-tuning, outperforming scratch-trained models with macro-F1 ...
Q-PhotoNAS: Hybrid Quantum Neural Architecture Search Framework on Photonic Devices
quant-ph 2026-05 unverdicted novelty 6.0

Q-PhotoNAS applies genetic algorithm search to jointly optimize classical preprocessing, phase encoding, and photonic circuit structure for hybrid quantum-classical models, reporting 99.44% and 98.78% accuracy on Digi...
A Dual Physics-Informed Kolmogorov-Arnold Neural Network Framework for Continuum Topology Optimization
cs.CE 2026-05 unverdicted novelty 6.0

Dual HRKAN framework (DPIKAN-TO) for topology optimization with one network predicting displacements and another handling sensitivity-based design updates.
Demystifying Manifold Constraints in LLM Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
cs.LG 2026-03 unverdicted novelty 6.0

LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers
cs.LG 2026-02 unverdicted novelty 6.0

TaperNorm gradually removes internal normalization in pre-norm transformers via learned gates that reach zero, revealing final norm as a scale anchor and enabling up to 1.18x faster KV-cached decoding with small loss ...
TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings
cs.SE 2025-08 conditional novelty 6.0

TriagerX combines dual-transformer content rankings with developer interaction history to improve top-k accuracy for developer and component recommendations in bug triaging across five datasets.
Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs
cs.LG 2025-08 unverdicted novelty 6.0

Introduces layer-wise learning signals combining knowledge distillation and local errors into Equilibrium Propagation, enabling scalable training of deep VGG-style CRNNs with SOTA results on CIFAR-10 and CIFAR-100.
Sinc Kolmogorov-Arnold network and its application for solving PDEs with singularities
cs.LG 2024-10 unverdicted novelty 6.0

SincKANs integrate Sinc interpolation into KAN activations and report better empirical results than alternatives on function approximation and PINN tasks.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Sharpness-Aware Minimization for Efficiently Improving Generalization
cs.LG 2020-10 conditional novelty 6.0

SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
Unsupervised Learning Framework of Interest Point Via Properties Optimization
cs.CV 2019-07 unverdicted novelty 6.0

Unsupervised EM-based joint optimization of interest point detector and descriptor via probability formulations of sparsity, repeatability and discriminability, yielding Property Network that outperforms SOTA on match...
Segmenting Objects in Day and Night:Edge-Conditioned CNN for Thermal Image Semantic Segmentation
cs.CV 2019-07 unverdicted novelty 6.0

EC-CNN uses a gated feature-wise transform to incorporate edge priors for thermal semantic segmentation and introduces the SODA dataset of over 7,000 labeled thermal images.
A Deep Learning System for Predicting Size and Fit in Fashion E-Commerce
cs.LG 2019-07 unverdicted novelty 6.0

A deep learning content-collaborative model for size and fit prediction that outperforms state-of-the-art on two public and two proprietary datasets.
Interaction-and-Aggregation Network for Person Re-identification
cs.CV 2019-07 unverdicted novelty 6.0

Introduces IA network with SIA and CIA modules to adaptively model spatial and channel feature interdependencies for improved person re-identification on benchmarks.
QUOTIENT: Two-Party Secure Neural Network Training and Prediction
cs.CR 2019-07 unverdicted novelty 6.0

QUOTIENT achieves 50X faster WAN training time and 6% higher absolute accuracy for secure two-party DNN training by jointly optimizing a discretized training algorithm with a tailored secure protocol.
Adaptive Weighting Depth-variant Deconvolution of Fluorescence Microscopy Images with Convolutional Neural Network
eess.IV 2019-07 unverdicted novelty 6.0

A CNN predicts depth-variant PSFs for patch-wise deconvolution of fluorescence microscopy images, with adaptive weighting to reduce artifacts, claiming 98.2% accuracy and up to 6.6 dB PSNR gain.
Graph-based Knowledge Distillation by Multi-head Attention Network
cs.LG 2019-07 unverdicted novelty 6.0

Multi-head attention constructs a graph of dataset relations from the teacher embedding procedure and transfers it to the student via multi-task learning, yielding 7.05% higher CIFAR-100 accuracy than the student alon...
Generalizing from a few environments in safety-critical reinforcement learning
cs.LG 2019-07 unverdicted novelty 6.0

RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
Rethinking Atrous Convolution for Semantic Image Segmentation
cs.CV 2017-06 unverdicted novelty 6.0

DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF p...
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
cs.LG 2016-09 unverdicted novelty 6.0

Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
Unveiling Hidden Lyman Alpha Emitters in the DESI DR1 Data
astro-ph.GA 2026-05 unverdicted novelty 5.0

A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.
A sound-horizon-free measurement of the Hubble constant from DESI DR2 baryon acoustic oscillations using artificial neural networks
astro-ph.CO 2026-04 unverdicted novelty 5.0

Neural network reconstruction of DESI DR2 BAO, SNe Ia, and cosmic chronometer data gives H0 = 71.5 ± 2.2 km s^{-1} Mpc^{-1} without sound horizon input.
Distributional Value Estimation Without Target Networks for Robust Quality-Diversity
cs.LG 2026-04 unverdicted novelty 5.0

QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.
Enhancing Event Reconstruction in Hyper-Kamiokande with Machine Learning: A ResNet Implementation
hep-ex 2026-04 conditional novelty 5.0

ResNet models classify four particle types and regress vertex, direction, and momentum in Hyper-Kamiokande with resolutions matching likelihood methods but at 30,000-50,000x faster inference on GPU.
Probabilistic Hysteresis Factor Prediction for Electric Vehicle Batteries with Graphite Anodes Containing Silicon
cs.LG 2026-03 unverdicted novelty 5.0

A data-driven probabilistic approach predicts the hysteresis factor for silicon-graphite anode batteries in electric vehicles, with tests for generalization across vehicle models.
DoSReMC: Domain Shift Resilient Mammography Classification using Batch Normalization Adaptation
eess.IV 2025-08 unverdicted novelty 5.0

DoSReMC improves cross-domain generalization in mammography classification by fine-tuning only batch normalization and fully connected layers of pretrained CNNs while preserving convolutional filters, combined with ad...
Model-independent calibration of Gamma-Ray Bursts with neural networks
astro-ph.CO 2024-11 unverdicted novelty 5.0

Neural networks calibrate 2D and 3D Dainotti relations on the Platinum GRB sample via ANN-driven MCMC to produce a model-independent Hubble diagram with reduced scatter.
YOLOv4: Optimal Speed and Accuracy of Object Detection
cs.CV 2020-04 unverdicted novelty 5.0

YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
Product Image Recognition with Guidance Learning and Noisy Supervision
cs.CV 2019-07 unverdicted novelty 5.0

Presents the Product-90 noisy product image dataset and a guidance learning method that combines noisy labels with teacher soft labels to train CNNs, reporting gains over prior methods on Product-90 and three public n...
Mitigating the Hubness Problem for Zero-Shot Learning of 3D Objects
cs.CV 2019-07 unverdicted novelty 5.0

A specialized loss mitigates hubness bias in 3D zero-shot learning and sets new state-of-the-art results on ModelNet40, ModelNet10, McGill, and SHREC2015 for both ZSL and GZSL.
Signal Conditioning for Learning in the Wild
cs.NE 2019-07 unverdicted novelty 5.0

Olfactory-inspired signal conditioning regularizes diverse inputs so a single brain-mimetic network performs classification across gas sensing, remote sensing, and species identification without hyperparameter changes.
AVD: Adversarial Video Distillation
cs.CV 2019-07 unverdicted novelty 5.0

AVD maps videos to semantically realistic 2D images via 3D conv encoder-decoder plus adversarial training, enabling image-based classifiers to perform video activity recognition.
UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor
cs.CV 2019-07 unverdicted novelty 5.0

UnsuperPoint is an end-to-end unsupervised interest point detector and descriptor trained with a siamese network and novel losses for uniform distribution and repeatability, running in real time with competitive resul...
Neuron ranking -- an informed way to condense convolutional neural networks architecture
cs.LG 2019-07 unverdicted novelty 5.0

Shapley value and variational importance switch methods produce consistent rankings of filter importance in CNNs, enabling compression and interpretability.
Semantic Product Search
cs.IR 2019-07 unverdicted novelty 5.0

A neural semantic matcher for product search uses a custom loss on behavior data, n-gram pooling, and hashing to beat prior methods by 4.7% Recall@100 and 14.5% MAP.
New pointwise convolution in Deep Neural Networks through Extremely Fast and Non Parametric Transforms
cs.CV 2019-06 unverdicted novelty 5.0

Replacing pointwise convolutions with DWHT yields a model with 79.1% fewer parameters, 48.4% fewer FLOPs, and 1.49% higher accuracy than MobileNet-V1 on CIFAR-100.
Efficient Multi-Domain Network Learning by Covariance Normalization
cs.CV 2019-06 unverdicted novelty 5.0

CovNorm reduces parameters in domain-adaptive layers via two PCAs and a mini-adaptation layer, enabling efficient multi-domain learning with performance close to full fine-tuning.
Multimodal and Multi-view Models for Emotion Recognition
cs.CL 2019-06 unverdicted novelty 5.0

Multimodal training with attention and contrastive multi-view learning improves both combined and acoustic-only emotion recognition on IEMOCAP over prior acoustic baselines.
Complex Signal Denoising and Interference Mitigation for Automotive Radar Using Convolutional Neural Networks
eess.SP 2019-06 unverdicted novelty 5.0

CNNs trained on simulated data outperform conventional methods for complex signal denoising and interference mitigation in automotive radar.
Deep Single Image Deraining Via Estimating Transmission and Atmospheric Light in rainy Scenes
cs.CV 2019-06 unverdicted novelty 5.0

A deep network estimates per-image atmospheric light and a transmission map, then recovers a clear image from the atmospheric scattering model, outperforming prior deraining methods.
Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling
cs.SD 2019-06 unverdicted novelty 5.0

Deep autoregressive models with F0 discretization, post-processing, and self-attention prenet outperform RNNs in objective and subjective metrics for singing voice synthesis on a Chinese corpus.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 84 Pith papers

[1]

Understanding the difficulty of training deep feedforward neural networks

Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp.\ 249--256, May 2010

work page 2010
[2]

Large scale distributed deep networks

Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc'Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012

work page 2012
[3]

Natural neural networks

Desjardins, Guillaume and Kavukcuoglu, Koray. Natural neural networks. (unpublished)

work page
[4]

Adaptive subgradient methods for online learning and stochastic optimization

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12: 0 2121--2159, July 2011. ISSN 1532-4435

work page 2011
[5]

Knowledge matters: Importance of prior information for optimization

G \" u l c ehre, C aglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013

work page arXiv 2013
[6]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

He , K., Zhang , X., Ren , S., and Sun , J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification . ArXiv e-prints, February 2015

work page 2015
[7]

and Oja, E

Hyv\" a rinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13 0 (4-5): 0 411--430, May 2000

work page 2000
[8]

A literature survey on domain adaptation of statistical classifiers, 2008

Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008

work page 2008
[9]

Gradient-based learning applied to document recognition

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, November 1998 a

work page 1998
[10]

Efficient backprop

LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998 b

work page 1998
[11]

Nonlinear image representation using divisive normalization

Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp.\ 1--8. IEEE Computer Society, Jun 23-28 2008. doi:10.1109/CVPR.2008.4587821

work page doi:10.1109/cvpr.2008.4587821 2008
[12]

Rectified linear units improve restricted boltzmann machines

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp.\ 807--814. Omnipress, 2010

work page 2010
[13]

On the difficulty of training recurrent neural networks

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 , pp.\ 1310--1318, 2013

work page 2013
[14]

Parallel training of DNNs with Natural Gradient and Parameter Averaging

Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014

work page Pith review arXiv 2014
[15]

Deep learning made easier by linear transformations in perceptrons

Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics ( AISTATS ) , pp.\ 924--932, 2012

work page 2012
[16]

ImageNet Large Scale Visual Recognition Challenge , 2014

Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge , 2014

work page 2014
[17]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013

work page Pith review arXiv 2013
[18]

Improving predictive inference under covariate shift by weighting the log-likelihood function

Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 0 (2): 0 227--244, October 2000

work page 2000
[19]

Dropout: A simple way to prevent neural networks from overfitting

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15 0 (1): 0 1929--1958, January 2014

work page 1929
[20]

On the importance of initialization and momentum in deep learning

Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp.\ 1139--1147. JMLR.org, 2013

work page 2013
[21]

Going Deeper with Convolutions

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014

work page Pith review arXiv 2014
[22]

A convergence analysis of log-linear training

Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp.\ 657--665, Granada, Spain, December 2011

work page 2011
[23]

Mean-normalized stochastic gradient for large-scale deep learning

Wiesler, Simon, Richard, Alexander, Schl \"u ter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.\ 180--184, Florence, Italy, May 2014

work page 2014
[24]

Deep image: Scaling up image recognition, 2015

Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015

work page 2015

[1] [1]

Understanding the difficulty of training deep feedforward neural networks

Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp.\ 249--256, May 2010

work page 2010

[2] [2]

Large scale distributed deep networks

Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc'Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012

work page 2012

[3] [3]

Natural neural networks

Desjardins, Guillaume and Kavukcuoglu, Koray. Natural neural networks. (unpublished)

work page

[4] [4]

Adaptive subgradient methods for online learning and stochastic optimization

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12: 0 2121--2159, July 2011. ISSN 1532-4435

work page 2011

[5] [5]

Knowledge matters: Importance of prior information for optimization

G \" u l c ehre, C aglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013

work page arXiv 2013

[6] [6]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

He , K., Zhang , X., Ren , S., and Sun , J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification . ArXiv e-prints, February 2015

work page 2015

[7] [7]

and Oja, E

Hyv\" a rinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13 0 (4-5): 0 411--430, May 2000

work page 2000

[8] [8]

A literature survey on domain adaptation of statistical classifiers, 2008

Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008

work page 2008

[9] [9]

Gradient-based learning applied to document recognition

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, November 1998 a

work page 1998

[10] [10]

Efficient backprop

LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998 b

work page 1998

[11] [11]

Nonlinear image representation using divisive normalization

Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp.\ 1--8. IEEE Computer Society, Jun 23-28 2008. doi:10.1109/CVPR.2008.4587821

work page doi:10.1109/cvpr.2008.4587821 2008

[12] [12]

Rectified linear units improve restricted boltzmann machines

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp.\ 807--814. Omnipress, 2010

work page 2010

[13] [13]

On the difficulty of training recurrent neural networks

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 , pp.\ 1310--1318, 2013

work page 2013

[14] [14]

Parallel training of DNNs with Natural Gradient and Parameter Averaging

Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014

work page Pith review arXiv 2014

[15] [15]

Deep learning made easier by linear transformations in perceptrons

Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics ( AISTATS ) , pp.\ 924--932, 2012

work page 2012

[16] [16]

ImageNet Large Scale Visual Recognition Challenge , 2014

Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge , 2014

work page 2014

[17] [17]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013

work page Pith review arXiv 2013

[18] [18]

Improving predictive inference under covariate shift by weighting the log-likelihood function

Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 0 (2): 0 227--244, October 2000

work page 2000

[19] [19]

Dropout: A simple way to prevent neural networks from overfitting

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15 0 (1): 0 1929--1958, January 2014

work page 1929

[20] [20]

On the importance of initialization and momentum in deep learning

Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp.\ 1139--1147. JMLR.org, 2013

work page 2013

[21] [21]

Going Deeper with Convolutions

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014

work page Pith review arXiv 2014

[22] [22]

A convergence analysis of log-linear training

Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp.\ 657--665, Granada, Spain, December 2011

work page 2011

[23] [23]

Mean-normalized stochastic gradient for large-scale deep learning

Wiesler, Simon, Richard, Alexander, Schl \"u ter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.\ 180--184, Florence, Italy, May 2014

work page 2014

[24] [24]

Deep image: Scaling up image recognition, 2015

Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015

work page 2015