Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems

Andre van Schaik; Hesham Mostafa; Mark D. McDonnell; Runchun Wang

arxiv: 1907.06916 · v2 · pith:M3ZKGF3Bnew · submitted 2019-07-16 · 💻 cs.LG · cs.CV· cs.NE· stat.ML

Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems

Mark D. McDonnell , Hesham Mostafa , Runchun Wang , Andre van Schaik This is my paper

Pith reviewed 2026-05-24 21:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.NEstat.ML

keywords batch normalizationshifted-ReLUsingle-bit weightsconvolutional neural networksembedded systemswide residual networksimage classification

0 comments

The pith

Batch-normalization layers do not consistently improve accuracy over shifted-ReLU in single-bit weight networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether batch-normalization layers remain necessary in deep convolutional networks when weights are limited to a single bit for embedded hardware. It replaces them with shifted-ReLU layers inside wide residual networks and measures performance on ImageNet, CIFAR-10 and CIFAR-100. The accuracy gap between the two choices varies with dataset, network depth and bit depth, and is often small. Shifted-ReLU versions therefore deliver comparable results while cutting the memory, speed and complexity costs that batch normalization imposes on low-power devices.

Core claim

Experiments with wide residual networks applied to the ImageNet, CIFAR 10 and CIFAR 100 image classification datasets show that batch-normalization layers do not consistently offer a significant advantage. The accuracy margin offered by batch-normalization layers depends on the data set, the network size, and the bit-depth of weights. Shifted-ReLU layers can often be used instead with no significant accuracy cost and provide advantages in speed, memory and complexity.

What carries the argument

Shifted-ReLU layers used in place of batch-normalization layers inside single-bit-per-weight wide residual networks.

If this is right

Shifted-ReLU networks avoid the computational overhead and small-batch training problems introduced by batch normalization.
Single-bit weight networks remain competitive in accuracy when batch normalization is removed.
Residual connections appear sufficient to maintain training stability without batch normalization in the tested regimes.
Designers of embedded vision systems can drop batch-normalization layers when hardware constraints make them costly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same substitution may work in detection or segmentation networks that also rely on single-bit weights.
Training with very small batches could become more reliable if shifted-ReLU replaces batch normalization.
Hardware implementations could be simplified by removing the running-mean and variance tracking required by batch normalization.

Load-bearing premise

That results obtained with wide residual networks on three image-classification datasets generalize to other architectures and tasks.

What would settle it

A new experiment on a different dataset or network size in which batch-normalization layers produce large, consistent accuracy gains across all tested bit depths would falsify the central claim.

Figures

Figures reproduced from arXiv: 1907.06916 by Andre van Schaik, Hesham Mostafa, Mark D. McDonnell, Runchun Wang.

**Figure 1.** Figure 1: Shifted Rectified Linear Unit (sReLU) activation function. The sReLU activation function lets negative inputs pass through, between 0 and some negative constant, in this case equal to −1. While the Exponential Linear Unit (ELU) is more popular, we have found sReLU to be equally effective, and less computationally demanding, due to avoiding calculation of an exponential. III. METHODS A. Baseline network arc… view at source ↗

**Figure 2.** Figure 2: Wide ResNet architecture for Baseline CIFAR models where BN layers are used. This architecture is nearly identical to that of [10], except here there is no optional ReLU applied to the input. Note the ordering of the final layers, where global average pooling (GAP) is used after a final 1×1 convolutional layer, that reduces the number of channels to equal the number of classes, and then feeds directly to t… view at source ↗

**Figure 3.** Figure 3: Changes when training for 1-bit-per-weight. When we train 1-bitper-weight networks following the method of [10], we apply the sign operator to full-precision copies of weights during training, and then scale by a constant equal to the initial standard deviation of the weights according to the method of [15]. approach corresponds to employing a final softmax layer of the form SMi(x) := exp xi T PN j=1 ex… view at source ↗

**Figure 4.** Figure 4: Wide ResNet architecture for CIFAR when all BN layers are replaced by sReLUs. The architecture is identical to that of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Spread of results: CIFAR 10, Width 4. The circle markers show the mean from 10 repeated runs for each of the 4 model types, using different random seeds for each repeat, but the same seed for each model. The error bars indicate the maximum and minimum errors over the 10 repeated runs. Baseline 1 Baseline 2 Final BN only All sReLU All ELU Mean-only-BN 18 19 20 21 22 23 24 25 Error rate (%) CIFAR 100, 32 bit… view at source ↗

**Figure 6.** Figure 6: Spread of results: CIFAR 100, Width 4. The circle markers show the mean from 10 repeated runs for each of the 4 model types, using different random seeds for each repeat, but the same seed for each model. The error bars indicate the maximum and minimum errors over the 10 repeated runs. longer the case for CIFAR 10, which might be because here we do not use a ReLU applied to the input. From these observatio… view at source ↗

read the original abstract

Batch-normalization (BN) layers are thought to be an integrally important layer type in today's state-of-the-art deep convolutional neural networks for computer vision tasks such as classification and detection. However, BN layers introduce complexity and computational overheads that are highly undesirable for training and/or inference on low-power custom hardware implementations of real-time embedded vision systems such as UAVs, robots and Internet of Things (IoT) devices. They are also problematic when batch sizes need to be very small during training, and innovations such as residual connections introduced more recently than BN layers could potentially have lessened their impact. In this paper we aim to quantify the benefits BN layers offer in image classification networks, in comparison with alternative choices. In particular, we study networks that use shifted-ReLU layers instead of BN layers. We found, following experiments with wide residual networks applied to the ImageNet, CIFAR 10 and CIFAR 100 image classification datasets, that BN layers do not consistently offer a significant advantage. We found that the accuracy margin offered by BN layers depends on the data set, the network size, and the bit-depth of weights. We conclude that in situations where BN layers are undesirable due to speed, memory or complexity costs, that using shifted-ReLU layers instead should be considered; we found they can offer advantages in all these areas, and often do not impose a significant accuracy cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that shifted-ReLU can replace BN in 1-bit wide residual networks on CIFAR and ImageNet with little accuracy cost in most tested cases, but the tests stay inside one architecture family.

read the letter

The main result is that BN layers do not give a consistent edge once you fix weights to single-bit values inside wide residual networks. On the three datasets the accuracy gap between BN and shifted-ReLU changes with dataset, depth, width, and bit depth, and shifted-ReLU wins or ties often enough that the authors recommend trying it when BN's memory and batch-size costs matter for embedded hardware. That comparison is new in the 1-bit setting even though both layer types are known separately. The work is useful because it supplies concrete numbers on a practical trade-off rather than another new architecture. The experiments cover multiple network sizes and bit depths on public data, which lets a reader see the pattern without having to rerun everything. The hardware motivation is stated plainly and the conclusion stays within what the tables show. The main limitation is the narrow scope. All results use wide residual networks, so residual connections may be masking cases where per-channel statistics from BN are needed for convergence in plain conv nets or other families. No error bars or repeated runs appear, and the paper does not describe whether the no-BN models received the same hyperparameter effort as the BN versions. Those gaps make the general advice to consider shifted-ReLU tentative rather than settled. The work is aimed at people who already train or deploy quantized convnets for low-power vision and need to decide whether to keep BN. A reader in that group can extract the tables and test the swap on their own model. It is worth sending to peer review because the empirical comparison is reproducible on standard benchmarks and the question is directly relevant to hardware constraints, even if referees will likely ask for more architectures and variance numbers.

Referee Report

3 major / 0 minor

Summary. The manuscript claims that batch normalization (BN) layers do not consistently offer a significant accuracy advantage in single-bit-per-weight wide residual networks (WRNs) for image classification. Experiments on ImageNet, CIFAR-10, and CIFAR-100 show that the accuracy margin of BN depends on the dataset, network size, and weight bit-depth; shifted-ReLU layers are presented as a viable substitute that can reduce speed, memory, and complexity costs with often no significant accuracy penalty, making them suitable for embedded systems.

Significance. If the empirical comparisons hold under rigorous validation, the result would be significant for quantized network design in low-power embedded vision applications, as it provides evidence that BN can be omitted in 1-bit WRNs without consistent accuracy loss. The controlled variations in network size and bit-depth across three public datasets add value, though the scoped architecture limits broader impact.

major comments (3)

[Abstract] Abstract: the central claim that 'BN layers do not consistently offer a significant advantage' rests on unreported quantitative margins; no accuracy differences, error bars, or statistical tests are described to support the conclusion that margins 'depend on the data set, the network size, and the bit-depth'.
[Experiments] Experiments section: results are obtained exclusively with wide residual networks; residual connections may mask cases where BN's per-channel statistics are required for stable convergence with strictly ±1 weights, and no evidence is given that findings transfer to non-residual convnets.
No details are provided on hyperparameter search, initialization, training schedules, or number of runs for the shifted-ReLU (no-BN) models, raising the possibility that observed equivalence or advantages are due to unequal optimization effort rather than inherent layer properties.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We respond to each major comment below, indicating revisions where appropriate to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'BN layers do not consistently offer a significant advantage' rests on unreported quantitative margins; no accuracy differences, error bars, or statistical tests are described to support the conclusion that margins 'depend on the data set, the network size, and the bit-depth'.

Authors: The Experiments section reports accuracy values for BN and shifted-ReLU variants in tables across CIFAR-10, CIFAR-100, and ImageNet for multiple network widths and weight bit-depths. The margins are directly derivable from these tabulated results and vary as stated. We will revise the abstract to include specific quantitative margin examples drawn from the tables to make the dependence explicit. revision: yes
Referee: [Experiments] Experiments section: results are obtained exclusively with wide residual networks; residual connections may mask cases where BN's per-channel statistics are required for stable convergence with strictly ±1 weights, and no evidence is given that findings transfer to non-residual convnets.

Authors: The study is scoped to wide residual networks, which are a standard and competitive choice for quantization experiments on image classification and directly relevant to embedded vision. The experiments systematically vary depth, width, and bit-depth within this architecture, demonstrating that shifted-ReLU substitutes for BN under these conditions. Extending the claims to non-residual networks lies outside the manuscript's stated scope. revision: no
Referee: No details are provided on hyperparameter search, initialization, training schedules, or number of runs for the shifted-ReLU (no-BN) models, raising the possibility that observed equivalence or advantages are due to unequal optimization effort rather than inherent layer properties.

Authors: The Experiments section specifies that identical training protocols, including hyperparameters, initialization, and schedules, were used for both BN and shifted-ReLU models. We will revise the text to explicitly confirm this equivalence, add the requested details on hyperparameter selection, and state the number of runs performed for the no-BN variants. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical accuracy comparisons with no derivations or self-referential reductions

full rationale

The paper conducts and reports direct experiments training wide residual networks (with 1-bit weights) on ImageNet/CIFAR-10/CIFAR-100, comparing variants that include or omit BN layers and substitute shifted-ReLU. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citations are invoked to derive the central claim; the accuracy margins are measured outcomes, not constructed by definition from the inputs. This matches the default case of an empirical study whose results stand or fall on the reported trials rather than any load-bearing reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is purely empirical and introduces no new mathematical axioms, free parameters, or invented entities; it relies on standard supervised training assumptions and the existence of the three public image datasets.

pith-pipeline@v0.9.0 · 5800 in / 1122 out tokens · 19816 ms · 2026-05-24T21:03:17.719823+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 21 internal anchors

[1]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502. 03167

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Microsoft Research, Tech. Rep., 2015, arxiv.1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,

P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,” ArXiv e-prints, June 2017

work page 2017
[4]

Group Normalization,

Y . Wu and K. He, “Group Normalization,” ArXiv e-prints, Mar. 2018

work page 2018
[6]

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

[Online]. Available: http://arxiv.org/abs/1702.03275

work page internal anchor Pith review Pith/arXiv arXiv
[7]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1MB model size,” CoRR, vol. abs/1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/1602.07360

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

M. Courbariaux, Y . Bengio, and J. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” CoRR, vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/ abs/1511.00363

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

Deep neural networks are robust to weight binarization and other non-linear distortions

P. Merolla, R. Appuswamy, J. V . Arthur, S. K. Esser, and D. S. Modha, “Deep neural networks are robust to weight binarization and other non-linear distortions,” CoRR, vol. abs/1606.01981, 2016. [Online]. Available: http://arxiv.org/abs/1606.01981

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Imagenet classiﬁcation using binary convolutional neural networks,” CoRR, vol. abs/1603.05279, 2016. [Online]. Available: http://arxiv.org/ abs/1603.05279

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Training wide residual networks for deployment using a single bit for each weight

M. D. McDonnell, “Training wide residual networks for deployment using a single bit for each weight,” 2018, in Proc. ICLR 2018; arxiv: 1802.08530

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Do deep nets really need to be deep?

J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems , 2014, pp. 2654– 2662

work page 2014
[13]

The loss surfaces of multilayer networks,

A. Choromanska, M. Henaff, M. Mathieu, G. Arous, and Y . LeCun, “The loss surfaces of multilayer networks,” in Artiﬁcial Intelligence and Statistics, 2015, pp. 192–204

work page 2015
[14]

Understanding deep learning requires rethinking generalization

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Understanding the difﬁculty of training deep feedforward neural networks

X. Glorot and Y . Bengio, “Understanding the difﬁculty of training deep feedforward neural networks.” in AISTATS, vol. 9, 2010, pp. 249–256

work page 2010
[16]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers: Surpassing human-level performance on ImageNet classiﬁcation,” in Proc. IEEE International Conference on Computer Vision (ICCV) (see arXiv:1502.01852), 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Why does unsupervised pre-training help deep learning?

D. Erhan, Y . Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” The Journal of Machine Learning Research, vol. 11, pp. 625–660, 2010

work page 2010
[18]

Convolutional deep belief networks for scalable unsupervised learning of hierarchical repre- sentations,

H. Lee, R. Grosse, R. Ranganath, and A. Y . Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical repre- sentations,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: ACM, 2009, pp. 609–616

work page 2009
[19]

Unsuper- vised learning of invariant feature hierarchies with applications to object recognition,

M. Ranzato, F. Huang, Y .-L. Boureau, and Y . LeCun, “Unsuper- vised learning of invariant feature hierarchies with applications to object recognition,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on . IEEE, 2007, pp. 1–8

work page 2007
[20]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

ADADELTA: An Adaptive Learning Rate Method

M. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[22]

Sharp Minima Can Generalize For Deep Nets

L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio, “Sharp minima can generalize for deep nets,” arXiv preprint arXiv:1703.04933 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[24]

On the importance of single directions for generalization

A. Morcos, D. Barrett, N. Rabinowitz, and M. Botvinick, “On the importance of single directions for generalization,” arXiv preprint arXiv:1803.06959, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Improved techniques for training GANs,

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Advances in Neural Information Processing Systems , 2016, pp. 2234–2242

work page 2016
[26]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” arXiv preprint arXiv:1609.04836 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Layer Normalization

J. Ba, J. Kiros, and G. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[28]

Weight normalization: A simple reparam- eterization to accelerate training of deep neural networks,

T. Salimans and D. Kingma, “Weight normalization: A simple reparam- eterization to accelerate training of deep neural networks,” in Advances in Neural Information Processing Systems , 2016, pp. 901–909

work page 2016
[29]

Efﬁcient backprop,

Y . LeCun, L. Bottou, G. Orr, and K.-R. M ¨uller, “Efﬁcient backprop,” in Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop . Springer-Verlag, 1998, pp. 9–50

work page 1996
[30]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” CoRR, vol. abs/1511.07289, 2015. [Online]. Available: http://arxiv.org/abs/1511. 07289

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

How Does Batch Nor- malization Help Optimization? (No, It Is Not About Internal Covariate Shift),

S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How Does Batch Nor- malization Help Optimization? (No, It Is Not About Internal Covariate Shift),” ArXiv e-prints, May 2018

work page 2018
[32]

Wide Residual Networks

S. Zagoruyko and N. Komodakis, “Wide residual networks,” 2016, arxiv.1605.07146

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

Identity Mappings in Deep Residual Networks

K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” Microsoft Research, Tech. Rep., 2016, arxiv.1603.05027

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

Improved Regularization of Convolutional Neural Networks with Cutout

T. Devries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” CoRR, vol. abs/1708.04552, 2017. [Online]. Available: http://arxiv.org/abs/1708.04552

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Distilling the Knowledge in a Neural Network,

G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” ArXiv e-prints, Mar. 2015

work page 2015
[36]

On Calibration of Modern Neural Networks

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” CoRR, vol. abs/1706.04599, 2017. [Online]. Available: http://arxiv.org/abs/1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Bina- reye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS,

B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst, “Bina- reye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS,” in IEEE Custom Integrated Circuits Conference, CICC, San Diego, April 8-11 , 2018, pp. 1–4

work page 2018
[38]

Eyeriss: An energy- efﬁcient reconﬁgurable accelerator for deep convolutional neural net- works

Y .-H. Chen, T. Krishna, J. S. Emer, and V . Sze, “Eyeriss: An energy- efﬁcient reconﬁgurable accelerator for deep convolutional neural net- works.” in ISSCC. IEEE, 2016, pp. 262–263

work page 2016
[39]

Brein memory: A single-chip binary/ternary reconﬁgurable in- memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,

K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Moto- mura, “Brein memory: A single-chip binary/ternary reconﬁgurable in- memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,” IEEE Journal of Solid-State Circuits , 12 2017

work page 2017

[1] [1]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502. 03167

work page internal anchor Pith review Pith/arXiv arXiv 2015

[2] [2]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Microsoft Research, Tech. Rep., 2015, arxiv.1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,

P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,” ArXiv e-prints, June 2017

work page 2017

[4] [4]

Group Normalization,

Y . Wu and K. He, “Group Normalization,” ArXiv e-prints, Mar. 2018

work page 2018

[5] [6]

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

[Online]. Available: http://arxiv.org/abs/1702.03275

work page internal anchor Pith review Pith/arXiv arXiv

[6] [7]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1MB model size,” CoRR, vol. abs/1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/1602.07360

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [8]

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

M. Courbariaux, Y . Bengio, and J. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” CoRR, vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/ abs/1511.00363

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [9]

Deep neural networks are robust to weight binarization and other non-linear distortions

P. Merolla, R. Appuswamy, J. V . Arthur, S. K. Esser, and D. S. Modha, “Deep neural networks are robust to weight binarization and other non-linear distortions,” CoRR, vol. abs/1606.01981, 2016. [Online]. Available: http://arxiv.org/abs/1606.01981

work page internal anchor Pith review Pith/arXiv arXiv 2016

[9] [10]

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Imagenet classiﬁcation using binary convolutional neural networks,” CoRR, vol. abs/1603.05279, 2016. [Online]. Available: http://arxiv.org/ abs/1603.05279

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [11]

Training wide residual networks for deployment using a single bit for each weight

M. D. McDonnell, “Training wide residual networks for deployment using a single bit for each weight,” 2018, in Proc. ICLR 2018; arxiv: 1802.08530

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [12]

Do deep nets really need to be deep?

J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems , 2014, pp. 2654– 2662

work page 2014

[12] [13]

The loss surfaces of multilayer networks,

A. Choromanska, M. Henaff, M. Mathieu, G. Arous, and Y . LeCun, “The loss surfaces of multilayer networks,” in Artiﬁcial Intelligence and Statistics, 2015, pp. 192–204

work page 2015

[13] [14]

Understanding deep learning requires rethinking generalization

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [15]

Understanding the difﬁculty of training deep feedforward neural networks

X. Glorot and Y . Bengio, “Understanding the difﬁculty of training deep feedforward neural networks.” in AISTATS, vol. 9, 2010, pp. 249–256

work page 2010

[15] [16]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers: Surpassing human-level performance on ImageNet classiﬁcation,” in Proc. IEEE International Conference on Computer Vision (ICCV) (see arXiv:1502.01852), 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [17]

Why does unsupervised pre-training help deep learning?

D. Erhan, Y . Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” The Journal of Machine Learning Research, vol. 11, pp. 625–660, 2010

work page 2010

[17] [18]

Convolutional deep belief networks for scalable unsupervised learning of hierarchical repre- sentations,

H. Lee, R. Grosse, R. Ranganath, and A. Y . Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical repre- sentations,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: ACM, 2009, pp. 609–616

work page 2009

[18] [19]

Unsuper- vised learning of invariant feature hierarchies with applications to object recognition,

M. Ranzato, F. Huang, Y .-L. Boureau, and Y . LeCun, “Unsuper- vised learning of invariant feature hierarchies with applications to object recognition,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on . IEEE, 2007, pp. 1–8

work page 2007

[19] [20]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[20] [21]

ADADELTA: An Adaptive Learning Rate Method

M. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[21] [22]

Sharp Minima Can Generalize For Deep Nets

L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio, “Sharp minima can generalize for deep nets,” arXiv preprint arXiv:1703.04933 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [23]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[23] [24]

On the importance of single directions for generalization

A. Morcos, D. Barrett, N. Rabinowitz, and M. Botvinick, “On the importance of single directions for generalization,” arXiv preprint arXiv:1803.06959, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [25]

Improved techniques for training GANs,

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Advances in Neural Information Processing Systems , 2016, pp. 2234–2242

work page 2016

[25] [26]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” arXiv preprint arXiv:1609.04836 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [27]

Layer Normalization

J. Ba, J. Kiros, and G. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [28]

Weight normalization: A simple reparam- eterization to accelerate training of deep neural networks,

T. Salimans and D. Kingma, “Weight normalization: A simple reparam- eterization to accelerate training of deep neural networks,” in Advances in Neural Information Processing Systems , 2016, pp. 901–909

work page 2016

[28] [29]

Efﬁcient backprop,

Y . LeCun, L. Bottou, G. Orr, and K.-R. M ¨uller, “Efﬁcient backprop,” in Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop . Springer-Verlag, 1998, pp. 9–50

work page 1996

[29] [30]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” CoRR, vol. abs/1511.07289, 2015. [Online]. Available: http://arxiv.org/abs/1511. 07289

work page internal anchor Pith review Pith/arXiv arXiv 2015

[30] [31]

How Does Batch Nor- malization Help Optimization? (No, It Is Not About Internal Covariate Shift),

S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How Does Batch Nor- malization Help Optimization? (No, It Is Not About Internal Covariate Shift),” ArXiv e-prints, May 2018

work page 2018

[31] [32]

Wide Residual Networks

S. Zagoruyko and N. Komodakis, “Wide residual networks,” 2016, arxiv.1605.07146

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [33]

Identity Mappings in Deep Residual Networks

K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” Microsoft Research, Tech. Rep., 2016, arxiv.1603.05027

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [34]

Improved Regularization of Convolutional Neural Networks with Cutout

T. Devries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” CoRR, vol. abs/1708.04552, 2017. [Online]. Available: http://arxiv.org/abs/1708.04552

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [35]

Distilling the Knowledge in a Neural Network,

G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” ArXiv e-prints, Mar. 2015

work page 2015

[35] [36]

On Calibration of Modern Neural Networks

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” CoRR, vol. abs/1706.04599, 2017. [Online]. Available: http://arxiv.org/abs/1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [37]

Bina- reye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS,

B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst, “Bina- reye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS,” in IEEE Custom Integrated Circuits Conference, CICC, San Diego, April 8-11 , 2018, pp. 1–4

work page 2018

[37] [38]

Eyeriss: An energy- efﬁcient reconﬁgurable accelerator for deep convolutional neural net- works

Y .-H. Chen, T. Krishna, J. S. Emer, and V . Sze, “Eyeriss: An energy- efﬁcient reconﬁgurable accelerator for deep convolutional neural net- works.” in ISSCC. IEEE, 2016, pp. 262–263

work page 2016

[38] [39]

Brein memory: A single-chip binary/ternary reconﬁgurable in- memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,

K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Moto- mura, “Brein memory: A single-chip binary/ternary reconﬁgurable in- memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,” IEEE Journal of Solid-State Circuits , 12 2017

work page 2017