pith. sign in

arxiv: 1907.06916 · v2 · pith:M3ZKGF3Bnew · submitted 2019-07-16 · 💻 cs.LG · cs.CV· cs.NE· stat.ML

Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems

Pith reviewed 2026-05-24 21:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.NEstat.ML
keywords batch normalizationshifted-ReLUsingle-bit weightsconvolutional neural networksembedded systemswide residual networksimage classification
0
0 comments X

The pith

Batch-normalization layers do not consistently improve accuracy over shifted-ReLU in single-bit weight networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether batch-normalization layers remain necessary in deep convolutional networks when weights are limited to a single bit for embedded hardware. It replaces them with shifted-ReLU layers inside wide residual networks and measures performance on ImageNet, CIFAR-10 and CIFAR-100. The accuracy gap between the two choices varies with dataset, network depth and bit depth, and is often small. Shifted-ReLU versions therefore deliver comparable results while cutting the memory, speed and complexity costs that batch normalization imposes on low-power devices.

Core claim

Experiments with wide residual networks applied to the ImageNet, CIFAR 10 and CIFAR 100 image classification datasets show that batch-normalization layers do not consistently offer a significant advantage. The accuracy margin offered by batch-normalization layers depends on the data set, the network size, and the bit-depth of weights. Shifted-ReLU layers can often be used instead with no significant accuracy cost and provide advantages in speed, memory and complexity.

What carries the argument

Shifted-ReLU layers used in place of batch-normalization layers inside single-bit-per-weight wide residual networks.

If this is right

  • Shifted-ReLU networks avoid the computational overhead and small-batch training problems introduced by batch normalization.
  • Single-bit weight networks remain competitive in accuracy when batch normalization is removed.
  • Residual connections appear sufficient to maintain training stability without batch normalization in the tested regimes.
  • Designers of embedded vision systems can drop batch-normalization layers when hardware constraints make them costly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same substitution may work in detection or segmentation networks that also rely on single-bit weights.
  • Training with very small batches could become more reliable if shifted-ReLU replaces batch normalization.
  • Hardware implementations could be simplified by removing the running-mean and variance tracking required by batch normalization.

Load-bearing premise

That results obtained with wide residual networks on three image-classification datasets generalize to other architectures and tasks.

What would settle it

A new experiment on a different dataset or network size in which batch-normalization layers produce large, consistent accuracy gains across all tested bit depths would falsify the central claim.

Figures

Figures reproduced from arXiv: 1907.06916 by Andre van Schaik, Hesham Mostafa, Mark D. McDonnell, Runchun Wang.

Figure 1
Figure 1. Figure 1: Shifted Rectified Linear Unit (sReLU) activation function. The sReLU activation function lets negative inputs pass through, between 0 and some negative constant, in this case equal to −1. While the Exponential Linear Unit (ELU) is more popular, we have found sReLU to be equally effective, and less computationally demanding, due to avoiding calculation of an exponential. III. METHODS A. Baseline network arc… view at source ↗
Figure 2
Figure 2. Figure 2: Wide ResNet architecture for Baseline CIFAR models where BN layers are used. This architecture is nearly identical to that of [10], except here there is no optional ReLU applied to the input. Note the ordering of the final layers, where global average pooling (GAP) is used after a final 1×1 convolutional layer, that reduces the number of channels to equal the number of classes, and then feeds directly to t… view at source ↗
Figure 3
Figure 3. Figure 3: Changes when training for 1-bit-per-weight. When we train 1-bit￾per-weight networks following the method of [10], we apply the sign operator to full-precision copies of weights during training, and then scale by a constant equal to the initial standard deviation of the weights according to the method of [15]. approach corresponds to employing a final softmax layer of the form SMi(x) := exp xi T  PN j=1 ex… view at source ↗
Figure 4
Figure 4. Figure 4: Wide ResNet architecture for CIFAR when all BN layers are replaced by sReLUs. The architecture is identical to that of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spread of results: CIFAR 10, Width 4. The circle markers show the mean from 10 repeated runs for each of the 4 model types, using different random seeds for each repeat, but the same seed for each model. The error bars indicate the maximum and minimum errors over the 10 repeated runs. Baseline 1 Baseline 2 Final BN only All sReLU All ELU Mean-only-BN 18 19 20 21 22 23 24 25 Error rate (%) CIFAR 100, 32 bit… view at source ↗
Figure 6
Figure 6. Figure 6: Spread of results: CIFAR 100, Width 4. The circle markers show the mean from 10 repeated runs for each of the 4 model types, using different random seeds for each repeat, but the same seed for each model. The error bars indicate the maximum and minimum errors over the 10 repeated runs. longer the case for CIFAR 10, which might be because here we do not use a ReLU applied to the input. From these observatio… view at source ↗
read the original abstract

Batch-normalization (BN) layers are thought to be an integrally important layer type in today's state-of-the-art deep convolutional neural networks for computer vision tasks such as classification and detection. However, BN layers introduce complexity and computational overheads that are highly undesirable for training and/or inference on low-power custom hardware implementations of real-time embedded vision systems such as UAVs, robots and Internet of Things (IoT) devices. They are also problematic when batch sizes need to be very small during training, and innovations such as residual connections introduced more recently than BN layers could potentially have lessened their impact. In this paper we aim to quantify the benefits BN layers offer in image classification networks, in comparison with alternative choices. In particular, we study networks that use shifted-ReLU layers instead of BN layers. We found, following experiments with wide residual networks applied to the ImageNet, CIFAR 10 and CIFAR 100 image classification datasets, that BN layers do not consistently offer a significant advantage. We found that the accuracy margin offered by BN layers depends on the data set, the network size, and the bit-depth of weights. We conclude that in situations where BN layers are undesirable due to speed, memory or complexity costs, that using shifted-ReLU layers instead should be considered; we found they can offer advantages in all these areas, and often do not impose a significant accuracy cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript claims that batch normalization (BN) layers do not consistently offer a significant accuracy advantage in single-bit-per-weight wide residual networks (WRNs) for image classification. Experiments on ImageNet, CIFAR-10, and CIFAR-100 show that the accuracy margin of BN depends on the dataset, network size, and weight bit-depth; shifted-ReLU layers are presented as a viable substitute that can reduce speed, memory, and complexity costs with often no significant accuracy penalty, making them suitable for embedded systems.

Significance. If the empirical comparisons hold under rigorous validation, the result would be significant for quantized network design in low-power embedded vision applications, as it provides evidence that BN can be omitted in 1-bit WRNs without consistent accuracy loss. The controlled variations in network size and bit-depth across three public datasets add value, though the scoped architecture limits broader impact.

major comments (3)
  1. [Abstract] Abstract: the central claim that 'BN layers do not consistently offer a significant advantage' rests on unreported quantitative margins; no accuracy differences, error bars, or statistical tests are described to support the conclusion that margins 'depend on the data set, the network size, and the bit-depth'.
  2. [Experiments] Experiments section: results are obtained exclusively with wide residual networks; residual connections may mask cases where BN's per-channel statistics are required for stable convergence with strictly ±1 weights, and no evidence is given that findings transfer to non-residual convnets.
  3. No details are provided on hyperparameter search, initialization, training schedules, or number of runs for the shifted-ReLU (no-BN) models, raising the possibility that observed equivalence or advantages are due to unequal optimization effort rather than inherent layer properties.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We respond to each major comment below, indicating revisions where appropriate to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'BN layers do not consistently offer a significant advantage' rests on unreported quantitative margins; no accuracy differences, error bars, or statistical tests are described to support the conclusion that margins 'depend on the data set, the network size, and the bit-depth'.

    Authors: The Experiments section reports accuracy values for BN and shifted-ReLU variants in tables across CIFAR-10, CIFAR-100, and ImageNet for multiple network widths and weight bit-depths. The margins are directly derivable from these tabulated results and vary as stated. We will revise the abstract to include specific quantitative margin examples drawn from the tables to make the dependence explicit. revision: yes

  2. Referee: [Experiments] Experiments section: results are obtained exclusively with wide residual networks; residual connections may mask cases where BN's per-channel statistics are required for stable convergence with strictly ±1 weights, and no evidence is given that findings transfer to non-residual convnets.

    Authors: The study is scoped to wide residual networks, which are a standard and competitive choice for quantization experiments on image classification and directly relevant to embedded vision. The experiments systematically vary depth, width, and bit-depth within this architecture, demonstrating that shifted-ReLU substitutes for BN under these conditions. Extending the claims to non-residual networks lies outside the manuscript's stated scope. revision: no

  3. Referee: No details are provided on hyperparameter search, initialization, training schedules, or number of runs for the shifted-ReLU (no-BN) models, raising the possibility that observed equivalence or advantages are due to unequal optimization effort rather than inherent layer properties.

    Authors: The Experiments section specifies that identical training protocols, including hyperparameters, initialization, and schedules, were used for both BN and shifted-ReLU models. We will revise the text to explicitly confirm this equivalence, add the requested details on hyperparameter selection, and state the number of runs performed for the no-BN variants. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical accuracy comparisons with no derivations or self-referential reductions

full rationale

The paper conducts and reports direct experiments training wide residual networks (with 1-bit weights) on ImageNet/CIFAR-10/CIFAR-100, comparing variants that include or omit BN layers and substitute shifted-ReLU. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citations are invoked to derive the central claim; the accuracy margins are measured outcomes, not constructed by definition from the inputs. This matches the default case of an empirical study whose results stand or fall on the reported trials rather than any load-bearing reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is purely empirical and introduces no new mathematical axioms, free parameters, or invented entities; it relies on standard supervised training assumptions and the existence of the three public image datasets.

pith-pipeline@v0.9.0 · 5800 in / 1122 out tokens · 19816 ms · 2026-05-24T21:03:17.719823+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 21 internal anchors

  1. [1]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502. 03167

  2. [2]

    Deep Residual Learning for Image Recognition

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Microsoft Research, Tech. Rep., 2015, arxiv.1512.03385

  3. [3]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,

    P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,” ArXiv e-prints, June 2017

  4. [4]

    Group Normalization,

    Y . Wu and K. He, “Group Normalization,” ArXiv e-prints, Mar. 2018

  5. [6]
  6. [7]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1MB model size,” CoRR, vol. abs/1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/1602.07360

  7. [8]

    BinaryConnect: Training Deep Neural Networks with binary weights during propagations

    M. Courbariaux, Y . Bengio, and J. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” CoRR, vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/ abs/1511.00363

  8. [9]

    Deep neural networks are robust to weight binarization and other non-linear distortions

    P. Merolla, R. Appuswamy, J. V . Arthur, S. K. Esser, and D. S. Modha, “Deep neural networks are robust to weight binarization and other non-linear distortions,” CoRR, vol. abs/1606.01981, 2016. [Online]. Available: http://arxiv.org/abs/1606.01981

  9. [10]

    XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

    M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Imagenet classification using binary convolutional neural networks,” CoRR, vol. abs/1603.05279, 2016. [Online]. Available: http://arxiv.org/ abs/1603.05279

  10. [11]

    Training wide residual networks for deployment using a single bit for each weight

    M. D. McDonnell, “Training wide residual networks for deployment using a single bit for each weight,” 2018, in Proc. ICLR 2018; arxiv: 1802.08530

  11. [12]

    Do deep nets really need to be deep?

    J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems , 2014, pp. 2654– 2662

  12. [13]

    The loss surfaces of multilayer networks,

    A. Choromanska, M. Henaff, M. Mathieu, G. Arous, and Y . LeCun, “The loss surfaces of multilayer networks,” in Artificial Intelligence and Statistics, 2015, pp. 192–204

  13. [14]

    Understanding deep learning requires rethinking generalization

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016

  14. [15]

    Understanding the difficulty of training deep feedforward neural networks

    X. Glorot and Y . Bengio, “Understanding the difficulty of training deep feedforward neural networks.” in AISTATS, vol. 9, 2010, pp. 249–256

  15. [16]

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

    K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. IEEE International Conference on Computer Vision (ICCV) (see arXiv:1502.01852), 2015

  16. [17]

    Why does unsupervised pre-training help deep learning?

    D. Erhan, Y . Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” The Journal of Machine Learning Research, vol. 11, pp. 625–660, 2010

  17. [18]

    Convolutional deep belief networks for scalable unsupervised learning of hierarchical repre- sentations,

    H. Lee, R. Grosse, R. Ranganath, and A. Y . Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical repre- sentations,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: ACM, 2009, pp. 609–616

  18. [19]

    Unsuper- vised learning of invariant feature hierarchies with applications to object recognition,

    M. Ranzato, F. Huang, Y .-L. Boureau, and Y . LeCun, “Unsuper- vised learning of invariant feature hierarchies with applications to object recognition,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on . IEEE, 2007, pp. 1–8

  19. [20]

    Adam: A Method for Stochastic Optimization

    D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014

  20. [21]

    ADADELTA: An Adaptive Learning Rate Method

    M. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012

  21. [22]

    Sharp Minima Can Generalize For Deep Nets

    L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio, “Sharp minima can generalize for deep nets,” arXiv preprint arXiv:1703.04933 , 2017

  22. [23]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

  23. [24]

    On the importance of single directions for generalization

    A. Morcos, D. Barrett, N. Rabinowitz, and M. Botvinick, “On the importance of single directions for generalization,” arXiv preprint arXiv:1803.06959, 2018

  24. [25]

    Improved techniques for training GANs,

    T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Advances in Neural Information Processing Systems , 2016, pp. 2234–2242

  25. [26]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” arXiv preprint arXiv:1609.04836 , 2016

  26. [27]

    Layer Normalization

    J. Ba, J. Kiros, and G. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016

  27. [28]

    Weight normalization: A simple reparam- eterization to accelerate training of deep neural networks,

    T. Salimans and D. Kingma, “Weight normalization: A simple reparam- eterization to accelerate training of deep neural networks,” in Advances in Neural Information Processing Systems , 2016, pp. 901–909

  28. [29]

    Efficient backprop,

    Y . LeCun, L. Bottou, G. Orr, and K.-R. M ¨uller, “Efficient backprop,” in Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop . Springer-Verlag, 1998, pp. 9–50

  29. [30]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” CoRR, vol. abs/1511.07289, 2015. [Online]. Available: http://arxiv.org/abs/1511. 07289

  30. [31]

    How Does Batch Nor- malization Help Optimization? (No, It Is Not About Internal Covariate Shift),

    S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How Does Batch Nor- malization Help Optimization? (No, It Is Not About Internal Covariate Shift),” ArXiv e-prints, May 2018

  31. [32]

    Wide Residual Networks

    S. Zagoruyko and N. Komodakis, “Wide residual networks,” 2016, arxiv.1605.07146

  32. [33]

    Identity Mappings in Deep Residual Networks

    K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” Microsoft Research, Tech. Rep., 2016, arxiv.1603.05027

  33. [34]

    Improved Regularization of Convolutional Neural Networks with Cutout

    T. Devries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” CoRR, vol. abs/1708.04552, 2017. [Online]. Available: http://arxiv.org/abs/1708.04552

  34. [35]

    Distilling the Knowledge in a Neural Network,

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” ArXiv e-prints, Mar. 2015

  35. [36]

    On Calibration of Modern Neural Networks

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” CoRR, vol. abs/1706.04599, 2017. [Online]. Available: http://arxiv.org/abs/1706.04599

  36. [37]

    Bina- reye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS,

    B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst, “Bina- reye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS,” in IEEE Custom Integrated Circuits Conference, CICC, San Diego, April 8-11 , 2018, pp. 1–4

  37. [38]

    Eyeriss: An energy- efficient reconfigurable accelerator for deep convolutional neural net- works

    Y .-H. Chen, T. Krishna, J. S. Emer, and V . Sze, “Eyeriss: An energy- efficient reconfigurable accelerator for deep convolutional neural net- works.” in ISSCC. IEEE, 2016, pp. 262–263

  38. [39]

    Brein memory: A single-chip binary/ternary reconfigurable in- memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,

    K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Moto- mura, “Brein memory: A single-chip binary/ternary reconfigurable in- memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,” IEEE Journal of Solid-State Circuits , 12 2017