Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems
Pith reviewed 2026-05-24 21:03 UTC · model grok-4.3
The pith
Batch-normalization layers do not consistently improve accuracy over shifted-ReLU in single-bit weight networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments with wide residual networks applied to the ImageNet, CIFAR 10 and CIFAR 100 image classification datasets show that batch-normalization layers do not consistently offer a significant advantage. The accuracy margin offered by batch-normalization layers depends on the data set, the network size, and the bit-depth of weights. Shifted-ReLU layers can often be used instead with no significant accuracy cost and provide advantages in speed, memory and complexity.
What carries the argument
Shifted-ReLU layers used in place of batch-normalization layers inside single-bit-per-weight wide residual networks.
If this is right
- Shifted-ReLU networks avoid the computational overhead and small-batch training problems introduced by batch normalization.
- Single-bit weight networks remain competitive in accuracy when batch normalization is removed.
- Residual connections appear sufficient to maintain training stability without batch normalization in the tested regimes.
- Designers of embedded vision systems can drop batch-normalization layers when hardware constraints make them costly.
Where Pith is reading between the lines
- The same substitution may work in detection or segmentation networks that also rely on single-bit weights.
- Training with very small batches could become more reliable if shifted-ReLU replaces batch normalization.
- Hardware implementations could be simplified by removing the running-mean and variance tracking required by batch normalization.
Load-bearing premise
That results obtained with wide residual networks on three image-classification datasets generalize to other architectures and tasks.
What would settle it
A new experiment on a different dataset or network size in which batch-normalization layers produce large, consistent accuracy gains across all tested bit depths would falsify the central claim.
Figures
read the original abstract
Batch-normalization (BN) layers are thought to be an integrally important layer type in today's state-of-the-art deep convolutional neural networks for computer vision tasks such as classification and detection. However, BN layers introduce complexity and computational overheads that are highly undesirable for training and/or inference on low-power custom hardware implementations of real-time embedded vision systems such as UAVs, robots and Internet of Things (IoT) devices. They are also problematic when batch sizes need to be very small during training, and innovations such as residual connections introduced more recently than BN layers could potentially have lessened their impact. In this paper we aim to quantify the benefits BN layers offer in image classification networks, in comparison with alternative choices. In particular, we study networks that use shifted-ReLU layers instead of BN layers. We found, following experiments with wide residual networks applied to the ImageNet, CIFAR 10 and CIFAR 100 image classification datasets, that BN layers do not consistently offer a significant advantage. We found that the accuracy margin offered by BN layers depends on the data set, the network size, and the bit-depth of weights. We conclude that in situations where BN layers are undesirable due to speed, memory or complexity costs, that using shifted-ReLU layers instead should be considered; we found they can offer advantages in all these areas, and often do not impose a significant accuracy cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that batch normalization (BN) layers do not consistently offer a significant accuracy advantage in single-bit-per-weight wide residual networks (WRNs) for image classification. Experiments on ImageNet, CIFAR-10, and CIFAR-100 show that the accuracy margin of BN depends on the dataset, network size, and weight bit-depth; shifted-ReLU layers are presented as a viable substitute that can reduce speed, memory, and complexity costs with often no significant accuracy penalty, making them suitable for embedded systems.
Significance. If the empirical comparisons hold under rigorous validation, the result would be significant for quantized network design in low-power embedded vision applications, as it provides evidence that BN can be omitted in 1-bit WRNs without consistent accuracy loss. The controlled variations in network size and bit-depth across three public datasets add value, though the scoped architecture limits broader impact.
major comments (3)
- [Abstract] Abstract: the central claim that 'BN layers do not consistently offer a significant advantage' rests on unreported quantitative margins; no accuracy differences, error bars, or statistical tests are described to support the conclusion that margins 'depend on the data set, the network size, and the bit-depth'.
- [Experiments] Experiments section: results are obtained exclusively with wide residual networks; residual connections may mask cases where BN's per-channel statistics are required for stable convergence with strictly ±1 weights, and no evidence is given that findings transfer to non-residual convnets.
- No details are provided on hyperparameter search, initialization, training schedules, or number of runs for the shifted-ReLU (no-BN) models, raising the possibility that observed equivalence or advantages are due to unequal optimization effort rather than inherent layer properties.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We respond to each major comment below, indicating revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'BN layers do not consistently offer a significant advantage' rests on unreported quantitative margins; no accuracy differences, error bars, or statistical tests are described to support the conclusion that margins 'depend on the data set, the network size, and the bit-depth'.
Authors: The Experiments section reports accuracy values for BN and shifted-ReLU variants in tables across CIFAR-10, CIFAR-100, and ImageNet for multiple network widths and weight bit-depths. The margins are directly derivable from these tabulated results and vary as stated. We will revise the abstract to include specific quantitative margin examples drawn from the tables to make the dependence explicit. revision: yes
-
Referee: [Experiments] Experiments section: results are obtained exclusively with wide residual networks; residual connections may mask cases where BN's per-channel statistics are required for stable convergence with strictly ±1 weights, and no evidence is given that findings transfer to non-residual convnets.
Authors: The study is scoped to wide residual networks, which are a standard and competitive choice for quantization experiments on image classification and directly relevant to embedded vision. The experiments systematically vary depth, width, and bit-depth within this architecture, demonstrating that shifted-ReLU substitutes for BN under these conditions. Extending the claims to non-residual networks lies outside the manuscript's stated scope. revision: no
-
Referee: No details are provided on hyperparameter search, initialization, training schedules, or number of runs for the shifted-ReLU (no-BN) models, raising the possibility that observed equivalence or advantages are due to unequal optimization effort rather than inherent layer properties.
Authors: The Experiments section specifies that identical training protocols, including hyperparameters, initialization, and schedules, were used for both BN and shifted-ReLU models. We will revise the text to explicitly confirm this equivalence, add the requested details on hyperparameter selection, and state the number of runs performed for the no-BN variants. revision: yes
Circularity Check
No circularity: purely empirical accuracy comparisons with no derivations or self-referential reductions
full rationale
The paper conducts and reports direct experiments training wide residual networks (with 1-bit weights) on ImageNet/CIFAR-10/CIFAR-100, comparing variants that include or omit BN layers and substitute shifted-ReLU. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citations are invoked to derive the central claim; the accuracy margins are measured outcomes, not constructed by definition from the inputs. This matches the default case of an empirical study whose results stand or fall on the reported trials rather than any load-bearing reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502. 03167
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[2]
Deep Residual Learning for Image Recognition
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Microsoft Research, Tech. Rep., 2015, arxiv.1512.03385
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[3]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,
P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,” ArXiv e-prints, June 2017
work page 2017
-
[4]
Y . Wu and K. He, “Group Normalization,” ArXiv e-prints, Mar. 2018
work page 2018
-
[6]
Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
[Online]. Available: http://arxiv.org/abs/1702.03275
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1MB model size,” CoRR, vol. abs/1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/1602.07360
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
BinaryConnect: Training Deep Neural Networks with binary weights during propagations
M. Courbariaux, Y . Bengio, and J. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” CoRR, vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/ abs/1511.00363
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[9]
Deep neural networks are robust to weight binarization and other non-linear distortions
P. Merolla, R. Appuswamy, J. V . Arthur, S. K. Esser, and D. S. Modha, “Deep neural networks are robust to weight binarization and other non-linear distortions,” CoRR, vol. abs/1606.01981, 2016. [Online]. Available: http://arxiv.org/abs/1606.01981
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Imagenet classification using binary convolutional neural networks,” CoRR, vol. abs/1603.05279, 2016. [Online]. Available: http://arxiv.org/ abs/1603.05279
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
Training wide residual networks for deployment using a single bit for each weight
M. D. McDonnell, “Training wide residual networks for deployment using a single bit for each weight,” 2018, in Proc. ICLR 2018; arxiv: 1802.08530
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Do deep nets really need to be deep?
J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems , 2014, pp. 2654– 2662
work page 2014
-
[13]
The loss surfaces of multilayer networks,
A. Choromanska, M. Henaff, M. Mathieu, G. Arous, and Y . LeCun, “The loss surfaces of multilayer networks,” in Artificial Intelligence and Statistics, 2015, pp. 192–204
work page 2015
-
[14]
Understanding deep learning requires rethinking generalization
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[15]
Understanding the difficulty of training deep feedforward neural networks
X. Glorot and Y . Bengio, “Understanding the difficulty of training deep feedforward neural networks.” in AISTATS, vol. 9, 2010, pp. 249–256
work page 2010
-
[16]
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. IEEE International Conference on Computer Vision (ICCV) (see arXiv:1502.01852), 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[17]
Why does unsupervised pre-training help deep learning?
D. Erhan, Y . Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” The Journal of Machine Learning Research, vol. 11, pp. 625–660, 2010
work page 2010
-
[18]
H. Lee, R. Grosse, R. Ranganath, and A. Y . Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical repre- sentations,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: ACM, 2009, pp. 609–616
work page 2009
-
[19]
Unsuper- vised learning of invariant feature hierarchies with applications to object recognition,
M. Ranzato, F. Huang, Y .-L. Boureau, and Y . LeCun, “Unsuper- vised learning of invariant feature hierarchies with applications to object recognition,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on . IEEE, 2007, pp. 1–8
work page 2007
-
[20]
Adam: A Method for Stochastic Optimization
D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[21]
ADADELTA: An Adaptive Learning Rate Method
M. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[22]
Sharp Minima Can Generalize For Deep Nets
L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio, “Sharp minima can generalize for deep nets,” arXiv preprint arXiv:1703.04933 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997
work page 1997
-
[24]
On the importance of single directions for generalization
A. Morcos, D. Barrett, N. Rabinowitz, and M. Botvinick, “On the importance of single directions for generalization,” arXiv preprint arXiv:1803.06959, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
Improved techniques for training GANs,
T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Advances in Neural Information Processing Systems , 2016, pp. 2234–2242
work page 2016
-
[26]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” arXiv preprint arXiv:1609.04836 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
J. Ba, J. Kiros, and G. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[28]
Weight normalization: A simple reparam- eterization to accelerate training of deep neural networks,
T. Salimans and D. Kingma, “Weight normalization: A simple reparam- eterization to accelerate training of deep neural networks,” in Advances in Neural Information Processing Systems , 2016, pp. 901–909
work page 2016
-
[29]
Y . LeCun, L. Bottou, G. Orr, and K.-R. M ¨uller, “Efficient backprop,” in Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop . Springer-Verlag, 1998, pp. 9–50
work page 1996
-
[30]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” CoRR, vol. abs/1511.07289, 2015. [Online]. Available: http://arxiv.org/abs/1511. 07289
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[31]
How Does Batch Nor- malization Help Optimization? (No, It Is Not About Internal Covariate Shift),
S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How Does Batch Nor- malization Help Optimization? (No, It Is Not About Internal Covariate Shift),” ArXiv e-prints, May 2018
work page 2018
-
[32]
S. Zagoruyko and N. Komodakis, “Wide residual networks,” 2016, arxiv.1605.07146
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
Identity Mappings in Deep Residual Networks
K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” Microsoft Research, Tech. Rep., 2016, arxiv.1603.05027
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[34]
Improved Regularization of Convolutional Neural Networks with Cutout
T. Devries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” CoRR, vol. abs/1708.04552, 2017. [Online]. Available: http://arxiv.org/abs/1708.04552
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Distilling the Knowledge in a Neural Network,
G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” ArXiv e-prints, Mar. 2015
work page 2015
-
[36]
On Calibration of Modern Neural Networks
C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” CoRR, vol. abs/1706.04599, 2017. [Online]. Available: http://arxiv.org/abs/1706.04599
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst, “Bina- reye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS,” in IEEE Custom Integrated Circuits Conference, CICC, San Diego, April 8-11 , 2018, pp. 1–4
work page 2018
-
[38]
Eyeriss: An energy- efficient reconfigurable accelerator for deep convolutional neural net- works
Y .-H. Chen, T. Krishna, J. S. Emer, and V . Sze, “Eyeriss: An energy- efficient reconfigurable accelerator for deep convolutional neural net- works.” in ISSCC. IEEE, 2016, pp. 262–263
work page 2016
-
[39]
K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Moto- mura, “Brein memory: A single-chip binary/ternary reconfigurable in- memory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,” IEEE Journal of Solid-State Circuits , 12 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.