pith. sign in

arxiv: 1907.00593 · v1 · pith:D36LPGMFnew · submitted 2019-07-01 · 💻 cs.LG · cs.CV· stat.ML

Weight Normalization based Quantization for Deep Neural Network Compression

Pith reviewed 2026-05-25 11:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords model compressionquantizationweight normalizationdeep neural networksCIFAR-100ImageNet
0
0 comments X

The pith

Weight normalization before quantization avoids long-tail weight distributions and lowers quantization error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces weight normalization based quantization (WNQ) as a way to compress deep neural network models. It claims that normalizing weights first prevents the long-tail distributions that inflate quantization error in existing methods. This enables more accurate compressed models suitable for mobile and embedded deployment. Experiments are reported to show gains over prior quantization baselines on standard image datasets.

Core claim

WNQ adopts weight normalization to avoid the long-tail distribution of network weights and subsequently reduces the quantization error. Experiments on CIFAR-100 and ImageNet show that WNQ can outperform other baselines to achieve state-of-the-art performance.

What carries the argument

Weight normalization applied to network weights before quantization, intended to reshape their distribution and cut quantization error.

If this is right

  • WNQ produces lower quantization error than standard quantization methods.
  • Compressed models retain higher accuracy on CIFAR-100 and ImageNet classification.
  • WNQ reaches state-of-the-art results among quantization-based compression techniques on the tested datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The normalization step could be tested as a preprocessing module inside other quantization or pruning pipelines.
  • Whether the same distribution shift helps quantization of non-vision models such as language or reinforcement-learning networks remains open.
  • If the benefit holds only for certain layer types, selective application per layer might improve results further.

Load-bearing premise

That weight normalization will avoid the long-tail distribution of network weights and thereby reduce quantization error.

What would settle it

Direct measurement of weight histograms after normalization that still show long tails, or a side-by-side quantization error comparison where WNQ does not reduce error relative to un-normalized quantization.

Figures

Figures reproduced from arXiv: 1907.00593 by Wen-Pu Cai, Wu-Jun Li.

Figure 1
Figure 1. Figure 1: One layer in LQ-Net Forward Backward w ŵ ŵ q wq x y ∂ŵ q ∂ŵ = 1 w max(|w|) Step1 Step2 Step3 cad [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of float weights w on some selected layers of ResNet20 on CIFAR-100 in 2-bit setting. Top row is WNQ and bottom row is LQ-Net. Red dots on x-axis are the average quantization levels in this layer. “mse” in each figure denotes the relative mean-squared quantization error of the layer defined in Section 4.4 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top1 accuracy of ResNet18 on Ima￾geNet. distribution will cause a larger quantization error which is denoted as “mse” (relative mean-squared quantization error) in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

With the development of deep neural networks, the size of network models becomes larger and larger. Model compression has become an urgent need for deploying these network models to mobile or embedded devices. Model quantization is a representative model compression technique. Although a lot of quantization methods have been proposed, many of them suffer from a high quantization error caused by a long-tail distribution of network weights. In this paper, we propose a novel quantization method, called weight normalization based quantization (WNQ), for model compression. WNQ adopts weight normalization to avoid the long-tail distribution of network weights and subsequently reduces the quantization error. Experiments on CIFAR-100 and ImageNet show that WNQ can outperform other baselines to achieve state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes Weight Normalization based Quantization (WNQ) as a model compression technique. It asserts that weight normalization avoids long-tail weight distributions in DNNs and thereby reduces quantization error, with experiments on CIFAR-100 and ImageNet demonstrating outperformance over baselines and state-of-the-art results.

Significance. If the central mechanism were validated through distributional analysis and direct quantization-error measurements, the approach could supply a lightweight preprocessing step for existing quantizers. The reported accuracy numbers on standard benchmarks indicate possible practical value for embedded deployment, but the missing link between normalization and error reduction prevents the result from being assessed as a clear advance.

major comments (3)
  1. [Abstract] Abstract: the claim that WNQ 'adopts weight normalization to avoid the long-tail distribution of network weights and subsequently reduces the quantization error' is stated without any derivation, cumulative-distribution analysis, or expected-error calculation showing how the normalization transform alters the weight statistics or lowers quantization error.
  2. [Experiments] Experiments (CIFAR-100 and ImageNet results): accuracy improvements are reported, yet no before/after weight histograms, no measured reduction in quantization error (e.g., L2 or per-layer), and no ablation that isolates the distribution-normalization effect from other quantization choices are supplied, leaving the stated mechanism unsupported.
  3. [Method] Method section: no equation or analysis demonstrates that the weight-normalization step changes the tail behavior in a manner that is guaranteed (or even expected) to reduce quantization error for the subsequent uniform or non-uniform quantizer.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript would be strengthened by explicit distributional analysis, error measurements, and ablations to support the claimed mechanism, and we will revise the paper to address these points.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that WNQ 'adopts weight normalization to avoid the long-tail distribution of network weights and subsequently reduces the quantization error' is stated without any derivation, cumulative-distribution analysis, or expected-error calculation showing how the normalization transform alters the weight statistics or lowers quantization error.

    Authors: We acknowledge that the abstract states the mechanism without supporting derivation or analysis in the current manuscript. In revision we will move a concise derivation and reference to cumulative-distribution effects into the method section and update the abstract to point to it. revision: yes

  2. Referee: [Experiments] Experiments (CIFAR-100 and ImageNet results): accuracy improvements are reported, yet no before/after weight histograms, no measured reduction in quantization error (e.g., L2 or per-layer), and no ablation that isolates the distribution-normalization effect from other quantization choices are supplied, leaving the stated mechanism unsupported.

    Authors: We agree these supporting measurements and ablations are absent. The revised version will add before/after weight histograms, per-layer L2 quantization-error reductions, and an ablation isolating the normalization step. revision: yes

  3. Referee: [Method] Method section: no equation or analysis demonstrates that the weight-normalization step changes the tail behavior in a manner that is guaranteed (or even expected) to reduce quantization error for the subsequent uniform or non-uniform quantizer.

    Authors: The current method section describes the procedure but does not contain the requested analysis. We will add an analytical subsection with equations showing how the normalization transform reduces tail mass and its expected effect on uniform quantization error. revision: yes

Circularity Check

0 steps flagged

No circularity: technique proposed by design choice with no self-referential reduction in any derivation chain.

full rationale

The paper introduces WNQ as a method that adopts weight normalization to avoid long-tail weight distributions. This is stated as an assertion in the abstract and central claim without any equations, fitted parameters renamed as predictions, or self-citations that would make the outcome equivalent to the input by construction. No load-bearing step reduces to a tautology (e.g., no Eq. X defined in terms of the claimed effect of X). Experiments report accuracy but do not involve the circular patterns of fitted-input-called-prediction or ansatz-smuggled-in-via-citation. The derivation chain, such as it exists, is self-contained as a proposal rather than a closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or can be extracted.

pith-pipeline@v0.9.0 · 5647 in / 1065 out tokens · 70138 ms · 2026-05-25T11:42:06.509156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 7 internal anchors

  1. [1]

    Deep learning with low precision by half-wave gaussian quantization

    Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In IEEE Conference on Computer Vision and Pattern Recognition, 2017

  2. [2]

    Binaryconnect: Training deep neural networks with binary weights during propagations

    Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, 2015

  3. [3]

    Regularizing Activation Distribution for Training Binarized Deep Networks

    Ruizhou Ding, Ting-Wu Chin, Zeye Liu, and Diana Marculescu. Regularizing activation distribution for training binarized deep networks. arXiv preprint arXiv:1904.02823, 2019

  4. [4]

    Heterogeneous bitwidth binarization in convolutional neural networks

    Joshua Fromm, Shwetak Patel, and Matthai Philipose. Heterogeneous bitwidth binarization in convolutional neural networks. In Advances in Neural Information Processing Systems, 2018

  5. [5]

    Network sketching: Exploiting binary structure in deep cnns

    Yiwen Guo, Anbang Yao, Hao Zhao, and Yurong Chen. Network sketching: Exploiting binary structure in deep cnns. In IEEE Conference on Computer Vision and Pattern Recognition, 2017

  6. [6]

    Learning both weights and connections for efficient neural network

    Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, 2015

  7. [7]

    Identity mappings in deep residual networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, 2016

  8. [8]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016

  9. [9]

    Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration

    Yang He, Ping Liu, Ziwei Wang, and Yi Yang. Pruning filter via geometric median for deep convolutional neural networks acceleration. arXiv preprint arXiv:1811.00250, 2018

  10. [10]

    Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation

    Zhezhi He and Deliang Fan. Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. arXiv preprint arXiv:1810.01018, 2018

  11. [11]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  12. [12]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017

  13. [13]

    Squeeze-and-excitation networks

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018

  14. [14]

    Binarized neural networks

    Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, 2016

  15. [15]

    Quantization and training of neural networks for efficient integer- arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer- arithmetic-only inference. In IEEE Conference on Computer Vision and Pattern Recognition, 2018

  16. [16]

    Learning to quantize deep networks by optimizing quantization intervals with task loss

    Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han, Youngjun Kwak, Sung Ju Hwang, and Changkyu Choi. Learning to quantize deep networks by optimizing quantization intervals with task loss. In IEEE Conference on Computer Vision and Pattern Recognition, 2019

  17. [17]

    Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

    Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015

  18. [18]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

  19. [19]

    Extremely low bit neural network: Squeeze the last bit out with admm

    Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low bit neural network: Squeeze the last bit out with admm. In AAAI Conference on Artificial Intelligence, 2018

  20. [20]

    Ternary weight networks

    Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016

  21. [21]

    Pruning filters for efficient convnets

    Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017. 9

  22. [22]

    Synaptic strength for convolutional neural network

    Chen Lin, Zhao Zhong, Wei Wu, and Junjie Yan. Synaptic strength for convolutional neural network. In Advances in Neural Information Processing System, 2018

  23. [23]

    Towards accurate binary convolutional neural network

    Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. InAdvances in Neural Information Processing Systems, 2017

  24. [24]

    Thinet: A filter level pruning method for deep neural network compression

    Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In IEEE International Conference on Computer Vision, 2017

  25. [25]

    Tensorizing neural networks

    Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, 2015

  26. [26]

    Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration

    Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. PyTorch: Tensors and Dynamic Neural Networks in Python with Strong GPU Acceleration, 6, 2017

  27. [27]

    Extreme network compression via filter group approximation

    Bo Peng, Wenming Tan, Zheyang Li, Shun Zhang, Di Xie, and Shiliang Pu. Extreme network compression via filter group approximation. In European Conference on Computer Vision, 2018

  28. [28]

    Xnor-net: Imagenet clas- sification using binary convolutional neural networks

    Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet clas- sification using binary convolutional neural networks. In European Conference on Computer Vision , 2016

  29. [29]

    Imagenet large scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015

  30. [30]

    Learning discrete weights using the local reparameterization trick

    Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discrete weights using the local reparameterization trick. In International Conference on Learning Representations, 2018

  31. [31]

    Tbn: Convolutional neural network with ternary inputs and binary weights

    Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, and Heng Tao Shen. Tbn: Convolutional neural network with ternary inputs and binary weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 315–332, 2018

  32. [32]

    Learning structured sparsity in deep neural networks

    Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, 2016

  33. [33]

    Lq-nets: Learned quantization for highly accurate and compact deep neural networks

    Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In European Conference on Computer Vision, 2018

  34. [34]

    Deep mutual learning

    Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018

  35. [35]

    DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

    Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016

  36. [36]

    Trained ternary quantization

    Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In International Conference on Learning Representations, 2017

  37. [37]

    Towards effective low-bitwidth convolutional neural networks

    Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Towards effective low-bitwidth convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 10