pith. sign in

arxiv: 1906.12172 · v1 · pith:FRAX6U6Gnew · submitted 2019-06-25 · 💻 cs.CV · cs.LG

New pointwise convolution in Deep Neural Networks through Extremely Fast and Non Parametric Transforms

Pith reviewed 2026-05-25 16:52 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords pointwise convolutionDiscrete Walsh-Hadamard TransformDWHTneural network efficiencyparameter reductionCIFAR-100MobileNet-V1non-parametric transforms
0
0 comments X

The pith

Fixed transforms like DWHT replace pointwise convolutions in neural networks, cutting parameters by 79% with an accuracy gain on CIFAR-100.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard transforms such as the Discrete Walsh-Hadamard Transform and Discrete Cosine Transform can serve as non-learnable replacements for pointwise convolution layers in deep neural networks. These transforms capture cross-channel correlations using fixed operations, primarily additions and subtractions for DWHT, which removes the need for trainable weights and floating-point multiplications. As a result, the networks require far fewer parameters and floating-point operations yet achieve comparable or better accuracy on image classification. The fast implementation of DWHT further reduces addition complexity from quadratic to logarithmic scaling. This produces highly efficient models, as shown by gains over the MobileNet-V1 baseline on CIFAR-100.

Core claim

The authors propose using the Discrete Walsh-Hadamard Transform as a parameter-free pointwise convolution in DNNs. This leverages the transform's ability to capture cross-channel correlations without learnable parameters, requiring only additions and subtractions with a fast algorithm that reduces complexity to O(n log n). When applied within MobileNet-V1 on CIFAR-100, the resulting model achieves 1.49% higher accuracy with 79.1% fewer parameters and 48.4% fewer FLOPs.

What carries the argument

Discrete Walsh-Hadamard Transform (DWHT) applied as a fixed, parameter-free pointwise convolution operator that mixes channels through additions and subtractions only.

If this is right

  • Pointwise convolution layers can be built with zero learnable parameters.
  • Floating-point multiplications disappear from those layers, leaving only additions and subtractions.
  • Fast algorithms lower the cost of those additions from O(n squared) to O(n log n).
  • The same substitution works with DCT and yields similar efficiency gains.
  • Accuracy on CIFAR-100 rises by 1.49 percent relative to the unmodified baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fixed-transform replacement could be inserted into other lightweight architectures besides MobileNet-V1.
  • Fewer free parameters may reduce overfitting risk on smaller training sets.
  • Combining the method with quantization or pruning would likely produce further compute savings.
  • Testing on ImageNet would reveal whether the correlation-capture property scales to higher-resolution data.

Load-bearing premise

Fixed transforms such as DWHT capture the cross-channel correlations needed for the task as effectively as learned pointwise convolutions, without any accuracy loss.

What would settle it

Training the DWHT-replaced model on CIFAR-100 and observing accuracy below the MobileNet-V1 baseline would disprove the claim that the replacement maintains or improves performance.

Figures

Figures reproduced from arXiv: 1906.12172 by Joonhyun Jeong, Sung-Ho Bae.

Figure 1
Figure 1. Figure 1: Left: architecture of our PC layer based on fast DHWT algorithm in Algorithm 1, Right: [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our blocks using conventional transform pointwise convolution (CTPC), random constant [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance curve of hierarchically applying our optimal block on CIFAR100, Top: in the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Histograms of hierarchy level (low-level, middle-level, high-level) activations after the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Histogram of 3 × 3 depthwise convolution weights in the third block, out of last 3 blocks. DCT-3-H and DWHT-3-H models are based on ShuffleNet V2 1.1x model with (d) block. Baseline model is ShuffleNet V2 1.1x model [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study of weight decay values (5e-4, 2e-3, 1e-2, 1e-1). We applied these weight [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance curve of hierarchically applying our optimal block (See Table 2 for detail [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Histograms of 3 × 3 depthwise convolution weights, Top: histogram of first block out of last 3 blocks, Bottom: histogram of second block out of last 3 blocks. DWHT-3-H and DCT-3-H models are based on ShuffleNet-V2 1.1x model with (d)-DWHT and (d)-DCT block, respectively. Baseline model is ShuffleNet-V2 1.1x model. Further, 3-M-Rear models gave slightly superior efficiency while 7-M, 3-M-Front, and low-leve… view at source ↗
read the original abstract

Some conventional transforms such as Discrete Walsh-Hadamard Transform (DWHT) and Discrete Cosine Transform (DCT) have been widely used as feature extractors in image processing but rarely applied in neural networks. However, we found that these conventional transforms have the ability to capture the cross-channel correlations without any learnable parameters in DNNs. This paper firstly proposes to apply conventional transforms to pointwise convolution, showing that such transforms significantly reduce the computational complexity of neural networks without accuracy performance degradation. Especially for DWHT, it requires no floating point multiplications but only additions and subtractions, which can considerably reduce computation overheads. In addition, its fast algorithm further reduces complexity of floating point addition from $\mathcal{O}(n^2)$ to $\mathcal{O}(n\log n)$. These nice properties construct extremely efficient networks in the number parameters and operations, enjoying accuracy gain. Our proposed DWHT-based model gained 1.49\% accuracy increase with 79.1\% reduced parameters and 48.4\% reduced FLOPs compared with its baseline model (MoblieNet-V1) on the CIFAR 100 dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes replacing pointwise convolutions in DNNs (specifically MobileNet-V1) with fixed, parameter-free transforms such as the Discrete Walsh-Hadamard Transform (DWHT) and Discrete Cosine Transform (DCT). These transforms are claimed to capture cross-channel correlations, yielding networks with substantially lower parameter counts and FLOPs. On CIFAR-100 the DWHT variant is reported to improve accuracy by 1.49% while reducing parameters by 79.1% and FLOPs by 48.4% relative to the baseline.

Significance. If the empirical result is robust, the work offers a concrete, multiplication-free alternative to learned pointwise convolutions that also exploits the fast O(n log n) DWHT algorithm. This could be useful for resource-constrained settings. The manuscript does not supply machine-checked proofs or reproducible code, so the primary strength is the reported head-to-head measurement rather than a parameter-free derivation.

major comments (2)
  1. [Experimental Results] Experimental section: the 1.49% accuracy gain, 79.1% parameter reduction, and 48.4% FLOP reduction on CIFAR-100 are presented from a single run with no error bars, no multiple random seeds, and no ablation on insertion point or baseline re-tuning. This single comparison is load-bearing for the central claim yet lacks the controls needed to establish robustness.
  2. [Methods] Methods / architecture description: the paper does not specify how the fixed DWHT matrix is applied to feature maps of arbitrary channel count, whether any normalization or reshaping is required, or how the transform interacts with the existing depthwise layers. These details are necessary to reproduce the stated parameter and FLOP counts.
minor comments (2)
  1. [Abstract] Abstract: 'MoblieNet-V1' is a typographical error.
  2. [Abstract] The abstract states the transforms operate 'without accuracy performance degradation,' yet the reported result is a 1.49% improvement; the wording should be aligned with the actual finding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will make the indicated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental section: the 1.49% accuracy gain, 79.1% parameter reduction, and 48.4% FLOP reduction on CIFAR-100 are presented from a single run with no error bars, no multiple random seeds, and no ablation on insertion point or baseline re-tuning. This single comparison is load-bearing for the central claim yet lacks the controls needed to establish robustness.

    Authors: We agree that reporting results from multiple runs with error bars and ablations would better establish robustness. In the revised manuscript we will add experiments averaged over several random seeds (with standard deviations), plus ablations on transform insertion points and any baseline re-tuning required. revision: yes

  2. Referee: [Methods] Methods / architecture description: the paper does not specify how the fixed DWHT matrix is applied to feature maps of arbitrary channel count, whether any normalization or reshaping is required, or how the transform interacts with the existing depthwise layers. These details are necessary to reproduce the stated parameter and FLOP counts.

    Authors: We will expand the methods section with a precise description of how the fixed DWHT (and DCT) matrices are applied to feature maps of any channel count. This will include the exact reshaping/padding procedure, any normalization steps, and the integration point relative to the depthwise layers, allowing exact reproduction of the reported parameter and FLOP counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances an empirical proposal to replace pointwise convolutions with fixed, parameter-free transforms such as DWHT, then directly measures the resulting accuracy, parameter count, and FLOP reductions against MobileNet-V1 on CIFAR-100. No derivation chain, equation, or uniqueness claim is shown to reduce by construction to a fitted input, self-citation, or renamed ansatz; the central performance numbers are external experimental outcomes rather than algebraic identities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the domain assumption that DWHT and DCT already encode the necessary cross-channel statistics; no new free parameters or invented entities are introduced because the transforms are non-parametric.

axioms (1)
  • domain assumption Conventional transforms such as DWHT and DCT can capture cross-channel correlations in feature maps without learnable parameters
    Invoked in the abstract as the justification for replacing pointwise convolution.

pith-pipeline@v0.9.0 · 5731 in / 1181 out tokens · 28317 ms · 2026-05-25T16:52:16.075642+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 15 internal anchors

  1. [2]

    An Analysis of Deep Neural Network Models for Practical Applications

    URL http://arxiv.org/ 9 Figure 5: Histogram of 3 × 3 depthwise convolution weights in the third block, out of last 3 blocks. DCT-3-H and DWHT-3-H models are based on ShuffleNet V2 1.1x model with (d) block. Baseline model is ShuffleNet V2 1.1x model. Figure 6: Ablation study of weight decay values (5e-4, 2e-3, 1e-2, 1e-1). We applied these weight decay valu...

  2. [3]

    Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

    URL http://arxiv. org/abs/1602.02830. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR, abs/1511.00363,

  3. [4]

    BinaryConnect: Training Deep Neural Networks with binary weights during propagations

    URL http: //arxiv.org/abs/1511.00363. Saeed Dabbaghchian, Masoumeh P Ghaemmaghami, and Ali Aghagolzadeh. Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition tech- nology. Pattern Recognition, 43(4):1431–1440,

  4. [5]

    Imagenet: A large-scale hi- erarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255. Ieee,

  5. [6]

    10 Arthita Ghosh and Rama Chellappa

    doi: 10.1109/TIP.2014.2362652. 10 Arthita Ghosh and Rama Chellappa. Deep feature extraction in the dct domain. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3536–3541. IEEE,

  6. [7]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a. Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing system...

  7. [8]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 770–778, June

  8. [9]

    doi: 10.1109/CVPR.2016.90. M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) , pp. 10–14, Feb

  9. [10]

    Andrew G

    doi: 10.1109/ISSCC.2014.6757323. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861,

  10. [11]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    URL http://arxiv.org/abs/ 1704.04861. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. CoRR, abs/1609.07061,

  11. [12]

    Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

    URL http://arxiv.org/abs/1609.07061. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167,

  12. [13]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    URL http://arxiv.org/ abs/1502.03167. Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image clas- sification. CoRR, abs/1703.09076,

  13. [14]

    Active Convolution: Learning the Shape of Convolution for Image Classification

    URL http://arxiv.org/abs/1703.09076. Yunho Jeon and Junmo Kim. Constructing fast network through deconstruction of convolution. CoRR, abs/1806.07370,

  14. [15]

    Constructing Fast Network through Deconstruction of Convolution

    URL http://arxiv.org/abs/1806.07370. Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional neural networks. CoRR, abs/1608.06049,

  15. [16]

    Local Binary Convolutional Neural Networks

    URL http://arxiv.org/abs/1608.06049. Chi-Wah Kok. Fast algorithm for computing discrete cosine transform.IEEE Transactions on Signal Processing, 45(3):757–760,

  16. [18]

    ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

    URL http://arxiv. org/abs/1807.11164. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pp. 807–814,

  17. [19]

    doi: 10.1109/PROC.1969.6869

    ISSN 0018-9219. doi: 10.1109/PROC.1969.6869. K Ramamohan Rao and Ping Yip. Discrete cosine transform: algorithms, advantages, applications . Academic press,

  18. [20]

    Regularized Evolution for Image Classifier Architecture Search

    Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548,

  19. [22]

    MobileNetV2: Inverted Residuals and Linear Bottlenecks

    URL http://arxiv.org/abs/1801.04381. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556,

  20. [23]

    Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

    Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016a. URL http:// arxiv.org/abs/1602.07261. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethink- ing the inception architecture for computer vision. In Pro...

  21. [25]

    ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

    URL http: //arxiv.org/abs/1707.01083. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710,