Hardware Efficient Approximate Convolution with Tunable Error Tolerance for CNNs

Anupam Kumari; Roy P Paily; Vishal Shashidhar

arxiv: 2603.10100 · v2 · submitted 2026-03-10 · 💻 cs.LG · cs.AI· cs.AR

Hardware Efficient Approximate Convolution with Tunable Error Tolerance for CNNs

Vishal Shashidhar , Anupam Kumari , Roy P Paily This is my paper

Pith reviewed 2026-05-15 13:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.AR

keywords soft sparsityapproximate convolutionMSB proxyCNN accelerationRISC-V custom instructionMAC reductionedge inferenceerror tolerance

0 comments

The pith

A soft sparsity method using an MSB proxy skips most non-zero multiplications in CNN convolutions while preserving full accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a hardware-efficient approach to approximate convolution that treats multiplications as soft-sparse rather than relying only on exact zeros. It uses the most significant bit of the operands as a quick proxy to decide which non-zero products are negligible enough to skip. When implemented as a custom RISC-V instruction and tested on LeNet-5 for MNIST, the method cuts ReLU MACs by 88.42 percent and Tanh MACs by 74.87 percent with no drop in classification accuracy. Power is further reduced by clock-gating the unused multipliers, yielding estimated savings of roughly 35 percent for ReLU networks. A reader would care because the technique targets the exact bottleneck that prevents deep networks from running efficiently on edge hardware without retraining or accuracy penalties.

Core claim

The central claim is that an MSB-based proxy can reliably flag negligible non-zero multiplications inside convolutional layers, allowing them to be skipped with tunable error tolerance. When this proxy is realized as a single custom RISC-V instruction and paired with clock-gating, LeNet-5 on MNIST loses 88.42 percent of its ReLU MAC operations and 74.87 percent of its Tanh MAC operations while accuracy stays identical to the exact baseline. The same proxy beats conventional zero-skipping by a factor of five and produces estimated power reductions of 35.2 percent for ReLU and 29.96 percent for Tanh, even after accounting for memory traffic.

What carries the argument

The MSB proxy, a low-cost check on the leading bit of each operand pair that decides whether the multiplication result is small enough to be ignored and the multiplier can be clock-gated.

If this is right

ReLU MAC operations fall by 88.42 percent and Tanh MAC operations by 74.87 percent on LeNet-5 with no accuracy loss.
Estimated power reductions reach 35.2 percent for ReLU networks and 29.96 percent for Tanh networks via clock-gating.
The method outperforms traditional zero-skipping by a factor of five in operation reduction.
The same proxy supplies tunable error tolerance that can be adjusted per layer without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the MSB proxy continues to work on deeper networks, the same instruction could be dropped into existing RISC-V CNN accelerators with minimal redesign.
Because the decision uses only the leading bit, the technique may combine cleanly with other low-precision formats such as 8-bit integers.
The approach opens a route to accuracy-preserving approximations that require no gradient-based fine-tuning after the proxy threshold is chosen.
Real silicon measurements on a fabricated RISC-V core would be needed to confirm whether memory-access energy still leaves net power savings once the multipliers are gated.

Load-bearing premise

The MSB proxy reliably identifies negligible multiplications across layers and activations such that any introduced errors do not accumulate to affect final network accuracy.

What would settle it

Measure accuracy on LeNet-5 with the same MSB threshold when the network is evaluated on a harder dataset such as CIFAR-10; any drop below the exact baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2603.10100 by Anupam Kumari, Roy P Paily, Vishal Shashidhar.

**Figure 3.** Figure 3: Visual demonstration of outputs with different error [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of fractional pixel errors for each of the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: At T=0.3, 11.58% of total MACs preserve accuracy. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: At T=0.2, 25.13% of total MACs preserve accuracy. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Modern CNNs' high computational demands hinder edge deployment, as traditional ``hard'' sparsity (skipping mathematical zeros) loses effectiveness in deep layers or with smooth activations like Tanh. We propose a ``soft sparsity'' paradigm using a hardware efficient Most Significant Bit (MSB) proxy to skip negligible non-zero multiplications. Integrated as a custom RISC-V instruction and evaluated on LeNet-5 (MNIST), this method reduces ReLU MACs by 88.42% and Tanh MACs by 74.87% with zero accuracy loss--outperforming zero-skipping by 5x. By clock-gating inactive multipliers, we estimate power savings of 35.2% for ReLU and 29.96% for Tanh. While memory access makes power reduction sub-linear to operation savings, this approach significantly optimizes resource-constrained inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSB proxy soft sparsity cuts MACs on LeNet-5 with reported zero loss, but the result is narrow and untested on deeper nets.

read the letter

The main takeaway is a hardware-level trick that uses the most significant bit of a product as a quick proxy to skip small non-zero multiplies in CNN convolutions. They turn this into a custom RISC-V instruction, clock-gate the unused multipliers, and report 88% fewer ReLU MACs and 75% fewer Tanh MACs on LeNet-5/MNIST while keeping accuracy unchanged, plus roughly 30% power reduction. That is a concrete step past ordinary zero-skipping, which loses steam with smooth activations like Tanh. The implementation details on how the proxy threshold is set and how it maps to the instruction are the parts that feel most useful for someone actually wiring up an edge accelerator. The power numbers are also grounded in the clock-gating approach rather than pure operation counts, which is realistic. The obvious limitation is the evaluation scope. Everything is shown only on LeNet-5 with MNIST. There is no deeper network, no harder dataset, and no explicit check on how the per-layer approximation errors add up after many summations and non-linearities. The zero-loss claim therefore rests on the assumption that the MSB proxy stays safe across varying activation distributions, and that assumption is not stress-tested beyond the shallow case. A reader working on tiny custom CNN hardware will still find the RISC-V integration and the soft-sparsity idea worth examining. For anyone needing results that generalize to modern models, the paper is too preliminary. I would send it to peer review so the authors can add those broader experiments and an error-bound analysis; the core mechanism is straightforward enough that referees can judge it quickly.

Referee Report

2 major / 1 minor

Summary. The paper proposes a soft sparsity paradigm for CNN convolutions that uses a Most Significant Bit (MSB) proxy to identify and skip negligible non-zero multiplications, implemented as a custom RISC-V instruction. Evaluated on LeNet-5 for MNIST, it reports 88.42% reduction in ReLU MACs and 74.87% in Tanh MACs with zero accuracy loss (outperforming zero-skipping by 5x) plus estimated power savings of 35.2% and 29.96% via clock-gating of inactive multipliers.

Significance. If the MSB proxy reliably bounds per-operation errors so they do not accumulate to change final accuracy, the technique would extend sparsity benefits to smooth activations and deeper layers, offering practical hardware efficiency gains for edge CNN inference.

major comments (2)

[Evaluation] Evaluation section: zero-accuracy-loss is demonstrated solely on LeNet-5/MNIST with no per-layer error bounds, no explicit propagation analysis through additions and non-linearities, and no results on deeper models or harder datasets; this directly underpins the central claim that the proxy prevents accuracy degradation.
[§3] §3 (MSB proxy definition): the threshold is described as tunable for error tolerance, yet no sensitivity study or layer-wise selection procedure is given, leaving the 'tunable' guarantee unsupported beyond the single reported configuration.

minor comments (1)

[Abstract] Abstract: power savings are labeled 'estimates'; adding a sentence on whether they derive from RTL simulation, synthesis, or analytical model would clarify their strength.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional analysis where feasible.

read point-by-point responses

Referee: [Evaluation] Evaluation section: zero-accuracy-loss is demonstrated solely on LeNet-5/MNIST with no per-layer error bounds, no explicit propagation analysis through additions and non-linearities, and no results on deeper models or harder datasets; this directly underpins the central claim that the proxy prevents accuracy degradation.

Authors: We acknowledge the evaluation is limited to LeNet-5/MNIST. The empirical zero-accuracy-loss result supports the claim for this benchmark, but we agree additional rigor is needed. In revision we will add per-layer error bounds and a short propagation analysis through additions and activations. Results on deeper models or harder datasets fall outside the current scope, which focuses on hardware-efficient implementation of the MSB proxy; we will state this limitation explicitly and note it as future work. revision: partial
Referee: [§3] §3 (MSB proxy definition): the threshold is described as tunable for error tolerance, yet no sensitivity study or layer-wise selection procedure is given, leaving the 'tunable' guarantee unsupported beyond the single reported configuration.

Authors: Section 3 defines the threshold as a tunable parameter controlling the MSB proxy cutoff. To substantiate tunability we will add a sensitivity study (accuracy and MAC reduction versus threshold) for each layer of LeNet-5 in the revised manuscript. From this data we will also describe a simple layer-wise heuristic for threshold selection based on target error tolerance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware approximation validated on LeNet-5/MNIST

full rationale

The paper proposes an MSB-based proxy for soft sparsity in CNN multiplications and evaluates it directly via custom RISC-V instruction on LeNet-5 (MNIST), reporting measured MAC reductions and power savings with zero accuracy loss. No derivation chain, fitted parameters renamed as predictions, self-citation load-bearing steps, or ansatz smuggling appear in the provided text or abstract. All central claims rest on experimental results rather than any equation or premise that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unproven domain assumption that an MSB-based magnitude check suffices to control approximation error in convolution layers.

free parameters (1)

MSB proxy threshold
Cutoff value used to decide when a multiplication is negligible; must be chosen or tuned to maintain zero accuracy loss.

axioms (1)

domain assumption MSB proxy accurately approximates multiplication magnitude for skipping decisions without full computation
Invoked to justify skipping non-zero but small operations while preserving accuracy.

pith-pipeline@v0.9.0 · 5454 in / 1170 out tokens · 48231 ms · 2026-05-15T13:00:26.869127+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

K. He, X. Zhang, S. Ren, and J. Sun, ”Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778

work page 2016
[2]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ”ImageNet classification with deep convolutional neural networks,”Commun. ACM, vol. 60, no. 6, pp. 84–90, 2012

work page 2012
[3]

Szegedy et al., ”Going deeper with convolutions,” inProc

C. Szegedy et al., ”Going deeper with convolutions,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1–9

work page 2015
[4]

Han et al., ”EIE: Efficient inference engine on compressed deep neural network,” inProc

S. Han et al., ”EIE: Efficient inference engine on compressed deep neural network,” inProc. 43rd Annu. Int. Symp. Comput. Archit. (ISCA), 2016

work page 2016
[5]

Guo et al., ”RRAM based buffer design for energy efficient CNN accelerator,” inProc

K. Guo et al., ”RRAM based buffer design for energy efficient CNN accelerator,” inProc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), 2018

work page 2018
[6]

Y .-H. Chen, T. Krishna, J. Emer, and V . Sze, ”Eyeriss: An energy- efficient reconfigurable accelerator for deep convolutional neural net- works,” inIEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2016, pp. 262–263

work page 2016
[7]

Zhang et al., ”Cambricon-X: An accelerator for sparse neural net- works,” inProc

S. Zhang et al., ”Cambricon-X: An accelerator for sparse neural net- works,” inProc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2016, pp. 1–12

work page 2016
[8]

Albericio et al., ”Cnvlutin: Ineffectual-neuron-free deep neural net- work computing,” inProc

J. Albericio et al., ”Cnvlutin: Ineffectual-neuron-free deep neural net- work computing,” inProc. 43rd Annu. Int. Symp. Comput. Archit. (ISCA), 2016

work page 2016
[9]

Aimar et al., ”Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps,”IEEE Trans

A. Aimar et al., ”Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps,”IEEE Trans. Neural Netw. Learn. Syst. (TNNLS), 2018

work page 2018
[10]

Hooker et al., ”The hardware lottery,”Commun

S. Hooker et al., ”The hardware lottery,”Commun. ACM, 2020

work page 2020
[11]

Zhang et al., ”Sparch: Efficiently exploiting broad sparsity in DNNs,” inProc

Z. Zhang et al., ”Sparch: Efficiently exploiting broad sparsity in DNNs,” inProc. IEEE Int. Symp. High-Performance Comput. Archit. (HPCA), 2020

work page 2020
[12]

Akhlaghi et al., ”SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks,” inProc

V . Akhlaghi et al., ”SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks,” inProc. 45th Annu. Int. Symp. Comput. Archit. (ISCA), 2018

work page 2018
[13]

Wang et al., ”Optimizing CNN computation using RISC-V custom instruction sets for edge platforms,”IEEE Trans

S. Wang et al., ”Optimizing CNN computation using RISC-V custom instruction sets for edge platforms,”IEEE Trans. Comput., vol. 73, no. 5, pp. 1371–1384, May 2024

work page 2024

[1] [1]

K. He, X. Zhang, S. Ren, and J. Sun, ”Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778

work page 2016

[2] [2]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ”ImageNet classification with deep convolutional neural networks,”Commun. ACM, vol. 60, no. 6, pp. 84–90, 2012

work page 2012

[3] [3]

Szegedy et al., ”Going deeper with convolutions,” inProc

C. Szegedy et al., ”Going deeper with convolutions,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1–9

work page 2015

[4] [4]

Han et al., ”EIE: Efficient inference engine on compressed deep neural network,” inProc

S. Han et al., ”EIE: Efficient inference engine on compressed deep neural network,” inProc. 43rd Annu. Int. Symp. Comput. Archit. (ISCA), 2016

work page 2016

[5] [5]

Guo et al., ”RRAM based buffer design for energy efficient CNN accelerator,” inProc

K. Guo et al., ”RRAM based buffer design for energy efficient CNN accelerator,” inProc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), 2018

work page 2018

[6] [6]

Y .-H. Chen, T. Krishna, J. Emer, and V . Sze, ”Eyeriss: An energy- efficient reconfigurable accelerator for deep convolutional neural net- works,” inIEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2016, pp. 262–263

work page 2016

[7] [7]

Zhang et al., ”Cambricon-X: An accelerator for sparse neural net- works,” inProc

S. Zhang et al., ”Cambricon-X: An accelerator for sparse neural net- works,” inProc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2016, pp. 1–12

work page 2016

[8] [8]

Albericio et al., ”Cnvlutin: Ineffectual-neuron-free deep neural net- work computing,” inProc

J. Albericio et al., ”Cnvlutin: Ineffectual-neuron-free deep neural net- work computing,” inProc. 43rd Annu. Int. Symp. Comput. Archit. (ISCA), 2016

work page 2016

[9] [9]

Aimar et al., ”Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps,”IEEE Trans

A. Aimar et al., ”Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps,”IEEE Trans. Neural Netw. Learn. Syst. (TNNLS), 2018

work page 2018

[10] [10]

Hooker et al., ”The hardware lottery,”Commun

S. Hooker et al., ”The hardware lottery,”Commun. ACM, 2020

work page 2020

[11] [11]

Zhang et al., ”Sparch: Efficiently exploiting broad sparsity in DNNs,” inProc

Z. Zhang et al., ”Sparch: Efficiently exploiting broad sparsity in DNNs,” inProc. IEEE Int. Symp. High-Performance Comput. Archit. (HPCA), 2020

work page 2020

[12] [12]

Akhlaghi et al., ”SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks,” inProc

V . Akhlaghi et al., ”SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks,” inProc. 45th Annu. Int. Symp. Comput. Archit. (ISCA), 2018

work page 2018

[13] [13]

Wang et al., ”Optimizing CNN computation using RISC-V custom instruction sets for edge platforms,”IEEE Trans

S. Wang et al., ”Optimizing CNN computation using RISC-V custom instruction sets for edge platforms,”IEEE Trans. Comput., vol. 73, no. 5, pp. 1371–1384, May 2024

work page 2024