E2AFS: Energy-Efficient Approximate Floating Point Square Rooter for Error Tolerant Computing

Jatin Kumar Reddy Mothe; Prateek Goyal; Sujit Kumar Sahoo; Swara Rajesh Shelke

arxiv: 2604.16964 · v1 · submitted 2026-04-18 · 💻 cs.AR

E2AFS: Energy-Efficient Approximate Floating Point Square Rooter for Error Tolerant Computing

Prateek Goyal , Jatin Kumar Reddy Mothe , Swara Rajesh Shelke , Sujit Kumar Sahoo This is my paper

Pith reviewed 2026-05-10 06:59 UTC · model grok-4.3

classification 💻 cs.AR

keywords approximate computingfloating-point square rootenergy-efficient architectureFPGA implementationerror-tolerant computingSobel edge detectionK-means quantizationpower-delay product

0 comments

The pith

E2AFS introduces a multiplier-free approximate floating-point square-root architecture that achieves lower dynamic power, shorter delay, and better power-delay product than prior designs while keeping errors acceptable for error-tolerantuse

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents E2AFS as a lightweight floating-point square-root design that avoids multipliers entirely to cut energy use in embedded and edge systems. It reduces logic depth and switching activity compared with conventional multiplier-based or iterative methods. On an Artix-7 FPGA the design records 7.63 mW dynamic power, 4.639 ns critical-path delay, and 35.39 pJ power-delay product, outperforming ESAS and CWAHA. Accuracy checks show low deviation from the exact square-root function, and application tests in Sobel edge detection and K-means quantization confirm the errors stay tolerable.

Core claim

E2AFS is a fully multiplier-free floating-point square-root architecture that minimizes logic depth and switching activity. FPGA implementation on Artix-7 demonstrates the lowest dynamic power of 7.63 mW, shortest critical-path delay of 4.639 ns, and minimum power-delay product of 35.39 pJ versus existing ESAS and CWAHA designs. Error metrics and graphical analysis establish consistently low deviation from the exact function, and end-to-end validation in Sobel edge detection and K-means color quantization shows the approximation remains suitable for low-power real-time edge and embedded platforms.

What carries the argument

The E2AFS multiplier-free floating-point square-root architecture that reduces logic depth and switching activity to improve power and delay.

If this is right

Records 7.63 mW dynamic power on Artix-7 FPGA
Achieves 4.639 ns critical-path delay
Delivers 35.39 pJ power-delay product
Maintains low deviation from exact square-root function
Supports low-power real-time operation in edge and embedded platforms

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The multiplier-free approach may transfer to other floating-point operations under similar hardware constraints
Lower power and delay could allow more complex computations within fixed energy budgets on battery-powered edge devices
The error-tolerance validation suggests the unit could fit additional real-time signal-processing pipelines beyond the two tested applications

Load-bearing premise

The approximation errors stay small enough that they do not degrade end-to-end quality in the targeted error-tolerant applications such as Sobel edge detection and K-means quantization.

What would settle it

Power or delay measurements on an Artix-7 device exceeding 7.63 mW or 4.639 ns, or visible quality loss in Sobel edge detection and K-means outputs when using E2AFS versus exact square-root computation, would disprove the efficiency and suitability claims.

Figures

Figures reproduced from arXiv: 2604.16964 by Jatin Kumar Reddy Mothe, Prateek Goyal, Sujit Kumar Sahoo, Swara Rajesh Shelke.

**Figure 2.** Figure 2: Graphical Analysis of Various Floating-Point Square Rooters [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Normalized Figures of Merit (FoM1 and FoM2) highlighting speed and energy efficiency. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visual comparison of edge-detection performance across approximate square-root architectures. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visual outcomes of color quantization using various approximate square-root architectures. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Floating-point square-root computation is a power- and delay-critical operation in edge-AI, signal-processing, and embedded systems. Conventional implementations typically rely on multipliers or iterative pipelines, resulting in increased hardware complexity, switching activity, and energy consumption. This work presents E2AFS, a lightweight and fully multiplier-free floating-point square-root architecture optimized for energy-efficient computation. By reducing logic depth and minimizing switching activity, the proposed design achieves substantial improvements in hardware efficiency and performance. FPGA implementation on an Artix-7 device demonstrates that E2AFS achieves the lowest dynamic power (7.63 mW), the shortest critical-path delay (4.639 ns), and the minimum power-delay product (35.39 pJ) compared to existing ESAS and CWAHA architectures. Error evaluation using multiple accuracy metrics, together with graphical analysis, shows that E2AFS closely approximates the exact square-root function with consistently low deviation. Application-level validation in Sobel edge detection and K-means color quantization further confirms its suitability for low-power real-time edge and embedded platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

E2AFS is a multiplier-free approximate FP sqrt on FPGA that reports lower power and delay than two named priors, but the gains need fair re-synthesis checks to be convincing.

read the letter

This paper gives us a concrete multiplier-free approximate floating point square root unit implemented on FPGA, claiming better power and speed than two previous designs, but the strength of that claim depends on whether the comparisons were done fairly. The authors present E2AFS, which reduces logic depth and switching activity by avoiding multipliers altogether. On an Artix-7 FPGA, it shows 7.63 mW dynamic power, 4.639 ns delay, and 35.39 pJ power-delay product, outperforming ESAS and CWAHA. They back this with error metrics indicating low deviation and tests in Sobel edge detection plus K-means quantization to confirm suitability for error-tolerant applications. The practical side is done well. Targeting real applications and reporting FPGA numbers makes it relevant for embedded systems work. The choice to go multiplier-free is a reasonable engineering move for energy efficiency in these domains. The main concern is the fairness of the hardware comparison. Power and delay on FPGAs vary with synthesis tool settings, clock frequency for estimation, place-and-route, and switching activity. The abstract does not confirm that the prior designs were re-implemented under identical conditions. If the baselines used different flows or constraints, the deltas might not reflect true architectural gains. The error evaluation mentions multiple metrics and graphs but lacks specific numbers here, so the acceptability for applications is not fully clear from the summary. This work suits hardware engineers focused on approximate computing for edge devices. Someone in that field could extract useful ideas from the architecture if the full details check out. It is incremental rather than foundational, yet the specific implementation and application tests give it enough substance for review. I would recommend sending it for peer review. The topic is timely for low-power systems, and referees can request the necessary clarifications on synthesis setup and error quantification to strengthen the paper.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes E2AFS, a lightweight multiplier-free approximate floating-point square-root architecture for energy-efficient computation in error-tolerant domains. It claims that FPGA synthesis on an Artix-7 device yields the lowest dynamic power (7.63 mW), shortest critical-path delay (4.639 ns), and minimum power-delay product (35.39 pJ) relative to the ESAS and CWAHA baselines, while maintaining low approximation error as shown by accuracy metrics, graphical analysis, and end-to-end tests on Sobel edge detection and K-means color quantization.

Significance. If the performance deltas hold under identical synthesis conditions, the design could offer a practical hardware primitive for low-power edge-AI and embedded signal-processing pipelines where square-root operations are frequent. The combination of multiplier elimination, reported PDP improvement, and application-level validation would strengthen the case for approximate arithmetic in resource-constrained platforms.

major comments (3)

[FPGA Implementation Results] FPGA results section (abstract and implementation results): The central superiority claims for dynamic power, delay, and PDP rest on comparisons to ESAS and CWAHA, yet the manuscript provides no explicit statement that all three designs were re-implemented by the authors using identical Vivado synthesis settings, optimization directives, place-and-route constraints, clock frequency for power analysis, and input switching activity. These factors directly affect the reported 7.63 mW / 4.639 ns / 35.39 pJ figures; without this confirmation the deltas cannot be attributed solely to the E2AFS architecture.
[Proposed Architecture / Error Evaluation] Proposed design and error evaluation sections: No design equations, algorithmic steps, or error formulas are supplied for the approximation method. The abstract asserts a “multiplier-free” property and “consistently low deviation,” but without the underlying mapping or error-bound derivation it is impossible to verify the claimed reduction in logic depth or to reproduce the accuracy metrics used for the Sobel and K-means tests.
[Application-Level Validation] Application validation section: The claim that approximation error “does not meaningfully degrade” end-to-end quality in Sobel edge detection and K-means quantization is load-bearing for the error-tolerant suitability argument, yet the manuscript reports only qualitative confirmation without quantitative metrics (e.g., PSNR, clustering accuracy delta, or visual difference scores) comparing exact versus approximate outputs.

minor comments (1)

[Abstract / Figures] Ensure that all figures referenced in the error and application sections are numbered and captioned consistently with the text; the abstract mentions “graphical analysis” without a corresponding figure citation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and rigor of our manuscript. We address each major comment point by point below and will revise the paper to incorporate the suggested clarifications and additions.

read point-by-point responses

Referee: [FPGA Implementation Results] The central superiority claims for dynamic power, delay, and PDP rest on comparisons to ESAS and CWAHA, yet the manuscript provides no explicit statement that all three designs were re-implemented by the authors using identical Vivado synthesis settings, optimization directives, place-and-route constraints, clock frequency for power analysis, and input switching activity. These factors directly affect the reported 7.63 mW / 4.639 ns / 35.39 pJ figures; without this confirmation the deltas cannot be attributed solely to the E2AFS architecture.

Authors: We confirm that E2AFS, ESAS, and CWAHA were all re-implemented by the authors on the same Artix-7 device using identical Vivado synthesis settings, optimization directives, place-and-route constraints, clock frequency, and input switching activity vectors for power estimation. We will add an explicit paragraph in the FPGA Implementation Results section stating these conditions to ensure the performance comparisons are fully reproducible and attributable to architectural differences. revision: yes
Referee: [Proposed Architecture / Error Evaluation] No design equations, algorithmic steps, or error formulas are supplied for the approximation method. The abstract asserts a “multiplier-free” property and “consistently low deviation,” but without the underlying mapping or error-bound derivation it is impossible to verify the claimed reduction in logic depth or to reproduce the accuracy metrics used for the Sobel and K-means tests.

Authors: The Proposed Architecture section describes the multiplier-free approximation via bit-level manipulations of the mantissa and exponent, along with the resulting logic simplification. To address the concern, we will add explicit design equations for the approximation mapping, a step-by-step algorithmic description (including pseudocode), and the error-bound derivation in the revised manuscript. This will enable verification of the logic-depth reduction and reproduction of the accuracy metrics. revision: yes
Referee: [Application-Level Validation] The claim that approximation error “does not meaningfully degrade” end-to-end quality in Sobel edge detection and K-means quantization is load-bearing for the error-tolerant suitability argument, yet the manuscript reports only qualitative confirmation without quantitative metrics (e.g., PSNR, clustering accuracy delta, or visual difference scores) comparing exact versus approximate outputs.

Authors: We acknowledge that the current application validation relies primarily on qualitative visual comparisons. In the revision, we will add quantitative metrics: PSNR and SSIM for Sobel edge detection, and clustering accuracy/purity deltas for K-means, directly comparing exact and approximate outputs. These additions will provide objective support for the claim that approximation error does not meaningfully degrade end-to-end quality. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical hardware design and FPGA results

full rationale

The paper proposes an approximate floating-point square-root hardware architecture (E2AFS) and reports direct FPGA synthesis metrics (power, delay, PDP) plus error and application-level results on Artix-7. No derivation chain, first-principles prediction, parameter fitting presented as output, or self-citation load-bearing any claim is present. Central results are empirical measurements from synthesis and testing, not reductions to inputs by construction or renaming of prior results. This matches the reader's assessment of minimal circularity burden for a design paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approximation method itself is not described, so any design choices that control error versus efficiency remain unspecified.

pith-pipeline@v0.9.0 · 5506 in / 1213 out tokens · 50536 ms · 2026-05-10T06:59:16.348335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Parhami, Computer Arithmetic: Algorithms and Hardware Designs

B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs . London, U.K.: Oxford University Press, 2000

work page 2000
[2]

Approximate computing: An emerging paradigm for energy-eﬃcient design,

J. Han and M. Orshansky, “Approximate computing: An emerging paradigm for energy-eﬃcient design,” in 2013 18th IEEE European Test Symposium (ETS) , 2013, pp. 1–6

work page 2013
[3]

A review, classiﬁcation, and comparative evaluation of approximate arithmetic circuits,

H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han, “A review, classiﬁcation, and comparative evaluation of approximate arithmetic circuits,” J. Emerg. Technol. Comput. Syst. , vol. 13, no. 4, Aug. 2017

work page 2017
[4]

Approximate arithmetic circuits: A survey, characterization, and recent applications,

H. Jiang, F. J. H. Santiago, H. Mo, L. Liu, and J. Han, “Approximate arithmetic circuits: A survey, characterization, and recent applications,” Proceedings of the IEEE, vol. 108, no. 12, pp. 2108–2135, 2020

work page 2020
[5]

Systematic design of an approximate adder: The opti- mized lower part constant-or adder,

A. Dalloo, A. Najaﬁ, and A. Garcia-Ortiz, “Systematic design of an approximate adder: The opti- mized lower part constant-or adder,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 8, pp. 1595–1599, 2018

work page 2018
[6]

Design and evaluation of low power error tolerant adder,

R. NA, A. A. A, and S. P, “Design and evaluation of low power error tolerant adder,” in 2023 International Conference on Next Generation Electronics (NEleX) , 2023, pp. 1–6

work page 2023
[7]

Design of approximate radix-4 booth multipliers for error-tolerant computing,

W. Liu, L. Qian, C. Wang, H. Jiang, J. Han, and F. Lombardi, “Design of approximate radix-4 booth multipliers for error-tolerant computing,” IEEE Transactions on Computers , vol. 66, no. 8, pp. 1435–1441, 2017

work page 2017
[8]

Design and analysis of approximate redundant binary multipliers,

W. Liu, T. Cao, P. Yin, Y. Zhu, C. Wang, E. E. Swartzlander, and F. Lombardi, “Design and analysis of approximate redundant binary multipliers,” IEEE Transactions on Computers , vol. 68, no. 6, pp. 804–819, 2019

work page 2019
[9]

Energy-eﬃcient logarithmic square rooter for error-resilient applications,

N. Arya, M. Pattanaik, and G. Sharma, “Energy-eﬃcient logarithmic square rooter for error-resilient applications,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 29, no. 11, pp. 1994–1997, 11 2021

work page 1994
[10]

Esas: Exponent series based approximate square root design,

O. G. Ratnaparkhi and M. Rao, “Esas: Exponent series based approximate square root design,” in 2022 25th Euromicro Conference on Digital System Design (DSD) , 2022, pp. 39–45

work page 2022
[11]

Low-power hardware architecture of optimized logarithmic square rooter with enhanced error compensation for error-tolerant systems,

P. Goyal and S. K. Sahoo, “Low-power hardware architecture of optimized logarithmic square rooter with enhanced error compensation for error-tolerant systems,” Integration, vol. 105, p. 102522, 2025

work page 2025
[12]

Cwaha: Cluster-wise approximation for hardware implementation of arithmetic functions,

O. G. Ratnaparkhi and M. Rao, “Cwaha: Cluster-wise approximation for hardware implementation of arithmetic functions,” in 2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) , 2023, pp. 1–6

work page 2023
[13]

Vivado design suite tutorial: Power analysis and optimization,

X. Inc., “Vivado design suite tutorial: Power analysis and optimization,” 2022, available at: https: //www.xilinx.com. 10

work page 2022
[14]

New metrics for the reliability of approximate and probabilistic adders,

J. Liang, J. Han, and F. Lombardi, “New metrics for the reliability of approximate and probabilistic adders,” IEEE Transactions on Computers , vol. 62, pp. 1760–1771, 09 2013

work page 2013
[15]

Gonzalez and R

R. Gonzalez and R. Woods, Digital Image Processing . Upper Saddle River, N.J.: Prentice Hall, 2008

work page 2008
[16]

Design tradeoﬀs in a hardware implemen- tation of the k-means clustering algorithm,

M. Leeser, J. Theiler, M. Estlick, and J. Szymanski, “Design tradeoﬀs in a hardware implemen- tation of the k-means clustering algorithm,” in Proceedings of the 2000 IEEE Sensor Array and Multichannel Signal Processing Workshop. SAM 2000 (Cat. No.00EX410) , 2000, pp. 520–524. 11

work page 2000

[1] [1]

Parhami, Computer Arithmetic: Algorithms and Hardware Designs

B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs . London, U.K.: Oxford University Press, 2000

work page 2000

[2] [2]

Approximate computing: An emerging paradigm for energy-eﬃcient design,

J. Han and M. Orshansky, “Approximate computing: An emerging paradigm for energy-eﬃcient design,” in 2013 18th IEEE European Test Symposium (ETS) , 2013, pp. 1–6

work page 2013

[3] [3]

A review, classiﬁcation, and comparative evaluation of approximate arithmetic circuits,

H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han, “A review, classiﬁcation, and comparative evaluation of approximate arithmetic circuits,” J. Emerg. Technol. Comput. Syst. , vol. 13, no. 4, Aug. 2017

work page 2017

[4] [4]

Approximate arithmetic circuits: A survey, characterization, and recent applications,

H. Jiang, F. J. H. Santiago, H. Mo, L. Liu, and J. Han, “Approximate arithmetic circuits: A survey, characterization, and recent applications,” Proceedings of the IEEE, vol. 108, no. 12, pp. 2108–2135, 2020

work page 2020

[5] [5]

Systematic design of an approximate adder: The opti- mized lower part constant-or adder,

A. Dalloo, A. Najaﬁ, and A. Garcia-Ortiz, “Systematic design of an approximate adder: The opti- mized lower part constant-or adder,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 8, pp. 1595–1599, 2018

work page 2018

[6] [6]

Design and evaluation of low power error tolerant adder,

R. NA, A. A. A, and S. P, “Design and evaluation of low power error tolerant adder,” in 2023 International Conference on Next Generation Electronics (NEleX) , 2023, pp. 1–6

work page 2023

[7] [7]

Design of approximate radix-4 booth multipliers for error-tolerant computing,

W. Liu, L. Qian, C. Wang, H. Jiang, J. Han, and F. Lombardi, “Design of approximate radix-4 booth multipliers for error-tolerant computing,” IEEE Transactions on Computers , vol. 66, no. 8, pp. 1435–1441, 2017

work page 2017

[8] [8]

Design and analysis of approximate redundant binary multipliers,

W. Liu, T. Cao, P. Yin, Y. Zhu, C. Wang, E. E. Swartzlander, and F. Lombardi, “Design and analysis of approximate redundant binary multipliers,” IEEE Transactions on Computers , vol. 68, no. 6, pp. 804–819, 2019

work page 2019

[9] [9]

Energy-eﬃcient logarithmic square rooter for error-resilient applications,

N. Arya, M. Pattanaik, and G. Sharma, “Energy-eﬃcient logarithmic square rooter for error-resilient applications,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 29, no. 11, pp. 1994–1997, 11 2021

work page 1994

[10] [10]

Esas: Exponent series based approximate square root design,

O. G. Ratnaparkhi and M. Rao, “Esas: Exponent series based approximate square root design,” in 2022 25th Euromicro Conference on Digital System Design (DSD) , 2022, pp. 39–45

work page 2022

[11] [11]

Low-power hardware architecture of optimized logarithmic square rooter with enhanced error compensation for error-tolerant systems,

P. Goyal and S. K. Sahoo, “Low-power hardware architecture of optimized logarithmic square rooter with enhanced error compensation for error-tolerant systems,” Integration, vol. 105, p. 102522, 2025

work page 2025

[12] [12]

Cwaha: Cluster-wise approximation for hardware implementation of arithmetic functions,

O. G. Ratnaparkhi and M. Rao, “Cwaha: Cluster-wise approximation for hardware implementation of arithmetic functions,” in 2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) , 2023, pp. 1–6

work page 2023

[13] [13]

Vivado design suite tutorial: Power analysis and optimization,

X. Inc., “Vivado design suite tutorial: Power analysis and optimization,” 2022, available at: https: //www.xilinx.com. 10

work page 2022

[14] [14]

New metrics for the reliability of approximate and probabilistic adders,

J. Liang, J. Han, and F. Lombardi, “New metrics for the reliability of approximate and probabilistic adders,” IEEE Transactions on Computers , vol. 62, pp. 1760–1771, 09 2013

work page 2013

[15] [15]

Gonzalez and R

R. Gonzalez and R. Woods, Digital Image Processing . Upper Saddle River, N.J.: Prentice Hall, 2008

work page 2008

[16] [16]

Design tradeoﬀs in a hardware implemen- tation of the k-means clustering algorithm,

M. Leeser, J. Theiler, M. Estlick, and J. Szymanski, “Design tradeoﬀs in a hardware implemen- tation of the k-means clustering algorithm,” in Proceedings of the 2000 IEEE Sensor Array and Multichannel Signal Processing Workshop. SAM 2000 (Cat. No.00EX410) , 2000, pp. 520–524. 11

work page 2000