FQA: A Full-Space Quantization-Driven Architecture for Hardware-Efficient Piecewise Approximation of Nonlinear Activation Functions

Chenjun Hao; Feng Yan; Hongbing Pan; Yuxuan Wang

arxiv: 2606.05627 · v1 · pith:KWGPFUOZnew · submitted 2026-06-04 · 💻 cs.AR · cs.ET

FQA: A Full-Space Quantization-Driven Architecture for Hardware-Efficient Piecewise Approximation of Nonlinear Activation Functions

Chenjun Hao , Feng Yan , Hongbing Pan , Yuxuan Wang This is my paper

Pith reviewed 2026-06-27 23:43 UTC · model grok-4.3

classification 💻 cs.AR cs.ET

keywords piecewise polynomial approximationnonlinear activation functionshardware efficiencyquantization errortruncation errorsigmoid functionFPGA design

0 comments

The pith

FQA searches the full space of truncation and quantization errors to locate optimal coefficients for piecewise approximations of activation functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes FQA to make piecewise polynomial approximations of nonlinear functions such as sigmoid more efficient on hardware. It models both fractional-bit truncation error and quantization error when choosing approximation coefficients, allowing the method to locate every coefficient value that could be optimal. This complete search reduces the number of segments needed while reaching the lowest possible maximum absolute error. The work also separates fractional word lengths, supplies two hardware schemes for different trade-offs, and adds a bisection search accelerator to keep computation feasible.

Core claim

FQA comprehensively considers both fractional-bit truncation error and quantization error that cause the deviation of the optimal approximation coefficients, enabling it to precisely determine and search the complete range of optimal coefficients for hardware-efficient piecewise polynomial approximations of nonlinear activation functions, with two implementation schemes, decoupled word lengths, and an acceleration method that together cut segments, area, and power.

What carries the argument

The full-space quantization-driven architecture (FQA) that jointly accounts for truncation and quantization errors to enumerate the entire set of candidate optimal coefficients.

If this is right

FQA reduces the number of segments required while still reaching the optimal maximum absolute error.
Two hardware implementation schemes allow different resource-performance balances.
Decoupling fractional word lengths opens exploration of improved hardware architectures.
The TBW acceleration method makes the expanded search practical.
Sigmoid hardware achieves more than 50 percent reduction in area and power versus prior PPA designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same error-modeling approach could be applied to other activation functions to test whether segment counts drop similarly.
The presented design workflow may improve how configurable hardware allocates resources for entire neural-network inference pipelines.
Lower area and power for individual functions could compound when many activation units sit inside a larger accelerator.

Load-bearing premise

Jointly modeling fractional-bit truncation error and quantization error is enough to locate every possible optimal coefficient without missing better solutions that would appear under different error models.

What would settle it

A coefficient set for sigmoid that achieves lower MAE or fewer segments than any FQA output while using the same hardware word lengths.

read the original abstract

In this paper, we propose a full-space quantization-driven architecture (FQA) for the hardware-efficient piecewise polynomial approximations (PPAs) of nonlinear activation functions. FQA comprehensively considers both fractional-bit truncation error and quantization error that cause the deviation of the optimal approximation coefficients. Crucially, FQA can precisely determine and search the complete range of optimal coefficients. Based on the proposed FQA, we develop two distinct hardware implementation schemes to cater to different resource-performance trade-offs. Furthermore, we decouple all the fractional word lengths (FWLs) involved in the calculation process to enable the exploration of superior hardware architectures. To mitigate the increased software computation time caused by the expanded quantization space, we design an acceleration method named TBW (target-guided bisection window) to expedite the piecewise calculation and searching process. Experimental results demonstrate that, compared to existing architectures, FQA can significantly reduce the number of required segments while achieving the optimal Maximum Absolute Error (MAE). For the hardware design of the Sigmoid function, our approach achieves over 50% reduction in area and power consumption compared to the state-of-the-art PPA architecture. Finally, we present a complete design workflow for deploying PPA on configurable hardware, maximizing the utilization of existing hardware resources and minimizing MAE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FQA expands the coefficient search for PPAs by jointly modeling truncation and quantization errors, with claimed hardware savings for sigmoid that rest on an unverified completeness assumption.

read the letter

The main takeaway is that FQA searches a combined space of fractional-bit truncation error and quantization error to pick coefficients for piecewise polynomial approximations of activations like sigmoid. They add TBW to speed up the expanded search, decouple all fractional word lengths for architecture exploration, and give two hardware schemes plus a full deployment workflow.

The paper does a reasonable job extending standard PPA fitting by treating both error sources together instead of handling them separately. The reported outcome—fewer segments while hitting optimal MAE, plus over 50% area and power reduction versus prior PPA hardware for sigmoid—addresses a practical bottleneck in edge accelerators. Decoupling the word lengths and providing the end-to-end mapping steps shows they connected the software search to actual hardware constraints.

The soft spot is the load-bearing claim that this approach “precisely determines and searches the complete range of optimal coefficients.” That only follows if truncation and quantization errors are the only ones that matter. Other factors such as rounding-mode interactions or overflow could move the true optimum outside the modeled space, which would make the gains relative to an incomplete baseline rather than absolute. The abstract supplies no equations, sensitivity checks, or full comparison tables, so it is difficult to judge how tight the optimality really is.

This is aimed at hardware engineers building neural-network accelerators on FPGAs or low-power ASICs who need efficient fixed-point approximations. A reader working on activation implementations would pick up usable ideas from the search procedure and the workflow section.

It deserves peer review. The method is a concrete, incremental step on existing PPA techniques with measurable hardware targets, even if the completeness argument needs more evidence in the full manuscript.

Referee Report

1 major / 0 minor

Summary. The paper proposes FQA, a full-space quantization-driven architecture for hardware-efficient piecewise polynomial approximations (PPAs) of nonlinear activation functions. It jointly models fractional-bit truncation error and quantization error to search the complete range of optimal coefficients, develops two hardware implementation schemes with decoupled fractional word lengths, introduces a TBW acceleration method, and claims fewer segments, optimal MAE, over 50% area/power reduction for Sigmoid versus state-of-the-art PPA, plus a complete deployment workflow.

Significance. If the optimality and hardware gains hold under the modeled errors, the work could improve resource efficiency in neural network accelerators by enabling more compact PPAs for activations while preserving accuracy.

major comments (1)

[Abstract] Abstract: The central claim that FQA 'precisely determine[s] and search[es] the complete range of optimal coefficients' by jointly modeling only fractional-bit truncation error and quantization error is load-bearing for the asserted optimality, segment reduction, and >50% area/power gains for Sigmoid. The manuscript provides no explicit verification that other error sources (e.g., rounding-mode interactions or fixed-point overflow) or alternative hardware constraints do not shift the true optimum outside the modeled space, leaving the completeness of the search unproven.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address it point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that FQA 'precisely determine[s] and search[es] the complete range of optimal coefficients' by jointly modeling only fractional-bit truncation error and quantization error is load-bearing for the asserted optimality, segment reduction, and >50% area/power gains for Sigmoid. The manuscript provides no explicit verification that other error sources (e.g., rounding-mode interactions or fixed-point overflow) or alternative hardware constraints do not shift the true optimum outside the modeled space, leaving the completeness of the search unproven.

Authors: We agree that the manuscript does not provide explicit verification or sensitivity analysis for error sources beyond the jointly modeled fractional-bit truncation and quantization errors. FQA is formulated to exhaustively enumerate coefficient candidates under precisely these two error contributions, which the paper identifies as the dominant sources of deviation from ideal polynomial coefficients in fixed-point PPA hardware. The reported optimality, segment reduction, and hardware gains are therefore with respect to this error model. Other factors such as rounding-mode interactions or overflow are governed by standard fixed-point arithmetic conventions and are typically resolved at the implementation stage rather than during coefficient search. To address the concern, we will revise the abstract to qualify the completeness claim as applying within the modeled truncation-plus-quantization space and add a short paragraph in Section III clarifying the scope and assumptions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method is a search procedure validated experimentally

full rationale

The paper introduces FQA as a search over an expanded coefficient space that jointly accounts for truncation and quantization error when selecting PPA coefficients for nonlinear activations. Claims of fewer segments, optimal MAE, and hardware gains (>50% area/power for Sigmoid) rest on experimental comparisons to prior PPA architectures rather than any reduction of outputs to fitted inputs or self-citation chains. No equations or sections equate a 'prediction' to a fitted parameter by construction, and no load-bearing uniqueness theorem or ansatz is imported from prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Report is based solely on the abstract; no equations or implementation details are available to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5766 in / 1284 out tokens · 16085 ms · 2026-06-27T23:43:06.293543+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 1 linked inside Pith

[1]

Accelerating recurrent neural networks: a memory-efficient approach,

Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks: a memory-efficient approach,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 25, no. 10, pp. 2763–2775, Oct. 2017

2017
[2]

Acceleration of LSTM with structured pruning method on FPGA,

S. Wang et al., “Acceleration of LSTM with structured pruning method on FPGA,” IEEE Access, vol. 7, pp. 62930–62937, 2019

2019
[3]

KAN: kolmogorov-arnold networks,

Z. Liu et al., “KAN: kolmogorov-arnold networks,” June 16, 2024, arXiv: arXiv:2404.19756. Accessed: Oct. 28, 2024. [Online]. Available: http://arxiv.org/abs/2404.19756

Pith/arXiv arXiv 2024
[4]

Fixed -point square roots using L -b truncation [DSP tips and tricks],

A. Seth and W. -S. Gan, “Fixed -point square roots using L -b truncation [DSP tips and tricks],” IEEE Signal Process. Mag. , vol. 28, no. 6, pp. 149–153, Nov. 2011

2011
[5]

GH CORDIC -based architecture for computing $N$ th root of single -precision floating-point number,

Y . Wang, Y . Luo, Z. Wang, Q. Shen, and H. Pan, “GH CORDIC -based architecture for computing $N$ th root of single -precision floating-point number,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 28, no. 4, pp. 864–875, Apr. 2020

2020
[6]

Generalized hyperbolic CORDIC and its logarithmic and exponential computation with arbitrary fixed base,

Y . Luo, Y . Wang, Y . Ha, Z. Wang, S. Ch en, and H. Pan, “Generalized hyperbolic CORDIC and its logarithmic and exponential computation with arbitrary fixed base,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 27, no. 9, pp. 2156–2169, Sept. 2019

2019
[7]

Elementary functions and approximate computing,

J. -M. Muller, “Elementary functions and approximate computing,” Proc. IEEE, vol. 108, no. 12, pp. 2136–2149, Dec. 2020

2020
[8]

A memory -efficient tables-and-additions method for accurate computation of elementary functions,

J. Y . L. Low and C. C. Jong, “A memory -efficient tables-and-additions method for accurate computation of elementary functions,” IEEE Trans. Comput., vol. 62, no. 5, pp. 858–872, May 2013

2013
[9]

Faithful bipartite ROM reciprocal tables,

D. Das Sarma and D. W. Matula, “Faithful bipartite ROM reciprocal tables,” in Proceedings of the 12th Symposium on Computer Arithmetic , July 1995, pp. 17–28

1995
[10]

Approximating elementary functions wi th symmetric bipartite tables,

M. J. Schulte and J. E. Stine, “Approximating elementary functions wi th symmetric bipartite tables,” IEEE Trans. Comput., vol. 48, no. 8, pp. 842– 847, Aug. 1999

1999
[11]

A fast hardware approach for approximate, efficient logarithm and antilogarithm computations,

S. Paul, N. Jayakumar, and S. P . Khatri, “A fast hardware approach for approximate, efficient logarithm and antilogarithm computations,” IEEE Trans. Very Large Scale Integr. VLSI Syst. , vol. 17, no. 2, pp. 269 –277, Feb. 2009

2009
[12]

Multipartite table methods,

F. de Dinechin and A. Tisserand, “Multipartite table methods,” IEEE Trans. Comput., vol. 54, no. 3, pp. 319–330, Mar. 2005

2005
[13]

Hierarchical multipartite function evaluation,

S. -F. Hsiao, C. -S. Wen, Y . -H. Chen, and K. -C. Huang, “Hierarchical multipartite function evaluation,” IEEE Trans. Comput., vol. 66, no. 1, pp. 89–99, Jan. 2017

2017
[14]

Double logarithmic arithmetic technique for low -power 3 -D graphics applications,

D. M. Ellaithy, M. A. El -Moursy, G. H. Ibrahim, A. Zaki, and A. Zekry, 13 “Double logarithmic arithmetic technique for low -power 3 -D graphics applications,” IEEE Trans. Very Large Scale Integr . VLSI Syst. , vol. 25, no. 7, pp. 2144–2152, July 2017

2017
[15]

High- speed function approximation using a minimax quadratic interpolator,

J. . -A. Pineiro, S. F. Oberman, J. . -M. Muller, and J. D. Bruguera, “High- speed function approximation using a minimax quadratic interpolator,” IEEE Trans. Comput., vol. 54, no. 3, pp. 304–318, Mar. 2005

2005
[16]

Algorithmic truncation of minimax polynomial coefficients,

S. A. Tawfik and H. A. H. Fahmy, “Algorithmic truncation of minimax polynomial coefficients,” in 2006 IEEE International Symposium on Circuits and Systems (ISCAS), May 2006, p. 4 pp. – 2424

2006
[17]

Two -level hardware function evaluation based on correction of normalized piecewise difference functions,

S. -F. Hsiao, H. -J. Ko, and C. -S. Wen, “Two -level hardware function evaluation based on correction of normalized piecewise difference functions,” IEEE Trans. Circuits Syst. II Express Briefs, vol. 59, no. 5, pp. 292–296, May 2012

2012
[18]

Minimizing coefficients wordlength for piecewise -polynomial hardware function evaluation with exact or faithful rounding,

D. De Caro, E. Napoli, D. Esposito, G. Castellano, N. Petra, and A. G. M. Strollo, “Minimizing coefficients wordlength for piecewise -polynomial hardware function evaluation with exact or faithful rounding,” IEEE Trans. Circuits Syst. Regul. Pap. , vol. 64, no. 5, pp. 1 187–1200, May 2017

2017
[19]

Efficient logarithmic converters for digital signal processing applications,

D. De Caro, N. Petra, and A. G. M. Strollo, “Efficient logarithmic converters for digital signal processing applications,” IEEE Trans. Circuits Syst. II Express Briefs, vol. 58, no. 10, pp. 667–671, Oct. 2011

2011
[20]

An optimized logarithmic converter with equal distribution of relative errors,

M. Zhu, Y . Ha, C. Gu, and L. Gao, “An optimized logarithmic converter with equal distribution of relative errors,” IEEE Trans. Circuits Syst. II Express Briefs, vol. 63, no. 9, pp. 848–852, Sept. 2016

2016
[21]

A l ow- error, cost-efficient design procedure for evaluating logarithms to Be used in a logarithmic arithmetic processor,

C.-W. Liu, S. -H. Ou, K. -C. Chang, T. -C. Lin, and S. -K. Chen, “A l ow- error, cost-efficient design procedure for evaluating logarithms to Be used in a logarithmic arithmetic processor,” IEEE Trans. Comput., vol. 65, no. 4, pp. 1158–1164, Apr. 2016

2016
[22]

Non‐linear activation function approximation using a REMEZ algorithm,

S. R. Chiluveru, M. Tripathy, and Bibhudutta, “Non‐linear activation function approximation using a REMEZ algorithm,” IET Circuits Devices Syst., vol. 15, no. 7, pp. 630–640, Oct. 2021

2021
[23]

Numerical function generators using LUT cascades,

T. Sasao, S. Nagayama, and J. T. Butler, “Numerical function generators using LUT cascades,” IEEE Trans. Comput., vol. 56, no. 6, pp. 826 –838, June 2007

2007
[24]

Hierarchical segmentation for hardware function evaluation,

D. -U. Lee, R. C. C. Cheung, W. Luk, and J. D. Villasenor, “Hierarchical segmentation for hardware function evaluation,” IEEE Trans. Very Large Scale Integr . VLSI Syst., vol. 17, no. 1, pp. 103–116, Jan. 2009

2009
[25]

A universal method of linear approximation with controllable error for the efficient implementation of transcendental functions,

H. Sun et al. , “A universal method of linear approximation with controllable error for the efficient implementation of transcendental functions,” IEEE Trans. Circuits Syst. Regul. Pap. , vol. 67, no. 1, pp. 177–188, Jan. 2020

2020
[26]

PLAC: piecewise linear approximation computation for all nonlinear unary functions,

H. Dong et al., “PLAC: piecewise linear approximation computation for all nonlinear unary functions,” IEEE Trans. Very Large Scale Integr . VLSI Syst., vol. 28, no. 9, pp. 2014–2027, Sept. 2020

2014
[27]

Ultralow-latency VLSI architecture based on a linear approximation method for computing nth roots of floating -point numbers,

F. Lyu, X. Xu, Y . Wang, Y . Luo, Y . Wang, and H. Pan, “Ultralow-latency VLSI architecture based on a linear approximation method for computing nth roots of floating -point numbers,” IEEE Trans. Circuits Syst. Regul. Pap., vol. 68, no. 2, pp. 715–727, Feb. 2021

2021
[28]

PWL-Based Architecture for the Logarithmic Computation of Floating -Point Numbers,

F. Lyu, Z. Mao, J. Zhang, Y . Wang, and Y . Luo, “PWL-Based Architecture for the Logarithmic Computation of Floating -Point Numbers,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 29, no. 7, pp. 1470 –1474, July 2021

2021
[29]

ML -PLAC: multiplierless piecewise linear approximation for nonlinear function evaluation,

F. Lyu, Y . Xia, Z. Mao, Y . Wa ng, Y . Wang, and Y . Luo, “ML -PLAC: multiplierless piecewise linear approximation for nonlinear function evaluation,” IEEE Trans. Circuits Syst. Regul. Pap. , vol. 69, no. 4, pp. 1546–1559, Apr. 2022

2022
[30]

Piecewise parabolic approximate compu tation based on an error-flattened segmenter and a novel quantizer,

M. An et al., “Piecewise parabolic approximate compu tation based on an error-flattened segmenter and a novel quantizer,” Electronics, vol. 10, no. 21, p. 2704, 2021

2021
[31]

QPA: A Quantization - Aware Piecewise Polynomial Approximation Methodology for Hardware- Efficient Implementations,

H. Geng, X. Chen, N. Zhao, Y . Du, and L. Du, “QPA: A Quantization - Aware Piecewise Polynomial Approximation Methodology for Hardware- Efficient Implementations,” IEEE Trans. Very Large Scale Integr . VLSI Syst., vol. 31, no. 7, pp. 931–944, July 2023

2023
[32]

A short course on approximation theory,

N. L. Carothers, “A short course on approximation theory,” Dept. Math. Statist., Bowling Green State Univ., Bowling Green, OH, USA, 2009

2009
[33]

TEA -S: A tiny and efficient architecture for PLAC -based softmax in transformers,

Z. Mei, H. Dong, Y . Wang, and H. Pan, “TEA -S: A tiny and efficient architecture for PLAC -based softmax in transformers,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 70, no. 9, pp. 3594–3598, Sep. 2023

2023
[34]

MBS: A high -precision approximation method for softmax and efficient hardware implementation,

Y . Wu, Z. Xie, H. Pan, and Y . Wang, “MBS: A high -precision approximation method for softmax and efficient hardware implementation,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 72, no. 7, pp. 3366–3375, Jul. 2025

2025
[35]

TEA -SPS: A Tiny and Efficient Architecture for Softmax With Parallelism and Sparsity Adaptability,

Z. Cui et al., “TEA -SPS: A Tiny and Efficient Architecture for Softmax With Parallelism and Sparsity Adaptability,” IEEE Trans. Circuits Syst. Regul. Pap., pp. 1–14, 2025

2025

[1] [1]

Accelerating recurrent neural networks: a memory-efficient approach,

Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks: a memory-efficient approach,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 25, no. 10, pp. 2763–2775, Oct. 2017

2017

[2] [2]

Acceleration of LSTM with structured pruning method on FPGA,

S. Wang et al., “Acceleration of LSTM with structured pruning method on FPGA,” IEEE Access, vol. 7, pp. 62930–62937, 2019

2019

[3] [3]

KAN: kolmogorov-arnold networks,

Z. Liu et al., “KAN: kolmogorov-arnold networks,” June 16, 2024, arXiv: arXiv:2404.19756. Accessed: Oct. 28, 2024. [Online]. Available: http://arxiv.org/abs/2404.19756

Pith/arXiv arXiv 2024

[4] [4]

Fixed -point square roots using L -b truncation [DSP tips and tricks],

A. Seth and W. -S. Gan, “Fixed -point square roots using L -b truncation [DSP tips and tricks],” IEEE Signal Process. Mag. , vol. 28, no. 6, pp. 149–153, Nov. 2011

2011

[5] [5]

GH CORDIC -based architecture for computing $N$ th root of single -precision floating-point number,

Y . Wang, Y . Luo, Z. Wang, Q. Shen, and H. Pan, “GH CORDIC -based architecture for computing $N$ th root of single -precision floating-point number,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 28, no. 4, pp. 864–875, Apr. 2020

2020

[6] [6]

Generalized hyperbolic CORDIC and its logarithmic and exponential computation with arbitrary fixed base,

Y . Luo, Y . Wang, Y . Ha, Z. Wang, S. Ch en, and H. Pan, “Generalized hyperbolic CORDIC and its logarithmic and exponential computation with arbitrary fixed base,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 27, no. 9, pp. 2156–2169, Sept. 2019

2019

[7] [7]

Elementary functions and approximate computing,

J. -M. Muller, “Elementary functions and approximate computing,” Proc. IEEE, vol. 108, no. 12, pp. 2136–2149, Dec. 2020

2020

[8] [8]

A memory -efficient tables-and-additions method for accurate computation of elementary functions,

J. Y . L. Low and C. C. Jong, “A memory -efficient tables-and-additions method for accurate computation of elementary functions,” IEEE Trans. Comput., vol. 62, no. 5, pp. 858–872, May 2013

2013

[9] [9]

Faithful bipartite ROM reciprocal tables,

D. Das Sarma and D. W. Matula, “Faithful bipartite ROM reciprocal tables,” in Proceedings of the 12th Symposium on Computer Arithmetic , July 1995, pp. 17–28

1995

[10] [10]

Approximating elementary functions wi th symmetric bipartite tables,

M. J. Schulte and J. E. Stine, “Approximating elementary functions wi th symmetric bipartite tables,” IEEE Trans. Comput., vol. 48, no. 8, pp. 842– 847, Aug. 1999

1999

[11] [11]

A fast hardware approach for approximate, efficient logarithm and antilogarithm computations,

S. Paul, N. Jayakumar, and S. P . Khatri, “A fast hardware approach for approximate, efficient logarithm and antilogarithm computations,” IEEE Trans. Very Large Scale Integr. VLSI Syst. , vol. 17, no. 2, pp. 269 –277, Feb. 2009

2009

[12] [12]

Multipartite table methods,

F. de Dinechin and A. Tisserand, “Multipartite table methods,” IEEE Trans. Comput., vol. 54, no. 3, pp. 319–330, Mar. 2005

2005

[13] [13]

Hierarchical multipartite function evaluation,

S. -F. Hsiao, C. -S. Wen, Y . -H. Chen, and K. -C. Huang, “Hierarchical multipartite function evaluation,” IEEE Trans. Comput., vol. 66, no. 1, pp. 89–99, Jan. 2017

2017

[14] [14]

Double logarithmic arithmetic technique for low -power 3 -D graphics applications,

D. M. Ellaithy, M. A. El -Moursy, G. H. Ibrahim, A. Zaki, and A. Zekry, 13 “Double logarithmic arithmetic technique for low -power 3 -D graphics applications,” IEEE Trans. Very Large Scale Integr . VLSI Syst. , vol. 25, no. 7, pp. 2144–2152, July 2017

2017

[15] [15]

High- speed function approximation using a minimax quadratic interpolator,

J. . -A. Pineiro, S. F. Oberman, J. . -M. Muller, and J. D. Bruguera, “High- speed function approximation using a minimax quadratic interpolator,” IEEE Trans. Comput., vol. 54, no. 3, pp. 304–318, Mar. 2005

2005

[16] [16]

Algorithmic truncation of minimax polynomial coefficients,

S. A. Tawfik and H. A. H. Fahmy, “Algorithmic truncation of minimax polynomial coefficients,” in 2006 IEEE International Symposium on Circuits and Systems (ISCAS), May 2006, p. 4 pp. – 2424

2006

[17] [17]

Two -level hardware function evaluation based on correction of normalized piecewise difference functions,

S. -F. Hsiao, H. -J. Ko, and C. -S. Wen, “Two -level hardware function evaluation based on correction of normalized piecewise difference functions,” IEEE Trans. Circuits Syst. II Express Briefs, vol. 59, no. 5, pp. 292–296, May 2012

2012

[18] [18]

Minimizing coefficients wordlength for piecewise -polynomial hardware function evaluation with exact or faithful rounding,

D. De Caro, E. Napoli, D. Esposito, G. Castellano, N. Petra, and A. G. M. Strollo, “Minimizing coefficients wordlength for piecewise -polynomial hardware function evaluation with exact or faithful rounding,” IEEE Trans. Circuits Syst. Regul. Pap. , vol. 64, no. 5, pp. 1 187–1200, May 2017

2017

[19] [19]

Efficient logarithmic converters for digital signal processing applications,

D. De Caro, N. Petra, and A. G. M. Strollo, “Efficient logarithmic converters for digital signal processing applications,” IEEE Trans. Circuits Syst. II Express Briefs, vol. 58, no. 10, pp. 667–671, Oct. 2011

2011

[20] [20]

An optimized logarithmic converter with equal distribution of relative errors,

M. Zhu, Y . Ha, C. Gu, and L. Gao, “An optimized logarithmic converter with equal distribution of relative errors,” IEEE Trans. Circuits Syst. II Express Briefs, vol. 63, no. 9, pp. 848–852, Sept. 2016

2016

[21] [21]

A l ow- error, cost-efficient design procedure for evaluating logarithms to Be used in a logarithmic arithmetic processor,

C.-W. Liu, S. -H. Ou, K. -C. Chang, T. -C. Lin, and S. -K. Chen, “A l ow- error, cost-efficient design procedure for evaluating logarithms to Be used in a logarithmic arithmetic processor,” IEEE Trans. Comput., vol. 65, no. 4, pp. 1158–1164, Apr. 2016

2016

[22] [22]

Non‐linear activation function approximation using a REMEZ algorithm,

S. R. Chiluveru, M. Tripathy, and Bibhudutta, “Non‐linear activation function approximation using a REMEZ algorithm,” IET Circuits Devices Syst., vol. 15, no. 7, pp. 630–640, Oct. 2021

2021

[23] [23]

Numerical function generators using LUT cascades,

T. Sasao, S. Nagayama, and J. T. Butler, “Numerical function generators using LUT cascades,” IEEE Trans. Comput., vol. 56, no. 6, pp. 826 –838, June 2007

2007

[24] [24]

Hierarchical segmentation for hardware function evaluation,

D. -U. Lee, R. C. C. Cheung, W. Luk, and J. D. Villasenor, “Hierarchical segmentation for hardware function evaluation,” IEEE Trans. Very Large Scale Integr . VLSI Syst., vol. 17, no. 1, pp. 103–116, Jan. 2009

2009

[25] [25]

A universal method of linear approximation with controllable error for the efficient implementation of transcendental functions,

H. Sun et al. , “A universal method of linear approximation with controllable error for the efficient implementation of transcendental functions,” IEEE Trans. Circuits Syst. Regul. Pap. , vol. 67, no. 1, pp. 177–188, Jan. 2020

2020

[26] [26]

PLAC: piecewise linear approximation computation for all nonlinear unary functions,

H. Dong et al., “PLAC: piecewise linear approximation computation for all nonlinear unary functions,” IEEE Trans. Very Large Scale Integr . VLSI Syst., vol. 28, no. 9, pp. 2014–2027, Sept. 2020

2014

[27] [27]

Ultralow-latency VLSI architecture based on a linear approximation method for computing nth roots of floating -point numbers,

F. Lyu, X. Xu, Y . Wang, Y . Luo, Y . Wang, and H. Pan, “Ultralow-latency VLSI architecture based on a linear approximation method for computing nth roots of floating -point numbers,” IEEE Trans. Circuits Syst. Regul. Pap., vol. 68, no. 2, pp. 715–727, Feb. 2021

2021

[28] [28]

PWL-Based Architecture for the Logarithmic Computation of Floating -Point Numbers,

F. Lyu, Z. Mao, J. Zhang, Y . Wang, and Y . Luo, “PWL-Based Architecture for the Logarithmic Computation of Floating -Point Numbers,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 29, no. 7, pp. 1470 –1474, July 2021

2021

[29] [29]

ML -PLAC: multiplierless piecewise linear approximation for nonlinear function evaluation,

F. Lyu, Y . Xia, Z. Mao, Y . Wa ng, Y . Wang, and Y . Luo, “ML -PLAC: multiplierless piecewise linear approximation for nonlinear function evaluation,” IEEE Trans. Circuits Syst. Regul. Pap. , vol. 69, no. 4, pp. 1546–1559, Apr. 2022

2022

[30] [30]

Piecewise parabolic approximate compu tation based on an error-flattened segmenter and a novel quantizer,

M. An et al., “Piecewise parabolic approximate compu tation based on an error-flattened segmenter and a novel quantizer,” Electronics, vol. 10, no. 21, p. 2704, 2021

2021

[31] [31]

QPA: A Quantization - Aware Piecewise Polynomial Approximation Methodology for Hardware- Efficient Implementations,

H. Geng, X. Chen, N. Zhao, Y . Du, and L. Du, “QPA: A Quantization - Aware Piecewise Polynomial Approximation Methodology for Hardware- Efficient Implementations,” IEEE Trans. Very Large Scale Integr . VLSI Syst., vol. 31, no. 7, pp. 931–944, July 2023

2023

[32] [32]

A short course on approximation theory,

N. L. Carothers, “A short course on approximation theory,” Dept. Math. Statist., Bowling Green State Univ., Bowling Green, OH, USA, 2009

2009

[33] [33]

TEA -S: A tiny and efficient architecture for PLAC -based softmax in transformers,

Z. Mei, H. Dong, Y . Wang, and H. Pan, “TEA -S: A tiny and efficient architecture for PLAC -based softmax in transformers,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 70, no. 9, pp. 3594–3598, Sep. 2023

2023

[34] [34]

MBS: A high -precision approximation method for softmax and efficient hardware implementation,

Y . Wu, Z. Xie, H. Pan, and Y . Wang, “MBS: A high -precision approximation method for softmax and efficient hardware implementation,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 72, no. 7, pp. 3366–3375, Jul. 2025

2025

[35] [35]

TEA -SPS: A Tiny and Efficient Architecture for Softmax With Parallelism and Sparsity Adaptability,

Z. Cui et al., “TEA -SPS: A Tiny and Efficient Architecture for Softmax With Parallelism and Sparsity Adaptability,” IEEE Trans. Circuits Syst. Regul. Pap., pp. 1–14, 2025

2025