pith. sign in

arxiv: 2604.23547 · v1 · submitted 2026-04-26 · 💻 cs.AR · cs.AI

Hardware-Efficient FPGA Implementation of Sigmoid Function Using Mixed-Radix Hyperbolic Rotation CORDIC

Pith reviewed 2026-05-08 05:09 UTC · model grok-4.3

classification 💻 cs.AR cs.AI
keywords sigmoidCORDICFPGAactivation functionhyperbolic tangentmixed-radixfixed-pointhardware implementation
0
0 comments X

The pith

A mixed-radix CORDIC architecture computes the sigmoid function on FPGA using 835 logic slices with zero DSP usage and mean absolute error of 4.23 × 10^{-4}.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a hardware-efficient way to realize the sigmoid activation on FPGAs by exploiting its direct relation to hyperbolic tangent after normalizing the input to unit range. The core technique is a mixed-radix hyperbolic rotation CORDIC that begins with stable radix-2 steps and switches to faster radix-4 steps without extra compensation, followed by a linear vectoring stage to recover tanh from the computed hyperbolic sine and cosine. The result is a fully pipelined 16-bit fixed-point design that occupies very few resources, which matters for embedding neural networks on edge devices where area and power are tight. The reported error level is low enough to replace software or floating-point alternatives in many classification and gating tasks.

Core claim

The mixed-radix hyperbolic rotation CORDIC (MR-HRC) algorithm, which performs initial radix-2 iterations for stable convergence and subsequent radix-4 iterations for accelerated convergence without scale-factor compensation, combined with a final radix-2 linear vectoring CORDIC stage, computes hyperbolic sine and cosine values whose ratio yields tanh; the sigmoid is then obtained directly from this tanh after normalizing the input range to unity.

What carries the argument

Mixed-radix hyperbolic rotation CORDIC (MR-HRC) that alternates radix-2 and radix-4 iterations to generate hyperbolic sine and cosine, followed by radix-2 linear vectoring CORDIC to form tanh as their ratio.

If this is right

  • The fully pipelined design fits in 835 logic slices with no DSP blocks on a Virtex-7 FPGA.
  • The architecture delivers 4.23 × 10^{-4} mean absolute error in 16-bit fixed-point format.
  • Input normalization to unity reduces the tanh operating range to 0.5 and speeds convergence.
  • Zero DSP usage leaves multiplier resources free for other parts of a neural-network accelerator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixed-radix pattern could be adapted to implement tanh or softplus with similar resource savings.
  • Embedding the module in a full network would test whether the fixed-point error accumulates acceptably across layers.
  • The approach may translate to ASIC flows where eliminating multipliers is even more valuable.
  • Extending the word length or adding on-the-fly range detection could widen applicability to larger dynamic ranges.

Load-bearing premise

Normalizing the input to unity and using 16-bit fixed-point arithmetic preserves enough accuracy for downstream neural-network tasks without requiring higher precision.

What would settle it

Compare the hardware sigmoid output against a double-precision reference over a dense grid of inputs in [-1,1] and verify whether the mean absolute error stays at or below 4.23 × 10^{-4} while checking that classification or gating accuracy does not degrade in an end-to-end neural network test.

Figures

Figures reproduced from arXiv: 2604.23547 by Ankur Changela, Chintan Panchal, Mohendra Roy.

Figure 1
Figure 1. Figure 1: Sigmoid activation function over a range [−6, 6] Despite its importance, directly implementing the sigmoid function in hard￾ware is computationally expensive due to the complex operations involved, such as exponentiation and division. This paper addresses this challenge by proposing a novel hardware approximation method for the sigmoid function. The method view at source ↗
Figure 2
Figure 2. Figure 2 view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of proposed MR-HRC algorithm view at source ↗
Figure 4
Figure 4. Figure 4: (a)Architecture of R2-HRC algorithm(b)Architecture of R4-HRC algorithm R2-HRC algorithm. In the R4-HRC algorithm, the digit selection function has five different values. As a result, the critical path involves two key components: the multiplexer and the adder. The architecture of the R2-LVC algorithm is quite simple, and it is illustrated in view at source ↗
Figure 5
Figure 5. Figure 5: The architecture of the R2-LVC algorithm with adder vious approaches view at source ↗
read the original abstract

Efficient hardware implementation of nonlinear activation functions is a crucial task in deploying artificial neural networks on resource-constrained and edge devices such as Field-Programmable Gate Arrays (FPGAs). The sigmoid activation function is widely used for probabilistic output, binary classification, and gating mechanisms in recurrent neural networks, despite its reliance on exponential computations. This paper presents a hardware-efficient FPGA implementation of the sigmoid activation function using a mixed-radix CORDIC-based architecture. The proposed approach leverages the mathematical relationship between the sigmoid and hyperbolic tangent functions. The input range is normalized to 1, enabling the corresponding tanh computation to operate within a reduced range of 0.5, which significantly improves convergence behavior. To achieve high accuracy with minimal hardware overhead, a modified mixed-radix hyperbolic rotation CORDIC (MR-HRC) algorithm combining radix-2 and radix-4 iterations is introduced. The initial radix-2 stage ensures stable convergence, while the subsequent radix-4 stage accelerates convergence without requiring scale-factor compensation. In the final stage, a radix-2 linear vectoring CORDIC (R2-LVC) is used to compute the hyperbolic tangent by dividing hyperbolic sine and cosine values derived from the MR-HRC algorithm. The entire architecture is fully pipelined and implemented on an FPGA. The design is realized on an Xilinx Virtex-7 FPGA using a 16-bit fixed-point representation. Experimental results demonstrate a significant reduction in hardware utilization, requiring only 835 logic slices with zero DSP usage. Additionally, the design achieves a mean absolute error of 4.23 10^-4, outperforming several recent sigmoid implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This paper presents a hardware-efficient FPGA implementation of the sigmoid activation function via a mixed-radix hyperbolic rotation CORDIC (MR-HRC) algorithm that combines radix-2 and radix-4 iterations, followed by a radix-2 linear vectoring CORDIC (R2-LVC) stage to compute tanh. By normalizing inputs to unity (reducing the effective tanh range to [0, 0.5]), the fully pipelined 16-bit fixed-point design on Xilinx Virtex-7 uses 835 logic slices with zero DSP blocks and achieves a measured mean absolute error of 4.23 × 10^{-4}, claiming to outperform several recent sigmoid implementations.

Significance. If the reported resource counts and error hold for the full dynamic range of sigmoid inputs encountered in neural networks, the work would provide a useful DSP-free baseline for edge-AI accelerators. The concrete post-synthesis slice counts, zero DSP usage, and measured MAE on the target FPGA are clear strengths that enable direct reproducibility and comparison.

major comments (2)
  1. [Abstract] Abstract: The headline claims of 835 slices, zero DSP usage, and MAE 4.23 × 10^{-4} are obtained only after input normalization to unity. For |x| > 1 (common in NN pre-activations), the architecture requires unspecified clipping or scaling logic; neither its fixed-point format, slice overhead, nor its contribution to total error is reported. This omission directly weakens the 'outperforming' claim for practical deployment.
  2. [Abstract] Abstract and results: No error-bar statistics, full-range verification (beyond the normalized interval), or side-by-side table comparing MAE and resources against the cited recent implementations are supplied. Without these, the robustness of the accuracy claim across the input distribution actually used in networks cannot be assessed.
minor comments (1)
  1. [Implementation description] The description of the MR-HRC iteration schedule and scale-factor handling would benefit from an explicit pseudocode listing or timing diagram to clarify the radix-2 to radix-4 transition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have made revisions to improve clarity on input handling and to strengthen the results presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims of 835 slices, zero DSP usage, and MAE 4.23 × 10^{-4} are obtained only after input normalization to unity. For |x| > 1 (common in NN pre-activations), the architecture requires unspecified clipping or scaling logic; neither its fixed-point format, slice overhead, nor its contribution to total error is reported. This omission directly weakens the 'outperforming' claim for practical deployment.

    Authors: The manuscript explicitly states that the input range is normalized to unity to reduce the effective tanh range to [0, 0.5] and improve CORDIC convergence. This normalization is presented as an integral part of the proposed approach for hardware efficiency. We agree that the preprocessing logic for |x| > 1 was not detailed sufficiently. In the revised manuscript we will add a dedicated subsection on the input interface, describing a simple saturation-based clipping to [-1, 1] followed by normalization, the 16-bit fixed-point format used (Q7.8), and an estimate of its resource cost (a few comparators and multiplexers adding < 50 slices with zero DSPs). We will also report that the additional error from clipping is negligible (< 10^{-5}) relative to the core MAE, thereby supporting the practical applicability of the reported figures. revision: yes

  2. Referee: [Abstract] Abstract and results: No error-bar statistics, full-range verification (beyond the normalized interval), or side-by-side table comparing MAE and resources against the cited recent implementations are supplied. Without these, the robustness of the accuracy claim across the input distribution actually used in networks cannot be assessed.

    Authors: The reported MAE is a deterministic average over the normalized input interval for which the architecture is optimized. We acknowledge the absence of a consolidated comparison table and extended verification. In the revision we will insert a side-by-side table in the results section that lists MAE, slice count, DSP usage, and latency for our design and the referenced recent implementations. We will also add full-range verification results obtained by applying the clipping stage for |x| > 1, confirming that overall MAE stays within 5 × 10^{-4}. In addition, we will report maximum absolute error and error standard deviation to provide statistical context for the accuracy claim. revision: yes

Circularity Check

0 steps flagged

No circularity: results are measured FPGA metrics from explicit implementation

full rationale

The paper describes an architecture that applies the standard identity sigmoid(x) = 0.5*(1 + tanh(x/2)), normalizes the input range to unity so tanh operates on [-0.5, 0.5], then realizes the tanh via a mixed-radix hyperbolic rotation CORDIC (radix-2 then radix-4 stages) followed by a radix-2 linear vectoring CORDIC stage that divides sinh and cosh. These steps are conventional CORDIC iterations with an explicit range-reduction design choice; none of the equations or claims reduce to fitted parameters, self-referential definitions, or load-bearing self-citations. The headline numbers (835 slices, 0 DSP, MAE 4.23e-4) are reported as post-synthesis and simulation measurements on Virtex-7, not as predictions derived from the same data. No uniqueness theorems or ansatzes are smuggled in. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The design rests on the standard identity linking sigmoid to tanh and on the convergence properties of CORDIC iterations; no new entities are postulated.

free parameters (2)
  • 16-bit fixed-point width
    Chosen for hardware efficiency; directly affects both resource count and approximation error.
  • Input normalization to unity range
    Ad-hoc scaling chosen to shrink the tanh argument to 0.5 and improve convergence speed.
axioms (2)
  • standard math Sigmoid(x) can be computed from tanh(x/2) via algebraic identity
    Invoked in the abstract to justify the reduced-range tanh computation.
  • domain assumption Mixed-radix CORDIC converges without scale-factor compensation when radix-2 precedes radix-4
    Stated as enabling stable and fast iteration without extra correction hardware.

pith-pipeline@v0.9.0 · 5608 in / 1352 out tokens · 35169 ms · 2026-05-08T05:09:57.555907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Ap- plied Sciences15(21), 11551 (Oct 2025)

    Bosso, V.d.A., Nardini, R.M., de Sousa, M.A.d.A., dos Santos, S.D., Pires, R.: An area-efficient and low-error fpga-based sigmoid function approximation. Ap- plied Sciences15(21), 11551 (Oct 2025). https://doi.org/10.3390/app152111551, http://dx.doi.org/10.3390/app152111551

  2. [2]

    Electronics Letters49(25), 1598–1600 (Dec 2013)

    del Campo, I., Finker, R., Echanobe, J., Basterretxea, K.: Controlled accu- racy approximation of sigmoid function for efficient fpga-based implementa- tion of artificial neurons. Electronics Letters49(25), 1598–1600 (Dec 2013). https://doi.org/10.1049/el.2013.3098, http://dx.doi.org/10.1049/el.2013.3098

  3. [3]

    Scientific Reports13(1) (Nov 2023)

    Changela, A., Kumar, Y., Woźniak, M., Shafi, J., Ijaz, M.F.: Radix-4 cordic algorithm based low-latency and hardware efficient vlsi architecture for nth root and nth power computations. Scientific Reports13(1) (Nov 2023). https://doi.org/10.1038/s41598-023-47890-3, http://dx.doi.org/10.1038/s41598- 023-47890-3

  4. [4]

    Circuits, Systems, and Signal Process- ing42(12), 7404–7432 (Jul 2023)

    Changela, A., Zaveri, M., Kumar, Y.: A new angle set-based absolute scaling- free reconfigurable cordic algorithm. Circuits, Systems, and Signal Process- ing42(12), 7404–7432 (Jul 2023). https://doi.org/10.1007/s00034-023-02452-w, http://dx.doi.org/10.1007/s00034-023-02452-w

  5. [5]

    Journal of Circuits, Systems and Com- puters32(05) (Oct 2022)

    Changela, A., Zaveri, M., Verma, D.: A comparative study on cordic algorithms and applications. Journal of Circuits, Systems and Com- puters32(05) (Oct 2022). https://doi.org/10.1142/s0218126623300027, http://dx.doi.org/10.1142/S0218126623300027

  6. [6]

    IEEE Transactions on Neural Networks11(6), 1438–1449 (2000)

    Delgado-Frias, J., Zhang, M., Vassiliadis, S.: Elementary function gen- erators for neural-network emulators. IEEE Transactions on Neural Networks11(6), 1438–1449 (2000). https://doi.org/10.1109/72.883475, http://dx.doi.org/10.1109/72.883475

  7. [7]

    Elec- 14 P

    Li, Z., Zhang, Y., Sui, B., Xing, Z., Wang, Q.: Fpga implementation for the sigmoid with piecewise linear fitting method based on curvature analysis. Elec- 14 P. Chintan et al. tronics11(9), 1365 (Apr 2022). https://doi.org/10.3390/electronics11091365, http://dx.doi.org/10.3390/electronics11091365

  8. [8]

    IEEE Trans

    Pan, Z., Gu, Z., Jiang, X., Zhu, G., Ma, D.: A modular approximation methodology for efficient fixed-point hardware implementation of the sigmoid function. IEEE Trans. Ind. Electron.69(10), 10694–10703 (Oct 2022)

  9. [9]

    Microprocessors and Microsys- tems39(6), 373–381 (Aug 2015)

    Tiwari, V., Khare, N.: Hardware implementation of neural network with sigmoidal activation functions using cordic. Microprocessors and Microsys- tems39(6), 373–381 (Aug 2015). https://doi.org/10.1016/j.micpro.2015.05.012, http://dx.doi.org/10.1016/j.micpro.2015.05.012

  10. [10]

    In: 2019 IEEE 15th International Conference on the Experience of Designing and Application of CAD Systems (CADSM)

    Tsmots, I., Skorokhoda, O., Rabyk, V.: Hardware implementation of sigmoid activation functions using fpga. In: 2019 IEEE 15th International Conference on the Experience of Designing and Application of CAD Systems (CADSM). p. 34–38. IEEE (Feb 2019). https://doi.org/10.1109/cadsm.2019.8779253, http://dx.doi.org/10.1109/CADSM.2019.8779253

  11. [11]

    In: 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET)

    Vaisnav, A., Ashok, S., Vinaykumar, S., Thilagavathy, R.: Fpga im- plementation and comparison of sigmoid and hyperbolic tangent acti- vation functions in an artificial neural network. In: 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET). p. 1–4. IEEE (Jul 2022). https://doi.org/10.1109/icecet55527.2022.9873085, http:...

  12. [12]

    IRE Transactions on Electronic ComputersEC-8(3), 330–334 (Sep 1959)

    Volder, J.E.: The cordic trigonometric computing tech- nique. IRE Transactions on Electronic ComputersEC-8(3), 330–334 (Sep 1959). https://doi.org/10.1109/tec.1959.5222693, http://dx.doi.org/10.1109/TEC.1959.5222693