A Quantitative Evaluation of Approximate Softmax Functions for Deep Neural Networks

Anthony Leiva-Valverde; Cristina Meinhardt; Fabricio Elizondo-Fern\'andez; Jorge Castro-God\'inez; Luis G. Le\'on-Vega

arxiv: 2501.13379 · v2 · submitted 2025-01-23 · 💻 cs.AR · eess.SP

A Quantitative Evaluation of Approximate Softmax Functions for Deep Neural Networks

Anthony Leiva-Valverde , Fabricio Elizondo-Fern\'andez , Luis G. Le\'on-Vega , Cristina Meinhardt , Jorge Castro-God\'inez This is my paper

Pith reviewed 2026-05-23 05:38 UTC · model grok-4.3

classification 💻 cs.AR eess.SP

keywords softmax approximationTaylor seriesFPGA implementationapproximate computingdeep neural networksLeNet-5MobileNet v2resource efficiency

0 comments

The pith

Taylor series approximations to softmax deliver up to 14% resource reduction on FPGAs with at most 0.2% accuracy loss in LeNet-5 and MobileNet v2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests Taylor series expansions and LUT-based interpolation as ways to simplify the exponential and division steps inside softmax so that the function fits on resource-limited FPGAs. Quadratic interpolation produces the smallest point-wise error, yet the first- and second-order Taylor versions run faster and consume fewer LUTs and DSP blocks. When these Taylor versions replace the exact softmax inside complete networks, the end-to-end top-1 accuracy falls by no more than 0.2 percent while total hardware resources drop by as much as 14 percent. A reader would care because the same networks could then run on cheaper FPGA boards without retraining or extra compensation logic.

Core claim

Approximate softmax designs that replace the exponential with first- or second-order Taylor polynomials achieve substantial trade-offs between accuracy and resource use: on LeNet-5 and MobileNet v2 they produce at most 0.2 percent accuracy degradation while cutting synthesized FPGA resources by up to 14 percent relative to an exact implementation.

What carries the argument

First- and second-order Taylor polynomials that approximate the exponential inside the softmax, replacing the full exp and division hardware with simple multiply-add circuits.

If this is right

Quadratic LUT interpolation yields the smallest numerical error of the methods tested.
Taylor approximations give markedly lower execution latency and resource counts than either exact or interpolation versions.
The same approximations remain usable inside complete networks without retraining.
The approach supplies a concrete starting point for integrating approximate softmax into larger models such as LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Taylor replacement could be applied to other layers that contain exponentials, such as certain attention mechanisms.
Measuring power draw rather than only LUT count would show whether the resource saving also reduces energy on battery-powered devices.
The reported 0.2 percent accuracy figure is for the specific datasets and training regimes used; different datasets might shift the observed degradation.

Load-bearing premise

Numerical error measured in isolation will appear directly as model-level accuracy loss without any retraining or other compensation.

What would settle it

Synthesize both the exact and the first-order Taylor softmax on the same FPGA target, insert each into LeNet-5 and MobileNet v2, and compare measured top-1 accuracy together with LUT and DSP counts.

Figures

Figures reproduced from arXiv: 2501.13379 by Anthony Leiva-Valverde, Cristina Meinhardt, Fabricio Elizondo-Fern\'andez, Jorge Castro-God\'inez, Luis G. Le\'on-Vega.

**Figure 2.** Figure 2: Resource usage and execution time of the softmax accelerators based on 3rd-order Taylor and 64-sample Linear Interpolation processing a 1024 16-bit [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

The softmax function is a widely used activation function in the output layers of neural networks, responsible for converting raw scores into class probabilities while introducing essential non-linearity. Implementing Softmax efficiently poses challenges on low-end FPGAs due to limited hardware resources and the computational complexity of exponential and division operations. This work evaluates approximate computing techniques for softmax acceleration using Taylor series and interpolation methods using Look-Up Tables (LUTs). These approximations aim to reduce execution time and resource consumption while maintaining acceptable levels of numerical precision. Our findings show that quadratic interpolation with LUTs yields the lowest numerical error. In contrast, Taylor-based approximations offer significantly better performance in terms of execution time and resource efficiency due to their computational simplicity. When applied to real-world deep learning models such as LeNet-5 and MobileNet v2, the first- and second-order Taylor approximations provided substantial trade-offs between accuracy and resource savings, achieving up to 0.2% accuracy degradation and 14% resource reduction compared to exact implementations. These results highlight the effectiveness of approximate Softmax designs on resource-constrained FPGAs and lay the groundwork for their integration into larger models, including large language models (LLMs).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper supplies concrete FPGA resource and accuracy numbers for standard Taylor and LUT softmax approximations on LeNet-5 and MobileNet v2, but leaves the critical logit-range check unaddressed.

read the letter

This paper measures Taylor-series and LUT approximations to softmax on low-end FPGAs and plugs the results into LeNet-5 and MobileNet v2. The reported trade-off is the main takeaway: first- and second-order Taylor versions deliver up to 14% resource reduction with at most 0.2% accuracy loss on those two models. Hardware people targeting similar constraints can use those numbers as a starting point for their own designs. Quadratic LUT interpolation shows the smallest numerical error, while the Taylor options win on speed and area. The techniques are not new; they are standard in approximate computing. The paper's contribution is the side-by-side FPGA synthesis data on these specific networks. The soft spot is the one flagged in the stress test. Taylor approximations to exp() lose accuracy fast once |x| exceeds a couple of units, yet the manuscript never reports the actual range of pre-softmax logits, any clipping or scaling, or whether the error was checked in isolation. Without that information the 0.2% figure cannot be trusted outside the exact experiments shown. The abstract also mentions LLMs but only tests small models. For a reader working on resource-constrained edge inference the tables are still worth a look. The work is coherent enough to send for peer review; a referee can request the missing logit statistics and synthesis settings.

Referee Report

2 major / 2 minor

Summary. The paper evaluates approximate softmax implementations on FPGAs using first- and second-order Taylor series for the exponential and LUT-based interpolation methods. It reports that Taylor approximations achieve up to 0.2% end-to-end accuracy degradation and 14% resource reduction relative to exact softmax when integrated into LeNet-5 and MobileNet v2, while LUT interpolation yields lower numerical error but higher resource cost.

Significance. If the accuracy and resource numbers are reproducible under the reported synthesis settings, the work supplies concrete FPGA-specific trade-off data for a common DNN primitive. The focus on end-to-end model accuracy rather than isolated operator error is a constructive contribution to approximate computing for resource-constrained hardware.

major comments (2)

[Evaluation on DNN models] The central claim that first- and second-order Taylor approximations produce ≤0.2% accuracy loss on LeNet-5 and MobileNet v2 rests on the unstated premise that pre-softmax logits remain inside the region where truncation error is negligible. The manuscript does not report the observed logit range, any clipping or scaling, or the maximum |x| encountered during inference (see the evaluation sections on real-world models). Because the remainder term of the Taylor series for exp(x) grows rapidly for |x| > 2–3, this omission directly undermines attribution of the measured accuracy figures to the approximation alone.
[Experimental methodology] The paper assumes that the numerical error of the approximate softmax in isolation propagates directly to model-level accuracy loss. No results are shown for models that were retrained or compensated after substitution of the approximate operator; the experimental protocol should state whether the reported 0.2% figure was obtained with frozen weights or after any adaptation.

minor comments (2)

The abstract contrasts quadratic LUT interpolation (lowest numerical error) with Taylor approximations (better execution time and resources); the main text should include a single table that directly juxtaposes both error and resource metrics for all methods on the same FPGA target.
Synthesis settings (target device, clock frequency, tool version) and the precise definition of the 14% resource reduction (LUTs, DSPs, or BRAMs) should be stated explicitly in the hardware implementation section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the evaluation of the approximate softmax implementations. We address each major comment below.

read point-by-point responses

Referee: [Evaluation on DNN models] The central claim that first- and second-order Taylor approximations produce ≤0.2% accuracy loss on LeNet-5 and MobileNet v2 rests on the unstated premise that pre-softmax logits remain inside the region where truncation error is negligible. The manuscript does not report the observed logit range, any clipping or scaling, or the maximum |x| encountered during inference (see the evaluation sections on real-world models). Because the remainder term of the Taylor series for exp(x) grows rapidly for |x| > 2–3, this omission directly undermines attribution of the measured accuracy figures to the approximation alone.

Authors: We agree that the observed logit ranges are relevant for assessing the Taylor approximations. The revised manuscript will include the maximum |x| values and logit statistics measured during inference on LeNet-5 and MobileNet v2. These data will be added to the evaluation sections to support attribution of the accuracy results to the approximations. revision: yes
Referee: [Experimental methodology] The paper assumes that the numerical error of the approximate softmax in isolation propagates directly to model-level accuracy loss. No results are shown for models that were retrained or compensated after substitution of the approximate operator; the experimental protocol should state whether the reported 0.2% figure was obtained with frozen weights or after any adaptation.

Authors: The reported accuracy figures were obtained with the original pre-trained models using frozen weights and no retraining. This isolates the effect of the approximate operator. The revised manuscript will explicitly state the frozen-weight protocol in the experimental methodology section. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with direct measurements; no circular derivations

full rationale

The paper is an empirical evaluation of approximate softmax implementations (Taylor series and LUT interpolation) on FPGA hardware for models like LeNet-5 and MobileNet v2. All reported numbers (accuracy degradation ≤0.2%, resource reduction 14%) are obtained from direct synthesis, timing, and inference measurements rather than from any first-principles derivation or fitted prediction. No equations are presented that reduce a claimed result to its own inputs by construction, and no load-bearing self-citations or uniqueness theorems appear in the provided text. The work therefore contains no circular steps of the enumerated kinds.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The paper performs an empirical hardware evaluation rather than a first-principles derivation, so the ledger contains only the approximation orders and LUT configurations chosen by the authors.

free parameters (2)

Taylor approximation order
First- and second-order truncations are selected; the choice directly controls the accuracy-resource trade-off reported.
LUT interpolation degree and size
Quadratic interpolation table dimensions are design parameters that determine both error and memory footprint.

pith-pipeline@v0.9.0 · 5761 in / 1196 out tokens · 21520 ms · 2026-05-23T05:38:04.146343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Skansi, Introduction to deep learning: From Logical Calculus to Artificial Intelligence, 2018, vol

S. Skansi, Introduction to deep learning: From Logical Calculus to Artificial Intelligence, 2018, vol. 114, no. 6

work page 2018
[2]

Gradient-based learning applied to document recogni- tion,

LeCun et al. , “Gradient-based learning applied to document recogni- tion,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998
[3]

Neural Networks and Deep Learning,

E. Alpaydin, “Neural Networks and Deep Learning,” Machine Learning, 2021

work page 2021
[4]

LLM Acceleration on FPGAs: A Comparative Study of Layer and Spatial Accelerators,

L. D. Prieto-Sibaja et al., “LLM Acceleration on FPGAs: A Comparative Study of Layer and Spatial Accelerators,” in 2024 IEEE 42nd Central America and Panama Convention (CONCAPAN XLII) , 2024, pp. 1–6

work page 2024
[5]

Pruning and quantization for deep neural network acceleration: A survey,

T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and quantization for deep neural network acceleration: A survey,” Neuro- computing, vol. 461, pp. 370–403, 2021

work page 2021
[6]

Approximate multiplier architectures through partial product perforation: Power-area tradeoffs analysis,

G. Zervakis et al., “Approximate multiplier architectures through partial product perforation: Power-area tradeoffs analysis,” Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI, vol. 20-22-May-2015, pp. 229–232, 2015

work page 2015
[7]

Hardware Approximate Computing: How, Why, When and Where? (Special Session),

H. Saadat and S. Parameswaran, “Hardware Approximate Computing: How, Why, When and Where? (Special Session),” in Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion , 2017. [Online]. Available: https://doi.org/10.1145/3125501.3125518

work page doi:10.1145/3125501.3125518 2017
[8]

An Exploration of Accuracy Configurable Matrix Multiply-Addition Archi- tectures using HLS,

L. G. Le ´on-Vega, E. Salazar-Villalobos, and J. Castro-God ´ınez, “An Exploration of Accuracy Configurable Matrix Multiply-Addition Archi- tectures using HLS,” in 2022 IEEE 15th Dallas Circuit And System Conference (DCAS), 2022, pp. 1–6

work page 2022
[9]

Abramowitz, Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,

M. Abramowitz, Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables, . USA: Dover Publications, Inc., 1974

work page 1974
[10]

R. H. Bartels, J. C. Beatty, and B. A. Barsky, An Introduction to Splines for Use in Computer Graphics & Geometric Modeling . San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1987

work page 1987
[11]

Evaluation of estimation algorithms part i: incomprehensive measures of performance,

X. R. Li and Z. Zhao, “Evaluation of estimation algorithms part i: incomprehensive measures of performance,” IEEE Transactions on Aerospace and Electronic Systems, vol. 42, no. 4, pp. 1340–1358, 2006

work page 2006
[12]

MobileNetV2: Inverted Residuals and Linear Bottlenecks

M. Sandler et al. , “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018. [Online]. Available: http://arxiv.org/abs/1801. 04381

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Flexible Accelerator Library: Approximate Computing Executer (AxC Executer),

L. G. Leon-Vega, D. C. Chavarria, and J. Castro- Godinez, “Flexible Accelerator Library: Approximate Computing Executer (AxC Executer),” Mar. 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7712042

work page doi:10.5281/zenodo.7712042 2023
[14]

Cordic-based Softmax Acceleration Method of Convolu- tion Neural Network on FPGA,

Y . Cao et al., “Cordic-based Softmax Acceleration Method of Convolu- tion Neural Network on FPGA,” in 2020 IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS) , 2020, pp. 66–70

work page 2020
[15]

Hardware implementation of multi-rate in- put softmax activation function,

M. Wasef and N. Rafla, “Hardware implementation of multi-rate in- put softmax activation function,” in 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS) , 2021, pp. 783–786

work page 2021
[16]

Approximate softmax functions for energy-efficient deep neural networks,

K. Chen, Y . Gao, H. Waris, W. Liu, and F. Lombardi, “Approximate softmax functions for energy-efficient deep neural networks,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 31, no. 1, pp. 4–16, 2023

work page 2023
[17]

Design and implementation of an approximate softmax layer for deep neural networks,

Y . Gao, W. Liu, and F. Lombardi, “Design and implementation of an approximate softmax layer for deep neural networks,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS) , 2020, pp. 1–5

work page 2020
[18]

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices,

F. Fahim et al. , “hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices,” in TinyML Research Symposium, no. 1, 2021

work page 2021

[1] [1]

Skansi, Introduction to deep learning: From Logical Calculus to Artificial Intelligence, 2018, vol

S. Skansi, Introduction to deep learning: From Logical Calculus to Artificial Intelligence, 2018, vol. 114, no. 6

work page 2018

[2] [2]

Gradient-based learning applied to document recogni- tion,

LeCun et al. , “Gradient-based learning applied to document recogni- tion,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998

[3] [3]

Neural Networks and Deep Learning,

E. Alpaydin, “Neural Networks and Deep Learning,” Machine Learning, 2021

work page 2021

[4] [4]

LLM Acceleration on FPGAs: A Comparative Study of Layer and Spatial Accelerators,

L. D. Prieto-Sibaja et al., “LLM Acceleration on FPGAs: A Comparative Study of Layer and Spatial Accelerators,” in 2024 IEEE 42nd Central America and Panama Convention (CONCAPAN XLII) , 2024, pp. 1–6

work page 2024

[5] [5]

Pruning and quantization for deep neural network acceleration: A survey,

T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and quantization for deep neural network acceleration: A survey,” Neuro- computing, vol. 461, pp. 370–403, 2021

work page 2021

[6] [6]

Approximate multiplier architectures through partial product perforation: Power-area tradeoffs analysis,

G. Zervakis et al., “Approximate multiplier architectures through partial product perforation: Power-area tradeoffs analysis,” Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI, vol. 20-22-May-2015, pp. 229–232, 2015

work page 2015

[7] [7]

Hardware Approximate Computing: How, Why, When and Where? (Special Session),

H. Saadat and S. Parameswaran, “Hardware Approximate Computing: How, Why, When and Where? (Special Session),” in Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion , 2017. [Online]. Available: https://doi.org/10.1145/3125501.3125518

work page doi:10.1145/3125501.3125518 2017

[8] [8]

An Exploration of Accuracy Configurable Matrix Multiply-Addition Archi- tectures using HLS,

L. G. Le ´on-Vega, E. Salazar-Villalobos, and J. Castro-God ´ınez, “An Exploration of Accuracy Configurable Matrix Multiply-Addition Archi- tectures using HLS,” in 2022 IEEE 15th Dallas Circuit And System Conference (DCAS), 2022, pp. 1–6

work page 2022

[9] [9]

Abramowitz, Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,

M. Abramowitz, Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables, . USA: Dover Publications, Inc., 1974

work page 1974

[10] [10]

R. H. Bartels, J. C. Beatty, and B. A. Barsky, An Introduction to Splines for Use in Computer Graphics & Geometric Modeling . San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1987

work page 1987

[11] [11]

Evaluation of estimation algorithms part i: incomprehensive measures of performance,

X. R. Li and Z. Zhao, “Evaluation of estimation algorithms part i: incomprehensive measures of performance,” IEEE Transactions on Aerospace and Electronic Systems, vol. 42, no. 4, pp. 1340–1358, 2006

work page 2006

[12] [12]

MobileNetV2: Inverted Residuals and Linear Bottlenecks

M. Sandler et al. , “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018. [Online]. Available: http://arxiv.org/abs/1801. 04381

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Flexible Accelerator Library: Approximate Computing Executer (AxC Executer),

L. G. Leon-Vega, D. C. Chavarria, and J. Castro- Godinez, “Flexible Accelerator Library: Approximate Computing Executer (AxC Executer),” Mar. 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7712042

work page doi:10.5281/zenodo.7712042 2023

[14] [14]

Cordic-based Softmax Acceleration Method of Convolu- tion Neural Network on FPGA,

Y . Cao et al., “Cordic-based Softmax Acceleration Method of Convolu- tion Neural Network on FPGA,” in 2020 IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS) , 2020, pp. 66–70

work page 2020

[15] [15]

Hardware implementation of multi-rate in- put softmax activation function,

M. Wasef and N. Rafla, “Hardware implementation of multi-rate in- put softmax activation function,” in 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS) , 2021, pp. 783–786

work page 2021

[16] [16]

Approximate softmax functions for energy-efficient deep neural networks,

K. Chen, Y . Gao, H. Waris, W. Liu, and F. Lombardi, “Approximate softmax functions for energy-efficient deep neural networks,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 31, no. 1, pp. 4–16, 2023

work page 2023

[17] [17]

Design and implementation of an approximate softmax layer for deep neural networks,

Y . Gao, W. Liu, and F. Lombardi, “Design and implementation of an approximate softmax layer for deep neural networks,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS) , 2020, pp. 1–5

work page 2020

[18] [18]

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices,

F. Fahim et al. , “hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices,” in TinyML Research Symposium, no. 1, 2021

work page 2021