A Quantitative Evaluation of Approximate Softmax Functions for Deep Neural Networks
Pith reviewed 2026-05-23 05:38 UTC · model grok-4.3
The pith
Taylor series approximations to softmax deliver up to 14% resource reduction on FPGAs with at most 0.2% accuracy loss in LeNet-5 and MobileNet v2.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Approximate softmax designs that replace the exponential with first- or second-order Taylor polynomials achieve substantial trade-offs between accuracy and resource use: on LeNet-5 and MobileNet v2 they produce at most 0.2 percent accuracy degradation while cutting synthesized FPGA resources by up to 14 percent relative to an exact implementation.
What carries the argument
First- and second-order Taylor polynomials that approximate the exponential inside the softmax, replacing the full exp and division hardware with simple multiply-add circuits.
If this is right
- Quadratic LUT interpolation yields the smallest numerical error of the methods tested.
- Taylor approximations give markedly lower execution latency and resource counts than either exact or interpolation versions.
- The same approximations remain usable inside complete networks without retraining.
- The approach supplies a concrete starting point for integrating approximate softmax into larger models such as LLMs.
Where Pith is reading between the lines
- The same Taylor replacement could be applied to other layers that contain exponentials, such as certain attention mechanisms.
- Measuring power draw rather than only LUT count would show whether the resource saving also reduces energy on battery-powered devices.
- The reported 0.2 percent accuracy figure is for the specific datasets and training regimes used; different datasets might shift the observed degradation.
Load-bearing premise
Numerical error measured in isolation will appear directly as model-level accuracy loss without any retraining or other compensation.
What would settle it
Synthesize both the exact and the first-order Taylor softmax on the same FPGA target, insert each into LeNet-5 and MobileNet v2, and compare measured top-1 accuracy together with LUT and DSP counts.
Figures
read the original abstract
The softmax function is a widely used activation function in the output layers of neural networks, responsible for converting raw scores into class probabilities while introducing essential non-linearity. Implementing Softmax efficiently poses challenges on low-end FPGAs due to limited hardware resources and the computational complexity of exponential and division operations. This work evaluates approximate computing techniques for softmax acceleration using Taylor series and interpolation methods using Look-Up Tables (LUTs). These approximations aim to reduce execution time and resource consumption while maintaining acceptable levels of numerical precision. Our findings show that quadratic interpolation with LUTs yields the lowest numerical error. In contrast, Taylor-based approximations offer significantly better performance in terms of execution time and resource efficiency due to their computational simplicity. When applied to real-world deep learning models such as LeNet-5 and MobileNet v2, the first- and second-order Taylor approximations provided substantial trade-offs between accuracy and resource savings, achieving up to 0.2% accuracy degradation and 14% resource reduction compared to exact implementations. These results highlight the effectiveness of approximate Softmax designs on resource-constrained FPGAs and lay the groundwork for their integration into larger models, including large language models (LLMs).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates approximate softmax implementations on FPGAs using first- and second-order Taylor series for the exponential and LUT-based interpolation methods. It reports that Taylor approximations achieve up to 0.2% end-to-end accuracy degradation and 14% resource reduction relative to exact softmax when integrated into LeNet-5 and MobileNet v2, while LUT interpolation yields lower numerical error but higher resource cost.
Significance. If the accuracy and resource numbers are reproducible under the reported synthesis settings, the work supplies concrete FPGA-specific trade-off data for a common DNN primitive. The focus on end-to-end model accuracy rather than isolated operator error is a constructive contribution to approximate computing for resource-constrained hardware.
major comments (2)
- [Evaluation on DNN models] The central claim that first- and second-order Taylor approximations produce ≤0.2% accuracy loss on LeNet-5 and MobileNet v2 rests on the unstated premise that pre-softmax logits remain inside the region where truncation error is negligible. The manuscript does not report the observed logit range, any clipping or scaling, or the maximum |x| encountered during inference (see the evaluation sections on real-world models). Because the remainder term of the Taylor series for exp(x) grows rapidly for |x| > 2–3, this omission directly undermines attribution of the measured accuracy figures to the approximation alone.
- [Experimental methodology] The paper assumes that the numerical error of the approximate softmax in isolation propagates directly to model-level accuracy loss. No results are shown for models that were retrained or compensated after substitution of the approximate operator; the experimental protocol should state whether the reported 0.2% figure was obtained with frozen weights or after any adaptation.
minor comments (2)
- The abstract contrasts quadratic LUT interpolation (lowest numerical error) with Taylor approximations (better execution time and resources); the main text should include a single table that directly juxtaposes both error and resource metrics for all methods on the same FPGA target.
- Synthesis settings (target device, clock frequency, tool version) and the precise definition of the 14% resource reduction (LUTs, DSPs, or BRAMs) should be stated explicitly in the hardware implementation section.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the evaluation of the approximate softmax implementations. We address each major comment below.
read point-by-point responses
-
Referee: [Evaluation on DNN models] The central claim that first- and second-order Taylor approximations produce ≤0.2% accuracy loss on LeNet-5 and MobileNet v2 rests on the unstated premise that pre-softmax logits remain inside the region where truncation error is negligible. The manuscript does not report the observed logit range, any clipping or scaling, or the maximum |x| encountered during inference (see the evaluation sections on real-world models). Because the remainder term of the Taylor series for exp(x) grows rapidly for |x| > 2–3, this omission directly undermines attribution of the measured accuracy figures to the approximation alone.
Authors: We agree that the observed logit ranges are relevant for assessing the Taylor approximations. The revised manuscript will include the maximum |x| values and logit statistics measured during inference on LeNet-5 and MobileNet v2. These data will be added to the evaluation sections to support attribution of the accuracy results to the approximations. revision: yes
-
Referee: [Experimental methodology] The paper assumes that the numerical error of the approximate softmax in isolation propagates directly to model-level accuracy loss. No results are shown for models that were retrained or compensated after substitution of the approximate operator; the experimental protocol should state whether the reported 0.2% figure was obtained with frozen weights or after any adaptation.
Authors: The reported accuracy figures were obtained with the original pre-trained models using frozen weights and no retraining. This isolates the effect of the approximate operator. The revised manuscript will explicitly state the frozen-weight protocol in the experimental methodology section. revision: yes
Circularity Check
Empirical benchmark with direct measurements; no circular derivations
full rationale
The paper is an empirical evaluation of approximate softmax implementations (Taylor series and LUT interpolation) on FPGA hardware for models like LeNet-5 and MobileNet v2. All reported numbers (accuracy degradation ≤0.2%, resource reduction 14%) are obtained from direct synthesis, timing, and inference measurements rather than from any first-principles derivation or fitted prediction. No equations are presented that reduce a claimed result to its own inputs by construction, and no load-bearing self-citations or uniqueness theorems appear in the provided text. The work therefore contains no circular steps of the enumerated kinds.
Axiom & Free-Parameter Ledger
free parameters (2)
- Taylor approximation order
- LUT interpolation degree and size
Reference graph
Works this paper leans on
-
[1]
Skansi, Introduction to deep learning: From Logical Calculus to Artificial Intelligence, 2018, vol
S. Skansi, Introduction to deep learning: From Logical Calculus to Artificial Intelligence, 2018, vol. 114, no. 6
work page 2018
-
[2]
Gradient-based learning applied to document recogni- tion,
LeCun et al. , “Gradient-based learning applied to document recogni- tion,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998
work page 1998
-
[3]
Neural Networks and Deep Learning,
E. Alpaydin, “Neural Networks and Deep Learning,” Machine Learning, 2021
work page 2021
-
[4]
LLM Acceleration on FPGAs: A Comparative Study of Layer and Spatial Accelerators,
L. D. Prieto-Sibaja et al., “LLM Acceleration on FPGAs: A Comparative Study of Layer and Spatial Accelerators,” in 2024 IEEE 42nd Central America and Panama Convention (CONCAPAN XLII) , 2024, pp. 1–6
work page 2024
-
[5]
Pruning and quantization for deep neural network acceleration: A survey,
T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and quantization for deep neural network acceleration: A survey,” Neuro- computing, vol. 461, pp. 370–403, 2021
work page 2021
-
[6]
G. Zervakis et al., “Approximate multiplier architectures through partial product perforation: Power-area tradeoffs analysis,” Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI, vol. 20-22-May-2015, pp. 229–232, 2015
work page 2015
-
[7]
Hardware Approximate Computing: How, Why, When and Where? (Special Session),
H. Saadat and S. Parameswaran, “Hardware Approximate Computing: How, Why, When and Where? (Special Session),” in Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion , 2017. [Online]. Available: https://doi.org/10.1145/3125501.3125518
-
[8]
An Exploration of Accuracy Configurable Matrix Multiply-Addition Archi- tectures using HLS,
L. G. Le ´on-Vega, E. Salazar-Villalobos, and J. Castro-God ´ınez, “An Exploration of Accuracy Configurable Matrix Multiply-Addition Archi- tectures using HLS,” in 2022 IEEE 15th Dallas Circuit And System Conference (DCAS), 2022, pp. 1–6
work page 2022
-
[9]
Abramowitz, Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,
M. Abramowitz, Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables, . USA: Dover Publications, Inc., 1974
work page 1974
-
[10]
R. H. Bartels, J. C. Beatty, and B. A. Barsky, An Introduction to Splines for Use in Computer Graphics & Geometric Modeling . San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1987
work page 1987
-
[11]
Evaluation of estimation algorithms part i: incomprehensive measures of performance,
X. R. Li and Z. Zhao, “Evaluation of estimation algorithms part i: incomprehensive measures of performance,” IEEE Transactions on Aerospace and Electronic Systems, vol. 42, no. 4, pp. 1340–1358, 2006
work page 2006
-
[12]
MobileNetV2: Inverted Residuals and Linear Bottlenecks
M. Sandler et al. , “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018. [Online]. Available: http://arxiv.org/abs/1801. 04381
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Flexible Accelerator Library: Approximate Computing Executer (AxC Executer),
L. G. Leon-Vega, D. C. Chavarria, and J. Castro- Godinez, “Flexible Accelerator Library: Approximate Computing Executer (AxC Executer),” Mar. 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7712042
-
[14]
Cordic-based Softmax Acceleration Method of Convolu- tion Neural Network on FPGA,
Y . Cao et al., “Cordic-based Softmax Acceleration Method of Convolu- tion Neural Network on FPGA,” in 2020 IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS) , 2020, pp. 66–70
work page 2020
-
[15]
Hardware implementation of multi-rate in- put softmax activation function,
M. Wasef and N. Rafla, “Hardware implementation of multi-rate in- put softmax activation function,” in 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS) , 2021, pp. 783–786
work page 2021
-
[16]
Approximate softmax functions for energy-efficient deep neural networks,
K. Chen, Y . Gao, H. Waris, W. Liu, and F. Lombardi, “Approximate softmax functions for energy-efficient deep neural networks,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 31, no. 1, pp. 4–16, 2023
work page 2023
-
[17]
Design and implementation of an approximate softmax layer for deep neural networks,
Y . Gao, W. Liu, and F. Lombardi, “Design and implementation of an approximate softmax layer for deep neural networks,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS) , 2020, pp. 1–5
work page 2020
-
[18]
hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices,
F. Fahim et al. , “hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices,” in TinyML Research Symposium, no. 1, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.