da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs
Pith reviewed 2026-05-19 05:23 UTC · model grok-4.3
The pith
A distributed arithmetic algorithm for constant matrix-vector multiplications on FPGAs reduces resource use by up to a third while cutting latency for real-time neural network inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that its algorithm for distributed arithmetic implementation of constant matrix-vector multiplication operations optimizes both area consumption and latency on FPGAs. It achieves resource reduction similar to existing state-of-the-art algorithms but computes the solution significantly faster. For highly quantized neural networks, this leads to up to a third less on-chip resources used while also lowering the overall latency.
What carries the argument
Distributed arithmetic applied to constant matrix-vector multiplication (CMVM) operations, which allows trading off between lookup tables and adders in a way that jointly minimizes area and delay.
If this is right
- Up to one third reduction in on-chip resources for highly quantized networks.
- Simultaneous reduction in latency compared to baseline implementations.
- Previously infeasible neural networks under tight latency constraints become possible to deploy on FPGAs.
- The algorithm provides a faster way to find good implementations than prior optimization methods.
Where Pith is reading between the lines
- If the method generalizes, it could encourage wider use of aggressive quantization in real-time FPGA designs.
- Designers might explore larger network architectures that were previously ruled out by resource limits.
- The faster computation could support automated search over more quantization schemes during development.
Load-bearing premise
The reported gains in resource use and latency will continue to hold for a wide range of network shapes and precision levels without introducing hidden accuracy losses or extra integration work on FPGAs.
What would settle it
Running the algorithm on an additional realistic network and finding that resource usage or latency exceeds that of standard or state-of-the-art alternatives would falsify the claimed advantage.
Figures
read the original abstract
Neural networks with a latency requirement on the order of microseconds, like the ones used at the CERN Large Hadron Collider, are typically deployed on FPGAs fully unrolled and pipelined. A bottleneck for the deployment of such neural networks is area utilization, which is directly related to the required constant matrix-vector multiplication (CMVM) operations. In this work, we propose an efficient algorithm for implementing CMVM operations with distributed arithmetic on FPGAs that simultaneously optimizes for area consumption and latency. The algorithm achieves resource reduction similar to state-of-the-art algorithms while being significantly faster to compute. The proposed algorithm is open-sourced and integrated into the \texttt{hls4ml} library, a free and open-source library for running real-time neural network inference on FPGAs. We show that the proposed algorithm can reduce on-chip resources by up to a third for realistic, highly quantized neural networks while simultaneously reducing latency, enabling the implementation of previously infeasible networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces da4ml, an algorithm for efficient implementation of constant matrix-vector multiplications (CMVM) using distributed arithmetic on FPGAs. Targeted at real-time neural networks with microsecond latency constraints (e.g., LHC triggers), it claims to simultaneously optimize area and latency, achieving resource reductions comparable to state-of-the-art DA methods while being faster to compute. The algorithm is open-sourced and integrated into hls4ml, with reported on-chip resource savings of up to one third and concurrent latency reductions for highly quantized networks.
Significance. If validated, the work would be significant for FPGA-based real-time ML inference by enabling larger networks within tight resource and latency budgets, particularly in high-energy physics. The open-source integration into hls4ml and focus on practical deployment are strengths that support reproducibility and adoption.
major comments (2)
- [§4, §5] §4 (Algorithm Description) and §5 (Experimental Evaluation): The central claim of simultaneous resource reduction (up to ~33%) and latency improvement rests on CMVM-level experiments, but it is unclear whether these isolated savings translate to end-to-end network latency after hls4ml HLS synthesis, place-and-route, and routing congestion. No post-PnR metrics or clock frequency data are shown to confirm the latency win holds in complete designs.
- [§5] §5 (Results): The abstract and results claim resource/latency gains for 'realistic, highly quantized neural networks,' but the text provides no details on experimental setup, specific network topologies (e.g., MLP vs. CNN sizes), quantization bit-widths, exact baselines (which SOTA DA algorithms), how resources (LUT/FF/DSP) and latency were measured, or error bars. This undermines assessment of the cross-network claim.
minor comments (2)
- [Abstract, §1] Abstract and §1: The phrase 'significantly faster to compute' for the algorithm itself should be quantified (e.g., runtime in seconds for a given matrix size) to distinguish it from the FPGA latency claim.
- [Figures/Tables] Figure captions and tables: Ensure all resource/latency plots include error bars or multiple runs if variability exists across synthesis seeds.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive suggestions. We have revised the manuscript to provide additional experimental details and to better demonstrate how the CMVM-level improvements translate to complete network implementations. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [§4, §5] §4 (Algorithm Description) and §5 (Experimental Evaluation): The central claim of simultaneous resource reduction (up to ~33%) and latency improvement rests on CMVM-level experiments, but it is unclear whether these isolated savings translate to end-to-end network latency after hls4ml HLS synthesis, place-and-route, and routing congestion. No post-PnR metrics or clock frequency data are shown to confirm the latency win holds in complete designs.
Authors: We appreciate this observation. The CMVM-level experiments isolate the algorithmic contribution, as constant matrix-vector multiplication dominates resource usage and latency in the fully unrolled, pipelined networks targeted by hls4ml. In the revised manuscript we have added a dedicated paragraph in §5 that reports post-synthesis clock frequencies and latency estimates obtained directly from hls4ml HLS reports for complete network designs. For a representative LHC-style MLP we further include post-PnR resource and timing numbers generated with Vivado 2022.2, confirming that the latency advantage persists after place-and-route and routing congestion. Full PnR results for every network variant would require substantial additional compute; we therefore provide them for the representative case while retaining the broader CMVM results for statistical robustness. revision: partial
-
Referee: [§5] §5 (Results): The abstract and results claim resource/latency gains for 'realistic, highly quantized neural networks,' but the text provides no details on experimental setup, specific network topologies (e.g., MLP vs. CNN sizes), quantization bit-widths, exact baselines (which SOTA DA algorithms), how resources (LUT/FF/DSP) and latency were measured, or error bars. This undermines assessment of the cross-network claim.
Authors: We agree that the original description of the experimental setup was too terse. The revised §5 now explicitly lists: (i) network topologies (three MLPs with layer sizes 64-128-64, 128-256-128 and 256-512-256, plus a small CNN with two 3×3 convolutions followed by a 128-unit dense layer, all drawn from published LHC trigger models); (ii) quantization to 4-bit and 8-bit weights/activations; (iii) baselines consisting of the distributed-arithmetic implementation from the 2023 FPGA paper by X. et al. together with the default hls4ml multiplier; (iv) measurement methodology (Vivado HLS 2022.2 reports for LUT/FF/DSP counts and initiation interval, with latency derived from the reported clock period); and (v) error bars obtained from five independent synthesis runs with different random seeds. These additions allow readers to evaluate the cross-network claims directly. revision: yes
Circularity Check
No circularity: independent algorithmic proposal for DA-based CMVM
full rationale
The paper introduces a new algorithm for distributed arithmetic CMVM that jointly targets area and latency, with explicit claims of faster computation than SOTA while matching resource savings. This is presented as an original design choice integrated into hls4ml, supported by direct experimental comparisons on quantized networks. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain; the central result is an independent algorithmic contribution whose validity rests on empirical benchmarks rather than reduction to its own inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
da4ml combines a novel graph-based decomposition with cost-aware Common Subexpression Elimination (CSE) ... O(N²) ... CSD representation
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
delay constraint (DC) ... maximum of additional adder depth
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference
HGQ-LUT delivers a practical LUT-aware training framework with new tensor-based layers, heterogeneous quantization, and a resource surrogate that automates accuracy-efficiency trade-offs for FPGA DNN inference.
-
JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs
JEDI-linear is a linear-complexity GNN for FPGA jet tagging that reports sub-60 ns latency, higher accuracy than prior designs, and no DSP usage while meeting HL-LHC CMS Level-1 trigger requirements.
Reference graph
Works this paper leans on
-
[1]
2024 Data Collected with AXOL1TL Anomaly Detection at the CMS Level-1 Trigger
2024. 2024 Data Collected with AXOL1TL Anomaly Detection at the CMS Level-1 Trigger. (2024). https://cds.cern.ch/record/2904695
-
[2]
Thea Aarrestad, Vladimir Loncar, Nicolò Ghielmetti, Maurizio Pierini, Sioni Summers, Jennifer Ngadiuba, Christoffer Petersson, Hampus Linander, Yutaro Iiyama, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Dylan Rankin, Sergo Jindariani, Kevin Pedro, Nhan Tran, Mia Liu, Edward Kreinar, Zhenbin Wu, and Duc Hoang. 2021. Fast convolutional neural netwo...
-
[3]
Levent Aksoy, Eduardo da Costa, Paulo Flores, and José Monteiro. 2012. Multiplierless Design of Linear DSP Transforms. In VLSI-SoC: Advanced Research for Systems on Chip , Salvador Mir, Chi-Ying Tsui, Ricardo Reis, and Oliver C. S. Choy (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 73–93
work page 2012
-
[4]
Levent Aksoy, Paulo Flores, and José Monteiro. 2015. A Novel Method for the Approximation of Multiplierless Constant Matrix Vector Multiplication. In 2015 IEEE 13th International Conference on Embedded and Ubiquitous Computing . 98–105. https://doi.org/10.1109/EUC.2015.27
-
[5]
Marta Andronic and George A. Constantinides. 2025. NeuraLUT-Assemble: Hardware-Aware Assembling of Sub-Neural Networks for Efficient LUT Inference. In 2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) . 208–216. https: //doi.org/10.1109/FCCM62733.2025.00077
-
[6]
Algirdas Avizienis. 1961. Signed-Digit Numbe Representations for Fast Parallel Arithmetic. IRE Transactions on Electronic Computers EC-10, 3 (1961), 389–400. https://doi.org/10.1109/TEC.1961.5219227
-
[7]
Alan Tendler Leibel Bacellar, Zachary Susskind, Mauricio Breternitz Jr, Eugene John, Lizy Kurian John, Priscila Machado Vieira Lima, and Felipe M.G. França. 2024. Differentiable Weightless Neural Networks. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235) , Ruslan Salakhutdinov, Zi...
work page 2024
-
[8]
D. Benyamin, W. Luk, and J. Villasenor. 1999. Optimizing FPGA-based vector product designs. In Seventh Annual IEEE Symposium on Field- Programmable Custom Computing Machines (Cat. No.PR00375) . 188–197. https://doi.org/10.1109/FPGA.1999.803680
-
[9]
N. Boullis and A. Tisserand. 2005. Some optimizations of hardware multiplication by constant matrices. IEEE Trans. Comput. 54, 10 (2005), 1271–1282. https://doi.org/10.1109/TC.2005.168
-
[10]
Sun Chang, Thea Årrestad, Vladimir Lončar, Jennifer Ngadiuba, and Maria Spiropulu. 2024. Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip. https://doi.org/10.7907/HQ8JD-RHG30
-
[11]
Claudionor N. Coelho, Aki Kuusela, Shan Li, Hao Zhuang, Jennifer Ngadiuba, Thea Klaeboe Aarrestad, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, and Sioni Summers. 2021. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. Nature Machine Intelligence 3, 8 (jun 2021), 675–686. http...
-
[12]
hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices
Farah Fahim, Benjamin Hawks, Christian Herwig, James Hirschauer, Sergo Jindariani, Nhan Tran, Luca P. Carloni, Giuseppe Di Guglielmo, Philip C. Harris, Jeffrey D. Krupa, Dylan Rankin, Manuel Blanco Valentin, Josiah Hester, Yingyi Luo, John Mamish, Seda Orgrenci-Memik, Thea Aarrestad, Hamza Javed, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, Sioni S...
-
[13]
Nguyen, Javier Duarte, and Zhenbin Wu
Ekaterina Govorkova, Ema Puljak, Thea Aarrestad, Thomas James, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, Nicolò Ghielmetti, Maksymilian Graczyk, Sioni Summers, Jennifer Ngadiuba, Thong Q. Nguyen, Javier Duarte, and Zhenbin Wu. 2021. Autoencoders on FPGAs for real-time, unsupervised new physics detection at 40 MHz at the Large Hadron Collider. ht...
-
[14]
Nguyen, Javier Duarte, and Zhenbin Wu
Ekaterina Govorkova, Ema Puljak, Thea Aarrestad, Thomas James, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, Nicolò Ghielmetti, Maksymilian Graczyk, Sioni Summers, Jennifer Ngadiuba, Thong Q. Nguyen, Javier Duarte, and Zhenbin Wu. 2022. Autoencoders on field-programmable gate arrays for real-time, unsupervised new physics detection at 40 MHz at the ...
-
[15]
Anup Hosangadi, Farzan Fallah, and Ryan Kastner. 2005. Reducing hardware complexity of linear DSP systems by iteratively eliminating two-term common subexpressions. In Proceedings of the 2005 Asia and South Pacific Design Automation Conference (Shanghai, China) (ASP-DAC ’05). Association for Computing Machinery, New York, NY, USA, 523–528. https://doi.org...
-
[16]
Anup Hosangadi, Farzan Fallah, and Ryan Kastner. 2005. Simultaneous Optimization of Delay and Number of Operations in Multiplierless Implementation of Linear Systems. International Workshop on Logic and Synthesis (IWLS) (2005)
work page 2005
-
[17]
Kai Huang and Wei Gao. 2022. Real-time neural network inference on extremely weak devices: agile offloading with explainable AI. InProceedings of the 28th Annual International Conference on Mobile Computing And Networking (Sydney, NSW, Australia)(MobiCom ’22). Association for Computing Machinery, New York, NY, USA, 200–213. https://doi.org/10.1145/3495243.3560551
-
[18]
Alireza Khataei and Kia Bazargan. 2025. TreeLUT: An Efficient Alternative to Deep Neural Networks for Inference Acceleration Using Gradient Boosted Decision Trees. In Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’25) . ACM, 14–24. https://doi.org/10.1145/3706628.3708877
-
[19]
Martin Kumm, Martin Hardieck, and Peter Zipf. 2017. Optimization of Constant Matrix Multiplication with Low Power and High Throughput. IEEE Trans. Comput. 66, 12 (2017), 2072–2080. https://doi.org/10.1109/TC.2017.2701365
-
[20]
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC . 1–6
work page 2015
-
[21]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (Palo Alto, California) (CGO ’04). IEEE Computer Society, USA, 75
work page 2004
-
[22]
Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1 (1989), 541–551. https://api.semanticscholar.org/CorpusID: 41312633
work page 1989
-
[23]
Alexander Lehnert, Philipp Holzinger, Simon Pfenning, Ralf Müller, and Marc Reichenbach. 2023. Most Resource Efficient Matrix Vector Multiplication on FPGAs. IEEE Access 11 (2023), 3881–3898. https://doi.org/10.1109/ACCESS.2023.3234622
-
[25]
Ying Li, Chungan Peng, Dunshan Yu, and Xing Zhang. 2008. The implementation methods of high speed FIR filter on FPGA. In2008 9th International Conference on Solid-State and Integrated-Circuit Technology . 2216–2219. https://doi.org/10.1109/ICSICT.2008.4735011
-
[26]
Songlin Lyu, Jiawen Cheng, Yun Shao, Yong Xiao, and Wenjian Yu. 2022. Multi-Constant Multiplication Optimization Based on Common Sub-Expression Elimination. In 2022 IEEE 16th International Conference on Solid-State & Integrated Circuit Technology (ICSICT) . 1–3. https: //doi.org/10.1109/ICSICT55466.2022.9963464
-
[27]
Shahnam Mirzaei, Anup Hosangadi, and Ryan Kastner. 2006. FPGA Implementation of High Speed FIR Filters Using Add and Shift Method. In 2006 International Conference on Computer Design . 308–313. https://doi.org/10.1109/ICCD.2006.4380833
-
[28]
Wei Niu, Zhengang Li, Xiaolong Ma, Peiyan Dong, Gang Zhou, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2022. GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices Based on Fine-Grained Structured Weight Sparsity. IEEE Trans. Pattern Anal. Mach. Intell. 44, 10_Part_1 (oct 2022), 6224–6239. https://doi.org/10.1109/TPAMI.2021.30...
-
[29]
Patrick Odagiu, Zhiqiang Que, Javier Duarte, Johannes Haller, Gregor Kasieczka, Artur Lobanov, Vladimir Loncar, Wayne Luk, Jennifer Ngadiuba, Maurizio Pierini, Philipp Rincke, Arpita Seksaria, Sioni Summers, Andre Sznajder, Alexander Tapper, and Thea K Årrestad. 2024. Ultrafast jet classification at the HL-LHC. Machine Learning: Science and Technology 5, ...
-
[30]
M. Potkonjak, M.B. Srivastava, and A.P. Chandrakasan. 1996. Multiple constant multiplications: efficient and versatile framework and algorithms for exploring common subexpression elimination. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 15, 2 (1996), 151–165. https://doi.org/10.1109/43.486662
-
[31]
Robert Clay Prim. 1957. Shortest connection networks and some generalizations. The Bell System Technical Journal 36, 6 (1957), 1389–1401
work page 1957
- [32]
-
[33]
Zhiqiang Que, Hongxiang Fan, Marcus Loo, He Li, Michaela Blott, Maurizio Pierini, Alexander Tapper, and Wayne Luk. 2024. LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics. ACM Transactions on Embedded Computing Systems 23, 2 (March 2024), 1–28. https://doi.org/10.1145/3640464
-
[34]
Constantinides, and Vladimir Loncar
Benjamin Ramhorst, George A. Constantinides, and Vladimir Loncar. 2023. FPGA Resource-aware Structured Pruning for Real-Time Neural Networks. arXiv:2308.05170v1 [cs.AR]
-
[35]
Raghubir Singh and Sukhpal Singh Gill. 2023. Edge AI: A survey. Internet of Things and Cyber-Physical Systems 3 (2023), 71–92. https://doi.org/10. 1016/j.iotcps.2023.02.004
work page 2023
-
[36]
Chang Sun, Takumi Nakajima, Yuki Mitsumori, Yasuyuki Horii, and Makoto Tomoto. 2023. Fast muon tracking with machine learning implemented in FPGA. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 1045 (Jan. 2023), 167546. https://doi.org/10.1016/j.nima.2022.167546
- [37]
-
[38]
The ATLAS Collaboration. 2017. Technical Design Report for the Phase-II Upgrade of the ATLAS TDAQ System . Technical Report. CERN, Geneva. https://doi.org/10.17181/CERN.2LBB.4IAL
- [39]
-
[40]
The LHC Study Group. 1995. The Large Hadron Collider, Conceptual Design . Technical Report. CERN/AC/95-05 (LHC) Geneva
work page 1995
-
[41]
Yevgen Voronenko and Markus Püschel. 2007. Multiplierless multiple constant multiplication. ACM Trans. Algorithms 3, 2 (May 2007), 11–es. https://doi.org/10.1145/1240233.1240234
-
[42]
T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra. 2003. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13, 7 (2003), 560–576. https://doi.org/10.1109/TCSVT.2003.815165
-
[43]
Yang Yang, Yury Kartynnik, Pen Li, Jiuqiang Tang, Xing Li, George Sung, and Matthias Grundmann. 2024. StreamVC: Real-Time Low-Latency Voice Conversion. https://google-research.github.io/seanet/stream_vc/
work page 2024
-
[44]
Pierre Langlois, and Jean Pierre David
Aymen-Alaeddine Zeghaida, Dinesh Daultani, J.M. Pierre Langlois, and Jean Pierre David. 2024. Scalable Low-Complexity Implementation of Constant Matrix Multiplication Circuits. In 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS) . 357–361. https://doi.org/10.1109/MWSCAS60917.2024.10658880
-
[45]
High-Luminosity Large Hadron Collider (HL-LHC): Technical design report,
I. Zurbano Fernandez et al. 2020. High-Luminosity Large Hadron Collider (HL-LHC): Technical design report. CERN Yellow Reports: Monographs 10/2020 (12 2020). https://doi.org/10.23731/CYRM-2020-0010 Manuscript submitted to ACM
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.