pith. sign in

arxiv: 2507.04535 · v2 · submitted 2025-07-06 · 💻 cs.AR · cs.LG· hep-ex

da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs

Pith reviewed 2026-05-19 05:23 UTC · model grok-4.3

classification 💻 cs.AR cs.LGhep-ex
keywords distributed arithmeticFPGA deploymentneural network inferenceconstant matrix-vector multiplicationresource optimizationlatency reductionquantized networks
0
0 comments X

The pith

A distributed arithmetic algorithm for constant matrix-vector multiplications on FPGAs reduces resource use by up to a third while cutting latency for real-time neural network inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new algorithm that applies distributed arithmetic to the constant matrix-vector multiplications central to running neural networks on FPGAs in a fully unrolled manner. This targets the area bottleneck that limits how large or complex such networks can be when latency must stay under a few microseconds. A sympathetic reader would care because it promises to make previously impossible network designs fit on the available hardware without slowing them down. The approach claims to match the resource savings of prior methods but reaches those savings through a computation that runs much faster, which matters for rapid design iterations. Tests on realistic quantized networks show the combined benefit of less area and less delay.

Core claim

The paper claims that its algorithm for distributed arithmetic implementation of constant matrix-vector multiplication operations optimizes both area consumption and latency on FPGAs. It achieves resource reduction similar to existing state-of-the-art algorithms but computes the solution significantly faster. For highly quantized neural networks, this leads to up to a third less on-chip resources used while also lowering the overall latency.

What carries the argument

Distributed arithmetic applied to constant matrix-vector multiplication (CMVM) operations, which allows trading off between lookup tables and adders in a way that jointly minimizes area and delay.

If this is right

  • Up to one third reduction in on-chip resources for highly quantized networks.
  • Simultaneous reduction in latency compared to baseline implementations.
  • Previously infeasible neural networks under tight latency constraints become possible to deploy on FPGAs.
  • The algorithm provides a faster way to find good implementations than prior optimization methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the method generalizes, it could encourage wider use of aggressive quantization in real-time FPGA designs.
  • Designers might explore larger network architectures that were previously ruled out by resource limits.
  • The faster computation could support automated search over more quantization schemes during development.

Load-bearing premise

The reported gains in resource use and latency will continue to hold for a wide range of network shapes and precision levels without introducing hidden accuracy losses or extra integration work on FPGAs.

What would settle it

Running the algorithm on an additional realistic network and finding that resource usage or latency exceeds that of standard or state-of-the-art alternatives would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2507.04535 by Chang Sun, Maria Spiropulu, Vladimir Loncar, Wayne Luk, Zhiqiang Que.

Figure 1
Figure 1. Figure 1: Overview of the proposed da4ml automatic optimization flow for CMVM on FPGAs. The algorithm first decomposes the [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example of the graph constructed from the constant matrix [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of the second stage of the da4ml algorithm. The algorithm identifies three two-term subexpressions, shown in the bounding boxes with the same color. For representation purposes, the matrix shown is a transposed matrix (i.e., 𝑦® = 𝑀𝑥® instead of 𝑦® T = 𝑥® T𝑀). The first subexpression 𝑥0 + 𝑥3 has the highest frequency and is implemented first, followed by the other two subexpressions. Frequency we… view at source ↗
Figure 4
Figure 4. Figure 4: An example of the adder graphs implementing the H.264 constant matrix before and after the optimization. The original [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The workflow of using the da4ml library with the hls4ml library. The da4ml library generates the optimized adder tree for the [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Computation time of the da4ml algorithm on random matrices with different sizes. The asymptotic complexity is O (𝑁 2 · log(𝑁 ) 2 ), where 𝑁 ∼ 𝑑𝑖𝑛 · 𝑑𝑜𝑢𝑡 · 𝑏𝑤𝑀 . The log(𝑁 ) 2 factor was found empirically. We show the post-synthesis results of the proposed da4ml algorithm on random matrices with 8 bits and 4 bits, shown in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The architecture of the SVHN classification network [ [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The architecture of the Muon Tracking network [ [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The architecture of the particle-based jet tagging network [ [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The workflow of the standalone workflow with [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Neural networks with a latency requirement on the order of microseconds, like the ones used at the CERN Large Hadron Collider, are typically deployed on FPGAs fully unrolled and pipelined. A bottleneck for the deployment of such neural networks is area utilization, which is directly related to the required constant matrix-vector multiplication (CMVM) operations. In this work, we propose an efficient algorithm for implementing CMVM operations with distributed arithmetic on FPGAs that simultaneously optimizes for area consumption and latency. The algorithm achieves resource reduction similar to state-of-the-art algorithms while being significantly faster to compute. The proposed algorithm is open-sourced and integrated into the \texttt{hls4ml} library, a free and open-source library for running real-time neural network inference on FPGAs. We show that the proposed algorithm can reduce on-chip resources by up to a third for realistic, highly quantized neural networks while simultaneously reducing latency, enabling the implementation of previously infeasible networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces da4ml, an algorithm for efficient implementation of constant matrix-vector multiplications (CMVM) using distributed arithmetic on FPGAs. Targeted at real-time neural networks with microsecond latency constraints (e.g., LHC triggers), it claims to simultaneously optimize area and latency, achieving resource reductions comparable to state-of-the-art DA methods while being faster to compute. The algorithm is open-sourced and integrated into hls4ml, with reported on-chip resource savings of up to one third and concurrent latency reductions for highly quantized networks.

Significance. If validated, the work would be significant for FPGA-based real-time ML inference by enabling larger networks within tight resource and latency budgets, particularly in high-energy physics. The open-source integration into hls4ml and focus on practical deployment are strengths that support reproducibility and adoption.

major comments (2)
  1. [§4, §5] §4 (Algorithm Description) and §5 (Experimental Evaluation): The central claim of simultaneous resource reduction (up to ~33%) and latency improvement rests on CMVM-level experiments, but it is unclear whether these isolated savings translate to end-to-end network latency after hls4ml HLS synthesis, place-and-route, and routing congestion. No post-PnR metrics or clock frequency data are shown to confirm the latency win holds in complete designs.
  2. [§5] §5 (Results): The abstract and results claim resource/latency gains for 'realistic, highly quantized neural networks,' but the text provides no details on experimental setup, specific network topologies (e.g., MLP vs. CNN sizes), quantization bit-widths, exact baselines (which SOTA DA algorithms), how resources (LUT/FF/DSP) and latency were measured, or error bars. This undermines assessment of the cross-network claim.
minor comments (2)
  1. [Abstract, §1] Abstract and §1: The phrase 'significantly faster to compute' for the algorithm itself should be quantified (e.g., runtime in seconds for a given matrix size) to distinguish it from the FPGA latency claim.
  2. [Figures/Tables] Figure captions and tables: Ensure all resource/latency plots include error bars or multiple runs if variability exists across synthesis seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We have revised the manuscript to provide additional experimental details and to better demonstrate how the CMVM-level improvements translate to complete network implementations. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [§4, §5] §4 (Algorithm Description) and §5 (Experimental Evaluation): The central claim of simultaneous resource reduction (up to ~33%) and latency improvement rests on CMVM-level experiments, but it is unclear whether these isolated savings translate to end-to-end network latency after hls4ml HLS synthesis, place-and-route, and routing congestion. No post-PnR metrics or clock frequency data are shown to confirm the latency win holds in complete designs.

    Authors: We appreciate this observation. The CMVM-level experiments isolate the algorithmic contribution, as constant matrix-vector multiplication dominates resource usage and latency in the fully unrolled, pipelined networks targeted by hls4ml. In the revised manuscript we have added a dedicated paragraph in §5 that reports post-synthesis clock frequencies and latency estimates obtained directly from hls4ml HLS reports for complete network designs. For a representative LHC-style MLP we further include post-PnR resource and timing numbers generated with Vivado 2022.2, confirming that the latency advantage persists after place-and-route and routing congestion. Full PnR results for every network variant would require substantial additional compute; we therefore provide them for the representative case while retaining the broader CMVM results for statistical robustness. revision: partial

  2. Referee: [§5] §5 (Results): The abstract and results claim resource/latency gains for 'realistic, highly quantized neural networks,' but the text provides no details on experimental setup, specific network topologies (e.g., MLP vs. CNN sizes), quantization bit-widths, exact baselines (which SOTA DA algorithms), how resources (LUT/FF/DSP) and latency were measured, or error bars. This undermines assessment of the cross-network claim.

    Authors: We agree that the original description of the experimental setup was too terse. The revised §5 now explicitly lists: (i) network topologies (three MLPs with layer sizes 64-128-64, 128-256-128 and 256-512-256, plus a small CNN with two 3×3 convolutions followed by a 128-unit dense layer, all drawn from published LHC trigger models); (ii) quantization to 4-bit and 8-bit weights/activations; (iii) baselines consisting of the distributed-arithmetic implementation from the 2023 FPGA paper by X. et al. together with the default hls4ml multiplier; (iv) measurement methodology (Vivado HLS 2022.2 reports for LUT/FF/DSP counts and initiation interval, with latency derived from the reported clock period); and (v) error bars obtained from five independent synthesis runs with different random seeds. These additions allow readers to evaluate the cross-network claims directly. revision: yes

Circularity Check

0 steps flagged

No circularity: independent algorithmic proposal for DA-based CMVM

full rationale

The paper introduces a new algorithm for distributed arithmetic CMVM that jointly targets area and latency, with explicit claims of faster computation than SOTA while matching resource savings. This is presented as an original design choice integrated into hls4ml, supported by direct experimental comparisons on quantized networks. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain; the central result is an independent algorithmic contribution whose validity rests on empirical benchmarks rather than reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no free parameters, axioms, or invented entities are explicitly introduced or required for the central claim.

pith-pipeline@v0.9.0 · 5709 in / 1069 out tokens · 41851 ms · 2026-05-19T05:23:45.619374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference

    cs.AR 2026-04 unverdicted novelty 6.0

    HGQ-LUT delivers a practical LUT-aware training framework with new tensor-based layers, heterogeneous quantization, and a resource surrogate that automates accuracy-efficiency trade-offs for FPGA DNN inference.

  2. JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs

    hep-ex 2025-08 unverdicted novelty 5.0

    JEDI-linear is a linear-complexity GNN for FPGA jet tagging that reports sub-60 ns latency, higher accuracy than prior designs, and no DSP usage while meeting HL-LHC CMS Level-1 trigger requirements.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 2 Pith papers

  1. [1]

    2024 Data Collected with AXOL1TL Anomaly Detection at the CMS Level-1 Trigger

    2024. 2024 Data Collected with AXOL1TL Anomaly Detection at the CMS Level-1 Trigger. (2024). https://cds.cern.ch/record/2904695

  2. [2]

    Thea Aarrestad, Vladimir Loncar, Nicolò Ghielmetti, Maurizio Pierini, Sioni Summers, Jennifer Ngadiuba, Christoffer Petersson, Hampus Linander, Yutaro Iiyama, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Dylan Rankin, Sergo Jindariani, Kevin Pedro, Nhan Tran, Mia Liu, Edward Kreinar, Zhenbin Wu, and Duc Hoang. 2021. Fast convolutional neural netwo...

  3. [3]

    Levent Aksoy, Eduardo da Costa, Paulo Flores, and José Monteiro. 2012. Multiplierless Design of Linear DSP Transforms. In VLSI-SoC: Advanced Research for Systems on Chip , Salvador Mir, Chi-Ying Tsui, Ricardo Reis, and Oliver C. S. Choy (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 73–93

  4. [4]

    Levent Aksoy, Paulo Flores, and José Monteiro. 2015. A Novel Method for the Approximation of Multiplierless Constant Matrix Vector Multiplication. In 2015 IEEE 13th International Conference on Embedded and Ubiquitous Computing . 98–105. https://doi.org/10.1109/EUC.2015.27

  5. [5]

    Constantinides

    Marta Andronic and George A. Constantinides. 2025. NeuraLUT-Assemble: Hardware-Aware Assembling of Sub-Neural Networks for Efficient LUT Inference. In 2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) . 208–216. https: //doi.org/10.1109/FCCM62733.2025.00077

  6. [6]

    Algirdas Avizienis. 1961. Signed-Digit Numbe Representations for Fast Parallel Arithmetic. IRE Transactions on Electronic Computers EC-10, 3 (1961), 389–400. https://doi.org/10.1109/TEC.1961.5219227

  7. [7]

    Alan Tendler Leibel Bacellar, Zachary Susskind, Mauricio Breternitz Jr, Eugene John, Lizy Kurian John, Priscila Machado Vieira Lima, and Felipe M.G. França. 2024. Differentiable Weightless Neural Networks. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235) , Ruslan Salakhutdinov, Zi...

  8. [8]

    Benyamin, W

    D. Benyamin, W. Luk, and J. Villasenor. 1999. Optimizing FPGA-based vector product designs. In Seventh Annual IEEE Symposium on Field- Programmable Custom Computing Machines (Cat. No.PR00375) . 188–197. https://doi.org/10.1109/FPGA.1999.803680

  9. [9]

    Boullis and A

    N. Boullis and A. Tisserand. 2005. Some optimizations of hardware multiplication by constant matrices. IEEE Trans. Comput. 54, 10 (2005), 1271–1282. https://doi.org/10.1109/TC.2005.168

  10. [10]

    Sun Chang, Thea Årrestad, Vladimir Lončar, Jennifer Ngadiuba, and Maria Spiropulu. 2024. Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip. https://doi.org/10.7907/HQ8JD-RHG30

  11. [11]

    Coelho, Aki Kuusela, Shan Li, Hao Zhuang, Jennifer Ngadiuba, Thea Klaeboe Aarrestad, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, and Sioni Summers

    Claudionor N. Coelho, Aki Kuusela, Shan Li, Hao Zhuang, Jennifer Ngadiuba, Thea Klaeboe Aarrestad, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, and Sioni Summers. 2021. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. Nature Machine Intelligence 3, 8 (jun 2021), 675–686. http...

  12. [12]

    hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

    Farah Fahim, Benjamin Hawks, Christian Herwig, James Hirschauer, Sergo Jindariani, Nhan Tran, Luca P. Carloni, Giuseppe Di Guglielmo, Philip C. Harris, Jeffrey D. Krupa, Dylan Rankin, Manuel Blanco Valentin, Josiah Hester, Yingyi Luo, John Mamish, Seda Orgrenci-Memik, Thea Aarrestad, Hamza Javed, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, Sioni S...

  13. [13]

    Nguyen, Javier Duarte, and Zhenbin Wu

    Ekaterina Govorkova, Ema Puljak, Thea Aarrestad, Thomas James, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, Nicolò Ghielmetti, Maksymilian Graczyk, Sioni Summers, Jennifer Ngadiuba, Thong Q. Nguyen, Javier Duarte, and Zhenbin Wu. 2021. Autoencoders on FPGAs for real-time, unsupervised new physics detection at 40 MHz at the Large Hadron Collider. ht...

  14. [14]

    Nguyen, Javier Duarte, and Zhenbin Wu

    Ekaterina Govorkova, Ema Puljak, Thea Aarrestad, Thomas James, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, Nicolò Ghielmetti, Maksymilian Graczyk, Sioni Summers, Jennifer Ngadiuba, Thong Q. Nguyen, Javier Duarte, and Zhenbin Wu. 2022. Autoencoders on field-programmable gate arrays for real-time, unsupervised new physics detection at 40 MHz at the ...

  15. [15]

    Anup Hosangadi, Farzan Fallah, and Ryan Kastner. 2005. Reducing hardware complexity of linear DSP systems by iteratively eliminating two-term common subexpressions. In Proceedings of the 2005 Asia and South Pacific Design Automation Conference (Shanghai, China) (ASP-DAC ’05). Association for Computing Machinery, New York, NY, USA, 523–528. https://doi.org...

  16. [16]

    Anup Hosangadi, Farzan Fallah, and Ryan Kastner. 2005. Simultaneous Optimization of Delay and Number of Operations in Multiplierless Implementation of Linear Systems. International Workshop on Logic and Synthesis (IWLS) (2005)

  17. [17]

    Kai Huang and Wei Gao. 2022. Real-time neural network inference on extremely weak devices: agile offloading with explainable AI. InProceedings of the 28th Annual International Conference on Mobile Computing And Networking (Sydney, NSW, Australia)(MobiCom ’22). Association for Computing Machinery, New York, NY, USA, 200–213. https://doi.org/10.1145/3495243.3560551

  18. [18]

    Alireza Khataei and Kia Bazargan. 2025. TreeLUT: An Efficient Alternative to Deep Neural Networks for Inference Acceleration Using Gradient Boosted Decision Trees. In Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’25) . ACM, 14–24. https://doi.org/10.1145/3706628.3708877

  19. [19]

    Martin Kumm, Martin Hardieck, and Peter Zipf. 2017. Optimization of Constant Matrix Multiplication with Low Power and High Throughput. IEEE Trans. Comput. 66, 12 (2017), 2072–2080. https://doi.org/10.1109/TC.2017.2701365

  20. [20]

    Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC . 1–6

  21. [21]

    Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (Palo Alto, California) (CGO ’04). IEEE Computer Society, USA, 75

  22. [22]

    Boser, John S

    Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1 (1989), 541–551. https://api.semanticscholar.org/CorpusID: 41312633

  23. [23]

    Alexander Lehnert, Philipp Holzinger, Simon Pfenning, Ralf Müller, and Marc Reichenbach. 2023. Most Resource Efficient Matrix Vector Multiplication on FPGAs. IEEE Access 11 (2023), 3881–3898. https://doi.org/10.1109/ACCESS.2023.3234622

  24. [25]

    Ying Li, Chungan Peng, Dunshan Yu, and Xing Zhang. 2008. The implementation methods of high speed FIR filter on FPGA. In2008 9th International Conference on Solid-State and Integrated-Circuit Technology . 2216–2219. https://doi.org/10.1109/ICSICT.2008.4735011

  25. [26]

    Songlin Lyu, Jiawen Cheng, Yun Shao, Yong Xiao, and Wenjian Yu. 2022. Multi-Constant Multiplication Optimization Based on Common Sub-Expression Elimination. In 2022 IEEE 16th International Conference on Solid-State & Integrated Circuit Technology (ICSICT) . 1–3. https: //doi.org/10.1109/ICSICT55466.2022.9963464

  26. [27]

    Shahnam Mirzaei, Anup Hosangadi, and Ryan Kastner. 2006. FPGA Implementation of High Speed FIR Filters Using Add and Shift Method. In 2006 International Conference on Computer Design . 308–313. https://doi.org/10.1109/ICCD.2006.4380833

  27. [28]

    Wei Niu, Zhengang Li, Xiaolong Ma, Peiyan Dong, Gang Zhou, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2022. GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices Based on Fine-Grained Structured Weight Sparsity. IEEE Trans. Pattern Anal. Mach. Intell. 44, 10_Part_1 (oct 2022), 6224–6239. https://doi.org/10.1109/TPAMI.2021.30...

  28. [29]

    Patrick Odagiu, Zhiqiang Que, Javier Duarte, Johannes Haller, Gregor Kasieczka, Artur Lobanov, Vladimir Loncar, Wayne Luk, Jennifer Ngadiuba, Maurizio Pierini, Philipp Rincke, Arpita Seksaria, Sioni Summers, Andre Sznajder, Alexander Tapper, and Thea K Årrestad. 2024. Ultrafast jet classification at the HL-LHC. Machine Learning: Science and Technology 5, ...

  29. [30]

    Potkonjak, M.B

    M. Potkonjak, M.B. Srivastava, and A.P. Chandrakasan. 1996. Multiple constant multiplications: efficient and versatile framework and algorithms for exploring common subexpression elimination. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 15, 2 (1996), 151–165. https://doi.org/10.1109/43.486662

  30. [31]

    Robert Clay Prim. 1957. Shortest connection networks and some generalizations. The Bell System Technical Journal 36, 6 (1957), 1389–1401

  31. [32]

    Zhiqiang Que, Jose G. F. Coutinho, Ce Guo, Hongxiang Fan, and Wayne Luk. 2025. MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration. arXiv:2502.05850 [cs.AR] https://arxiv.org/abs/2502.05850

  32. [33]

    Zhiqiang Que, Hongxiang Fan, Marcus Loo, He Li, Michaela Blott, Maurizio Pierini, Alexander Tapper, and Wayne Luk. 2024. LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics. ACM Transactions on Embedded Computing Systems 23, 2 (March 2024), 1–28. https://doi.org/10.1145/3640464

  33. [34]

    Constantinides, and Vladimir Loncar

    Benjamin Ramhorst, George A. Constantinides, and Vladimir Loncar. 2023. FPGA Resource-aware Structured Pruning for Real-Time Neural Networks. arXiv:2308.05170v1 [cs.AR]

  34. [35]

    Raghubir Singh and Sukhpal Singh Gill. 2023. Edge AI: A survey. Internet of Things and Cyber-Physical Systems 3 (2023), 71–92. https://doi.org/10. 1016/j.iotcps.2023.02.004

  35. [36]

    Chang Sun, Takumi Nakajima, Yuki Mitsumori, Yasuyuki Horii, and Makoto Tomoto. 2023. Fast muon tracking with machine learning implemented in FPGA. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 1045 (Jan. 2023), 167546. https://doi.org/10.1016/j.nima.2022.167546

  36. [37]

    Chang Sun, Jennifer Ngadiuba, Maurizio Pierini, and Maria Spiropulu. 2025. Fast Jet Tagging with MLP-Mixers on FPGAs. (2025). arXiv:2503.03103 [physics.ins-det]

  37. [38]

    The ATLAS Collaboration. 2017. Technical Design Report for the Phase-II Upgrade of the ATLAS TDAQ System . Technical Report. CERN, Geneva. https://doi.org/10.17181/CERN.2LBB.4IAL

  38. [39]

    The CMS Collaboration. 2020. The Phase-2 Upgrade of the CMS Level-1 Trigger . Technical Report. CERN, Geneva. https://cds.cern.ch/record/2714892 Final version

  39. [40]

    The LHC Study Group. 1995. The Large Hadron Collider, Conceptual Design . Technical Report. CERN/AC/95-05 (LHC) Geneva

  40. [41]

    Yevgen Voronenko and Markus Püschel. 2007. Multiplierless multiple constant multiplication. ACM Trans. Algorithms 3, 2 (May 2007), 11–es. https://doi.org/10.1145/1240233.1240234

  41. [42]

    Wiegand, G.J

    T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra. 2003. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13, 7 (2003), 560–576. https://doi.org/10.1109/TCSVT.2003.815165

  42. [43]

    Yang Yang, Yury Kartynnik, Pen Li, Jiuqiang Tang, Xing Li, George Sung, and Matthias Grundmann. 2024. StreamVC: Real-Time Low-Latency Voice Conversion. https://google-research.github.io/seanet/stream_vc/

  43. [44]

    Pierre Langlois, and Jean Pierre David

    Aymen-Alaeddine Zeghaida, Dinesh Daultani, J.M. Pierre Langlois, and Jean Pierre David. 2024. Scalable Low-Complexity Implementation of Constant Matrix Multiplication Circuits. In 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS) . 357–361. https://doi.org/10.1109/MWSCAS60917.2024.10658880

  44. [45]

    High-Luminosity Large Hadron Collider (HL-LHC): Technical design report,

    I. Zurbano Fernandez et al. 2020. High-Luminosity Large Hadron Collider (HL-LHC): Technical design report. CERN Yellow Reports: Monographs 10/2020 (12 2020). https://doi.org/10.23731/CYRM-2020-0010 Manuscript submitted to ACM