arxiv: 2604.22293 · v1 · submitted 2026-04-24 · 💻 cs.AR · cs.LG· hep-ex

Recognition: unknown

HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference

Chang Sun , Zhiqiang Que , Bakhtiar Zadeh , Qibin Liu , Kevin H. Alvarez , Wayne Luk , Maria Spiropulu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:30 UTC · model grok-4.3

classification 💻 cs.AR cs.LGhep-ex

keywords LUT-aware trainingheterogeneous quantizationFPGA neural network inferencelookup table DNNhardware-aware deep learningzero-bit pruningresource surrogate model

0 comments

The pith

HGQ-LUT trains lookup-table neural networks over 100 times faster on GPUs while delivering state-of-the-art FPGA hardware efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to make lookup-table neural networks practical for FPGA deployment by removing the main barriers of slow training and manual hardware tuning. It replaces slow custom training methods with layers that run as ordinary fast tensor operations on GPUs during training, then compile directly into efficient logic lookup tables on the FPGA. Fine-grained per-element quantization, including the option to prune some weights to zero bits, works together with a resource-use predictor to let the training process automatically balance accuracy against hardware cost. The result is an end-to-end workflow that supports mixing lookup-table blocks with conventional arithmetic blocks and includes bit-exact verification. This combination targets real low-latency applications such as those at the CERN Large Hadron Collider.

Core claim

HGQ-LUT introduces LUT-Dense and LUT-Conv layers that execute as standard tensor operations during training yet compile to logic LUTs for hardware, paired with element-wise heterogeneous quantization that includes zero-bit pruning and a LUT-aware resource surrogate; together these enable more than 100 times faster training on modern GPUs, automatic accuracy-resource trade-off exploration, and unified design of hybrid LUT-plus-arithmetic architectures with bit-exact verification.

What carries the argument

LUT-Dense and LUT-Conv layers implemented via regular tensor operations that later compile to FPGA logic LUTs, combined with fine-grained heterogeneous quantization and a LUT-aware resource surrogate that guides automatic design-space exploration.

If this is right

Training time for LUT-based networks drops from hours or days to minutes, making repeated hardware-aware design iterations feasible on standard GPU hardware.
Designers no longer need to manually select bit widths; the surrogate and quantization jointly search the accuracy-resource space automatically.
Hybrid networks that combine LUT blocks with conventional multiply-add blocks can be designed, compiled, and verified in a single open-source flow.
The same model can be deployed at the edge with ultra-low latency while retaining the accuracy achieved during fast GPU training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be adapted to other reconfigurable fabrics or even ASICs if a corresponding resource surrogate is built for those targets.
The tensor-based training trick may reduce the cost of other hardware-aware training methods that currently rely on slow simulation loops.
Integration with existing neural-architecture-search tools could turn the resource surrogate into an automatic hardware-aware NAS objective for FPGAs.

Load-bearing premise

The tensor-operation versions of the LUT layers used in training produce exactly the same numerical results and resource counts as the final compiled LUT hardware on the FPGA.

What would settle it

Measure actual FPGA latency, power, resource utilization, and inference accuracy for a trained HGQ-LUT model and compare them directly against the values predicted by the resource surrogate and the bit-exact simulation from the training phase.

Figures

Figures reproduced from arXiv: 2604.22293 by Bakhtiar Zadeh, Chang Sun, Kevin H. Alvarez, Maria Spiropulu, Qibin Liu, Wayne Luk, Zhiqiang Que.

**Figure 1.** Figure 1: The overall workflow of the HGQ-LUT framework, with the LUT view at source ↗

**Figure 3.** Figure 3: The hybrid architecture used for TGC Muon Tracking task with view at source ↗

**Figure 4.** Figure 4: The overall network architecture used for the CEPC gas detector view at source ↗

**Figure 5.** Figure 5: Separation power at different particle momenta for the HGQ-LUT view at source ↗

read the original abstract

Lookup-table (LUT) based neural networks can deliver ultra-low latency and excellent hardware efficiency on FPGAs by mapping arithmetic operations directly onto the logic primitives. However, state-of-the-art LUT-aware training (LAT) approaches remain difficult to use in practice: they are often orders of magnitude slower to train than conventional networks, require non-trivial manual tuning for hardware efficiency, and lack an end-to-end workflow. This work presents HGQ-LUT, integrated in https://github.com/calad0i/HGQ2, a new LAT approach that achieves state-of-the-art hardware efficiency while accelerating training by over 100 times on modern GPUs. HGQ-LUT introduces LUT-Dense and LUT-Conv layers that are implemented with regular, accelerator-efficient tensor operations during training, which are then compiled into logic LUTs for hardware. By combining these layers with fine-grained, element-wise heterogeneous quantization (including zero-bit pruning) and a LUT-aware resource surrogate, HGQ-LUT enables the automatic exploration of accuracy-resource trade-offs without manual bit-width tuning. We further integrate HGQ-LUT into open-source toolchains, enabling unified design, compilation, and bit-exact verification of hybrid architectures that mix LUT-based with conventional arithmetic blocks. These features make LAT-based DNNs practical for real-world deployment, such as at the CERN Large Hadron Collider's experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HGQ-LUT makes LUT-aware training practical with fast tensor-based layers and auto resource search, but the training-to-hardware match under heterogeneous quantization still needs direct checks.

read the letter

The main thing to know is that this paper introduces LUT-Dense and LUT-Conv layers that train with ordinary tensor operations on GPUs, then map to actual FPGA LUTs, plus a surrogate that lets the system search accuracy-resource trade-offs automatically. That combination, along with element-wise heterogeneous quantization and zero-bit pruning, targets the slow training and manual tuning problems that have limited prior LUT-aware training work. The toolchain integration for hybrid LUT-plus-arithmetic designs and bit-exact verification is also a concrete step forward for real deployments like CERN experiments. These pieces address documented usability gaps in the LAT literature and ship with a GitHub repo, which helps reproducibility. The approach shows clear thinking about the practical barriers rather than just another accuracy tweak. On the soft side, the abstract claims over 100x training speedups and state-of-the-art hardware efficiency, yet the summary supplies no concrete benchmark tables, error bars, or direct comparisons, so the size of the gains is hard to judge from the high-level description alone. The bigger open question is whether the tensor simulation during training exactly reproduces the compiled LUT behavior once heterogeneous bit widths and pruning are in place. Edge cases in input encoding, carry handling, or pruning semantics could create a mismatch, and the stress-test note correctly flags that risk. If the full paper includes post-synthesis resource numbers that match the surrogate and verification runs that confirm bit-exact equivalence, the claims hold up better; otherwise that remains the load-bearing assumption. This work is aimed at hardware engineers and applied researchers who need low-latency FPGA inference rather than general AI theorists. A reader already working on FPGA accelerators or edge deployment would get usable methods and code from it. It deserves a serious referee because the new layers and surrogate solve stated shortcomings with implementable components, even if some quantitative validation details will need tightening during review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HGQ-LUT, a LUT-aware training (LAT) framework for DNN inference on FPGAs. It defines LUT-Dense and LUT-Conv layers that are realized via standard tensor operations (matrix multiplies and convolutions) during GPU training to achieve >100x speedup over prior LAT methods, then compiled to logic LUTs. These layers are paired with element-wise heterogeneous quantization (including zero-bit pruning) and a LUT-aware resource surrogate that enables automatic accuracy-resource trade-off search without manual bit-width tuning. The work integrates the approach into open-source toolchains for unified design, compilation, and bit-exact verification of hybrid LUT-plus-conventional arithmetic architectures, with example use at CERN LHC experiments.

Significance. If the central claims hold, HGQ-LUT would make LUT-based DNNs substantially more practical for ultra-low-latency FPGA deployment by removing the dominant training-time and tuning barriers that have limited prior LAT work. The explicit provision of the GitHub repository (https://github.com/calad0i/HGQ2) for reproducible code, the emphasis on bit-exact verification, and the hybrid-architecture support are concrete strengths that increase the result's immediate utility for high-energy-physics and other real-time inference settings.

major comments (2)

[§3.1–3.2] §3.1–3.2 (LUT-Dense/LUT-Conv definitions): The central claim that training-time tensor implementations produce models whose accuracy and functionality are preserved on FPGA LUT hardware under heterogeneous quantization and zero-bit pruning is load-bearing. No explicit equivalence proof, exhaustive edge-case enumeration (e.g., pruning semantics when a weight is quantized to zero bits, or input-encoding differences for multi-input LUTs), or side-by-side numerical comparison of tensor vs. post-synthesis behavior is supplied for the heterogeneous case; only homogeneous examples appear to be validated.
[§4.3, Table 3] §4.3 and Table 3 (resource surrogate validation): The automatic trade-off exploration and reported SOTA efficiency numbers rest on the LUT-aware surrogate accurately predicting post-synthesis utilization. Direct quantitative comparison (e.g., surrogate-predicted vs. actual LUT/FF/BRAM counts after Vivado synthesis) is shown only for a subset of homogeneous designs; extension to the fine-grained heterogeneous configurations that drive the claimed gains is required to confirm the surrogate does not systematically under- or over-estimate resources.

minor comments (2)

[Figure 4] Figure 4: Axis labels and legend entries for the heterogeneous vs. homogeneous curves are difficult to distinguish at the printed resolution; adding explicit bit-width annotations on the data points would improve readability.
[§5.1] §5.1: The statement that the approach is “parameter-free” should be qualified, as the surrogate still contains a small number of tunable coefficients whose sensitivity is not reported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of HGQ-LUT's practical contributions and for the constructive major comments. We address each point below and will revise the manuscript accordingly to strengthen the validation of the core claims.

read point-by-point responses

Referee: [§3.1–3.2] §3.1–3.2 (LUT-Dense/LUT-Conv definitions): The central claim that training-time tensor implementations produce models whose accuracy and functionality are preserved on FPGA LUT hardware under heterogeneous quantization and zero-bit pruning is load-bearing. No explicit equivalence proof, exhaustive edge-case enumeration (e.g., pruning semantics when a weight is quantized to zero bits, or input-encoding differences for multi-input LUTs), or side-by-side numerical comparison of tensor vs. post-synthesis behavior is supplied for the heterogeneous case; only homogeneous examples appear to be validated.

Authors: The LUT-Dense and LUT-Conv layers are defined so that the tensor operations (matrix multiplies and convolutions) during training perform exactly the same arithmetic as the compiled LUT logic on hardware, with element-wise heterogeneous quantization applied identically in both domains. For zero-bit pruning, a weight assigned zero bits is removed from the computation graph in the tensor implementation (by masking or zeroing the corresponding slice), which matches the hardware behavior of omitting that LUT input. We acknowledge that the manuscript currently provides only homogeneous validation examples. In the revision we will add an appendix containing (i) a concise equivalence argument derived directly from the layer definitions in §3.1–3.2, (ii) side-by-side numerical comparisons of tensor versus post-synthesis outputs for representative heterogeneous quantization masks (including zero-bit cases), and (iii) explicit clarification of the input-encoding convention used for multi-input LUTs. These additions will directly address the referee's concern without altering the reported results. revision: yes
Referee: [§4.3, Table 3] §4.3 and Table 3 (resource surrogate validation): The automatic trade-off exploration and reported SOTA efficiency numbers rest on the LUT-aware surrogate accurately predicting post-synthesis utilization. Direct quantitative comparison (e.g., surrogate-predicted vs. actual LUT/FF/BRAM counts after Vivado synthesis) is shown only for a subset of homogeneous designs; extension to the fine-grained heterogeneous configurations that drive the claimed gains is required to confirm the surrogate does not systematically under- or over-estimate resources.

Authors: We agree that the surrogate's accuracy must be demonstrated for the heterogeneous configurations that underpin the automatic trade-off search. The current Table 3 and §4.3 focus on homogeneous designs for brevity. In the revised manuscript we will extend the validation by adding a new table (or expanded subsection) that reports surrogate-predicted versus post-Vivado-synthesis LUT/FF/BRAM counts for multiple heterogeneous bit-width assignments drawn from the Pareto-front experiments. This will confirm that the surrogate remains reliable in the fine-grained regime and will support the SOTA efficiency claims. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper defines LUT-Dense and LUT-Conv layers via independent tensor-operation implementations for training, then separately compiles them to hardware LUTs. The LUT-aware resource surrogate is introduced as an additional modeling component for trade-off search. No equations or claims reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the workflow is presented as an end-to-end empirical pipeline with external verification steps. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters or invented entities; the method rests on the domain assumption that training-time tensor operations can be mapped losslessly to hardware LUTs.

axioms (1)

domain assumption Tensor operations used in training accurately simulate the behavior of compiled LUT logic on FPGA
Central to the claim that training-time implementations translate directly to hardware efficiency.

pith-pipeline@v0.9.0 · 5568 in / 1212 out tokens · 33454 ms · 2026-05-08T09:30:06.502862+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 26 canonical work pages · 3 internal anchors

[1]

LUTNet: Learning FPGA configurations for highly efficient neural network infer- ence,

E. Wang, J. J. Davis, P. Y . Cheung, and G. A. Constantinides, “LUTNet: Learning FPGA configurations for highly efficient neural network infer- ence,”IEEE Transactions on Computers, vol. 69, no. 12, pp. 1795–1808, 2020

2020
[2]

Neuralut-assemble: Hardware- aware assembling of sub-neural networks for efficient lut infer- ence,

M. Andronic and G. A. Constantinides, “Neuralut-assemble: Hardware- aware assembling of sub-neural networks for efficient lut infer- ence,” in2025 IEEE 33rd Annual International Symposium on Field- Programmable Custom Computing Machines (FCCM), 2025, pp. 208– 216

2025
[3]

Logicnets: Co-designed neural networks and circuits for extreme-throughput applications,

Y . Umuroglu, Y . Akhauri, N. J. Fraser, and M. Blott, “Logicnets: Co-designed neural networks and circuits for extreme-throughput applications,”2020 30th International Conference on Field- Programmable Logic and Applications (FPL), pp. 291–297, 2020. [Online]. Available: https://doi.org/10.1109/FPL50879.2020.00055

work page doi:10.1109/fpl50879.2020.00055 2020
[4]

NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions,

M. Andronic and G. A. Constantinides, “NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions,” in2024 34th In- ternational Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2024, pp. 140–148, doi: 10.1109/FPL64840.2024.00028

work page doi:10.1109/fpl64840.2024.00028 2024
[5]

Reducedlut: Table decomposition with “don’t care

O. Cassidy, M. Andronic, S. Coward, and G. A. Constantinides, “Reducedlut: Table decomposition with “don’t care” conditions,” in Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA ’25. ACM, Feb. 2025, p. 36–42. [Online]. Available: http://dx.doi.org/10.1145/3706628.3708823

work page doi:10.1145/3706628.3708823 2025
[6]

PolyLUT: Ultra-Low Latency Polynomial Inference With Hardware-Aware Structured Pruning,

M. Andronic and G. A. Constantinides, “PolyLUT: Ultra-Low Latency Polynomial Inference With Hardware-Aware Structured Pruning,” in IEEE Transactions on Computers. IEEE, 2025, pp. 3181–3194, doi: 10.1109/TC.2025.3586311

work page doi:10.1109/tc.2025.3586311 2025
[7]

Polylut-add: Fpga-based lut inference with wide inputs,

B. Lou, R. Rademacher, D. Boland, and P. H. Leong, “Polylut-add: Fpga-based lut inference with wide inputs,” in2024 34th International Conference on Field-Programmable Logic and Applications (FPL), 2024, pp. 149–155

2024
[8]

Treelut: An efficient alternative to deep neural networks for inference acceleration using gradient boosted decision trees,

O. Weng, M. Andronic, D. Zuberi, J. Chen, C. Geniesse, G. A. Constantinides, N. Tran, N. J. Fraser, J. M. Duarte, and R. Kastner, “Greater than the sum of its luts: Scaling up lut-based neural networks with amigolut,” inProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA ’25. New York, NY , USA: Associati...

work page doi:10.1145/3706628.3708874 2025
[9]

Hmt: Hierarchical memory transformer for efficient long context language processing

Z. He, S. Ye, R. Ma, Y . Wang, and J. Cong, “LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs,”arXiv preprint arXiv:2511.06174, 2025

work page arXiv 2025
[10]

CD-LLM: A Heterogeneous Multi-FPGA Sys- tem for Batched Decoding of 70B+ LLMs using a Compute-Dedicated Architecture,

W. Ma, X. Yang, S. Zeng, T. Liu, L. Shen, H. Wang, S. Li, K. Hong, Z. Zhu, X. Ninget al., “CD-LLM: A Heterogeneous Multi-FPGA Sys- tem for Batched Decoding of 70B+ LLMs using a Compute-Dedicated Architecture,”ACM Transactions on Reconfigurable Technology and Systems, 2025

2025
[11]

Differentiable weightless neural networks,

A. T. L. Bacellar, Z. Susskind, M. Breternitz Jr, E. John, L. K. John, P. M. V . Lima, and F. M. Franc ¸a, “Differentiable weightless neural networks,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berk...

2024
[12]

Multilayer feedforward networks are universal approximators,

K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,”Neural Netw., vol. 2, no. 5, pp. 359–366, Jul. 1989

1989
[13]

Hgq: High granularity quantization for real-time neural networks on fpgas,

C. Sun, Z. Que, T. K. ˚Arrestad, V . Loncar, J. Ngadiuba, W. Luk, and M. Spiropulu, “Hgq: High granularity quantization for real-time neural networks on fpgas,” inProceedings of the 2026 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’26. New York, NY , USA: Association for Computing Machinery, 2026. [Online]. Available: ht...

work page doi:10.1145/3748173 2026
[14]

hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices,

F. Fahim, B. Hawks, C. Herwig, J. Hirschauer, S. Jindariani, N. Tran, L. P. Carloni, G. D. Guglielmo, P. C. Harris, J. D. Krupa, D. Rankin, M. B. Valentin, J. Hester, Y . Luo, J. Mamish, S. Orgrenci-Memik, T. Aarrestad, H. Javed, V . Loncar, M. Pierini, A. A. Pol, S. Summers, J. M. Duarte, S. Hauck, S. Hsu, J. Ngadiuba, M. Liu, D. Hoang, E. Kreinar, and Z...

work page arXiv 2021
[15]

da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs

C. Sun, Z. Que, V . Loncar, W. Luk, and M. Spiropulu, “da4ml: Distributed arithmetic for real-time neural networks on fpgas,” 2025. [Online]. Available: https://arxiv.org/abs/2507.04535

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Xla : Compiling machine learning for peak performance,

A. Sabne, “Xla : Compiling machine learning for peak performance,” 2020

2020
[17]

High Performance Convolutional Neural Networks for Document Processing,

K. Chellapilla, S. Puri, and P. Simard, “High Performance Convolutional Neural Networks for Document Processing,” inTenth International Workshop on Frontiers in Handwriting Recognition, G. Lorette, Ed., Universit´e de Rennes 1. La Baule (France): Suvisoft, Oct. 2006, http://www.suvisoft.com. [Online]. Available: https://inria.hal.science/ inria-00112631

2006
[18]

GHDL: Open-source analyzer, compiler and simulator for VHDL

T. Gingold and et al, “GHDL: Open-source analyzer, compiler and simulator for VHDL.” [Online]. Available: https://ghdl.github.io/ghdl/
[19]

Verilator,

W. Snyder, P. Wasson, D. Galbi, and et al, “Verilator,” 2025, if you use this software, please cite it using the metadata from this file. [Online]. Available: https://verilator.org

2025
[20]

Fpga resource- aware structured pruning for real-time neural networks,

B. Ramhorst, G. A. Constantinides, and V . Loncar, “Fpga resource- aware structured pruning for real-time neural networks,” 2023. [Online]. Available: https://arxiv.org/abs/2308.05170v1

work page arXiv 2023
[21]

MetaML- Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration,

Z. Que, J. G. F. Coutinho, C. Guo, H. Fan, and W. Luk, “MetaML- Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration,”ACM Transactions on Reconfigurable Technology and Systems, 2026 (Accepted)

2026
[22]

KANEL ´E: Kolmogorov-Arnold Networks for Efficient LUT-based Evaluation,

D. Hoang, A. Gupta, and P. Harris, “KANEL ´E: Kolmogorov-Arnold Networks for Efficient LUT-based Evaluation,” inProceedings of the 2026 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’26. New York, NY , USA: Association for Computing Machinery, 2026. [Online]. Available: https://doi.org/10. 1145/3748173.3779202

work page arXiv 2026
[23]

Treelut: An efficient alternative to deep neural networks for inference acceleration using gradient boosted decision trees,

A. Khataei and K. Bazargan, “Treelut: An efficient alternative to deep neural networks for inference acceleration using gradient boosted decision trees,” inProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA ’25. ACM, Feb. 2025, pp. 14–24. [Online]. Available: http://dx.doi.org/10.1145/3706628.3708877

work page doi:10.1145/3706628.3708877 2025
[24]

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors,

C. N. Coelho, A. Kuusela, S. Li, H. Zhuang, J. Ngadiuba, T. K. Aarrestad, V . Loncar, M. Pierini, A. A. Pol, and S. Summers, “Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors,”Nature Machine Intelligence, vol. 3, no. 8, pp. 675–686, jun 2021. [Online]. Available: https://doi.org/10.1...

2021
[25]

Ultrafast jet classification at the hl-lhc,

P. Odagiu, Z. Que, J. Duarte, J. Haller, G. Kasieczka, A. Lobanov, V . Loncar, W. Luk, J. Ngadiuba, M. Pierini, P. Rincke, A. Seksaria, S. Summers, A. Sznajder, A. Tapper, and T. K. ˚Arrestad, “Ultrafast jet classification at the hl-lhc,”Machine Learning: Science and Technology, vol. 5, no. 3, p. 035017, Jul. 2024. [Online]. Available: http://dx.doi.org/1...

work page doi:10.1088/2632-2153/ad5f10 2024
[26]

LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics,

Z. Que, H. Fan, M. Loo, H. Li, M. Blott, M. Pierini, A. Tapper, and W. Luk, “LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics,”ACM Transactions on Embedded Computing Systems, vol. 23, no. 2, p. 1–28, Mar. 2024. [Online]. Available: http://dx.doi.org/10.1145/3640464

work page doi:10.1145/3640464 2024
[27]

Optimizing graph neural networks for jet tagging in particle physics on FPGAs,

Z. Que, M. Loo, H. Fan, M. Pierini, A. Tapper, and W. Luk, “Optimizing graph neural networks for jet tagging in particle physics on FPGAs,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 327–333

2022
[28]

Turner, J

C. Sun, T. Nakajima, Y . Mitsumori, Y . Horii, and M. Tomoto, “Fast muon tracking with machine learning implemented in fpga,”Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 1045, p. 167546, Jan. 2023. [Online]. Available: http://dx.doi.org/10.1016/j. nima.2022.167546

work page doi:10.1016/j 2023
[29]

Hls4ml lhc jet dataset (150 particles),

M. Pierini, J. M. Duarte, N. Tran, and M. Freytsis, “Hls4ml lhc jet dataset (150 particles),” Jan. 2020. [Online]. Available: https://doi.org/10.5281/zenodo.3602260

work page doi:10.5281/zenodo.3602260 2020
[30]

Cluster counting algorithm for the cepc drift chamber using lstm and dgcnn,

Z.-F. Tian, G. Zhao, L.-H. Wu, Z.-Y . Zhang, X. Zhou, S.-T. Xin, S.-Y . Liu, G. Li, M.-Y . Dong, and S.-S. Sun, “Cluster counting algorithm for the cepc drift chamber using lstm and dgcnn,”Nuclear Science and Techniques, vol. 36, no. 7, May 2025. [Online]. Available: http://dx.doi.org/10.1007/s41365-025-01670-y

work page doi:10.1007/s41365-025-01670-y 2025
[31]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015. 9

work page internal anchor Pith review arXiv 2015
[32]

Compressing deep neural networks on fpgas to binary and ternary precision with hls4ml,

J. Ngadiuba, V . Loncar, M. Pierini, S. Summers, G. D. Guglielmo, J. Duarte, P. Harris, D. Rankin, S. Jindariani, M. Liu, K. Pedro, N. Tran, E. Kreinar, S. Sagear, Z. Wu, and D. Hoang, “Compressing deep neural networks on fpgas to binary and ternary precision with hls4ml,”Machine Learning: Science and Technology, vol. 2, no. 1, p. 015001, dec 2020. [Onlin...

work page doi:10.1088/2632-2153/aba042 2020
[33]

Neuralut,

M. Andronic and O. Cassidy, “Neuralut,” 2025. [On- line]. Available: https://github.com/MartaAndronic/NeuraLUT/tree/ 650c4f4ebcd9c6e47c7229c6a0786a5b4d8696c7

2025
[34]

Duchstf/kanele: Fpga’ 26 artifact evaluation,

D. Hoang and A. Gupta, “Duchstf/kanele: Fpga’ 26 artifact evaluation,” Jan. 2026. [Online]. Available: https://doi.org/10.5281/zenodo.18165682

work page doi:10.5281/zenodo.18165682 2026
[35]

JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs

Z. Que, C. Sun, S. Paramesvaran, E. Clement, K. Karakoulaki, C. Brown, L. Laatu, A. Cox, A. Tapper, W. Luk, and M. Spiropulu, “JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs,” 2025. [Online]. Available: https://arxiv.org/abs/2508.15468

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Cluster Counting Algorithm for Drift Chamber using LSTM and DGCNN,

Z. Tian, G. Zhao, L. Wu, Z. Zhang, X. Zhou, S. Xin, S. Liu, G. Li, M. Dong, and S. Sun, “Cluster Counting Algorithm for Drift Chamber using LSTM and DGCNN,” Sep. 2024. [Online]. Available: https://doi.org/10.57760/sciencedb.16322

work page doi:10.57760/sciencedb.16322 2024
[37]

CEPC Conceptual Design Report: Volume 2 - Physics & Detector

T. C. S. Group, “Cepc conceptual design report: V olume 2 - physics & detector,” 2018. [Online]. Available: https://arxiv.org/abs/1811.10545

work page Pith review arXiv 2018
[38]

Abdallah et al

——, “Cepc technical design report – accelerator (v2),” 2024. [Online]. Available: https://arxiv.org/abs/2312.14363

work page arXiv 2024
[39]

Sub-microsecond Transformers for Jet Tagging on FPGAs,

L. Laatu, C. Sun, A. Cox, A. Gandrakota, B. Maier, J. Ngadiuba, Z. Que, W. Luk, M. Spiropulu, and A. Tapper, “Sub-microsecond Transformers for Jet Tagging on FPGAs,”arXiv preprint arXiv:2510.24784, 2025

work page arXiv 2025
[40]

JetFormer: A Scalable and Efficient Transformer for Jet Tagging from Offline Analysis to FPGA Triggers,

R. Zheng, C. Sun, Q. Liu, L. Laatu, A. Cox, B. Maier, A. Tapper, J. G. Coutinho, W. Luk, and Z. Que, “JetFormer: A Scalable and Efficient Transformer for Jet Tagging from Offline Analysis to FPGA Triggers,” arXiv preprint arXiv:2601.17215, 2026. 10

work page arXiv 2026