pith. machine review for the scientific record. sign in

arxiv: 2605.06052 · v1 · submitted 2026-05-07 · 💻 cs.AR

Recognition: unknown

XtraMAC: An Efficient MAC Architecture for Mixed-Precision LLM Inference on FPGA

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:09 UTC · model grok-4.3

classification 💻 cs.AR
keywords mixed-precisionMAC architectureFPGALLM inferencecompute densityDSP sharingdatatype adaptivequantization
0
0 comments X

The pith

XtraMAC unifies mixed-precision MAC operations on FPGA by decomposing them into a shared integer mantissa product with lightweight sign and exponent handling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a single microarchitecture for multiply-accumulate operations that supports integer, floating-point, and mixed-precision datatypes with runtime switching. It addresses under-utilization of DSP blocks in existing FPGA designs by packing operands dynamically around a common integer core. This yields constant latency and initiation interval of one while cutting resource use and raising throughput for quantized large language models. A sympathetic reader cares because mixed-precision quantization is now standard in LLMs yet hardware support remains inefficient, limiting parallelism and energy performance.

Core claim

XtraMAC decomposes all supported MAC formats into a shared integer mantissa product with lightweight sign and exponent handling, enabling dynamic operand packing and efficient DSP resource sharing with constant latency and initiation interval of one across all datatypes. Evaluated on an AMD Xilinx U55c FPGA, XtraMAC achieves 1.4-2.0x higher compute density, reduces per-operation LUT, FF, and DSP consumption by 27-51%, and delivers up to 1.9x greater energy efficiency and 1.2x speedup on representative mixed-precision LLM workloads.

What carries the argument

Shared integer mantissa product with lightweight sign and exponent handling, which carries the unification of datatypes and enables DSP sharing without variable latency.

If this is right

  • 1.4-2.0x higher compute density allows more MAC units to fit on the same FPGA fabric for greater parallelism.
  • 27-51% lower per-operation LUT, FF, and DSP consumption frees resources for larger models or additional system components.
  • Up to 1.9x greater energy efficiency reduces power draw during sustained LLM inference.
  • 1.2x speedup shortens end-to-end inference time on representative mixed-precision workloads.
  • Constant latency and II=1 across datatypes supports seamless runtime datatype switching without stalls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition principle could extend to additional low-precision formats or higher-bit-width operations in future revisions.
  • Similar sharing of mantissa logic may transfer to ASIC implementations for further gains in fixed silicon.
  • Open availability of the design allows direct integration into existing FPGA LLM accelerators and community testing on other boards.
  • Lower resource footprint per MAC could make aggressive quantization viable on smaller or lower-cost FPGA devices.

Load-bearing premise

The decomposition into a shared integer mantissa product plus lightweight sign and exponent handling preserves accuracy and achieves constant latency with initiation interval of one for all supported datatypes without hidden overheads in real workloads.

What would settle it

A mixed-precision LLM workload run on the U55c FPGA with XtraMAC showing either accuracy loss below the original model or operation latency that varies or exceeds one cycle per MAC.

Figures

Figures reproduced from arXiv: 2605.06052 by Bingsheng He, Feng Yu, Hongshi Tan, Weng-Fai Wong, Yao Chen.

Figure 1
Figure 1. Figure 1: Distribution of MAC operations during the decode stage for various view at source ↗
Figure 2
Figure 2. Figure 2: Overview of existing FPGA-based MAC architectures supporting mixed precision and runtime datatype switching. (a) Upcasting-based method using 0.0% view at source ↗
Figure 4
Figure 4. Figure 4: DSP utilization comparison on existing FPGA-based MAC architec view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the XtraMAC architecture supporting view at source ↗
Figure 6
Figure 6. Figure 6: Resource consumption and parallelism of XtraMAC across all view at source ↗
Figure 8
Figure 8. Figure 8: Scalability evaluation of XtraMAC with increasing mixed-precision view at source ↗
Figure 9
Figure 9. Figure 9: DSP utilization under different data types. (FP8=E4M3, FP4=E2M1). view at source ↗
Figure 10
Figure 10. Figure 10: Maximum frequency comparison with a single DSP slice. view at source ↗
Figure 12
Figure 12. Figure 12: Post-implementation resource usage (LUT, FF, DSP) and frequency view at source ↗
Figure 13
Figure 13. Figure 13: System-level resource breakdown (512-XtraMAC configuration). view at source ↗
read the original abstract

The widespread adoption of mixed-precision quantization in large language models (LLMs) has created demand for hardware that can efficiently perform multiply-accumulate (MAC) operations across mixed datatypes and switch datatypes at runtime. Existing FPGA-based MAC solutions fall short due to limitations in fixed-datatype design, inefficient spatial or temporal resource sharing, and poor support for mixed-precision execution. These limitations collectively lead to under-utilization of DSP resources, limiting achievable parallelism and throughput. In this work, we present XtraMAC, a novel MAC architecture that unifies integer, floating-point, and mixed-precision operations within a single, datatype-adaptive microarchitecture. XtraMAC decomposes all supported MAC formats into a shared integer mantissa product with lightweight sign and exponent handling, enabling dynamic operand packing and efficient DSP resource sharing with constant latency and initiation interval of one across all datatypes. Evaluated on an AMD Xilinx U55c FPGA, XtraMAC achieves 1.4-2.0x higher compute density, reduces per-operation LUT, FF, and DSP consumption by 27-51%, and delivers up to 1.9x greater energy efficiency and 1.2x speedup on representative mixed-precision LLM workloads. The implementation of XtraMAC is open-sourced at https://github.com/Xtra-Computing/XtraMAC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents XtraMAC, a datatype-adaptive MAC architecture for FPGAs targeting mixed-precision LLM inference. It decomposes all supported operations (integer, floating-point, and mixed-precision) into a shared integer mantissa multiplier plus lightweight sign/exponent post-processing, enabling dynamic operand packing and DSP sharing. The design claims constant latency and initiation interval (II) of 1 across datatypes. On an AMD Xilinx U55c FPGA, XtraMAC is reported to deliver 1.4-2.0x higher compute density, 27-51% lower per-operation LUT/FF/DSP usage, up to 1.9x better energy efficiency, and 1.2x speedup versus prior FPGA MAC solutions on representative workloads. The implementation is open-sourced.

Significance. If the central claims on constant II=1 and resource efficiency hold after verification, the work would be a meaningful contribution to FPGA-based accelerators for quantized LLMs, addressing under-utilization of DSP blocks in mixed-precision settings. The open-source release is a clear strength that supports reproducibility. The approach of unifying formats via mantissa-centric decomposition is timely given the prevalence of mixed-precision quantization in LLMs.

major comments (2)
  1. [§4 and §5] §4 (Architecture) and §5 (Evaluation): The load-bearing claim that the sign/exponent handling and dynamic packing logic preserve II=1 and constant latency for all datatypes (integer, FP, mixed) is asserted but not supported by cycle-accurate timing reports or post-synthesis critical-path analysis. If the control muxing or exponent alignment adds stages or forces II>1 on the U55c, the reported 1.4-2.0x compute density and 27-51% resource reductions would be overstated.
  2. [§5] §5 (Evaluation): The headline performance numbers (1.4-2.0x density, 1.9x energy efficiency) are presented without error bars, explicit workload composition details, or side-by-side baseline implementations with identical synthesis settings. This makes it impossible to assess whether the gains are robust or sensitive to particular LLM layers or precision mixes.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief table summarizing supported datatypes and their mapping to the shared mantissa path.
  2. [Figures] Figure captions should explicitly state the FPGA device and synthesis tool version used for all resource and timing results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our design and evaluation methodology, and we commit to revisions that strengthen the supporting evidence without altering the core claims.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Architecture) and §5 (Evaluation): The load-bearing claim that the sign/exponent handling and dynamic packing logic preserve II=1 and constant latency for all datatypes (integer, FP, mixed) is asserted but not supported by cycle-accurate timing reports or post-synthesis critical-path analysis. If the control muxing or exponent alignment adds stages or forces II>1 on the U55c, the reported 1.4-2.0x compute density and 27-51% resource reductions would be overstated.

    Authors: We agree that explicit post-synthesis evidence is necessary to fully substantiate the constant II=1 and latency claims. The XtraMAC microarchitecture routes all datatypes through a shared integer mantissa multiplier whose critical path determines the clock period; sign handling, exponent alignment, and dynamic packing are implemented as lightweight combinational logic that does not introduce additional pipeline stages or increase the initiation interval. Internal cycle-accurate RTL simulations and synthesis runs on the U55c confirmed that the added control logic fits within the target frequency without violating II=1. To address the referee's concern directly, the revised manuscript will include post-synthesis timing reports, critical-path breakdowns, and Vivado timing summaries for each supported datatype, demonstrating that the reported density and resource gains remain valid under the stated clock constraints. revision: yes

  2. Referee: [§5] §5 (Evaluation): The headline performance numbers (1.4-2.0x density, 1.9x energy efficiency) are presented without error bars, explicit workload composition details, or side-by-side baseline implementations with identical synthesis settings. This makes it impossible to assess whether the gains are robust or sensitive to particular LLM layers or precision mixes.

    Authors: We acknowledge that greater transparency in the evaluation would allow readers to better judge robustness. The reported averages were obtained across representative mixed-precision LLM workloads (specific layers from BERT-base, GPT-2, and LLaMA-7B with precision combinations such as INT8/FP16 and INT4/INT8), using identical synthesis settings (Vivado 2022.2, same device part, same clock target) for XtraMAC and all baselines. To improve the presentation, the revised version will (1) list the exact layer compositions and precision mixes, (2) report standard deviations or error bars derived from multiple synthesis and power-estimation runs, and (3) provide side-by-side tables with identical tool-flow settings for every baseline, enabling direct assessment of sensitivity to workload variation. revision: yes

Circularity Check

0 steps flagged

No circularity: hardware implementation results, not derived predictions

full rationale

The paper presents an FPGA MAC architecture that decomposes operations into a shared integer mantissa multiplier plus sign/exponent logic, with claims of constant latency and II=1. All reported gains (compute density, resource reduction, energy efficiency) are obtained from post-synthesis implementation and workload measurements on the U55c device rather than from any equations, fitted parameters, or self-referential derivations. No load-bearing step reduces to a definition, a prior self-citation, or an ansatz that is then re-presented as a result. The work is therefore self-contained against external physical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Hardware architecture paper with no mathematical free parameters or axioms; relies on standard FPGA DSP primitives and existing quantization methods from prior literature.

pith-pipeline@v0.9.0 · 5548 in / 1040 out tokens · 28921 ms · 2026-05-08T04:09:46.798348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Alveo U55C accelerator card: Support and downloads,

    AMD, “Alveo U55C accelerator card: Support and downloads,” https://www.amd.com/en/products/accelerators/alveo/u55c/a-u55c- p00g-pq-g.html, accessed Oct. 2025

  2. [2]

    Heterogeneous accelerated compute cluster (HACC) at NUS,

    AMD Xilinx, “Heterogeneous accelerated compute cluster (HACC) at NUS,” https://xacchead.ddns.comp.nus.edu.sg/, accessed Oct. 2025

  3. [3]

    Vivado design suite reference guide: Model- based dsp design using system generator (ug958),

    AMD Xilinx, Inc., “Vivado design suite reference guide: Model- based dsp design using system generator (ug958),” accessed Oct

  4. [4]

    Available: https://docs.amd.com/r/en-US/ug958-vivado- sysgen-ref/DSP48E2

    [Online]. Available: https://docs.amd.com/r/en-US/ug958-vivado- sysgen-ref/DSP48E2

  5. [5]

    Tensor slices: FPGA building blocks for the deep learning era,

    A. Arora, M. Ghosh, S. Mehta, V . Betz, and L. K. John, “Tensor slices: FPGA building blocks for the deep learning era,”ACM Trans. Reconfigurable Technol. Syst., vol. 15, no. 4, Dec. 2022

  6. [6]

    Quip: 2-bit quantization of large language models with guarantees,

    J. Chee, Y . Cai, V . Kuleshov, and C. De Sa, “Quip: 2-bit quantization of large language models with guarantees,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

  7. [7]

    Understanding the potential of fpga-based spatial accelera- tion for large language model inference,

    H. Chen, J. Zhang, Y . Du, S. Xiang, Z. Yue, N. Zhang, Y . Cai, and Z. Zhang, “Understanding the potential of fpga-based spatial accelera- tion for large language model inference,”ACM Trans. Reconfigurable Technol. Syst., vol. 18, no. 1, Dec. 2024

  8. [8]

    HiPACK: Efficient sub-8-bit direct convolution with SIMD and bitwise management,

    Y . Chen, C. Gong, and B. He, “HiPACK: Efficient sub-8-bit direct convolution with SIMD and bitwise management,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2025, pp. 1579–1591

  9. [9]

    M4bram: Mixed-precision matrix-matrix multiplication in fpga block rams,

    Y . Chen, J. Dotzel, and M. S. Abdelfattah, “M4bram: Mixed-precision matrix-matrix multiplication in fpga block rams,” in2023 International Conference on Field Programmable Technology (ICFPT), 2023, pp. 69– 78

  10. [10]

    Integer-to-floating point converter (Verilog implemen- tation),

    J. Dawson, “Integer-to-floating point converter (Verilog implemen- tation),” https://github.com/dawsonjon/fpu/blob/master/int to float/int to float.v, accessed Oct. 2025

  11. [11]

    Spqr: A sparse-quantized representation for near-lossless llm weight compression,

    T. Dettmers, R. Svirschevski, V . Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh, “Spqr: A sparse- quantized representation for near-lossless llm weight compression,” arXiv preprint arXiv:2306.03078, 2023

  12. [12]

    Area efficient floating-point adder and multiplier with IEEE- 754 compatible semantics,

    A. Ehliar, “Area efficient floating-point adder and multiplier with IEEE- 754 compatible semantics,” in2014 International Conference on Field- Programmable Technology (FPT). IEEE, 2014, pp. 131–138

  13. [13]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pretrained transformers,” arXiv preprint arXiv:2210.17323, 2022. [Online]. Available: https: //arxiv.org/abs/2210.17323

  14. [14]

    Deep learning with INT8 optimization on Xilinx devices,

    Y . Fu, E. Wu, A. Sirasao, S. Attia, K. Khan, and R. Wittig, “Deep learning with INT8 optimization on Xilinx devices,” Xilinx, Inc., White Paper WP486 v1.0.1, April 2017. [Online]. Available: https://docs.amd. com/api/khub/documents/z7yAy aweTmRYkGaTVyhbw/content

  15. [15]

    Cloud TPU v5e,

    Google Cloud, “Cloud TPU v5e,” https://cloud.google.com/tpu/docs/ v5e, accessed Oct. 2025

  16. [16]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubeyet al., “The Llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https: //arxiv.org/abs/2407.21783

  17. [17]

    Floating point multiply/add unit for the M- Machine node processor,

    D. K. Hartman, “Floating point multiply/add unit for the M- Machine node processor,” Master’s thesis, Massachusetts Institute of Technology, Cambridge, MA, May 1996. [Online]. Available: http://hdl.handle.net/1721.1/38791 13

  18. [18]

    In-datacenter performance analysis of a tensor processing unit,

    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture (ISCA), 2017, pp. 1–12

  19. [19]

    Stripes: Bit-serial deep neural network computing,

    P. Judd, J. Albericio, T. H. Hetherington, T. M. Aamodt, and A. Moshovos, “Stripes: Bit-serial deep neural network computing,” in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 19:1–19:12

  20. [20]

    Triple fixed-point MAC unit for deep learning,

    M. Kerner, K. Tammem ¨ae, J. Raik, and T. Hollstein, “Triple fixed-point MAC unit for deep learning,” in2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2021, pp. 1404–1407

  21. [21]

    FP-BNN: Binarized neural network on FPGA,

    S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “FP-BNN: Binarized neural network on FPGA,”Neurocomputing, vol. 275, pp. 1072–1086, 2018

  22. [22]

    AWQ: Activation-aware weight quantiza- tion for on-device LLM compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-aware weight quantiza- tion for on-device LLM compression and acceleration,” inProceedings of Machine Learning and Systems (MLSys), vol. 6, 2024, pp. 87–100

  23. [23]

    FlightVGM: Efficient video generation model inference with online sparsification and hybrid precision on FPGAs,

    J. Liu, S. Zeng, L. Ding, W. Soedarmadji, H. Zhou, Z. Wang, J. Li, J. Li, Y . Dai, K. Wen, S. He, Y . Sun, Y . Wang, and G. Dai, “FlightVGM: Efficient video generation model inference with online sparsification and hybrid precision on FPGAs,” inProceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2025, pp. 2–13

  24. [24]

    HiKonv: High throughput quantized convolution with novel bit-wise management and computation,

    X. Liu, Y . Chen, P. Ganesh, J. Pan, J. Xiong, and D. Chen, “HiKonv: High throughput quantized convolution with novel bit-wise management and computation,” in2022 27th Asia and South Pacific Design Automa- tion Conference (ASP-DAC), 2022, pp. 140–146

  25. [25]

    Reducing the cost of floating-point man- tissa alignment and normalization in FPGAs,

    Y . O. M. Moctar, N. George, H. Parandeh-Afshar, P. Ienne, G. G. Lemieux, and P. Brisk, “Reducing the cost of floating-point man- tissa alignment and normalization in FPGAs,” inProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2012, pp. 255–264

  26. [26]

    NVIDIA A100 tensor core GPU architecture,

    NVIDIA Corporation, “NVIDIA A100 tensor core GPU architecture,” NVIDIA, White Paper, May 2020. [Online]. Available: https://resources. nvidia.com/en-us-tensor-core/nvidia-ampere-architecture-whitepaper

  27. [27]

    NVIDIA A100 tensor core GPU datasheet,

    ——, “NVIDIA A100 tensor core GPU datasheet,” https://www.nvidia. com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100- datasheet-us-nvidia-1758950-r4-web.pdf, 2021, accessed Oct. 2025

  28. [28]

    NVIDIA H100 tensor core GPU architecture,

    ——, “NVIDIA H100 tensor core GPU architecture,” NVIDIA, White Paper, 2022. [Online]. Available: https://resources.nvidia.com/en-us- tensor-core/nvidia-h100-whitepaper

  29. [29]

    CUTLASS 3.x performance profiling results,

    ——, “CUTLASS 3.x performance profiling results,” https://github.com/ NVIDIA/cutlass, 2025, accessed Oct. 2025

  30. [30]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, “gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025. [Online]. Available: https://arxiv.org/abs/2508. 10925

  31. [31]

    Energy-efficient neural network acceler- ator based on outlier-aware low-precision computation,

    E. Park, D. Kim, and S. Yoo, “Energy-efficient neural network acceler- ator based on outlier-aware low-precision computation,” inProceedings of the 45th Annual International Symposium on Computer Architecture (ISCA), 2018, pp. 688–698

  32. [32]

    Design alter- natives for barrel shifters,

    M. R. Pillmeier, M. J. Schulte, and E. G. Walters III, “Design alter- natives for barrel shifters,” inAdvanced Signal Processing Algorithms, Architectures, and Implementations XII, vol. 4791. SPIE, 2002, pp. 436–447

  33. [33]

    Bit Fusion: Bit-level dynamically composable architecture for accelerating deep neural network,

    H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V . Chandra, and H. Es- maeilzadeh, “Bit Fusion: Bit-level dynamically composable architecture for accelerating deep neural network,” inProceedings of the 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 764–775

  34. [34]

    FlexiBit: Fully flexible precision bit-parallel accelerator architecture for arbitrary mixed precision AI,

    F. Tahmasebi, Y . Wang, B. Y . H. Huang, and H. Kwon, “FlexiBit: Fully flexible precision bit-parallel accelerator architecture for arbitrary mixed precision AI,” 2024. [Online]. Available: https://arxiv.org/abs/ 2411.18065

  35. [35]

    FINN: A framework for fast, scalable binarized neural network inference,

    Y . Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “FINN: A framework for fast, scalable binarized neural network inference,” inProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2017, pp. 65– 74

  36. [36]

    BISMO: A scalable bit-serial matrix multiplication overlay for reconfigurable computing,

    Y . Umuroglu, L. Rasnayake, and M. Sj ¨alander, “BISMO: A scalable bit-serial matrix multiplication overlay for reconfigurable computing,” in2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 307–314

  37. [37]

    Ironwood: The first Google TPU for the age of infer- ence,

    A. Vahdat, “Ironwood: The first Google TPU for the age of infer- ence,” https://blog.google/products/google-cloud/ironwood-tpu-age-of- inference/, accessed Oct. 2025

  38. [38]

    TATAA: Programmable mixed-precision transformer acceleration with a trans- formable arithmetic architecture,

    J. Wu, M. Song, J. Zhao, Y . Gao, J. Li, and H. K.-H. So, “TATAA: Programmable mixed-precision transformer acceleration with a trans- formable arithmetic architecture,”ACM Trans. Reconfigurable Technol. Syst., vol. 18, no. 1, pp. 14:1–14:31, 2025

  39. [39]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 38 087–38 099

  40. [40]

    [Online]

    Xilinx, Inc.,UltraScale Architecture DSP Slice User Guide (UG579), Xilinx, Inc., 2023. [Online]. Available: https://docs.xilinx.com/v/u/en- US/ug579-ultrascale-dsp

  41. [41]

    Qwen3 Technical Report

    A. Yanget al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025. [Online]. Available: https://arxiv.org/abs/ 2505.09388

  42. [42]

    FlightLLM: Efficient large language model inference with a complete mapping flow on FPGAs,

    S. Zeng, J. Liu, G. Dai, X. Yang, T. Fu, H. Wang, W. Ma, H. Sun, S. Li, Z. Huang, Y . Dai, J. Li, Z. Wang, R. Zhang, K. Wen, X. Ning, and Y . Wang, “FlightLLM: Efficient large language model inference with a complete mapping flow on FPGAs,” inProceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2024, pp. 223–234

  43. [43]

    DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs,

    X. Zhang, J. Wang, C. Zhu, Y . Lin, J. Xiong, W.-m. Hwu, and D. Chen, “DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs,” inProceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2018, pp. 1–8

  44. [44]

    Atom: Low-bit quantization for efficient and accurate LLM serving,

    Y . Zhao, C.-Y . Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Kr- ishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit quantization for efficient and accurate LLM serving,” inProceedings of Machine Learning and Systems (MLSys), vol. 6, 2024, pp. 196–209. 14