arxiv: 2604.19106 · v1 · submitted 2026-04-21 · 💻 cs.AR · cs.AI· cs.LG

Recognition: unknown

Design Rules for Extreme-Edge Scientific Computing on AI Engines

Zhenghua Ma , G Abarajithan , Dimitrios Danopoulos , Olivia Weng , Francesco Restuccia , Ryan Kastner

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:15 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.LG

keywords extreme-edge computingneural network deploymentFPGA architecturesperformance comparisondataflow optimizationlatency metricson-chip inferencescientific computing

0 comments

The pith

A resource metric shows when specialized compute engines outperform standard FPGA logic for low-latency sensor models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to decide when extreme-edge scientific neural networks should run on one hardware path rather than the other inside modern chips. It supplies measurements of how each path scales with model size and structure, then introduces a single number that equates their costs while including latency. With that number and some data-movement changes, networks that exceed the capacity of one path can still execute fully on-chip on the other. If the comparison holds, designers gain a repeatable way to pick the faster option without exhaustive trial runs.

Core claim

Systematic characterization and micro-benchmarking reveal that AI Engine implementations can host end-to-end neural networks that do not fit on programmable logic. The latency-adjusted resource equivalence metric identifies the crossover points where one path becomes preferable. Spatial and API-level dataflow changes keep latency low even as models grow.

What carries the argument

The latency-adjusted resource equivalence (LARE) metric, which normalizes resource consumption by achieved latency to decide when one hardware path beats the other.

If this is right

Designers obtain a quantitative rule instead of ad-hoc testing for choosing the hardware path that meets real-time constraints.
Models previously blocked by resource limits on one path become deployable on the other while keeping weights on-chip.
Tailored dataflow patterns reduce the latency penalty that normally grows with larger networks.
End-to-end inference becomes practical for applications that require both high model capacity and small batch sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same comparison approach could be applied to other high-density accelerators when they appear in embedded scientific instruments.
Similar equivalence metrics might shorten the time needed to evaluate future chip generations for edge sensor tasks.
The design rules could feed into automated tools that map a given model directly to the faster path.

Load-bearing premise

The tested networks and hardware samples represent the scaling behavior of arbitrary extreme-edge scientific models, and the toolchain adds no large unmeasured costs.

What would settle it

A scientific neural network whose measured latency on programmable logic falls below the LARE prediction for the AI Engine version, or an AI Engine deployment whose actual latency exceeds the micro-benchmark extrapolation by more than the reported margin.

Figures

Figures reproduced from arXiv: 2604.19106 by Dimitrios Danopoulos, Francesco Restuccia, G Abarajithan, Olivia Weng, Ryan Kastner, Zhenghua Ma.

**Figure 2.** Figure 2: HLS4ML performance scalability. Performance is measured by Interval, i.e., the time between output batches in steady-state execution. A smaller Interval indicates higher throughput and thus better performance. In the resource-abundant regime, HLS4ML can fully parallelize the design, so Interval remains nearly constant while resource consumption increases with workload sizes. In the constrained-resource reg… view at source ↗

**Figure 3.** Figure 3: Micro-benchmarking to understand resource–latency trade-off. Each [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Performance analysis of GEMM workloads with batch size of 8 implemented in a single compute tile, to measure the performance impact of API-level [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Latency reduction of tiling a GEMM workload / dense layer of size [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Latency impact of exhausting the available AIE columns. We use [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Latency overhead of crossing the AIE-PL boundary. Each experiment [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Extreme-edge scientific applications use machine learning models to analyze sensor data and make real-time decisions. Their stringent latency and throughput requirements demand small batch sizes and require that model weights remain fully on-chip. Spatial dataflow implementations are common for extreme-edge applications. Spatial dataflow works well for small networks, but it fails to scale to larger models due to inherent resource scaling limitations. AI Engines on modern FPGA SoCs offer a promising alternative with high compute density and additional on-chip memory. However, the architecture, programming model, and performance-scaling behavior of AI Engines differ fundamentally from those of the programmable logic, making direct comparison non-trivial and the benefits of using AI Engines unclear. This work addresses how and when extreme-edge scientific neural networks should be implemented on AI Engines versus programmable logic. We provide systematic architectural characterization and micro-benchmarking and introduce a latency-adjusted resource equivalence (LARE) metric that identifies when AI Engine implementations outperform programmable logic designs. We further propose spatial and API-level dataflow optimizations tailored to low-latency scientific inference. Finally, we demonstrate the successful deployment of end-to-end neural networks on AI Engines that cannot fit on programmable logic when using the hlsml toolchain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a LARE metric and AI-Engine dataflow tweaks to pick hardware for tiny-batch scientific inference, with a demo of networks that won't fit on programmable logic under hlsml, but the metric's reach beyond the tested cases is unproven.

read the letter

The core contribution is the latency-adjusted resource equivalence metric that tries to make fair head-to-head calls between AI Engines and programmable logic under the tight constraints of full on-chip weights and single-sample latency. They back it with architectural characterization, micro-benchmarks, and spatial plus API-level optimizations that are specific to the AI Engine programming model. The end-to-end demo of networks that run on the engines but not on PL with the standard toolchain is the most concrete part and directly addresses a pain point for people who need larger models in real-time sensor pipelines. That is useful work if the numbers hold up. The main weakness is that the design rules and LARE itself rest on micro-benchmarks whose coverage is narrow. Nothing in the abstract shows cross-validation across different sparsity patterns, precision choices, or dataflow structures that show up in actual scientific models, so the claim that the metric correctly flags when AI Engines win could break outside the tested set. The absence of any quantitative results, scaling plots, or error analysis in the summary also makes it hard to judge how stable the resource and latency laws are. This is aimed at embedded-systems researchers and hardware selectors who already work with Xilinx Versal or similar AI Engine parts and need rules of thumb for low-latency inference. A reader who wants practical code-level guidance on the optimizations will find value; someone looking for a general, validated comparison framework will need the full numbers and broader tests first. I would send it to peer review. The idea is practical and the demo is a start, but referees should press on the generalization and ask for the missing quantitative validation before the design rules are treated as reliable.

Referee Report

2 major / 2 minor

Summary. The paper claims to provide systematic architectural characterization and micro-benchmarking of AI Engines versus programmable logic for extreme-edge scientific neural networks. It introduces a latency-adjusted resource equivalence (LARE) metric to identify when AI Engine implementations outperform programmable logic designs, proposes spatial and API-level dataflow optimizations for low-latency inference, and demonstrates successful end-to-end deployment of neural networks on AI Engines that cannot fit on programmable logic when using the hlsml toolchain.

Significance. If the results hold, the work supplies actionable design rules and a new comparison metric for hardware choice in extreme-edge scientific computing, where small batch sizes and on-chip weights are required. The empirical micro-benchmarking approach and explicit demonstration of larger models on AI Engines represent a practical contribution, particularly the focus on resource/latency trade-offs that programmable logic scaling limitations impose.

major comments (2)

[Micro-benchmarking and LARE metric section] Micro-benchmarking and LARE metric section: The LARE metric and derived scaling laws are validated only on the specific networks tested; the manuscript provides no cross-validation or additional experiments on a wider suite of scientific models with varying sparsity, dataflow patterns, or precision. This generalization is load-bearing for the central claim that LARE correctly identifies AI Engine superiority for networks that cannot fit on programmable logic under hlsml.
[Demonstration of end-to-end networks section] Demonstration of end-to-end networks section: The claim of successful deployment of networks that cannot fit on programmable logic requires explicit quantitative results (latency, resource utilization, throughput) with direct hlsml comparisons, error analysis, and validation details; the current presentation leaves the performance gains and overheads unquantified, undermining assessment of the design rules.

minor comments (2)

[Abstract] Abstract: Including one or two key quantitative results (e.g., LARE values or resource savings from the demonstration) would strengthen the summary of claims.
[LARE metric definition] Notation: The definition of the LARE metric would benefit from an explicit equation or formula to clarify how latency and resource terms are combined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are warranted.

read point-by-point responses

Referee: [Micro-benchmarking and LARE metric section] Micro-benchmarking and LARE metric section: The LARE metric and derived scaling laws are validated only on the specific networks tested; the manuscript provides no cross-validation or additional experiments on a wider suite of scientific models with varying sparsity, dataflow patterns, or precision. This generalization is load-bearing for the central claim that LARE correctly identifies AI Engine superiority for networks that cannot fit on programmable logic under hlsml.

Authors: The LARE metric is derived from systematic architectural characterization and micro-benchmarks on fundamental kernels (GEMM, convolutions, activations, and data movement patterns) that are representative of extreme-edge scientific workloads, rather than being validated solely on full end-to-end networks. The scaling laws follow from these architecture-level measurements of resource and latency trade-offs. We agree, however, that explicit cross-validation on additional models would strengthen the generalization claim. In the revised manuscript we will add a dedicated discussion subsection on LARE applicability and include results from at least two further scientific models with differing sparsity and precision to provide the requested cross-validation. revision: partial
Referee: [Demonstration of end-to-end networks section] Demonstration of end-to-end networks section: The claim of successful deployment of networks that cannot fit on programmable logic requires explicit quantitative results (latency, resource utilization, throughput) with direct hlsml comparisons, error analysis, and validation details; the current presentation leaves the performance gains and overheads unquantified, undermining assessment of the design rules.

Authors: We accept this criticism. The end-to-end section will be expanded in revision to include explicit tables and figures reporting latency, resource utilization (AI Engine tiles, LUTs, BRAM, DSPs), throughput, and power for the deployed networks. Direct comparisons to hlsml will be provided for all cases where an hlsml implementation fits; for networks that exceed programmable-logic resources we will supply extrapolated estimates grounded in the micro-benchmarking data. Error analysis (output accuracy versus reference) and validation methodology will also be added so that performance gains and any optimization overheads can be quantitatively assessed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarking and new metric are self-contained

full rationale

The paper's core contribution is an empirical architectural characterization of AI Engines, micro-benchmarking against programmable logic, and the introduction of the LARE metric to compare latency-adjusted resource use. It then applies these to demonstrate end-to-end network deployments that exceed PL limits under the hlsml toolchain. No derivation step reduces by construction to its own inputs, no parameter is fitted and then relabeled as a prediction, and no load-bearing claim rests on self-citation chains or imported uniqueness theorems. The work is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about FPGA architectures and the validity of the newly introduced LARE metric for performance comparison; no free parameters are fitted and no new physical entities are postulated.

axioms (2)

domain assumption Spatial dataflow implementations fail to scale to larger models due to inherent resource scaling limitations
Stated directly in the abstract as a fundamental limitation.
domain assumption AI Engines offer high compute density and additional on-chip memory with architecture and programming model fundamentally different from programmable logic
Presented as the key architectural premise enabling the comparison.

invented entities (1)

LARE metric no independent evidence
purpose: To identify when AI Engine implementations outperform programmable logic designs
Newly proposed latency-adjusted resource equivalence metric introduced to guide implementation choices.

pith-pipeline@v0.9.0 · 5526 in / 1321 out tokens · 55137 ms · 2026-05-10T02:15:53.478572+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 13 canonical work pages

[2]

A recon- figurable neural network asic for detector front-end data compression at the hl-lhc,

G. Di Guglielmo, F. Fahim, C. Herwig, M. B. Valentin, J. Duarte, C. Gingu, P. Harris, J. Hirschauer, M. Kwok, V . Loncaret al., “A recon- figurable neural network asic for detector front-end data compression at the hl-lhc,”IEEE Transactions on Nuclear Science, vol. 68, no. 8, pp. 2179–2186, 2021

2021
[3]

Low latency optical- based mode tracking with machine learning deployed on fpgas on a tokamak,

Y . Wei, R. F. Forelli, C. Hansen, J. P. Levesque, N. Tran, J. C. Agar, G. Di Guglielmo, M. E. Mauel, and G. A. Navratil, “Low latency optical- based mode tracking with machine learning deployed on fpgas on a tokamak,”Review of scientific instruments, vol. 95, no. 7, 2024

2024
[4]

Reliable edge machine learning hardware for scientific applications,

T. Baldi, J. Campos, B. Hawks, J. Ngadiuba, N. Tran, D. Diaz, J. Duarte, R. Kastner, A. Meza, M. Quinnanet al., “Reliable edge machine learning hardware for scientific applications,” in2024 IEEE 42nd VLSI Test Symposium (VTS). IEEE, 2024, pp. 1–5

2024
[5]

Ar- chitectural implications of neural network inference for high data-rate, low-latency scientific applications,

O. Weng, A. Redding, N. Tran, J. M. Duarte, and R. Kastner, “Ar- chitectural implications of neural network inference for high data-rate, low-latency scientific applications,”arXiv preprint arXiv:2403.08980, 2024

work page arXiv 2024
[6]

hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices,

F. Fahim, B. Hawks, C. Herwig, J. Hirschauer, S. Jindariani, N. Tran, L. P. Carloni, G. Di Guglielmo, P. Harris, J. Krupaet al., “hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices,”arXiv preprint arXiv:2103.05579, 2021

work page arXiv 2021
[7]

Xilinx first 7nm device: Versal ai core (vc1902)

S. Ahmad, S. Subramanian, V . Boppana, S. Lakka, F.-H. Ho, T. Knopp, J. Noguera, G. Singh, and R. Wittig, “Xilinx first 7nm device: Versal ai core (vc1902).” inHot Chips Symposium, 2019, pp. 1–28

2019
[8]

Amd versal ai-engines for fixed latency environments,

I. Xiotidis, N. C. Hall, T. Du, N. Konstantinidis, and D. Miller, “Amd versal ai-engines for fixed latency environments,”arXiv preprint arXiv:2603.13852, 2026

work page arXiv 2026
[9]

Machine learning on heterogeneous, edge, and quantum hardware for particle physics (ml-hequpp),

J. Gonski, J. Ott, S. Abbaszadeh, S. Addepalli, M. Cremonesi, J. Dick- inson, G. Di Guglielmo, E. Y . Ertorer, L. Gray, R. Herbstet al., “Machine learning on heterogeneous, edge, and quantum hardware for particle physics (ml-hequpp),” Carnegie Mellon U.; Pisa U.; Pittsburgh U.; INFN, Padua; Kansas U.; CERN . . . , Tech. Rep., 2026

2026
[10]

[Online]

Advanced Micro Devices, Inc.,AI Engine Graph Programming Guide (UG1076), AMD, 2025, version 2025.2. [Online]. Available: https://docs.amd.com/r/en-US/ug1076-ai-engine-graph-programming

2025
[11]

Aries: An agile mlir-based compilation flow for reconfigurable devices with ai engines,

J. Zhuang, S. Xiang, H. Chen, N. Zhang, Z. Yang, T. Mao, Z. Zhang, and P. Zhou, “Aries: An agile mlir-based compilation flow for reconfigurable devices with ai engines,” inProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2025, pp. 92–102

2025
[12]

MLIR-AIE: An Open-Source Compiler Toolchain for AMD AI Engines,

S. Levental, S. Taylor, S. Eldridgeet al., “MLIR-AIE: An Open-Source Compiler Toolchain for AMD AI Engines,” inProceedings of the 2024 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’24). ACM, 2024

2024
[13]

2025.DOI: 10.48550/ ARXIV .2512.15946

D. Danopoulos, E. Lupi, C. Sun, S. Dittmeier, M. Kagan, V . Loncar, and M. Pierini, “Aie4ml: An end-to-end framework for compiling neural networks for the next generation of amd ai engines,”arXiv preprint arXiv:2512.15946, 2025

work page arXiv 2025
[14]

[Online]

AMD Xilinx,VEK280 Evaluation Board User Guide (UG1612), 2025, accessed: 2025-10-30. [Online]. Available: https://docs.amd.com/r/en- US/ug1612-vek280-eval-bd

2025
[15]

[Online]

——,VCK190 Evaluation Board User Guide (UG1366), 2024, accessed: 2024-11-09. [Online]. Available: https://docs.amd.com/r/en- US/ug1366-vck190-eval-bd

2024
[16]

Charm: C omposing h eterogeneous a ccele r ators for m atrix multiply on versal acap architecture,

J. Zhuang, J. Lau, H. Ye, Z. Yang, Y . Du, J. Lo, K. Denolf, S. Neuendorf- fer, A. Jones, J. Huet al., “Charm: C omposing h eterogeneous a ccele r ators for m atrix multiply on versal acap architecture,” inProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 153–164

2023
[17]

Charm 2.0: Composing heterogeneous accelerators for deep learning on versal acap architecture,

J. Zhuang, J. Lau, H. Ye, Z. Yang, S. Ji, J. Lo, K. Denolf, S. Neuen- dorffer, A. Jones, J. Huet al., “Charm 2.0: Composing heterogeneous accelerators for deep learning on versal acap architecture,”ACM Trans- actions on Reconfigurable Technology and Systems, vol. 17, no. 3, pp. 1–31, 2024

2024
[18]

Maxeva: Maximizing the efficiency of matrix multiplication on versal ai engine,

E. Taka, A. Arora, K.-C. Wu, and D. Marculescu, “Maxeva: Maximizing the efficiency of matrix multiplication on versal ai engine,” in2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023, pp. 96–105

2023
[19]

Exploring the versal ai engines for accelerating stencil- based atmospheric advection simulation,

N. Brown, “Exploring the versal ai engines for accelerating stencil- based atmospheric advection simulation,” inProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 91–97

2023
[20]

Exploiting on-chip heterogeneity of versal architecture for gnn infer- ence acceleration,

P. Chen, P. Manjunath, S. Wijeratne, B. Zhang, and V . Prasanna, “Exploiting on-chip heterogeneity of versal architecture for gnn infer- ence acceleration,” in2023 33rd International Conference on Field- Programmable Logic and Applications (FPL). IEEE, 2023, pp. 219– 227

2023
[21]

Autoencoders on fpgas for real-time, unsupervised new physics detection at 40 mhz at the large hadron collider (2021),

E. Govorkovaet al., “Autoencoders on fpgas for real-time, unsupervised new physics detection at 40 mhz at the large hadron collider (2021),” arXiv preprint arXiv:2108.03986, 2021

work page arXiv 2021
[22]

The CMS trigger system

C. collaborationet al., “The cms trigger system,”arXiv preprint arXiv:1609.02366, 2016

work page Pith review arXiv 2016
[23]

Fastml science benchmarks: Accelerating real-time scientific edge machine learning,

J. Duarte, N. Tran, B. Hawks, C. Herwig, J. Muhizi, S. Prakash, and V . J. Reddi, “Fastml science benchmarks: Accelerating real-time scientific edge machine learning,”arXiv preprint arXiv:2207.07958, 2022

work page arXiv 2022
[24]

Low-latency on-chip tau event selection with machine learning for the belle ii level-1 trigger,

D. Misra, “Low-latency on-chip tau event selection with machine learning for the belle ii level-1 trigger,” Talk at Fast Machine Learning for Science Conference, 2025, cERN Indico contribution. [Online]. Available: https://indico.cern.ch/event/1496673/contributions/6637927/contribution.pdf

work page arXiv 2025
[25]

Analysis of hardware synthesis strategies for machine learning in collider trigger and data acquisition,

H. Jia, A. Dave, J. Gonski, and R. Herbst, “Analysis of hardware synthesis strategies for machine learning in collider trigger and data acquisition,”arXiv preprint arXiv:2411.11678, 2024

work page arXiv 2024
[26]

Low-latency machine learning fpga accelerator for multi- qubit-state discrimination,

P. K. Gautam, S. Kalipatnapu, U. Singhal, B. Lienhard, V . Singh, C. S. Thakuret al., “Low-latency machine learning fpga accelerator for multi- qubit-state discrimination,”arXiv preprint arXiv:2407.03852, 2024

work page arXiv 2024
[27]

The cms experiment at the cern lhc,

C. Collaborationet al., “The cms experiment at the cern lhc,”Journal of instrumentation, vol. 3, no. August 2008, pp. 1–334, 2008

2008
[28]

Performance of the cms level-1 trigger in proton-proton collisions at√s= 13tev,

A. M. Sirunyan, A. Tumasyan, W. Adam, F. Ambrogi, B. Arnold, H. Bergauer, T. Bergauer, M. Dragicevic, J. Ero, A. E. Del Valleet al., “Performance of the cms level-1 trigger in proton-proton collisions at√s= 13tev,”Journal of Instrumentation, vol. 15, no. 10, 2020

2020
[29]

Operation of the atlas trigger system in run 2,

ATLAS Collaboration, “Operation of the atlas trigger system in run 2,” Journal of Instrumentation, vol. 15, no. 10, p. P10004, 2020

2020
[30]

[Online]

AMD Xilinx,ZCU102 Evaluation Board User Guide(UG1182), 2025, accessed: 2025-10-30. [Online]. Available: https://docs.amd.com/v/u/en- US/ug1182-zcu102-eval-bd

2025
[31]

[Online]

——,ZCU104 Evaluation Board User Guide(UG1267), 2024, accessed: 2025-10-30. [Online]. Available: https://docs.amd.com/v/u/en- US/ug1267-zcu104-eval-bd

2024
[32]

Finn: A framework for fast, scalable binarized neural network inference,

Y . Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “Finn: A framework for fast, scalable binarized neural network inference,” inProceedings of the 2017 ACM/SIGDA interna- tional symposium on field-programmable gate arrays, 2017, pp. 65–74

2017
[33]

AI Engine API Documentation (Vitis 2024.1),

AMD Xilinx, “AI Engine API Documentation (Vitis 2024.1),” https://www.xilinx.com/htmldocs/xilinx2024 1/aiengine api/aie api/doc/, 2024, accessed: 2025-03-28

2024
[34]

[Online]

——,AI Engine Kernel and Graph Programming Guide (UG1079), 2024, accessed: 2025-03-28. [Online]. Available: https://docs.amd.com/r/en-US/ug1079-ai-engine-kernel-coding

2024
[35]

[Online]

AMD,Versal Adaptive SoC System and Solution Planning Methodology Guide (UG1504), 2025, version 2025.2. [Online]. Available: https://docs.amd.com/r/en-US/ug1504-acap-system-solution- planning-methodology

2025
[36]

[Online]

——,Versal Adaptive SoC Design Guide (UG1273), 2025, version 2025.2. [Online]. Available: https://docs.amd.com/r/en-US/ug1273- versal-acap-design

2025
[37]

[Online]

——,Vitis High-Level Synthesis User Guide (UG1399), 2025, version 2025.2. [Online]. Available: https://docs.amd.com/r/en-US/ug1399-vitis- hls

2025
[38]

A latency-constrained, gated recurrent unit (gru) implementation in the versal ai engine,

M. Sapkas, A. Triossi, and M. Zanetti, “A latency-constrained, gated recurrent unit (gru) implementation in the versal ai engine,” 2026. [Online]. Available: https://arxiv.org/abs/2511.15626

work page arXiv 2026
[39]

2021 , journal =

C. Banbury, V . J. Reddi, P. Torelli, J. Holleman, N. Jeffries, C. Kiraly, P. Montino, D. Kanter, S. Ahmed, D. Pauet al., “Mlperf tiny bench- mark,”arXiv preprint arXiv:2106.07597, 2021

work page arXiv 2021
[40]

Xvdpu: A high-performance cnn accelerator on the versal platform powered by the ai engine,

X. Jia, Y . Zhang, G. Liu, X. Yang, T. Zhang, J. Zheng, D. Xu, Z. Liu, M. Liu, X. Yanet al., “Xvdpu: A high-performance cnn accelerator on the versal platform powered by the ai engine,”ACM Transactions on Reconfigurable Technology and Systems, vol. 17, no. 2, pp. 1–24, 2024

2024
[41]

Sparta: spatial acceleration for efficient and scalable horizontal diffusion weather stencil computation,

G. Singh, A. Khodamoradi, K. Denolf, J. Lo, J. G ´omez-Luna, J. Melber, A. Bisca, H. Corporaal, and O. Mutlu, “Sparta: spatial acceleration for efficient and scalable horizontal diffusion weather stencil computation,” inProceedings of the 37th International Conference on Supercomputing, 2023, pp. 463–476

2023
[42]

Evaluation of xilinx versal architecture for next-gen edge computing in space,

N. Perryman, C. Wilson, and A. George, “Evaluation of xilinx versal architecture for next-gen edge computing in space,” in2023 IEEE aerospace conference. IEEE, 2023, pp. 1–11

2023
[43]

Aim: Accelerating arbitrary-precision integer multiplication on heterogeneous reconfigurable computing platform versal acap,

Z. Yang, J. Zhuang, J. Yin, C. Yu, A. K. Jones, and P. Zhou, “Aim: Accelerating arbitrary-precision integer multiplication on heterogeneous reconfigurable computing platform versal acap,” in2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023, pp. 1–9

2023
[44]

Cat: Customized transformer accelerator framework on versal acap,

W. Zhang, Y . Liu, and Z. Bao, “Cat: Customized transformer accelerator framework on versal acap,”arXiv preprint arXiv:2409.09689, 2024

work page arXiv 2024