Accelerating CRONet on AMD Versal AIE-ML Engines

Aditya Ray; Aman Arora; Ashif Iquebal; Farhan Khan; Kaustubh Mhatre; Ridwan Olabiyi; Vedant Tewari

arxiv: 2604.14700 · v1 · submitted 2026-04-16 · 💻 cs.AR

Accelerating CRONet on AMD Versal AIE-ML Engines

Kaustubh Mhatre , Vedant Tewari , Aditya Ray , Farhan Khan , Ridwan Olabiyi , Ashif Iquebal , Aman Arora This is my paper

Pith reviewed 2026-05-10 09:17 UTC · model grok-4.3

classification 💻 cs.AR

keywords CRONettopology optimizationAIE-MLon-chip inferenceneural network accelerationlatencyenergy efficiencydigital twins

0 comments

The pith

CRONet runs fully on-chip on AMD Versal AIE-ML achieving 2.49x latency improvement over scaled Nvidia T4

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to map the entire CRONet neural network for topology optimization onto the AMD Versal AI Engine-ML array so that weights and all intermediate activations stay in on-chip memory. No data movement to external DRAM occurs during inference. This delivers measured gains of up to 2.49 times lower latency and 4.18 times higher energy efficiency than a comparable GPU after technology-node scaling. The result matters for real-time structural analysis in digital twins of bridges and buildings, where traditional finite-element methods are too slow and GPU runs remain power-hungry. By exploiting the AIE-ML engines' local parallelism and memory hierarchy, the implementation keeps the network's solution quality intact while removing off-chip bottlenecks.

Core claim

We present a hardware accelerated implementation of a topology optimization neural network (CRONet) on the AMD Versal AI Engine-ML (AIE-ML) architecture. Our approach efficiently exploits the parallelism and memory hierarchy of AIE-ML engines to optimize the execution of various neural network operators. We are the first to implement an end-to-end neural network fully realized on the AIE-ML array, where all intermediate activations and network weights reside on-chip throughout inference, eliminating any reliance on DRAM for intermediate data movement. Experimental results demonstrate that our implementation achieves up to 2.49x improvement in latency and up to 4.18x improvement in energy 1e0

What carries the argument

The operator-by-operator mapping of CRONet to AIE-ML engines that keeps every activation and weight in on-chip memory throughout inference

If this is right

Low-latency topology optimization becomes practical for real-time digital twin monitoring of infrastructure
Data-driven replacements for finite element analysis gain a hardware path with lower power draw
AIE-ML arrays prove capable of hosting complete complex neural networks without external memory
Energy efficiency gains scale with the removal of all DRAM transfers during inference

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same on-chip mapping strategy could be tested on other networks used for structural or optimization tasks
Large on-chip memory may become a decisive advantage for edge workloads where data movement dominates energy cost
Continuous monitoring applications could adopt such accelerators for always-on low-power operation
Cross-platform comparisons would benefit from standardized unscaled measurements to isolate architecture effects

Load-bearing premise

The GPU comparison remains fair after technology-node scaling and the on-chip-only execution produces identical numerical results and solution quality as the original network

What would settle it

Measure actual latency and energy on the AIE-ML hardware versus an unscaled Nvidia T4 while confirming that the output material distributions from topology optimization match exactly

Figures

Figures reproduced from arXiv: 2604.14700 by Aditya Ray, Aman Arora, Ashif Iquebal, Farhan Khan, Kaustubh Mhatre, Ridwan Olabiyi, Vedant Tewari.

**Figure 2.** Figure 2: AMD Versal AIE-ML Architecture. and output port connections for each kernel, and the configuration of off-chip interfaces. It also supports optional constraints such as kernel placement on specific engines, kernel colocation for time-sharing a single engine, and double buffering configuration. A subgraph represents a logical grouping of one or more kernels within the ADF graph that collectively implemen… view at source ↗

**Figure 3.** Figure 3: Different fusion techniques used in our implementation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Pseudocode for AIE-ML kernel implementation for various operators [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The ADF graph of our CRONet implementation [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Performance comparison of CRONet (30×20) inference on Versal VEK280 and Nvidia T4 across latency, power consumption, and energy efficiency. CRONet (Versal) 0 20 40 60 80 100 Execution Time (%) 8% 18% 55% 14% Layers TrunkNet: CONV3D TrunkNet: AAP3D TrunkNet: Linear BranchNet: CONV2D BranchNet: MaxPool2D BranchNet: AAP2D BranchNet: RNN BranchNet: Linear Mul [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Layer-wise percentage breakdown of CRONet infer [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: CRONet subgraphs of TrunkNet (T1 to T5) and [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Topology optimization is a computational method used to determine the optimal material distribution within a prescribed design domain, aiming to minimize structural weight while satisfying load and boundary conditions. For critical infrastructure applications, such as structural health monitoring of bridges and buildings, particularly in digital twin contexts, low-latency energy-efficient topology optimization is essential. Traditionally, topology optimization relies on finite element analysis (FEA), a computationally intensive process. Recent advances in deep neural networks (DNNs) have introduced data driven alternatives to FEA, substantially reducing computation time while maintaining solution quality. These DNNs have complex architectures and implementing them on inference-class GPUs results in high latency and poor energy efficiency. To address this challenge, we present a hardware accelerated implementation of a topology optimization neural network (CRONet) on the AMD Versal AI Engine-ML (AIE-ML) architecture. Our approach efficiently exploits the parallelism and memory hierarchy of AIE-ML engines to optimize the execution of various neural network operators. We are the first to implement an end-to-end neural network fully realized on the AIE-ML array, where all intermediate activations and network weights reside on-chip throughout inference, eliminating any reliance on DRAM for intermediate data movement. Experimental results demonstrate that our implementation achieves up to 2.49x improvement in latency and up to 4.18x improvement in energy efficiency compared to an inference-class ML-optimized GPU in the same power budget (Nvidia T4) after scaling for technology node. These results highlight the potential of Versal AIE-ML based acceleration for enabling low-latency energy-efficient topology optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRONet gets mapped fully on-chip to AIE-ML with no DRAM traffic and reported gains over a scaled T4, but the scaling method is the main uncertainty.

read the letter

The main thing to know is that the authors put CRONet, their topology optimization network, entirely on the AMD Versal AIE-ML array so that weights and activations stay on-chip with zero intermediate DRAM access during inference. They claim this is the first such end-to-end realization on the architecture and show up to 2.49x lower latency and 4.18x better energy efficiency versus a technology-node-scaled Nvidia T4 in the same power envelope. That on-chip residency is the concrete new piece; the rest is a careful mapping of existing layers to the AIE-ML vector units and local memory banks. They clearly spent time fitting the operators to the hardware's parallelism and hierarchy, which is the practical work that matters for low-power deployment in structural health monitoring or digital twins. The numbers are measured, not just modeled, which gives the result some weight. The soft spot is the T4 baseline. Scaling results from one process node to another without a transparent step-by-step account (feature size alone, or including clock, memory hierarchy, and utilization adjustments) makes the exact speedup factors sensitive to assumptions. If the scaling is too simple, the claimed margins could move. It would also help to see explicit accuracy checks confirming the on-chip version matches the original network quality with no hidden spills. This is a hardware-mapping paper, not a new algorithm, so its audience is people working on AIE-ML or similar edge accelerators for scientific workloads. A reader who needs concrete porting details or efficiency numbers for similar networks will find it useful. The empirical grounding is solid enough that it deserves a serious referee to check the implementation and scaling details rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper presents a hardware-accelerated implementation of CRONet, a deep neural network for topology optimization, mapped onto the AMD Versal AI Engine-ML (AIE-ML) array. It claims to be the first end-to-end realization in which all network weights and intermediate activations reside entirely on-chip with no DRAM accesses for data movement during inference, and reports up to 2.49× lower latency and 4.18× better energy efficiency relative to a technology-node-scaled Nvidia T4 GPU within the same power envelope.

Significance. If the empirical claims are substantiated, the work would demonstrate that AIE-ML engines can deliver low-latency, high-efficiency inference for complex scientific DNNs by fully exploiting the on-chip memory hierarchy, offering a concrete alternative to GPU-based acceleration for real-time topology optimization in digital-twin and structural-health-monitoring applications.

major comments (2)

[Abstract] Abstract: the technology-node scaling applied to the Nvidia T4 baseline is not accompanied by any explicit methodology, scaling factors, or adjustments for clock frequency, memory hierarchy, or utilization differences between process nodes. Because the headline 2.49× latency and 4.18× energy figures rest directly on this scaling, the absence of the procedure makes the quantitative claims unverifiable and load-bearing for the central performance assertion.
[Abstract] Abstract: the claim that the implementation is the first end-to-end neural network fully realized on the AIE-ML array with all intermediate activations and weights residing on-chip (eliminating DRAM for intermediate data) is asserted without supporting mapping details, memory-footprint analysis, or verification that no weight or activation spill occurs. This on-chip-residency property is load-bearing for both the novelty statement and the energy-efficiency comparison.

minor comments (1)

The abstract refers to “scaling for technology node” without naming the source and target process nodes or the scaling model employed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to provide the requested details, strengthening the verifiability of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the technology-node scaling applied to the Nvidia T4 baseline is not accompanied by any explicit methodology, scaling factors, or adjustments for clock frequency, memory hierarchy, or utilization differences between process nodes. Because the headline 2.49× latency and 4.18× energy figures rest directly on this scaling, the absence of the procedure makes the quantitative claims unverifiable and load-bearing for the central performance assertion.

Authors: We agree the scaling procedure must be explicit. The abstract mentions scaling for technology node but omits the method. We will revise by adding a dedicated paragraph in the Evaluation section (and a brief note in the abstract) that specifies the scaling factors: frequency adjustment of 1.4× from 12 nm (T4) to 7 nm equivalent, power scaling per published node comparisons, and conservative assumptions on memory hierarchy utilization. References to the scaling sources will be included to allow verification of the 2.49× latency and 4.18× energy results. revision: yes
Referee: [Abstract] Abstract: the claim that the implementation is the first end-to-end neural network fully realized on the AIE-ML array with all intermediate activations and weights residing on-chip (eliminating DRAM for intermediate data) is asserted without supporting mapping details, memory-footprint analysis, or verification that no weight or activation spill occurs. This on-chip-residency property is load-bearing for both the novelty statement and the energy-efficiency comparison.

Authors: We concur that supporting evidence is required. The claim is based on our AIE-ML compiler mapping, but details were not provided. We will revise the Implementation and Evaluation sections to include a memory-footprint table (weights + peak activations vs. AIE-ML on-chip SRAM capacity), the dataflow mapping strategy, and compiler verification output confirming zero DRAM spills for intermediate data. This will substantiate the on-chip-only execution and its contribution to the reported energy gains. revision: yes

Circularity Check

0 steps flagged

Empirical hardware implementation paper with no derivation chain

full rationale

The paper describes a hardware mapping and measurement exercise for CRONet on AIE-ML engines, reporting latency and energy numbers against a technology-scaled T4 GPU baseline. No equations, fitted parameters, or mathematical derivations are present that could reduce to self-definition or self-citation. Claims rest on experimental results and on-chip residency assertions that are externally falsifiable by replication; the original CRONet reference (if cited) supplies the network topology but does not participate in any load-bearing derivation inside this manuscript.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described. The work relies on standard hardware-mapping practices for neural-network operators.

pith-pipeline@v0.9.0 · 5618 in / 1277 out tokens · 56481 ms · 2026-05-10T09:17:50.594942+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

AI Engine API User Guide., 2022

work page 2022
[2]

AMD Versal ACAP

AMD. AMD Versal ACAP . https://www.amd.com/en/products/ adaptive-socs-and-fpgas/versal.html, 2024. [Online; accessed 02-may- 2024]

work page 2024
[3]

IRON: Unlocking the Full Potential of NPUs, 2025

AMD. IRON: Unlocking the Full Potential of NPUs, 2025

work page 2025
[4]

VEK280 Evaluation Board User Guide (UG1612), 2023

AMD/XILINX. VEK280 Evaluation Board User Guide (UG1612), 2023

work page 2023
[5]

Versal Adaptive SoC AIE-ML Architecture Manual (AM020), 2023

AMD/XILINX. Versal Adaptive SoC AIE-ML Architecture Manual (AM020), 2023

work page 2023
[6]

D., LOW, J.,ANDLOW, T

BINDER, E. D., LOW, J.,ANDLOW, T. M. Architecture-aware models of ai engines for high-performance matrix matrix multiplication. In Proceedings of the 54th International Conference on Parallel Processing (New York, NY , USA, 2025), ICPP ’25, Association for Computing Machinery, p. 531–540

work page 2025
[7]

N., SUSOY, M.,ANDFRANGOPOL, D

CATBAS, F. N., SUSOY, M.,ANDFRANGOPOL, D. M. Structural health monitoring and reliability estimation: Long span truss bridge application with environmental monitoring data.Engineering Structures 30, 9 (2008), 2347–2359

work page 2008
[8]

Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine

CHATARASI, P., NEUENDORFFER, S., BAYLISS, S., VISSERS, K.,AND SARKAR, V. Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine. In2020 IEEE High Performance Extreme Computing Conference (HPEC)(2020), pp. 1–10

work page 2020
[9]

Exploiting On-Chip Heterogeneity of Versal Archi- tecture for GNN Inference Acceleration

CHEN, P., MANJUNATH, P., WIJERATNE, S., ZHANG, B.,AND PRASANNA, V. Exploiting On-Chip Heterogeneity of Versal Archi- tecture for GNN Inference Acceleration. In2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) (Gothenburg, Sweden, Sept. 2023), IEEE, pp. 219–227

work page 2023
[10]

D.,ANDGRANDHI, R

DEATON, J. D.,ANDGRANDHI, R. V. A survey of structural and mul- tidisciplinary continuum topology optimization: post 2000.Structural and Multidisciplinary Optimization 49, 1 (Jan. 2014), 1–38

work page 2000
[11]

AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine

DENG, X., WANG, S., GAO, T., LIU, J., LIU, L.,ANDZHENG, N. AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine. In2024 34th International Conference on Field-Programmable Logic and Applications (FPL)(2024), pp. 227– 235

work page 2024
[12]

K., SHI, Y., WANG, Y.,ANDZHOU, P

DONG, P., ZHUANG, J., YANG, Z., JI, S., LI, Y., XU, D., HUANG, H., HU, J., JONES, A. K., SHI, Y., WANG, Y.,ANDZHOU, P. EQ- ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, 11 (2024), 3949–3960

work page 2024
[13]

On the feasibility of using fpga’s for efficient topology optimization

HESSE, K., SCHOEBERL, M., AAGE, N.,ANDTR ¨AFF, E. On the feasibility of using fpga’s for efficient topology optimization. In2023 26th Euromicro Conference on Digital System Design (DSD)(2023), pp. 242–250

work page 2023
[14]

MLIR: A Compiler Infrastructure for the End of Moore’s Law, 2020

LATTNER, C., AMINI, M., BONDHUGULA, U., COHEN, A., DAVIS, A., PIENAAR, J., RIDDLE, R., SHPEISMAN, T., VASILACHE, N.,AND ZINENKO, O. MLIR: A Compiler Infrastructure for the End of Moore’s Law, 2020

work page 2020
[15]

LU, L., JIN, P., PANG, G., ZHANG, Z.,ANDKARNIADAKIS, G. E. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators.Nature Machine Intelligence 3, 3 (Mar. 2021), 218–229

work page 2021
[16]

Gama: High-performance gemm acceleration on amd versal ml-optimized ai engines

MHATRE, K., TAKA, E.,ANDARORA, A. Gama: High-performance gemm acceleration on amd versal ml-optimized ai engines. In2025 35th International Conference on Field-Programmable Logic and Ap- plications (FPL25)(2025)

work page 2025
[17]

M., MULLETI, V

MHATRE, K. M., MULLETI, V. G. P., BANSIL, C. J., TAKA, E.,AND ARORA, A. Performance analysis of gemm workloads on the amd versal platform. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)(2025), pp. 150–161

work page 2025
[18]

Cronet: A convolutional recurrent operator approximator network to accelerate topology opti- mization.Manufacturing Letters 44(2025), 1052–1063

OLABIYI, R., YANG, H.,ANDIQUEBAL, A. Cronet: A convolutional recurrent operator approximator network to accelerate topology opti- mization.Manufacturing Letters 44(2025), 1052–1063. 53rd SME North American Manufacturing Research Conference (NAMRC 53)

work page 2025
[19]

Evaluation of Xilinx Versal Architecture for Next-Gen Edge Computing in Space

PERRYMAN, N., WILSON, C.,ANDGEORGE, A. Evaluation of Xilinx Versal Architecture for Next-Gen Edge Computing in Space. In2023 IEEE Aerospace Conference(Mar. 2023), pp. 1–11. ISSN: 1095-323X

work page 2023
[20]

SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

SINGH, G., KHODAMORADI, A., DENOLF, K., LO, J., G ´OMEZ-LUNA, J., MELBER, J., BISCA, A., CORPORAAL, H.,ANDMUTLU, O. SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation. InICS(2023)

work page 2023
[21]

Scaling equations for the accurate prediction of cmos device performance from 180nm to 7nm.Integration 58(2017), 74–81

STILLMAKER, A.,ANDBAAS, B. Scaling equations for the accurate prediction of cmos device performance from 180nm to 7nm.Integration 58(2017), 74–81

work page 2017
[22]

Maxeva: Maximizing the efficiency of matrix multiplication on versal ai engine,

TAKA, E., ARORA, A., WU, K.-C.,ANDMARCULESCU, D. MaxEV A: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine, Nov. 2023. arXiv:2311.04980 [cs]

work page arXiv 2023
[23]

WANG, C., ZHANG, X., CONG, J.,ANDHOE, J. C. Reconfigurable Stream Network Architecture, 2025

work page 2025
[24]

Evaluation of Xilinx Versal Device

WIERSE, M. Evaluation of Xilinx Versal Device. Bachelor thesis, ETH Zurich, Zurich, 2023-02

work page 2023
[25]

K.,ANDZHOU, P

YANG, Z., ZHUANG, J., YIN, J., YU, C., JONES, A. K.,ANDZHOU, P. AIM: Accelerating Arbitrary-Precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)(San Francisco, CA, USA, Oct. 2023), IEEE, pp. 1–9

work page 2023
[26]

YEMME, A.,ANDGARANI, S. S. A Scalable GPT-2 Inference Hard- ware Architecture on FPGA. In2023 International Joint Conference on Neural Networks (IJCNN)(June 2023), pp. 1–8. ISSN: 2161-4407

work page 2023
[27]

H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture

ZHANG, C., GENG, T., GUO, A., TIAN, J., HERBORDT, M., LI, A., ANDTAO, D. H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture. In2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)(Aug. 2022), pp. 200–208. ISSN: 1946-1488

work page 2022
[28]

CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture

ZHUANG, J., LAU, J., YE, H., YANG, Z., DU, Y., LO, J., DENOLF, K., NEUENDORFFER, S., JONES, A., HU, J., CHEN, D., CONG, J., ANDZHOU, P. CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture. InProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey CA USA, Feb. 2023),...

work page 2023
[29]

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans

ZHUANG, J., LAU, J., YE, H., YANG, Z., JI, S., LO, J., DENOLF, K., NEUENDORFFER, S., JONES, A., HU, J., SHI, Y., CHEN, D., CONG, J.,ANDZHOU, P. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans. Reconfigurable Technol. Syst. 17, 3 (Sept. 2024)

work page 2024
[30]

ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

ZHUANG, J., XIANG, S., CHEN, H., ZHANG, N., YANG, Z., MAO, T., ZHANG, Z.,ANDZHOU, P. ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines. InProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(New York, NY , USA, 2025), FPGA ’25, Association for Computing Machinery, p. 92–102

work page 2025
[31]

K., HU, J., SHI, Y.,ANDZHOU, P

ZHUANG, J., YANG, Z., JI, S., HUANG, H., JONES, A. K., HU, J., SHI, Y.,ANDZHOU, P. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration, Feb. 2024. arXiv:2401.10417 [cs]

work page arXiv 2024
[32]

AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Pro- grammable System-on-Chip, May 2023

ZHUANG, J., YANG, Z.,ANDZHOU, P. AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Pro- grammable System-on-Chip, May 2023. arXiv:2305.18698 [cs]

work page arXiv 2023

[1] [1]

AI Engine API User Guide., 2022

work page 2022

[2] [2]

AMD Versal ACAP

AMD. AMD Versal ACAP . https://www.amd.com/en/products/ adaptive-socs-and-fpgas/versal.html, 2024. [Online; accessed 02-may- 2024]

work page 2024

[3] [3]

IRON: Unlocking the Full Potential of NPUs, 2025

AMD. IRON: Unlocking the Full Potential of NPUs, 2025

work page 2025

[4] [4]

VEK280 Evaluation Board User Guide (UG1612), 2023

AMD/XILINX. VEK280 Evaluation Board User Guide (UG1612), 2023

work page 2023

[5] [5]

Versal Adaptive SoC AIE-ML Architecture Manual (AM020), 2023

AMD/XILINX. Versal Adaptive SoC AIE-ML Architecture Manual (AM020), 2023

work page 2023

[6] [6]

D., LOW, J.,ANDLOW, T

BINDER, E. D., LOW, J.,ANDLOW, T. M. Architecture-aware models of ai engines for high-performance matrix matrix multiplication. In Proceedings of the 54th International Conference on Parallel Processing (New York, NY , USA, 2025), ICPP ’25, Association for Computing Machinery, p. 531–540

work page 2025

[7] [7]

N., SUSOY, M.,ANDFRANGOPOL, D

CATBAS, F. N., SUSOY, M.,ANDFRANGOPOL, D. M. Structural health monitoring and reliability estimation: Long span truss bridge application with environmental monitoring data.Engineering Structures 30, 9 (2008), 2347–2359

work page 2008

[8] [8]

Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine

CHATARASI, P., NEUENDORFFER, S., BAYLISS, S., VISSERS, K.,AND SARKAR, V. Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine. In2020 IEEE High Performance Extreme Computing Conference (HPEC)(2020), pp. 1–10

work page 2020

[9] [9]

Exploiting On-Chip Heterogeneity of Versal Archi- tecture for GNN Inference Acceleration

CHEN, P., MANJUNATH, P., WIJERATNE, S., ZHANG, B.,AND PRASANNA, V. Exploiting On-Chip Heterogeneity of Versal Archi- tecture for GNN Inference Acceleration. In2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) (Gothenburg, Sweden, Sept. 2023), IEEE, pp. 219–227

work page 2023

[10] [10]

D.,ANDGRANDHI, R

DEATON, J. D.,ANDGRANDHI, R. V. A survey of structural and mul- tidisciplinary continuum topology optimization: post 2000.Structural and Multidisciplinary Optimization 49, 1 (Jan. 2014), 1–38

work page 2000

[11] [11]

AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine

DENG, X., WANG, S., GAO, T., LIU, J., LIU, L.,ANDZHENG, N. AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine. In2024 34th International Conference on Field-Programmable Logic and Applications (FPL)(2024), pp. 227– 235

work page 2024

[12] [12]

K., SHI, Y., WANG, Y.,ANDZHOU, P

DONG, P., ZHUANG, J., YANG, Z., JI, S., LI, Y., XU, D., HUANG, H., HU, J., JONES, A. K., SHI, Y., WANG, Y.,ANDZHOU, P. EQ- ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, 11 (2024), 3949–3960

work page 2024

[13] [13]

On the feasibility of using fpga’s for efficient topology optimization

HESSE, K., SCHOEBERL, M., AAGE, N.,ANDTR ¨AFF, E. On the feasibility of using fpga’s for efficient topology optimization. In2023 26th Euromicro Conference on Digital System Design (DSD)(2023), pp. 242–250

work page 2023

[14] [14]

MLIR: A Compiler Infrastructure for the End of Moore’s Law, 2020

LATTNER, C., AMINI, M., BONDHUGULA, U., COHEN, A., DAVIS, A., PIENAAR, J., RIDDLE, R., SHPEISMAN, T., VASILACHE, N.,AND ZINENKO, O. MLIR: A Compiler Infrastructure for the End of Moore’s Law, 2020

work page 2020

[15] [15]

LU, L., JIN, P., PANG, G., ZHANG, Z.,ANDKARNIADAKIS, G. E. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators.Nature Machine Intelligence 3, 3 (Mar. 2021), 218–229

work page 2021

[16] [16]

Gama: High-performance gemm acceleration on amd versal ml-optimized ai engines

MHATRE, K., TAKA, E.,ANDARORA, A. Gama: High-performance gemm acceleration on amd versal ml-optimized ai engines. In2025 35th International Conference on Field-Programmable Logic and Ap- plications (FPL25)(2025)

work page 2025

[17] [17]

M., MULLETI, V

MHATRE, K. M., MULLETI, V. G. P., BANSIL, C. J., TAKA, E.,AND ARORA, A. Performance analysis of gemm workloads on the amd versal platform. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)(2025), pp. 150–161

work page 2025

[18] [18]

Cronet: A convolutional recurrent operator approximator network to accelerate topology opti- mization.Manufacturing Letters 44(2025), 1052–1063

OLABIYI, R., YANG, H.,ANDIQUEBAL, A. Cronet: A convolutional recurrent operator approximator network to accelerate topology opti- mization.Manufacturing Letters 44(2025), 1052–1063. 53rd SME North American Manufacturing Research Conference (NAMRC 53)

work page 2025

[19] [19]

Evaluation of Xilinx Versal Architecture for Next-Gen Edge Computing in Space

PERRYMAN, N., WILSON, C.,ANDGEORGE, A. Evaluation of Xilinx Versal Architecture for Next-Gen Edge Computing in Space. In2023 IEEE Aerospace Conference(Mar. 2023), pp. 1–11. ISSN: 1095-323X

work page 2023

[20] [20]

SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

SINGH, G., KHODAMORADI, A., DENOLF, K., LO, J., G ´OMEZ-LUNA, J., MELBER, J., BISCA, A., CORPORAAL, H.,ANDMUTLU, O. SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation. InICS(2023)

work page 2023

[21] [21]

Scaling equations for the accurate prediction of cmos device performance from 180nm to 7nm.Integration 58(2017), 74–81

STILLMAKER, A.,ANDBAAS, B. Scaling equations for the accurate prediction of cmos device performance from 180nm to 7nm.Integration 58(2017), 74–81

work page 2017

[22] [22]

Maxeva: Maximizing the efficiency of matrix multiplication on versal ai engine,

TAKA, E., ARORA, A., WU, K.-C.,ANDMARCULESCU, D. MaxEV A: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine, Nov. 2023. arXiv:2311.04980 [cs]

work page arXiv 2023

[23] [23]

WANG, C., ZHANG, X., CONG, J.,ANDHOE, J. C. Reconfigurable Stream Network Architecture, 2025

work page 2025

[24] [24]

Evaluation of Xilinx Versal Device

WIERSE, M. Evaluation of Xilinx Versal Device. Bachelor thesis, ETH Zurich, Zurich, 2023-02

work page 2023

[25] [25]

K.,ANDZHOU, P

YANG, Z., ZHUANG, J., YIN, J., YU, C., JONES, A. K.,ANDZHOU, P. AIM: Accelerating Arbitrary-Precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)(San Francisco, CA, USA, Oct. 2023), IEEE, pp. 1–9

work page 2023

[26] [26]

YEMME, A.,ANDGARANI, S. S. A Scalable GPT-2 Inference Hard- ware Architecture on FPGA. In2023 International Joint Conference on Neural Networks (IJCNN)(June 2023), pp. 1–8. ISSN: 2161-4407

work page 2023

[27] [27]

H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture

ZHANG, C., GENG, T., GUO, A., TIAN, J., HERBORDT, M., LI, A., ANDTAO, D. H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture. In2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)(Aug. 2022), pp. 200–208. ISSN: 1946-1488

work page 2022

[28] [28]

CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture

ZHUANG, J., LAU, J., YE, H., YANG, Z., DU, Y., LO, J., DENOLF, K., NEUENDORFFER, S., JONES, A., HU, J., CHEN, D., CONG, J., ANDZHOU, P. CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture. InProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey CA USA, Feb. 2023),...

work page 2023

[29] [29]

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans

ZHUANG, J., LAU, J., YE, H., YANG, Z., JI, S., LO, J., DENOLF, K., NEUENDORFFER, S., JONES, A., HU, J., SHI, Y., CHEN, D., CONG, J.,ANDZHOU, P. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans. Reconfigurable Technol. Syst. 17, 3 (Sept. 2024)

work page 2024

[30] [30]

ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

ZHUANG, J., XIANG, S., CHEN, H., ZHANG, N., YANG, Z., MAO, T., ZHANG, Z.,ANDZHOU, P. ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines. InProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(New York, NY , USA, 2025), FPGA ’25, Association for Computing Machinery, p. 92–102

work page 2025

[31] [31]

K., HU, J., SHI, Y.,ANDZHOU, P

ZHUANG, J., YANG, Z., JI, S., HUANG, H., JONES, A. K., HU, J., SHI, Y.,ANDZHOU, P. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration, Feb. 2024. arXiv:2401.10417 [cs]

work page arXiv 2024

[32] [32]

AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Pro- grammable System-on-Chip, May 2023

ZHUANG, J., YANG, Z.,ANDZHOU, P. AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Pro- grammable System-on-Chip, May 2023. arXiv:2305.18698 [cs]

work page arXiv 2023