Accelerating CRONet on AMD Versal AIE-ML Engines
Pith reviewed 2026-05-10 09:17 UTC · model grok-4.3
The pith
CRONet runs fully on-chip on AMD Versal AIE-ML achieving 2.49x latency improvement over scaled Nvidia T4
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a hardware accelerated implementation of a topology optimization neural network (CRONet) on the AMD Versal AI Engine-ML (AIE-ML) architecture. Our approach efficiently exploits the parallelism and memory hierarchy of AIE-ML engines to optimize the execution of various neural network operators. We are the first to implement an end-to-end neural network fully realized on the AIE-ML array, where all intermediate activations and network weights reside on-chip throughout inference, eliminating any reliance on DRAM for intermediate data movement. Experimental results demonstrate that our implementation achieves up to 2.49x improvement in latency and up to 4.18x improvement in energy 1e0
What carries the argument
The operator-by-operator mapping of CRONet to AIE-ML engines that keeps every activation and weight in on-chip memory throughout inference
If this is right
- Low-latency topology optimization becomes practical for real-time digital twin monitoring of infrastructure
- Data-driven replacements for finite element analysis gain a hardware path with lower power draw
- AIE-ML arrays prove capable of hosting complete complex neural networks without external memory
- Energy efficiency gains scale with the removal of all DRAM transfers during inference
Where Pith is reading between the lines
- The same on-chip mapping strategy could be tested on other networks used for structural or optimization tasks
- Large on-chip memory may become a decisive advantage for edge workloads where data movement dominates energy cost
- Continuous monitoring applications could adopt such accelerators for always-on low-power operation
- Cross-platform comparisons would benefit from standardized unscaled measurements to isolate architecture effects
Load-bearing premise
The GPU comparison remains fair after technology-node scaling and the on-chip-only execution produces identical numerical results and solution quality as the original network
What would settle it
Measure actual latency and energy on the AIE-ML hardware versus an unscaled Nvidia T4 while confirming that the output material distributions from topology optimization match exactly
Figures
read the original abstract
Topology optimization is a computational method used to determine the optimal material distribution within a prescribed design domain, aiming to minimize structural weight while satisfying load and boundary conditions. For critical infrastructure applications, such as structural health monitoring of bridges and buildings, particularly in digital twin contexts, low-latency energy-efficient topology optimization is essential. Traditionally, topology optimization relies on finite element analysis (FEA), a computationally intensive process. Recent advances in deep neural networks (DNNs) have introduced data driven alternatives to FEA, substantially reducing computation time while maintaining solution quality. These DNNs have complex architectures and implementing them on inference-class GPUs results in high latency and poor energy efficiency. To address this challenge, we present a hardware accelerated implementation of a topology optimization neural network (CRONet) on the AMD Versal AI Engine-ML (AIE-ML) architecture. Our approach efficiently exploits the parallelism and memory hierarchy of AIE-ML engines to optimize the execution of various neural network operators. We are the first to implement an end-to-end neural network fully realized on the AIE-ML array, where all intermediate activations and network weights reside on-chip throughout inference, eliminating any reliance on DRAM for intermediate data movement. Experimental results demonstrate that our implementation achieves up to 2.49x improvement in latency and up to 4.18x improvement in energy efficiency compared to an inference-class ML-optimized GPU in the same power budget (Nvidia T4) after scaling for technology node. These results highlight the potential of Versal AIE-ML based acceleration for enabling low-latency energy-efficient topology optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a hardware-accelerated implementation of CRONet, a deep neural network for topology optimization, mapped onto the AMD Versal AI Engine-ML (AIE-ML) array. It claims to be the first end-to-end realization in which all network weights and intermediate activations reside entirely on-chip with no DRAM accesses for data movement during inference, and reports up to 2.49× lower latency and 4.18× better energy efficiency relative to a technology-node-scaled Nvidia T4 GPU within the same power envelope.
Significance. If the empirical claims are substantiated, the work would demonstrate that AIE-ML engines can deliver low-latency, high-efficiency inference for complex scientific DNNs by fully exploiting the on-chip memory hierarchy, offering a concrete alternative to GPU-based acceleration for real-time topology optimization in digital-twin and structural-health-monitoring applications.
major comments (2)
- [Abstract] Abstract: the technology-node scaling applied to the Nvidia T4 baseline is not accompanied by any explicit methodology, scaling factors, or adjustments for clock frequency, memory hierarchy, or utilization differences between process nodes. Because the headline 2.49× latency and 4.18× energy figures rest directly on this scaling, the absence of the procedure makes the quantitative claims unverifiable and load-bearing for the central performance assertion.
- [Abstract] Abstract: the claim that the implementation is the first end-to-end neural network fully realized on the AIE-ML array with all intermediate activations and weights residing on-chip (eliminating DRAM for intermediate data) is asserted without supporting mapping details, memory-footprint analysis, or verification that no weight or activation spill occurs. This on-chip-residency property is load-bearing for both the novelty statement and the energy-efficiency comparison.
minor comments (1)
- The abstract refers to “scaling for technology node” without naming the source and target process nodes or the scaling model employed.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to provide the requested details, strengthening the verifiability of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the technology-node scaling applied to the Nvidia T4 baseline is not accompanied by any explicit methodology, scaling factors, or adjustments for clock frequency, memory hierarchy, or utilization differences between process nodes. Because the headline 2.49× latency and 4.18× energy figures rest directly on this scaling, the absence of the procedure makes the quantitative claims unverifiable and load-bearing for the central performance assertion.
Authors: We agree the scaling procedure must be explicit. The abstract mentions scaling for technology node but omits the method. We will revise by adding a dedicated paragraph in the Evaluation section (and a brief note in the abstract) that specifies the scaling factors: frequency adjustment of 1.4× from 12 nm (T4) to 7 nm equivalent, power scaling per published node comparisons, and conservative assumptions on memory hierarchy utilization. References to the scaling sources will be included to allow verification of the 2.49× latency and 4.18× energy results. revision: yes
-
Referee: [Abstract] Abstract: the claim that the implementation is the first end-to-end neural network fully realized on the AIE-ML array with all intermediate activations and weights residing on-chip (eliminating DRAM for intermediate data) is asserted without supporting mapping details, memory-footprint analysis, or verification that no weight or activation spill occurs. This on-chip-residency property is load-bearing for both the novelty statement and the energy-efficiency comparison.
Authors: We concur that supporting evidence is required. The claim is based on our AIE-ML compiler mapping, but details were not provided. We will revise the Implementation and Evaluation sections to include a memory-footprint table (weights + peak activations vs. AIE-ML on-chip SRAM capacity), the dataflow mapping strategy, and compiler verification output confirming zero DRAM spills for intermediate data. This will substantiate the on-chip-only execution and its contribution to the reported energy gains. revision: yes
Circularity Check
Empirical hardware implementation paper with no derivation chain
full rationale
The paper describes a hardware mapping and measurement exercise for CRONet on AIE-ML engines, reporting latency and energy numbers against a technology-scaled T4 GPU baseline. No equations, fitted parameters, or mathematical derivations are present that could reduce to self-definition or self-citation. Claims rest on experimental results and on-chip residency assertions that are externally falsifiable by replication; the original CRONet reference (if cited) supplies the network topology but does not participate in any load-bearing derivation inside this manuscript.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
AI Engine API User Guide., 2022
work page 2022
-
[2]
AMD. AMD Versal ACAP . https://www.amd.com/en/products/ adaptive-socs-and-fpgas/versal.html, 2024. [Online; accessed 02-may- 2024]
work page 2024
-
[3]
IRON: Unlocking the Full Potential of NPUs, 2025
AMD. IRON: Unlocking the Full Potential of NPUs, 2025
work page 2025
-
[4]
VEK280 Evaluation Board User Guide (UG1612), 2023
AMD/XILINX. VEK280 Evaluation Board User Guide (UG1612), 2023
work page 2023
-
[5]
Versal Adaptive SoC AIE-ML Architecture Manual (AM020), 2023
AMD/XILINX. Versal Adaptive SoC AIE-ML Architecture Manual (AM020), 2023
work page 2023
-
[6]
BINDER, E. D., LOW, J.,ANDLOW, T. M. Architecture-aware models of ai engines for high-performance matrix matrix multiplication. In Proceedings of the 54th International Conference on Parallel Processing (New York, NY , USA, 2025), ICPP ’25, Association for Computing Machinery, p. 531–540
work page 2025
-
[7]
CATBAS, F. N., SUSOY, M.,ANDFRANGOPOL, D. M. Structural health monitoring and reliability estimation: Long span truss bridge application with environmental monitoring data.Engineering Structures 30, 9 (2008), 2347–2359
work page 2008
-
[8]
Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine
CHATARASI, P., NEUENDORFFER, S., BAYLISS, S., VISSERS, K.,AND SARKAR, V. Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine. In2020 IEEE High Performance Extreme Computing Conference (HPEC)(2020), pp. 1–10
work page 2020
-
[9]
Exploiting On-Chip Heterogeneity of Versal Archi- tecture for GNN Inference Acceleration
CHEN, P., MANJUNATH, P., WIJERATNE, S., ZHANG, B.,AND PRASANNA, V. Exploiting On-Chip Heterogeneity of Versal Archi- tecture for GNN Inference Acceleration. In2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) (Gothenburg, Sweden, Sept. 2023), IEEE, pp. 219–227
work page 2023
-
[10]
DEATON, J. D.,ANDGRANDHI, R. V. A survey of structural and mul- tidisciplinary continuum topology optimization: post 2000.Structural and Multidisciplinary Optimization 49, 1 (Jan. 2014), 1–38
work page 2000
-
[11]
AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine
DENG, X., WANG, S., GAO, T., LIU, J., LIU, L.,ANDZHENG, N. AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine. In2024 34th International Conference on Field-Programmable Logic and Applications (FPL)(2024), pp. 227– 235
work page 2024
-
[12]
K., SHI, Y., WANG, Y.,ANDZHOU, P
DONG, P., ZHUANG, J., YANG, Z., JI, S., LI, Y., XU, D., HUANG, H., HU, J., JONES, A. K., SHI, Y., WANG, Y.,ANDZHOU, P. EQ- ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, 11 (2024), 3949–3960
work page 2024
-
[13]
On the feasibility of using fpga’s for efficient topology optimization
HESSE, K., SCHOEBERL, M., AAGE, N.,ANDTR ¨AFF, E. On the feasibility of using fpga’s for efficient topology optimization. In2023 26th Euromicro Conference on Digital System Design (DSD)(2023), pp. 242–250
work page 2023
-
[14]
MLIR: A Compiler Infrastructure for the End of Moore’s Law, 2020
LATTNER, C., AMINI, M., BONDHUGULA, U., COHEN, A., DAVIS, A., PIENAAR, J., RIDDLE, R., SHPEISMAN, T., VASILACHE, N.,AND ZINENKO, O. MLIR: A Compiler Infrastructure for the End of Moore’s Law, 2020
work page 2020
-
[15]
LU, L., JIN, P., PANG, G., ZHANG, Z.,ANDKARNIADAKIS, G. E. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators.Nature Machine Intelligence 3, 3 (Mar. 2021), 218–229
work page 2021
-
[16]
Gama: High-performance gemm acceleration on amd versal ml-optimized ai engines
MHATRE, K., TAKA, E.,ANDARORA, A. Gama: High-performance gemm acceleration on amd versal ml-optimized ai engines. In2025 35th International Conference on Field-Programmable Logic and Ap- plications (FPL25)(2025)
work page 2025
-
[17]
MHATRE, K. M., MULLETI, V. G. P., BANSIL, C. J., TAKA, E.,AND ARORA, A. Performance analysis of gemm workloads on the amd versal platform. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)(2025), pp. 150–161
work page 2025
-
[18]
OLABIYI, R., YANG, H.,ANDIQUEBAL, A. Cronet: A convolutional recurrent operator approximator network to accelerate topology opti- mization.Manufacturing Letters 44(2025), 1052–1063. 53rd SME North American Manufacturing Research Conference (NAMRC 53)
work page 2025
-
[19]
Evaluation of Xilinx Versal Architecture for Next-Gen Edge Computing in Space
PERRYMAN, N., WILSON, C.,ANDGEORGE, A. Evaluation of Xilinx Versal Architecture for Next-Gen Edge Computing in Space. In2023 IEEE Aerospace Conference(Mar. 2023), pp. 1–11. ISSN: 1095-323X
work page 2023
-
[20]
SINGH, G., KHODAMORADI, A., DENOLF, K., LO, J., G ´OMEZ-LUNA, J., MELBER, J., BISCA, A., CORPORAAL, H.,ANDMUTLU, O. SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation. InICS(2023)
work page 2023
-
[21]
STILLMAKER, A.,ANDBAAS, B. Scaling equations for the accurate prediction of cmos device performance from 180nm to 7nm.Integration 58(2017), 74–81
work page 2017
-
[22]
Maxeva: Maximizing the efficiency of matrix multiplication on versal ai engine,
TAKA, E., ARORA, A., WU, K.-C.,ANDMARCULESCU, D. MaxEV A: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine, Nov. 2023. arXiv:2311.04980 [cs]
-
[23]
WANG, C., ZHANG, X., CONG, J.,ANDHOE, J. C. Reconfigurable Stream Network Architecture, 2025
work page 2025
-
[24]
Evaluation of Xilinx Versal Device
WIERSE, M. Evaluation of Xilinx Versal Device. Bachelor thesis, ETH Zurich, Zurich, 2023-02
work page 2023
-
[25]
YANG, Z., ZHUANG, J., YIN, J., YU, C., JONES, A. K.,ANDZHOU, P. AIM: Accelerating Arbitrary-Precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)(San Francisco, CA, USA, Oct. 2023), IEEE, pp. 1–9
work page 2023
-
[26]
YEMME, A.,ANDGARANI, S. S. A Scalable GPT-2 Inference Hard- ware Architecture on FPGA. In2023 International Joint Conference on Neural Networks (IJCNN)(June 2023), pp. 1–8. ISSN: 2161-4407
work page 2023
-
[27]
H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture
ZHANG, C., GENG, T., GUO, A., TIAN, J., HERBORDT, M., LI, A., ANDTAO, D. H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture. In2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)(Aug. 2022), pp. 200–208. ISSN: 1946-1488
work page 2022
-
[28]
CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture
ZHUANG, J., LAU, J., YE, H., YANG, Z., DU, Y., LO, J., DENOLF, K., NEUENDORFFER, S., JONES, A., HU, J., CHEN, D., CONG, J., ANDZHOU, P. CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture. InProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey CA USA, Feb. 2023),...
work page 2023
-
[29]
ZHUANG, J., LAU, J., YE, H., YANG, Z., JI, S., LO, J., DENOLF, K., NEUENDORFFER, S., JONES, A., HU, J., SHI, Y., CHEN, D., CONG, J.,ANDZHOU, P. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans. Reconfigurable Technol. Syst. 17, 3 (Sept. 2024)
work page 2024
-
[30]
ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines
ZHUANG, J., XIANG, S., CHEN, H., ZHANG, N., YANG, Z., MAO, T., ZHANG, Z.,ANDZHOU, P. ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines. InProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(New York, NY , USA, 2025), FPGA ’25, Association for Computing Machinery, p. 92–102
work page 2025
-
[31]
K., HU, J., SHI, Y.,ANDZHOU, P
ZHUANG, J., YANG, Z., JI, S., HUANG, H., JONES, A. K., HU, J., SHI, Y.,ANDZHOU, P. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration, Feb. 2024. arXiv:2401.10417 [cs]
-
[32]
ZHUANG, J., YANG, Z.,ANDZHOU, P. AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Pro- grammable System-on-Chip, May 2023. arXiv:2305.18698 [cs]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.