pith. sign in

arxiv: 2604.14700 · v1 · submitted 2026-04-16 · 💻 cs.AR

Accelerating CRONet on AMD Versal AIE-ML Engines

Pith reviewed 2026-05-10 09:17 UTC · model grok-4.3

classification 💻 cs.AR
keywords CRONettopology optimizationAIE-MLon-chip inferenceneural network accelerationlatencyenergy efficiencydigital twins
0
0 comments X

The pith

CRONet runs fully on-chip on AMD Versal AIE-ML achieving 2.49x latency improvement over scaled Nvidia T4

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to map the entire CRONet neural network for topology optimization onto the AMD Versal AI Engine-ML array so that weights and all intermediate activations stay in on-chip memory. No data movement to external DRAM occurs during inference. This delivers measured gains of up to 2.49 times lower latency and 4.18 times higher energy efficiency than a comparable GPU after technology-node scaling. The result matters for real-time structural analysis in digital twins of bridges and buildings, where traditional finite-element methods are too slow and GPU runs remain power-hungry. By exploiting the AIE-ML engines' local parallelism and memory hierarchy, the implementation keeps the network's solution quality intact while removing off-chip bottlenecks.

Core claim

We present a hardware accelerated implementation of a topology optimization neural network (CRONet) on the AMD Versal AI Engine-ML (AIE-ML) architecture. Our approach efficiently exploits the parallelism and memory hierarchy of AIE-ML engines to optimize the execution of various neural network operators. We are the first to implement an end-to-end neural network fully realized on the AIE-ML array, where all intermediate activations and network weights reside on-chip throughout inference, eliminating any reliance on DRAM for intermediate data movement. Experimental results demonstrate that our implementation achieves up to 2.49x improvement in latency and up to 4.18x improvement in energy 1e0

What carries the argument

The operator-by-operator mapping of CRONet to AIE-ML engines that keeps every activation and weight in on-chip memory throughout inference

If this is right

  • Low-latency topology optimization becomes practical for real-time digital twin monitoring of infrastructure
  • Data-driven replacements for finite element analysis gain a hardware path with lower power draw
  • AIE-ML arrays prove capable of hosting complete complex neural networks without external memory
  • Energy efficiency gains scale with the removal of all DRAM transfers during inference

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same on-chip mapping strategy could be tested on other networks used for structural or optimization tasks
  • Large on-chip memory may become a decisive advantage for edge workloads where data movement dominates energy cost
  • Continuous monitoring applications could adopt such accelerators for always-on low-power operation
  • Cross-platform comparisons would benefit from standardized unscaled measurements to isolate architecture effects

Load-bearing premise

The GPU comparison remains fair after technology-node scaling and the on-chip-only execution produces identical numerical results and solution quality as the original network

What would settle it

Measure actual latency and energy on the AIE-ML hardware versus an unscaled Nvidia T4 while confirming that the output material distributions from topology optimization match exactly

Figures

Figures reproduced from arXiv: 2604.14700 by Aditya Ray, Aman Arora, Ashif Iquebal, Farhan Khan, Kaustubh Mhatre, Ridwan Olabiyi, Vedant Tewari.

Figure 1
Figure 1. Figure 1: Architecture of CRONet [18]. in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AMD Versal AIE-ML Architecture. and output port connections for each kernel, and the configura￾tion of off-chip interfaces. It also supports optional constraints such as kernel placement on specific engines, kernel co￾location for time-sharing a single engine, and double buffering configuration. A subgraph represents a logical grouping of one or more kernels within the ADF graph that collectively imple￾men… view at source ↗
Figure 3
Figure 3. Figure 3: Different fusion techniques used in our implementation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pseudocode for AIE-ML kernel implementation for various operators [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The ADF graph of our CRONet implementation [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance comparison of CRONet (30×20) infer￾ence on Versal VEK280 and Nvidia T4 across latency, power consumption, and energy efficiency. CRONet (Versal) 0 20 40 60 80 100 Execution Time (%) 8% 18% 55% 14% Layers TrunkNet: CONV3D TrunkNet: AAP3D TrunkNet: Linear BranchNet: CONV2D BranchNet: MaxPool2D BranchNet: AAP2D BranchNet: RNN BranchNet: Linear Mul [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise percentage breakdown of CRONet infer [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: CRONet subgraphs of TrunkNet (T1 to T5) and [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Topology optimization is a computational method used to determine the optimal material distribution within a prescribed design domain, aiming to minimize structural weight while satisfying load and boundary conditions. For critical infrastructure applications, such as structural health monitoring of bridges and buildings, particularly in digital twin contexts, low-latency energy-efficient topology optimization is essential. Traditionally, topology optimization relies on finite element analysis (FEA), a computationally intensive process. Recent advances in deep neural networks (DNNs) have introduced data driven alternatives to FEA, substantially reducing computation time while maintaining solution quality. These DNNs have complex architectures and implementing them on inference-class GPUs results in high latency and poor energy efficiency. To address this challenge, we present a hardware accelerated implementation of a topology optimization neural network (CRONet) on the AMD Versal AI Engine-ML (AIE-ML) architecture. Our approach efficiently exploits the parallelism and memory hierarchy of AIE-ML engines to optimize the execution of various neural network operators. We are the first to implement an end-to-end neural network fully realized on the AIE-ML array, where all intermediate activations and network weights reside on-chip throughout inference, eliminating any reliance on DRAM for intermediate data movement. Experimental results demonstrate that our implementation achieves up to 2.49x improvement in latency and up to 4.18x improvement in energy efficiency compared to an inference-class ML-optimized GPU in the same power budget (Nvidia T4) after scaling for technology node. These results highlight the potential of Versal AIE-ML based acceleration for enabling low-latency energy-efficient topology optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a hardware-accelerated implementation of CRONet, a deep neural network for topology optimization, mapped onto the AMD Versal AI Engine-ML (AIE-ML) array. It claims to be the first end-to-end realization in which all network weights and intermediate activations reside entirely on-chip with no DRAM accesses for data movement during inference, and reports up to 2.49× lower latency and 4.18× better energy efficiency relative to a technology-node-scaled Nvidia T4 GPU within the same power envelope.

Significance. If the empirical claims are substantiated, the work would demonstrate that AIE-ML engines can deliver low-latency, high-efficiency inference for complex scientific DNNs by fully exploiting the on-chip memory hierarchy, offering a concrete alternative to GPU-based acceleration for real-time topology optimization in digital-twin and structural-health-monitoring applications.

major comments (2)
  1. [Abstract] Abstract: the technology-node scaling applied to the Nvidia T4 baseline is not accompanied by any explicit methodology, scaling factors, or adjustments for clock frequency, memory hierarchy, or utilization differences between process nodes. Because the headline 2.49× latency and 4.18× energy figures rest directly on this scaling, the absence of the procedure makes the quantitative claims unverifiable and load-bearing for the central performance assertion.
  2. [Abstract] Abstract: the claim that the implementation is the first end-to-end neural network fully realized on the AIE-ML array with all intermediate activations and weights residing on-chip (eliminating DRAM for intermediate data) is asserted without supporting mapping details, memory-footprint analysis, or verification that no weight or activation spill occurs. This on-chip-residency property is load-bearing for both the novelty statement and the energy-efficiency comparison.
minor comments (1)
  1. The abstract refers to “scaling for technology node” without naming the source and target process nodes or the scaling model employed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to provide the requested details, strengthening the verifiability of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the technology-node scaling applied to the Nvidia T4 baseline is not accompanied by any explicit methodology, scaling factors, or adjustments for clock frequency, memory hierarchy, or utilization differences between process nodes. Because the headline 2.49× latency and 4.18× energy figures rest directly on this scaling, the absence of the procedure makes the quantitative claims unverifiable and load-bearing for the central performance assertion.

    Authors: We agree the scaling procedure must be explicit. The abstract mentions scaling for technology node but omits the method. We will revise by adding a dedicated paragraph in the Evaluation section (and a brief note in the abstract) that specifies the scaling factors: frequency adjustment of 1.4× from 12 nm (T4) to 7 nm equivalent, power scaling per published node comparisons, and conservative assumptions on memory hierarchy utilization. References to the scaling sources will be included to allow verification of the 2.49× latency and 4.18× energy results. revision: yes

  2. Referee: [Abstract] Abstract: the claim that the implementation is the first end-to-end neural network fully realized on the AIE-ML array with all intermediate activations and weights residing on-chip (eliminating DRAM for intermediate data) is asserted without supporting mapping details, memory-footprint analysis, or verification that no weight or activation spill occurs. This on-chip-residency property is load-bearing for both the novelty statement and the energy-efficiency comparison.

    Authors: We concur that supporting evidence is required. The claim is based on our AIE-ML compiler mapping, but details were not provided. We will revise the Implementation and Evaluation sections to include a memory-footprint table (weights + peak activations vs. AIE-ML on-chip SRAM capacity), the dataflow mapping strategy, and compiler verification output confirming zero DRAM spills for intermediate data. This will substantiate the on-chip-only execution and its contribution to the reported energy gains. revision: yes

Circularity Check

0 steps flagged

Empirical hardware implementation paper with no derivation chain

full rationale

The paper describes a hardware mapping and measurement exercise for CRONet on AIE-ML engines, reporting latency and energy numbers against a technology-scaled T4 GPU baseline. No equations, fitted parameters, or mathematical derivations are present that could reduce to self-definition or self-citation. Claims rest on experimental results and on-chip residency assertions that are externally falsifiable by replication; the original CRONet reference (if cited) supplies the network topology but does not participate in any load-bearing derivation inside this manuscript.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described. The work relies on standard hardware-mapping practices for neural-network operators.

pith-pipeline@v0.9.0 · 5618 in / 1277 out tokens · 56481 ms · 2026-05-10T09:17:50.594942+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    AI Engine API User Guide., 2022

  2. [2]

    AMD Versal ACAP

    AMD. AMD Versal ACAP . https://www.amd.com/en/products/ adaptive-socs-and-fpgas/versal.html, 2024. [Online; accessed 02-may- 2024]

  3. [3]

    IRON: Unlocking the Full Potential of NPUs, 2025

    AMD. IRON: Unlocking the Full Potential of NPUs, 2025

  4. [4]

    VEK280 Evaluation Board User Guide (UG1612), 2023

    AMD/XILINX. VEK280 Evaluation Board User Guide (UG1612), 2023

  5. [5]

    Versal Adaptive SoC AIE-ML Architecture Manual (AM020), 2023

    AMD/XILINX. Versal Adaptive SoC AIE-ML Architecture Manual (AM020), 2023

  6. [6]

    D., LOW, J.,ANDLOW, T

    BINDER, E. D., LOW, J.,ANDLOW, T. M. Architecture-aware models of ai engines for high-performance matrix matrix multiplication. In Proceedings of the 54th International Conference on Parallel Processing (New York, NY , USA, 2025), ICPP ’25, Association for Computing Machinery, p. 531–540

  7. [7]

    N., SUSOY, M.,ANDFRANGOPOL, D

    CATBAS, F. N., SUSOY, M.,ANDFRANGOPOL, D. M. Structural health monitoring and reliability estimation: Long span truss bridge application with environmental monitoring data.Engineering Structures 30, 9 (2008), 2347–2359

  8. [8]

    Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine

    CHATARASI, P., NEUENDORFFER, S., BAYLISS, S., VISSERS, K.,AND SARKAR, V. Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine. In2020 IEEE High Performance Extreme Computing Conference (HPEC)(2020), pp. 1–10

  9. [9]

    Exploiting On-Chip Heterogeneity of Versal Archi- tecture for GNN Inference Acceleration

    CHEN, P., MANJUNATH, P., WIJERATNE, S., ZHANG, B.,AND PRASANNA, V. Exploiting On-Chip Heterogeneity of Versal Archi- tecture for GNN Inference Acceleration. In2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) (Gothenburg, Sweden, Sept. 2023), IEEE, pp. 219–227

  10. [10]

    D.,ANDGRANDHI, R

    DEATON, J. D.,ANDGRANDHI, R. V. A survey of structural and mul- tidisciplinary continuum topology optimization: post 2000.Structural and Multidisciplinary Optimization 49, 1 (Jan. 2014), 1–38

  11. [11]

    AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine

    DENG, X., WANG, S., GAO, T., LIU, J., LIU, L.,ANDZHENG, N. AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine. In2024 34th International Conference on Field-Programmable Logic and Applications (FPL)(2024), pp. 227– 235

  12. [12]

    K., SHI, Y., WANG, Y.,ANDZHOU, P

    DONG, P., ZHUANG, J., YANG, Z., JI, S., LI, Y., XU, D., HUANG, H., HU, J., JONES, A. K., SHI, Y., WANG, Y.,ANDZHOU, P. EQ- ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, 11 (2024), 3949–3960

  13. [13]

    On the feasibility of using fpga’s for efficient topology optimization

    HESSE, K., SCHOEBERL, M., AAGE, N.,ANDTR ¨AFF, E. On the feasibility of using fpga’s for efficient topology optimization. In2023 26th Euromicro Conference on Digital System Design (DSD)(2023), pp. 242–250

  14. [14]

    MLIR: A Compiler Infrastructure for the End of Moore’s Law, 2020

    LATTNER, C., AMINI, M., BONDHUGULA, U., COHEN, A., DAVIS, A., PIENAAR, J., RIDDLE, R., SHPEISMAN, T., VASILACHE, N.,AND ZINENKO, O. MLIR: A Compiler Infrastructure for the End of Moore’s Law, 2020

  15. [15]

    LU, L., JIN, P., PANG, G., ZHANG, Z.,ANDKARNIADAKIS, G. E. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators.Nature Machine Intelligence 3, 3 (Mar. 2021), 218–229

  16. [16]

    Gama: High-performance gemm acceleration on amd versal ml-optimized ai engines

    MHATRE, K., TAKA, E.,ANDARORA, A. Gama: High-performance gemm acceleration on amd versal ml-optimized ai engines. In2025 35th International Conference on Field-Programmable Logic and Ap- plications (FPL25)(2025)

  17. [17]

    M., MULLETI, V

    MHATRE, K. M., MULLETI, V. G. P., BANSIL, C. J., TAKA, E.,AND ARORA, A. Performance analysis of gemm workloads on the amd versal platform. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)(2025), pp. 150–161

  18. [18]

    Cronet: A convolutional recurrent operator approximator network to accelerate topology opti- mization.Manufacturing Letters 44(2025), 1052–1063

    OLABIYI, R., YANG, H.,ANDIQUEBAL, A. Cronet: A convolutional recurrent operator approximator network to accelerate topology opti- mization.Manufacturing Letters 44(2025), 1052–1063. 53rd SME North American Manufacturing Research Conference (NAMRC 53)

  19. [19]

    Evaluation of Xilinx Versal Architecture for Next-Gen Edge Computing in Space

    PERRYMAN, N., WILSON, C.,ANDGEORGE, A. Evaluation of Xilinx Versal Architecture for Next-Gen Edge Computing in Space. In2023 IEEE Aerospace Conference(Mar. 2023), pp. 1–11. ISSN: 1095-323X

  20. [20]

    SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

    SINGH, G., KHODAMORADI, A., DENOLF, K., LO, J., G ´OMEZ-LUNA, J., MELBER, J., BISCA, A., CORPORAAL, H.,ANDMUTLU, O. SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation. InICS(2023)

  21. [21]

    Scaling equations for the accurate prediction of cmos device performance from 180nm to 7nm.Integration 58(2017), 74–81

    STILLMAKER, A.,ANDBAAS, B. Scaling equations for the accurate prediction of cmos device performance from 180nm to 7nm.Integration 58(2017), 74–81

  22. [22]

    Maxeva: Maximizing the efficiency of matrix multiplication on versal ai engine,

    TAKA, E., ARORA, A., WU, K.-C.,ANDMARCULESCU, D. MaxEV A: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine, Nov. 2023. arXiv:2311.04980 [cs]

  23. [23]

    WANG, C., ZHANG, X., CONG, J.,ANDHOE, J. C. Reconfigurable Stream Network Architecture, 2025

  24. [24]

    Evaluation of Xilinx Versal Device

    WIERSE, M. Evaluation of Xilinx Versal Device. Bachelor thesis, ETH Zurich, Zurich, 2023-02

  25. [25]

    K.,ANDZHOU, P

    YANG, Z., ZHUANG, J., YIN, J., YU, C., JONES, A. K.,ANDZHOU, P. AIM: Accelerating Arbitrary-Precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)(San Francisco, CA, USA, Oct. 2023), IEEE, pp. 1–9

  26. [26]

    YEMME, A.,ANDGARANI, S. S. A Scalable GPT-2 Inference Hard- ware Architecture on FPGA. In2023 International Joint Conference on Neural Networks (IJCNN)(June 2023), pp. 1–8. ISSN: 2161-4407

  27. [27]

    H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture

    ZHANG, C., GENG, T., GUO, A., TIAN, J., HERBORDT, M., LI, A., ANDTAO, D. H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture. In2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)(Aug. 2022), pp. 200–208. ISSN: 1946-1488

  28. [28]

    CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture

    ZHUANG, J., LAU, J., YE, H., YANG, Z., DU, Y., LO, J., DENOLF, K., NEUENDORFFER, S., JONES, A., HU, J., CHEN, D., CONG, J., ANDZHOU, P. CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture. InProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey CA USA, Feb. 2023),...

  29. [29]

    CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans

    ZHUANG, J., LAU, J., YE, H., YANG, Z., JI, S., LO, J., DENOLF, K., NEUENDORFFER, S., JONES, A., HU, J., SHI, Y., CHEN, D., CONG, J.,ANDZHOU, P. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans. Reconfigurable Technol. Syst. 17, 3 (Sept. 2024)

  30. [30]

    ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

    ZHUANG, J., XIANG, S., CHEN, H., ZHANG, N., YANG, Z., MAO, T., ZHANG, Z.,ANDZHOU, P. ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines. InProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(New York, NY , USA, 2025), FPGA ’25, Association for Computing Machinery, p. 92–102

  31. [31]

    K., HU, J., SHI, Y.,ANDZHOU, P

    ZHUANG, J., YANG, Z., JI, S., HUANG, H., JONES, A. K., HU, J., SHI, Y.,ANDZHOU, P. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration, Feb. 2024. arXiv:2401.10417 [cs]

  32. [32]

    AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Pro- grammable System-on-Chip, May 2023

    ZHUANG, J., YANG, Z.,ANDZHOU, P. AutoMM: Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Pro- grammable System-on-Chip, May 2023. arXiv:2305.18698 [cs]