DORA: Dataflow-Instruction Orchestration Architecture for DNN Acceleration

Jinming Zhuang; Peipei Zhou; Sarah Schultz; Shixin Ji; Weisong Shi; Xingzhen Chen; Zheng Dong; Zhuoping Yang

arxiv: 2605.23833 · v1 · pith:V4TTPY75new · submitted 2026-05-22 · 💻 cs.AR

DORA: Dataflow-Instruction Orchestration Architecture for DNN Acceleration

Xingzhen Chen , Zhuoping Yang , Jinming Zhuang , Shixin Ji , Sarah Schultz , Zheng Dong , Weisong Shi , Peipei Zhou This is my paper

Pith reviewed 2026-05-25 02:14 UTC · model grok-4.3

classification 💻 cs.AR

keywords DNN accelerationdataflow architectureinstruction set architectureoverlay architecturereconfigurable computingcompilation frameworkhardware efficiencyperformance optimization

0 comments

The pith

DORA uses a custom ISA to explicitly orchestrate dataflow at the layer level on DNN accelerators, sustaining stable efficiency across workloads that differ by up to 6× in operation counts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern DNN models vary widely in operations, tensor shapes, and dependencies, causing generic accelerators to lose efficiency. DORA addresses this by proposing an instruction-based overlay architecture whose ISA describes data movement, computation, and synchronization explicitly. A two-stage compilation framework with MILP and heuristic search engines produces schedules for given workloads, while on-chip memory and parallelism mechanisms support flexibility. On an AMD Versal prototype, the design keeps efficiency variation below 5% on a single vector processor and reaches up to 5× higher throughput than prior accelerators.

Core claim

DORA maintains stable efficiency, with less than 5% variation on a single vector processor across workloads exhibiting up to 6× variation in operation counts. Compared to state-of-the-art accelerators, DORA consistently achieves higher performance, delivering up to 5× throughput improvement. The heuristic-based scheduler further achieves up to 90% optimality under practical time constraints.

What carries the argument

An instruction-based overlay architecture whose proposed ISA explicitly encodes dataflow, combined with on-chip memory management, computation parallelism management, and a two-stage compilation framework that uses MILP-based and heuristic search engines to generate layer-level schedules.

If this is right

The architecture sustains high hardware efficiency on diverse and complex DNN models without incurring large overhead per workload.
DORA can be deployed directly on existing reconfigurable platforms such as the AMD Versal VCK190.
The heuristic scheduler delivers schedules within 90% of MILP optimality under realistic compile-time limits.
Fine-grained ISA control of data movement and synchronization enables consistent performance across layers with differing characteristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Open-sourcing the framework and scheduler allows independent verification on additional platforms or models.
The same instruction-orchestration approach could be adapted to fixed-function ASICs to reduce the cost of supporting model diversity.
Stable efficiency across operation-count variation suggests the design may reduce the need for per-model hardware specialization.
The two-stage design-space exploration could be extended to include power or area constraints not emphasized in the current evaluation.

Load-bearing premise

The DNN workloads and operation-count variations used in the experiments represent the full range of real-world models that future users will deploy.

What would settle it

Running DORA on a new collection of DNN models whose operation counts vary by more than 6× or whose tensor shapes and dependencies fall outside the tested set, then checking whether efficiency variation on a single vector processor exceeds 5%.

Figures

Figures reproduced from arXiv: 2605.23833 by Jinming Zhuang, Peipei Zhou, Sarah Schultz, Shixin Ji, Weisong Shi, Xingzhen Chen, Zheng Dong, Zhuoping Yang.

**Figure 5.** Figure 5: Synchronization mechanism in MIU. separate on-chip buffers need to be allocated. To reduce the required size of on-chip buffers, the on-chip buffer can be tailored to a single operand and reused for the other one at the cost of extra padding, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 3.** Figure 3: Flexible on-chip memory resource management. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Flexible parallelism management. Sync Unit Load Unit Store Unit To LMUs To DRAM IDU Instr stream Ready List Layer 0 1 2 ... State 1 0 0 ... Iter start Update Ready List Ready stream Instr ready Issue instr Load next instr Iter start Send ready signal Load instr Store Iter start Load instr Load Direct access Instr control path Stream datapath [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: DORA framework overview. in a modular fashion, and users can generate customized DORA architecture using a template-based design approach, where users only need to specify the number of different function units, such as MMU, LMU, etc., according to the application requirements and resource constraints. To accommodate the rapid evolution of non-linear functions in modern DNN models, DORA supports the integr… view at source ↗

**Figure 7.** Figure 7: MILP formulation. #ReqSFU, DORA explores the runtime parameter space and selects the optimal configuration to populate the candidate execution table. Based on the discussion in Section 3, DORA supports multiple levels of flexible tile sizes by adjusting computation parallelism. Specifically, each vector processor computes a tiled MM of aie_m × aie_k × aie_n. Within one MMU, the vector processor array follo… view at source ↗

**Figure 8.** Figure 8: DORA compilation and runtime behaviors. shown in [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 10.** Figure 10: Single AIE efficiency under #operations variation. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 MLP-L MLP-S NCF BERT384 BERT256 BERT128 BERT64 BERT32 DeiT PointNet-LPointNet-S DORA (FP, FM) DORA (FP) RSN CHARM 2.0 (CHARM, AutoMM, ARIES) 1.14x 1.39x 1.42x 1.49x 1.89x 1.6x 2.8x 2.8x 4.5x 5.0x Throughout/GFLOPS 2.2x [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

**Figure 11.** Figure 11: End-to-end performance and gains. FP: flexible [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 12.** Figure 12: DSE acceleration options evaluation. and small gains. However, NCF consists of diverse MM shapes even with 3072 × 32 × 1, introducing much imbalance between operands, which can provide optimization opportunities for flexible LMU functionality to achieve the maximum data reuse. BERT-32 is a tiny model with small MM layer shapes, and DORA can configure each LMU to match with operands in a fine-grained fashi… view at source ↗

read the original abstract

As deep neural networks develop significantly more diverse and complex, achieving high performance and efficiency on complicated DNN models faces pressing challenges. Modern DNN workloads are increasingly diverse in operation types, tensor shapes, and execution dependencies, making it difficult to sustain high hardware efficiency across models. In addition, a generic accelerator often incurs substantial overhead when executing diverse workloads. To address these problems, we propose DORA, an instruction-based overlay architecture that explicitly describes dataflow via a proposed ISA, enabling fine-grained control of data movement, computation, and synchronization at the layer level. To support flexibility while achieving high performance, DORA adopts a novel on-chip memory management and computation parallelism management mechanism. DORA proposes a compilation framework that can generate instructions for given DNN workloads after a two-stage design space exploration. DORA framework also incorporates a MILP-based and a heuristic-based search engine to generate the schedule solution for different needs and constraints. We prototype DORA on the AMD Versal VCK190 platform, demonstrating its deployability on existing reconfigurable systems. Experimental results show that DORA maintains stable efficiency, with less than 5\% variation on a single vector processor across workloads exhibiting up to 6$\times$ variation in operation counts. Compared to state-of-the-art accelerators, DORA consistently achieves higher performance, delivering up to 5$\times$ throughput improvement. The heuristic-based scheduler further achieves up to 90\% optimality under practical time constraints. DORA is open-sourced at https://github.com/arc-research-lab/DORA.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes DORA, an instruction-based overlay architecture for DNN acceleration on reconfigurable hardware. It defines a custom ISA to explicitly orchestrate dataflow at the layer level, introduces on-chip memory management and computation parallelism mechanisms, and presents a two-stage compilation framework using MILP-based and heuristic search engines for scheduling. Prototyped on the AMD Versal VCK190, the work reports stable efficiency (<5% variation) on a single vector processor across workloads with up to 6× variation in operation counts, up to 5× throughput improvement versus state-of-the-art accelerators, and up to 90% optimality for the heuristic scheduler. The implementation is open-sourced.

Significance. If the stability and performance claims hold under representative workload diversity, the result would be significant for the field of flexible DNN accelerators. It offers a concrete path to combine the adaptability of instruction-driven designs with high efficiency on existing reconfigurable platforms, directly targeting the growing diversity of DNN models. The physical prototype and open-source release provide additional value by enabling reproducibility and deployment studies.

major comments (2)

[Abstract] Abstract: The central efficiency-stability claim (<5% variation across workloads with 6× operation-count variation) and the 5× throughput improvement are presented without any description of workload selection criteria, measurement methodology, statistical significance, or the precise set of operation types, tensor shapes, and dependencies tested. This absence directly affects the ability to evaluate whether the results address the diversity challenges stated in the introduction.
[Introduction and Experimental Results] Introduction and Experimental Results: The motivation explicitly identifies diversity in operation types, tensor shapes, and execution dependencies as the core challenge, yet the reported experiments are described only in terms of operation-count variation. No evidence is supplied that the evaluated workloads differ along the other stated axes; if the workloads are structurally similar, the stability result does not substantiate the generalization to “complicated DNN models.”

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying the experimental methodology and workload characteristics. We agree that additional details are needed to substantiate the claims regarding workload diversity and will revise the manuscript to address both major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The central efficiency-stability claim (<5% variation across workloads with 6× operation-count variation) and the 5× throughput improvement are presented without any description of workload selection criteria, measurement methodology, statistical significance, or the precise set of operation types, tensor shapes, and dependencies tested. This absence directly affects the ability to evaluate whether the results address the diversity challenges stated in the introduction.

Authors: We agree that the abstract is too concise on these points. In the revised version, we will expand the abstract to include a brief description of the workload selection criteria (standard models from MLPerf and common DNN benchmarks), measurement methodology (on-board execution on AMD Versal VCK190 with cycle-accurate timing via the integrated logic analyzer), and note the tested operation types (convolutions, matrix multiplications, activations, and reductions), tensor shapes (varying from 1×1 to 224×224 inputs with channel depths 64–2048), and dependencies (sequential, residual, and attention-based). Statistical significance is established via 10 repeated runs per workload with reported mean and standard deviation; these details will be summarized concisely in the abstract while retaining the core claims. revision: yes
Referee: [Introduction and Experimental Results] Introduction and Experimental Results: The motivation explicitly identifies diversity in operation types, tensor shapes, and execution dependencies as the core challenge, yet the reported experiments are described only in terms of operation-count variation. No evidence is supplied that the evaluated workloads differ along the other stated axes; if the workloads are structurally similar, the stability result does not substantiate the generalization to “complicated DNN models.”

Authors: The workloads used in the experiments are drawn from representative DNN models (ResNet-50, MobileNet-V2, BERT-base, and a custom attention-based model) that inherently differ in operation types, tensor shapes, and execution dependencies, as described in Section 5.1 of the manuscript. However, we acknowledge that the current presentation emphasizes operation-count variation and does not explicitly quantify the other dimensions. In the revision, we will add a dedicated table (new Table 2) that reports per-workload metrics for all three axes—operation-type diversity (unique op counts), tensor-shape variation (min/max dimensions and channel counts), and dependency graphs (number of parallel vs. sequential layers)—to demonstrate that the workloads are not structurally similar. This will directly support the generalization claim. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes an instruction-based overlay architecture (DORA) with a custom ISA, on-chip memory management, compilation framework, and MILP/heuristic schedulers, then reports measured results from a physical prototype on AMD Versal VCK190. No equations, first-principles derivations, or 'predictions' appear in the provided text; efficiency and throughput claims rest on direct hardware measurements across tested workloads rather than any fitted parameter renamed as output or self-citation chain. The central claims are therefore self-contained against external benchmarks and receive score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that the tested workloads capture the diversity that matters in practice and that FPGA prototype results translate to the claimed efficiency gains; no free parameters or invented physical entities are introduced in the abstract.

axioms (1)

domain assumption DNN workloads exhibit up to 6x variation in operation counts while remaining representative of production models.
Invoked to support the stability claim; location is the experimental results paragraph in the abstract.

pith-pipeline@v0.9.0 · 5827 in / 1226 out tokens · 23048 ms · 2026-05-25T02:14:33.881268+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 28 canonical work pages · 2 internal anchors

[1]

Mohamed S Abdelfattah et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. InFPL. IEEE

2018
[2]

AMD 2023.Vitis AI User Guide. AMD. https://docs.amd.com/r/en-US/ug1414- vitis-ai

2023
[3]

AMD/Xilinx. 2021. Versal Adaptive Compute Acceleration Platform. https: //www.xilinx.com/products/silicon-devices/acap/versal.html

2021
[4]

2023.AI Engine API and Intrinsics User Guide

AMD/Xilinx. 2023.AI Engine API and Intrinsics User Guide

2023
[5]

2023.Versal ACAP AI Engine System C Simulator

AMD/Xilinx. 2023.Versal ACAP AI Engine System C Simulator

2023
[6]

Autoware Foundation. [n. d.]. Autoware - the world’s leading open-source soft- ware project for autonomous driving. https://github.com/autowarefoundation/ autoware
[7]

Alan Tendler Leibel Bacellar et al. 2024. Differentiable Weightless Neural Net- works. InICML. 2277–2295. https://proceedings.mlr.press/v235/bacellar24a.html GLSVLSI ’26, June 22–24, 2026, Canandaigua, NY, USA Xingzhen Chen, Zhuoping Yang, Jinming Zhuang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, and Peipei Zhou

2024
[8]

Mohammed S Bensaleh et al . 2018. Optimal task scheduling for distributed cluster with active storage devices and accelerated nodes.IEEE Access6 (2018), 48195–48209

2018
[9]

Julian Blank et al. [n. d.]. Pymoo: Multi-Objective Optimization in Python.IEEE Access([n. d.]). doi:10.1109/ACCESS.2020.2990567

work page doi:10.1109/access.2020.2990567 2020
[10]

Mohamed Bouaziz et al. [n. d.]. A Dataflow Overlay for Monte Carlo Multi-Asset Option Pricing on AMD Versal AI Engines. InISC High Performance 2025 Research Paper Proceedings. doi:10.23919/ISC.2025.11020612

work page doi:10.23919/isc.2025.11020612 2025
[11]

Andrew Boutros et al. 2020. Beyond Peak Performance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs. InICFPT. 10–19. doi:10.1109/ ICFPT51103.2020.00011

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Andrew Boutros et al . 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. InICFPT. IEEE, 10–19

2020
[13]

Jingwei Cai et al. 2023. Inter-layer scheduling space definition and exploration for tiled accelerators. InISCA. 1–17

2023
[14]

Hongzheng Chen et al. 2024. Understanding the potential of fpga-based spatial acceleration for large language model inference.ACM TRETS18, 1 (2024), 1–29

2024
[15]

Hongzheng Chen et al. 2024. Allo: A programming model for composable accel- erator design.Proceedings of the ACM on Programming Languages8, PLDI (2024), 593–620

2024
[16]

Dimitrios Danopoulos et al . 2025. AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines.arXiv (2025)

2025
[17]

Xiaodong Deng et al. 2024. AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine. InFPL. 227–235. doi:10.1109/ FPL64840.2024.00039

work page arXiv 2024
[18]

Jacob Devlin et al. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Peiyan Dong et al. 2024. EQ-ViT: Algorithm-Hardware Co-Design for End-to- End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture.TCAD(2024). doi:10.1109/TCAD.2024.3443692

work page doi:10.1109/tcad.2024.3443692 2024
[20]

Mario Doumet et al. 2024. H2PIPE: High throughput CNN inference on FPGAs with high-bandwidth memory. In2024 FPL. IEEE, 69–77

2024
[21]

Jeremy Fowers et al. 2018. A configurable cloud-scale DNN processor for real-time AI. InISCA. IEEE, 1–14

2018
[22]

Jeremy Fowers et al . 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. InISCA. 1–14. doi:10.1109/ISCA.2018.00012

work page doi:10.1109/isca.2018.00012 2018
[23]

Paolo Salvatore Galfano et al . 2024. Co-Designing a 3D Transformation Ac- celerator for Versal-Based Image Registration. InICCD. 219–222. doi:10.1109/ ICCD63220.2024.00041

work page arXiv 2024
[24]

Nan Guan et al. [n. d.]. Industry Challenge
[25]

Zibo Guo et al. 2024. An overlay accelerator of DeepLab CNN for spacecraft image segmentation on FPGA.Remote Sensing16, 5 (2024), 894

2024
[26]

Mathew Hall et al. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs.arXiv preprint arXiv:2007.10451(2020)

work page arXiv 2020
[27]

Xiangnan He et al. 2017. Neural Collaborative Filtering(WWW ’17). doi:10.1145/ 3038912.3052569

work page arXiv 2017
[28]

Zifan He et al. 2025. InTAR: Inter-Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs. InFCCM. IEEE, 123–132

2025
[29]

Erika Hunhoff et al. 2025. Efficiency, expressivity, and extensibility in a close-to- metal npu programming interface. InFCCM. IEEE, 85–94

2025
[30]

Mustafa Ibrahim et al . 2025. VERSATILE: Very Fast Partial Reconfiguration Controller.ACM Transactions on Reconfigurable Technology and Systems18, 3 (2025), 1–22

2025
[31]

Shixin Ji et al. 2025. ART: Customizing accelerators for DNN-enabled real-time safety-critical systems. InGLSVLSI. 442–449

2025
[32]

Lana Josipovic et al. 2021. Synthesizing General-Purpose Code Into Dynamically Scheduled Circuits.IEEE Circuits and Systems Magazine21, 2 (2021), 97–118. doi:10.1109/MCAS.2021.3071631

work page doi:10.1109/mcas.2021.3071631 2021
[33]

Hyoukjun Kwon et al. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. InHPCA. IEEE, 71–83

2021
[34]

Jun Liu et al. 2025. FlightVGM: Efficient Video Generation Model Inference with Online Sparsification and Hybrid Precision on FPGAs(FPGA ’25). doi:10.1145/ 3706628.3708864

work page arXiv 2025
[35]

CPLEX User’s Manual. 1987. Ibm ilog cplex optimization studio.Version12, 1987-2018 (1987), 1

1987
[36]

Johannes Menzel et al . 2025. Efficient and Distributed Computation of Elec- tron Repulsion Integrals on AMD AI Engines. InFCCM. 95–104. doi:10.1109/ FCCM62733.2025.00044

work page arXiv 2025
[37]

Kaustubh Manohar Mhatre et al. 2025. Performance Analysis of GEMM Work- loads on the AMD Versal Platform. InISPASS. 150–161. doi:10.1109/ISPASS64960. 2025.00023

work page doi:10.1109/ispass64960 2025
[38]

Kaustubh Manohar Mhatre et al. 2025. GAMA: High-Performance GEMM Ac- celeration on AMD Versal ML-Optimized AI Engines. In2025 FPL. 323–331. doi:10.1109/FPL68686.2025.00051

work page doi:10.1109/fpl68686.2025.00051 2025
[39]

YoungSeok Na et al. 2026. HiLFS: FPGA-Orchestrated File System for High-Level Synthesis. InFPGA. 126–136

2026
[40]

Tan Nguyen et al . 2023. SPADES: A Productive Design Flow for Versal Pro- grammable Logic. InFPL. 65–71. doi:10.1109/FPL60245.2023.00017

work page doi:10.1109/fpl60245.2023.00017 2023
[41]

John Nickolls et al. [n. d.]. Scalable parallel programming with CUDA. ([n. d.])
[42]

Charles R Qi et al. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. InCVPR. 652–660

2017
[43]

Jan-Frederik Schulte et al. 2026. hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware.ACM TRETS(April 2026). doi:10.1145/3801979 Just Accepted

work page doi:10.1145/3801979 2026
[44]

Canberk Sönmez et al. 2026. Chext: A Domain-specific Language for Safe and Agile Elastic Dataflow Accelerators. InFPGA. 37–37

2026
[45]

Endri Taka et al. 2023. MaxEVA: Maximizing the Efficiency of Matrix Multiplica- tion on Versal AI Engine. InICFPT. 96–105. doi:10.1109/ICFPT59805.2023.00016

work page doi:10.1109/icfpt59805.2023.00016 2023
[46]

Dhananjay Rao Thallikar et al. 2026. HMix: An Efficient Hardware Accelerator for Quantized MLP-Mixer Inference. (2026)

2026
[47]

Ilya Tolstikhin et al. 2024. MLP-mixer: an all-MLP architecture for vision. In NIPS. Curran Associates Inc., Red Hook, NY, USA, Article 1857, 12 pages

2024
[48]

Jianming Tong et al. 2024. FEATHER: A reconfigurable accelerator with data reordering support for low-cost on-chip dataflow switching. InISCA. IEEE

2024
[49]

Hugo Touvron et al. 2021. Training data-efficient image transformers & distilla- tion through attention. InInternational conference on machine learning. PMLR

2021
[50]

Chengyue Wang et al. 2025. Reconfigurable Stream Network Architecture. In ISCA

2025
[51]

Erwei Wang et al. 2026. From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR.ACM TRETS(Jan. 2026). doi:10.1145/3785670 Just Accepted

work page doi:10.1145/3785670 2026
[52]

Yu Emma Wang et al. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning.arXiv preprint arXiv:1907.10701(2019)

work page arXiv 2019
[53]

Xuechao Wei et al. 2018. TGPA: Tile-grained pipeline architecture for low latency CNN inference. InICCAD. IEEE, 1–8

2018
[54]

2023.Zynq-7000 SoC Technical Reference Manual

Xilinx, Inc. 2023.Zynq-7000 SoC Technical Reference Manual. AMD. https: //docs.amd.com/r/en-US/ug585-zynq-7000-SoC-TRM

2023
[55]

Yixin Xu et al. 2024. Ferroelectric FET-based context-switching FPGA enabling dynamic reconfiguration for adaptive deep learning machines.Science Advances (2024). arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.adk1525 doi:10. 1126/sciadv.adk1525

work page doi:10.1126/sciadv.adk1525 2024
[56]

Hanchen Yang et al . 2025. NSFlow: An End-to-End FPGA Framework with Scalable Dataflow Architecture for Neuro-Symbolic AI. In2025 DAC. doi:10.1109/ DAC63849.2025.11133088

work page arXiv 2025
[57]

Zhuoping Yang et al . [n. d.]. AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. InICCAD
[58]

Shulin Zeng et al. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs(FPGA ’24). New York, NY, USA. doi:10.1145/3626202.3637562

work page doi:10.1145/3626202.3637562 2024
[59]

Dan Zhang et al. 2022. A full-stack search technique for domain optimized deep learning accelerators. InASPLOS. 27–42

2022
[60]

Xiaofan Zhang et al. 2020. DNNExplorer: a framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator. InICCAD

2020
[61]

2019.Modeling and Optimization for Customized Computing: Perfor- mance, Energy and Cost Perspective

Peipei Zhou. 2019.Modeling and Optimization for Customized Computing: Perfor- mance, Energy and Cost Perspective. University of California, Los Angeles

2019
[62]

Jinming Zhuang et al. 2023. CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture. InFPGA(Monterey, CA, USA). ACM, 153–164. doi:10.1145/3543622.3573210

work page doi:10.1145/3543622.3573210 2023
[63]

Jinming Zhuang et al. 2024. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM TRETS17, 3, Article 51 (Sept. 2024), 31 pages. doi:10.1145/3686163

work page doi:10.1145/3686163 2024
[64]

Jinming Zhuang et al. 2025. ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines(FPGA ’25). New York, NY, USA. doi:10.1145/3706628.3708870

work page doi:10.1145/3706628.3708870 2025
[65]

Jinming Zhuang et al . 2024. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration. InFPGA. ACM. doi:10. 1145/3626202.3637569

work page arXiv 2024
[66]

Jinming Zhuang et al. 2023. AutoMM: Energy-efficient multi-data-type matrix multiply design on heterogeneous programmable system-on-chip. (2023)

2023
[67]

Jinming Zhuang et al . 2023. High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives. In2023 60th ACM/IEEE Design Automation Conference (DAC). 1–6. doi:10.1109/ DAC56929.2023.10247981

work page arXiv 2023

[1] [1]

Mohamed S Abdelfattah et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. InFPL. IEEE

2018

[2] [2]

AMD 2023.Vitis AI User Guide. AMD. https://docs.amd.com/r/en-US/ug1414- vitis-ai

2023

[3] [3]

AMD/Xilinx. 2021. Versal Adaptive Compute Acceleration Platform. https: //www.xilinx.com/products/silicon-devices/acap/versal.html

2021

[4] [4]

2023.AI Engine API and Intrinsics User Guide

AMD/Xilinx. 2023.AI Engine API and Intrinsics User Guide

2023

[5] [5]

2023.Versal ACAP AI Engine System C Simulator

AMD/Xilinx. 2023.Versal ACAP AI Engine System C Simulator

2023

[6] [6]

Autoware Foundation. [n. d.]. Autoware - the world’s leading open-source soft- ware project for autonomous driving. https://github.com/autowarefoundation/ autoware

[7] [7]

Alan Tendler Leibel Bacellar et al. 2024. Differentiable Weightless Neural Net- works. InICML. 2277–2295. https://proceedings.mlr.press/v235/bacellar24a.html GLSVLSI ’26, June 22–24, 2026, Canandaigua, NY, USA Xingzhen Chen, Zhuoping Yang, Jinming Zhuang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, and Peipei Zhou

2024

[8] [8]

Mohammed S Bensaleh et al . 2018. Optimal task scheduling for distributed cluster with active storage devices and accelerated nodes.IEEE Access6 (2018), 48195–48209

2018

[9] [9]

Julian Blank et al. [n. d.]. Pymoo: Multi-Objective Optimization in Python.IEEE Access([n. d.]). doi:10.1109/ACCESS.2020.2990567

work page doi:10.1109/access.2020.2990567 2020

[10] [10]

Mohamed Bouaziz et al. [n. d.]. A Dataflow Overlay for Monte Carlo Multi-Asset Option Pricing on AMD Versal AI Engines. InISC High Performance 2025 Research Paper Proceedings. doi:10.23919/ISC.2025.11020612

work page doi:10.23919/isc.2025.11020612 2025

[11] [11]

Andrew Boutros et al. 2020. Beyond Peak Performance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs. InICFPT. 10–19. doi:10.1109/ ICFPT51103.2020.00011

work page internal anchor Pith review Pith/arXiv arXiv 2020

[12] [12]

Andrew Boutros et al . 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. InICFPT. IEEE, 10–19

2020

[13] [13]

Jingwei Cai et al. 2023. Inter-layer scheduling space definition and exploration for tiled accelerators. InISCA. 1–17

2023

[14] [14]

Hongzheng Chen et al. 2024. Understanding the potential of fpga-based spatial acceleration for large language model inference.ACM TRETS18, 1 (2024), 1–29

2024

[15] [15]

Hongzheng Chen et al. 2024. Allo: A programming model for composable accel- erator design.Proceedings of the ACM on Programming Languages8, PLDI (2024), 593–620

2024

[16] [16]

Dimitrios Danopoulos et al . 2025. AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines.arXiv (2025)

2025

[17] [17]

Xiaodong Deng et al. 2024. AMA: An Analytical Approach to Maximizing the Efficiency of Deep Learning on Versal AI Engine. InFPL. 227–235. doi:10.1109/ FPL64840.2024.00039

work page arXiv 2024

[18] [18]

Jacob Devlin et al. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Peiyan Dong et al. 2024. EQ-ViT: Algorithm-Hardware Co-Design for End-to- End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture.TCAD(2024). doi:10.1109/TCAD.2024.3443692

work page doi:10.1109/tcad.2024.3443692 2024

[20] [20]

Mario Doumet et al. 2024. H2PIPE: High throughput CNN inference on FPGAs with high-bandwidth memory. In2024 FPL. IEEE, 69–77

2024

[21] [21]

Jeremy Fowers et al. 2018. A configurable cloud-scale DNN processor for real-time AI. InISCA. IEEE, 1–14

2018

[22] [22]

Jeremy Fowers et al . 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. InISCA. 1–14. doi:10.1109/ISCA.2018.00012

work page doi:10.1109/isca.2018.00012 2018

[23] [23]

Paolo Salvatore Galfano et al . 2024. Co-Designing a 3D Transformation Ac- celerator for Versal-Based Image Registration. InICCD. 219–222. doi:10.1109/ ICCD63220.2024.00041

work page arXiv 2024

[24] [24]

Nan Guan et al. [n. d.]. Industry Challenge

[25] [25]

Zibo Guo et al. 2024. An overlay accelerator of DeepLab CNN for spacecraft image segmentation on FPGA.Remote Sensing16, 5 (2024), 894

2024

[26] [26]

Mathew Hall et al. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs.arXiv preprint arXiv:2007.10451(2020)

work page arXiv 2020

[27] [27]

Xiangnan He et al. 2017. Neural Collaborative Filtering(WWW ’17). doi:10.1145/ 3038912.3052569

work page arXiv 2017

[28] [28]

Zifan He et al. 2025. InTAR: Inter-Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs. InFCCM. IEEE, 123–132

2025

[29] [29]

Erika Hunhoff et al. 2025. Efficiency, expressivity, and extensibility in a close-to- metal npu programming interface. InFCCM. IEEE, 85–94

2025

[30] [30]

Mustafa Ibrahim et al . 2025. VERSATILE: Very Fast Partial Reconfiguration Controller.ACM Transactions on Reconfigurable Technology and Systems18, 3 (2025), 1–22

2025

[31] [31]

Shixin Ji et al. 2025. ART: Customizing accelerators for DNN-enabled real-time safety-critical systems. InGLSVLSI. 442–449

2025

[32] [32]

Lana Josipovic et al. 2021. Synthesizing General-Purpose Code Into Dynamically Scheduled Circuits.IEEE Circuits and Systems Magazine21, 2 (2021), 97–118. doi:10.1109/MCAS.2021.3071631

work page doi:10.1109/mcas.2021.3071631 2021

[33] [33]

Hyoukjun Kwon et al. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. InHPCA. IEEE, 71–83

2021

[34] [34]

Jun Liu et al. 2025. FlightVGM: Efficient Video Generation Model Inference with Online Sparsification and Hybrid Precision on FPGAs(FPGA ’25). doi:10.1145/ 3706628.3708864

work page arXiv 2025

[35] [35]

CPLEX User’s Manual. 1987. Ibm ilog cplex optimization studio.Version12, 1987-2018 (1987), 1

1987

[36] [36]

Johannes Menzel et al . 2025. Efficient and Distributed Computation of Elec- tron Repulsion Integrals on AMD AI Engines. InFCCM. 95–104. doi:10.1109/ FCCM62733.2025.00044

work page arXiv 2025

[37] [37]

Kaustubh Manohar Mhatre et al. 2025. Performance Analysis of GEMM Work- loads on the AMD Versal Platform. InISPASS. 150–161. doi:10.1109/ISPASS64960. 2025.00023

work page doi:10.1109/ispass64960 2025

[38] [38]

Kaustubh Manohar Mhatre et al. 2025. GAMA: High-Performance GEMM Ac- celeration on AMD Versal ML-Optimized AI Engines. In2025 FPL. 323–331. doi:10.1109/FPL68686.2025.00051

work page doi:10.1109/fpl68686.2025.00051 2025

[39] [39]

YoungSeok Na et al. 2026. HiLFS: FPGA-Orchestrated File System for High-Level Synthesis. InFPGA. 126–136

2026

[40] [40]

Tan Nguyen et al . 2023. SPADES: A Productive Design Flow for Versal Pro- grammable Logic. InFPL. 65–71. doi:10.1109/FPL60245.2023.00017

work page doi:10.1109/fpl60245.2023.00017 2023

[41] [41]

John Nickolls et al. [n. d.]. Scalable parallel programming with CUDA. ([n. d.])

[42] [42]

Charles R Qi et al. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. InCVPR. 652–660

2017

[43] [43]

Jan-Frederik Schulte et al. 2026. hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware.ACM TRETS(April 2026). doi:10.1145/3801979 Just Accepted

work page doi:10.1145/3801979 2026

[44] [44]

Canberk Sönmez et al. 2026. Chext: A Domain-specific Language for Safe and Agile Elastic Dataflow Accelerators. InFPGA. 37–37

2026

[45] [45]

Endri Taka et al. 2023. MaxEVA: Maximizing the Efficiency of Matrix Multiplica- tion on Versal AI Engine. InICFPT. 96–105. doi:10.1109/ICFPT59805.2023.00016

work page doi:10.1109/icfpt59805.2023.00016 2023

[46] [46]

Dhananjay Rao Thallikar et al. 2026. HMix: An Efficient Hardware Accelerator for Quantized MLP-Mixer Inference. (2026)

2026

[47] [47]

Ilya Tolstikhin et al. 2024. MLP-mixer: an all-MLP architecture for vision. In NIPS. Curran Associates Inc., Red Hook, NY, USA, Article 1857, 12 pages

2024

[48] [48]

Jianming Tong et al. 2024. FEATHER: A reconfigurable accelerator with data reordering support for low-cost on-chip dataflow switching. InISCA. IEEE

2024

[49] [49]

Hugo Touvron et al. 2021. Training data-efficient image transformers & distilla- tion through attention. InInternational conference on machine learning. PMLR

2021

[50] [50]

Chengyue Wang et al. 2025. Reconfigurable Stream Network Architecture. In ISCA

2025

[51] [51]

Erwei Wang et al. 2026. From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR.ACM TRETS(Jan. 2026). doi:10.1145/3785670 Just Accepted

work page doi:10.1145/3785670 2026

[52] [52]

Yu Emma Wang et al. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning.arXiv preprint arXiv:1907.10701(2019)

work page arXiv 2019

[53] [53]

Xuechao Wei et al. 2018. TGPA: Tile-grained pipeline architecture for low latency CNN inference. InICCAD. IEEE, 1–8

2018

[54] [54]

2023.Zynq-7000 SoC Technical Reference Manual

Xilinx, Inc. 2023.Zynq-7000 SoC Technical Reference Manual. AMD. https: //docs.amd.com/r/en-US/ug585-zynq-7000-SoC-TRM

2023

[55] [55]

Yixin Xu et al. 2024. Ferroelectric FET-based context-switching FPGA enabling dynamic reconfiguration for adaptive deep learning machines.Science Advances (2024). arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.adk1525 doi:10. 1126/sciadv.adk1525

work page doi:10.1126/sciadv.adk1525 2024

[56] [56]

Hanchen Yang et al . 2025. NSFlow: An End-to-End FPGA Framework with Scalable Dataflow Architecture for Neuro-Symbolic AI. In2025 DAC. doi:10.1109/ DAC63849.2025.11133088

work page arXiv 2025

[57] [57]

Zhuoping Yang et al . [n. d.]. AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. InICCAD

[58] [58]

Shulin Zeng et al. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs(FPGA ’24). New York, NY, USA. doi:10.1145/3626202.3637562

work page doi:10.1145/3626202.3637562 2024

[59] [59]

Dan Zhang et al. 2022. A full-stack search technique for domain optimized deep learning accelerators. InASPLOS. 27–42

2022

[60] [60]

Xiaofan Zhang et al. 2020. DNNExplorer: a framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator. InICCAD

2020

[61] [61]

2019.Modeling and Optimization for Customized Computing: Perfor- mance, Energy and Cost Perspective

Peipei Zhou. 2019.Modeling and Optimization for Customized Computing: Perfor- mance, Energy and Cost Perspective. University of California, Los Angeles

2019

[62] [62]

Jinming Zhuang et al. 2023. CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture. InFPGA(Monterey, CA, USA). ACM, 153–164. doi:10.1145/3543622.3573210

work page doi:10.1145/3543622.3573210 2023

[63] [63]

Jinming Zhuang et al. 2024. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM TRETS17, 3, Article 51 (Sept. 2024), 31 pages. doi:10.1145/3686163

work page doi:10.1145/3686163 2024

[64] [64]

Jinming Zhuang et al. 2025. ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines(FPGA ’25). New York, NY, USA. doi:10.1145/3706628.3708870

work page doi:10.1145/3706628.3708870 2025

[65] [65]

Jinming Zhuang et al . 2024. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration. InFPGA. ACM. doi:10. 1145/3626202.3637569

work page arXiv 2024

[66] [66]

Jinming Zhuang et al. 2023. AutoMM: Energy-efficient multi-data-type matrix multiply design on heterogeneous programmable system-on-chip. (2023)

2023

[67] [67]

Jinming Zhuang et al . 2023. High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives. In2023 60th ACM/IEEE Design Automation Conference (DAC). 1–6. doi:10.1109/ DAC56929.2023.10247981

work page arXiv 2023