DORA: Dataflow-Instruction Orchestration Architecture for DNN Acceleration
Pith reviewed 2026-05-25 02:14 UTC · model grok-4.3
The pith
DORA uses a custom ISA to explicitly orchestrate dataflow at the layer level on DNN accelerators, sustaining stable efficiency across workloads that differ by up to 6× in operation counts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DORA maintains stable efficiency, with less than 5% variation on a single vector processor across workloads exhibiting up to 6× variation in operation counts. Compared to state-of-the-art accelerators, DORA consistently achieves higher performance, delivering up to 5× throughput improvement. The heuristic-based scheduler further achieves up to 90% optimality under practical time constraints.
What carries the argument
An instruction-based overlay architecture whose proposed ISA explicitly encodes dataflow, combined with on-chip memory management, computation parallelism management, and a two-stage compilation framework that uses MILP-based and heuristic search engines to generate layer-level schedules.
If this is right
- The architecture sustains high hardware efficiency on diverse and complex DNN models without incurring large overhead per workload.
- DORA can be deployed directly on existing reconfigurable platforms such as the AMD Versal VCK190.
- The heuristic scheduler delivers schedules within 90% of MILP optimality under realistic compile-time limits.
- Fine-grained ISA control of data movement and synchronization enables consistent performance across layers with differing characteristics.
Where Pith is reading between the lines
- Open-sourcing the framework and scheduler allows independent verification on additional platforms or models.
- The same instruction-orchestration approach could be adapted to fixed-function ASICs to reduce the cost of supporting model diversity.
- Stable efficiency across operation-count variation suggests the design may reduce the need for per-model hardware specialization.
- The two-stage design-space exploration could be extended to include power or area constraints not emphasized in the current evaluation.
Load-bearing premise
The DNN workloads and operation-count variations used in the experiments represent the full range of real-world models that future users will deploy.
What would settle it
Running DORA on a new collection of DNN models whose operation counts vary by more than 6× or whose tensor shapes and dependencies fall outside the tested set, then checking whether efficiency variation on a single vector processor exceeds 5%.
Figures
read the original abstract
As deep neural networks develop significantly more diverse and complex, achieving high performance and efficiency on complicated DNN models faces pressing challenges. Modern DNN workloads are increasingly diverse in operation types, tensor shapes, and execution dependencies, making it difficult to sustain high hardware efficiency across models. In addition, a generic accelerator often incurs substantial overhead when executing diverse workloads. To address these problems, we propose DORA, an instruction-based overlay architecture that explicitly describes dataflow via a proposed ISA, enabling fine-grained control of data movement, computation, and synchronization at the layer level. To support flexibility while achieving high performance, DORA adopts a novel on-chip memory management and computation parallelism management mechanism. DORA proposes a compilation framework that can generate instructions for given DNN workloads after a two-stage design space exploration. DORA framework also incorporates a MILP-based and a heuristic-based search engine to generate the schedule solution for different needs and constraints. We prototype DORA on the AMD Versal VCK190 platform, demonstrating its deployability on existing reconfigurable systems. Experimental results show that DORA maintains stable efficiency, with less than 5\% variation on a single vector processor across workloads exhibiting up to 6$\times$ variation in operation counts. Compared to state-of-the-art accelerators, DORA consistently achieves higher performance, delivering up to 5$\times$ throughput improvement. The heuristic-based scheduler further achieves up to 90\% optimality under practical time constraints. DORA is open-sourced at https://github.com/arc-research-lab/DORA.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DORA, an instruction-based overlay architecture for DNN acceleration on reconfigurable hardware. It defines a custom ISA to explicitly orchestrate dataflow at the layer level, introduces on-chip memory management and computation parallelism mechanisms, and presents a two-stage compilation framework using MILP-based and heuristic search engines for scheduling. Prototyped on the AMD Versal VCK190, the work reports stable efficiency (<5% variation) on a single vector processor across workloads with up to 6× variation in operation counts, up to 5× throughput improvement versus state-of-the-art accelerators, and up to 90% optimality for the heuristic scheduler. The implementation is open-sourced.
Significance. If the stability and performance claims hold under representative workload diversity, the result would be significant for the field of flexible DNN accelerators. It offers a concrete path to combine the adaptability of instruction-driven designs with high efficiency on existing reconfigurable platforms, directly targeting the growing diversity of DNN models. The physical prototype and open-source release provide additional value by enabling reproducibility and deployment studies.
major comments (2)
- [Abstract] Abstract: The central efficiency-stability claim (<5% variation across workloads with 6× operation-count variation) and the 5× throughput improvement are presented without any description of workload selection criteria, measurement methodology, statistical significance, or the precise set of operation types, tensor shapes, and dependencies tested. This absence directly affects the ability to evaluate whether the results address the diversity challenges stated in the introduction.
- [Introduction and Experimental Results] Introduction and Experimental Results: The motivation explicitly identifies diversity in operation types, tensor shapes, and execution dependencies as the core challenge, yet the reported experiments are described only in terms of operation-count variation. No evidence is supplied that the evaluated workloads differ along the other stated axes; if the workloads are structurally similar, the stability result does not substantiate the generalization to “complicated DNN models.”
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on clarifying the experimental methodology and workload characteristics. We agree that additional details are needed to substantiate the claims regarding workload diversity and will revise the manuscript to address both major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central efficiency-stability claim (<5% variation across workloads with 6× operation-count variation) and the 5× throughput improvement are presented without any description of workload selection criteria, measurement methodology, statistical significance, or the precise set of operation types, tensor shapes, and dependencies tested. This absence directly affects the ability to evaluate whether the results address the diversity challenges stated in the introduction.
Authors: We agree that the abstract is too concise on these points. In the revised version, we will expand the abstract to include a brief description of the workload selection criteria (standard models from MLPerf and common DNN benchmarks), measurement methodology (on-board execution on AMD Versal VCK190 with cycle-accurate timing via the integrated logic analyzer), and note the tested operation types (convolutions, matrix multiplications, activations, and reductions), tensor shapes (varying from 1×1 to 224×224 inputs with channel depths 64–2048), and dependencies (sequential, residual, and attention-based). Statistical significance is established via 10 repeated runs per workload with reported mean and standard deviation; these details will be summarized concisely in the abstract while retaining the core claims. revision: yes
-
Referee: [Introduction and Experimental Results] Introduction and Experimental Results: The motivation explicitly identifies diversity in operation types, tensor shapes, and execution dependencies as the core challenge, yet the reported experiments are described only in terms of operation-count variation. No evidence is supplied that the evaluated workloads differ along the other stated axes; if the workloads are structurally similar, the stability result does not substantiate the generalization to “complicated DNN models.”
Authors: The workloads used in the experiments are drawn from representative DNN models (ResNet-50, MobileNet-V2, BERT-base, and a custom attention-based model) that inherently differ in operation types, tensor shapes, and execution dependencies, as described in Section 5.1 of the manuscript. However, we acknowledge that the current presentation emphasizes operation-count variation and does not explicitly quantify the other dimensions. In the revision, we will add a dedicated table (new Table 2) that reports per-workload metrics for all three axes—operation-type diversity (unique op counts), tensor-shape variation (min/max dimensions and channel counts), and dependency graphs (number of parallel vs. sequential layers)—to demonstrate that the workloads are not structurally similar. This will directly support the generalization claim. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper proposes an instruction-based overlay architecture (DORA) with a custom ISA, on-chip memory management, compilation framework, and MILP/heuristic schedulers, then reports measured results from a physical prototype on AMD Versal VCK190. No equations, first-principles derivations, or 'predictions' appear in the provided text; efficiency and throughput claims rest on direct hardware measurements across tested workloads rather than any fitted parameter renamed as output or self-citation chain. The central claims are therefore self-contained against external benchmarks and receive score 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DNN workloads exhibit up to 6x variation in operation counts while remaining representative of production models.
Reference graph
Works this paper leans on
-
[1]
Mohamed S Abdelfattah et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. InFPL. IEEE
2018
-
[2]
AMD 2023.Vitis AI User Guide. AMD. https://docs.amd.com/r/en-US/ug1414- vitis-ai
2023
-
[3]
AMD/Xilinx. 2021. Versal Adaptive Compute Acceleration Platform. https: //www.xilinx.com/products/silicon-devices/acap/versal.html
2021
-
[4]
2023.AI Engine API and Intrinsics User Guide
AMD/Xilinx. 2023.AI Engine API and Intrinsics User Guide
2023
-
[5]
2023.Versal ACAP AI Engine System C Simulator
AMD/Xilinx. 2023.Versal ACAP AI Engine System C Simulator
2023
-
[6]
Autoware Foundation. [n. d.]. Autoware - the world’s leading open-source soft- ware project for autonomous driving. https://github.com/autowarefoundation/ autoware
-
[7]
Alan Tendler Leibel Bacellar et al. 2024. Differentiable Weightless Neural Net- works. InICML. 2277–2295. https://proceedings.mlr.press/v235/bacellar24a.html GLSVLSI ’26, June 22–24, 2026, Canandaigua, NY, USA Xingzhen Chen, Zhuoping Yang, Jinming Zhuang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, and Peipei Zhou
2024
-
[8]
Mohammed S Bensaleh et al . 2018. Optimal task scheduling for distributed cluster with active storage devices and accelerated nodes.IEEE Access6 (2018), 48195–48209
2018
-
[9]
Julian Blank et al. [n. d.]. Pymoo: Multi-Objective Optimization in Python.IEEE Access([n. d.]). doi:10.1109/ACCESS.2020.2990567
-
[10]
Mohamed Bouaziz et al. [n. d.]. A Dataflow Overlay for Monte Carlo Multi-Asset Option Pricing on AMD Versal AI Engines. InISC High Performance 2025 Research Paper Proceedings. doi:10.23919/ISC.2025.11020612
-
[11]
Andrew Boutros et al. 2020. Beyond Peak Performance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs. InICFPT. 10–19. doi:10.1109/ ICFPT51103.2020.00011
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[12]
Andrew Boutros et al . 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. InICFPT. IEEE, 10–19
2020
-
[13]
Jingwei Cai et al. 2023. Inter-layer scheduling space definition and exploration for tiled accelerators. InISCA. 1–17
2023
-
[14]
Hongzheng Chen et al. 2024. Understanding the potential of fpga-based spatial acceleration for large language model inference.ACM TRETS18, 1 (2024), 1–29
2024
-
[15]
Hongzheng Chen et al. 2024. Allo: A programming model for composable accel- erator design.Proceedings of the ACM on Programming Languages8, PLDI (2024), 593–620
2024
-
[16]
Dimitrios Danopoulos et al . 2025. AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines.arXiv (2025)
2025
- [17]
-
[18]
Jacob Devlin et al. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Peiyan Dong et al. 2024. EQ-ViT: Algorithm-Hardware Co-Design for End-to- End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture.TCAD(2024). doi:10.1109/TCAD.2024.3443692
-
[20]
Mario Doumet et al. 2024. H2PIPE: High throughput CNN inference on FPGAs with high-bandwidth memory. In2024 FPL. IEEE, 69–77
2024
-
[21]
Jeremy Fowers et al. 2018. A configurable cloud-scale DNN processor for real-time AI. InISCA. IEEE, 1–14
2018
-
[22]
Jeremy Fowers et al . 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. InISCA. 1–14. doi:10.1109/ISCA.2018.00012
- [23]
-
[24]
Nan Guan et al. [n. d.]. Industry Challenge
-
[25]
Zibo Guo et al. 2024. An overlay accelerator of DeepLab CNN for spacecraft image segmentation on FPGA.Remote Sensing16, 5 (2024), 894
2024
- [26]
- [27]
-
[28]
Zifan He et al. 2025. InTAR: Inter-Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs. InFCCM. IEEE, 123–132
2025
-
[29]
Erika Hunhoff et al. 2025. Efficiency, expressivity, and extensibility in a close-to- metal npu programming interface. InFCCM. IEEE, 85–94
2025
-
[30]
Mustafa Ibrahim et al . 2025. VERSATILE: Very Fast Partial Reconfiguration Controller.ACM Transactions on Reconfigurable Technology and Systems18, 3 (2025), 1–22
2025
-
[31]
Shixin Ji et al. 2025. ART: Customizing accelerators for DNN-enabled real-time safety-critical systems. InGLSVLSI. 442–449
2025
-
[32]
Lana Josipovic et al. 2021. Synthesizing General-Purpose Code Into Dynamically Scheduled Circuits.IEEE Circuits and Systems Magazine21, 2 (2021), 97–118. doi:10.1109/MCAS.2021.3071631
-
[33]
Hyoukjun Kwon et al. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. InHPCA. IEEE, 71–83
2021
- [34]
-
[35]
CPLEX User’s Manual. 1987. Ibm ilog cplex optimization studio.Version12, 1987-2018 (1987), 1
1987
- [36]
-
[37]
Kaustubh Manohar Mhatre et al. 2025. Performance Analysis of GEMM Work- loads on the AMD Versal Platform. InISPASS. 150–161. doi:10.1109/ISPASS64960. 2025.00023
-
[38]
Kaustubh Manohar Mhatre et al. 2025. GAMA: High-Performance GEMM Ac- celeration on AMD Versal ML-Optimized AI Engines. In2025 FPL. 323–331. doi:10.1109/FPL68686.2025.00051
-
[39]
YoungSeok Na et al. 2026. HiLFS: FPGA-Orchestrated File System for High-Level Synthesis. InFPGA. 126–136
2026
-
[40]
Tan Nguyen et al . 2023. SPADES: A Productive Design Flow for Versal Pro- grammable Logic. InFPL. 65–71. doi:10.1109/FPL60245.2023.00017
-
[41]
John Nickolls et al. [n. d.]. Scalable parallel programming with CUDA. ([n. d.])
-
[42]
Charles R Qi et al. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. InCVPR. 652–660
2017
-
[43]
Jan-Frederik Schulte et al. 2026. hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware.ACM TRETS(April 2026). doi:10.1145/3801979 Just Accepted
-
[44]
Canberk Sönmez et al. 2026. Chext: A Domain-specific Language for Safe and Agile Elastic Dataflow Accelerators. InFPGA. 37–37
2026
-
[45]
Endri Taka et al. 2023. MaxEVA: Maximizing the Efficiency of Matrix Multiplica- tion on Versal AI Engine. InICFPT. 96–105. doi:10.1109/ICFPT59805.2023.00016
-
[46]
Dhananjay Rao Thallikar et al. 2026. HMix: An Efficient Hardware Accelerator for Quantized MLP-Mixer Inference. (2026)
2026
-
[47]
Ilya Tolstikhin et al. 2024. MLP-mixer: an all-MLP architecture for vision. In NIPS. Curran Associates Inc., Red Hook, NY, USA, Article 1857, 12 pages
2024
-
[48]
Jianming Tong et al. 2024. FEATHER: A reconfigurable accelerator with data reordering support for low-cost on-chip dataflow switching. InISCA. IEEE
2024
-
[49]
Hugo Touvron et al. 2021. Training data-efficient image transformers & distilla- tion through attention. InInternational conference on machine learning. PMLR
2021
-
[50]
Chengyue Wang et al. 2025. Reconfigurable Stream Network Architecture. In ISCA
2025
-
[51]
Erwei Wang et al. 2026. From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR.ACM TRETS(Jan. 2026). doi:10.1145/3785670 Just Accepted
- [52]
-
[53]
Xuechao Wei et al. 2018. TGPA: Tile-grained pipeline architecture for low latency CNN inference. InICCAD. IEEE, 1–8
2018
-
[54]
2023.Zynq-7000 SoC Technical Reference Manual
Xilinx, Inc. 2023.Zynq-7000 SoC Technical Reference Manual. AMD. https: //docs.amd.com/r/en-US/ug585-zynq-7000-SoC-TRM
2023
-
[55]
Yixin Xu et al. 2024. Ferroelectric FET-based context-switching FPGA enabling dynamic reconfiguration for adaptive deep learning machines.Science Advances (2024). arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.adk1525 doi:10. 1126/sciadv.adk1525
- [56]
-
[57]
Zhuoping Yang et al . [n. d.]. AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. InICCAD
-
[58]
Shulin Zeng et al. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs(FPGA ’24). New York, NY, USA. doi:10.1145/3626202.3637562
-
[59]
Dan Zhang et al. 2022. A full-stack search technique for domain optimized deep learning accelerators. InASPLOS. 27–42
2022
-
[60]
Xiaofan Zhang et al. 2020. DNNExplorer: a framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator. InICCAD
2020
-
[61]
2019.Modeling and Optimization for Customized Computing: Perfor- mance, Energy and Cost Perspective
Peipei Zhou. 2019.Modeling and Optimization for Customized Computing: Perfor- mance, Energy and Cost Perspective. University of California, Los Angeles
2019
-
[62]
Jinming Zhuang et al. 2023. CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture. InFPGA(Monterey, CA, USA). ACM, 153–164. doi:10.1145/3543622.3573210
-
[63]
Jinming Zhuang et al. 2024. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM TRETS17, 3, Article 51 (Sept. 2024), 31 pages. doi:10.1145/3686163
-
[64]
Jinming Zhuang et al. 2025. ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines(FPGA ’25). New York, NY, USA. doi:10.1145/3706628.3708870
- [65]
-
[66]
Jinming Zhuang et al. 2023. AutoMM: Energy-efficient multi-data-type matrix multiply design on heterogeneous programmable system-on-chip. (2023)
2023
- [67]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.