DAE4HLS: Exposing Memory-Level Parallelism for High-Level Synthesis using Explicit Decoupling
Pith reviewed 2026-05-25 02:27 UTC · model grok-4.3
The pith
Explicit decoupling of memory requests and responses in high-level synthesis unlocks memory-level parallelism that compilers miss and delivers 10-79x speedups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents DAE4HLS, a decoupled access-execute paradigm that supplies a new programming model for explicitly separating memory requests from responses; when this model is implemented by repurposing existing AXI stream and AXI burst interfaces, both static Vitis HLS and dynamic HLS frameworks can expose memory-level parallelism that automatic compilers cannot discover, resulting in measured speedups between 10x and 79x on the target workloads.
What carries the argument
The DAE4HLS explicit-decoupling programming model that repurposes AXI stream and burst interfaces to separate request and response streams.
If this is right
- Applications with complex, non-sequential memory patterns on large datasets become candidates for HLS acceleration.
- Designers can obtain the claimed speedups while continuing to use the vendor's standard AXI interfaces.
- Dynamic HLS frameworks gain an additional mechanism for handling irregular workloads that static scheduling misses.
- The performance range of 10-79x holds when the explicit decoupling is applied to both the static and dynamic tool flows.
Where Pith is reading between the lines
- Similar explicit-decoupling annotations could be added to other commercial or open HLS compilers that already expose AXI-like ports.
- The same separation of request and response phases might reduce stalls in FPGA accelerators for graph or sparse-matrix codes that currently require hand-written RTL.
- A direct test would measure whether the same source-level changes produce comparable speedups when the DAE4HLS model is ported to a different dynamic scheduling engine.
Load-bearing premise
Repurposing the existing AXI stream and burst interfaces for explicit decoupling adds no prohibitive overhead and preserves compatibility inside the AMD Vitis HLS toolchain.
What would settle it
A workload with irregular large-dataset accesses where the DAE4HLS code either fails to compile under Vitis or produces no speedup over ordinary HLS on the same platform.
Figures
read the original abstract
High-level synthesis (HLS) performs well for simple memory access patterns, such as for sequential accesses that can be turned into bursts, or for memory accesses into small datasets that can be stored in scratchpads. This limits HLS to accelerating only the low-hanging fruit, where memory-level parallelism is either trivially abundant, due to simple access patterns, or latency is low, due to the small dataset. Applications with more complex access patterns on large datasets would also benefit from acceleration, and would especially benefit from the reduction in design and verification effort that HLS promises. In this paper, we present DAE4HLS, a decoupled access-execute (DAE) paradigm for HLS. We propose a new programming model for explicitly decoupling requests and responses, which unlocks memory-level parallelism that otherwise cannot be automatically provided by a compiler. We apply the DAE4HLS paradigm to the commercial AMD Vitis HLS toolchain and show that the existing AXI stream and AXI burst interfaces can be repurposed for explicit decoupling. We further apply the paradigm to a dynamic-HLS framework, which is better suited for handling irregular workloads as compared to statically scheduled HLS. We show that support for explicit decoupling improves the performance and achieves a total speedup of 10-79$\times$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DAE4HLS, a decoupled access-execute (DAE) paradigm for high-level synthesis. It defines a new programming model that allows explicit decoupling of memory requests and responses to expose memory-level parallelism (MLP) beyond what automatic compiler analysis can achieve for complex access patterns on large datasets. The approach is instantiated in two settings: (1) by repurposing existing AXI stream and burst interfaces within the commercial AMD Vitis HLS toolchain, and (2) within a dynamic-HLS framework. Experimental results are reported to show that explicit decoupling yields speedups of 10-79×.
Significance. If the experimental claims hold after proper characterization of interface overheads, the work would meaningfully extend the reach of HLS to irregular, memory-bound workloads that currently fall outside the “low-hanging fruit” that HLS compilers handle automatically. The pragmatic decision to reuse standard AXI interfaces rather than requiring new hardware primitives is a practical strength that could ease adoption.
major comments (2)
- [Abstract, §4] Abstract and §4 (presumed experimental section): the central performance claim of 10-79× speedup is presented without any reported cycle counts, bandwidth utilization, or ablation that isolates the cost of repurposing AXI stream/burst interfaces from the decoupling transformation itself. The skeptic concern about unmeasured stalls or reduced burst efficiency is therefore unaddressed; this directly undermines the claim that the repurposing can be done “without introducing prohibitive overheads.”
- [§3] §3 (programming model and interface mapping): the manuscript must demonstrate that the explicit decoupling primitives map onto AXI without altering the semantics or timing of the original HLS-generated hardware. No timing diagrams, resource overhead tables, or comparison against native AXI usage are referenced in the provided abstract; this information is load-bearing for the practicality argument.
minor comments (2)
- [Abstract] The abstract should list the specific benchmarks, input sizes, and baseline configurations (e.g., plain Vitis HLS, manual RTL, other DAE approaches) so that the speedup range can be interpreted.
- [§2] Notation for the new decoupling primitives (request/response channels, decoupling buffers) should be introduced with a small code example or diagram in the main text rather than only in supplementary material.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential of DAE4HLS to extend HLS to more complex workloads. We address each major comment below and commit to revisions that strengthen the experimental characterization and interface details.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (presumed experimental section): the central performance claim of 10-79× speedup is presented without any reported cycle counts, bandwidth utilization, or ablation that isolates the cost of repurposing AXI stream/burst interfaces from the decoupling transformation itself. The skeptic concern about unmeasured stalls or reduced burst efficiency is therefore unaddressed; this directly undermines the claim that the repurposing can be done “without introducing prohibitive overheads.”
Authors: We agree that the current presentation of the 10-79× speedups lacks the requested low-level metrics. In the revised manuscript we will augment §4 with cycle counts, bandwidth utilization figures, and an ablation that isolates decoupling gains from any AXI-repurposing overheads, directly addressing concerns about stalls and burst efficiency. revision: yes
-
Referee: [§3] §3 (programming model and interface mapping): the manuscript must demonstrate that the explicit decoupling primitives map onto AXI without altering the semantics or timing of the original HLS-generated hardware. No timing diagrams, resource overhead tables, or comparison against native AXI usage are referenced in the provided abstract; this information is load-bearing for the practicality argument.
Authors: The manuscript describes the AXI mapping at a high level but does not yet include the requested demonstrations. We will revise §3 to add timing diagrams, resource-overhead tables, and direct comparisons against native AXI usage, confirming that semantics and timing remain unchanged. revision: yes
Circularity Check
No significant circularity; claims rest on experimental implementation.
full rationale
The paper introduces a DAE programming model for HLS, repurposes existing AXI interfaces, and reports 10-79× speedups from applying the paradigm to Vitis HLS and a dynamic framework. No equations, fitted parameters, or self-citations are presented as load-bearing derivations; performance numbers are direct experimental outcomes, not reductions to inputs by construction. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of HLS toolchains regarding interface compatibility and repurposing
Reference graph
Works this paper leans on
-
[1]
AMD. 2025. Vitis Reference Guide. Technical Report
work page 2025
-
[2]
Mikhail Asiatici. 2021. Miss-Optimized Memory Systems: Turning Thousands of Outstanding Misses into Reuse Opportunities . Ph. D. Dissertation. EPFL. https://doi.org/10.5075/epfl-thesis-8050
-
[3]
Suhail Basalama and Jason Cong. 2025. Stream-HLS: Towards Automatic Dataflow Acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 103–114. https://doi.org/10.1145/3706628.3708878
-
[4]
George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios N. Pnevmatikatos. 2018. A decoupled access-execute architecture for reconfigurable accelerators. In Proceedings of the ACM International Conference on Computing Frontiers . 244–247. https://doi.org/10.1145/3203217. 3203267
-
[5]
Tao Chen and G. Edward Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture . 1–12. https://doi.org/10.1109/MICRO.2016.7783749
-
[6]
Constantinides, and John Wickerson
Jianyi Cheng, Lana Josipović, George A. Constantinides, and John Wickerson. 2022. Dynamic Inter-Block Scheduling for HLS. In Proceedings of the Conference on Field Programmable Logic and Applications . 243–252. https://doi.org/10.1109/FPL57034.2022.00045
-
[7]
Jianyi Cheng, Lana Josipović, John Wickerson, and George A. Constantinides. 2023. Parallelising Control Flow in Dynamic-scheduling High-level Synthesis. ACM Transactions on Reconfigurable Technology and Systems 16 (Dec. 2023), 1–32. Issue 4. https://doi.org/10.1145/3599973
-
[8]
Shaoyi Cheng and John Wawrzynek. 2014. Architectural synthesis of computational pipelines with decoupled memory access. In FPT. 83–90. https://doi.org/10.1109/FPT.2014.7082758
-
[9]
Dally, Yatish Turakhia, and Song Han
William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-specific hardware accelerators. Commun. ACM 63 (June 2020), 48–57. Issue 7. https://doi.org/10.1145/3361682
-
[10]
Johannes de Fine Licht, Maciej Besta, Simon Meierhans, and Torsten Hoefler. 2021. Transformations of High-Level Synthesis Codes for High- Performance Computing. IEEE Transactions on Parallel and Distributed Systems 32 (May 2021), 1014–1029. Issue 5. https://doi.org/10.1109/TPDS. 2020.3039409
-
[11]
Ayatallah Elakhras, Andrea Guerrieri, Lana Josipović, and Paolo Ienne. 2022. Unleashing Parallelism in Elastic Circuits with Faster Token Delivery. In Proceedings of the Conference on Field Programmable Logic and Applications . 253–261. https://doi.org/10.1109/FPL57034.2022.00046
-
[12]
Ayatallah Elakhras, Riya Sawhney, Andrea Guerrieri, Lana Josipovic, and Paolo Ienne. 2023. Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 39–45. https: //doi.org/10.1145/3543622.3573050
-
[13]
DeepSeek-AI et al. 2025. DeepSeek-V3 Technical Report. https://doi.org/10.48550/arXiv.2412.19437
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2025
-
[14]
Shane T. Fleming and David B. Thomas. 2017. Using Runahead Execution to Hide Memory Latency in High Level Synthesis. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines . 109–116. https://doi.org/10.1109/FCCM.2017.33
-
[15]
Khronos OpenCL Working Group. 2025. The OpenCL Specification. Standard. Khronos. https://registry.khronos.org/OpenCL/specs/3.0-unified/pdf/ OpenCL_API.pdf
work page 2025
-
[16]
Lana Josipovic, Philip Brisk, and Paolo Ienne. 2017. An Out-of-Order Load-Store Queue for Spatial Computing. ACM Transactions on Embedded Computing Systems 16 (Oct. 2017), 1–19. Issue 5s. https://doi.org/10.1145/3126525
-
[17]
Lana Josipović, Radhika Ghosal, and Paolo Ienne. 2018. Dynamically Scheduled High-level Synthesis. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 127–136. https://doi.org/10.1145/3174243.3174264
-
[18]
Lana Josipović, Andrea Guerrieri, and Paolo Ienne. 2020. Invited Tutorial: Dynamatic: From C/C++ to Dynamically Scheduled Circuits. InProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 1–10. https://doi.org/10.1145/3373087.3375391
-
[19]
Rui Li, Lincoln Berkley, Yihang Yang, and Rajit Manohar. 2021. Fluid: An Asynchronous High-level Synthesis Tool for Complex Program Structures. In Proceedings of the IEEE International Symposium on Asynchronous Circuits and Systems . 1–8. https://doi.org/10.1109/ASYNC48570.2021.00009
-
[20]
Jiantao Liu, Carmine Rizzi, and Lana Josipović. 2022. Load-Store Queue Sizing for Efficient Dataflow Circuits. In International Conference on Field-Programmable Technology. 1–9. https://doi.org/10.1109/ICFPT56656.2022.9974425
-
[21]
David Metz, Nico Reissmann, and Magnus Själander. 2024. R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges. InProceedings of the IEEE/ACM International Conference on Computer-Aided Design. https://doi.org/10.1145/3676536.3676671
-
[22]
Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the International Symposium on Computer Architecture . 416–429. https://doi.org/10.1145/3079856.3080255
-
[23]
NVIDIA. 2025. CUDA C++ Programming Guide. Technical Report
work page 2025
-
[24]
Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the Architectural Support for Programming Languages and Operating Systems . 1...
-
[25]
Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization . 110–119. https://doi.org/10.1109/IISWC. 2014.6983050
-
[26]
Nico Reissmann, Jan Christian Meyer, Helge Bahmann, and Magnus Själander. 2020. RVSDG: An Intermediate Representation for Optimizing Compilers. ACM Transactions on Embedded Computing Systems 19 (Dec. 2020), 49:1–49:28. Issue 6. https://doi.org/10.1145/3391902 18 David Metz and Magnus Själander
-
[27]
Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters 10 (Jan. 2011), 16–19. Issue 1. https://doi.org/10.1109/L-CA.2011.4
-
[28]
Santosh Shetty and Benjamin Camon Schafer. 2021. Enabling the Design of Behavioral Systems-on-Chip. In Proceedings of the ACM/IEEE Design Automation Conference. 331–336. https://doi.org/10.1109/DAC18074.2021.9586263
-
[29]
James E. Smith. 1982. Decoupled access/execute computer architectures. ACM SIGARCH Computer Architecture News 10 (April 1982), 112–119. Issue
work page 1982
-
[30]
https://doi.org/10.1145/1067649.801719
-
[31]
Masayuki Usui and Shinya Takamaeda-Yamazaki. 2023. High-Level Synthesis of Memory Systems for Decoupled Data Orchestration. In Proceedings of the Applied Reconfigurable Computing . 3–18. https://doi.org/10.1007/978-3-031-42921-7_1
-
[32]
Wm A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23 (1995), 20–24. Issue 1. https://doi.org/10.1145/216585.216588
-
[33]
Jiahui Xu and Lana Josipovic. 2025. CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLS. In Proceedings of the Architectural Support for Programming Languages and Operating Systems . 249–263. https://doi.org/10.1145/3669940.3707273
-
[34]
Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In Proceedings of the International Symposium High-Performance Computer Architecture. 741–755. https://doi.org/10.1109/HPCA53966.2022.00060
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.