pith. sign in

arxiv: 2605.23549 · v1 · pith:VEXQDCZPnew · submitted 2026-05-22 · 💻 cs.AR

DAE4HLS: Exposing Memory-Level Parallelism for High-Level Synthesis using Explicit Decoupling

Pith reviewed 2026-05-25 02:27 UTC · model grok-4.3

classification 💻 cs.AR
keywords high-level synthesisdecoupled access-executememory-level parallelismAXI interfacesirregular memory accessperformance accelerationVitis HLS
0
0 comments X

The pith

Explicit decoupling of memory requests and responses in high-level synthesis unlocks memory-level parallelism that compilers miss and delivers 10-79x speedups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-level synthesis works well only for simple sequential bursts or small scratchpad datasets, leaving complex access patterns on large data out of reach. The paper introduces a programming model that lets the programmer explicitly separate access requests from execute responses. This model is realized by repurposing standard AXI stream and burst interfaces inside the commercial Vitis HLS flow and is also applied to a dynamic HLS framework. The central claim is that the added explicit decoupling supplies the missing memory-level parallelism and produces large performance gains without new hardware interfaces.

Core claim

The paper presents DAE4HLS, a decoupled access-execute paradigm that supplies a new programming model for explicitly separating memory requests from responses; when this model is implemented by repurposing existing AXI stream and AXI burst interfaces, both static Vitis HLS and dynamic HLS frameworks can expose memory-level parallelism that automatic compilers cannot discover, resulting in measured speedups between 10x and 79x on the target workloads.

What carries the argument

The DAE4HLS explicit-decoupling programming model that repurposes AXI stream and burst interfaces to separate request and response streams.

If this is right

  • Applications with complex, non-sequential memory patterns on large datasets become candidates for HLS acceleration.
  • Designers can obtain the claimed speedups while continuing to use the vendor's standard AXI interfaces.
  • Dynamic HLS frameworks gain an additional mechanism for handling irregular workloads that static scheduling misses.
  • The performance range of 10-79x holds when the explicit decoupling is applied to both the static and dynamic tool flows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar explicit-decoupling annotations could be added to other commercial or open HLS compilers that already expose AXI-like ports.
  • The same separation of request and response phases might reduce stalls in FPGA accelerators for graph or sparse-matrix codes that currently require hand-written RTL.
  • A direct test would measure whether the same source-level changes produce comparable speedups when the DAE4HLS model is ported to a different dynamic scheduling engine.

Load-bearing premise

Repurposing the existing AXI stream and burst interfaces for explicit decoupling adds no prohibitive overhead and preserves compatibility inside the AMD Vitis HLS toolchain.

What would settle it

A workload with irregular large-dataset accesses where the DAE4HLS code either fails to compile under Vitis or produces no speedup over ordinary HLS on the same platform.

Figures

Figures reproduced from arXiv: 2605.23549 by David Metz, Magnus Sj\"alander.

Figure 1
Figure 1. Figure 1: Illustration showing the general principle behind (a) prefetching, (b) streaming, and (c) decoupling. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: State edges for stream and decouple in Listing 1. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Store state edge. memories. Instead, we equip accelerators with one AXI interface per pointer argument. To decouple the critical path of the accelerator from the memory subsystem, we place buffers on the AXI interfaces. We evaluate three different configurations for R-HLS. R-HLS and R-HLS Stream use the memory disambiguation by Metz et al. [21], with the changes to accommodate long latency memory operation… view at source ↗
Figure 4
Figure 4. Figure 4: Overhead in cycle count for dae4hls over "golden" reference. mergesort_opt provides a 74.8% speedup over mergesort, at a 55.5% LUT, 40% FF, and 100% BRAM increase. This seems like a worthwhile trade-off, considering memory traffic is also reduced, and as discussed below, the performance difference exceeds 90% for larger array sizes. For R-HLS Decoupled, binsearch_for seems preferable over binsearch, since … view at source ↗
read the original abstract

High-level synthesis (HLS) performs well for simple memory access patterns, such as for sequential accesses that can be turned into bursts, or for memory accesses into small datasets that can be stored in scratchpads. This limits HLS to accelerating only the low-hanging fruit, where memory-level parallelism is either trivially abundant, due to simple access patterns, or latency is low, due to the small dataset. Applications with more complex access patterns on large datasets would also benefit from acceleration, and would especially benefit from the reduction in design and verification effort that HLS promises. In this paper, we present DAE4HLS, a decoupled access-execute (DAE) paradigm for HLS. We propose a new programming model for explicitly decoupling requests and responses, which unlocks memory-level parallelism that otherwise cannot be automatically provided by a compiler. We apply the DAE4HLS paradigm to the commercial AMD Vitis HLS toolchain and show that the existing AXI stream and AXI burst interfaces can be repurposed for explicit decoupling. We further apply the paradigm to a dynamic-HLS framework, which is better suited for handling irregular workloads as compared to statically scheduled HLS. We show that support for explicit decoupling improves the performance and achieves a total speedup of 10-79$\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DAE4HLS, a decoupled access-execute (DAE) paradigm for high-level synthesis. It defines a new programming model that allows explicit decoupling of memory requests and responses to expose memory-level parallelism (MLP) beyond what automatic compiler analysis can achieve for complex access patterns on large datasets. The approach is instantiated in two settings: (1) by repurposing existing AXI stream and burst interfaces within the commercial AMD Vitis HLS toolchain, and (2) within a dynamic-HLS framework. Experimental results are reported to show that explicit decoupling yields speedups of 10-79×.

Significance. If the experimental claims hold after proper characterization of interface overheads, the work would meaningfully extend the reach of HLS to irregular, memory-bound workloads that currently fall outside the “low-hanging fruit” that HLS compilers handle automatically. The pragmatic decision to reuse standard AXI interfaces rather than requiring new hardware primitives is a practical strength that could ease adoption.

major comments (2)
  1. [Abstract, §4] Abstract and §4 (presumed experimental section): the central performance claim of 10-79× speedup is presented without any reported cycle counts, bandwidth utilization, or ablation that isolates the cost of repurposing AXI stream/burst interfaces from the decoupling transformation itself. The skeptic concern about unmeasured stalls or reduced burst efficiency is therefore unaddressed; this directly undermines the claim that the repurposing can be done “without introducing prohibitive overheads.”
  2. [§3] §3 (programming model and interface mapping): the manuscript must demonstrate that the explicit decoupling primitives map onto AXI without altering the semantics or timing of the original HLS-generated hardware. No timing diagrams, resource overhead tables, or comparison against native AXI usage are referenced in the provided abstract; this information is load-bearing for the practicality argument.
minor comments (2)
  1. [Abstract] The abstract should list the specific benchmarks, input sizes, and baseline configurations (e.g., plain Vitis HLS, manual RTL, other DAE approaches) so that the speedup range can be interpreted.
  2. [§2] Notation for the new decoupling primitives (request/response channels, decoupling buffers) should be introduced with a small code example or diagram in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of DAE4HLS to extend HLS to more complex workloads. We address each major comment below and commit to revisions that strengthen the experimental characterization and interface details.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (presumed experimental section): the central performance claim of 10-79× speedup is presented without any reported cycle counts, bandwidth utilization, or ablation that isolates the cost of repurposing AXI stream/burst interfaces from the decoupling transformation itself. The skeptic concern about unmeasured stalls or reduced burst efficiency is therefore unaddressed; this directly undermines the claim that the repurposing can be done “without introducing prohibitive overheads.”

    Authors: We agree that the current presentation of the 10-79× speedups lacks the requested low-level metrics. In the revised manuscript we will augment §4 with cycle counts, bandwidth utilization figures, and an ablation that isolates decoupling gains from any AXI-repurposing overheads, directly addressing concerns about stalls and burst efficiency. revision: yes

  2. Referee: [§3] §3 (programming model and interface mapping): the manuscript must demonstrate that the explicit decoupling primitives map onto AXI without altering the semantics or timing of the original HLS-generated hardware. No timing diagrams, resource overhead tables, or comparison against native AXI usage are referenced in the provided abstract; this information is load-bearing for the practicality argument.

    Authors: The manuscript describes the AXI mapping at a high level but does not yet include the requested demonstrations. We will revise §3 to add timing diagrams, resource-overhead tables, and direct comparisons against native AXI usage, confirming that semantics and timing remain unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experimental implementation.

full rationale

The paper introduces a DAE programming model for HLS, repurposes existing AXI interfaces, and reports 10-79× speedups from applying the paradigm to Vitis HLS and a dynamic framework. No equations, fitted parameters, or self-citations are presented as load-bearing derivations; performance numbers are direct experimental outcomes, not reductions to inputs by construction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is a systems engineering and programming model effort relying on standard HLS toolchain assumptions rather than mathematical derivations.

axioms (1)
  • domain assumption Standard assumptions of HLS toolchains regarding interface compatibility and repurposing
    The paper relies on the ability to repurpose AXI interfaces without breaking existing functionality.

pith-pipeline@v0.9.0 · 5760 in / 1011 out tokens · 22413 ms · 2026-05-25T02:27:10.331736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    AMD. 2025. Vitis Reference Guide. Technical Report

  2. [2]

    Mikhail Asiatici. 2021. Miss-Optimized Memory Systems: Turning Thousands of Outstanding Misses into Reuse Opportunities . Ph. D. Dissertation. EPFL. https://doi.org/10.5075/epfl-thesis-8050

  3. [3]

    Suhail Basalama and Jason Cong. 2025. Stream-HLS: Towards Automatic Dataflow Acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 103–114. https://doi.org/10.1145/3706628.3708878

  4. [4]

    Pnevmatikatos

    George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios N. Pnevmatikatos. 2018. A decoupled access-execute architecture for reconfigurable accelerators. In Proceedings of the ACM International Conference on Computing Frontiers . 244–247. https://doi.org/10.1145/3203217. 3203267

  5. [5]

    Edward Suh

    Tao Chen and G. Edward Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture . 1–12. https://doi.org/10.1109/MICRO.2016.7783749

  6. [6]

    Constantinides, and John Wickerson

    Jianyi Cheng, Lana Josipović, George A. Constantinides, and John Wickerson. 2022. Dynamic Inter-Block Scheduling for HLS. In Proceedings of the Conference on Field Programmable Logic and Applications . 243–252. https://doi.org/10.1109/FPL57034.2022.00045

  7. [7]

    Constantinides

    Jianyi Cheng, Lana Josipović, John Wickerson, and George A. Constantinides. 2023. Parallelising Control Flow in Dynamic-scheduling High-level Synthesis. ACM Transactions on Reconfigurable Technology and Systems 16 (Dec. 2023), 1–32. Issue 4. https://doi.org/10.1145/3599973

  8. [8]

    Shaoyi Cheng and John Wawrzynek. 2014. Architectural synthesis of computational pipelines with decoupled memory access. In FPT. 83–90. https://doi.org/10.1109/FPT.2014.7082758

  9. [9]

    Dally, Yatish Turakhia, and Song Han

    William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-specific hardware accelerators. Commun. ACM 63 (June 2020), 48–57. Issue 7. https://doi.org/10.1145/3361682

  10. [10]

    Johannes de Fine Licht, Maciej Besta, Simon Meierhans, and Torsten Hoefler. 2021. Transformations of High-Level Synthesis Codes for High- Performance Computing. IEEE Transactions on Parallel and Distributed Systems 32 (May 2021), 1014–1029. Issue 5. https://doi.org/10.1109/TPDS. 2020.3039409

  11. [11]

    Ayatallah Elakhras, Andrea Guerrieri, Lana Josipović, and Paolo Ienne. 2022. Unleashing Parallelism in Elastic Circuits with Faster Token Delivery. In Proceedings of the Conference on Field Programmable Logic and Applications . 253–261. https://doi.org/10.1109/FPL57034.2022.00046

  12. [12]

    Ayatallah Elakhras, Riya Sawhney, Andrea Guerrieri, Lana Josipovic, and Paolo Ienne. 2023. Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 39–45. https: //doi.org/10.1145/3543622.3573050

  13. [13]

    DeepSeek-AI et al. 2025. DeepSeek-V3 Technical Report. https://doi.org/10.48550/arXiv.2412.19437

  14. [14]

    Fleming and David B

    Shane T. Fleming and David B. Thomas. 2017. Using Runahead Execution to Hide Memory Latency in High Level Synthesis. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines . 109–116. https://doi.org/10.1109/FCCM.2017.33

  15. [15]

    Khronos OpenCL Working Group. 2025. The OpenCL Specification. Standard. Khronos. https://registry.khronos.org/OpenCL/specs/3.0-unified/pdf/ OpenCL_API.pdf

  16. [16]

    Lana Josipovic, Philip Brisk, and Paolo Ienne. 2017. An Out-of-Order Load-Store Queue for Spatial Computing. ACM Transactions on Embedded Computing Systems 16 (Oct. 2017), 1–19. Issue 5s. https://doi.org/10.1145/3126525

  17. [17]

    Lana Josipović, Radhika Ghosal, and Paolo Ienne. 2018. Dynamically Scheduled High-level Synthesis. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 127–136. https://doi.org/10.1145/3174243.3174264

  18. [18]

    Lana Josipović, Andrea Guerrieri, and Paolo Ienne. 2020. Invited Tutorial: Dynamatic: From C/C++ to Dynamically Scheduled Circuits. InProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 1–10. https://doi.org/10.1145/3373087.3375391

  19. [19]

    Rui Li, Lincoln Berkley, Yihang Yang, and Rajit Manohar. 2021. Fluid: An Asynchronous High-level Synthesis Tool for Complex Program Structures. In Proceedings of the IEEE International Symposium on Asynchronous Circuits and Systems . 1–8. https://doi.org/10.1109/ASYNC48570.2021.00009

  20. [20]

    Jiantao Liu, Carmine Rizzi, and Lana Josipović. 2022. Load-Store Queue Sizing for Efficient Dataflow Circuits. In International Conference on Field-Programmable Technology. 1–9. https://doi.org/10.1109/ICFPT56656.2022.9974425

  21. [21]

    David Metz, Nico Reissmann, and Magnus Själander. 2024. R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges. InProceedings of the IEEE/ACM International Conference on Computer-Aided Design. https://doi.org/10.1145/3676536.3676671

  22. [22]

    Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the International Symposium on Computer Architecture . 416–429. https://doi.org/10.1145/3079856.3080255

  23. [23]

    NVIDIA. 2025. CUDA C++ Programming Guide. Technical Report

  24. [24]

    Keckler, Christopher W

    Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the Architectural Support for Programming Languages and Operating Systems . 1...

  25. [25]

    Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization . 110–119. https://doi.org/10.1109/IISWC. 2014.6983050

  26. [26]

    Nico Reissmann, Jan Christian Meyer, Helge Bahmann, and Magnus Själander. 2020. RVSDG: An Intermediate Representation for Optimizing Compilers. ACM Transactions on Embedded Computing Systems 19 (Dec. 2020), 49:1–49:28. Issue 6. https://doi.org/10.1145/3391902 18 David Metz and Magnus Själander

  27. [27]

    Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters 10 (Jan. 2011), 16–19. Issue 1. https://doi.org/10.1109/L-CA.2011.4

  28. [28]

    Santosh Shetty and Benjamin Camon Schafer. 2021. Enabling the Design of Behavioral Systems-on-Chip. In Proceedings of the ACM/IEEE Design Automation Conference. 331–336. https://doi.org/10.1109/DAC18074.2021.9586263

  29. [29]

    James E. Smith. 1982. Decoupled access/execute computer architectures. ACM SIGARCH Computer Architecture News 10 (April 1982), 112–119. Issue

  30. [30]

    https://doi.org/10.1145/1067649.801719

  31. [31]

    Masayuki Usui and Shinya Takamaeda-Yamazaki. 2023. High-Level Synthesis of Memory Systems for Decoupled Data Orchestration. In Proceedings of the Applied Reconfigurable Computing . 3–18. https://doi.org/10.1007/978-3-031-42921-7_1

  32. [32]

    Wulf and Sally A

    Wm A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23 (1995), 20–24. Issue 1. https://doi.org/10.1145/216585.216588

  33. [33]

    Jiahui Xu and Lana Josipovic. 2025. CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLS. In Proceedings of the Architectural Support for Programming Languages and Operating Systems . 249–263. https://doi.org/10.1145/3669940.3707273

  34. [34]

    Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In Proceedings of the International Symposium High-Performance Computer Architecture. 741–755. https://doi.org/10.1109/HPCA53966.2022.00060