DAE4HLS: Exposing Memory-Level Parallelism for High-Level Synthesis using Explicit Decoupling

David Metz; Magnus Sj\"alander

arxiv: 2605.23549 · v1 · pith:VEXQDCZPnew · submitted 2026-05-22 · 💻 cs.AR

DAE4HLS: Exposing Memory-Level Parallelism for High-Level Synthesis using Explicit Decoupling

David Metz , Magnus Sj\"alander This is my paper

Pith reviewed 2026-05-25 02:27 UTC · model grok-4.3

classification 💻 cs.AR

keywords high-level synthesisdecoupled access-executememory-level parallelismAXI interfacesirregular memory accessperformance accelerationVitis HLS

0 comments

The pith

Explicit decoupling of memory requests and responses in high-level synthesis unlocks memory-level parallelism that compilers miss and delivers 10-79x speedups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-level synthesis works well only for simple sequential bursts or small scratchpad datasets, leaving complex access patterns on large data out of reach. The paper introduces a programming model that lets the programmer explicitly separate access requests from execute responses. This model is realized by repurposing standard AXI stream and burst interfaces inside the commercial Vitis HLS flow and is also applied to a dynamic HLS framework. The central claim is that the added explicit decoupling supplies the missing memory-level parallelism and produces large performance gains without new hardware interfaces.

Core claim

The paper presents DAE4HLS, a decoupled access-execute paradigm that supplies a new programming model for explicitly separating memory requests from responses; when this model is implemented by repurposing existing AXI stream and AXI burst interfaces, both static Vitis HLS and dynamic HLS frameworks can expose memory-level parallelism that automatic compilers cannot discover, resulting in measured speedups between 10x and 79x on the target workloads.

What carries the argument

The DAE4HLS explicit-decoupling programming model that repurposes AXI stream and burst interfaces to separate request and response streams.

If this is right

Applications with complex, non-sequential memory patterns on large datasets become candidates for HLS acceleration.
Designers can obtain the claimed speedups while continuing to use the vendor's standard AXI interfaces.
Dynamic HLS frameworks gain an additional mechanism for handling irregular workloads that static scheduling misses.
The performance range of 10-79x holds when the explicit decoupling is applied to both the static and dynamic tool flows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar explicit-decoupling annotations could be added to other commercial or open HLS compilers that already expose AXI-like ports.
The same separation of request and response phases might reduce stalls in FPGA accelerators for graph or sparse-matrix codes that currently require hand-written RTL.
A direct test would measure whether the same source-level changes produce comparable speedups when the DAE4HLS model is ported to a different dynamic scheduling engine.

Load-bearing premise

Repurposing the existing AXI stream and burst interfaces for explicit decoupling adds no prohibitive overhead and preserves compatibility inside the AMD Vitis HLS toolchain.

What would settle it

A workload with irregular large-dataset accesses where the DAE4HLS code either fails to compile under Vitis or produces no speedup over ordinary HLS on the same platform.

Figures

Figures reproduced from arXiv: 2605.23549 by David Metz, Magnus Sj\"alander.

**Figure 2.** Figure 2: State edges for stream and decouple in Listing 1. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Store state edge. memories. Instead, we equip accelerators with one AXI interface per pointer argument. To decouple the critical path of the accelerator from the memory subsystem, we place buffers on the AXI interfaces. We evaluate three different configurations for R-HLS. R-HLS and R-HLS Stream use the memory disambiguation by Metz et al. [21], with the changes to accommodate long latency memory operation… view at source ↗

**Figure 4.** Figure 4: Overhead in cycle count for dae4hls over "golden" reference. mergesort_opt provides a 74.8% speedup over mergesort, at a 55.5% LUT, 40% FF, and 100% BRAM increase. This seems like a worthwhile trade-off, considering memory traffic is also reduced, and as discussed below, the performance difference exceeds 90% for larger array sizes. For R-HLS Decoupled, binsearch_for seems preferable over binsearch, since … view at source ↗

read the original abstract

High-level synthesis (HLS) performs well for simple memory access patterns, such as for sequential accesses that can be turned into bursts, or for memory accesses into small datasets that can be stored in scratchpads. This limits HLS to accelerating only the low-hanging fruit, where memory-level parallelism is either trivially abundant, due to simple access patterns, or latency is low, due to the small dataset. Applications with more complex access patterns on large datasets would also benefit from acceleration, and would especially benefit from the reduction in design and verification effort that HLS promises. In this paper, we present DAE4HLS, a decoupled access-execute (DAE) paradigm for HLS. We propose a new programming model for explicitly decoupling requests and responses, which unlocks memory-level parallelism that otherwise cannot be automatically provided by a compiler. We apply the DAE4HLS paradigm to the commercial AMD Vitis HLS toolchain and show that the existing AXI stream and AXI burst interfaces can be repurposed for explicit decoupling. We further apply the paradigm to a dynamic-HLS framework, which is better suited for handling irregular workloads as compared to statically scheduled HLS. We show that support for explicit decoupling improves the performance and achieves a total speedup of 10-79$\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical explicit decoupling model for Vitis HLS and dynamic frameworks, but the 10-79x claims rest on missing experimental details.

read the letter

The core contribution is a programming model that lets HLS users explicitly separate memory requests from responses, applied both to commercial Vitis by repurposing AXI stream and burst ports and to a dynamic HLS setup. This is a direct, usable extension of the older decoupled access-execute idea to tools that currently handle only simple patterns well. The write-up does a clear job laying out why standard HLS stalls on large, irregular datasets and how the explicit model can expose more memory-level parallelism without changing the underlying hardware interfaces much. That part is straightforward and worth noting for anyone already using these tools. The main weakness is the performance evidence. The abstract states 10-79x speedups but supplies no benchmark list, baseline descriptions, cycle counts, bandwidth measurements, or checks on whether the AXI repurposing adds stalls or reduces burst efficiency. Without those, the central claim cannot be evaluated, and the stress-test point about unmeasured interface overheads remains open. If the full paper contains the missing methodology and ablations, the numbers could stand; right now they do not. This work is aimed at FPGA and HLS practitioners who already fight memory patterns in Vitis or similar flows. A reader in that niche could pick up the programming model and try it even before the speedups are confirmed. The paper shows clear thinking about the problem and honest use of existing interfaces rather than inventing new hardware. I would send it to peer review so referees can check the experimental section, but the authors should expect pointed questions on methodology and overhead characterization.

Referee Report

2 major / 2 minor

Summary. The paper introduces DAE4HLS, a decoupled access-execute (DAE) paradigm for high-level synthesis. It defines a new programming model that allows explicit decoupling of memory requests and responses to expose memory-level parallelism (MLP) beyond what automatic compiler analysis can achieve for complex access patterns on large datasets. The approach is instantiated in two settings: (1) by repurposing existing AXI stream and burst interfaces within the commercial AMD Vitis HLS toolchain, and (2) within a dynamic-HLS framework. Experimental results are reported to show that explicit decoupling yields speedups of 10-79×.

Significance. If the experimental claims hold after proper characterization of interface overheads, the work would meaningfully extend the reach of HLS to irregular, memory-bound workloads that currently fall outside the “low-hanging fruit” that HLS compilers handle automatically. The pragmatic decision to reuse standard AXI interfaces rather than requiring new hardware primitives is a practical strength that could ease adoption.

major comments (2)

[Abstract, §4] Abstract and §4 (presumed experimental section): the central performance claim of 10-79× speedup is presented without any reported cycle counts, bandwidth utilization, or ablation that isolates the cost of repurposing AXI stream/burst interfaces from the decoupling transformation itself. The skeptic concern about unmeasured stalls or reduced burst efficiency is therefore unaddressed; this directly undermines the claim that the repurposing can be done “without introducing prohibitive overheads.”
[§3] §3 (programming model and interface mapping): the manuscript must demonstrate that the explicit decoupling primitives map onto AXI without altering the semantics or timing of the original HLS-generated hardware. No timing diagrams, resource overhead tables, or comparison against native AXI usage are referenced in the provided abstract; this information is load-bearing for the practicality argument.

minor comments (2)

[Abstract] The abstract should list the specific benchmarks, input sizes, and baseline configurations (e.g., plain Vitis HLS, manual RTL, other DAE approaches) so that the speedup range can be interpreted.
[§2] Notation for the new decoupling primitives (request/response channels, decoupling buffers) should be introduced with a small code example or diagram in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of DAE4HLS to extend HLS to more complex workloads. We address each major comment below and commit to revisions that strengthen the experimental characterization and interface details.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (presumed experimental section): the central performance claim of 10-79× speedup is presented without any reported cycle counts, bandwidth utilization, or ablation that isolates the cost of repurposing AXI stream/burst interfaces from the decoupling transformation itself. The skeptic concern about unmeasured stalls or reduced burst efficiency is therefore unaddressed; this directly undermines the claim that the repurposing can be done “without introducing prohibitive overheads.”

Authors: We agree that the current presentation of the 10-79× speedups lacks the requested low-level metrics. In the revised manuscript we will augment §4 with cycle counts, bandwidth utilization figures, and an ablation that isolates decoupling gains from any AXI-repurposing overheads, directly addressing concerns about stalls and burst efficiency. revision: yes
Referee: [§3] §3 (programming model and interface mapping): the manuscript must demonstrate that the explicit decoupling primitives map onto AXI without altering the semantics or timing of the original HLS-generated hardware. No timing diagrams, resource overhead tables, or comparison against native AXI usage are referenced in the provided abstract; this information is load-bearing for the practicality argument.

Authors: The manuscript describes the AXI mapping at a high level but does not yet include the requested demonstrations. We will revise §3 to add timing diagrams, resource-overhead tables, and direct comparisons against native AXI usage, confirming that semantics and timing remain unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experimental implementation.

full rationale

The paper introduces a DAE programming model for HLS, repurposes existing AXI interfaces, and reports 10-79× speedups from applying the paradigm to Vitis HLS and a dynamic framework. No equations, fitted parameters, or self-citations are presented as load-bearing derivations; performance numbers are direct experimental outcomes, not reductions to inputs by construction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is a systems engineering and programming model effort relying on standard HLS toolchain assumptions rather than mathematical derivations.

axioms (1)

domain assumption Standard assumptions of HLS toolchains regarding interface compatibility and repurposing
The paper relies on the ability to repurpose AXI interfaces without breaking existing functionality.

pith-pipeline@v0.9.0 · 5760 in / 1011 out tokens · 22413 ms · 2026-05-25T02:27:10.331736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

AMD. 2025. Vitis Reference Guide. Technical Report

work page 2025
[2]

Mikhail Asiatici. 2021. Miss-Optimized Memory Systems: Turning Thousands of Outstanding Misses into Reuse Opportunities . Ph. D. Dissertation. EPFL. https://doi.org/10.5075/epfl-thesis-8050

work page doi:10.5075/epfl-thesis-8050 2021
[3]

Suhail Basalama and Jason Cong. 2025. Stream-HLS: Towards Automatic Dataflow Acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 103–114. https://doi.org/10.1145/3706628.3708878

work page doi:10.1145/3706628.3708878 2025
[4]

Pnevmatikatos

George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios N. Pnevmatikatos. 2018. A decoupled access-execute architecture for reconfigurable accelerators. In Proceedings of the ACM International Conference on Computing Frontiers . 244–247. https://doi.org/10.1145/3203217. 3203267

work page doi:10.1145/3203217 2018
[5]

Edward Suh

Tao Chen and G. Edward Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture . 1–12. https://doi.org/10.1109/MICRO.2016.7783749

work page doi:10.1109/micro.2016.7783749 2016
[6]

Constantinides, and John Wickerson

Jianyi Cheng, Lana Josipović, George A. Constantinides, and John Wickerson. 2022. Dynamic Inter-Block Scheduling for HLS. In Proceedings of the Conference on Field Programmable Logic and Applications . 243–252. https://doi.org/10.1109/FPL57034.2022.00045

work page doi:10.1109/fpl57034.2022.00045 2022
[7]

Constantinides

Jianyi Cheng, Lana Josipović, John Wickerson, and George A. Constantinides. 2023. Parallelising Control Flow in Dynamic-scheduling High-level Synthesis. ACM Transactions on Reconfigurable Technology and Systems 16 (Dec. 2023), 1–32. Issue 4. https://doi.org/10.1145/3599973

work page doi:10.1145/3599973 2023
[8]

Shaoyi Cheng and John Wawrzynek. 2014. Architectural synthesis of computational pipelines with decoupled memory access. In FPT. 83–90. https://doi.org/10.1109/FPT.2014.7082758

work page doi:10.1109/fpt.2014.7082758 2014
[9]

Dally, Yatish Turakhia, and Song Han

William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-specific hardware accelerators. Commun. ACM 63 (June 2020), 48–57. Issue 7. https://doi.org/10.1145/3361682

work page doi:10.1145/3361682 2020
[10]

Johannes de Fine Licht, Maciej Besta, Simon Meierhans, and Torsten Hoefler. 2021. Transformations of High-Level Synthesis Codes for High- Performance Computing. IEEE Transactions on Parallel and Distributed Systems 32 (May 2021), 1014–1029. Issue 5. https://doi.org/10.1109/TPDS. 2020.3039409

work page doi:10.1109/tpds 2021
[11]

Ayatallah Elakhras, Andrea Guerrieri, Lana Josipović, and Paolo Ienne. 2022. Unleashing Parallelism in Elastic Circuits with Faster Token Delivery. In Proceedings of the Conference on Field Programmable Logic and Applications . 253–261. https://doi.org/10.1109/FPL57034.2022.00046

work page doi:10.1109/fpl57034.2022.00046 2022
[12]

Ayatallah Elakhras, Riya Sawhney, Andrea Guerrieri, Lana Josipovic, and Paolo Ienne. 2023. Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 39–45. https: //doi.org/10.1145/3543622.3573050

work page doi:10.1145/3543622.3573050 2023
[13]

DeepSeek-AI et al. 2025. DeepSeek-V3 Technical Report. https://doi.org/10.48550/arXiv.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2025
[14]

Fleming and David B

Shane T. Fleming and David B. Thomas. 2017. Using Runahead Execution to Hide Memory Latency in High Level Synthesis. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines . 109–116. https://doi.org/10.1109/FCCM.2017.33

work page doi:10.1109/fccm.2017.33 2017
[15]

Khronos OpenCL Working Group. 2025. The OpenCL Specification. Standard. Khronos. https://registry.khronos.org/OpenCL/specs/3.0-unified/pdf/ OpenCL_API.pdf

work page 2025
[16]

Lana Josipovic, Philip Brisk, and Paolo Ienne. 2017. An Out-of-Order Load-Store Queue for Spatial Computing. ACM Transactions on Embedded Computing Systems 16 (Oct. 2017), 1–19. Issue 5s. https://doi.org/10.1145/3126525

work page doi:10.1145/3126525 2017
[17]

Lana Josipović, Radhika Ghosal, and Paolo Ienne. 2018. Dynamically Scheduled High-level Synthesis. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 127–136. https://doi.org/10.1145/3174243.3174264

work page doi:10.1145/3174243.3174264 2018
[18]

Lana Josipović, Andrea Guerrieri, and Paolo Ienne. 2020. Invited Tutorial: Dynamatic: From C/C++ to Dynamically Scheduled Circuits. InProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 1–10. https://doi.org/10.1145/3373087.3375391

work page doi:10.1145/3373087.3375391 2020
[19]

Rui Li, Lincoln Berkley, Yihang Yang, and Rajit Manohar. 2021. Fluid: An Asynchronous High-level Synthesis Tool for Complex Program Structures. In Proceedings of the IEEE International Symposium on Asynchronous Circuits and Systems . 1–8. https://doi.org/10.1109/ASYNC48570.2021.00009

work page doi:10.1109/async48570.2021.00009 2021
[20]

Jiantao Liu, Carmine Rizzi, and Lana Josipović. 2022. Load-Store Queue Sizing for Efficient Dataflow Circuits. In International Conference on Field-Programmable Technology. 1–9. https://doi.org/10.1109/ICFPT56656.2022.9974425

work page doi:10.1109/icfpt56656.2022.9974425 2022
[21]

David Metz, Nico Reissmann, and Magnus Själander. 2024. R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges. InProceedings of the IEEE/ACM International Conference on Computer-Aided Design. https://doi.org/10.1145/3676536.3676671

work page doi:10.1145/3676536.3676671 2024
[22]

Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the International Symposium on Computer Architecture . 416–429. https://doi.org/10.1145/3079856.3080255

work page doi:10.1145/3079856.3080255 2017
[23]

NVIDIA. 2025. CUDA C++ Programming Guide. Technical Report

work page 2025
[24]

Keckler, Christopher W

Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the Architectural Support for Programming Languages and Operating Systems . 1...

work page doi:10.1145/3297858.3304025 2019
[25]

Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization . 110–119. https://doi.org/10.1109/IISWC. 2014.6983050

work page doi:10.1109/iiswc 2014
[26]

Nico Reissmann, Jan Christian Meyer, Helge Bahmann, and Magnus Själander. 2020. RVSDG: An Intermediate Representation for Optimizing Compilers. ACM Transactions on Embedded Computing Systems 19 (Dec. 2020), 49:1–49:28. Issue 6. https://doi.org/10.1145/3391902 18 David Metz and Magnus Själander

work page doi:10.1145/3391902 2020
[27]

Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters 10 (Jan. 2011), 16–19. Issue 1. https://doi.org/10.1109/L-CA.2011.4

work page doi:10.1109/l-ca.2011.4 2011
[28]

Santosh Shetty and Benjamin Camon Schafer. 2021. Enabling the Design of Behavioral Systems-on-Chip. In Proceedings of the ACM/IEEE Design Automation Conference. 331–336. https://doi.org/10.1109/DAC18074.2021.9586263

work page doi:10.1109/dac18074.2021.9586263 2021
[29]

James E. Smith. 1982. Decoupled access/execute computer architectures. ACM SIGARCH Computer Architecture News 10 (April 1982), 112–119. Issue

work page 1982
[30]

https://doi.org/10.1145/1067649.801719

work page doi:10.1145/1067649.801719
[31]

Masayuki Usui and Shinya Takamaeda-Yamazaki. 2023. High-Level Synthesis of Memory Systems for Decoupled Data Orchestration. In Proceedings of the Applied Reconfigurable Computing . 3–18. https://doi.org/10.1007/978-3-031-42921-7_1

work page doi:10.1007/978-3-031-42921-7_1 2023
[32]

Wulf and Sally A

Wm A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23 (1995), 20–24. Issue 1. https://doi.org/10.1145/216585.216588

work page doi:10.1145/216585.216588 1995
[33]

Jiahui Xu and Lana Josipovic. 2025. CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLS. In Proceedings of the Architectural Support for Programming Languages and Operating Systems . 249–263. https://doi.org/10.1145/3669940.3707273

work page doi:10.1145/3669940.3707273 2025
[34]

Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In Proceedings of the International Symposium High-Performance Computer Architecture. 741–755. https://doi.org/10.1109/HPCA53966.2022.00060

work page doi:10.1109/hpca53966.2022.00060 2022

[1] [1]

AMD. 2025. Vitis Reference Guide. Technical Report

work page 2025

[2] [2]

Mikhail Asiatici. 2021. Miss-Optimized Memory Systems: Turning Thousands of Outstanding Misses into Reuse Opportunities . Ph. D. Dissertation. EPFL. https://doi.org/10.5075/epfl-thesis-8050

work page doi:10.5075/epfl-thesis-8050 2021

[3] [3]

Suhail Basalama and Jason Cong. 2025. Stream-HLS: Towards Automatic Dataflow Acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 103–114. https://doi.org/10.1145/3706628.3708878

work page doi:10.1145/3706628.3708878 2025

[4] [4]

Pnevmatikatos

George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios N. Pnevmatikatos. 2018. A decoupled access-execute architecture for reconfigurable accelerators. In Proceedings of the ACM International Conference on Computing Frontiers . 244–247. https://doi.org/10.1145/3203217. 3203267

work page doi:10.1145/3203217 2018

[5] [5]

Edward Suh

Tao Chen and G. Edward Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture . 1–12. https://doi.org/10.1109/MICRO.2016.7783749

work page doi:10.1109/micro.2016.7783749 2016

[6] [6]

Constantinides, and John Wickerson

Jianyi Cheng, Lana Josipović, George A. Constantinides, and John Wickerson. 2022. Dynamic Inter-Block Scheduling for HLS. In Proceedings of the Conference on Field Programmable Logic and Applications . 243–252. https://doi.org/10.1109/FPL57034.2022.00045

work page doi:10.1109/fpl57034.2022.00045 2022

[7] [7]

Constantinides

Jianyi Cheng, Lana Josipović, John Wickerson, and George A. Constantinides. 2023. Parallelising Control Flow in Dynamic-scheduling High-level Synthesis. ACM Transactions on Reconfigurable Technology and Systems 16 (Dec. 2023), 1–32. Issue 4. https://doi.org/10.1145/3599973

work page doi:10.1145/3599973 2023

[8] [8]

Shaoyi Cheng and John Wawrzynek. 2014. Architectural synthesis of computational pipelines with decoupled memory access. In FPT. 83–90. https://doi.org/10.1109/FPT.2014.7082758

work page doi:10.1109/fpt.2014.7082758 2014

[9] [9]

Dally, Yatish Turakhia, and Song Han

William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-specific hardware accelerators. Commun. ACM 63 (June 2020), 48–57. Issue 7. https://doi.org/10.1145/3361682

work page doi:10.1145/3361682 2020

[10] [10]

Johannes de Fine Licht, Maciej Besta, Simon Meierhans, and Torsten Hoefler. 2021. Transformations of High-Level Synthesis Codes for High- Performance Computing. IEEE Transactions on Parallel and Distributed Systems 32 (May 2021), 1014–1029. Issue 5. https://doi.org/10.1109/TPDS. 2020.3039409

work page doi:10.1109/tpds 2021

[11] [11]

Ayatallah Elakhras, Andrea Guerrieri, Lana Josipović, and Paolo Ienne. 2022. Unleashing Parallelism in Elastic Circuits with Faster Token Delivery. In Proceedings of the Conference on Field Programmable Logic and Applications . 253–261. https://doi.org/10.1109/FPL57034.2022.00046

work page doi:10.1109/fpl57034.2022.00046 2022

[12] [12]

Ayatallah Elakhras, Riya Sawhney, Andrea Guerrieri, Lana Josipovic, and Paolo Ienne. 2023. Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 39–45. https: //doi.org/10.1145/3543622.3573050

work page doi:10.1145/3543622.3573050 2023

[13] [13]

DeepSeek-AI et al. 2025. DeepSeek-V3 Technical Report. https://doi.org/10.48550/arXiv.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2025

[14] [14]

Fleming and David B

Shane T. Fleming and David B. Thomas. 2017. Using Runahead Execution to Hide Memory Latency in High Level Synthesis. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines . 109–116. https://doi.org/10.1109/FCCM.2017.33

work page doi:10.1109/fccm.2017.33 2017

[15] [15]

Khronos OpenCL Working Group. 2025. The OpenCL Specification. Standard. Khronos. https://registry.khronos.org/OpenCL/specs/3.0-unified/pdf/ OpenCL_API.pdf

work page 2025

[16] [16]

Lana Josipovic, Philip Brisk, and Paolo Ienne. 2017. An Out-of-Order Load-Store Queue for Spatial Computing. ACM Transactions on Embedded Computing Systems 16 (Oct. 2017), 1–19. Issue 5s. https://doi.org/10.1145/3126525

work page doi:10.1145/3126525 2017

[17] [17]

Lana Josipović, Radhika Ghosal, and Paolo Ienne. 2018. Dynamically Scheduled High-level Synthesis. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 127–136. https://doi.org/10.1145/3174243.3174264

work page doi:10.1145/3174243.3174264 2018

[18] [18]

Lana Josipović, Andrea Guerrieri, and Paolo Ienne. 2020. Invited Tutorial: Dynamatic: From C/C++ to Dynamically Scheduled Circuits. InProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . 1–10. https://doi.org/10.1145/3373087.3375391

work page doi:10.1145/3373087.3375391 2020

[19] [19]

Rui Li, Lincoln Berkley, Yihang Yang, and Rajit Manohar. 2021. Fluid: An Asynchronous High-level Synthesis Tool for Complex Program Structures. In Proceedings of the IEEE International Symposium on Asynchronous Circuits and Systems . 1–8. https://doi.org/10.1109/ASYNC48570.2021.00009

work page doi:10.1109/async48570.2021.00009 2021

[20] [20]

Jiantao Liu, Carmine Rizzi, and Lana Josipović. 2022. Load-Store Queue Sizing for Efficient Dataflow Circuits. In International Conference on Field-Programmable Technology. 1–9. https://doi.org/10.1109/ICFPT56656.2022.9974425

work page doi:10.1109/icfpt56656.2022.9974425 2022

[21] [21]

David Metz, Nico Reissmann, and Magnus Själander. 2024. R-HLS: An IR for Dynamic High-Level Synthesis and Memory Disambiguation based on Regions and State Edges. InProceedings of the IEEE/ACM International Conference on Computer-Aided Design. https://doi.org/10.1145/3676536.3676671

work page doi:10.1145/3676536.3676671 2024

[22] [22]

Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the International Symposium on Computer Architecture . 416–429. https://doi.org/10.1145/3079856.3080255

work page doi:10.1145/3079856.3080255 2017

[23] [23]

NVIDIA. 2025. CUDA C++ Programming Guide. Technical Report

work page 2025

[24] [24]

Keckler, Christopher W

Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the Architectural Support for Programming Languages and Operating Systems . 1...

work page doi:10.1145/3297858.3304025 2019

[25] [25]

Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization . 110–119. https://doi.org/10.1109/IISWC. 2014.6983050

work page doi:10.1109/iiswc 2014

[26] [26]

Nico Reissmann, Jan Christian Meyer, Helge Bahmann, and Magnus Själander. 2020. RVSDG: An Intermediate Representation for Optimizing Compilers. ACM Transactions on Embedded Computing Systems 19 (Dec. 2020), 49:1–49:28. Issue 6. https://doi.org/10.1145/3391902 18 David Metz and Magnus Själander

work page doi:10.1145/3391902 2020

[27] [27]

Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters 10 (Jan. 2011), 16–19. Issue 1. https://doi.org/10.1109/L-CA.2011.4

work page doi:10.1109/l-ca.2011.4 2011

[28] [28]

Santosh Shetty and Benjamin Camon Schafer. 2021. Enabling the Design of Behavioral Systems-on-Chip. In Proceedings of the ACM/IEEE Design Automation Conference. 331–336. https://doi.org/10.1109/DAC18074.2021.9586263

work page doi:10.1109/dac18074.2021.9586263 2021

[29] [29]

James E. Smith. 1982. Decoupled access/execute computer architectures. ACM SIGARCH Computer Architecture News 10 (April 1982), 112–119. Issue

work page 1982

[30] [30]

https://doi.org/10.1145/1067649.801719

work page doi:10.1145/1067649.801719

[31] [31]

Masayuki Usui and Shinya Takamaeda-Yamazaki. 2023. High-Level Synthesis of Memory Systems for Decoupled Data Orchestration. In Proceedings of the Applied Reconfigurable Computing . 3–18. https://doi.org/10.1007/978-3-031-42921-7_1

work page doi:10.1007/978-3-031-42921-7_1 2023

[32] [32]

Wulf and Sally A

Wm A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23 (1995), 20–24. Issue 1. https://doi.org/10.1145/216585.216588

work page doi:10.1145/216585.216588 1995

[33] [33]

Jiahui Xu and Lana Josipovic. 2025. CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLS. In Proceedings of the Architectural Support for Programming Languages and Operating Systems . 249–263. https://doi.org/10.1145/3669940.3707273

work page doi:10.1145/3669940.3707273 2025

[34] [34]

Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In Proceedings of the International Symposium High-Performance Computer Architecture. 741–755. https://doi.org/10.1109/HPCA53966.2022.00060

work page doi:10.1109/hpca53966.2022.00060 2022