pith. sign in

arxiv: 2605.23630 · v1 · pith:QPSUF55Wnew · submitted 2026-05-22 · 💻 cs.AR

To Overlay or to Customize? Revisiting Architectural Choices in Heterogeneous Systems

Pith reviewed 2026-05-25 02:25 UTC · model grok-4.3

classification 💻 cs.AR
keywords overlay architecturecustomized architectureheterogeneous systemsautonomous drivingmodel switchingreconfiguration latencybitstream reloadFPGA acceleration
0
0 comments X

The pith

Overlay architectures handle frequent model switches better than customized ones in autonomous driving under today's reconfiguration costs, but the advantage can reverse if reload overhead falls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares overlay-based and customized acceleration architectures for heterogeneous systems in an autonomous driving context. It models how often neural network models must switch, how long bitstream reload takes, workload variation, and efficiency needs. The analysis concludes that overlays currently reduce the penalty from rapid switching. Lower reload times would make customization more competitive for efficiency-focused cases, while more flexible overlays would widen their lead. The work shows that the preferred choice depends on how these parameters evolve with technology.

Core claim

Our analysis shows that overlay-based architecture is more suitable for highly frequent model switching under the state-of-the-art architecture. However, as bitstream reload overhead continues to reduce, customized architectures may become increasingly attractive, especially for workloads with efficiency requirements. Conversely, if overlay architectures become more capable and flexible, they may further expand their advantage over customized architectures. These observations provide design insights for future architectural design, and the optimal deployment strategy will be flipped according to the technique development.

What carries the argument

Trade-off analysis of overlay versus customized acceleration under varying model switching frequency, reconfiguration latency, workload variation, and architectural design in an autonomous driving scenario.

If this is right

  • High switching frequency favors overlay designs to avoid reload penalties under current technology.
  • Improvements that cut bitstream reload time will shift preference toward customized designs when efficiency matters most.
  • Greater flexibility and capability in overlays will increase their advantage over customization.
  • Deployment decisions must be revisited whenever reconfiguration hardware or overlay features advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Faster reconfiguration hardware could accelerate adoption of specialized accelerators across edge computing domains.
  • Similar frequency-versus-efficiency trade-offs likely appear in other dynamic workloads such as robotics or video analytics.
  • Hybrid designs that blend overlay flexibility with selective customization may emerge as a practical middle path.

Load-bearing premise

The comparison assumes particular practical values for switching frequency, reconfiguration latency, workload variation, and design parameters drawn from an autonomous driving scenario.

What would settle it

Collect measured switching frequencies, actual bitstream reload times, and end-to-end latency or energy numbers from a running autonomous driving platform and test whether they match the paper's predicted preference thresholds.

Figures

Figures reproduced from arXiv: 2605.23630 by Peipei Zhou, Shixin Ji, Xingzhen Chen, Zheng Dong.

Figure 1
Figure 1. Figure 1: The timeline comparison on customized bitstream [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The comparison between overlay and customization [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: When only comparing the execution time, the customized [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: The comparison between overlay and customization [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The comparison between customization and overlay [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

In this work, we present a systematic study of this trade-off from a deployment-centric perspective, focusing on an autonomous driving scenario. Instead of treating overlay and customized acceleration as isolated design points, we analyze when each approach is preferable under practical conditions, including workload variation, architectural design, reconfiguration latency, and switching frequency. Our analysis shows that overlay-based architecture is more suitable for highly frequent model switching under the state-of-the-art architecture. However, as bitstream reload overhead continues to reduce, customized architectures may become increasingly attractive, especially for workloads with efficiency requirements. Conversely, if overlay architectures become more capable and flexible, they may further expand their advantage over customized architectures. These observations provide design insights for future architectural design, and the optimal deployment strategy will be flipped according to the technique development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a systematic, deployment-centric analysis of the trade-off between overlay-based and customized acceleration architectures in heterogeneous systems, using an autonomous driving scenario as the running example. It compares the two approaches under parameters for workload variation, architectural design, reconfiguration latency, and model-switching frequency, concluding that overlays are preferable at high switching rates under current bitstream reload costs, while customized designs may become attractive as reload overhead falls and that the preference could reverse with further overlay improvements.

Significance. If the underlying parametric model is shown to be grounded in measured data, the work would supply concrete guidance for architects choosing between flexibility and efficiency in latency-sensitive, frequently reconfigured workloads. The emphasis on how technology trends (reload time, overlay capability) can invert the optimal choice is a useful framing beyond static comparisons.

major comments (1)
  1. [Abstract] Abstract (and the trade-off analysis section): the central claim that overlay architectures are more suitable for highly frequent model switching, with a flip point as bitstream reload overhead decreases, is derived from a parametric comparison whose inputs (reconfiguration latency, switching frequency, efficiency delta) are described only as 'practical conditions' for autonomous driving. No cited measurements, workload traces, sensitivity ranges, or error bars are supplied, so the reported thresholds are sensitive to unvalidated assumptions; a factor-of-3 deviation in any input would move the crossover outside the operating regime considered.
minor comments (1)
  1. [Abstract] The final sentence of the abstract ('the optimal deployment strategy will be flipped according to the technique development') is grammatically awkward and should be rephrased for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The single major comment identifies a valid concern about the grounding of the parametric inputs. We address it below and will revise the manuscript to strengthen this aspect.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and the trade-off analysis section): the central claim that overlay architectures are more suitable for highly frequent model switching, with a flip point as bitstream reload overhead decreases, is derived from a parametric comparison whose inputs (reconfiguration latency, switching frequency, efficiency delta) are described only as 'practical conditions' for autonomous driving. No cited measurements, workload traces, sensitivity ranges, or error bars are supplied, so the reported thresholds are sensitive to unvalidated assumptions; a factor-of-3 deviation in any input would move the crossover outside the operating regime considered.

    Authors: We agree that the current presentation of parameters as 'practical conditions' lacks sufficient explicit grounding and sensitivity analysis, which weakens the robustness of the reported thresholds. In the revised manuscript we will: (1) add citations to representative FPGA reconfiguration measurements and autonomous-driving workload studies that informed the baseline values; (2) introduce a new subsection in the trade-off analysis that reports the parameter ranges considered and performs a sensitivity study (including factor-of-3 deviations) to show how the crossover point shifts; and (3) qualify the abstract and conclusions to reflect the sensitivity results. These additions will make the operating regimes and the direction of the technology-trend conclusions more transparent. revision: yes

Circularity Check

0 steps flagged

No circularity; parametric analysis is self-contained

full rationale

The abstract and skeptic summary describe a deployment-centric trade-off study that compares overlay vs. customized architectures under stated assumptions about workload variation, reconfiguration latency, and switching frequency in an autonomous-driving scenario. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. The central observations (overlay preferable at high switching frequency; preference may flip as overhead falls) are presented as outcomes of the parametric comparison rather than reducing to the inputs by construction. This matches the default expectation of a non-circular modeling paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5663 in / 1002 out tokens · 20862 ms · 2026-05-25T02:25:08.246053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    Mohamed S Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, and Andrew C Ling. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In2018 28th international conference on field programmable logic and applications (FPL). IEEE, 411–4117

  2. [2]

    AMD 2023.Vitis AI User Guide. AMD. https://docs.amd.com/r/en-US/ug1414- vitis-ai

  3. [3]

    Sameh Attia and Vaughn Betz. 2020. Feel free to interrupt: Safe task stopping to enable FPGA checkpointing and context switching.ACM Transactions on Reconfigurable Technology and Systems (TRETS)13, 1 (2020), 1–27

  4. [4]

    Autoware Foundation. 2026. Autoware - the world’s leading open-source soft- ware project for autonomous driving. https://github.com/autowarefoundation/ autoware

  5. [5]

    Alessandro Biondi, Alessio Balsini, Marco Pagani, Enrico Rossi, Mauro Marinoni, and Giorgio Buttazzo. 2016. A framework for supporting real-time applications on dynamic reconfigurable FPGAs. In2016 IEEE Real-Time Systems Symposium (RTSS). IEEE, 1–12

  6. [6]

    Robin Bonamy, Hung-Manh Pham, Sébastien Pillement, and Daniel Chillet. 2012. UPaRC—Ultra-fast power-aware reconfiguration controller. In2012 Design, Au- tomation & Test in Europe Conference & Exhibition (DATE). doi:10.1109/DATE. 2012.6176705

  7. [7]

    Mohamed Bouaziz, Michael Samet, and Suhaib A. Fahmy. 2025. A Dataflow Overlay for Monte Carlo Multi-Asset Option Pricing on AMD Versal AI Engines. InISC High Performance 2025 Research Paper Proceedings. doi:10.23919/ISC.2025. 11020612

  8. [8]

    CCD Photometric Study of the Contact Binary TX Cnc in the Young Open Cluster NGC 2632

    Andrew Boutros, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C. Hoe, Vaughn Betz, and Martin Langhammer. 2020. Beyond Peak Perfor- mance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs. In 2020 International Conference on Field-Programmable Technology (ICFPT). 10–19. doi:10.1109/ICFPT51103.2020.00011

  9. [9]

    Luis Andres Cardona and Carles Ferrer. 2015. AC_ICAP: A flexible high speed ICAP controller.International Journal of Reconfigurable Computing(2015)

  10. [10]

    Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang. 2024. Allo: A programming model for composable accelerator design.Proceedings of the ACM on Programming Languages8, PLDI (2024), 593– 620

  11. [11]

    Xingzhen Chen, Zhuoping Yang, Jinming Zhuang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, and Peipei Zhou. 2026. DORA: Dataflow-Instruction Orches- tration Architecture for DNN Acceleration.arXiv preprint(2026)

  12. [12]

    Xingzhen Chen, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, and Peipei Zhou. 2026. FILCO: Flexible Composing Archi- tecture with Real-Time Reconfigurability for DNN Acceleration.arXiv preprint arXiv:2604.07523(2026)

  13. [13]

    Peiyan Dong, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Yanyu Li, Dongkuan Xu, Heng Huang, Jingtong Hu, Alex Jones, Yiyu Shi, Yanzhi Wang, and Peipei Zhou

  14. [14]

    EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture.IEEE TCAD(2024)

  15. [15]

    Mario Doumet, Marius Stan, Mathew Hall, and Vaughn Betz. 2024. H2PIPE: High throughput CNN inference on FPGAs with high-bandwidth memory. In2024 FPL. IEEE, 69–77

  16. [16]

    Reinhardt, Adrian M

    Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger

  17. [17]

    In2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)

    A Configurable Cloud-Scale DNN Processor for Real-Time AI. In2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 1–14. doi:10.1109/ISCA.2018.00012

  18. [18]

    Paolo Salvatore Galfano, Giuseppe Sorrentino, Eleonora D’Arnese, and Davide Conficconi. 2024. Co-Designing a 3D Transformation Accelerator for Versal- Based Image Registration. In2024 IEEE 42nd International Conference on Computer Design (ICCD). 219–222. doi:10.1109/ICCD63220.2024.00041 IGSC 2026, June 22–24, 2026, Canandaigua, NY, USA Xingzhen Chen, Shix...

  19. [19]

    Chengsi Gao, Ying Wang, Cheng Liu, Mengdi Wang, Weiwei Chen, Yinhe Han, and Lei Zhang. 2023. Layer-Puzzle: Allocating and Scheduling Multi-task on Multi-core NPUs by Using Layer Heterogeneity. InDATE. IEEE, 1–6

  20. [20]

    Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Fu...

  21. [21]

    Jiapeng Guan, Ran Wei, Dean You, Yingquan Wang, Ruizhe Yang, Hui Wang, and Zhe Jiang. 2024. MESC: Re-thinking Algorithmic Priority and/or Criticality Inversions for Heterogeneous MCSs. In2024 IEEE Real-Time Systems Symposium (RTSS). IEEE, 1–14

  22. [22]

    Nan Guan and Zheng Dong. [n. d.]. Industry Challenge. ([n. d.])

  23. [23]

    Zibo Guo, Kai Liu, Wei Liu, Xiaoyao Sun, Chongyang Ding, and Shangrong Li

  24. [24]

    An overlay accelerator of DeepLab CNN for spacecraft image segmentation on FPGA.Remote Sensing16, 5 (2024), 894

  25. [25]

    Zifan He, Anderson Truong, Yingqi Cao, and Jason Cong. 2025. InTAR: Inter- Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 123–132

  26. [26]

    Erika Hunhoff, Joseph Melber, Kristof Denolf, Andra Bisca, Samuel Bayliss, Stephen Neuendorffer, Jeff Fifield, Jack Lo, Pranathi Vasireddy, Phil James-Roxby, et al. 2025. Efficiency, expressivity, and extensibility in a close-to-metal npu programming interface. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines...

  27. [27]

    Mustafa Ibrahim, Sebastien Pillement, Andrea Pinna, and Sebastien Le Nours

  28. [28]

    Reconfigurable Technol

    VERSATILE: Very Fast Partial Reconfiguration Controller.ACM Trans. Reconfigurable Technol. Syst.18, 3, Article 42 (Sept. 2025), 22 pages. doi:10.1145/ 3748728

  29. [29]

    Shixin Ji, Xingzhen Chen, Jinming Zhuang, Wei Zhang, Zhuoping Yang, Sarah Schultz, Yukai Song, Jingtong Hu, Alex Jones, Zheng Dong, and Peipei Zhou

  30. [30]

    InProceedings of the Great Lakes Symposium on VLSI 2025 (GLSVLSI ’25)

    ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems. InProceedings of the Great Lakes Symposium on VLSI 2025 (GLSVLSI ’25). Association for Computing Machinery, New York, NY, USA, 442–449. doi:10. 1145/3716368.3735215

  31. [31]

    Jones, Zheng Dong, and Peipei Zhou

    Shixin Ji, Zhuoping Yang, Xingzhen Chen, Wei Zhang, Jinming Zhuang, Alex K. Jones, Zheng Dong, and Peipei Zhou. 2025. DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety- Critical Systems. In2025 IEEE Real-Time Systems Symposium (RTSS). 392–405. doi:10.1109/RTSS66672.2025.00039

  32. [32]

    Krzysztof Jozwik, Hiroyuki Tomiyama, Shinya Honda, and Hiroaki Takada. 2010. A novel mechanism for effective hardware task preemption in dynamically re- configurable systems. In2010 International Conference on Field Programmable Logic and Applications. IEEE, 352–355

  33. [33]

    Seah Kim, Hasan Genc, Vadim Vadimovich Nikiforov, Krste Asanović, Borivoje Nikolić, and Yakun Sophia Shao. 2023. MoCA: Memory-centric, adaptive execu- tion for multi-tenant deep neural networks. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 828–841

  34. [34]

    Amit Kulkarni, Vipin Kizheppatt, and Dirk Stroobandt. 2015. MiCAP: a custom reconfiguration controller for dynamic circuit specialization. In2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–6. doi:10. 1109/ReConFig.2015.7393327

  35. [35]

    Johannes Menzel and Christian Plessl. 2025. Efficient and Distributed Computa- tion of Electron Repulsion Integrals on AMD AI Engines. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 95–104. doi:10.1109/FCCM62733.2025.00044

  36. [36]

    Young H Oh, Seonghak Kim, Yunho Jin, Sam Son, Jonghyun Bae, Jongsung Lee, Yeonhong Park, Dong Uk Kim, Tae Jun Ham, and Jae W Lee. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise sched- uling. InHPCA. IEEE, 584–597

  37. [37]

    Marco Pagani, Alessio Balsini, Alessandro Biondi, Mauro Marinoni, and Giorgio Buttazzo. 2017. A Linux-based support for developing real-time applications on heterogeneous platforms with dynamic FPGA reconfiguration. In2017 30th IEEE International System-on-Chip Conference (SOCC). IEEE, 96–101

  38. [38]

    Francesco Restuccia and Alessandro Biondi. 2021. Time-predictable acceleration of deep neural networks on fpga soc platforms. In2021 IEEE Real-Time Systems Symposium (RTSS). IEEE, 441–454

  39. [39]

    Enrico Rossi, Marvin Damschen, Lars Bauer, Giorgio Buttazzo, and Jörg Henkel

  40. [40]

    11, 2, Article 10 (2018), 24 pages

    Preemption of the Partial Reconfiguration Process to Enable Real-Time Computing With FPGAs. 11, 2, Article 10 (2018), 24 pages. doi:10.1145/3182183

  41. [41]

    Biruk Seyoum, Marco Pagani, Alessandro Biondi, and Giorgio Buttazzo. 2021. Automating the design flow under dynamic partial reconfiguration for hardware- software co-design in FPGA SoC. InProceedings of the 36th Annual ACM Sympo- sium on Applied Computing. 481–490

  42. [42]

    Dhananjay Rao Thallikar, Shashank Nag, and Lizy K John. 2026. HMix: An Efficient Hardware Accelerator for Quantized MLP-Mixer Inference. (2026)

  43. [43]

    Jianming Tong, Anirudh Itagi, Prasanth Chatarasi, and Tushar Krishna. 2024. Feather: A reconfigurable accelerator with data reordering support for low- cost on-chip dataflow switching. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 198–214

  44. [44]

    Chunyang Wang, Yuebin Bai, and Desen Sun. 2023. CD-MSA: cooperative and deadline-aware scheduling for efficient multi-tenancy on DNN accelerators.TPDS 34, 7 (2023), 2091–2106

  45. [45]

    Chengyue Wang, Xiaofan Zhang, Jason Cong, and James C Hoe. 2025. Re- configurable Stream Network Architecture. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1848–1866

  46. [46]

    Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, André Rosti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, and Jinming Zhuang. 2026. From L...

  47. [47]

    Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong

  48. [48]

    In2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

    TGPA: Tile-grained pipeline architecture for low latency CNN inference. In2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1–8

  49. [49]

    Yixin Xu, Zijian Zhao, Yi Xiao, Tongguang Yu, Halid Mulaosmanovic, Dominik Kleimaier, Stefan Duenkel, Sven Beyer, Xiao Gong, Rajiv Joshi, Xiaobo Hu, Shixian Wen, Amanda Sofie Rios, Kiran Lekkala, Laurent Itti, Eric Homan, Sumitha George, Vijaykrishnan Narayanan, and Kai Ni

  50. [50]

    arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.adk1525 doi:10

    Ferroelectric FET-based context-switching FPGA enabling dynamic reconfiguration for adaptive deep learning machines.Science Advances (2024). arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.adk1525 doi:10. 1126/sciadv.adk1525

  51. [51]

    Hanchen Yang, Zishen Wan, Ritik Raj, Joongun Park, Ziwei Li, Ananda Samajdar, Arijit Raychowdhury, and Tushar Krishna. 2025. NSFlow: An End-to-End FPGA Framework with Scalable Dataflow Architecture for Neuro-Symbolic AI.arXiv preprint arXiv:2504.19323(2025)

  52. [52]

    Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K Jones, and Peipei Zhou. 2023. AIM: Accelerating Arbitrary-precision Integer Multiplication on Het- erogeneous Reconfigurable Computing Platform Versal ACAP. In2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 1–9

  53. [53]

    Shulin Zeng, Guohao Dai, Niansong Zhang, Xinhao Yang, Haoyu Zhang, Zhenhua Zhu, Huazhong Yang, and Yu Wang. 2022. Serving multi-DNN workloads on FPGAs: A coordinated architecture, scheduling, and mapping perspective.IEEE Trans. Comput.72, 5 (2022), 1314–1328

  54. [54]

    Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia Mirhoseini. 2022. A full-stack search technique for domain optimized deep learning accelerators. InProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 27–42

  55. [55]

    Yifan Zhang, Zhiheng Chen, Ye Qiao, and Sitao Huang. 2025. PD-Swap: Prefill- Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration.arXiv preprint arXiv:2512.11550(2025)

  56. [56]

    Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, and Peipei Zhou. 2023. CHARM: Composing Heterogeneous Accel- eRators for Matrix Multiply on Versal ACAP Architecture. InProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmab...

  57. [57]

    Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Shixin Ji, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Yiyu Shi, Deming Chen, Jason Cong, and Peipei Zhou. 2024. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans. Reconfigurable Technol. Syst.17, 3, Article 51 (Sept. 202...

  58. [58]

    Jinming Zhuang, Shaojie Xiang, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, and Peipei Zhou. 2025. ARIES: An Agile MLIR- Based Compilation Flow for Reconfigurable Devices with AI Engines. InProceed- ings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA)(FPGA ’25). Association for Co...

  59. [59]

    Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou

    Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou. 2024. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration. InProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA)(FPGA ’24). Association f...

  60. [60]

    Jinming Zhuang, Zhuoping Yang, and Peipei Zhou. 2025. High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives. InDAC(San Francisco, California, United States)(DAC ’23). IEEE Press, 1–6. doi:10.1109/DAC56929.2023.10247981