To Overlay or to Customize? Revisiting Architectural Choices in Heterogeneous Systems

Peipei Zhou; Shixin Ji; Xingzhen Chen; Zheng Dong

arxiv: 2605.23630 · v1 · pith:QPSUF55Wnew · submitted 2026-05-22 · 💻 cs.AR

To Overlay or to Customize? Revisiting Architectural Choices in Heterogeneous Systems

Xingzhen Chen , Shixin Ji , Zheng Dong , Peipei Zhou This is my paper

Pith reviewed 2026-05-25 02:25 UTC · model grok-4.3

classification 💻 cs.AR

keywords overlay architecturecustomized architectureheterogeneous systemsautonomous drivingmodel switchingreconfiguration latencybitstream reloadFPGA acceleration

0 comments

The pith

Overlay architectures handle frequent model switches better than customized ones in autonomous driving under today's reconfiguration costs, but the advantage can reverse if reload overhead falls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares overlay-based and customized acceleration architectures for heterogeneous systems in an autonomous driving context. It models how often neural network models must switch, how long bitstream reload takes, workload variation, and efficiency needs. The analysis concludes that overlays currently reduce the penalty from rapid switching. Lower reload times would make customization more competitive for efficiency-focused cases, while more flexible overlays would widen their lead. The work shows that the preferred choice depends on how these parameters evolve with technology.

Core claim

Our analysis shows that overlay-based architecture is more suitable for highly frequent model switching under the state-of-the-art architecture. However, as bitstream reload overhead continues to reduce, customized architectures may become increasingly attractive, especially for workloads with efficiency requirements. Conversely, if overlay architectures become more capable and flexible, they may further expand their advantage over customized architectures. These observations provide design insights for future architectural design, and the optimal deployment strategy will be flipped according to the technique development.

What carries the argument

Trade-off analysis of overlay versus customized acceleration under varying model switching frequency, reconfiguration latency, workload variation, and architectural design in an autonomous driving scenario.

If this is right

High switching frequency favors overlay designs to avoid reload penalties under current technology.
Improvements that cut bitstream reload time will shift preference toward customized designs when efficiency matters most.
Greater flexibility and capability in overlays will increase their advantage over customization.
Deployment decisions must be revisited whenever reconfiguration hardware or overlay features advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Faster reconfiguration hardware could accelerate adoption of specialized accelerators across edge computing domains.
Similar frequency-versus-efficiency trade-offs likely appear in other dynamic workloads such as robotics or video analytics.
Hybrid designs that blend overlay flexibility with selective customization may emerge as a practical middle path.

Load-bearing premise

The comparison assumes particular practical values for switching frequency, reconfiguration latency, workload variation, and design parameters drawn from an autonomous driving scenario.

What would settle it

Collect measured switching frequencies, actual bitstream reload times, and end-to-end latency or energy numbers from a running autonomous driving platform and test whether they match the paper's predicted preference thresholds.

Figures

Figures reproduced from arXiv: 2605.23630 by Peipei Zhou, Shixin Ji, Xingzhen Chen, Zheng Dong.

**Figure 2.** Figure 2: The comparison between overlay and customization [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 1.** Figure 1: When only comparing the execution time, the customized [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 4.** Figure 4: The comparison between overlay and customization [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: The comparison between customization and overlay [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

In this work, we present a systematic study of this trade-off from a deployment-centric perspective, focusing on an autonomous driving scenario. Instead of treating overlay and customized acceleration as isolated design points, we analyze when each approach is preferable under practical conditions, including workload variation, architectural design, reconfiguration latency, and switching frequency. Our analysis shows that overlay-based architecture is more suitable for highly frequent model switching under the state-of-the-art architecture. However, as bitstream reload overhead continues to reduce, customized architectures may become increasingly attractive, especially for workloads with efficiency requirements. Conversely, if overlay architectures become more capable and flexible, they may further expand their advantage over customized architectures. These observations provide design insights for future architectural design, and the optimal deployment strategy will be flipped according to the technique development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps overlay vs customized FPGA choices for autonomous driving workloads but its preference-flip claims rest on unvalidated inputs for switching frequency and reload latency.

read the letter

Hi, the main thing here is a deployment-focused comparison of overlay versus customized architectures in heterogeneous systems for autonomous driving. It concludes overlays work better at high model switching rates under current bitstream costs, while customized designs could gain ground if reload overhead shrinks or if overlays gain flexibility, and it notes the optimal choice can flip with tech trends. That directional framing around workload variation, reconfiguration latency, and efficiency needs is the useful part; it treats the architectures as points on a continuum rather than fixed winners. The paper does not introduce new math or a general framework, just applies the known trade-off to this one scenario with some parameter discussion. The soft spot is exactly the one the stress-test flags. The suitability thresholds come from a parametric model whose inputs for practical AD conditions lack cited measurements, workload traces, or sensitivity ranges. If those latency or frequency numbers are off by a factor of three or four, the reported crossover point moves, turning the analysis into an illustration rather than a robust result. Without independent data backing the assumptions, the central claim stays conditional on the model. This is aimed at engineers designing accelerators for real-time embedded AI, particularly FPGA users in vehicles. A reader already working on similar deployment questions might extract some parameter intuition, but it will not shift broader understanding. It deserves peer review because the topic is relevant and the framing is straightforward, though any referee would need to check the methods section and data sources before the conclusions can be taken as grounded.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a systematic, deployment-centric analysis of the trade-off between overlay-based and customized acceleration architectures in heterogeneous systems, using an autonomous driving scenario as the running example. It compares the two approaches under parameters for workload variation, architectural design, reconfiguration latency, and model-switching frequency, concluding that overlays are preferable at high switching rates under current bitstream reload costs, while customized designs may become attractive as reload overhead falls and that the preference could reverse with further overlay improvements.

Significance. If the underlying parametric model is shown to be grounded in measured data, the work would supply concrete guidance for architects choosing between flexibility and efficiency in latency-sensitive, frequently reconfigured workloads. The emphasis on how technology trends (reload time, overlay capability) can invert the optimal choice is a useful framing beyond static comparisons.

major comments (1)

[Abstract] Abstract (and the trade-off analysis section): the central claim that overlay architectures are more suitable for highly frequent model switching, with a flip point as bitstream reload overhead decreases, is derived from a parametric comparison whose inputs (reconfiguration latency, switching frequency, efficiency delta) are described only as 'practical conditions' for autonomous driving. No cited measurements, workload traces, sensitivity ranges, or error bars are supplied, so the reported thresholds are sensitive to unvalidated assumptions; a factor-of-3 deviation in any input would move the crossover outside the operating regime considered.

minor comments (1)

[Abstract] The final sentence of the abstract ('the optimal deployment strategy will be flipped according to the technique development') is grammatically awkward and should be rephrased for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The single major comment identifies a valid concern about the grounding of the parametric inputs. We address it below and will revise the manuscript to strengthen this aspect.

read point-by-point responses

Referee: [Abstract] Abstract (and the trade-off analysis section): the central claim that overlay architectures are more suitable for highly frequent model switching, with a flip point as bitstream reload overhead decreases, is derived from a parametric comparison whose inputs (reconfiguration latency, switching frequency, efficiency delta) are described only as 'practical conditions' for autonomous driving. No cited measurements, workload traces, sensitivity ranges, or error bars are supplied, so the reported thresholds are sensitive to unvalidated assumptions; a factor-of-3 deviation in any input would move the crossover outside the operating regime considered.

Authors: We agree that the current presentation of parameters as 'practical conditions' lacks sufficient explicit grounding and sensitivity analysis, which weakens the robustness of the reported thresholds. In the revised manuscript we will: (1) add citations to representative FPGA reconfiguration measurements and autonomous-driving workload studies that informed the baseline values; (2) introduce a new subsection in the trade-off analysis that reports the parameter ranges considered and performs a sensitivity study (including factor-of-3 deviations) to show how the crossover point shifts; and (3) qualify the abstract and conclusions to reflect the sensitivity results. These additions will make the operating regimes and the direction of the technology-trend conclusions more transparent. revision: yes

Circularity Check

0 steps flagged

No circularity; parametric analysis is self-contained

full rationale

The abstract and skeptic summary describe a deployment-centric trade-off study that compares overlay vs. customized architectures under stated assumptions about workload variation, reconfiguration latency, and switching frequency in an autonomous-driving scenario. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. The central observations (overlay preferable at high switching frequency; preference may flip as overhead falls) are presented as outcomes of the parametric comparison rather than reducing to the inputs by construction. This matches the default expectation of a non-circular modeling paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5663 in / 1002 out tokens · 20862 ms · 2026-05-25T02:25:08.246053+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Mohamed S Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, and Andrew C Ling. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In2018 28th international conference on field programmable logic and applications (FPL). IEEE, 411–4117

2018
[2]

AMD 2023.Vitis AI User Guide. AMD. https://docs.amd.com/r/en-US/ug1414- vitis-ai

2023
[3]

Sameh Attia and Vaughn Betz. 2020. Feel free to interrupt: Safe task stopping to enable FPGA checkpointing and context switching.ACM Transactions on Reconfigurable Technology and Systems (TRETS)13, 1 (2020), 1–27

2020
[4]

Autoware Foundation. 2026. Autoware - the world’s leading open-source soft- ware project for autonomous driving. https://github.com/autowarefoundation/ autoware

2026
[5]

Alessandro Biondi, Alessio Balsini, Marco Pagani, Enrico Rossi, Mauro Marinoni, and Giorgio Buttazzo. 2016. A framework for supporting real-time applications on dynamic reconfigurable FPGAs. In2016 IEEE Real-Time Systems Symposium (RTSS). IEEE, 1–12

2016
[6]

Robin Bonamy, Hung-Manh Pham, Sébastien Pillement, and Daniel Chillet. 2012. UPaRC—Ultra-fast power-aware reconfiguration controller. In2012 Design, Au- tomation & Test in Europe Conference & Exhibition (DATE). doi:10.1109/DATE. 2012.6176705

work page doi:10.1109/date 2012
[7]

Mohamed Bouaziz, Michael Samet, and Suhaib A. Fahmy. 2025. A Dataflow Overlay for Monte Carlo Multi-Asset Option Pricing on AMD Versal AI Engines. InISC High Performance 2025 Research Paper Proceedings. doi:10.23919/ISC.2025. 11020612

work page doi:10.23919/isc.2025 2025
[8]

CCD Photometric Study of the Contact Binary TX Cnc in the Young Open Cluster NGC 2632

Andrew Boutros, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C. Hoe, Vaughn Betz, and Martin Langhammer. 2020. Beyond Peak Perfor- mance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs. In 2020 International Conference on Field-Programmable Technology (ICFPT). 10–19. doi:10.1109/ICFPT51103.2020.00011

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/icfpt51103.2020.00011 2020
[9]

Luis Andres Cardona and Carles Ferrer. 2015. AC_ICAP: A flexible high speed ICAP controller.International Journal of Reconfigurable Computing(2015)

2015
[10]

Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang. 2024. Allo: A programming model for composable accelerator design.Proceedings of the ACM on Programming Languages8, PLDI (2024), 593– 620

2024
[11]

Xingzhen Chen, Zhuoping Yang, Jinming Zhuang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, and Peipei Zhou. 2026. DORA: Dataflow-Instruction Orches- tration Architecture for DNN Acceleration.arXiv preprint(2026)

2026
[12]

Xingzhen Chen, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, and Peipei Zhou. 2026. FILCO: Flexible Composing Archi- tecture with Real-Time Reconfigurability for DNN Acceleration.arXiv preprint arXiv:2604.07523(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Peiyan Dong, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Yanyu Li, Dongkuan Xu, Heng Huang, Jingtong Hu, Alex Jones, Yiyu Shi, Yanzhi Wang, and Peipei Zhou
[14]

EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture.IEEE TCAD(2024)

2024
[15]

Mario Doumet, Marius Stan, Mathew Hall, and Vaughn Betz. 2024. H2PIPE: High throughput CNN inference on FPGAs with high-bandwidth memory. In2024 FPL. IEEE, 69–77

2024
[16]

Reinhardt, Adrian M

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger
[17]

In2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)

A Configurable Cloud-Scale DNN Processor for Real-Time AI. In2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 1–14. doi:10.1109/ISCA.2018.00012

work page doi:10.1109/isca.2018.00012 2018
[18]

Paolo Salvatore Galfano, Giuseppe Sorrentino, Eleonora D’Arnese, and Davide Conficconi. 2024. Co-Designing a 3D Transformation Accelerator for Versal- Based Image Registration. In2024 IEEE 42nd International Conference on Computer Design (ICCD). 219–222. doi:10.1109/ICCD63220.2024.00041 IGSC 2026, June 22–24, 2026, Canandaigua, NY, USA Xingzhen Chen, Shix...

work page doi:10.1109/iccd63220.2024.00041 2024
[19]

Chengsi Gao, Ying Wang, Cheng Liu, Mengdi Wang, Weiwei Chen, Yinhe Han, and Lei Zhang. 2023. Layer-Puzzle: Allocating and Scheduling Multi-task on Multi-core NPUs by Using Layer Heterogeneity. InDATE. IEEE, 1–6

2023
[20]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Fu...

2021
[21]

Jiapeng Guan, Ran Wei, Dean You, Yingquan Wang, Ruizhe Yang, Hui Wang, and Zhe Jiang. 2024. MESC: Re-thinking Algorithmic Priority and/or Criticality Inversions for Heterogeneous MCSs. In2024 IEEE Real-Time Systems Symposium (RTSS). IEEE, 1–14

2024
[22]

Nan Guan and Zheng Dong. [n. d.]. Industry Challenge. ([n. d.])
[23]

Zibo Guo, Kai Liu, Wei Liu, Xiaoyao Sun, Chongyang Ding, and Shangrong Li
[24]

An overlay accelerator of DeepLab CNN for spacecraft image segmentation on FPGA.Remote Sensing16, 5 (2024), 894

2024
[25]

Zifan He, Anderson Truong, Yingqi Cao, and Jason Cong. 2025. InTAR: Inter- Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 123–132

2025
[26]

Erika Hunhoff, Joseph Melber, Kristof Denolf, Andra Bisca, Samuel Bayliss, Stephen Neuendorffer, Jeff Fifield, Jack Lo, Pranathi Vasireddy, Phil James-Roxby, et al. 2025. Efficiency, expressivity, and extensibility in a close-to-metal npu programming interface. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines...

2025
[27]

Mustafa Ibrahim, Sebastien Pillement, Andrea Pinna, and Sebastien Le Nours
[28]

Reconfigurable Technol

VERSATILE: Very Fast Partial Reconfiguration Controller.ACM Trans. Reconfigurable Technol. Syst.18, 3, Article 42 (Sept. 2025), 22 pages. doi:10.1145/ 3748728

2025
[29]

Shixin Ji, Xingzhen Chen, Jinming Zhuang, Wei Zhang, Zhuoping Yang, Sarah Schultz, Yukai Song, Jingtong Hu, Alex Jones, Zheng Dong, and Peipei Zhou
[30]

InProceedings of the Great Lakes Symposium on VLSI 2025 (GLSVLSI ’25)

ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems. InProceedings of the Great Lakes Symposium on VLSI 2025 (GLSVLSI ’25). Association for Computing Machinery, New York, NY, USA, 442–449. doi:10. 1145/3716368.3735215

work page arXiv 2025
[31]

Jones, Zheng Dong, and Peipei Zhou

Shixin Ji, Zhuoping Yang, Xingzhen Chen, Wei Zhang, Jinming Zhuang, Alex K. Jones, Zheng Dong, and Peipei Zhou. 2025. DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety- Critical Systems. In2025 IEEE Real-Time Systems Symposium (RTSS). 392–405. doi:10.1109/RTSS66672.2025.00039

work page doi:10.1109/rtss66672.2025.00039 2025
[32]

Krzysztof Jozwik, Hiroyuki Tomiyama, Shinya Honda, and Hiroaki Takada. 2010. A novel mechanism for effective hardware task preemption in dynamically re- configurable systems. In2010 International Conference on Field Programmable Logic and Applications. IEEE, 352–355

2010
[33]

Seah Kim, Hasan Genc, Vadim Vadimovich Nikiforov, Krste Asanović, Borivoje Nikolić, and Yakun Sophia Shao. 2023. MoCA: Memory-centric, adaptive execu- tion for multi-tenant deep neural networks. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 828–841

2023
[34]

Amit Kulkarni, Vipin Kizheppatt, and Dirk Stroobandt. 2015. MiCAP: a custom reconfiguration controller for dynamic circuit specialization. In2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–6. doi:10. 1109/ReConFig.2015.7393327

work page arXiv 2015
[35]

Johannes Menzel and Christian Plessl. 2025. Efficient and Distributed Computa- tion of Electron Repulsion Integrals on AMD AI Engines. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 95–104. doi:10.1109/FCCM62733.2025.00044

work page doi:10.1109/fccm62733.2025.00044 2025
[36]

Young H Oh, Seonghak Kim, Yunho Jin, Sam Son, Jonghyun Bae, Jongsung Lee, Yeonhong Park, Dong Uk Kim, Tae Jun Ham, and Jae W Lee. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise sched- uling. InHPCA. IEEE, 584–597

2021
[37]

Marco Pagani, Alessio Balsini, Alessandro Biondi, Mauro Marinoni, and Giorgio Buttazzo. 2017. A Linux-based support for developing real-time applications on heterogeneous platforms with dynamic FPGA reconfiguration. In2017 30th IEEE International System-on-Chip Conference (SOCC). IEEE, 96–101

2017
[38]

Francesco Restuccia and Alessandro Biondi. 2021. Time-predictable acceleration of deep neural networks on fpga soc platforms. In2021 IEEE Real-Time Systems Symposium (RTSS). IEEE, 441–454

2021
[39]

Enrico Rossi, Marvin Damschen, Lars Bauer, Giorgio Buttazzo, and Jörg Henkel
[40]

11, 2, Article 10 (2018), 24 pages

Preemption of the Partial Reconfiguration Process to Enable Real-Time Computing With FPGAs. 11, 2, Article 10 (2018), 24 pages. doi:10.1145/3182183

work page doi:10.1145/3182183 2018
[41]

Biruk Seyoum, Marco Pagani, Alessandro Biondi, and Giorgio Buttazzo. 2021. Automating the design flow under dynamic partial reconfiguration for hardware- software co-design in FPGA SoC. InProceedings of the 36th Annual ACM Sympo- sium on Applied Computing. 481–490

2021
[42]

Dhananjay Rao Thallikar, Shashank Nag, and Lizy K John. 2026. HMix: An Efficient Hardware Accelerator for Quantized MLP-Mixer Inference. (2026)

2026
[43]

Jianming Tong, Anirudh Itagi, Prasanth Chatarasi, and Tushar Krishna. 2024. Feather: A reconfigurable accelerator with data reordering support for low- cost on-chip dataflow switching. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 198–214

2024
[44]

Chunyang Wang, Yuebin Bai, and Desen Sun. 2023. CD-MSA: cooperative and deadline-aware scheduling for efficient multi-tenancy on DNN accelerators.TPDS 34, 7 (2023), 2091–2106

2023
[45]

Chengyue Wang, Xiaofan Zhang, Jason Cong, and James C Hoe. 2025. Re- configurable Stream Network Architecture. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1848–1866

2025
[46]

Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, André Rosti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, and Jinming Zhuang. 2026. From L...

work page doi:10.1145/3785670 2026
[47]

Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong
[48]

In2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

TGPA: Tile-grained pipeline architecture for low latency CNN inference. In2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1–8
[49]

Yixin Xu, Zijian Zhao, Yi Xiao, Tongguang Yu, Halid Mulaosmanovic, Dominik Kleimaier, Stefan Duenkel, Sven Beyer, Xiao Gong, Rajiv Joshi, Xiaobo Hu, Shixian Wen, Amanda Sofie Rios, Kiran Lekkala, Laurent Itti, Eric Homan, Sumitha George, Vijaykrishnan Narayanan, and Kai Ni
[50]

arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.adk1525 doi:10

Ferroelectric FET-based context-switching FPGA enabling dynamic reconfiguration for adaptive deep learning machines.Science Advances (2024). arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.adk1525 doi:10. 1126/sciadv.adk1525

work page doi:10.1126/sciadv.adk1525 2024
[51]

Hanchen Yang, Zishen Wan, Ritik Raj, Joongun Park, Ziwei Li, Ananda Samajdar, Arijit Raychowdhury, and Tushar Krishna. 2025. NSFlow: An End-to-End FPGA Framework with Scalable Dataflow Architecture for Neuro-Symbolic AI.arXiv preprint arXiv:2504.19323(2025)

work page arXiv 2025
[52]

Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K Jones, and Peipei Zhou. 2023. AIM: Accelerating Arbitrary-precision Integer Multiplication on Het- erogeneous Reconfigurable Computing Platform Versal ACAP. In2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 1–9

2023
[53]

Shulin Zeng, Guohao Dai, Niansong Zhang, Xinhao Yang, Haoyu Zhang, Zhenhua Zhu, Huazhong Yang, and Yu Wang. 2022. Serving multi-DNN workloads on FPGAs: A coordinated architecture, scheduling, and mapping perspective.IEEE Trans. Comput.72, 5 (2022), 1314–1328

2022
[54]

Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia Mirhoseini. 2022. A full-stack search technique for domain optimized deep learning accelerators. InProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 27–42

2022
[55]

Yifan Zhang, Zhiheng Chen, Ye Qiao, and Sitao Huang. 2025. PD-Swap: Prefill- Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration.arXiv preprint arXiv:2512.11550(2025)

work page arXiv 2025
[56]

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, and Peipei Zhou. 2023. CHARM: Composing Heterogeneous Accel- eRators for Matrix Multiply on Versal ACAP Architecture. InProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmab...

work page doi:10.1145/3543622.3573210 2023
[57]

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Shixin Ji, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Yiyu Shi, Deming Chen, Jason Cong, and Peipei Zhou. 2024. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans. Reconfigurable Technol. Syst.17, 3, Article 51 (Sept. 202...

2024
[58]

Jinming Zhuang, Shaojie Xiang, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, and Peipei Zhou. 2025. ARIES: An Agile MLIR- Based Compilation Flow for Reconfigurable Devices with AI Engines. InProceed- ings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA)(FPGA ’25). Association for Co...

work page doi:10.1145/3706628.3708870 2025
[59]

Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou

Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou. 2024. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration. InProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA)(FPGA ’24). Association f...

work page doi:10.1145/3626202.3637569 2024
[60]

Jinming Zhuang, Zhuoping Yang, and Peipei Zhou. 2025. High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives. InDAC(San Francisco, California, United States)(DAC ’23). IEEE Press, 1–6. doi:10.1109/DAC56929.2023.10247981

work page doi:10.1109/dac56929.2023.10247981 2025

[1] [1]

Mohamed S Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, and Andrew C Ling. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In2018 28th international conference on field programmable logic and applications (FPL). IEEE, 411–4117

2018

[2] [2]

AMD 2023.Vitis AI User Guide. AMD. https://docs.amd.com/r/en-US/ug1414- vitis-ai

2023

[3] [3]

Sameh Attia and Vaughn Betz. 2020. Feel free to interrupt: Safe task stopping to enable FPGA checkpointing and context switching.ACM Transactions on Reconfigurable Technology and Systems (TRETS)13, 1 (2020), 1–27

2020

[4] [4]

Autoware Foundation. 2026. Autoware - the world’s leading open-source soft- ware project for autonomous driving. https://github.com/autowarefoundation/ autoware

2026

[5] [5]

Alessandro Biondi, Alessio Balsini, Marco Pagani, Enrico Rossi, Mauro Marinoni, and Giorgio Buttazzo. 2016. A framework for supporting real-time applications on dynamic reconfigurable FPGAs. In2016 IEEE Real-Time Systems Symposium (RTSS). IEEE, 1–12

2016

[6] [6]

Robin Bonamy, Hung-Manh Pham, Sébastien Pillement, and Daniel Chillet. 2012. UPaRC—Ultra-fast power-aware reconfiguration controller. In2012 Design, Au- tomation & Test in Europe Conference & Exhibition (DATE). doi:10.1109/DATE. 2012.6176705

work page doi:10.1109/date 2012

[7] [7]

Mohamed Bouaziz, Michael Samet, and Suhaib A. Fahmy. 2025. A Dataflow Overlay for Monte Carlo Multi-Asset Option Pricing on AMD Versal AI Engines. InISC High Performance 2025 Research Paper Proceedings. doi:10.23919/ISC.2025. 11020612

work page doi:10.23919/isc.2025 2025

[8] [8]

CCD Photometric Study of the Contact Binary TX Cnc in the Young Open Cluster NGC 2632

Andrew Boutros, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C. Hoe, Vaughn Betz, and Martin Langhammer. 2020. Beyond Peak Perfor- mance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs. In 2020 International Conference on Field-Programmable Technology (ICFPT). 10–19. doi:10.1109/ICFPT51103.2020.00011

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/icfpt51103.2020.00011 2020

[9] [9]

Luis Andres Cardona and Carles Ferrer. 2015. AC_ICAP: A flexible high speed ICAP controller.International Journal of Reconfigurable Computing(2015)

2015

[10] [10]

Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang. 2024. Allo: A programming model for composable accelerator design.Proceedings of the ACM on Programming Languages8, PLDI (2024), 593– 620

2024

[11] [11]

Xingzhen Chen, Zhuoping Yang, Jinming Zhuang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, and Peipei Zhou. 2026. DORA: Dataflow-Instruction Orches- tration Architecture for DNN Acceleration.arXiv preprint(2026)

2026

[12] [12]

Xingzhen Chen, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, and Peipei Zhou. 2026. FILCO: Flexible Composing Archi- tecture with Real-Time Reconfigurability for DNN Acceleration.arXiv preprint arXiv:2604.07523(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Peiyan Dong, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Yanyu Li, Dongkuan Xu, Heng Huang, Jingtong Hu, Alex Jones, Yiyu Shi, Yanzhi Wang, and Peipei Zhou

[14] [14]

EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture.IEEE TCAD(2024)

2024

[15] [15]

Mario Doumet, Marius Stan, Mathew Hall, and Vaughn Betz. 2024. H2PIPE: High throughput CNN inference on FPGAs with high-bandwidth memory. In2024 FPL. IEEE, 69–77

2024

[16] [16]

Reinhardt, Adrian M

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger

[17] [17]

In2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)

A Configurable Cloud-Scale DNN Processor for Real-Time AI. In2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 1–14. doi:10.1109/ISCA.2018.00012

work page doi:10.1109/isca.2018.00012 2018

[18] [18]

Paolo Salvatore Galfano, Giuseppe Sorrentino, Eleonora D’Arnese, and Davide Conficconi. 2024. Co-Designing a 3D Transformation Accelerator for Versal- Based Image Registration. In2024 IEEE 42nd International Conference on Computer Design (ICCD). 219–222. doi:10.1109/ICCD63220.2024.00041 IGSC 2026, June 22–24, 2026, Canandaigua, NY, USA Xingzhen Chen, Shix...

work page doi:10.1109/iccd63220.2024.00041 2024

[19] [19]

Chengsi Gao, Ying Wang, Cheng Liu, Mengdi Wang, Weiwei Chen, Yinhe Han, and Lei Zhang. 2023. Layer-Puzzle: Allocating and Scheduling Multi-task on Multi-core NPUs by Using Layer Heterogeneity. InDATE. IEEE, 1–6

2023

[20] [20]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Fu...

2021

[21] [21]

Jiapeng Guan, Ran Wei, Dean You, Yingquan Wang, Ruizhe Yang, Hui Wang, and Zhe Jiang. 2024. MESC: Re-thinking Algorithmic Priority and/or Criticality Inversions for Heterogeneous MCSs. In2024 IEEE Real-Time Systems Symposium (RTSS). IEEE, 1–14

2024

[22] [22]

Nan Guan and Zheng Dong. [n. d.]. Industry Challenge. ([n. d.])

[23] [23]

Zibo Guo, Kai Liu, Wei Liu, Xiaoyao Sun, Chongyang Ding, and Shangrong Li

[24] [24]

An overlay accelerator of DeepLab CNN for spacecraft image segmentation on FPGA.Remote Sensing16, 5 (2024), 894

2024

[25] [25]

Zifan He, Anderson Truong, Yingqi Cao, and Jason Cong. 2025. InTAR: Inter- Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 123–132

2025

[26] [26]

Erika Hunhoff, Joseph Melber, Kristof Denolf, Andra Bisca, Samuel Bayliss, Stephen Neuendorffer, Jeff Fifield, Jack Lo, Pranathi Vasireddy, Phil James-Roxby, et al. 2025. Efficiency, expressivity, and extensibility in a close-to-metal npu programming interface. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines...

2025

[27] [27]

Mustafa Ibrahim, Sebastien Pillement, Andrea Pinna, and Sebastien Le Nours

[28] [28]

Reconfigurable Technol

VERSATILE: Very Fast Partial Reconfiguration Controller.ACM Trans. Reconfigurable Technol. Syst.18, 3, Article 42 (Sept. 2025), 22 pages. doi:10.1145/ 3748728

2025

[29] [29]

Shixin Ji, Xingzhen Chen, Jinming Zhuang, Wei Zhang, Zhuoping Yang, Sarah Schultz, Yukai Song, Jingtong Hu, Alex Jones, Zheng Dong, and Peipei Zhou

[30] [30]

InProceedings of the Great Lakes Symposium on VLSI 2025 (GLSVLSI ’25)

ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems. InProceedings of the Great Lakes Symposium on VLSI 2025 (GLSVLSI ’25). Association for Computing Machinery, New York, NY, USA, 442–449. doi:10. 1145/3716368.3735215

work page arXiv 2025

[31] [31]

Jones, Zheng Dong, and Peipei Zhou

Shixin Ji, Zhuoping Yang, Xingzhen Chen, Wei Zhang, Jinming Zhuang, Alex K. Jones, Zheng Dong, and Peipei Zhou. 2025. DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety- Critical Systems. In2025 IEEE Real-Time Systems Symposium (RTSS). 392–405. doi:10.1109/RTSS66672.2025.00039

work page doi:10.1109/rtss66672.2025.00039 2025

[32] [32]

Krzysztof Jozwik, Hiroyuki Tomiyama, Shinya Honda, and Hiroaki Takada. 2010. A novel mechanism for effective hardware task preemption in dynamically re- configurable systems. In2010 International Conference on Field Programmable Logic and Applications. IEEE, 352–355

2010

[33] [33]

Seah Kim, Hasan Genc, Vadim Vadimovich Nikiforov, Krste Asanović, Borivoje Nikolić, and Yakun Sophia Shao. 2023. MoCA: Memory-centric, adaptive execu- tion for multi-tenant deep neural networks. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 828–841

2023

[34] [34]

Amit Kulkarni, Vipin Kizheppatt, and Dirk Stroobandt. 2015. MiCAP: a custom reconfiguration controller for dynamic circuit specialization. In2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–6. doi:10. 1109/ReConFig.2015.7393327

work page arXiv 2015

[35] [35]

Johannes Menzel and Christian Plessl. 2025. Efficient and Distributed Computa- tion of Electron Repulsion Integrals on AMD AI Engines. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 95–104. doi:10.1109/FCCM62733.2025.00044

work page doi:10.1109/fccm62733.2025.00044 2025

[36] [36]

Young H Oh, Seonghak Kim, Yunho Jin, Sam Son, Jonghyun Bae, Jongsung Lee, Yeonhong Park, Dong Uk Kim, Tae Jun Ham, and Jae W Lee. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise sched- uling. InHPCA. IEEE, 584–597

2021

[37] [37]

Marco Pagani, Alessio Balsini, Alessandro Biondi, Mauro Marinoni, and Giorgio Buttazzo. 2017. A Linux-based support for developing real-time applications on heterogeneous platforms with dynamic FPGA reconfiguration. In2017 30th IEEE International System-on-Chip Conference (SOCC). IEEE, 96–101

2017

[38] [38]

Francesco Restuccia and Alessandro Biondi. 2021. Time-predictable acceleration of deep neural networks on fpga soc platforms. In2021 IEEE Real-Time Systems Symposium (RTSS). IEEE, 441–454

2021

[39] [39]

Enrico Rossi, Marvin Damschen, Lars Bauer, Giorgio Buttazzo, and Jörg Henkel

[40] [40]

11, 2, Article 10 (2018), 24 pages

Preemption of the Partial Reconfiguration Process to Enable Real-Time Computing With FPGAs. 11, 2, Article 10 (2018), 24 pages. doi:10.1145/3182183

work page doi:10.1145/3182183 2018

[41] [41]

Biruk Seyoum, Marco Pagani, Alessandro Biondi, and Giorgio Buttazzo. 2021. Automating the design flow under dynamic partial reconfiguration for hardware- software co-design in FPGA SoC. InProceedings of the 36th Annual ACM Sympo- sium on Applied Computing. 481–490

2021

[42] [42]

Dhananjay Rao Thallikar, Shashank Nag, and Lizy K John. 2026. HMix: An Efficient Hardware Accelerator for Quantized MLP-Mixer Inference. (2026)

2026

[43] [43]

Jianming Tong, Anirudh Itagi, Prasanth Chatarasi, and Tushar Krishna. 2024. Feather: A reconfigurable accelerator with data reordering support for low- cost on-chip dataflow switching. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 198–214

2024

[44] [44]

Chunyang Wang, Yuebin Bai, and Desen Sun. 2023. CD-MSA: cooperative and deadline-aware scheduling for efficient multi-tenancy on DNN accelerators.TPDS 34, 7 (2023), 2091–2106

2023

[45] [45]

Chengyue Wang, Xiaofan Zhang, Jason Cong, and James C Hoe. 2025. Re- configurable Stream Network Architecture. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1848–1866

2025

[46] [46]

Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, André Rosti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, and Jinming Zhuang. 2026. From L...

work page doi:10.1145/3785670 2026

[47] [47]

Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong

[48] [48]

In2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

TGPA: Tile-grained pipeline architecture for low latency CNN inference. In2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1–8

[49] [49]

Yixin Xu, Zijian Zhao, Yi Xiao, Tongguang Yu, Halid Mulaosmanovic, Dominik Kleimaier, Stefan Duenkel, Sven Beyer, Xiao Gong, Rajiv Joshi, Xiaobo Hu, Shixian Wen, Amanda Sofie Rios, Kiran Lekkala, Laurent Itti, Eric Homan, Sumitha George, Vijaykrishnan Narayanan, and Kai Ni

[50] [50]

arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.adk1525 doi:10

Ferroelectric FET-based context-switching FPGA enabling dynamic reconfiguration for adaptive deep learning machines.Science Advances (2024). arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.adk1525 doi:10. 1126/sciadv.adk1525

work page doi:10.1126/sciadv.adk1525 2024

[51] [51]

Hanchen Yang, Zishen Wan, Ritik Raj, Joongun Park, Ziwei Li, Ananda Samajdar, Arijit Raychowdhury, and Tushar Krishna. 2025. NSFlow: An End-to-End FPGA Framework with Scalable Dataflow Architecture for Neuro-Symbolic AI.arXiv preprint arXiv:2504.19323(2025)

work page arXiv 2025

[52] [52]

Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K Jones, and Peipei Zhou. 2023. AIM: Accelerating Arbitrary-precision Integer Multiplication on Het- erogeneous Reconfigurable Computing Platform Versal ACAP. In2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 1–9

2023

[53] [53]

Shulin Zeng, Guohao Dai, Niansong Zhang, Xinhao Yang, Haoyu Zhang, Zhenhua Zhu, Huazhong Yang, and Yu Wang. 2022. Serving multi-DNN workloads on FPGAs: A coordinated architecture, scheduling, and mapping perspective.IEEE Trans. Comput.72, 5 (2022), 1314–1328

2022

[54] [54]

Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia Mirhoseini. 2022. A full-stack search technique for domain optimized deep learning accelerators. InProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 27–42

2022

[55] [55]

Yifan Zhang, Zhiheng Chen, Ye Qiao, and Sitao Huang. 2025. PD-Swap: Prefill- Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration.arXiv preprint arXiv:2512.11550(2025)

work page arXiv 2025

[56] [56]

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, and Peipei Zhou. 2023. CHARM: Composing Heterogeneous Accel- eRators for Matrix Multiply on Versal ACAP Architecture. InProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmab...

work page doi:10.1145/3543622.3573210 2023

[57] [57]

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Shixin Ji, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Yiyu Shi, Deming Chen, Jason Cong, and Peipei Zhou. 2024. CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture.ACM Trans. Reconfigurable Technol. Syst.17, 3, Article 51 (Sept. 202...

2024

[58] [58]

Jinming Zhuang, Shaojie Xiang, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, and Peipei Zhou. 2025. ARIES: An Agile MLIR- Based Compilation Flow for Reconfigurable Devices with AI Engines. InProceed- ings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA)(FPGA ’25). Association for Co...

work page doi:10.1145/3706628.3708870 2025

[59] [59]

Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou

Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou. 2024. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration. InProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA)(FPGA ’24). Association f...

work page doi:10.1145/3626202.3637569 2024

[60] [60]

Jinming Zhuang, Zhuoping Yang, and Peipei Zhou. 2025. High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives. InDAC(San Francisco, California, United States)(DAC ’23). IEEE Press, 1–6. doi:10.1109/DAC56929.2023.10247981

work page doi:10.1109/dac56929.2023.10247981 2025