arxiv: 2604.07523 · v2 · submitted 2026-04-08 · 💻 cs.AR

FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration

Xingzhen Chen , Jinming Zhuang , Zhuoping Yang , Shixin Ji , Sarah Schultz , Zheng Dong , Weisong Shi , Peipei Zhou This is my paper

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.AR

keywords DNN accelerationflexible architecturereal-time reconfigurationhardware efficiencydesign space explorationVersal FPGAheterogeneous computingaccelerator composition

0 comments

The pith

FILCO lets DNN accelerators reconfigure in real time and compose into unified or separate units to match varying workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FILCO as a flexible composing architecture for DNN acceleration on heterogeneous platforms. It claims that this design supports real-time reconfiguration and can form either one large accelerator or several smaller independent ones, avoiding the resource waste that occurs when fixed dedicated hardware meets mismatched workloads or when overlay designs switch dataflows inefficiently. An analytical model combined with two-stage design space exploration finds the best storage and computation balance for each case. When implemented and tested on a 7 nm AMD Versal VCK190 board, the approach reports 1.3x to 5x gains in throughput and hardware efficiency across diverse DNN workloads. A reader would care because current platforms often keep extra hardware idle or force suboptimal mappings; FILCO offers a single reconfigurable substrate that adapts on demand.

Core claim

FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. The accompanying FILCO framework uses an analytical model with two-stage design space exploration to reach the optimal design point, delivering 1.3x-5x higher throughput and hardware efficiency than prior dedicated or overlay architectures on varied DNN workloads.

What carries the argument

The FILCO flexible composing architecture that supports real-time reconfiguration and on-demand composition into unified or multiple accelerators, driven by an analytical model and two-stage design space exploration to select storage and computation resources.

If this is right

Dedicated fixed architectures will continue to suffer workload mismatch while FILCO adapts by recomposing resources at runtime.
Overlay designs that only switch dataflow remain limited in granularity; FILCO's composition into independent units removes that constraint.
The two-stage DSE reduces the search effort needed to reach an efficient mapping for each new workload.
On the evaluated 7 nm Versal board the design shows consistent 1.3x-5x improvements across the tested workload set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the reconfiguration overhead truly stays low, the same fabric could support dynamic task migration between edge devices and nearby servers without hardware swaps.
The composition mechanism might extend naturally to other coarse-grained reconfigurable fabrics, reducing the need for multiple specialized chips in heterogeneous systems.
Automated mapping tools built on the analytical model could let software decide at runtime whether to run one large accelerator or several smaller ones for a given batch of inferences.

Load-bearing premise

The two-stage analytical model correctly locates the optimal design point without later manual fixes and the cost of real-time reconfiguration stays small enough not to erase the reported efficiency gains on the target hardware.

What would settle it

Measure actual throughput and efficiency on the AMD Versal VCK190 board for the same mixed DNN workloads; if the gains fall below 1.3x over strong baselines or if reconfiguration latency offsets the benefits, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.07523 by Jinming Zhuang, Peipei Zhou, Sarah Schultz, Shixin Ji, Weisong Shi, Xingzhen Chen, Zheng Dong, Zhuoping Yang.

**Figure 2.** Figure 2: FILCO hardware architecture. the on-chip resources into Compute Units (CU), Flexible Memory Units (FMU), and IO Manager (IOM). Each Compute Unit is featured with an AI Engine (AIE) array, a CU Buffer, and a Mesh Manager, and is responsible for handling the compute-intensive workloads. The Flexible Memory Units explore data reuse by allocating onchip buffers on the Programmable Logic (PL). Additionally, th… view at source ↗

**Figure 5.** Figure 5: Flexible on-chip memory functionality. required to handle diverse workloads, e.g., 128x512 matrix shapes, such a static design method induces much storage overhead, and only achieves 50% efficiency due to unnecessary padding (red block). In reality, the two diverse matrices have the same data size, which can definitely be stored in one buffer. Therefore, proposing a flexible on-chip memory that is able to … view at source ↗

**Figure 4.** Figure 4: Flexible on-chip memory views. (green blocks), smaller workloads require padding, resulting in significant invalid computation (red blocks). Designing finite instruction blocks helps to mitigate the invalid computation, but it has significant limitations in practice. There are only 16KB of instruction memory in each AIE, and the instruction size for computing MM with a tile size of 32x32x32 is more than … view at source ↗

**Figure 6.** Figure 6: shows an overview of the FILCO framework. FILCO takes DNN models, platform information, and DDR profiling results as input. After the automated optimization flow and code generation, FILCO generates the binary files by launching the backend compilers. In the first stage, Runtime Parameter Optimizer performs a brute-force search on every layer to find the optimal runtime dataflow, as well as a table with t… view at source ↗

**Figure 7.** Figure 7: Illustration diagram for GA decoder. #Ops Eff. 5,000 10,000 15,000 20,000 25,000 40,000 35,000 20% 70% 35% 100% 6.4x #Ops flexibility while keeping perf. within 5% of the peak. Flexible AIE programming (Ours) Baseline AIE programming [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Single AIE efficiency under #operations variation. [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 9.** Figure 9: Throughput comparisons on diverse MM workloads. [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 11.** Figure 11: Comparison of search time for MILP and GA solver. [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗

read the original abstract

With the development of deep neural network (DNN) enabled applications, achieving high hardware resource efficiency on diverse workloads is non-trivial in heterogeneous computing platforms. Prior works discuss dedicated architectures to achieve maximal resource efficiency. However, a mismatch between hardware and workloads always exists in various diverse workloads. Other works discuss overlay architecture that can dynamically switch dataflow for different workloads. However, these works are still limited by flexibility granularity and induce much resource inefficiency. To solve this problem, we propose a flexible composing architecture, FILCO, that can efficiently match diverse workloads to achieve the optimal storage and computation resource efficiency. FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. We also propose the FILCO framework, including an analytical model with a two-stage DSE that can achieve the optimal design point. We also evaluate the FILCO framework on the 7nm AMD Versal VCK190 board. Compared with prior works, our design can achieve 1.3x - 5x throughput and hardware efficiency on various diverse workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FILCO adds real-time reconfiguration and flexible composition to overlay-style DNN accelerators with an analytical two-stage DSE, but the 1.3-5x claims rest on sparse evaluation details that leave overhead and validation unclear.

read the letter

The main point is that FILCO tries to fix the mismatch between fixed DNN hardware and varied workloads by letting the accelerator reconfigure in real time and compose itself into one unified unit or several independent ones. They pair this with an analytical model and two-stage design space exploration to pick the design point, then report 1.3x-5x throughput and efficiency gains on the AMD Versal VCK190 board compared with prior work.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FILCO, a flexible composing architecture for DNN acceleration on heterogeneous platforms. FILCO supports real-time reconfiguration and can be composed into either a unified accelerator or multiple independent accelerators to match diverse workloads for optimal storage and computation resource efficiency. The FILCO framework includes an analytical model paired with a two-stage design space exploration (DSE) procedure to identify optimal design points. Evaluation is performed on the 7 nm AMD Versal VCK190 board, with claims of 1.3×–5× gains in throughput and hardware efficiency versus prior dedicated and overlay architectures across various workloads.

Significance. If the central claims hold, FILCO would represent a meaningful advance in flexible DNN accelerators by bridging the gap between rigid dedicated designs and coarse-grained overlays, delivering measurable efficiency gains on diverse workloads through real-time reconfigurability. The provision of an analytical model and two-stage DSE is a strength that supports systematic optimization and potential reproducibility.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: The 1.3×–5× throughput and hardware-efficiency claims are stated without workload specifications, baseline architectures, error bars, or a clear methodology description (including how reconfiguration overhead was measured and subtracted). This absence prevents verification that the gains are load-bearing and not negated by overhead on the VCK190.
[FILCO framework / Analytical model] Analytical model and two-stage DSE (framework description): The model is used both to generate candidate designs and to assert optimality. No explicit comparison of model-predicted versus measured performance on the target board is referenced, nor is it stated whether DSE parameters were fitted to the same evaluation data; this creates a circularity risk for the optimality claim.

minor comments (2)

[Abstract] The abstract uses the phrase 'various diverse workloads' without enumeration; the main text should list the concrete DNN models, batch sizes, and dataflow variants used.
[Analytical model] Notation for the analytical model (e.g., definitions of storage and computation efficiency metrics) should be introduced with explicit equations and units before the DSE procedure is described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline revisions to improve clarity and address the raised concerns.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: The 1.3×–5× throughput and hardware-efficiency claims are stated without workload specifications, baseline architectures, error bars, or a clear methodology description (including how reconfiguration overhead was measured and subtracted). This absence prevents verification that the gains are load-bearing and not negated by overhead on the VCK190.

Authors: We acknowledge that the abstract provides only a high-level summary and does not enumerate specific workloads, baselines, error bars, or the overhead measurement procedure. The Evaluation section of the manuscript does detail the workloads (diverse DNN models across convolutional and transformer architectures), the baseline dedicated and overlay accelerators, error bars from repeated board measurements, and the reconfiguration overhead quantification (via direct timing on the VCK190, subtracted from end-to-end execution time). To make these elements immediately verifiable from the abstract and to strengthen the methodology description, we will revise the abstract to include brief workload and baseline references and expand the Evaluation section with an explicit overhead accounting subsection. revision: yes
Referee: [FILCO framework / Analytical model] Analytical model and two-stage DSE (framework description): The model is used both to generate candidate designs and to assert optimality. No explicit comparison of model-predicted versus measured performance on the target board is referenced, nor is it stated whether DSE parameters were fitted to the same evaluation data; this creates a circularity risk for the optimality claim.

Authors: The analytical model comprises closed-form equations derived directly from the VCK190 hardware specifications and standard DNN operation costs; no parameters were fitted to the evaluation data. The two-stage DSE uses the model solely to rank candidate designs, which are subsequently implemented and measured on the board. To eliminate any appearance of circularity, we will add a dedicated validation subsection that reports side-by-side model-predicted versus measured performance for the final selected designs, thereby confirming the model's independent predictive accuracy. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an analytical model with two-stage DSE to identify optimal design points for the FILCO architecture, followed by hardware evaluation on the AMD Versal VCK190 achieving 1.3x-5x gains. No equations, self-citations, or derivation steps are quoted that reduce the optimality claim, performance predictions, or reconfiguration benefits directly to fitted inputs or prior self-referential results by construction. The central claims rest on external hardware benchmarks rather than internal self-definition or fitted-input renaming, making the derivation self-contained against the stated evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on an unverified analytical performance model and a two-stage DSE whose accuracy is assumed rather than demonstrated; no independent evidence for the model is provided in the abstract.

free parameters (1)

design parameters searched in two-stage DSE
The DSE searches or fits parameters to reach the claimed optimal design point for storage and computation efficiency.

axioms (1)

domain assumption The analytical model correctly predicts real hardware throughput and resource usage for reconfigured designs
Invoked to justify that the two-stage DSE finds the optimal point and that reported gains are achievable.

pith-pipeline@v0.9.0 · 5505 in / 1261 out tokens · 37186 ms · 2026-05-10T17:19:01.036894+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

Mohamed S Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, and Andrew C Ling. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In2018 28th international conference on field programmable logic and applications (FPL). IEEE, 411–4117

work page 2018
[2]

AMD/Xilinx. 2021. Versal Adaptive Compute Acceleration Platform. https: //www.xilinx.com/products/silicon-devices/acap/versal.html

work page 2021
[3]

2023.AI Engine API and Intrinsics User Guide

AMD/Xilinx. 2023.AI Engine API and Intrinsics User Guide

work page 2023
[4]

2023.Versal ACAP AI Engine System C Simulator

AMD/Xilinx. 2023.Versal ACAP AI Engine System C Simulator

work page 2023
[5]

Mohammed S Bensaleh, Yaman Sharaf-Dabbagh, Hazem Hajj, Mazen AR Saghir, Haitham Akkary, Hassan Artail, Abdulfattah M Obeid, and Syed Manzoor Qasim

work page
[6]

Optimal task scheduling for distributed cluster with active storage devices and accelerated nodes.IEEE Access6 (2018), 48195–48209

work page 2018
[7]

Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023. Inter- layer scheduling space definition and exploration for tiled accelerators. InPro- ceedings of the 50th Annual International Symposium on Computer Architecture

work page 2023
[8]

Chia-Hao Chang, Jihoon Han, Anand Sivasubramaniam, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-Mei Hwu. 2024. GMT: GPU Orchestrated Memory Tiering for the Big Data Era. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(La Jolla, CA, USA)(ASPLOS ’24). Association for...

work page doi:10.1145/3620666.3651353 2024
[9]

Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, and Zhiru Zhang. 2024. Understanding the Potential of FPGA- based Spatial Acceleration for Large Language Model Inference.ACM Trans. Reconfigurable Technol. Syst.18, 1, Article 5 (Dec. 2024). doi:10.1145/3656177

work page doi:10.1145/3656177 2024
[10]

Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang. 2024. Allo: A programming model for composable accelerator design.Proceedings of the ACM on Programming Languages(2024)

work page 2024
[11]

Jones, Yiyu Shi, Yanzhi Wang, and Peipei Zhou

Peiyan Dong, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Yanyu Li, Dongkuan Xu, Heng Huang, Jingtong Hu, Alex K. Jones, Yiyu Shi, Yanzhi Wang, and Peipei Zhou

work page
[12]

GNSS/Multi-Sensor Fusion Using Continuous-Time Factor Graph Optimization for Robust Localization , url =

EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems43, 11 (2024), 3949–3960. doi:10.1109/TCAD.2024.3443692

work page doi:10.1109/tcad.2024.3443692 2024
[13]

Mathew Hall and Vaughn Betz. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs.arXiv preprint arXiv:2007.10451(2020)

work page arXiv 2020
[14]

Zifan He, Anderson Truong, Yingqi Cao, and Jason Cong. 2025. InTAR: Inter- Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 123–132

work page 2025
[15]

Erika Hunhoff, Joseph Melber, Kristof Denolf, Andra Bisca, Samuel Bayliss, Stephen Neuendorffer, Jeff Fifield, Jack Lo, Pranathi Vasireddy, Phil James-Roxby, and Eric Keller. 2025. Efficiency, Expressivity, and Extensibility in a Close- to-Metal NPU Programming Interface. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computi...

work page doi:10.1109/fccm62733.2025.00043 2025
[16]

Shixin Ji, Xingzhen Chen, Jinming Zhuang, Wei Zhang, Zhuoping Yang, Sarah Schultz, Yukai Song, Jingtong Hu, Alex Jones, Zheng Dong, and Peipei Zhou

work page
[17]

InProceedings of the 2025 ACM Great Lakes Symposium on VLSI

ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems. InProceedings of the 2025 ACM Great Lakes Symposium on VLSI

work page 2025
[18]

Shixin Ji, Zhuoping Yang, Xingzhen Chen, Wei Zhang, Jinming Zhuang, Alex K Jones, Zheng Dong, and Peipei Zhou. 2025. DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety- Critical Systems. InThe 46th IEEE Real-Time Systems Symposium, 2025

work page 2025
[19]

Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. 2021. Heterogeneous Dataflow Accelerators for Multi-DNN Workloads. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 71–83. doi:10.1109/HPCA51647.2021.00016

work page doi:10.1109/hpca51647.2021.00016 2021
[20]

Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 71–83

work page 2021
[21]

Jun Liu, Shulin Zeng, Li Ding, Widyadewi Soedarmadji, Hao Zhou, Zehao Wang, Jinhao Li, Jintao Li, Yadong Dai, Kairui Wen, Shan He, Yaqi Sun, Yu Wang, and Guohao Dai. 2025. FlightVGM: Efficient Video Generation Model Inference with Online Sparsification and Hybrid Precision on FPGAs(FPGA ’25). Association for Computing Machinery, New York, NY, USA

work page 2025
[22]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2017
[23]

Zaid Qureshi, Vikram Sharma Mailthody, Isaac Gelado, Seungwon Min, Amna Masood, Jeongmin Park, Jinjun Xiong, C. J. Newburn, Dmitri Vainbrand, I-Hsin Chung, Michael Garland, William Dally, and Wen-mei Hwu. 2023. GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture. InProceedings of the 28th ACM International Conference on A...

work page doi:10.1145/3575693.3575748 2023
[24]

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. 2021. MLP-Mixer: An all-MLP Architecture for Vision. InAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Lia...

work page 2021
[25]

Jianming Tong, Anirudh Itagi, Prasanth Chatarasi, and Tushar Krishna. 2024. FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low- Cost On-Chip Dataflow Switching. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). doi:10.1109/ISCA59077.2024.00024

work page doi:10.1109/isca59077.2024.00024 2024
[26]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. InInternational conference on machine learning. PMLR

work page 2021
[27]

Chengyue Wang, Xiaofan Zhang, Jason Cong, and James C Hoe. 2025. Re- configurable Stream Network Architecture. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1848–1866

work page 2025
[28]

Qinggang Wang, Long Zheng, Zhaozeng An, Shuyi Xiong, Runze Wang, Yu Huang, Pengcheng Yao, Xiaofei Liao, Hai Jin, and Jingling Xue. 2024. A scalable, efficient, and robust dynamic memory management library for HLS-based FPGAs. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 437–450

work page 2024
[29]

Yu Emma Wang, Gu-Yeon Wei, and David Brooks. 2019. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning. arXiv:1907.10701 [cs.LG]

work page arXiv 2019
[30]

Wong, Jialiang Zhang, and Jing (Jane) Li

Linus Y. Wong, Jialiang Zhang, and Jing (Jane) Li. 2023. DONGLE: Direct FPGA- Orchestrated NVMe Storage for HLS. InProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA)(FPGA ’23). Association for Computing Machinery, New York, NY, USA, 3–13. doi:10.1145/3543622.3573185

work page doi:10.1145/3543622.3573185 2023
[31]

Hanchen Yang, Zishen Wan, Ritik Raj, Joongun Park, Ziwei Li, Ananda Samajdar, Arijit Raychowdhury, and Tushar Krishna. 2025. NSFlow: An End-to-End FPGA Framework with Scalable Dataflow Architecture for Neuro-Symbolic AI.arXiv preprint arXiv:2504.19323(2025)

work page arXiv 2025
[32]

Zhuoping Yang, Jinming Zhuang, Xingzhen Chen, Alex Jones, and Peipei Zhou

work page
[33]

InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1028–1042

work page
[34]

Jones, and Peipei Zhou

Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K. Jones, and Peipei Zhou. 2023. AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. InICCAD

work page 2023
[35]

Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs. InProceedings of the 2024 ACM/SIGDA International Sympo...

work page doi:10.1145/3626202.3637562 2024
[36]

Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia Mirhoseini. 2022. A full-stack search technique for domain optimized deep learning accelerators. InProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 27–42

work page 2022
[37]

Xiaofan Zhang, Hanchen Ye, Junsong Wang, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2020. DNNExplorer: a framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator. InProceedings of the 39th International Conference on Computer-Aided Design. 1–9

work page 2020
[38]

2019.Modeling and Optimization for Customized Computing: Per- formance, Energy and Cost Perspective

Peipei Zhou. 2019.Modeling and Optimization for Customized Computing: Per- formance, Energy and Cost Perspective. Ph. D. Dissertation. University of Cali- fornia, Los Angeles. https://escholarship.org/uc/item/6g7663zw ProQuest ID: Zhou_ucla_0031D_18150; Merritt ID: ark:/13030/m5dk0j3x

work page 2019
[39]

Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, and Peipei Zhou. 2023. CHARM: Composing Heterogeneous Accel- eRators for Matrix Multiply on Versal ACAP Architecture. InFPGA(Monterey, CA, USA). ACM, 153–164. doi:10.1145/3543622.3573210

work page doi:10.1145/3543622.3573210 2023
[40]

Jinming Zhuang, Shaojie Xiang, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, and Peipei Zhou. 2025. ARIES: An Agile MLIR- Based Compilation Flow for Reconfigurable Devices with AI Engines. InPro- ceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. New York, NY, USA. doi:10.1145/3706628.3708870

work page doi:10.1145/3706628.3708870 2025
[41]

Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou

Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou. 2024. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration. InFPGA. ACM. https://doi.org/10.1145/3626202.3637569

work page doi:10.1145/3626202.3637569 2024