FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration
Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3
The pith
FILCO lets DNN accelerators reconfigure in real time and compose into unified or separate units to match varying workloads.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. The accompanying FILCO framework uses an analytical model with two-stage design space exploration to reach the optimal design point, delivering 1.3x-5x higher throughput and hardware efficiency than prior dedicated or overlay architectures on varied DNN workloads.
What carries the argument
The FILCO flexible composing architecture that supports real-time reconfiguration and on-demand composition into unified or multiple accelerators, driven by an analytical model and two-stage design space exploration to select storage and computation resources.
If this is right
- Dedicated fixed architectures will continue to suffer workload mismatch while FILCO adapts by recomposing resources at runtime.
- Overlay designs that only switch dataflow remain limited in granularity; FILCO's composition into independent units removes that constraint.
- The two-stage DSE reduces the search effort needed to reach an efficient mapping for each new workload.
- On the evaluated 7 nm Versal board the design shows consistent 1.3x-5x improvements across the tested workload set.
Where Pith is reading between the lines
- If the reconfiguration overhead truly stays low, the same fabric could support dynamic task migration between edge devices and nearby servers without hardware swaps.
- The composition mechanism might extend naturally to other coarse-grained reconfigurable fabrics, reducing the need for multiple specialized chips in heterogeneous systems.
- Automated mapping tools built on the analytical model could let software decide at runtime whether to run one large accelerator or several smaller ones for a given batch of inferences.
Load-bearing premise
The two-stage analytical model correctly locates the optimal design point without later manual fixes and the cost of real-time reconfiguration stays small enough not to erase the reported efficiency gains on the target hardware.
What would settle it
Measure actual throughput and efficiency on the AMD Versal VCK190 board for the same mixed DNN workloads; if the gains fall below 1.3x over strong baselines or if reconfiguration latency offsets the benefits, the central claim does not hold.
Figures
read the original abstract
With the development of deep neural network (DNN) enabled applications, achieving high hardware resource efficiency on diverse workloads is non-trivial in heterogeneous computing platforms. Prior works discuss dedicated architectures to achieve maximal resource efficiency. However, a mismatch between hardware and workloads always exists in various diverse workloads. Other works discuss overlay architecture that can dynamically switch dataflow for different workloads. However, these works are still limited by flexibility granularity and induce much resource inefficiency. To solve this problem, we propose a flexible composing architecture, FILCO, that can efficiently match diverse workloads to achieve the optimal storage and computation resource efficiency. FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. We also propose the FILCO framework, including an analytical model with a two-stage DSE that can achieve the optimal design point. We also evaluate the FILCO framework on the 7nm AMD Versal VCK190 board. Compared with prior works, our design can achieve 1.3x - 5x throughput and hardware efficiency on various diverse workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FILCO, a flexible composing architecture for DNN acceleration on heterogeneous platforms. FILCO supports real-time reconfiguration and can be composed into either a unified accelerator or multiple independent accelerators to match diverse workloads for optimal storage and computation resource efficiency. The FILCO framework includes an analytical model paired with a two-stage design space exploration (DSE) procedure to identify optimal design points. Evaluation is performed on the 7 nm AMD Versal VCK190 board, with claims of 1.3×–5× gains in throughput and hardware efficiency versus prior dedicated and overlay architectures across various workloads.
Significance. If the central claims hold, FILCO would represent a meaningful advance in flexible DNN accelerators by bridging the gap between rigid dedicated designs and coarse-grained overlays, delivering measurable efficiency gains on diverse workloads through real-time reconfigurability. The provision of an analytical model and two-stage DSE is a strength that supports systematic optimization and potential reproducibility.
major comments (2)
- [Abstract / Evaluation] Abstract and Evaluation section: The 1.3×–5× throughput and hardware-efficiency claims are stated without workload specifications, baseline architectures, error bars, or a clear methodology description (including how reconfiguration overhead was measured and subtracted). This absence prevents verification that the gains are load-bearing and not negated by overhead on the VCK190.
- [FILCO framework / Analytical model] Analytical model and two-stage DSE (framework description): The model is used both to generate candidate designs and to assert optimality. No explicit comparison of model-predicted versus measured performance on the target board is referenced, nor is it stated whether DSE parameters were fitted to the same evaluation data; this creates a circularity risk for the optimality claim.
minor comments (2)
- [Abstract] The abstract uses the phrase 'various diverse workloads' without enumeration; the main text should list the concrete DNN models, batch sizes, and dataflow variants used.
- [Analytical model] Notation for the analytical model (e.g., definitions of storage and computation efficiency metrics) should be introduced with explicit equations and units before the DSE procedure is described.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline revisions to improve clarity and address the raised concerns.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: The 1.3×–5× throughput and hardware-efficiency claims are stated without workload specifications, baseline architectures, error bars, or a clear methodology description (including how reconfiguration overhead was measured and subtracted). This absence prevents verification that the gains are load-bearing and not negated by overhead on the VCK190.
Authors: We acknowledge that the abstract provides only a high-level summary and does not enumerate specific workloads, baselines, error bars, or the overhead measurement procedure. The Evaluation section of the manuscript does detail the workloads (diverse DNN models across convolutional and transformer architectures), the baseline dedicated and overlay accelerators, error bars from repeated board measurements, and the reconfiguration overhead quantification (via direct timing on the VCK190, subtracted from end-to-end execution time). To make these elements immediately verifiable from the abstract and to strengthen the methodology description, we will revise the abstract to include brief workload and baseline references and expand the Evaluation section with an explicit overhead accounting subsection. revision: yes
-
Referee: [FILCO framework / Analytical model] Analytical model and two-stage DSE (framework description): The model is used both to generate candidate designs and to assert optimality. No explicit comparison of model-predicted versus measured performance on the target board is referenced, nor is it stated whether DSE parameters were fitted to the same evaluation data; this creates a circularity risk for the optimality claim.
Authors: The analytical model comprises closed-form equations derived directly from the VCK190 hardware specifications and standard DNN operation costs; no parameters were fitted to the evaluation data. The two-stage DSE uses the model solely to rank candidate designs, which are subsequently implemented and measured on the board. To eliminate any appearance of circularity, we will add a dedicated validation subsection that reports side-by-side model-predicted versus measured performance for the final selected designs, thereby confirming the model's independent predictive accuracy. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an analytical model with two-stage DSE to identify optimal design points for the FILCO architecture, followed by hardware evaluation on the AMD Versal VCK190 achieving 1.3x-5x gains. No equations, self-citations, or derivation steps are quoted that reduce the optimality claim, performance predictions, or reconfiguration benefits directly to fitted inputs or prior self-referential results by construction. The central claims rest on external hardware benchmarks rather than internal self-definition or fitted-input renaming, making the derivation self-contained against the stated evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- design parameters searched in two-stage DSE
axioms (1)
- domain assumption The analytical model correctly predicts real hardware throughput and resource usage for reconfigured designs
Reference graph
Works this paper leans on
-
[1]
Mohamed S Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, and Andrew C Ling. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In2018 28th international conference on field programmable logic and applications (FPL). IEEE, 411–4117
work page 2018
-
[2]
AMD/Xilinx. 2021. Versal Adaptive Compute Acceleration Platform. https: //www.xilinx.com/products/silicon-devices/acap/versal.html
work page 2021
-
[3]
2023.AI Engine API and Intrinsics User Guide
AMD/Xilinx. 2023.AI Engine API and Intrinsics User Guide
work page 2023
-
[4]
2023.Versal ACAP AI Engine System C Simulator
AMD/Xilinx. 2023.Versal ACAP AI Engine System C Simulator
work page 2023
-
[5]
Mohammed S Bensaleh, Yaman Sharaf-Dabbagh, Hazem Hajj, Mazen AR Saghir, Haitham Akkary, Hassan Artail, Abdulfattah M Obeid, and Syed Manzoor Qasim
-
[6]
Optimal task scheduling for distributed cluster with active storage devices and accelerated nodes.IEEE Access6 (2018), 48195–48209
work page 2018
-
[7]
Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023. Inter- layer scheduling space definition and exploration for tiled accelerators. InPro- ceedings of the 50th Annual International Symposium on Computer Architecture
work page 2023
-
[8]
Chia-Hao Chang, Jihoon Han, Anand Sivasubramaniam, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-Mei Hwu. 2024. GMT: GPU Orchestrated Memory Tiering for the Big Data Era. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(La Jolla, CA, USA)(ASPLOS ’24). Association for...
-
[9]
Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, and Zhiru Zhang. 2024. Understanding the Potential of FPGA- based Spatial Acceleration for Large Language Model Inference.ACM Trans. Reconfigurable Technol. Syst.18, 1, Article 5 (Dec. 2024). doi:10.1145/3656177
-
[10]
Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang. 2024. Allo: A programming model for composable accelerator design.Proceedings of the ACM on Programming Languages(2024)
work page 2024
-
[11]
Jones, Yiyu Shi, Yanzhi Wang, and Peipei Zhou
Peiyan Dong, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Yanyu Li, Dongkuan Xu, Heng Huang, Jingtong Hu, Alex K. Jones, Yiyu Shi, Yanzhi Wang, and Peipei Zhou
-
[12]
EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems43, 11 (2024), 3949–3960. doi:10.1109/TCAD.2024.3443692
- [13]
-
[14]
Zifan He, Anderson Truong, Yingqi Cao, and Jason Cong. 2025. InTAR: Inter- Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 123–132
work page 2025
-
[15]
Erika Hunhoff, Joseph Melber, Kristof Denolf, Andra Bisca, Samuel Bayliss, Stephen Neuendorffer, Jeff Fifield, Jack Lo, Pranathi Vasireddy, Phil James-Roxby, and Eric Keller. 2025. Efficiency, Expressivity, and Extensibility in a Close- to-Metal NPU Programming Interface. In2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computi...
-
[16]
Shixin Ji, Xingzhen Chen, Jinming Zhuang, Wei Zhang, Zhuoping Yang, Sarah Schultz, Yukai Song, Jingtong Hu, Alex Jones, Zheng Dong, and Peipei Zhou
-
[17]
InProceedings of the 2025 ACM Great Lakes Symposium on VLSI
ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems. InProceedings of the 2025 ACM Great Lakes Symposium on VLSI
work page 2025
-
[18]
Shixin Ji, Zhuoping Yang, Xingzhen Chen, Wei Zhang, Jinming Zhuang, Alex K Jones, Zheng Dong, and Peipei Zhou. 2025. DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety- Critical Systems. InThe 46th IEEE Real-Time Systems Symposium, 2025
work page 2025
-
[19]
Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. 2021. Heterogeneous Dataflow Accelerators for Multi-DNN Workloads. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 71–83. doi:10.1109/HPCA51647.2021.00016
-
[20]
Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 71–83
work page 2021
-
[21]
Jun Liu, Shulin Zeng, Li Ding, Widyadewi Soedarmadji, Hao Zhou, Zehao Wang, Jinhao Li, Jintao Li, Yadong Dai, Kairui Wen, Shan He, Yaqi Sun, Yu Wang, and Guohao Dai. 2025. FlightVGM: Efficient Video Generation Model Inference with Online Sparsification and Hybrid Precision on FPGAs(FPGA ’25). Association for Computing Machinery, New York, NY, USA
work page 2025
-
[22]
Qi, Hao Su, Kaichun Mo, and Leonidas J
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2017
-
[23]
Zaid Qureshi, Vikram Sharma Mailthody, Isaac Gelado, Seungwon Min, Amna Masood, Jeongmin Park, Jinjun Xiong, C. J. Newburn, Dmitri Vainbrand, I-Hsin Chung, Michael Garland, William Dally, and Wen-mei Hwu. 2023. GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture. InProceedings of the 28th ACM International Conference on A...
-
[24]
Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. 2021. MLP-Mixer: An all-MLP Architecture for Vision. InAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Lia...
work page 2021
-
[25]
Jianming Tong, Anirudh Itagi, Prasanth Chatarasi, and Tushar Krishna. 2024. FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low- Cost On-Chip Dataflow Switching. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). doi:10.1109/ISCA59077.2024.00024
-
[26]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. InInternational conference on machine learning. PMLR
work page 2021
-
[27]
Chengyue Wang, Xiaofan Zhang, Jason Cong, and James C Hoe. 2025. Re- configurable Stream Network Architecture. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1848–1866
work page 2025
-
[28]
Qinggang Wang, Long Zheng, Zhaozeng An, Shuyi Xiong, Runze Wang, Yu Huang, Pengcheng Yao, Xiaofei Liao, Hai Jin, and Jingling Xue. 2024. A scalable, efficient, and robust dynamic memory management library for HLS-based FPGAs. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 437–450
work page 2024
- [29]
-
[30]
Wong, Jialiang Zhang, and Jing (Jane) Li
Linus Y. Wong, Jialiang Zhang, and Jing (Jane) Li. 2023. DONGLE: Direct FPGA- Orchestrated NVMe Storage for HLS. InProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays(Monterey, CA, USA)(FPGA ’23). Association for Computing Machinery, New York, NY, USA, 3–13. doi:10.1145/3543622.3573185
- [31]
-
[32]
Zhuoping Yang, Jinming Zhuang, Xingzhen Chen, Alex Jones, and Peipei Zhou
-
[33]
AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1028–1042
-
[34]
Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K. Jones, and Peipei Zhou. 2023. AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. InICCAD
work page 2023
-
[35]
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs. InProceedings of the 2024 ACM/SIGDA International Sympo...
-
[36]
Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia Mirhoseini. 2022. A full-stack search technique for domain optimized deep learning accelerators. InProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 27–42
work page 2022
-
[37]
Xiaofan Zhang, Hanchen Ye, Junsong Wang, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2020. DNNExplorer: a framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator. InProceedings of the 39th International Conference on Computer-Aided Design. 1–9
work page 2020
-
[38]
2019.Modeling and Optimization for Customized Computing: Per- formance, Energy and Cost Perspective
Peipei Zhou. 2019.Modeling and Optimization for Customized Computing: Per- formance, Energy and Cost Perspective. Ph. D. Dissertation. University of Cali- fornia, Los Angeles. https://escholarship.org/uc/item/6g7663zw ProQuest ID: Zhou_ucla_0031D_18150; Merritt ID: ark:/13030/m5dk0j3x
work page 2019
-
[39]
Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, and Peipei Zhou. 2023. CHARM: Composing Heterogeneous Accel- eRators for Matrix Multiply on Versal ACAP Architecture. InFPGA(Monterey, CA, USA). ACM, 153–164. doi:10.1145/3543622.3573210
-
[40]
Jinming Zhuang, Shaojie Xiang, Hongzheng Chen, Niansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, and Peipei Zhou. 2025. ARIES: An Agile MLIR- Based Compilation Flow for Reconfigurable Devices with AI Engines. InPro- ceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. New York, NY, USA. doi:10.1145/3706628.3708870
-
[41]
Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou
Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, and Peipei Zhou. 2024. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration. InFPGA. ACM. https://doi.org/10.1145/3626202.3637569
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.