pith. sign in

arxiv: 1907.02217 · v1 · pith:5D4LL2KKnew · submitted 2019-07-04 · 💻 cs.AR

FusionAccel: A General Re-configurable Deep Learning Inference Accelerator on FPGA for Convolutional Neural Networks

Pith reviewed 2026-05-25 08:53 UTC · model grok-4.3

classification 💻 cs.AR
keywords convolutional neural networkFPGA acceleratorreconfigurable hardwareRTL designdeep learning inferenceCNNASIC migrationSpartan-6
0
0 comments X

The pith

FusionAccel is a scalable RTL-based CNN accelerator on FPGA that matches Caffe-CPU outputs and supports pre-compilation reconstruction plus runtime reconfiguration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes FusionAccel as a general re-configurable deep learning inference accelerator for convolutional neural networks. It consists of a hardware architecture that adapts to different network structures and includes supporting software. The architecture allows reconstruction before compilation and reconfiguration at runtime. The design is implemented in RTL and verified functionally on a Xilinx Spartan-6 FPGA, where it produces results identical to those from Caffe-CPU. Because the project uses RTL throughout, the design can be migrated to ASIC after only minor replacement of FPGA-specific IPs.

Core claim

FusionAccel is a scalable convolutional neural network accelerator hardware architecture with supporting software. It can adapt to different network structures and can be reconstructed before compilation and reconfigured at runtime. This paper realizes this RTL convolutional neural network accelerator design and functional verifications on a Xilinx Spartan-6 FPGA. The result is identical to that of Caffe-CPU. Since the entire project is based on RTL, it can be migrated to ASIC after replacing some FPGA-specific IPs.

What carries the argument

The RTL convolutional neural network accelerator design that enables scalability across network structures through pre-compilation reconstruction and runtime reconfiguration.

If this is right

  • The accelerator handles varying CNN structures without requiring a full redesign for each new network.
  • Pre-compilation reconstruction and runtime reconfiguration together allow the same hardware to serve multiple models.
  • RTL implementation produces results identical to Caffe-CPU on the tested FPGA.
  • The design can be ported to ASIC by swapping only FPGA-specific IP blocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Users could treat the RTL template as a starting point and change only configuration parameters rather than rewrite accelerator logic for new CNNs.
  • If the reconfiguration mechanism scales to larger or deeper networks, the same FPGA bitstream could support model updates in deployed systems without hardware replacement.
  • Matching Caffe-CPU outputs opens the possibility of using the accelerator as a drop-in replacement in existing Caffe-based workflows for edge inference.

Load-bearing premise

Functional verification on one Spartan-6 FPGA with outputs matching Caffe-CPU is enough to prove the design works correctly for arbitrary CNN structures and can move to ASIC after only minor IP changes.

What would settle it

Running the accelerator on a CNN structure different from those used in verification and obtaining outputs that differ from Caffe-CPU would show the adaptability claim does not hold.

Figures

Figures reproduced from arXiv: 1907.02217 by Shi Shi.

Figure 1
Figure 1. Figure 1: NVDLA system block diagram. 2 Concurrent open-source projects With regards to accelerator technology, there are two research concentrations: high-performance [2–7] and re￾configurability [8–11]. ASIC accelerators tend to make use of the performance advantage and higher data bandwidth, while FPGA ones tend to make use of the configurability to support more network types. Concurrent two large open￾source FPG… view at source ↗
Figure 2
Figure 2. Figure 2: NVDLA inner core [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: NVDLA software flow. is that it achieves a balance between computation efficiency and accuracy with 6-bit or 8-bit quantized data. On Xilinx ZU9/ZU7 platforms, a maximum of 1024/512 on-chip DSPs can be utilized. Apart from NVDLA, the fully connected layer, which contains the most weights in CNNs, is realized on CPUs. CHaiDNN project is compiled with Xilinx SDSoC. The SDSoC toolkit analyzes and makes partit… view at source ↗
Figure 4
Figure 4. Figure 4: Opal Kelly XEM6310-LX45 FPGA [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Activation Function (sigmoid, tanh, ReLu). [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: NVDLA lookup table structure [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: NVDLA two-stage lookup tables. Left: LRN. Right: S [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pooling diagrams [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: MEC convolution process. 3.3.2 MEC (Memory Efficient Convolution) MEC is an extension of im2col + GEMM, as [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Bitonic sort example [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pipeline accumulating example. comparator is O(n(log n) 2 ). If 2 m−1 parallel comparators are utilized, the time complexity will be O((log n) 2 ). As in [PITH_FULL_IMAGE:figures/full_fig_p007_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Generic accelerator architecture. 3.4 Algorithm Trade-off 3.4.1 Bitonic Sort & Pipeline Accumulation Whether to use bitonic sort and pipeline accumulation is determined by the data access format in cache. If the di￾mension of cache is W or H first, then these two algorithms are practical. But if the cache is channel first, it would significantly increase the computation unit number if these two algorithms… view at source ↗
Figure 15
Figure 15. Figure 15: Operating flow of the generic accelerator. [PITH_FULL_IMAGE:figures/full_fig_p009_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: In-memory padding diagram of generic accelerato [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Xilinx MCB read timing diagram [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: MCB read timing and DMA code. Moreover, if kernel increases (e.g., in AlexNet there is kernel size of 11×11), the slot number required increases proportionally. It is not a good practice since it makes the hardware size constrained by network size. Unless the slot size is very huge, networks with large convolution kernels are not supported. It is neither a runtime configurable design anyway. This project … view at source ↗
Figure 19
Figure 19. Figure 19: MEC convolution diagram (stride = 1, input channe [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: MEC convolution diagram (stride = 2, input channe [PITH_FULL_IMAGE:figures/full_fig_p012_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: FP16 data format. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Stream Accelerator architecture [PITH_FULL_IMAGE:figures/full_fig_p013_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: FIFO with independent clock domains. handshake, supporting independent read/write clock domains, as in [PITH_FULL_IMAGE:figures/full_fig_p013_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Convolution process (parallelism = 16). accumulator can get the result in one cycle, then the speed of the three stages are the same and the pipeline is filled, which means the resource utilization is the best. 4.2.2 Max-pooling Units Max-pooling consists of 8 parallel floating-point comparators and 1 FIFO. Its formula is as follows, in which D stands for the elements of input matrix and A stands for the … view at source ↗
Figure 25
Figure 25. Figure 25: Convolution timing sequence. 4.2.3 Average-pooling Units Average-pooling consists of 8 parallel floating-point adders and 8 parallel floating-point dividers. Its computation formula is as follows, in which D stands for elements in the input matrix and A stands for elements in the output matrix. Awo,co,ho = 1 k 2 h=X ho·s+k h=ho·s w=Xwo·s+k w=wo·s Dw,h,co (3) Average-pooling does not change the input matri… view at source ↗
Figure 26
Figure 26. Figure 26: Max-pooling timing sequence. Average Pooling Timing 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 clk engine_valid avepool_enable s_fifo_empty s_fifo_wr_en s_fifo_rd_en s_fifo_valid a_acc 0000 2cc5 3803 b_acc 0000 2cc5 36d4 3aa1 result_acc 0000 2cc5 3803 count 0 1 2 168 div_data_ready a_div 0000 558f b_div 0000 5948 result_div 0000 3836 ready ready_buf 6 cycles b a ACC… view at source ↗
Figure 27
Figure 27. Figure 27: Average-pooling timing sequence. Apart from these scripts, Caffe on CPU is also required to verify the inference, which is identical to the BVLC sample script 6 . 4.3 USB3.0 IO USB3.0 IO block loads input commands to command FIFO and stores input data, weight and bias to the corresponding cache. Meanwhile it transfers the result to result FIFO, and the parameters to host to calculate the cache positions. … view at source ↗
Figure 28
Figure 28. Figure 28: Preprocess.py code script [PITH_FULL_IMAGE:figures/full_fig_p019_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Extract.py code script [PITH_FULL_IMAGE:figures/full_fig_p019_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: The testbench generated by python script. [PITH_FULL_IMAGE:figures/full_fig_p019_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Block-Throttled PIPE IN timing diagram. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Block-Throttled PIPE OUT timing diagram. [PITH_FULL_IMAGE:figures/full_fig_p020_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Network parameters required by the accel [PITH_FULL_IMAGE:figures/full_fig_p020_33.png] view at source ↗
Figure 35
Figure 35. Figure 35: Operation flow of the stream accelerator. [PITH_FULL_IMAGE:figures/full_fig_p021_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Flow diagram of the software. the bitstream file. In Load Commands the host transfers all parameters of each layer to CMDFIFO on FPGA. In Load Layer these pre-stored parameters will be read out. These parameters will be called by computation units, as well as be used to slice the data blocks. In Process Weight Bias the network weights will be processed and slices. In load weight & bias the biases and weig… view at source ↗
Figure 37
Figure 37. Figure 37: Intermediate result of the accelerator computat [PITH_FULL_IMAGE:figures/full_fig_p023_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Final result of the accelerator computation. [PITH_FULL_IMAGE:figures/full_fig_p023_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Caffe inference result (upper) and accelerator i [PITH_FULL_IMAGE:figures/full_fig_p023_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Configurable parameters before compilation. [PITH_FULL_IMAGE:figures/full_fig_p025_40.png] view at source ↗
read the original abstract

The deep learning accelerator is one of the methods to accelerate deep learning network computations, which is mainly based on convolutional neural network acceleration. To address the fact that concurrent convolutional neural network accelerators are not solely open-source and the exclusiveness of platforms, FusionAccel, a scalable convolutional neural network accelerator hardware architecture with supporting software is proposed. It can adapt to different network structures and can be reconstructed before compilation and reconfigured at runtime. This paper realizes this RTL convolutional neural network accelerator design and functional verifications on a Xilinx Spartan-6 FPGA. The result is identical to that of Caffe-CPU. Since the entire project is based on RTL, it can be migrated to ASIC after replacing some FPGA-specific IPs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper presents FusionAccel, a scalable and re-configurable RTL-based CNN inference accelerator architecture with supporting software. It claims the design adapts to different network structures via pre-compilation reconstruction and runtime reconfiguration, has been implemented and functionally verified on a Xilinx Spartan-6 FPGA with results identical to Caffe-CPU, and can be migrated to ASIC after replacing FPGA-specific IPs.

Significance. If the verification and generality claims hold, the work would supply an open RTL design for CNN acceleration that is adaptable across networks and potentially portable to ASIC, addressing the noted scarcity of open-source, non-platform-exclusive accelerators.

major comments (3)
  1. [Abstract] Abstract: the central claim that 'the result is identical to that of Caffe-CPU' is load-bearing for correctness and generality but supplies no test networks, layer coverage (padding modes, strides, batch-norm, activations), error metrics, or exclusion criteria.
  2. [Abstract] Abstract: the portability assertion ('migrated to ASIC after replacing some FPGA-specific IPs') is untested and lacks any discussion of which IPs are replaced or why FPGA timing/resource assumptions would survive the substitution.
  3. [Abstract] Abstract: no quantitative resource utilization, latency, or throughput figures are reported, which is required to substantiate the 'scalable' and 'general' claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. The points raised correctly identify areas where the manuscript provides insufficient supporting detail for its central claims. We will revise the abstract and add material to the main text to address each issue.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'the result is identical to that of Caffe-CPU' is load-bearing for correctness and generality but supplies no test networks, layer coverage (padding modes, strides, batch-norm, activations), error metrics, or exclusion criteria.

    Authors: We agree that the abstract does not supply these details. We will revise the abstract to name the networks used for verification and state the error metric (exact integer match). The experiments section will be expanded to list supported layer parameters and any exclusions. revision: yes

  2. Referee: [Abstract] Abstract: the portability assertion ('migrated to ASIC after replacing some FPGA-specific IPs') is untested and lacks any discussion of which IPs are replaced or why FPGA timing/resource assumptions would survive the substitution.

    Authors: The referee is correct that the claim is untested. We will qualify or remove the assertion in the abstract and add a short discussion section noting the specific FPGA IPs (e.g., memory and clock primitives) that would require replacement, together with the need for separate ASIC timing closure. revision: yes

  3. Referee: [Abstract] Abstract: no quantitative resource utilization, latency, or throughput figures are reported, which is required to substantiate the 'scalable' and 'general' claims.

    Authors: We agree that quantitative figures are required. The implementation section will be updated to include explicit resource counts, latency, and throughput numbers measured on the Spartan-6 device; these will also be summarized in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation verified against independent external benchmark

full rationale

The paper presents an RTL hardware architecture for CNN inference, implemented and functionally verified on a Xilinx Spartan-6 FPGA with outputs matching Caffe-CPU. No equations, fitted parameters, predictions, or self-citations appear in the text. The central claim rests on direct comparison to an external reference implementation rather than any derivation that reduces to its own inputs by construction. This matches the default expectation of a self-contained engineering result with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an engineering demonstration with no equations, fitted constants, or new physical entities. No free parameters, axioms, or invented entities are required or introduced.

pith-pipeline@v0.9.0 · 5638 in / 1274 out tokens · 32516 ms · 2026-05-25T08:53:49.136863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Efficient methods and hardware for de ep learning

    Song Han and B Dally. Efficient methods and hardware for de ep learning. University Lecture, 2017

  2. [2]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer paramet ers and< 0.5 mb model size. arXiv preprint 25 A PREPRINT - J ULY 5, 2019 arXiv:1602.07360, 2016

  3. [3]

    Nn-x-a hardware accelerator for conv olutional neural networks

    Vinayak A Gokhale. Nn-x-a hardware accelerator for conv olutional neural networks. 2014

  4. [4]

    Squeezenext: Hardware-aware neural network design

    Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiang yu Y ue, Peter Jin, Sicheng Zhao, and Kurt Keutzer. Squeezenext: Hardware-aware neural network design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition W orkshops, pages 1638–1647, 2018

  5. [5]

    Squeezedet: Unified, small, low power fully con- volutional neural networks for real-time object detection for autonomous driving

    Bichen Wu, Forrest Iandola, Peter H Jin, and Kurt Keutzer . Squeezedet: Unified, small, low power fully con- volutional neural networks for real-time object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition W orkshops, pages 129–137, 2017

  6. [6]

    Eie: efficient inference engine on compressed deep neural networ k

    Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural networ k. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) , pages 243–254. IEEE, 2016

  7. [7]

    Origami: A 803-gop/s/w convolutional network accelerator

    Lukas Cavigelli and Luca Benini. Origami: A 803-gop/s/w convolutional network accelerator. IEEE Transac- tions on Circuits and Systems for Video T echnology, 27(11):2461–2475, 2016

  8. [8]

    Towards a universal fpga matrix-vector multiplication architec- ture

    Srinidhi Kestur, John D Davis, and Eric S Chung. Towards a universal fpga matrix-vector multiplication architec- ture. In 2012 IEEE 20th International Symposium on Field-Programma ble Custom Computing Machines , pages 9–16. IEEE, 2012

  9. [9]

    Cnp: An fpga-based processor for convolu- tional networks

    Clément Farabet, Cyril Poulet, Jefferson Y Han, and Y ann LeCun. Cnp: An fpga-based processor for convolu- tional networks. In 2009 International Conference on Field Programmable Logic and Applications, pages 32–37. IEEE, 2009

  10. [10]

    Neuflow: A runtime reconfigurable dataflow processor for vision

    Clément Farabet, Berin Martini, Benoit Corda, Polina A kselrod, Eugenio Culurciello, and Y ann LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In CVPR W orkshops, pages 109–116, 2011

  11. [11]

    Hardware-oriented Approximation of Convolutional Neural Networks

    Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. H ardware-oriented approximation of convolutional neural networks. arXiv preprint arXiv:1604.03168, 2016

  12. [12]

    Nvdla primer, nvdla documentatio n

    NVIDIA Corporation. Nvdla primer, nvdla documentatio n. https://nvdla.org/primer.html. Accessed March 6, 2019

  13. [13]

    Chaidnn, hls based deep neural network acce lerator library for xilinx ultrascale+ mpsocs

    Xilinx Inc. Chaidnn, hls based deep neural network acce lerator library for xilinx ultrascale+ mpsocs. https://github.com/Xilinx/CHaiDNN. Accessed March 6, 2019

  14. [14]

    Opal Kelly Incorporated. Xem6310. https://opalkelly.com/products/xem6310. Accessed March 6, 2019

  15. [15]

    Mec: memory-efficient conv olution for deep neural network

    Minsik Cho and Daniel Brand. Mec: memory-efficient conv olution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning-V olume 70, pages 815–824. JMLR. org, 2017. 26