FusionAccel: A General Re-configurable Deep Learning Inference Accelerator on FPGA for Convolutional Neural Networks

Shi Shi

arxiv: 1907.02217 · v1 · pith:5D4LL2KKnew · submitted 2019-07-04 · 💻 cs.AR

FusionAccel: A General Re-configurable Deep Learning Inference Accelerator on FPGA for Convolutional Neural Networks

Shi Shi This is my paper

Pith reviewed 2026-05-25 08:53 UTC · model grok-4.3

classification 💻 cs.AR

keywords convolutional neural networkFPGA acceleratorreconfigurable hardwareRTL designdeep learning inferenceCNNASIC migrationSpartan-6

0 comments

The pith

FusionAccel is a scalable RTL-based CNN accelerator on FPGA that matches Caffe-CPU outputs and supports pre-compilation reconstruction plus runtime reconfiguration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes FusionAccel as a general re-configurable deep learning inference accelerator for convolutional neural networks. It consists of a hardware architecture that adapts to different network structures and includes supporting software. The architecture allows reconstruction before compilation and reconfiguration at runtime. The design is implemented in RTL and verified functionally on a Xilinx Spartan-6 FPGA, where it produces results identical to those from Caffe-CPU. Because the project uses RTL throughout, the design can be migrated to ASIC after only minor replacement of FPGA-specific IPs.

Core claim

FusionAccel is a scalable convolutional neural network accelerator hardware architecture with supporting software. It can adapt to different network structures and can be reconstructed before compilation and reconfigured at runtime. This paper realizes this RTL convolutional neural network accelerator design and functional verifications on a Xilinx Spartan-6 FPGA. The result is identical to that of Caffe-CPU. Since the entire project is based on RTL, it can be migrated to ASIC after replacing some FPGA-specific IPs.

What carries the argument

The RTL convolutional neural network accelerator design that enables scalability across network structures through pre-compilation reconstruction and runtime reconfiguration.

If this is right

The accelerator handles varying CNN structures without requiring a full redesign for each new network.
Pre-compilation reconstruction and runtime reconfiguration together allow the same hardware to serve multiple models.
RTL implementation produces results identical to Caffe-CPU on the tested FPGA.
The design can be ported to ASIC by swapping only FPGA-specific IP blocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users could treat the RTL template as a starting point and change only configuration parameters rather than rewrite accelerator logic for new CNNs.
If the reconfiguration mechanism scales to larger or deeper networks, the same FPGA bitstream could support model updates in deployed systems without hardware replacement.
Matching Caffe-CPU outputs opens the possibility of using the accelerator as a drop-in replacement in existing Caffe-based workflows for edge inference.

Load-bearing premise

Functional verification on one Spartan-6 FPGA with outputs matching Caffe-CPU is enough to prove the design works correctly for arbitrary CNN structures and can move to ASIC after only minor IP changes.

What would settle it

Running the accelerator on a CNN structure different from those used in verification and obtaining outputs that differ from Caffe-CPU would show the adaptability claim does not hold.

Figures

Figures reproduced from arXiv: 1907.02217 by Shi Shi.

**Figure 1.** Figure 1: NVDLA system block diagram. 2 Concurrent open-source projects With regards to accelerator technology, there are two research concentrations: high-performance [2–7] and reconfigurability [8–11]. ASIC accelerators tend to make use of the performance advantage and higher data bandwidth, while FPGA ones tend to make use of the configurability to support more network types. Concurrent two large opensource FPG… view at source ↗

**Figure 2.** Figure 2: NVDLA inner core [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: NVDLA software flow. is that it achieves a balance between computation efficiency and accuracy with 6-bit or 8-bit quantized data. On Xilinx ZU9/ZU7 platforms, a maximum of 1024/512 on-chip DSPs can be utilized. Apart from NVDLA, the fully connected layer, which contains the most weights in CNNs, is realized on CPUs. CHaiDNN project is compiled with Xilinx SDSoC. The SDSoC toolkit analyzes and makes partit… view at source ↗

**Figure 4.** Figure 4: Opal Kelly XEM6310-LX45 FPGA [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Activation Function (sigmoid, tanh, ReLu). [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: NVDLA lookup table structure [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: NVDLA two-stage lookup tables. Left: LRN. Right: S [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 9.** Figure 9: Pooling diagrams [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 11.** Figure 11: MEC convolution process. 3.3.2 MEC (Memory Efficient Convolution) MEC is an extension of im2col + GEMM, as [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗

**Figure 12.** Figure 12: Bitonic sort example [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗

**Figure 13.** Figure 13: Pipeline accumulating example. comparator is O(n(log n) 2 ). If 2 m−1 parallel comparators are utilized, the time complexity will be O((log n) 2 ). As in [PITH_FULL_IMAGE:figures/full_fig_p007_13.png] view at source ↗

**Figure 14.** Figure 14: Generic accelerator architecture. 3.4 Algorithm Trade-off 3.4.1 Bitonic Sort & Pipeline Accumulation Whether to use bitonic sort and pipeline accumulation is determined by the data access format in cache. If the dimension of cache is W or H first, then these two algorithms are practical. But if the cache is channel first, it would significantly increase the computation unit number if these two algorithms… view at source ↗

**Figure 15.** Figure 15: Operating flow of the generic accelerator. [PITH_FULL_IMAGE:figures/full_fig_p009_15.png] view at source ↗

**Figure 16.** Figure 16: In-memory padding diagram of generic accelerato [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗

**Figure 17.** Figure 17: Xilinx MCB read timing diagram [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗

**Figure 18.** Figure 18: MCB read timing and DMA code. Moreover, if kernel increases (e.g., in AlexNet there is kernel size of 11×11), the slot number required increases proportionally. It is not a good practice since it makes the hardware size constrained by network size. Unless the slot size is very huge, networks with large convolution kernels are not supported. It is neither a runtime configurable design anyway. This project … view at source ↗

**Figure 19.** Figure 19: MEC convolution diagram (stride = 1, input channe [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗

**Figure 20.** Figure 20: MEC convolution diagram (stride = 2, input channe [PITH_FULL_IMAGE:figures/full_fig_p012_20.png] view at source ↗

**Figure 21.** Figure 21: FP16 data format. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_21.png] view at source ↗

**Figure 22.** Figure 22: Stream Accelerator architecture [PITH_FULL_IMAGE:figures/full_fig_p013_22.png] view at source ↗

**Figure 23.** Figure 23: FIFO with independent clock domains. handshake, supporting independent read/write clock domains, as in [PITH_FULL_IMAGE:figures/full_fig_p013_23.png] view at source ↗

**Figure 24.** Figure 24: Convolution process (parallelism = 16). accumulator can get the result in one cycle, then the speed of the three stages are the same and the pipeline is filled, which means the resource utilization is the best. 4.2.2 Max-pooling Units Max-pooling consists of 8 parallel floating-point comparators and 1 FIFO. Its formula is as follows, in which D stands for the elements of input matrix and A stands for the … view at source ↗

**Figure 25.** Figure 25: Convolution timing sequence. 4.2.3 Average-pooling Units Average-pooling consists of 8 parallel floating-point adders and 8 parallel floating-point dividers. Its computation formula is as follows, in which D stands for elements in the input matrix and A stands for elements in the output matrix. Awo,co,ho = 1 k 2 h=X ho·s+k h=ho·s w=Xwo·s+k w=wo·s Dw,h,co (3) Average-pooling does not change the input matri… view at source ↗

**Figure 26.** Figure 26: Max-pooling timing sequence. Average Pooling Timing 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 clk engine_valid avepool_enable s_fifo_empty s_fifo_wr_en s_fifo_rd_en s_fifo_valid a_acc 0000 2cc5 3803 b_acc 0000 2cc5 36d4 3aa1 result_acc 0000 2cc5 3803 count 0 1 2 168 div_data_ready a_div 0000 558f b_div 0000 5948 result_div 0000 3836 ready ready_buf 6 cycles b a ACC… view at source ↗

**Figure 27.** Figure 27: Average-pooling timing sequence. Apart from these scripts, Caffe on CPU is also required to verify the inference, which is identical to the BVLC sample script 6 . 4.3 USB3.0 IO USB3.0 IO block loads input commands to command FIFO and stores input data, weight and bias to the corresponding cache. Meanwhile it transfers the result to result FIFO, and the parameters to host to calculate the cache positions. … view at source ↗

**Figure 28.** Figure 28: Preprocess.py code script [PITH_FULL_IMAGE:figures/full_fig_p019_28.png] view at source ↗

**Figure 29.** Figure 29: Extract.py code script [PITH_FULL_IMAGE:figures/full_fig_p019_29.png] view at source ↗

**Figure 30.** Figure 30: The testbench generated by python script. [PITH_FULL_IMAGE:figures/full_fig_p019_30.png] view at source ↗

**Figure 31.** Figure 31: Block-Throttled PIPE IN timing diagram. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_31.png] view at source ↗

**Figure 32.** Figure 32: Block-Throttled PIPE OUT timing diagram. [PITH_FULL_IMAGE:figures/full_fig_p020_32.png] view at source ↗

**Figure 33.** Figure 33: Network parameters required by the accel [PITH_FULL_IMAGE:figures/full_fig_p020_33.png] view at source ↗

**Figure 35.** Figure 35: Operation flow of the stream accelerator. [PITH_FULL_IMAGE:figures/full_fig_p021_35.png] view at source ↗

**Figure 36.** Figure 36: Flow diagram of the software. the bitstream file. In Load Commands the host transfers all parameters of each layer to CMDFIFO on FPGA. In Load Layer these pre-stored parameters will be read out. These parameters will be called by computation units, as well as be used to slice the data blocks. In Process Weight Bias the network weights will be processed and slices. In load weight & bias the biases and weig… view at source ↗

**Figure 37.** Figure 37: Intermediate result of the accelerator computat [PITH_FULL_IMAGE:figures/full_fig_p023_37.png] view at source ↗

**Figure 38.** Figure 38: Final result of the accelerator computation. [PITH_FULL_IMAGE:figures/full_fig_p023_38.png] view at source ↗

**Figure 39.** Figure 39: Caffe inference result (upper) and accelerator i [PITH_FULL_IMAGE:figures/full_fig_p023_39.png] view at source ↗

**Figure 40.** Figure 40: Configurable parameters before compilation. [PITH_FULL_IMAGE:figures/full_fig_p025_40.png] view at source ↗

read the original abstract

The deep learning accelerator is one of the methods to accelerate deep learning network computations, which is mainly based on convolutional neural network acceleration. To address the fact that concurrent convolutional neural network accelerators are not solely open-source and the exclusiveness of platforms, FusionAccel, a scalable convolutional neural network accelerator hardware architecture with supporting software is proposed. It can adapt to different network structures and can be reconstructed before compilation and reconfigured at runtime. This paper realizes this RTL convolutional neural network accelerator design and functional verifications on a Xilinx Spartan-6 FPGA. The result is identical to that of Caffe-CPU. Since the entire project is based on RTL, it can be migrated to ASIC after replacing some FPGA-specific IPs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FusionAccel is an RTL CNN accelerator on Spartan-6 that matches Caffe-CPU outputs, but the abstract supplies no performance numbers, reconfiguration details, or broad testing.

read the letter

The main point on this paper is that the authors built an RTL-based CNN inference accelerator they call FusionAccel, put it on a Xilinx Spartan-6, and showed that its outputs match a Caffe-CPU reference for the cases they checked. They also claim the design supports pre-compilation reconstruction and runtime reconfiguration for different networks, plus easy migration to ASIC by swapping a few IPs. That functional match is the concrete result they report.

Referee Report

3 major / 0 minor

Summary. The paper presents FusionAccel, a scalable and re-configurable RTL-based CNN inference accelerator architecture with supporting software. It claims the design adapts to different network structures via pre-compilation reconstruction and runtime reconfiguration, has been implemented and functionally verified on a Xilinx Spartan-6 FPGA with results identical to Caffe-CPU, and can be migrated to ASIC after replacing FPGA-specific IPs.

Significance. If the verification and generality claims hold, the work would supply an open RTL design for CNN acceleration that is adaptable across networks and potentially portable to ASIC, addressing the noted scarcity of open-source, non-platform-exclusive accelerators.

major comments (3)

[Abstract] Abstract: the central claim that 'the result is identical to that of Caffe-CPU' is load-bearing for correctness and generality but supplies no test networks, layer coverage (padding modes, strides, batch-norm, activations), error metrics, or exclusion criteria.
[Abstract] Abstract: the portability assertion ('migrated to ASIC after replacing some FPGA-specific IPs') is untested and lacks any discussion of which IPs are replaced or why FPGA timing/resource assumptions would survive the substitution.
[Abstract] Abstract: no quantitative resource utilization, latency, or throughput figures are reported, which is required to substantiate the 'scalable' and 'general' claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. The points raised correctly identify areas where the manuscript provides insufficient supporting detail for its central claims. We will revise the abstract and add material to the main text to address each issue.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'the result is identical to that of Caffe-CPU' is load-bearing for correctness and generality but supplies no test networks, layer coverage (padding modes, strides, batch-norm, activations), error metrics, or exclusion criteria.

Authors: We agree that the abstract does not supply these details. We will revise the abstract to name the networks used for verification and state the error metric (exact integer match). The experiments section will be expanded to list supported layer parameters and any exclusions. revision: yes
Referee: [Abstract] Abstract: the portability assertion ('migrated to ASIC after replacing some FPGA-specific IPs') is untested and lacks any discussion of which IPs are replaced or why FPGA timing/resource assumptions would survive the substitution.

Authors: The referee is correct that the claim is untested. We will qualify or remove the assertion in the abstract and add a short discussion section noting the specific FPGA IPs (e.g., memory and clock primitives) that would require replacement, together with the need for separate ASIC timing closure. revision: yes
Referee: [Abstract] Abstract: no quantitative resource utilization, latency, or throughput figures are reported, which is required to substantiate the 'scalable' and 'general' claims.

Authors: We agree that quantitative figures are required. The implementation section will be updated to include explicit resource counts, latency, and throughput numbers measured on the Spartan-6 device; these will also be summarized in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation verified against independent external benchmark

full rationale

The paper presents an RTL hardware architecture for CNN inference, implemented and functionally verified on a Xilinx Spartan-6 FPGA with outputs matching Caffe-CPU. No equations, fitted parameters, predictions, or self-citations appear in the text. The central claim rests on direct comparison to an external reference implementation rather than any derivation that reduces to its own inputs by construction. This matches the default expectation of a self-contained engineering result with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an engineering demonstration with no equations, fitted constants, or new physical entities. No free parameters, axioms, or invented entities are required or introduced.

pith-pipeline@v0.9.0 · 5638 in / 1274 out tokens · 32516 ms · 2026-05-25T08:53:49.136863+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Efﬁcient methods and hardware for de ep learning

Song Han and B Dally. Efﬁcient methods and hardware for de ep learning. University Lecture, 2017

work page 2017
[2]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer paramet ers and< 0.5 mb model size. arXiv preprint 25 A PREPRINT - J ULY 5, 2019 arXiv:1602.07360, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Nn-x-a hardware accelerator for conv olutional neural networks

Vinayak A Gokhale. Nn-x-a hardware accelerator for conv olutional neural networks. 2014

work page 2014
[4]

Squeezenext: Hardware-aware neural network design

Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiang yu Y ue, Peter Jin, Sicheng Zhao, and Kurt Keutzer. Squeezenext: Hardware-aware neural network design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition W orkshops, pages 1638–1647, 2018

work page 2018
[5]

Squeezedet: Uniﬁed, small, low power fully con- volutional neural networks for real-time object detection for autonomous driving

Bichen Wu, Forrest Iandola, Peter H Jin, and Kurt Keutzer . Squeezedet: Uniﬁed, small, low power fully con- volutional neural networks for real-time object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition W orkshops, pages 129–137, 2017

work page 2017
[6]

Eie: efﬁcient inference engine on compressed deep neural networ k

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efﬁcient inference engine on compressed deep neural networ k. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) , pages 243–254. IEEE, 2016

work page 2016
[7]

Origami: A 803-gop/s/w convolutional network accelerator

Lukas Cavigelli and Luca Benini. Origami: A 803-gop/s/w convolutional network accelerator. IEEE Transac- tions on Circuits and Systems for Video T echnology, 27(11):2461–2475, 2016

work page 2016
[8]

Towards a universal fpga matrix-vector multiplication architec- ture

Srinidhi Kestur, John D Davis, and Eric S Chung. Towards a universal fpga matrix-vector multiplication architec- ture. In 2012 IEEE 20th International Symposium on Field-Programma ble Custom Computing Machines , pages 9–16. IEEE, 2012

work page 2012
[9]

Cnp: An fpga-based processor for convolu- tional networks

Clément Farabet, Cyril Poulet, Jefferson Y Han, and Y ann LeCun. Cnp: An fpga-based processor for convolu- tional networks. In 2009 International Conference on Field Programmable Logic and Applications, pages 32–37. IEEE, 2009

work page 2009
[10]

Neuﬂow: A runtime reconﬁgurable dataﬂow processor for vision

Clément Farabet, Berin Martini, Benoit Corda, Polina A kselrod, Eugenio Culurciello, and Y ann LeCun. Neuﬂow: A runtime reconﬁgurable dataﬂow processor for vision. In CVPR W orkshops, pages 109–116, 2011

work page 2011
[11]

Hardware-oriented Approximation of Convolutional Neural Networks

Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. H ardware-oriented approximation of convolutional neural networks. arXiv preprint arXiv:1604.03168, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Nvdla primer, nvdla documentatio n

NVIDIA Corporation. Nvdla primer, nvdla documentatio n. https://nvdla.org/primer.html. Accessed March 6, 2019

work page 2019
[13]

Chaidnn, hls based deep neural network acce lerator library for xilinx ultrascale+ mpsocs

Xilinx Inc. Chaidnn, hls based deep neural network acce lerator library for xilinx ultrascale+ mpsocs. https://github.com/Xilinx/CHaiDNN. Accessed March 6, 2019

work page 2019
[14]

Opal Kelly Incorporated. Xem6310. https://opalkelly.com/products/xem6310. Accessed March 6, 2019

work page 2019
[15]

Mec: memory-efﬁcient conv olution for deep neural network

Minsik Cho and Daniel Brand. Mec: memory-efﬁcient conv olution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning-V olume 70, pages 815–824. JMLR. org, 2017. 26

work page 2017

[1] [1]

Efﬁcient methods and hardware for de ep learning

Song Han and B Dally. Efﬁcient methods and hardware for de ep learning. University Lecture, 2017

work page 2017

[2] [2]

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer paramet ers and< 0.5 mb model size. arXiv preprint 25 A PREPRINT - J ULY 5, 2019 arXiv:1602.07360, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

Nn-x-a hardware accelerator for conv olutional neural networks

Vinayak A Gokhale. Nn-x-a hardware accelerator for conv olutional neural networks. 2014

work page 2014

[4] [4]

Squeezenext: Hardware-aware neural network design

Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiang yu Y ue, Peter Jin, Sicheng Zhao, and Kurt Keutzer. Squeezenext: Hardware-aware neural network design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition W orkshops, pages 1638–1647, 2018

work page 2018

[5] [5]

Squeezedet: Uniﬁed, small, low power fully con- volutional neural networks for real-time object detection for autonomous driving

Bichen Wu, Forrest Iandola, Peter H Jin, and Kurt Keutzer . Squeezedet: Uniﬁed, small, low power fully con- volutional neural networks for real-time object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition W orkshops, pages 129–137, 2017

work page 2017

[6] [6]

Eie: efﬁcient inference engine on compressed deep neural networ k

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efﬁcient inference engine on compressed deep neural networ k. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) , pages 243–254. IEEE, 2016

work page 2016

[7] [7]

Origami: A 803-gop/s/w convolutional network accelerator

Lukas Cavigelli and Luca Benini. Origami: A 803-gop/s/w convolutional network accelerator. IEEE Transac- tions on Circuits and Systems for Video T echnology, 27(11):2461–2475, 2016

work page 2016

[8] [8]

Towards a universal fpga matrix-vector multiplication architec- ture

Srinidhi Kestur, John D Davis, and Eric S Chung. Towards a universal fpga matrix-vector multiplication architec- ture. In 2012 IEEE 20th International Symposium on Field-Programma ble Custom Computing Machines , pages 9–16. IEEE, 2012

work page 2012

[9] [9]

Cnp: An fpga-based processor for convolu- tional networks

Clément Farabet, Cyril Poulet, Jefferson Y Han, and Y ann LeCun. Cnp: An fpga-based processor for convolu- tional networks. In 2009 International Conference on Field Programmable Logic and Applications, pages 32–37. IEEE, 2009

work page 2009

[10] [10]

Neuﬂow: A runtime reconﬁgurable dataﬂow processor for vision

Clément Farabet, Berin Martini, Benoit Corda, Polina A kselrod, Eugenio Culurciello, and Y ann LeCun. Neuﬂow: A runtime reconﬁgurable dataﬂow processor for vision. In CVPR W orkshops, pages 109–116, 2011

work page 2011

[11] [11]

Hardware-oriented Approximation of Convolutional Neural Networks

Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. H ardware-oriented approximation of convolutional neural networks. arXiv preprint arXiv:1604.03168, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Nvdla primer, nvdla documentatio n

NVIDIA Corporation. Nvdla primer, nvdla documentatio n. https://nvdla.org/primer.html. Accessed March 6, 2019

work page 2019

[13] [13]

Chaidnn, hls based deep neural network acce lerator library for xilinx ultrascale+ mpsocs

Xilinx Inc. Chaidnn, hls based deep neural network acce lerator library for xilinx ultrascale+ mpsocs. https://github.com/Xilinx/CHaiDNN. Accessed March 6, 2019

work page 2019

[14] [14]

Opal Kelly Incorporated. Xem6310. https://opalkelly.com/products/xem6310. Accessed March 6, 2019

work page 2019

[15] [15]

Mec: memory-efﬁcient conv olution for deep neural network

Minsik Cho and Daniel Brand. Mec: memory-efﬁcient conv olution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning-V olume 70, pages 815–824. JMLR. org, 2017. 26

work page 2017