FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching
Pith reviewed 2026-05-24 01:08 UTC · model grok-4.3
The pith
FEATHER accelerator uses Nest array and BIRRD network to reorder data on-chip for seamless per-layer dataflow switches at low cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FEATHER leverages a novel spatial array termed Nest and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead.
What carries the argument
Nest spatial array combined with BIRRD multi-stage reduction network, which integrate data layout reordering into computation and reduction steps.
If this is right
- Each layer of a model can run under its own best dataflow without paying a reconfiguration penalty each time.
- Inference latency improves by 1.27 to 2.89 times and energy efficiency by 1.3 to 6.43 times versus prior accelerators.
- FPGA throughput rises 2.65 to 3.91 times over Xilinx DPU and Gemmini while using only 6 percent extra area over a fixed-dataflow baseline.
- The Layoutloop extension allows systematic comparison of dataflow choices that also account for layout costs.
Where Pith is reading between the lines
- Designers of future edge accelerators could adopt the same embedded reordering approach instead of building separate fixed units for each dataflow.
- The low-overhead switching might allow runtime adaptation of dataflows based on changing workload or power constraints.
- Extending the same mechanism to larger chips or multi-chip systems could reduce the need for complex off-chip data movement during model execution.
Load-bearing premise
The Nest array and BIRRD network can handle every required data reordering and reconfiguration for any dataflow with negligible added latency and area.
What would settle it
Direct measurement on the FPGA implementation showing reordering latency or area overhead rising sharply for a dataflow switch not covered in the reported ResNet-50 and MobileNet-V3 cases.
Figures
read the original abstract
The inference of ML models composed of diverse structures, types, and sizes boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed Nest and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27~2.89x inference latency speedup and 1.3~6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is released at https://github.com/maeri-project/FEATHER.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FEATHER, a reconfigurable accelerator for ML inference that uses a novel Nest spatial array and BIRRD multi-stage reduction network to support flexible data layout reordering and datapath reconfiguration. This is claimed to enable seamless switching between optimal dataflows (tiling, ordering, parallelism, shapes) with negligible latency and resource overhead. The authors enhance Timeloop into Layoutloop for joint dataflow-layout modeling, evaluate FEATHER in simulation against NVDLA/SIGMA/Eyeriss baselines on ResNet-50 and MobileNet-V3, and deploy it on a ZCU104 FPGA where it achieves 2.65/3.91x throughput over Xilinx DPU/Gemmini at 6% area overhead relative to a fixed-dataflow Eyeriss-like design. Code is released.
Significance. If the negligible-overhead claim for arbitrary dataflow switching holds, the work could enable more adaptive accelerators that exploit per-layer optimal dataflows without the usual reconfiguration penalty, improving latency and energy on edge devices. Positive elements include the end-to-end FPGA implementation, public code release for reproducibility, and the Layoutloop extension for systematic evaluation. The reported speedups (1.27-2.89x latency, 1.3-6.43x energy) and low area cost would be impactful if the core hardware mechanisms are shown to generalize beyond the evaluated workloads.
major comments (2)
- [Evaluation sections (Layoutloop modeling and FPGA deployment)] The central claim that Nest and BIRRD perform all required layout reordering and datapath reconfiguration with negligible latency/resource overhead for arbitrary dataflows (stated in the abstract) is load-bearing but unsupported by isolated measurements. Only aggregate results for ResNet-50 and MobileNet-V3 layers are reported in Layoutloop and the ZCU104 deployment; no breakdown isolates reconfiguration cycles, extra BRAM/DSP usage, or latency scaling when switching between non-evaluated tilings, orderings, or shapes.
- [Nest and BIRRD architecture description] The assumption that the reordering network's latency remains negligible independent of dataflow complexity is not tested. If BIRRD latency scales with reduction complexity (common in multi-stage networks), the headline speedups become workload-specific rather than general, undermining the 'seamless switching' contribution.
minor comments (2)
- [Abstract] Abstract contains a typo: 'MobiletNet-V3' should be 'MobileNet-V3'.
- [Evaluation methodology] The manuscript lacks visible error bars, full methodology details on post-hoc design choices, or discussion of how Layoutloop models reconfiguration costs, reducing verifiability of the reported speedups.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications drawn from the manuscript and indicate revisions to strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [Evaluation sections (Layoutloop modeling and FPGA deployment)] The central claim that Nest and BIRRD perform all required layout reordering and datapath reconfiguration with negligible latency/resource overhead for arbitrary dataflows (stated in the abstract) is load-bearing but unsupported by isolated measurements. Only aggregate results for ResNet-50 and MobileNet-V3 layers are reported in Layoutloop and the ZCU104 deployment; no breakdown isolates reconfiguration cycles, extra BRAM/DSP usage, or latency scaling when switching between non-evaluated tilings, orderings, or shapes.
Authors: The manuscript reports aggregate end-to-end results because these reflect realistic ML workloads, with the FPGA deployment on ZCU104 capturing all overheads in practice (yielding the reported throughput gains at 6% area). The architecture sections describe how Nest and BIRRD fuse reordering into the reduction path without extra cycles or dedicated resources. We agree that isolated breakdowns would make the negligible-overhead claim more explicit and will add a new microbenchmark subsection with reconfiguration cycle counts, BRAM/DSP deltas, and scaling across varied tilings and shapes in the revision. revision: yes
-
Referee: [Nest and BIRRD architecture description] The assumption that the reordering network's latency remains negligible independent of dataflow complexity is not tested. If BIRRD latency scales with reduction complexity (common in multi-stage networks), the headline speedups become workload-specific rather than general, undermining the 'seamless switching' contribution.
Authors: BIRRD uses a fixed number of pipeline stages whose depth is independent of dataflow parameters; reordering is offloaded to the spatial connections in Nest rather than increasing network stages. This bound is implicit in the consistent low-overhead results across the diverse layers of the two evaluated networks and the FPGA measurements. We will expand the architecture description with an explicit latency analysis showing the fixed-stage property to address the concern. revision: partial
Circularity Check
No significant circularity; claims rest on new hardware and external empirical evaluation
full rationale
The paper introduces Nest and BIRRD as novel components enabling data reordering and reconfiguration, then evaluates them via an enhanced Timeloop (Layoutloop) model and ZCU104 FPGA deployment. Performance numbers (1.27-2.89x latency, etc.) are reported as direct comparisons to external baselines (NVDLA, SIGMA, Eyeriss, Xilinx DPU, Gemmini). No equations, fitted parameters, or self-citations are shown that reduce any prediction or result to the inputs by construction. The 'negligible overhead' premise is an engineering claim supported by aggregate measurements rather than a self-referential derivation. This is a standard non-circular hardware design paper.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Nest
no independent evidence
-
BIRRD
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We enhance Timeloop ... with layout assessment capabilities, and term it as Layoutloop
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
“Coral usb accelerator,” https://coral.ai/products/accelerator/, accessed: 2024-02-23
work page 2024
-
[2]
Xilinx deep learning processing unit,
“Xilinx deep learning processing unit,” https://docs.xilinx.com/r/1.2- English/ug1414-vitis-ai/Deep-Learning-Processor-Unit-DPU, accessed: 2022-12-10
work page 2022
-
[3]
Yolo3d: End-to-end real-time 3d oriented object bounding box detection from lidar point cloud,
W. Ali, S. Abdelkarim, M. Zidan, M. Zahran, and A. El Sallab, “Yolo3d: End-to-end real-time 3d oriented object bounding box detection from lidar point cloud,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0
work page 2018
-
[4]
On-line algorithms for path selection in a nonblocking network,
S. Arora, T. Leighton, and B. Maggs, “On-line algorithms for path selection in a nonblocking network,” in Proceedings of the Twenty- Second Annual ACM Symposium on Theory of Computing , ser. STOC ’90. New York, NY , USA: Association for Computing Machinery, 1990, p. 149–158. [Online]. Available: https://doi.org/10.1145/100216.100232
-
[5]
On-line algorithms for path selection in a nonblocking network,
——, “On-line algorithms for path selection in a nonblocking network,” in Proceedings of the twenty-second annual ACM symposium on Theory of computing, 1990, pp. 149–158
work page 1990
-
[6]
P. Behnam, J. Tong, A. Khare, Y . Chen, Y . Pan, P. Gadikar, A. Bamb- haniya, T. Krishna, and A. Tumanov, “Hardware–software co-design for real-time latency–accuracy navigation in tiny machine learning applications,” IEEE Micro, vol. 43, no. 06, pp. 93–101, nov 2023
work page 2023
-
[7]
Subgraph stationary hardware- software inference co-design,
P. Behnam, J. Tong, A. Khare, Y . Chen, Y . Pan, P. Gadikar, A. R. Bamb- haniya, T. Krishna, and A. Tumanov, “Subgraph stationary hardware- software inference co-design,” 2023
work page 2023
-
[8]
Optimized routing for fat-tree topologies,
B. Bogdanski, “Optimized routing for fat-tree topologies,” Department of Informatics, Faculty of Mathematics and Natural Sciences, University of Oslo, Norway , 2014
work page 2014
-
[9]
Marvel: A data-centric approach for mapping deep learning operators on spatial accelerators,
P. Chatarasi, H. Kwon, A. Parashar, M. Pellauer, T. Krishna, and V . Sarkar, “Marvel: A data-centric approach for mapping deep learning operators on spatial accelerators,” ACM Trans. Archit. Code Optim. , vol. 19, no. 1, dec 2021. [Online]. Available: https://doi.org/10.1145/3485137
-
[10]
Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx ai engine,
P. Chatarasi, S. Neuendorffer, S. Bayliss, K. Vissers, and V . Sarkar, “Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx ai engine,” in 2020 IEEE High Performance Extreme Computing Conference (HPEC) , 2020, pp. 1–10
work page 2020
-
[11]
High performance convolutional neural networks for document processing,
K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional neural networks for document processing,” in Tenth international workshop on frontiers in handwriting recognition . Suvisoft, 2006
work page 2006
-
[12]
Rethinking Atrous Convolution for Semantic Image Segmentation
L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Eyeriss: An Energy- Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,
Y .-H. Chen, T. Krishna, J. S. Emer, and V . Sze, “Eyeriss: An Energy- Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 127–138, 2016
work page 2016
-
[14]
Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,
——, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 127–138, 2017
work page 2017
-
[15]
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
Y .-H. Chen, T.-J. Yang, J. Emer, and V . Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” arXiv preprint arXiv:1807.07928 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
W. J. Dally and B. P. Towles, Principles and practices of interconnection networks. Elsevier, 2004
work page 2004
-
[17]
Qnnpack: Open source library for optimized mobile deep learning,
M. Dukhan, Y . Wu, and H. Lu, “Qnnpack: Open source library for optimized mobile deep learning,” 2018
work page 2018
-
[18]
(beta) channels last memory format in pytorch ¶
V . Fedyunin, “(beta) channels last memory format in pytorch ¶.” [Online]. Available: https://pytorch.org/tutorials/intermediate/memory format tutorial.html
-
[19]
Mtia: First generation silicon targeting meta’s recommendation systems,
A. Firoozshahian, J. Coburn, R. Levenstein, R. Nattoji, A. Kamath, O. Wu, G. Grewal, H. Aepala, B. Jakka, B. Dreyer, A. Hutchin, U. Diril, K. Nair, E. K. Aredestani, M. Schatz, Y . Hao, R. Komuravelli, K. Ho, S. Abu Asal, J. Shajrawi, K. Quinn, N. Sreedhara, P. Kansal, W. Wei, D. Jayaraman, L. Cheng, P. Chopda, E. Wang, A. Bikumandla, A. Karthik Sengottuv...
-
[20]
Abstractive text summarization by incorporating reader comments,
S. Gao, X. Chen, P. Li, Z. Ren, L. Bing, D. Zhao, and R. Yan, “Abstractive text summarization by incorporating reader comments,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, no. 01, 2019, pp. 6399–6406
work page 2019
-
[21]
H. Genc, A. Haj-Ali, V . Iyer, A. Amid, H. Mao, J. C. Wright, C. Schmidt, J. Zhao, A. J. Ou, M. Banister, Y . S. Shao, B. Nikolic, I. Stoica, and K. Asanovic, “Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures,” CoRR, vol. abs/1911.09925, 2019. [Online]. Available: http://arxiv.org/abs/1911.09925
-
[22]
K. Hegde, P.-A. Tsai, S. Huang, V . Chandra, A. Parashar, and C. W. Fletcher, “Mind mappings: enabling efficient algorithm-accelerator mapping space search,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , ser. ASPLOS ’21. New York, NY , USA: Association for Computing Machi...
-
[23]
Cosa: Scheduling by constrained optimization for spatial accelerators,
Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek, and Y . S. Shao, “Cosa: Scheduling by constrained optimization for spatial accelerators,” in Proceedings of the 48th Annual International Symposium on Computer Architecture , ser. ISCA ’21. IEEE Press, 2021, p. 554–566. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00050
-
[24]
G. Jeong, G. Kestor, P. Chatarasi, A. Parashar, P.-A. Tsai, S. Rajaman- ickam, R. Gioiosa, and T. Krishna, “Union: A unified hw-sw co-design ecosystem in mlir for evaluating tensor operations on spatial accelerators,” in 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2021, pp. 30–44
work page 2021
-
[25]
H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and T. Zhao, “SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, Jul. 2020, pp. 2177–2190. [Onli...
work page 2020
-
[26]
Ten lessons from three generations shaped google’s tpuv4i : Industrial product,
N. P. Jouppi, D. Hyun Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil, S. Prasad, C. Young, Z. Zhou, and D. Patterson, “Ten lessons from three generations shaped google’s tpuv4i : Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) , 2021, pp. 1–14
work page 2021
-
[27]
Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm,
S.-C. Kao and T. Krishna, “Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm,” in Proceedings of the 39th International Conference on Computer-Aided Design , ser. ICCAD ’20. New York, NY , USA: Association for Computing Machinery,
-
[28]
Available: https://doi.org/10.1145/3400302.3415639
[Online]. Available: https://doi.org/10.1145/3400302.3415639
-
[29]
Magma: An optimization framework for mapping multiple dnns on multiple accelerator cores,
——, “Magma: An optimization framework for mapping multiple dnns on multiple accelerator cores,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA) , 2022, pp. 814–830
work page 2022
-
[30]
Demystifying map space exploration for npus,
S.-C. Kao, A. Parashar, P.-A. Tsai, and T. Krishna, “Demystifying map space exploration for npus,” 2022
work page 2022
-
[31]
Digamma: Domain-aware genetic algorithm for hw-mapping co-optimization for dnn accelerators,
S.-C. Kao, M. Pellauer, A. Parashar, and T. Krishna, “Digamma: Domain-aware genetic algorithm for hw-mapping co-optimization for dnn accelerators,” in 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE) , 2022, pp. 232–237
work page 2022
-
[32]
Fbgemm: Enabling high-performance low-precision deep learning inference,
D. Khudia, J. Huang, P. Basu, S. Deng, H. Liu, J. Park, and M. Smelyan- skiy, “Fbgemm: Enabling high-performance low-precision deep learning inference,” arXiv preprint arXiv:2101.05615 , 2021
-
[33]
Data orchestration in deep learning accelerators,
T. Krishna, H. Kwon, A. Parashar, M. Pellauer, and A. Samajdar, “Data orchestration in deep learning accelerators,” 2020
work page 2020
-
[34]
H. Kwon, P. Chatarasi, V . Sarkar, T. Krishna, M. Pellauer, and A. Parashar, “Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings,” IEEE Micro, vol. 40, no. 3, pp. 20–29, 2020
work page 2020
-
[35]
Flexion: A quantitative metric for flexibility in dnn accelerators,
H. Kwon, M. Pellauer, A. Parashar, and T. Krishna, “Flexion: A quantitative metric for flexibility in dnn accelerators,” IEEE Computer Architecture Letters, vol. 20, no. 1, pp. 1–4, 2021
work page 2021
-
[36]
MAERI: Enabling Flex- ible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects,
H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling Flex- ible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects,” in Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018
work page 2018
-
[37]
Fat-trees: universal networks for hardware-efficient supercomputing,
C. E. Leiserson, “Fat-trees: universal networks for hardware-efficient supercomputing,” IEEE transactions on Computers , vol. 100, no. 10, pp. 892–901, 1985
work page 1985
-
[38]
High-level semantic feature detection: A new perspective for pedestrian detection,
W. Liu, S. Liao, W. Ren, W. Hu, and Y . Yu, “High-level semantic feature detection: A new perspective for pedestrian detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 5187–5196
work page 2019
-
[39]
Efficient pytorch: Tensor memory format matters
D. Matani, “Efficient pytorch: Tensor memory format matters.” [Online]. Available: https://pytorch.org/blog/tensor-memory-format-matters/
-
[40]
(2016) NVIDIA Deep Learning Accelerator (NVDLA)
NVIDIA. (2016) NVIDIA Deep Learning Accelerator (NVDLA). [Online]. Available: http://nvdla.org/primer.html
work page 2016
-
[41]
Accelerating deep convolutional neural networks using specialized hardware,
K. Ovtcharov, O. Ruwase, J.-Y . Kim, J. Fowers, K. Strauss, and E. S. Chung, “Accelerating deep convolutional neural networks using specialized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11, pp. 1–4, 2015
work page 2015
-
[42]
Timeloop: A Systematic Approach to DNN Accelerator Evaluation,
A. Parashar, P. Raina, Y . S. Shao, Y .-H. Chen, V . A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A Systematic Approach to DNN Accelerator Evaluation,” in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019
work page 2019
-
[43]
Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,
E. Qin, A. Samajdar, H. Kwon, V . Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) , 2020, pp. 58–70
work page 2020
-
[44]
Ai and ml accelerator survey and trends,
A. Reuther, P. Michaleas, M. Jones, V . Gadepally, S. Samsi, and J. Kepner, “Ai and ml accelerator survey and trends,” 2022. [Online]. Available: https://arxiv.org/abs/2210.04055
-
[45]
Self-adaptive reconfigurable arrays (sara): Using ml to assist scaling gemm acceleration,
A. Samajdar, M. Pellauer, and T. Krishna, “Self-adaptive reconfigurable arrays (sara): Using ml to assist scaling gemm acceleration,” ArXiv, vol. abs/2101.04799, 2021
-
[46]
SCALE-Sim: Systolic CNN Accelerator Simulator
A. Samajdar, Y . Zhu, P. Whatmough, M. Mattina, and T. Krishna, “SCALE-Sim: Systolic CNN Accelerator Simulator,” arXiv preprint arXiv:1811.02883, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[47]
An evaluation of edge tpu accelerators for convolutional neural networks,
K. Seshadri, B. Akin, J. Laudon, R. Narayanaswami, and A. Yazdan- bakhsh, “An evaluation of edge tpu accelerators for convolutional neural networks,” 2022
work page 2022
-
[48]
Simba: Scaling deep-learning inference with multi- chip-module-based architecture,
Y . S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y . Zhang, W. J. Dally, J. Emer, C. T. Gray, B. Khailany, and S. W. Keckler, “Simba: Scaling deep-learning inference with multi- chip-module-based architecture,” in Proceedings of the 52nd Annual IEEE/ACM International Symp...
-
[49]
Y . Shen, T. Ji, M. Ferdman, and P. Milder, “Medusa: A scalable intercon- nect for many-port dnn accelerators and wide dram controller interfaces,” in 2018 28th International Conference on Field Programmable Logic and Applications (FPL) , 2018, pp. 101–1014
work page 2018
-
[50]
Groundwater level prediction using machine learning models: A comprehensive review,
H. Tao, M. M. Hameed, H. A. Marhoon, M. Zounemat-Kermani, S. Heddam, S. Kim, S. O. Sulaiman, M. L. Tan, Z. Sa’adi, A. D. Mehr, M. F. Allawi, S. Abba, J. M. Zain, M. W. Falah, M. Jamei, N. D. Bokde, M. Bayatvarkeshi, M. Al-Mukhtar, S. K. Bhagat, T. Tiyasha, K. M. Khedher, N. Al-Ansari, S. Shahid, and Z. M. Yaseen, “Groundwater level prediction using machin...
work page 2022
-
[51]
Dsagen: Synthesizing programmable spatial accelerators,
J. Weng, S. Liu, V . Dadu, Z. Wang, P. Shah, and T. Nowatzki, “Dsagen: Synthesizing programmable spatial accelerators,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) , 2020, pp. 268–281
work page 2020
-
[52]
(2022) Xilinx Deep Learning Unit (DPU)
Xilinx. (2022) Xilinx Deep Learning Unit (DPU). [Online]. Avail- able: https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Deep-Learning- Processor-Unit
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.