pith. sign in

arxiv: 2405.13170 · v1 · submitted 2024-05-21 · 💻 cs.AR

FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

Pith reviewed 2026-05-24 01:08 UTC · model grok-4.3

classification 💻 cs.AR
keywords reconfigurable acceleratordataflow switchingon-chip data reorderingML inferenceNest spatial arrayBIRRD reduction networkFPGA deploymentdata layout optimization
0
0 comments X

The pith

FEATHER accelerator uses Nest array and BIRRD network to reorder data on-chip for seamless per-layer dataflow switches at low cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that ML accelerators can switch to the optimal dataflow for each layer without the usual high overhead of data layout reordering and datapath changes. It introduces a spatial array called Nest and a multi-stage reduction network called BIRRD that embed reordering and flexible reduction into the hardware itself. This design is modeled in an enhanced version of Timeloop called Layoutloop and implemented on FPGA hardware. Results indicate latency and energy gains over fixed and other reconfigurable accelerators while adding only modest area. The core idea is that on-chip support for reordering removes the barrier to using diverse dataflows.

Core claim

FEATHER leverages a novel spatial array termed Nest and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead.

What carries the argument

Nest spatial array combined with BIRRD multi-stage reduction network, which integrate data layout reordering into computation and reduction steps.

If this is right

  • Each layer of a model can run under its own best dataflow without paying a reconfiguration penalty each time.
  • Inference latency improves by 1.27 to 2.89 times and energy efficiency by 1.3 to 6.43 times versus prior accelerators.
  • FPGA throughput rises 2.65 to 3.91 times over Xilinx DPU and Gemmini while using only 6 percent extra area over a fixed-dataflow baseline.
  • The Layoutloop extension allows systematic comparison of dataflow choices that also account for layout costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of future edge accelerators could adopt the same embedded reordering approach instead of building separate fixed units for each dataflow.
  • The low-overhead switching might allow runtime adaptation of dataflows based on changing workload or power constraints.
  • Extending the same mechanism to larger chips or multi-chip systems could reduce the need for complex off-chip data movement during model execution.

Load-bearing premise

The Nest array and BIRRD network can handle every required data reordering and reconfiguration for any dataflow with negligible added latency and area.

What would settle it

Direct measurement on the FPGA implementation showing reordering latency or area overhead rising sharply for a dataflow switch not covered in the reported ResNet-50 and MobileNet-V3 cases.

Figures

Figures reproduced from arXiv: 2405.13170 by Anirudh Itagi, Jianming Tong, Prasanth Chatarasi, Tushar Krishna.

Figure 1
Figure 1. Figure 1: Terminology of convolution workload and dataflow [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Latency evaluation of dataflows on 16×16 PE array with various layouts (error bar shows layout impacts, less latency is better). The best flexible dataflow (green bar) theoretically reduces overall latency of fixed dataflow-layout (blue bar) by 63.3%. However, ignoring the impact of layout considerations in theoretical dataflows results in up to a 128× latency gap in practice (yellow bar). FEATHER eliminat… view at source ↗
Figure 3
Figure 3. Figure 3: Layout terminology example: ‘CHW W4H2C2’. ‘CHW’ signifies the inter-line dimension order as C→H→W across lines. ‘W4H2C2’ indicates the intra-line dimension order: (4,2,2) elements from the (W,H,C) dimensions are flattened into a single row in the order of W→H→C. words a buffer could supply per cycle) and the depth represents the total number of buffer row entries as shown in [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 4
Figure 4. Figure 4: Memory efficiency and computation utilization of various [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of reordering patterns. The 2D layout without any reordering is shown in 5a, which only allows reading two rows concurrently, assuming true dual-port SRAM. Line Rotation (5b, e.g., Medusa [48]) moves a row from bank 0 to bank 1 prior to reading, enabling simultaneous access to at most three rows from bank 0 through dual-bank ports. This technique, however, utilizes additional port from bank 1, pot… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of data reordering implementations. This work proposes RIR that eliminates reorder latency and bank conflicts. We discuss on-chip reorder patterns, including transpose, line rotation, row-reorder and arbitrary reorder, in [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of FEATHER architecture. The compute pipeline (NEST→BIRRD→OB→QM) reads iActs from StaB Ping (or Pong) and writes oActs to StaB Pong (or Ping) with a new data layout. (§IV), (iii) a tool called LayoutLoop for dataflow and layout co-exploration (§V). FEATHER provides two specific benefits over prior work in data reordering: (i) supporting arbitrary reorder, and (ii) proposing RIR to hide reordering … view at source ↗
Figure 8
Figure 8. Figure 8: Micro-architecture of FEATHER’s datapath for convo￾lution/GEMM. For convolution, the NEST reads iActs from StaB and weights from StrB, streaming both in a top-to-bottom pipeline. PEs in a column time-multiplex a common output bus. BIRRD conducts global spatial reduction and reorders results for targeted StaB banks during reduction, altering data layout in StaB. NEST facilitates inter-layer pipelining by re… view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of the FEATHER with NEST and BIRRD employing a convolutional operation with a 2×2 weights featuring 2 input channels (C = 2) and generating 16 output channels (M = 16) across a 4×4 iAct with 2 input channels. The depicted dataflow utilizes a weight-stationary approach, where each PE has a local register file containing a channel of weights (2×2). The dataflow is parallelized for two input chan… view at source ↗
Figure 10
Figure 10. Figure 10: Comparison between per-layer flexible dataflows in [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of FEATHER switching from channel-last lay￾out (HWC C4) to a row-major format (MPQ Q4(CHW W4)) during reduction without incurring bank conflicts. This is because multiple iActs are reduced into fewer oActs, thereby reducing accesses within each bank. In this example, NEST leverages parallelism along the kernel M and channel C dimensions, reading and vertically streaming four iActs of four input ch… view at source ↗
Figure 12
Figure 12. Figure 12: FEATHER vs. SoTAs on real devices. We run each [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: FEATHER vs. SoTA using Layoutloop (Percentage inside each blue bar indicates average steady-state PE utilization. Red bar indicates bank conflict slowdown, while yellow bar indicates off-chip reordering costs. Lower is better. (The red text in the x-axis of the right chart mentions the fixed layout or the layout reordering mechanism for each design.) With per-layer dataflow-layout switching, FEATHER achie… view at source ↗
Figure 14
Figure 14. Figure 14: ASIC resource comparison (FEATHER vs. SoTA). 16×16 FEATHER place-and-route at TSMC 28nm. E. Timing Analysis We layout FEATHER with 64, 256, and 1024 PEs, requiring BIRRD with 8, 16, and 32 inputs. The die photo of FEATHER with 16×16 PEs is shown in Fig. 14b revealing that BIRRD consumes only 4% of the overall post-layout area in the TSMC 28nm process. BIRRD does not have long wires because it is placed ou… view at source ↗
read the original abstract

The inference of ML models composed of diverse structures, types, and sizes boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed Nest and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27~2.89x inference latency speedup and 1.3~6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is released at https://github.com/maeri-project/FEATHER.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FEATHER, a reconfigurable accelerator for ML inference that uses a novel Nest spatial array and BIRRD multi-stage reduction network to support flexible data layout reordering and datapath reconfiguration. This is claimed to enable seamless switching between optimal dataflows (tiling, ordering, parallelism, shapes) with negligible latency and resource overhead. The authors enhance Timeloop into Layoutloop for joint dataflow-layout modeling, evaluate FEATHER in simulation against NVDLA/SIGMA/Eyeriss baselines on ResNet-50 and MobileNet-V3, and deploy it on a ZCU104 FPGA where it achieves 2.65/3.91x throughput over Xilinx DPU/Gemmini at 6% area overhead relative to a fixed-dataflow Eyeriss-like design. Code is released.

Significance. If the negligible-overhead claim for arbitrary dataflow switching holds, the work could enable more adaptive accelerators that exploit per-layer optimal dataflows without the usual reconfiguration penalty, improving latency and energy on edge devices. Positive elements include the end-to-end FPGA implementation, public code release for reproducibility, and the Layoutloop extension for systematic evaluation. The reported speedups (1.27-2.89x latency, 1.3-6.43x energy) and low area cost would be impactful if the core hardware mechanisms are shown to generalize beyond the evaluated workloads.

major comments (2)
  1. [Evaluation sections (Layoutloop modeling and FPGA deployment)] The central claim that Nest and BIRRD perform all required layout reordering and datapath reconfiguration with negligible latency/resource overhead for arbitrary dataflows (stated in the abstract) is load-bearing but unsupported by isolated measurements. Only aggregate results for ResNet-50 and MobileNet-V3 layers are reported in Layoutloop and the ZCU104 deployment; no breakdown isolates reconfiguration cycles, extra BRAM/DSP usage, or latency scaling when switching between non-evaluated tilings, orderings, or shapes.
  2. [Nest and BIRRD architecture description] The assumption that the reordering network's latency remains negligible independent of dataflow complexity is not tested. If BIRRD latency scales with reduction complexity (common in multi-stage networks), the headline speedups become workload-specific rather than general, undermining the 'seamless switching' contribution.
minor comments (2)
  1. [Abstract] Abstract contains a typo: 'MobiletNet-V3' should be 'MobileNet-V3'.
  2. [Evaluation methodology] The manuscript lacks visible error bars, full methodology details on post-hoc design choices, or discussion of how Layoutloop models reconfiguration costs, reducing verifiability of the reported speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications drawn from the manuscript and indicate revisions to strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Evaluation sections (Layoutloop modeling and FPGA deployment)] The central claim that Nest and BIRRD perform all required layout reordering and datapath reconfiguration with negligible latency/resource overhead for arbitrary dataflows (stated in the abstract) is load-bearing but unsupported by isolated measurements. Only aggregate results for ResNet-50 and MobileNet-V3 layers are reported in Layoutloop and the ZCU104 deployment; no breakdown isolates reconfiguration cycles, extra BRAM/DSP usage, or latency scaling when switching between non-evaluated tilings, orderings, or shapes.

    Authors: The manuscript reports aggregate end-to-end results because these reflect realistic ML workloads, with the FPGA deployment on ZCU104 capturing all overheads in practice (yielding the reported throughput gains at 6% area). The architecture sections describe how Nest and BIRRD fuse reordering into the reduction path without extra cycles or dedicated resources. We agree that isolated breakdowns would make the negligible-overhead claim more explicit and will add a new microbenchmark subsection with reconfiguration cycle counts, BRAM/DSP deltas, and scaling across varied tilings and shapes in the revision. revision: yes

  2. Referee: [Nest and BIRRD architecture description] The assumption that the reordering network's latency remains negligible independent of dataflow complexity is not tested. If BIRRD latency scales with reduction complexity (common in multi-stage networks), the headline speedups become workload-specific rather than general, undermining the 'seamless switching' contribution.

    Authors: BIRRD uses a fixed number of pipeline stages whose depth is independent of dataflow parameters; reordering is offloaded to the spatial connections in Nest rather than increasing network stages. This bound is implicit in the consistent low-overhead results across the diverse layers of the two evaluated networks and the FPGA measurements. We will expand the architecture description with an explicit latency analysis showing the fixed-stage property to address the concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on new hardware and external empirical evaluation

full rationale

The paper introduces Nest and BIRRD as novel components enabling data reordering and reconfiguration, then evaluates them via an enhanced Timeloop (Layoutloop) model and ZCU104 FPGA deployment. Performance numbers (1.27-2.89x latency, etc.) are reported as direct comparisons to external baselines (NVDLA, SIGMA, Eyeriss, Xilinx DPU, Gemmini). No equations, fitted parameters, or self-citations are shown that reduce any prediction or result to the inputs by construction. The 'negligible overhead' premise is an engineering claim supported by aggregate measurements rather than a self-referential derivation. This is a standard non-circular hardware design paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review provides limited visibility into modeling assumptions; the central claim rests on the functional correctness and low-overhead behavior of the two newly introduced hardware blocks (Nest and BIRRD) whose independent validation is the FPGA results.

invented entities (2)
  • Nest no independent evidence
    purpose: spatial array enabling flexible data handling and reordering
    New component proposed to support dataflow switching
  • BIRRD no independent evidence
    purpose: multi-stage reduction network that performs data reduction together with layout reordering
    New component proposed to hide reordering overhead

pith-pipeline@v0.9.0 · 5880 in / 1505 out tokens · 29907 ms · 2026-05-24T01:08:00.715237+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

  1. [1]

    Coral usb accelerator,

    “Coral usb accelerator,” https://coral.ai/products/accelerator/, accessed: 2024-02-23

  2. [2]

    Xilinx deep learning processing unit,

    “Xilinx deep learning processing unit,” https://docs.xilinx.com/r/1.2- English/ug1414-vitis-ai/Deep-Learning-Processor-Unit-DPU, accessed: 2022-12-10

  3. [3]

    Yolo3d: End-to-end real-time 3d oriented object bounding box detection from lidar point cloud,

    W. Ali, S. Abdelkarim, M. Zidan, M. Zahran, and A. El Sallab, “Yolo3d: End-to-end real-time 3d oriented object bounding box detection from lidar point cloud,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0

  4. [4]

    On-line algorithms for path selection in a nonblocking network,

    S. Arora, T. Leighton, and B. Maggs, “On-line algorithms for path selection in a nonblocking network,” in Proceedings of the Twenty- Second Annual ACM Symposium on Theory of Computing , ser. STOC ’90. New York, NY , USA: Association for Computing Machinery, 1990, p. 149–158. [Online]. Available: https://doi.org/10.1145/100216.100232

  5. [5]

    On-line algorithms for path selection in a nonblocking network,

    ——, “On-line algorithms for path selection in a nonblocking network,” in Proceedings of the twenty-second annual ACM symposium on Theory of computing, 1990, pp. 149–158

  6. [6]

    Hardware–software co-design for real-time latency–accuracy navigation in tiny machine learning applications,

    P. Behnam, J. Tong, A. Khare, Y . Chen, Y . Pan, P. Gadikar, A. Bamb- haniya, T. Krishna, and A. Tumanov, “Hardware–software co-design for real-time latency–accuracy navigation in tiny machine learning applications,” IEEE Micro, vol. 43, no. 06, pp. 93–101, nov 2023

  7. [7]

    Subgraph stationary hardware- software inference co-design,

    P. Behnam, J. Tong, A. Khare, Y . Chen, Y . Pan, P. Gadikar, A. R. Bamb- haniya, T. Krishna, and A. Tumanov, “Subgraph stationary hardware- software inference co-design,” 2023

  8. [8]

    Optimized routing for fat-tree topologies,

    B. Bogdanski, “Optimized routing for fat-tree topologies,” Department of Informatics, Faculty of Mathematics and Natural Sciences, University of Oslo, Norway , 2014

  9. [9]

    Marvel: A data-centric approach for mapping deep learning operators on spatial accelerators,

    P. Chatarasi, H. Kwon, A. Parashar, M. Pellauer, T. Krishna, and V . Sarkar, “Marvel: A data-centric approach for mapping deep learning operators on spatial accelerators,” ACM Trans. Archit. Code Optim. , vol. 19, no. 1, dec 2021. [Online]. Available: https://doi.org/10.1145/3485137

  10. [10]

    Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx ai engine,

    P. Chatarasi, S. Neuendorffer, S. Bayliss, K. Vissers, and V . Sarkar, “Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx ai engine,” in 2020 IEEE High Performance Extreme Computing Conference (HPEC) , 2020, pp. 1–10

  11. [11]

    High performance convolutional neural networks for document processing,

    K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional neural networks for document processing,” in Tenth international workshop on frontiers in handwriting recognition . Suvisoft, 2006

  12. [12]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017

  13. [13]

    Eyeriss: An Energy- Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,

    Y .-H. Chen, T. Krishna, J. S. Emer, and V . Sze, “Eyeriss: An Energy- Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 127–138, 2016

  14. [14]

    Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,

    ——, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits , vol. 52, no. 1, pp. 127–138, 2017

  15. [15]

    Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices

    Y .-H. Chen, T.-J. Yang, J. Emer, and V . Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” arXiv preprint arXiv:1807.07928 , 2018

  16. [16]

    W. J. Dally and B. P. Towles, Principles and practices of interconnection networks. Elsevier, 2004

  17. [17]

    Qnnpack: Open source library for optimized mobile deep learning,

    M. Dukhan, Y . Wu, and H. Lu, “Qnnpack: Open source library for optimized mobile deep learning,” 2018

  18. [18]

    (beta) channels last memory format in pytorch ¶

    V . Fedyunin, “(beta) channels last memory format in pytorch ¶.” [Online]. Available: https://pytorch.org/tutorials/intermediate/memory format tutorial.html

  19. [19]

    Mtia: First generation silicon targeting meta’s recommendation systems,

    A. Firoozshahian, J. Coburn, R. Levenstein, R. Nattoji, A. Kamath, O. Wu, G. Grewal, H. Aepala, B. Jakka, B. Dreyer, A. Hutchin, U. Diril, K. Nair, E. K. Aredestani, M. Schatz, Y . Hao, R. Komuravelli, K. Ho, S. Abu Asal, J. Shajrawi, K. Quinn, N. Sreedhara, P. Kansal, W. Wei, D. Jayaraman, L. Cheng, P. Chopda, E. Wang, A. Bikumandla, A. Karthik Sengottuv...

  20. [20]

    Abstractive text summarization by incorporating reader comments,

    S. Gao, X. Chen, P. Li, Z. Ren, L. Bing, D. Zhao, and R. Yan, “Abstractive text summarization by incorporating reader comments,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, no. 01, 2019, pp. 6399–6406

  21. [21]

    Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures,

    H. Genc, A. Haj-Ali, V . Iyer, A. Amid, H. Mao, J. C. Wright, C. Schmidt, J. Zhao, A. J. Ou, M. Banister, Y . S. Shao, B. Nikolic, I. Stoica, and K. Asanovic, “Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures,” CoRR, vol. abs/1911.09925, 2019. [Online]. Available: http://arxiv.org/abs/1911.09925

  22. [22]

    Fletcher

    K. Hegde, P.-A. Tsai, S. Huang, V . Chandra, A. Parashar, and C. W. Fletcher, “Mind mappings: enabling efficient algorithm-accelerator mapping space search,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , ser. ASPLOS ’21. New York, NY , USA: Association for Computing Machi...

  23. [23]

    Cosa: Scheduling by constrained optimization for spatial accelerators,

    Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek, and Y . S. Shao, “Cosa: Scheduling by constrained optimization for spatial accelerators,” in Proceedings of the 48th Annual International Symposium on Computer Architecture , ser. ISCA ’21. IEEE Press, 2021, p. 554–566. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00050

  24. [24]

    Union: A unified hw-sw co-design ecosystem in mlir for evaluating tensor operations on spatial accelerators,

    G. Jeong, G. Kestor, P. Chatarasi, A. Parashar, P.-A. Tsai, S. Rajaman- ickam, R. Gioiosa, and T. Krishna, “Union: A unified hw-sw co-design ecosystem in mlir for evaluating tensor operations on spatial accelerators,” in 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2021, pp. 30–44

  25. [25]

    SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization,

    H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and T. Zhao, “SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, Jul. 2020, pp. 2177–2190. [Onli...

  26. [26]

    Ten lessons from three generations shaped google’s tpuv4i : Industrial product,

    N. P. Jouppi, D. Hyun Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil, S. Prasad, C. Young, Z. Zhou, and D. Patterson, “Ten lessons from three generations shaped google’s tpuv4i : Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) , 2021, pp. 1–14

  27. [27]

    Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm,

    S.-C. Kao and T. Krishna, “Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm,” in Proceedings of the 39th International Conference on Computer-Aided Design , ser. ICCAD ’20. New York, NY , USA: Association for Computing Machinery,

  28. [28]

    Available: https://doi.org/10.1145/3400302.3415639

    [Online]. Available: https://doi.org/10.1145/3400302.3415639

  29. [29]

    Magma: An optimization framework for mapping multiple dnns on multiple accelerator cores,

    ——, “Magma: An optimization framework for mapping multiple dnns on multiple accelerator cores,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA) , 2022, pp. 814–830

  30. [30]

    Demystifying map space exploration for npus,

    S.-C. Kao, A. Parashar, P.-A. Tsai, and T. Krishna, “Demystifying map space exploration for npus,” 2022

  31. [31]

    Digamma: Domain-aware genetic algorithm for hw-mapping co-optimization for dnn accelerators,

    S.-C. Kao, M. Pellauer, A. Parashar, and T. Krishna, “Digamma: Domain-aware genetic algorithm for hw-mapping co-optimization for dnn accelerators,” in 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE) , 2022, pp. 232–237

  32. [32]

    Fbgemm: Enabling high-performance low-precision deep learning inference,

    D. Khudia, J. Huang, P. Basu, S. Deng, H. Liu, J. Park, and M. Smelyan- skiy, “Fbgemm: Enabling high-performance low-precision deep learning inference,” arXiv preprint arXiv:2101.05615 , 2021

  33. [33]

    Data orchestration in deep learning accelerators,

    T. Krishna, H. Kwon, A. Parashar, M. Pellauer, and A. Samajdar, “Data orchestration in deep learning accelerators,” 2020

  34. [34]

    Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings,

    H. Kwon, P. Chatarasi, V . Sarkar, T. Krishna, M. Pellauer, and A. Parashar, “Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings,” IEEE Micro, vol. 40, no. 3, pp. 20–29, 2020

  35. [35]

    Flexion: A quantitative metric for flexibility in dnn accelerators,

    H. Kwon, M. Pellauer, A. Parashar, and T. Krishna, “Flexion: A quantitative metric for flexibility in dnn accelerators,” IEEE Computer Architecture Letters, vol. 20, no. 1, pp. 1–4, 2021

  36. [36]

    MAERI: Enabling Flex- ible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects,

    H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling Flex- ible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects,” in Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018

  37. [37]

    Fat-trees: universal networks for hardware-efficient supercomputing,

    C. E. Leiserson, “Fat-trees: universal networks for hardware-efficient supercomputing,” IEEE transactions on Computers , vol. 100, no. 10, pp. 892–901, 1985

  38. [38]

    High-level semantic feature detection: A new perspective for pedestrian detection,

    W. Liu, S. Liao, W. Ren, W. Hu, and Y . Yu, “High-level semantic feature detection: A new perspective for pedestrian detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 5187–5196

  39. [39]

    Efficient pytorch: Tensor memory format matters

    D. Matani, “Efficient pytorch: Tensor memory format matters.” [Online]. Available: https://pytorch.org/blog/tensor-memory-format-matters/

  40. [40]

    (2016) NVIDIA Deep Learning Accelerator (NVDLA)

    NVIDIA. (2016) NVIDIA Deep Learning Accelerator (NVDLA). [Online]. Available: http://nvdla.org/primer.html

  41. [41]

    Accelerating deep convolutional neural networks using specialized hardware,

    K. Ovtcharov, O. Ruwase, J.-Y . Kim, J. Fowers, K. Strauss, and E. S. Chung, “Accelerating deep convolutional neural networks using specialized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11, pp. 1–4, 2015

  42. [42]

    Timeloop: A Systematic Approach to DNN Accelerator Evaluation,

    A. Parashar, P. Raina, Y . S. Shao, Y .-H. Chen, V . A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A Systematic Approach to DNN Accelerator Evaluation,” in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019

  43. [43]

    Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,

    E. Qin, A. Samajdar, H. Kwon, V . Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) , 2020, pp. 58–70

  44. [44]

    Ai and ml accelerator survey and trends,

    A. Reuther, P. Michaleas, M. Jones, V . Gadepally, S. Samsi, and J. Kepner, “Ai and ml accelerator survey and trends,” 2022. [Online]. Available: https://arxiv.org/abs/2210.04055

  45. [45]

    Self-adaptive reconfigurable arrays (sara): Using ml to assist scaling gemm acceleration,

    A. Samajdar, M. Pellauer, and T. Krishna, “Self-adaptive reconfigurable arrays (sara): Using ml to assist scaling gemm acceleration,” ArXiv, vol. abs/2101.04799, 2021

  46. [46]

    SCALE-Sim: Systolic CNN Accelerator Simulator

    A. Samajdar, Y . Zhu, P. Whatmough, M. Mattina, and T. Krishna, “SCALE-Sim: Systolic CNN Accelerator Simulator,” arXiv preprint arXiv:1811.02883, 2018

  47. [47]

    An evaluation of edge tpu accelerators for convolutional neural networks,

    K. Seshadri, B. Akin, J. Laudon, R. Narayanaswami, and A. Yazdan- bakhsh, “An evaluation of edge tpu accelerators for convolutional neural networks,” 2022

  48. [48]

    Simba: Scaling deep-learning inference with multi- chip-module-based architecture,

    Y . S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y . Zhang, W. J. Dally, J. Emer, C. T. Gray, B. Khailany, and S. W. Keckler, “Simba: Scaling deep-learning inference with multi- chip-module-based architecture,” in Proceedings of the 52nd Annual IEEE/ACM International Symp...

  49. [49]

    Medusa: A scalable intercon- nect for many-port dnn accelerators and wide dram controller interfaces,

    Y . Shen, T. Ji, M. Ferdman, and P. Milder, “Medusa: A scalable intercon- nect for many-port dnn accelerators and wide dram controller interfaces,” in 2018 28th International Conference on Field Programmable Logic and Applications (FPL) , 2018, pp. 101–1014

  50. [50]

    Groundwater level prediction using machine learning models: A comprehensive review,

    H. Tao, M. M. Hameed, H. A. Marhoon, M. Zounemat-Kermani, S. Heddam, S. Kim, S. O. Sulaiman, M. L. Tan, Z. Sa’adi, A. D. Mehr, M. F. Allawi, S. Abba, J. M. Zain, M. W. Falah, M. Jamei, N. D. Bokde, M. Bayatvarkeshi, M. Al-Mukhtar, S. K. Bhagat, T. Tiyasha, K. M. Khedher, N. Al-Ansari, S. Shahid, and Z. M. Yaseen, “Groundwater level prediction using machin...

  51. [51]

    Dsagen: Synthesizing programmable spatial accelerators,

    J. Weng, S. Liu, V . Dadu, Z. Wang, P. Shah, and T. Nowatzki, “Dsagen: Synthesizing programmable spatial accelerators,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) , 2020, pp. 268–281

  52. [52]

    (2022) Xilinx Deep Learning Unit (DPU)

    Xilinx. (2022) Xilinx Deep Learning Unit (DPU). [Online]. Avail- able: https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Deep-Learning- Processor-Unit