pith. machine review for the scientific record. sign in

arxiv: 2602.15172 · v2 · submitted 2026-02-16 · 💻 cs.AR

Recognition: no theorem link

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:28 UTC · model grok-4.3

classification 💻 cs.AR
keywords accelerator mappingDNN acceleratorsoptimal mappingenergy-delay productdataplacementlow-latency designsearch-space pruning
0
0 comments X

The pith

A new dataplacement concept lets TCM find optimal accelerator mappings in 17 seconds instead of hours.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TCM, a mapper that exhaustively searches the space of computation and data-movement schedules for DNN accelerators to minimize energy and latency. It defines dataplacement as a new classification of mappings that reveals massive redundancy, allowing the search space to be pruned by up to 32 orders of magnitude while still containing every optimal solution. This reduction makes full enumeration practical for the first time, replacing heuristic search with guaranteed optimality. The result is mappings that deliver 1.2 to 6.5 times better energy-delay product than prior mappers while cutting search time by three orders of magnitude.

Core claim

TCM is the first mapper that can locate provably optimal mappings for accelerator designs in feasible runtime. By introducing the dataplacement concept, the authors identify and eliminate redundant and suboptimal mappings, shrinking the space from as large as 10^37 candidates down to roughly 10^5. Exhaustive search over this reduced space produces mappings whose energy-delay product is 1.2-6.5 times better than those found by previous heuristic or metaheuristic mappers, while reducing search time from five hours to 17 seconds.

What carries the argument

Dataplacement, a classification of mappings that exposes redundancy and suboptimality without discarding any optimal solution, enabling drastic yet safe pruning of the mapspace.

If this is right

  • Designers can now obtain mappings that improve accelerator energy-delay product by 1.2 to 6.5 times over those produced by prior mappers.
  • Mapping search times drop from hours to seconds, allowing many more design iterations in the same wall-clock time.
  • Full enumeration of the mapspace becomes practical, so optimality is guaranteed rather than approximated.
  • The pruned space of approximately 10^5 mappings can be evaluated on ordinary workstations without specialized hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same redundancy-identification technique could be applied to mapping problems for non-DNN workloads or to other hardware resources such as interconnects.
  • Embedding TCM inside automated design flows would shift the default from heuristic tuning to guaranteed-optimal schedules.
  • Testing TCM on emerging memory technologies or heterogeneous accelerators would reveal whether the dataplacement pruning rules generalize.
  • Open-sourcing the pruned enumeration engine could let smaller teams achieve the same optimality previously available only to those with large compute budgets.

Load-bearing premise

The dataplacement classification correctly marks every redundant or suboptimal mapping for removal without ever discarding an optimal mapping.

What would settle it

Apply both TCM and a brute-force exhaustive search to a small accelerator configuration and DNN workload; check whether TCM returns the same minimum energy-delay-product mapping that the exhaustive search finds.

Figures

Figures reproduced from arXiv: 2602.15172 by Joel S. Emer, Michael Gilbert, Tanner Andrulis, Vivienne Sze.

Figure 1
Figure 1. Figure 1: (a) An example LoopTree, the different types of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Constructing a mapspace for the Einsum in Eq. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) An example dataplacement and slots where loops [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A mapping with non-helpful loops. Notice that the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of TCM. Each mapper process (in blue) is [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Search size reduction by each of our optimizations. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mapspace size scaling, with and without pruning, [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (Left) Model speed comparison. The curried model [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: TCM accurately informs design space explorations, [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: TCM balances the energy of multiple memory lev [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
read the original abstract

The energy and latency of an accelerator running a deep neural network (DNN) depend on how the computation and data movement are scheduled in the accelerator (i.e., mapping), and picking an optimal mapping is essential to achieve high-performance accelerators. However, it is challenging to find mappings that maximize accelerator performance. The space of mappings is large, and prior works cannot guarantee finding optimal mappings because they use heuristics or metaheuristics to narrow the search space. To address this challenge, we propose the Turbo-Charged Mapper (TCM), a fast mapper that finds optimal mappings. The key to our approach is that we define a new mapping concept called dataplacement, which, like the prior concept of dataflow, allows for clear analysis and comparison of mappings. Through it, we identify opportunities to prune redundant and suboptimal mappings, reducing search space by up to 32 orders of magnitude ($10^{37}\rightarrow10^5$). TCM leverages these insights to perform full mapspace searches, making it the first mapper that can find optimal mappings in feasible runtime. Compared to prior mappers, TCM improves accelerator energy-delay-product by $1.2-6.5\times$ while simultaneously reducing mapping search time by $1000\times$ (5 hours $\rightarrow$ 17 seconds).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Turbo-Charged Mapper (TCM) for DNN accelerator mapping. It defines a new concept called dataplacement (analogous to dataflow) to analyze mappings, prune redundant and suboptimal ones, and reduce the search space by up to 32 orders of magnitude (10^37 to 10^5). This enables exhaustive enumeration to find optimal mappings, claimed to be the first such mapper with feasible runtime; experiments report 1.2-6.5× better energy-delay product and 1000× faster search (5 hours to 17 seconds) versus prior mappers.

Significance. If the dataplacement pruning rules are complete and provably preserve the true optimum, the result would be a substantial contribution to accelerator design automation: the first practical method to guarantee optimality rather than rely on heuristics whose sub-optimality cannot be bounded. The reported EDP gains and runtime reduction would then be directly attributable to exhaustive search over a correctly pruned space.

major comments (2)
  1. [Abstract and dataplacement definition] The optimality guarantee is load-bearing on the claim that dataplacement pruning discards only redundant or suboptimal mappings while retaining every optimal one. The abstract states the reduction but supplies no derivation of the pruning rules, no small-instance exhaustive enumeration to validate completeness, and no counter-example search; a reduction of this magnitude requires explicit proof or verification that the retained 10^5 mappings contain the global optimum for the target accelerators.
  2. [Experimental evaluation] Experimental comparisons report 1.2-6.5× EDP improvement and 1000× speedup, yet the manuscript provides no table of per-benchmark results, no description of the exact baseline mappers and their search budgets, and no error bars or multiple-run statistics; without these, it is impossible to assess whether the gains are consistent or driven by particular workloads.
minor comments (2)
  1. [Abstract] Clarify whether the 17-second runtime includes the cost of computing dataplacement attributes or only the subsequent exhaustive enumeration.
  2. [Dataplacement section] Add a small worked example (e.g., a 2-layer network on a 4-PE accelerator) showing the original mapping space, the dataplacement equivalence classes, and the pruned space to illustrate the pruning logic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract and dataplacement definition] The optimality guarantee is load-bearing on the claim that dataplacement pruning discards only redundant or suboptimal mappings while retaining every optimal one. The abstract states the reduction but supplies no derivation of the pruning rules, no small-instance exhaustive enumeration to validate completeness, and no counter-example search; a reduction of this magnitude requires explicit proof or verification that the retained 10^5 mappings contain the global optimum for the target accelerators.

    Authors: We agree the abstract is concise and omits the full derivation. Section 3 of the manuscript derives the dataplacement pruning rules, proving they remove only redundant (data-movement equivalent) or provably suboptimal mappings while retaining every optimal one. To address the validation request, we will add a new subsection with small-instance exhaustive checks on toy accelerators (e.g., 2x2 PE arrays) that compare the pruned space against the full space, confirming the global optimum is always preserved. We will also note that the proof structure precludes counterexamples. revision: yes

  2. Referee: [Experimental evaluation] Experimental comparisons report 1.2-6.5× EDP improvement and 1000× speedup, yet the manuscript provides no table of per-benchmark results, no description of the exact baseline mappers and their search budgets, and no error bars or multiple-run statistics; without these, it is impossible to assess whether the gains are consistent or driven by particular workloads.

    Authors: We will expand the experimental section with a new table listing per-benchmark EDP and runtime results for every workload. The table will explicitly name the baseline mappers (Timeloop and others), their search budgets and configurations, and the achieved EDP values. Because TCM performs deterministic exhaustive search, run-to-run variance is zero; we will nevertheless report hardware-simulation statistics where relevant to show consistency across workloads. revision: yes

Circularity Check

0 steps flagged

No circularity detected; optimality rests on analytical pruning claim

full rationale

The paper introduces dataplacement as a new concept to analyze mappings, identify redundant/suboptimal ones, and prune the space from 10^37 to 10^5 while claiming to preserve the true optimum, enabling exhaustive search. No equations, fitted parameters, self-citations, or ansatzes are quoted that reduce the optimality guarantee to a definition or input by construction. The pruning is presented as an independent analysis of dataflow and dataplacement properties rather than a self-referential step, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the correctness of the dataplacement definition for exhaustive pruning and on the assumption that the reduced space still contains the global optimum.

axioms (1)
  • domain assumption Dataplacement correctly captures all redundancy and suboptimality relations among mappings
    Invoked to justify the 32-order-of-magnitude reduction and the guarantee of optimality.
invented entities (1)
  • dataplacement no independent evidence
    purpose: New abstraction parallel to dataflow that enables pruning of the mapping space
    Introduced in the paper as the key enabler of the search-space reduction.

pith-pipeline@v0.9.0 · 5541 in / 1289 out tokens · 21508 ms · 2026-05-15T21:28:54.784813+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design

    cs.AR 2026-02 unverdicted novelty 7.0

    FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper

  1. [1]

    d.].NVIDIA Deep Learning Accelerator

    [n. d.].NVIDIA Deep Learning Accelerator. http://nvdla.org/primer.html

  2. [2]

    Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783725

  3. [3]

    Tanner Andrulis. [n. d.].HWComponents. https://github.com/Accelergy-Project/ hwcomponents

  4. [4]

    Emer, and Vivienne Sze

    Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2023. RAELLA: Reform- ing the Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM: No Retraining Required!. InProceedings of the 50th Annual International Sym- posium on Computer Architecture(Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 27, 16 pa...

  5. [5]

    Emer, and Vivienne Sze

    Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2024. CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10–23. https://doi.org/10.1109/ISPASS61541.2024.00012

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott ...

  7. [7]

    Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023. Inter- Layer Scheduling Space Definition and Exploration for Tiled Accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture (Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 13, 17 pages. https://doi.org...

  8. [8]

    Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architec- ture for Energy-Efficient Dataflow for Convolutional Neural Networks. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 367–379. https://doi.org/10.1109/ISCA.2016.40

  9. [9]

    Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne. 2016. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. InIEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers. 262–263

  10. [10]

    2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators

    Michael Gilbert. 2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators. PhD thesis. Massachusetts Institute of Technology, Cambridge, MA

  11. [11]

    Emer, and Vivienne Sze

    Michael Gilbert, Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2024. Loop- Tree: Exploring the Fused-Layer Dataflow Accelerator Design Space.IEEE Transactions on Circuits and Systems for Artificial Intelligence1, 1 (2024), 97–

  12. [12]

    https://doi.org/10.1109/TCASAI.2024.3461716

  13. [13]

    Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Hartwig Adam, and Quoc Le. 2019. Searching for MobileNetV3. In2019 IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 1314–1324. https://doi.org/10. 1109/ICCV.2019.00140

  14. [14]

    Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 554–566. https: //doi.org/10.1109/ISCA52012.2021.00050

  15. [15]

    Emer, and Angshuman Parashar

    Qijing Huang, Po-An Tsai, Joel S. Emer, and Angshuman Parashar. 2024. Mind the Gap: Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 150–166. https://doi.org/10.1109/ISCA59077.2024.00021

  16. [16]

    Hongyang Jia, Hossein Valavi, Yinqi Tang, Jintao Zhang, and Naveen Verma

  17. [17]

    https://doi.org/10.1109/JSSC.2020.2987714

    A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In- Memory Computing.IEEE Journal of Solid-State Circuits55, 9 (2020), 2609–2621. https://doi.org/10.1109/JSSC.2020.2987714

  18. [18]

    Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks.. InProceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1. 1–13. https://proceedings.mlsys.org/paper_files/paper/2019/file/ b422680f3db0986ddd7f8f126baaf0fa-Paper.pdf

  19. [19]

    Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B

    Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In2021 ACM/IEEE 48th Annual Intern...

  20. [20]

    Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard ...

  21. [21]

    Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW Map- ping of DNN Models on Accelerators via Genetic Algorithm. In2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9

  22. [22]

    Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim, and Bongjin Kim. 2021. Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In- Memory Macro for Processing Neural Networks.IEEE Journal of Solid-State Circuits56, 7 (2021), 2221–2233. https://doi.org/10.1109/JSSC.2021.3061508

  23. [23]

    Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings.IEEE Micro40, 3 (2020), 20–29. https://doi.org/10.1109/MM.2020.2985963

  24. [24]

    Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174. https://doi.org/10.1109/TC.2021.3059962

  25. [25]

    2025.NVIDIA DGX B200

    NVIDIA. 2025.NVIDIA DGX B200. https://www.nvidia.com/en-us/data-center/ dgx-b200/ Accessed: August 1, 2025

  26. [26]

    MohammadHossein Olyaiy, Christopher Ng, Alexandra Sasha Fedorova, and Mieszko Lis. 2023. Sunstone: A Scalable and Versatile Scheduler for Mapping Tensor Algebra on Spatial Accelerators. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 259–271. https://doi.org/ 10.1109/ISPASS57527.2023.00033 12 The Turbo-Charged...

  27. [27]

    Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W

    Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In2019 IEEE International Symposium on Perfor- mance Analysis of Systems and Software (ISPASS). 304–315. https:...

  28. [28]

    Chirag Sakhuja, Zhan Shi, and Calvin Lin. 2023. Leveraging Domain Information for the Efficient Automated Design of Deep Learning Accelerators. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 287–301. https://doi.org/10.1109/HPCA56546.2023.10071095

  29. [29]

    Stanley Williams, and Vivek Srikumar

    Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arith- metic in Crossbars. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 14–26. https://doi.org/10.1...

  30. [30]

    Kyle Shiflett, Avinash Karanth, Razvan Bunescu, and Ahmed Louri. 2021. Al- bireo: energy-efficient acceleration of convolutional neural networks via sili- con photonics. InProceedings of the 48th Annual International Symposium on Computer Architecture(Virtual Event, Spain)(ISCA ’21). IEEE Press, 860–873. https://doi.org/10.1109/ISCA52012.2021.00072

  31. [31]

    Sinangil, Burak Erbagci, Rawan Naous, Kerem Akarvardar, Dar Sun, Win-San Khwa, Hung-Jen Liao, Yih Wang, and Jonathan Chang

    Mahmut E. Sinangil, Burak Erbagci, Rawan Naous, Kerem Akarvardar, Dar Sun, Win-San Khwa, Hung-Jen Liao, Yih Wang, and Jonathan Chang. 2021. A 7- nm Compute-in-Memory SRAM Macro Supporting Multi-Bit Input, Weight and Output and Achieving 351 TOPS/W and 372.4 GOPS.IEEE Journal of Solid-State Circuits56, 1 (2021), 188–198. https://doi.org/10.1109/JSSC.2020.3031290

  32. [32]

    Arne Symons, Linyan Mei, and Marian Verhelst. 2021. LOMA: Fast Auto- Scheduling on DNN Accelerators through Loop-Order-based Memory Allocation. In2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS). 1–4. https://doi.org/10.1109/AICAS51828.2021.9458493

  33. [33]

    Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2020. Efficient Processing of Deep Neural Networks.Synthesis Lectures on Computer Architec- ture15, 2 (2020), 1–341. https://doi.org/10.2200/S01004ED1V01Y202004CAC050 arXiv:https://doi.org/10.2200/S01004ED1V01Y202004CAC050

  34. [34]

    Brockman, and Norman P

    Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. 2008. A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies. In2008 International Symposium on Computer Architecture. 51–62. https://doi.org/10. 1109/ISCA.2008.16

  35. [35]

    Burc Eryilmaz, Wenqiang Zhang, Yan Liao, Dabin Wu, Stephen Deiss, Bin Gao, Priyanka Raina, Siddharth Joshi, Huaqiang Wu, Gert Cauwenberghs, and H.-S

    Weier Wan, Rajkumar Kubendran, S. Burc Eryilmaz, Wenqiang Zhang, Yan Liao, Dabin Wu, Stephen Deiss, Bin Gao, Priyanka Raina, Siddharth Joshi, Huaqiang Wu, Gert Cauwenberghs, and H.-S. Philip Wong. 2020. 33.1 A 74 TMACS/W CMOS-RRAM Neurosynaptic Core with Dynamically Reconfigurable Dataflow and In-situ Transposable Weights for Probabilistic Graphical Model...

  36. [36]

    Philip Wong, and Gert Cauwenberghs

    Weier Wan, Rajkumar Kubendran, Clemens Schaefer, Sukru Burc Eryilmaz, Wen- qiang Zhang, Dabin Wu, Stephen Deiss, Priyanka Raina, He Qian, Bin Gao, Siddharth Joshi, Huaqiang Wu, H.-S. Philip Wong, and Gert Cauwenberghs. 2022. A compute-in-memory chip based on resistive random-access memory.Nature 608, 7923 (Aug. 2022), 504–512. https://doi.org/10.1038/s415...

  37. [37]

    Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Dan Lake, and Brent Carlton. 2023. A Charge Domain SRAM Compute-in-Memory Macro With C-2C Ladder-Based 8-Bit MAC Unit in 22-nm FinFET Process for Edge Inference.IEEE Journal of Solid-State Circuits58, 4 (2023), 1037–1050. https: //doi.org/10.1109/JSSC.2022.3232601

  38. [38]

    Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Xiaosen Liu, Dan Lake, Brent Carlton, and May Wu. 2022. A 32.2 TOPS/W SRAM Compute- in-Memory Macro Employing a Linear 8-bit C-2C Ladder for Charge Domain Computation in 22nm for Edge Inference. In2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 36–37. https:...

  39. [39]

    Emer, and Vivienne Sze

    Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2019. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8. https://doi.org/10.1109/ICCAD45719.2019.8942149

  40. [40]

    Size Zheng, Siyuan Chen, Siyuan Gao, Liancheng Jia, Guangyu Sun, Runsheng Wang, and Yun Liang. 2023. TileFlow: A Framework for Modeling Fusion Dataflow via Tree-Based Analysis. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture(<conf-loc>, <city>Toronto</city>, <state>ON</state>, <country>Canada</country>, </conf-loc>)(...

  41. [41]

    Zhizhen Zhong, Mingran Yang, Jay Lang, Christian Williams, Liam Kronman, Alexander Sludds, Homa Esfahanizadeh, Dirk Englund, and Manya Ghobadi. 2023. Lightning: A Reconfigurable Photonic-Electronic SmartNIC for Fast and Energy- Efficient Inference. InProceedings of the ACM SIGCOMM 2023 Conference(New York, NY, USA)(ACM SIGCOMM ’23). Association for Comput...