arxiv: 2602.15172 · v2 · submitted 2026-02-16 · 💻 cs.AR

Recognition: no theorem link

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

Michael Gilbert , Tanner Andrulis , Vivienne Sze , Joel S. Emer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:28 UTC · model grok-4.3

classification 💻 cs.AR

keywords accelerator mappingDNN acceleratorsoptimal mappingenergy-delay productdataplacementlow-latency designsearch-space pruning

0 comments

The pith

A new dataplacement concept lets TCM find optimal accelerator mappings in 17 seconds instead of hours.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TCM, a mapper that exhaustively searches the space of computation and data-movement schedules for DNN accelerators to minimize energy and latency. It defines dataplacement as a new classification of mappings that reveals massive redundancy, allowing the search space to be pruned by up to 32 orders of magnitude while still containing every optimal solution. This reduction makes full enumeration practical for the first time, replacing heuristic search with guaranteed optimality. The result is mappings that deliver 1.2 to 6.5 times better energy-delay product than prior mappers while cutting search time by three orders of magnitude.

Core claim

TCM is the first mapper that can locate provably optimal mappings for accelerator designs in feasible runtime. By introducing the dataplacement concept, the authors identify and eliminate redundant and suboptimal mappings, shrinking the space from as large as 10^37 candidates down to roughly 10^5. Exhaustive search over this reduced space produces mappings whose energy-delay product is 1.2-6.5 times better than those found by previous heuristic or metaheuristic mappers, while reducing search time from five hours to 17 seconds.

What carries the argument

Dataplacement, a classification of mappings that exposes redundancy and suboptimality without discarding any optimal solution, enabling drastic yet safe pruning of the mapspace.

If this is right

Designers can now obtain mappings that improve accelerator energy-delay product by 1.2 to 6.5 times over those produced by prior mappers.
Mapping search times drop from hours to seconds, allowing many more design iterations in the same wall-clock time.
Full enumeration of the mapspace becomes practical, so optimality is guaranteed rather than approximated.
The pruned space of approximately 10^5 mappings can be evaluated on ordinary workstations without specialized hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same redundancy-identification technique could be applied to mapping problems for non-DNN workloads or to other hardware resources such as interconnects.
Embedding TCM inside automated design flows would shift the default from heuristic tuning to guaranteed-optimal schedules.
Testing TCM on emerging memory technologies or heterogeneous accelerators would reveal whether the dataplacement pruning rules generalize.
Open-sourcing the pruned enumeration engine could let smaller teams achieve the same optimality previously available only to those with large compute budgets.

Load-bearing premise

The dataplacement classification correctly marks every redundant or suboptimal mapping for removal without ever discarding an optimal mapping.

What would settle it

Apply both TCM and a brute-force exhaustive search to a small accelerator configuration and DNN workload; check whether TCM returns the same minimum energy-delay-product mapping that the exhaustive search finds.

Figures

Figures reproduced from arXiv: 2602.15172 by Joel S. Emer, Michael Gilbert, Tanner Andrulis, Vivienne Sze.

**Figure 2.** Figure 2: Constructing a mapspace for the Einsum in Eq. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) An example dataplacement and slots where loops [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: A mapping with non-helpful loops. Notice that the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of TCM. Each mapper process (in blue) is [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Search size reduction by each of our optimizations. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Mapspace size scaling, with and without pruning, [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: (Left) Model speed comparison. The curried model [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: TCM accurately informs design space explorations, [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: TCM balances the energy of multiple memory lev [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

read the original abstract

The energy and latency of an accelerator running a deep neural network (DNN) depend on how the computation and data movement are scheduled in the accelerator (i.e., mapping), and picking an optimal mapping is essential to achieve high-performance accelerators. However, it is challenging to find mappings that maximize accelerator performance. The space of mappings is large, and prior works cannot guarantee finding optimal mappings because they use heuristics or metaheuristics to narrow the search space. To address this challenge, we propose the Turbo-Charged Mapper (TCM), a fast mapper that finds optimal mappings. The key to our approach is that we define a new mapping concept called dataplacement, which, like the prior concept of dataflow, allows for clear analysis and comparison of mappings. Through it, we identify opportunities to prune redundant and suboptimal mappings, reducing search space by up to 32 orders of magnitude ($10^{37}\rightarrow10^5$). TCM leverages these insights to perform full mapspace searches, making it the first mapper that can find optimal mappings in feasible runtime. Compared to prior mappers, TCM improves accelerator energy-delay-product by $1.2-6.5\times$ while simultaneously reducing mapping search time by $1000\times$ (5 hours $\rightarrow$ 17 seconds).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Turbo-Charged Mapper (TCM) for DNN accelerator mapping. It defines a new concept called dataplacement (analogous to dataflow) to analyze mappings, prune redundant and suboptimal ones, and reduce the search space by up to 32 orders of magnitude (10^37 to 10^5). This enables exhaustive enumeration to find optimal mappings, claimed to be the first such mapper with feasible runtime; experiments report 1.2-6.5× better energy-delay product and 1000× faster search (5 hours to 17 seconds) versus prior mappers.

Significance. If the dataplacement pruning rules are complete and provably preserve the true optimum, the result would be a substantial contribution to accelerator design automation: the first practical method to guarantee optimality rather than rely on heuristics whose sub-optimality cannot be bounded. The reported EDP gains and runtime reduction would then be directly attributable to exhaustive search over a correctly pruned space.

major comments (2)

[Abstract and dataplacement definition] The optimality guarantee is load-bearing on the claim that dataplacement pruning discards only redundant or suboptimal mappings while retaining every optimal one. The abstract states the reduction but supplies no derivation of the pruning rules, no small-instance exhaustive enumeration to validate completeness, and no counter-example search; a reduction of this magnitude requires explicit proof or verification that the retained 10^5 mappings contain the global optimum for the target accelerators.
[Experimental evaluation] Experimental comparisons report 1.2-6.5× EDP improvement and 1000× speedup, yet the manuscript provides no table of per-benchmark results, no description of the exact baseline mappers and their search budgets, and no error bars or multiple-run statistics; without these, it is impossible to assess whether the gains are consistent or driven by particular workloads.

minor comments (2)

[Abstract] Clarify whether the 17-second runtime includes the cost of computing dataplacement attributes or only the subsequent exhaustive enumeration.
[Dataplacement section] Add a small worked example (e.g., a 2-layer network on a 4-PE accelerator) showing the original mapping space, the dataplacement equivalence classes, and the pruned space to illustrate the pruning logic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract and dataplacement definition] The optimality guarantee is load-bearing on the claim that dataplacement pruning discards only redundant or suboptimal mappings while retaining every optimal one. The abstract states the reduction but supplies no derivation of the pruning rules, no small-instance exhaustive enumeration to validate completeness, and no counter-example search; a reduction of this magnitude requires explicit proof or verification that the retained 10^5 mappings contain the global optimum for the target accelerators.

Authors: We agree the abstract is concise and omits the full derivation. Section 3 of the manuscript derives the dataplacement pruning rules, proving they remove only redundant (data-movement equivalent) or provably suboptimal mappings while retaining every optimal one. To address the validation request, we will add a new subsection with small-instance exhaustive checks on toy accelerators (e.g., 2x2 PE arrays) that compare the pruned space against the full space, confirming the global optimum is always preserved. We will also note that the proof structure precludes counterexamples. revision: yes
Referee: [Experimental evaluation] Experimental comparisons report 1.2-6.5× EDP improvement and 1000× speedup, yet the manuscript provides no table of per-benchmark results, no description of the exact baseline mappers and their search budgets, and no error bars or multiple-run statistics; without these, it is impossible to assess whether the gains are consistent or driven by particular workloads.

Authors: We will expand the experimental section with a new table listing per-benchmark EDP and runtime results for every workload. The table will explicitly name the baseline mappers (Timeloop and others), their search budgets and configurations, and the achieved EDP values. Because TCM performs deterministic exhaustive search, run-to-run variance is zero; we will nevertheless report hardware-simulation statistics where relevant to show consistency across workloads. revision: yes

Circularity Check

0 steps flagged

No circularity detected; optimality rests on analytical pruning claim

full rationale

The paper introduces dataplacement as a new concept to analyze mappings, identify redundant/suboptimal ones, and prune the space from 10^37 to 10^5 while claiming to preserve the true optimum, enabling exhaustive search. No equations, fitted parameters, self-citations, or ansatzes are quoted that reduce the optimality guarantee to a definition or input by construction. The pruning is presented as an independent analysis of dataflow and dataplacement properties rather than a self-referential step, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the correctness of the dataplacement definition for exhaustive pruning and on the assumption that the reduced space still contains the global optimum.

axioms (1)

domain assumption Dataplacement correctly captures all redundancy and suboptimality relations among mappings
Invoked to justify the 32-order-of-magnitude reduction and the guarantee of optimality.

invented entities (1)

dataplacement no independent evidence
purpose: New abstraction parallel to dataflow that enables pruning of the mapping space
Introduced in the paper as the key enabler of the search-space reduction.

pith-pipeline@v0.9.0 · 5541 in / 1289 out tokens · 21508 ms · 2026-05-15T21:28:54.784813+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design
cs.AR 2026-02 unverdicted novelty 7.0

FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper

[1]

d.].NVIDIA Deep Learning Accelerator

[n. d.].NVIDIA Deep Learning Accelerator. http://nvdla.org/primer.html

work page
[2]

Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783725

work page doi:10.1109/micro.2016.7783725 2016
[3]

Tanner Andrulis. [n. d.].HWComponents. https://github.com/Accelergy-Project/ hwcomponents

work page
[4]

Emer, and Vivienne Sze

Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2023. RAELLA: Reform- ing the Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM: No Retraining Required!. InProceedings of the 50th Annual International Sym- posium on Computer Architecture(Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 27, 16 pa...

work page doi:10.1145/3579371.3589062 2023
[5]

Emer, and Vivienne Sze

Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2024. CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10–23. https://doi.org/10.1109/ISPASS61541.2024.00012

work page doi:10.1109/ispass61541.2024.00012 2024
[6]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott ...

work page 2020
[7]

Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023. Inter- Layer Scheduling Space Definition and Exploration for Tiled Accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture (Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 13, 17 pages. https://doi.org...

work page doi:10.1145/3579371.3589048 2023
[8]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architec- ture for Energy-Efficient Dataflow for Convolutional Neural Networks. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 367–379. https://doi.org/10.1109/ISCA.2016.40

work page doi:10.1109/isca.2016.40 2016
[9]

Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne. 2016. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. InIEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers. 262–263

work page 2016
[10]

2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators

Michael Gilbert. 2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators. PhD thesis. Massachusetts Institute of Technology, Cambridge, MA

work page 2023
[11]

Emer, and Vivienne Sze

Michael Gilbert, Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2024. Loop- Tree: Exploring the Fused-Layer Dataflow Accelerator Design Space.IEEE Transactions on Circuits and Systems for Artificial Intelligence1, 1 (2024), 97–

work page 2024
[12]

https://doi.org/10.1109/TCASAI.2024.3461716

work page doi:10.1109/tcasai.2024.3461716 2024
[13]

Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Hartwig Adam, and Quoc Le. 2019. Searching for MobileNetV3. In2019 IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 1314–1324. https://doi.org/10. 1109/ICCV.2019.00140

work page arXiv 2019
[14]

Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 554–566. https: //doi.org/10.1109/ISCA52012.2021.00050

work page doi:10.1109/isca52012.2021.00050 2021
[15]

Emer, and Angshuman Parashar

Qijing Huang, Po-An Tsai, Joel S. Emer, and Angshuman Parashar. 2024. Mind the Gap: Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 150–166. https://doi.org/10.1109/ISCA59077.2024.00021

work page doi:10.1109/isca59077.2024.00021 2024
[16]

Hongyang Jia, Hossein Valavi, Yinqi Tang, Jintao Zhang, and Naveen Verma

work page
[17]

https://doi.org/10.1109/JSSC.2020.2987714

A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In- Memory Computing.IEEE Journal of Solid-State Circuits55, 9 (2020), 2609–2621. https://doi.org/10.1109/JSSC.2020.2987714

work page doi:10.1109/jssc.2020.2987714 2020
[18]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks.. InProceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1. 1–13. https://proceedings.mlsys.org/paper_files/paper/2019/file/ b422680f3db0986ddd7f8f126baaf0fa-Paper.pdf

work page 2019
[19]

Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B

Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In2021 ACM/IEEE 48th Annual Intern...

work page doi:10.1109/isca52012.2021 2021
[20]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard ...

work page doi:10.1145/3079856.3080246 2017
[21]

Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW Map- ping of DNN Models on Accelerators via Genetic Algorithm. In2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9

work page 2020
[22]

Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim, and Bongjin Kim. 2021. Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In- Memory Macro for Processing Neural Networks.IEEE Journal of Solid-State Circuits56, 7 (2021), 2221–2233. https://doi.org/10.1109/JSSC.2021.3061508

work page doi:10.1109/jssc.2021.3061508 2021
[23]

Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings.IEEE Micro40, 3 (2020), 20–29. https://doi.org/10.1109/MM.2020.2985963

work page doi:10.1109/mm.2020.2985963 2020
[24]

Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174. https://doi.org/10.1109/TC.2021.3059962

work page doi:10.1109/tc.2021.3059962 2021
[25]

2025.NVIDIA DGX B200

NVIDIA. 2025.NVIDIA DGX B200. https://www.nvidia.com/en-us/data-center/ dgx-b200/ Accessed: August 1, 2025

work page 2025
[26]

MohammadHossein Olyaiy, Christopher Ng, Alexandra Sasha Fedorova, and Mieszko Lis. 2023. Sunstone: A Scalable and Versatile Scheduler for Mapping Tensor Algebra on Spatial Accelerators. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 259–271. https://doi.org/ 10.1109/ISPASS57527.2023.00033 12 The Turbo-Charged...

work page doi:10.1109/ispass57527.2023.00033 2023
[27]

Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In2019 IEEE International Symposium on Perfor- mance Analysis of Systems and Software (ISPASS). 304–315. https:...

work page arXiv 2019
[28]

Chirag Sakhuja, Zhan Shi, and Calvin Lin. 2023. Leveraging Domain Information for the Efficient Automated Design of Deep Learning Accelerators. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 287–301. https://doi.org/10.1109/HPCA56546.2023.10071095

work page doi:10.1109/hpca56546.2023.10071095 2023
[29]

Stanley Williams, and Vivek Srikumar

Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arith- metic in Crossbars. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 14–26. https://doi.org/10.1...

work page doi:10.1109/isca.2016.12 2016
[30]

Kyle Shiflett, Avinash Karanth, Razvan Bunescu, and Ahmed Louri. 2021. Al- bireo: energy-efficient acceleration of convolutional neural networks via sili- con photonics. InProceedings of the 48th Annual International Symposium on Computer Architecture(Virtual Event, Spain)(ISCA ’21). IEEE Press, 860–873. https://doi.org/10.1109/ISCA52012.2021.00072

work page doi:10.1109/isca52012.2021.00072 2021
[31]

Sinangil, Burak Erbagci, Rawan Naous, Kerem Akarvardar, Dar Sun, Win-San Khwa, Hung-Jen Liao, Yih Wang, and Jonathan Chang

Mahmut E. Sinangil, Burak Erbagci, Rawan Naous, Kerem Akarvardar, Dar Sun, Win-San Khwa, Hung-Jen Liao, Yih Wang, and Jonathan Chang. 2021. A 7- nm Compute-in-Memory SRAM Macro Supporting Multi-Bit Input, Weight and Output and Achieving 351 TOPS/W and 372.4 GOPS.IEEE Journal of Solid-State Circuits56, 1 (2021), 188–198. https://doi.org/10.1109/JSSC.2020.3031290

work page doi:10.1109/jssc.2020.3031290 2021
[32]

Arne Symons, Linyan Mei, and Marian Verhelst. 2021. LOMA: Fast Auto- Scheduling on DNN Accelerators through Loop-Order-based Memory Allocation. In2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS). 1–4. https://doi.org/10.1109/AICAS51828.2021.9458493

work page doi:10.1109/aicas51828.2021.9458493 2021
[33]

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2020. Efficient Processing of Deep Neural Networks.Synthesis Lectures on Computer Architec- ture15, 2 (2020), 1–341. https://doi.org/10.2200/S01004ED1V01Y202004CAC050 arXiv:https://doi.org/10.2200/S01004ED1V01Y202004CAC050

work page doi:10.2200/s01004ed1v01y202004cac050 2020
[34]

Brockman, and Norman P

Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. 2008. A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies. In2008 International Symposium on Computer Architecture. 51–62. https://doi.org/10. 1109/ISCA.2008.16

work page 2008
[35]

Burc Eryilmaz, Wenqiang Zhang, Yan Liao, Dabin Wu, Stephen Deiss, Bin Gao, Priyanka Raina, Siddharth Joshi, Huaqiang Wu, Gert Cauwenberghs, and H.-S

Weier Wan, Rajkumar Kubendran, S. Burc Eryilmaz, Wenqiang Zhang, Yan Liao, Dabin Wu, Stephen Deiss, Bin Gao, Priyanka Raina, Siddharth Joshi, Huaqiang Wu, Gert Cauwenberghs, and H.-S. Philip Wong. 2020. 33.1 A 74 TMACS/W CMOS-RRAM Neurosynaptic Core with Dynamically Reconfigurable Dataflow and In-situ Transposable Weights for Probabilistic Graphical Model...

work page doi:10.1109/isscc19947.2020.9062979 2020
[36]

Philip Wong, and Gert Cauwenberghs

Weier Wan, Rajkumar Kubendran, Clemens Schaefer, Sukru Burc Eryilmaz, Wen- qiang Zhang, Dabin Wu, Stephen Deiss, Priyanka Raina, He Qian, Bin Gao, Siddharth Joshi, Huaqiang Wu, H.-S. Philip Wong, and Gert Cauwenberghs. 2022. A compute-in-memory chip based on resistive random-access memory.Nature 608, 7923 (Aug. 2022), 504–512. https://doi.org/10.1038/s415...

work page doi:10.1038/s41586-022-04992-8 2022
[37]

Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Dan Lake, and Brent Carlton. 2023. A Charge Domain SRAM Compute-in-Memory Macro With C-2C Ladder-Based 8-Bit MAC Unit in 22-nm FinFET Process for Edge Inference.IEEE Journal of Solid-State Circuits58, 4 (2023), 1037–1050. https: //doi.org/10.1109/JSSC.2022.3232601

work page doi:10.1109/jssc.2022.3232601 2023
[38]

Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Xiaosen Liu, Dan Lake, Brent Carlton, and May Wu. 2022. A 32.2 TOPS/W SRAM Compute- in-Memory Macro Employing a Linear 8-bit C-2C Ladder for Charge Domain Computation in 22nm for Edge Inference. In2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 36–37. https:...

work page arXiv 2022
[39]

Emer, and Vivienne Sze

Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2019. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8. https://doi.org/10.1109/ICCAD45719.2019.8942149

work page doi:10.1109/iccad45719.2019.8942149 2019
[40]

Size Zheng, Siyuan Chen, Siyuan Gao, Liancheng Jia, Guangyu Sun, Runsheng Wang, and Yun Liang. 2023. TileFlow: A Framework for Modeling Fusion Dataflow via Tree-Based Analysis. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture(<conf-loc>, <city>Toronto</city>, <state>ON</state>, <country>Canada</country>, </conf-loc>)(...

work page doi:10.1145/3613424.3623792 2023
[41]

Zhizhen Zhong, Mingran Yang, Jay Lang, Christian Williams, Liam Kronman, Alexander Sludds, Homa Esfahanizadeh, Dirk Englund, and Manya Ghobadi. 2023. Lightning: A Reconfigurable Photonic-Electronic SmartNIC for Fast and Energy- Efficient Inference. InProceedings of the ACM SIGCOMM 2023 Conference(New York, NY, USA)(ACM SIGCOMM ’23). Association for Comput...

work page doi:10.1145/3603269.3604821 2023