arxiv: 2604.17692 · v1 · submitted 2026-04-20 · 💻 cs.AR

Recognition: unknown

AccelCIM: Systematic Dataflow Exploration for SRAM Compute-in-Memory Accelerator

Chenhao Xue , Yukun Wang , An Guo , Yuhui Shi , Jinwei Zhou , Xiping Dong , Yihan Yin , Yuanpeng Zhang

show 6 more authors

Tianyu Jia Wei Gao Qiang Wu Xin Si Jun Yang Guangyu Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:25 UTC · model grok-4.3

classification 💻 cs.AR

keywords SRAM CIMCompute-in-MemoryDataflow ExplorationDNN AcceleratorLLM InferenceCycle-Accurate SimulationPPA Analysis

0 comments

The pith

AccelCIM builds a complete design space for dataflow choices in SRAM compute-in-memory accelerators and evaluates them on large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve SRAM-based compute-in-memory accelerators for deep neural networks that are too large to fit entirely on the chip. It does this by creating a structured way to consider all possible data movement patterns across memory macros and their arrangements. The framework then uses detailed cycle-by-cycle simulations and chip layout analysis to judge which patterns work best. When tested on language model tasks, it shows which choices reduce the costly back-and-forth data transfers with external memory. This approach replaces incomplete assumptions in earlier studies with a more complete exploration.

Core claim

This paper presents AccelCIM as a systematic dataflow exploration framework for SRAM CIM accelerators. The framework defines a design space that includes configurations of individual CIM macros and how those macros are organized into arrays. Designs are assessed through cycle-accurate architectural simulation combined with post-layout power, performance, and area analysis. The method is demonstrated on representative large language model applications to derive insights for accelerator design.

What carries the argument

A systematic dataflow design space covering both CIM macro configurations and macro-array organizations, paired with cycle-accurate simulation and post-layout PPA evaluation.

Load-bearing premise

The design space and evaluation techniques capture the dominant factors that determine real hardware efficiency for large models.

What would settle it

Fabricating a prototype SRAM CIM accelerator following one of the framework's recommended dataflows and measuring its actual energy consumption and latency against the simulated values.

Figures

Figures reproduced from arXiv: 2604.17692 by An Guo, Chenhao Xue, Guangyu Sun, Jinwei Zhou, Jun Yang, Qiang Wu, Tianyu Jia, Wei Gao, Xin Si, Xiping Dong, Yihan Yin, Yuanpeng Zhang, Yuhui Shi, Yukun Wang.

**Figure 4.** Figure 4: AccelCIM CIM macro template. Macro [BR-1][0] ＋ ··· ··· ··· input (b) Systolic, Output-Stationary Dataflow ··· ··· (d) Systolic, Weight-Stationary Dataflow ＋＋＋ ··· input weight input output output weight Macro [0][0] ＋ ··· Macro [0][BC-1] Macro [BR-1][BC-1] ＋＋ Macro [0][0] Macro [BR-1][0] output ＋＋＋ ··· output Macro [0][BC-1] Macro [BR-1][BC-1] ··· ··· ＋ Macro [0][0] ··· ＋ Macro [BR-1][0] output ··· ＋ … view at source ↗

**Figure 6.** Figure 6: AccelCIM’s macro array generator workflow. (a) CIM macro layout (b) Macro Array layout Crtl I/O Banks ×8 Input Driver Wordline Driver [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Example layouts from AccelCIM’s macro array generator. of the current weight row, new weights enter each column in a similar staggered manner. This ensures that each macro can update the current weight row upon finishing computation, enabling all macros to perform at least one of the tasks between computation and weight updates. For the OS-Broadcast dataflow, activation movement follows the WS-Broadcast p… view at source ↗

**Figure 9.** Figure 9: Comparison of cycle-oriented and performance [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗

**Figure 8.** Figure 8: Pareto frontiers of different dataflows in [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 11.** Figure 11: The energy and area efficiency of candidate designs [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗

**Figure 12.** Figure 12: Impact of CIM macro supporting Compute-I/O [PITH_FULL_IMAGE:figures/full_fig_p006_12.png] view at source ↗

read the original abstract

SRAM-based compute-in-memory (CIM) offers high computational density and energy efficiency for deep neural network (DNN) accelerators, but its limited capacity causes on/off-chip data movement overhead for large DNN models. Existing CIM accelerator studies typically assume that DNN models fit entirely on-chip, leaving efficient dataflow design largely untapped. This paper introduces AccelCIM, a systematic dataflow exploration framework for SRAM CIM accelerator, which addresses two key limitations of prior work. (1) It formulates a systematic dataflow design space spanning CIM macro configurations and macro-array organizations. (2) It introduces rigorous design evaluation using cycle-accurate architectural simulation and post-layout PPA analysis. We conduct an extensive design space exploration and apply AccelCIM to representative LLM applications, providing practical insights for the principled design of CIM accelerators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AccelCIM gives a structured design space and evaluation method for SRAM CIM dataflows under large-model capacity limits, but the simulator's off-chip traffic modeling is the make-or-break part.

read the letter

The paper's main advance is laying out a design space that covers both individual CIM macro settings and how those macros are tiled into arrays, then evaluating the options with cycle-accurate simulation plus post-layout PPA numbers. This directly targets the gap where earlier CIM work assumed the whole model fits on chip, which stops being true for LLMs. Applying the framework to representative LLM workloads and pulling out design guidelines is the practical step that prior studies largely skipped.

Referee Report

2 major / 2 minor

Summary. The paper introduces AccelCIM, a systematic dataflow exploration framework for SRAM-based compute-in-memory (CIM) accelerators. It formulates a design space spanning CIM macro configurations and macro-array organizations to address data-movement overheads in capacity-limited settings, employs cycle-accurate architectural simulation together with post-layout PPA analysis for evaluation, conducts extensive design-space exploration, and applies the framework to representative LLM workloads to derive practical design insights.

Significance. If the simulator and PPA models prove accurate, the work supplies a needed methodological advance for CIM accelerator design by moving beyond the on-chip-fit assumption common in prior studies and by delivering workload-specific insights for LLMs. The explicit use of cycle-accurate simulation and post-layout analysis is a strength that supports reproducibility and realism in the reported energy/latency rankings.

major comments (2)

[§4.3] §4.3 (Cycle-Accurate Simulator): the description of off-chip memory traffic (weights, activations, KV cache) for LLM layers that exceed macro-array capacity lacks any validation against silicon measurements or cross-checks with established DRAM models; because the central claim is that the framework yields actionable dataflow rankings and PPA numbers for real LLM inference, this omission is load-bearing.
[Table 5] Table 5 (LLM Results): the reported energy and latency improvements for different macro-array organizations are presented without sensitivity analysis to simulator parameters such as interconnect bandwidth or stall cycles; this weakens the robustness of the “practical insights” conclusion.

minor comments (2)

[Abstract] Abstract: the phrase “representative LLM applications” is not instantiated with concrete model names or layer sizes, making it difficult for readers to gauge the scope of the claimed insights.
[Figure 4] Figure 4: axis labels and units on the post-layout PPA plots are inconsistent between sub-figures, complicating direct comparison of the explored configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of AccelCIM's methodological contributions. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [§4.3] §4.3 (Cycle-Accurate Simulator): the description of off-chip memory traffic (weights, activations, KV cache) for LLM layers that exceed macro-array capacity lacks any validation against silicon measurements or cross-checks with established DRAM models; because the central claim is that the framework yields actionable dataflow rankings and PPA numbers for real LLM inference, this omission is load-bearing.

Authors: We agree that explicit validation strengthens the off-chip modeling claims. Our cycle-accurate simulator models off-chip traffic using standard DRAM timing/energy parameters drawn from established literature and interfaces (e.g., HBM2/3 characteristics). While the current manuscript does not contain dedicated cross-checks or silicon comparisons for LLM-specific traffic, we will revise §4.3 to add a new validation subsection. This will include (1) direct comparison of modeled DRAM access costs against published values from DRAMSim2 and similar tools on synthetic access patterns, and (2) energy/latency cross-checks against reported off-chip costs for transformer layers in prior accelerator studies. These additions will directly support the reliability of the LLM dataflow rankings without altering the core simulation methodology. revision: yes
Referee: [Table 5] Table 5 (LLM Results): the reported energy and latency improvements for different macro-array organizations are presented without sensitivity analysis to simulator parameters such as interconnect bandwidth or stall cycles; this weakens the robustness of the “practical insights” conclusion.

Authors: We acknowledge that sensitivity analysis would better demonstrate the robustness of the reported rankings. The current results in Table 5 use fixed, realistic interconnect and stall parameters derived from the post-layout PPA models. In the revised manuscript we will add a dedicated sensitivity subsection (or supplementary figure) that varies interconnect bandwidth (±30%) and stall-cycle assumptions across plausible ranges. The analysis will confirm that the relative ordering of macro-array organizations remains stable, thereby reinforcing the practical design insights for LLM workloads while preserving the original quantitative results. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and evaluation are independent of inputs

full rationale

The paper introduces AccelCIM as a new systematic design-space formulation plus cycle-accurate simulation and post-layout PPA evaluation applied to LLMs. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on external simulation fidelity and design-space enumeration rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the domain assumption about prior studies and introduces the new framework as its core addition, with evaluation relying on standard simulation techniques.

axioms (1)

domain assumption Existing studies on CIM accelerators assume that DNN models fit entirely on-chip.
This is stated as one of the key limitations of prior work that the paper addresses.

invented entities (1)

AccelCIM no independent evidence
purpose: A systematic dataflow exploration framework for SRAM-based CIM accelerators
Introduced as the main contribution of the paper.

pith-pipeline@v0.9.0 · 5479 in / 1211 out tokens · 88387 ms · 2026-05-10T04:25:20.372759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Tanner Andrulis, Joel S Emer, and Vivienne Sze. 2024. CiMLoop: A flexible, accurate, and fast compute-in-memory modeling tool. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 10–23

2024
[2]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

2020
[3]

Jia Chen, Fengbin Tu, Kunming Shao, Fengshi Tian, Xiao Huo, Chi-Ying Tsui, and Kwang-Ting Cheng. 2023. Autodcim: An automated digital cim compiler. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

2023
[4]

Xiaofeng Chen, Ruiqi Guo, Zhiheng Yue, Yang Hu, Leibo Liu, Shaojun Wei, and Shouyi Yin. 2023. A systolic computing-in-memory array based accelerator with predictive early activation for spatiotemporal convolutions. In2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 1–5

2023
[5]

Haikang Diao, Haoyi Zhang, Jiahao Song, Haoyang Luo, Yibo Lin, Runsheng Wang, Yuan Wang, and Xiyuan Tang. 2025. SEGA-DCIM: Design Space Exploration-Guided Automatic Digital CIM Compiler with Multiple Precision Support. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7

2025
[6]

2024.Corsair

d-Matrix. 2024.Corsair. https://www.d-matrix.ai/product/

2024
[7]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

2024
[8]

Hidehiro Fujiwara, Haruki Mori, Wei-Chang Zhao, Mei-Chen Chuang, Rawan Naous, Chao-Kai Chuang, Takeshi Hashizume, Dar Sun, Chia-Fu Lee, Kerem Akarvardar, et al . 2022. A 5-nm 254-TOPS/W 221-TOPS/mm 2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltage-frequency scaling and simultaneous MAC and write operations. In2022 IEEE Intern...

2022
[9]

Hidehiro Fujiwara, Haruki Mori, Wei-Chang Zhao, Kinshuk Khare, Cheng-En Lee, Xiaochen Peng, Vineet Joshi, Chao-Kai Chuang, Shu-Huan Hsu, Takeshi Hashizume, et al. 2024. 34.4 A 3nm, 32.5 TOPS/W, 55.0 TOPS/mm 2 and 3.78 Mb/mm 2 fully-digital compute-in-memory macro supporting INT12 × INT12 with a parallel-MAC architecture and foundry 6T-SRAM bit cell. In202...

2024
[10]

An Guo, Chen Xi, Fangyuan Dong, Xingyu Pu, Dongqi Li, Jingmin Zhang, Xueshan Dong, Hui Gao, Yiran Zhang, Bo Wang, et al . 2024. A 28-nm 64-kb 31.6-TFLOPS/W digital-domain floating-point-computing-unit and double-bit 6T-SRAM computing-in-memory macro for floating-point CNNs.IEEE Journal of Solid-State Circuits59, 9 (2024), 3032–3044

2024
[11]

2024.HaloDrive TM30

Houmo. 2024.HaloDrive TM30. https://www.houmoai.com/en/55/ProductType. html

2024
[12]

2025.CIAL-CIMCompiler

Richard Hui. 2025.CIAL-CIMCompiler. https://github.com/Richard-Hui/CIAL- CIMCompiler

2025
[13]

Junmo Lee, Anni Lu, Wantong Li, and Shimeng Yu. 2024. Neurosim v1. 4: Extend- ing technology support for digital compute-in-memory toward 1nm node.IEEE Transactions on Circuits and Systems I: Regular Papers71, 4 (2024), 1733–1744

2024
[14]

Yang Li, Yu Shen, Wentao Zhang, Yuanwei Chen, Huaijun Jiang, Mingchao Liu, Jiawei Jiang, Jinyang Gao, Wentao Wu, Zhi Yang, et al . 2021. Openbox: A generalized black-box optimization service. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 3209–3219

2021
[15]

Yingjie Qi, Jianlei Yang, Yiou Wang, Yikun Wang, Dayu Wang, Ling Tang, Cenlin Duan, Xiaolin He, and Weisheng Zhao. 2025. CIMFlow: An Integrated Framework for Systematic Design and Evaluation of Digital CIM Architectures.arXiv preprint arXiv:2505.01107(2025)

work page arXiv 2025
[16]

Kunming Shao, Fengshi Tian, Xiaomeng Wang, Jiakun Zheng, Jia Chen, Jingyu He, Hui Wu, Jinbo Chen, Xihao Guan, Yi Deng, et al. 2025. Syndcim: A performance- aware digital computing-in-memory compiler with multi-spec-oriented subcir- cuit synthesis. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7

2025
[17]

Ming-En Shih, Shih-Wei Hsieh, Ping-Yuan Tsai, Ming-Hung Lin, Pei-Kuei Tsung, En-Jui Chang, Jenwei Liang, Shu-Hsin Chang, Chung-Lun Huang, You-Yu Nian, Zhe Wan, Sushil Kumar, Cheng-Xin Xue, Gajanan Jedhe, Hidehiro Fujiwara, Haruki Mori, Chih-Wei Chen, Po-Hua Huang, Chih-Feng Juan, Chung-Yi Chen, Tsung-Yao Lin, Ch Wang, Chih-Cheng Chen, and Kevin Jou. 2024....

work page doi:10.1109/isscc49657.2024 2024
[18]

Jiacong Sun, Pouya Houshmand, and Marian Verhelst. 2023. Analog or digital in-memory computing? benchmarking through quantitative modeling. In2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 1–9

2023
[19]

Chuyu Wang, Ke Hu, Fan Yang, Keren Zhu, and Xuan Zeng. 2025. DAMIL-DCIM: A Digital CIM Layout Synthesis Framework with Dataflow-Aware Floorplan and MILP-Based Detailed Placement. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7

2025
[20]

Yongkun Wu, Xiaomeng Wang, Jia Chen, Zhenhua Zhu, Jingyu He, Pingcheng Dong, Yonghao Tan, Xin Zhao, Liang Chang, Yu Wang, et al. 2025. Exploiting the Memory-Compute-Coupling Feature for CIM Accelerator Design Optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2025)

2025
[21]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Yi Zhan, Wei-Han Yu, Ka-Fai Un, Rui P Martins, and Pui-In Mak. 2024. GSLP- CIM: A 28-nm globally systolic and locally parallel CNN/transformer accelerator with scalable and reconfigurable eDRAM compute-in-memory macro for flexible dataflow.IEEE Transactions on Circuits and Systems I: Regular Papers(2024)

2024
[23]

Hongyi Zhang, Haozhe Zhu, Siqi He, Mengjie Li, Chengchen Wang, Xiankui Xiong, Haidong Tian, Xiaoyang Zeng, and Chixiao Chen. 2024. Arctic: Agile and robust compute-in-memory compiler with parameterized int/fp precision and built-in self test. In2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1–6

2024
[24]

Zhantong Zhu, Hongou Li, Wenjie Ren, Meng Wu, Le Ye, Ru Huang, and Tianyu Jia. 2025. Leveraging compute-in-memory for efficient generative model inference in tpus. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7

2025
[25]

Zhenhua Zhu, Hanbo Sun, Tongxin Xie, Yu Zhu, Guohao Dai, Lixue Xia, Dimin Niu, Xiaoming Chen, Xiaobo Sharon Hu, Yu Cao, et al . 2023. Mnsim 2.0: A behavior-level modeling tool for processing-in-memory architectures.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems42, 11 (2023), 4112–4125

2023