Recognition: unknown
AccelCIM: Systematic Dataflow Exploration for SRAM Compute-in-Memory Accelerator
Pith reviewed 2026-05-10 04:25 UTC · model grok-4.3
The pith
AccelCIM builds a complete design space for dataflow choices in SRAM compute-in-memory accelerators and evaluates them on large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper presents AccelCIM as a systematic dataflow exploration framework for SRAM CIM accelerators. The framework defines a design space that includes configurations of individual CIM macros and how those macros are organized into arrays. Designs are assessed through cycle-accurate architectural simulation combined with post-layout power, performance, and area analysis. The method is demonstrated on representative large language model applications to derive insights for accelerator design.
What carries the argument
A systematic dataflow design space covering both CIM macro configurations and macro-array organizations, paired with cycle-accurate simulation and post-layout PPA evaluation.
Load-bearing premise
The design space and evaluation techniques capture the dominant factors that determine real hardware efficiency for large models.
What would settle it
Fabricating a prototype SRAM CIM accelerator following one of the framework's recommended dataflows and measuring its actual energy consumption and latency against the simulated values.
Figures
read the original abstract
SRAM-based compute-in-memory (CIM) offers high computational density and energy efficiency for deep neural network (DNN) accelerators, but its limited capacity causes on/off-chip data movement overhead for large DNN models. Existing CIM accelerator studies typically assume that DNN models fit entirely on-chip, leaving efficient dataflow design largely untapped. This paper introduces AccelCIM, a systematic dataflow exploration framework for SRAM CIM accelerator, which addresses two key limitations of prior work. (1) It formulates a systematic dataflow design space spanning CIM macro configurations and macro-array organizations. (2) It introduces rigorous design evaluation using cycle-accurate architectural simulation and post-layout PPA analysis. We conduct an extensive design space exploration and apply AccelCIM to representative LLM applications, providing practical insights for the principled design of CIM accelerators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AccelCIM, a systematic dataflow exploration framework for SRAM-based compute-in-memory (CIM) accelerators. It formulates a design space spanning CIM macro configurations and macro-array organizations to address data-movement overheads in capacity-limited settings, employs cycle-accurate architectural simulation together with post-layout PPA analysis for evaluation, conducts extensive design-space exploration, and applies the framework to representative LLM workloads to derive practical design insights.
Significance. If the simulator and PPA models prove accurate, the work supplies a needed methodological advance for CIM accelerator design by moving beyond the on-chip-fit assumption common in prior studies and by delivering workload-specific insights for LLMs. The explicit use of cycle-accurate simulation and post-layout analysis is a strength that supports reproducibility and realism in the reported energy/latency rankings.
major comments (2)
- [§4.3] §4.3 (Cycle-Accurate Simulator): the description of off-chip memory traffic (weights, activations, KV cache) for LLM layers that exceed macro-array capacity lacks any validation against silicon measurements or cross-checks with established DRAM models; because the central claim is that the framework yields actionable dataflow rankings and PPA numbers for real LLM inference, this omission is load-bearing.
- [Table 5] Table 5 (LLM Results): the reported energy and latency improvements for different macro-array organizations are presented without sensitivity analysis to simulator parameters such as interconnect bandwidth or stall cycles; this weakens the robustness of the “practical insights” conclusion.
minor comments (2)
- [Abstract] Abstract: the phrase “representative LLM applications” is not instantiated with concrete model names or layer sizes, making it difficult for readers to gauge the scope of the claimed insights.
- [Figure 4] Figure 4: axis labels and units on the post-layout PPA plots are inconsistent between sub-figures, complicating direct comparison of the explored configurations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of AccelCIM's methodological contributions. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Cycle-Accurate Simulator): the description of off-chip memory traffic (weights, activations, KV cache) for LLM layers that exceed macro-array capacity lacks any validation against silicon measurements or cross-checks with established DRAM models; because the central claim is that the framework yields actionable dataflow rankings and PPA numbers for real LLM inference, this omission is load-bearing.
Authors: We agree that explicit validation strengthens the off-chip modeling claims. Our cycle-accurate simulator models off-chip traffic using standard DRAM timing/energy parameters drawn from established literature and interfaces (e.g., HBM2/3 characteristics). While the current manuscript does not contain dedicated cross-checks or silicon comparisons for LLM-specific traffic, we will revise §4.3 to add a new validation subsection. This will include (1) direct comparison of modeled DRAM access costs against published values from DRAMSim2 and similar tools on synthetic access patterns, and (2) energy/latency cross-checks against reported off-chip costs for transformer layers in prior accelerator studies. These additions will directly support the reliability of the LLM dataflow rankings without altering the core simulation methodology. revision: yes
-
Referee: [Table 5] Table 5 (LLM Results): the reported energy and latency improvements for different macro-array organizations are presented without sensitivity analysis to simulator parameters such as interconnect bandwidth or stall cycles; this weakens the robustness of the “practical insights” conclusion.
Authors: We acknowledge that sensitivity analysis would better demonstrate the robustness of the reported rankings. The current results in Table 5 use fixed, realistic interconnect and stall parameters derived from the post-layout PPA models. In the revised manuscript we will add a dedicated sensitivity subsection (or supplementary figure) that varies interconnect bandwidth (±30%) and stall-cycle assumptions across plausible ranges. The analysis will confirm that the relative ordering of macro-array organizations remains stable, thereby reinforcing the practical design insights for LLM workloads while preserving the original quantitative results. revision: yes
Circularity Check
No circularity: framework and evaluation are independent of inputs
full rationale
The paper introduces AccelCIM as a new systematic design-space formulation plus cycle-accurate simulation and post-layout PPA evaluation applied to LLMs. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on external simulation fidelity and design-space enumeration rather than reducing to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing studies on CIM accelerators assume that DNN models fit entirely on-chip.
invented entities (1)
-
AccelCIM
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Tanner Andrulis, Joel S Emer, and Vivienne Sze. 2024. CiMLoop: A flexible, accurate, and fast compute-in-memory modeling tool. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 10–23
2024
-
[2]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901
2020
-
[3]
Jia Chen, Fengbin Tu, Kunming Shao, Fengshi Tian, Xiao Huo, Chi-Ying Tsui, and Kwang-Ting Cheng. 2023. Autodcim: An automated digital cim compiler. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6
2023
-
[4]
Xiaofeng Chen, Ruiqi Guo, Zhiheng Yue, Yang Hu, Leibo Liu, Shaojun Wei, and Shouyi Yin. 2023. A systolic computing-in-memory array based accelerator with predictive early activation for spatiotemporal convolutions. In2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 1–5
2023
-
[5]
Haikang Diao, Haoyi Zhang, Jiahao Song, Haoyang Luo, Yibo Lin, Runsheng Wang, Yuan Wang, and Xiyuan Tang. 2025. SEGA-DCIM: Design Space Exploration-Guided Automatic Digital CIM Compiler with Multiple Precision Support. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7
2025
-
[6]
2024.Corsair
d-Matrix. 2024.Corsair. https://www.d-matrix.ai/product/
2024
-
[7]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407
2024
-
[8]
Hidehiro Fujiwara, Haruki Mori, Wei-Chang Zhao, Mei-Chen Chuang, Rawan Naous, Chao-Kai Chuang, Takeshi Hashizume, Dar Sun, Chia-Fu Lee, Kerem Akarvardar, et al . 2022. A 5-nm 254-TOPS/W 221-TOPS/mm 2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltage-frequency scaling and simultaneous MAC and write operations. In2022 IEEE Intern...
2022
-
[9]
Hidehiro Fujiwara, Haruki Mori, Wei-Chang Zhao, Kinshuk Khare, Cheng-En Lee, Xiaochen Peng, Vineet Joshi, Chao-Kai Chuang, Shu-Huan Hsu, Takeshi Hashizume, et al. 2024. 34.4 A 3nm, 32.5 TOPS/W, 55.0 TOPS/mm 2 and 3.78 Mb/mm 2 fully-digital compute-in-memory macro supporting INT12 × INT12 with a parallel-MAC architecture and foundry 6T-SRAM bit cell. In202...
2024
-
[10]
An Guo, Chen Xi, Fangyuan Dong, Xingyu Pu, Dongqi Li, Jingmin Zhang, Xueshan Dong, Hui Gao, Yiran Zhang, Bo Wang, et al . 2024. A 28-nm 64-kb 31.6-TFLOPS/W digital-domain floating-point-computing-unit and double-bit 6T-SRAM computing-in-memory macro for floating-point CNNs.IEEE Journal of Solid-State Circuits59, 9 (2024), 3032–3044
2024
-
[11]
2024.HaloDrive TM30
Houmo. 2024.HaloDrive TM30. https://www.houmoai.com/en/55/ProductType. html
2024
-
[12]
2025.CIAL-CIMCompiler
Richard Hui. 2025.CIAL-CIMCompiler. https://github.com/Richard-Hui/CIAL- CIMCompiler
2025
-
[13]
Junmo Lee, Anni Lu, Wantong Li, and Shimeng Yu. 2024. Neurosim v1. 4: Extend- ing technology support for digital compute-in-memory toward 1nm node.IEEE Transactions on Circuits and Systems I: Regular Papers71, 4 (2024), 1733–1744
2024
-
[14]
Yang Li, Yu Shen, Wentao Zhang, Yuanwei Chen, Huaijun Jiang, Mingchao Liu, Jiawei Jiang, Jinyang Gao, Wentao Wu, Zhi Yang, et al . 2021. Openbox: A generalized black-box optimization service. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 3209–3219
2021
- [15]
-
[16]
Kunming Shao, Fengshi Tian, Xiaomeng Wang, Jiakun Zheng, Jia Chen, Jingyu He, Hui Wu, Jinbo Chen, Xihao Guan, Yi Deng, et al. 2025. Syndcim: A performance- aware digital computing-in-memory compiler with multi-spec-oriented subcir- cuit synthesis. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7
2025
-
[17]
Ming-En Shih, Shih-Wei Hsieh, Ping-Yuan Tsai, Ming-Hung Lin, Pei-Kuei Tsung, En-Jui Chang, Jenwei Liang, Shu-Hsin Chang, Chung-Lun Huang, You-Yu Nian, Zhe Wan, Sushil Kumar, Cheng-Xin Xue, Gajanan Jedhe, Hidehiro Fujiwara, Haruki Mori, Chih-Wei Chen, Po-Hua Huang, Chih-Feng Juan, Chung-Yi Chen, Tsung-Yao Lin, Ch Wang, Chih-Cheng Chen, and Kevin Jou. 2024....
-
[18]
Jiacong Sun, Pouya Houshmand, and Marian Verhelst. 2023. Analog or digital in-memory computing? benchmarking through quantitative modeling. In2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 1–9
2023
-
[19]
Chuyu Wang, Ke Hu, Fan Yang, Keren Zhu, and Xuan Zeng. 2025. DAMIL-DCIM: A Digital CIM Layout Synthesis Framework with Dataflow-Aware Floorplan and MILP-Based Detailed Placement. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7
2025
-
[20]
Yongkun Wu, Xiaomeng Wang, Jia Chen, Zhenhua Zhu, Jingyu He, Pingcheng Dong, Yonghao Tan, Xin Zhao, Liang Chang, Yu Wang, et al. 2025. Exploiting the Memory-Compute-Coupling Feature for CIM Accelerator Design Optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2025)
2025
-
[21]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Yi Zhan, Wei-Han Yu, Ka-Fai Un, Rui P Martins, and Pui-In Mak. 2024. GSLP- CIM: A 28-nm globally systolic and locally parallel CNN/transformer accelerator with scalable and reconfigurable eDRAM compute-in-memory macro for flexible dataflow.IEEE Transactions on Circuits and Systems I: Regular Papers(2024)
2024
-
[23]
Hongyi Zhang, Haozhe Zhu, Siqi He, Mengjie Li, Chengchen Wang, Xiankui Xiong, Haidong Tian, Xiaoyang Zeng, and Chixiao Chen. 2024. Arctic: Agile and robust compute-in-memory compiler with parameterized int/fp precision and built-in self test. In2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1–6
2024
-
[24]
Zhantong Zhu, Hongou Li, Wenjie Ren, Meng Wu, Le Ye, Ru Huang, and Tianyu Jia. 2025. Leveraging compute-in-memory for efficient generative model inference in tpus. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7
2025
-
[25]
Zhenhua Zhu, Hanbo Sun, Tongxin Xie, Yu Zhu, Guohao Dai, Lixue Xia, Dimin Niu, Xiaoming Chen, Xiaobo Sharon Hu, Yu Cao, et al . 2023. Mnsim 2.0: A behavior-level modeling tool for processing-in-memory architectures.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems42, 11 (2023), 4112–4125
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.