NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference
Pith reviewed 2026-05-25 02:54 UTC · model grok-4.3
The pith
3D NAND architecture fuses expert selection and computation into one cycle for MoE LLM inference
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By leveraging the intrinsic string structure of 3D NAND technology, NASiC fuses the dynamical expert selection through CAM-based masking mechanism and activated expert computation through CIM into a single computation cycle, eradicating redundant computation and enhancing computational parallelism, with circuit-level optimizations and multibit CIM cell co-design enabling block-wise parallel computation with in-situ signed multibit input and weight expansion.
What carries the argument
CAM-based masking mechanism integrated into the 3D NAND string structure that performs expert selection and CIM computation inside one cycle
If this is right
- Inactive experts produce no array activity, removing the redundant work that normally lowers effective parallelism in sparse MoE execution.
- All expert weights remain resident in the high-density 3D NAND array, eliminating the need to swap parameters on and off chip.
- Multibit input and weight expansion occurs in place, raising utilization of each Flash cell without extra area.
- Block-wise parallel operation scales throughput linearly with the number of active experts inside the same cycle budget.
Where Pith is reading between the lines
- The same string-level masking trick could be tested on other sparse activation patterns such as dynamic pruning or conditional computation outside MoE.
- If the single-cycle fusion holds at scale, end-to-end latency for on-device MoE chat or generation would drop because selection and execution no longer add separate pipeline stages.
- The design choice to keep every expert in place might change how model developers trade off number of experts against total parameter count when targeting edge hardware.
Load-bearing premise
The circuit-level optimizations, multibit CIM cells, and CAM masking can be built into real 3D NAND hardware without large accuracy loss, area cost, or fabrication problems that erase the claimed speed and energy gains.
What would settle it
Fabricate a test chip of the NASiC array in 3D NAND and measure whether the single-cycle selection-plus-computation step delivers the reported performance and energy numbers at the stated accuracy level when compared with separate selection followed by computation.
Figures
read the original abstract
The Mixture-of-Experts (MoE) models have emerged as the state-of-the-art paradigm for scaling up large language models (LLMs) without proportionally increased computational cost. However, its on-device deployment faces a critical challenge due to the large memory requirement for storing all expert parameters. 3D NAND-based computing-in-memory (CIM) architectures uniquely offer high storage capacity and reduced data movement, while they are ill-suited for MoE models with dynamically sparse expert activation, leading to a degradation of effective computational parallelism, along with underutilization of multibit storage capability of Flash cells. In this work, we proposed a 3D NAND-based content addressable-selected CIM architecture, dubbed as NASiC, which is tailored to MoE models. By leveraging the intrinsic string structure of 3D NAND technology, NASiC fuses the dynamical expert selection through CAM-based masking mechanism and activated expert computation through CIM into a single computation cycle, eradicating redundant computation and enhancing computational parallelism. Moreover, circuit-level optimizations and multibit CIM cell are co-designed with proposed NASiC architecture, featuring block-wise parallel computation with in-situ signed multibit input and weight expansion, substantially improving the throughput and energy-efficiency of NAND CIM array, as well as the utilization of high-density 3D NAND technology for MoE models. With extensive experimental results, we demonstrate NASiC achieves 4-114.8x improved performance and 3.9-70x improved energy efficiency over state-of-the-art designs, along with high accuracy, showing its great potential for efficient on-device MoE LLM inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes NASiC, a 3D NAND-based CAM-selected multibit CIM architecture for on-device MoE LLM inference. It exploits the intrinsic string structure of 3D NAND to fuse CAM-based dynamical expert selection (masking) and CIM-based expert computation into a single cycle, eliminating redundant operations and improving parallelism. Circuit-level co-design includes block-wise parallel signed multibit input/weight expansion. The authors report 4-114.8× performance and 3.9-70× energy-efficiency gains over state-of-the-art designs while preserving high accuracy.
Significance. If the central claims hold under realistic device conditions, the work could meaningfully advance efficient on-device deployment of large MoE models by addressing the mismatch between dynamic sparsity and standard NAND CIM utilization. The architecture-circuit-cell co-design for sparse activation and multibit Flash cells is a clear strength; the single-cycle fusion idea directly targets a known limitation in prior CIM designs for MoE workloads.
major comments (2)
- [Abstract and §5] Abstract and §5 (Experimental Results): The headline claims of 4-114.8× performance and 3.9-70× energy efficiency are load-bearing for the contribution, yet neither the abstract nor the results section specifies the exact SOTA baselines, the simulation framework (SPICE vs. architectural), accuracy metric definitions, or any Monte-Carlo/process-variation analysis. Without these, the reported gains cannot be assessed for robustness against the cell non-idealities that the architecture assumes away.
- [§3 and §4] §3 (Architecture) and §4 (Circuit Co-design): The single-cycle CAM-CIM fusion via 3D NAND strings is presented as eradicating redundant computation, but the description provides no timing diagram, critical-path analysis, or quantification of area/delay overhead from the added CAM masking transistors on the NAND string. Any non-zero overhead would directly undermine the parallelism and utilization claims that justify the large speed-up numbers.
minor comments (2)
- [Abstract] Abstract: The acronym NASiC is introduced without expansion.
- [§5] Figure captions and legends in §5 are occasionally missing units or baseline identifiers, reducing readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the clarity and robustness of our claims. We will revise the manuscript to provide the requested details on baselines, simulation methodology, metrics, variation analysis, and circuit timing.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Experimental Results): The headline claims of 4-114.8× performance and 3.9-70× energy efficiency are load-bearing for the contribution, yet neither the abstract nor the results section specifies the exact SOTA baselines, the simulation framework (SPICE vs. architectural), accuracy metric definitions, or any Monte-Carlo/process-variation analysis. Without these, the reported gains cannot be assessed for robustness against the cell non-idealities that the architecture assumes away.
Authors: We agree that explicit specification is required for independent assessment. In the revised manuscript we will (i) name the precise SOTA baselines in both the abstract and §5, (ii) state that all results derive from an architectural simulator whose timing and energy models are calibrated against SPICE simulations of the 3D NAND cells, (iii) define the accuracy metrics (perplexity on WikiText-103 and downstream task accuracy), and (iv) add Monte-Carlo results that incorporate measured cell-to-cell variation and process corners to quantify robustness of the reported speed-up and energy figures. revision: yes
-
Referee: [§3 and §4] §3 (Architecture) and §4 (Circuit Co-design): The single-cycle CAM-CIM fusion via 3D NAND strings is presented as eradicating redundant computation, but the description provides no timing diagram, critical-path analysis, or quantification of area/delay overhead from the added CAM masking transistors on the NAND string. Any non-zero overhead would directly undermine the parallelism and utilization claims that justify the large speed-up numbers.
Authors: The single-cycle property follows directly from the vertical string topology of 3D NAND: the CAM masking transistors sit on the same string as the CIM cells and are activated during the pre-charge phase that already occurs in every CIM cycle, so they do not lengthen the critical path. Nevertheless, we acknowledge the absence of supporting diagrams and quantification. The revision will add (i) a timing diagram showing the fused CAM-CIM sequence within one cycle, (ii) a critical-path breakdown confirming that the masking transistors lie off the read-current path, and (iii) area and delay numbers extracted from the same SPICE-calibrated models used for the performance results, demonstrating that the incremental overhead is negligible relative to the baseline NAND string. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an architecture proposal for NASiC that fuses CAM-based expert selection and CIM computation via 3D NAND string structure, supported by circuit co-design and validated through experimental results showing performance and energy gains. No equations, self-definitional mappings, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The performance assertions rest on external experimental benchmarks rather than reducing to the input assumptions by construction, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Attention is all you need.Advances in Neural Information Processing Systems, 2017
A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 6
work page 2017
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems.arXiv preprint arXiv:2402.18013, 2024
-
[5]
Shyam Sudhakaran, Miguel González-Duque, Matthias Freiberger, Claire Glanois, Elias Najarro, and Sebastian Risi. Mariogpt: Open-ended text2level generation through large language models.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[6]
Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerová, et al. Adapted large language models can outperform medical experts in clinical text summarization.Nature medicine, 30(4):1134–1142, 2024
work page 2024
-
[7]
Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025
-
[8]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[9]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- tian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023
work page 2023
-
[10]
A survey: Collaborative hardware and software design in the era of large language models
Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, et al. A survey: Collaborative hardware and software design in the era of large language models. IEEE Circuits and Systems Magazine, 25(1):35–57, 2025
work page 2025
-
[11]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[13]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Glam: Efficient scaling of language models with mixture-of-experts
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. InInterna- tional conference on machine learning, pages 5547–5569. PMLR, 2022
work page 2022
-
[15]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[16]
No Language Left Behind: Scaling Human-Centered Machine Translation
Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architec- tures
Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, et al. Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architec- tures. InProceedings of the 52nd Annual International Symposium on Computer Architecture, pages 1731–1745, 2025
work page 2025
-
[18]
Edgemoe: Fast on-device inference of moe-based large language models
Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352, 2023
-
[19]
Yixiao Chen, Yanyue Xie, Ruining Yang, Wei Jiang, Wei Wang, Yong He, Yue Chen, Pu Zhao, and Yanzhi Wang. Collaborative compression for large-scale moe deployment on edge.arXiv preprint arXiv:2509.25689, 2025
-
[20]
Semiconductor memory technologies: State- of-the-art and future trends.Computer, 57(4):150–154, 2024
Shimeng Yu and Tae-Hyeon Kim. Semiconductor memory technologies: State- of-the-art and future trends.Computer, 57(4):150–154, 2024
work page 2024
-
[21]
Cambricon-llm: A chiplet-based hybrid architecture for on-device inference of 70b llm
Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, et al. Cambricon-llm: A chiplet-based hybrid architecture for on-device inference of 70b llm. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1474–1488. IEEE, 2024
work page 2024
-
[22]
Aif: Accelerating on-device llm inference using in-flash pro- cessing
Jaeyong Lee, Hyeunjoo Kim, Sanghun Oh, Myoungjun Chun, Myungsuk Kim, and Jihong Kim. Aif: Accelerating on-device llm inference using in-flash pro- cessing. InProceedings of the 52nd Annual International Symposium on Computer Architecture, pages 529–543, 2025
work page 2025
-
[23]
Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, and Benjamin Lee. Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference.arXiv preprint arXiv:2303.06182, 2023
-
[24]
Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference
Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1018–1031. IEEE, 2024
work page 2024
-
[25]
Kwanhee Kyung, Sungmin Yun, and Jung Ho Ahn. Ssd offloading for llm mixture- of-experts weights considered harmful in energy efficiency.IEEE Computer Architecture Letters, 2025
work page 2025
-
[26]
Po-Kai Hsu, Pei-Ying Du, Chieh Roger Lo, Hang-Ting Lue, Wei-Chen Chen, Tzu- Hsuan Hsu, Teng-Hao Yeh, Chih-Chang Hsieh, Ming-Liang Wei, Keh-Chung Wang, et al. An approach of 3d nand flash based nonvolatile computing-in- memory (nvcim) accelerator for deep neural networks (dnns) with calibration and read disturb analysis. In2020 IEEE International Memory Wo...
work page 2020
-
[27]
Hang-Ting Lue, Po-Kai Hsu, Ming-Liang Wei, Teng-Hao Yeh, Pei-Ying Du, Wei- Chen Chen, Keh-Chung Wang, and Chih-Yuan Lu. Optimal design methods to transform 3d nand flash into a high-density, high-bandwidth and low-power nonvolatile computing in memory (nvcim) accelerator for deep-learning neural networks (dnn). In2019 IEEE International Electron Devices M...
work page 2019
-
[28]
Wonbo Shim and Shimeng Yu. Technological design of 3d nand-based compute- in-memory architecture for gb-scale deep neural network.IEEE Electron Device Letters, 42(2):160–163, 2020
work page 2020
-
[29]
Architectural design of 3d nand flash based compute-in-memory for inference engine
Wonbo Shim, Hongwu Jiang, Xiaochen Peng, and Shimeng Yu. Architectural design of 3d nand flash based compute-in-memory for inference engine. In Proceedings of the International Symposium on Memory Systems, pages 77–85, 2020
work page 2020
-
[30]
Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025
-
[31]
Dongku Kang, Minsu Kim, Su Chang Jeon, Wontaeck Jung, Jooyong Park, Gyosoo Choo, Dong-kyo Shim, Anil Kavala, Seung-Bum Kim, Kyung-Min Kang, et al. 13.4 a 512gb 3-bit/cell 3d 6 th-generation v-nand flash memory with 82mb/s write throughput and 1.2 gb/s interface. In2019 IEEE International Solid-State Circuits Conference-(ISSCC), pages 216–218. IEEE, 2019
work page 2019
-
[32]
Jong Hak Yuh, Yen-Lung Jason Li, Heguang Li, Yoshihiro Oyama, Cynthia Hsu, Pradeep Anantula, Gwang Yeong Stanley Jeong, Anirudh Amarnath, Siddhesh Darne, Sneha Bhatia, et al. A 1-tb 4-b/cell 4-plane 162-layer 3-d flash memory with 2.4-gb/s io interface.IEEE Journal of Solid-State Circuits, 58(1):316–328, 2022
work page 2022
-
[33]
Wonbo Shim. Impact of 3d nand current variation on inference accuracy for in-memory computing.Journal of Semiconductor Technology and Science, 22(5): 341–345, 2022
work page 2022
-
[34]
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025
work page 2025
-
[35]
Geun Ho Lee, Sungmin Hwang, Junsu Yu, and Hyungjin Kim. Architecture and process integration overview of 3d nand flash technologies.Applied Sciences, 11 (15), 2021. ISSN 2076-3417
work page 2021
-
[36]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087– 38099. PMLR, 2023
work page 2023
-
[37]
Sayma Nowshin Chowdhury, Matthew Chen, and Sahil Shah. Analysis and verilog-a modeling of floating-gate transistors.IEEE Open Journal of Circuits and Systems, 2024
work page 2024
-
[38]
Hunjun Lee, Minseop Kim, Dongmoon Min, Joonsung Kim, Jongwon Back, Honam Yoo, Jong-Ho Lee, and Jangwoo Kim. 3d-fpim: An extreme energy- efficient dnn acceleration system using 3d nand flash-based in-situ pim unit. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1359–1376. IEEE, 2022
work page 2022
-
[39]
Takuji Miki, Toshiaki Ozeki, and Jun-ichi Naka. A 2-gs/s 8-bit time-interleaved sar adc for millimeter-wave pulsed radar baseband soc.IEEE Journal of Solid-State Circuits, 52(10):2712–2720, 2017
work page 2017
-
[40]
Nvidia a100 tensor core gpu: Performance and innovation.IEEE Micro, 41(2):29–35, 2021
Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. Nvidia a100 tensor core gpu: Performance and innovation.IEEE Micro, 41(2):29–35, 2021. 7
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.