pith. sign in

arxiv: 2605.23294 · v1 · pith:GB4PPGH4new · submitted 2026-05-22 · 💻 cs.AR

NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference

Pith reviewed 2026-05-25 02:54 UTC · model grok-4.3

classification 💻 cs.AR
keywords 3D NANDComputing-in-MemoryMixture-of-ExpertsContent Addressable MemoryLLM inferenceon-device AICAMCIM
0
0 comments X

The pith

3D NAND architecture fuses expert selection and computation into one cycle for MoE LLM inference

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that a new architecture called NASiC solves the mismatch between 3D NAND computing-in-memory and the dynamic sparsity of Mixture-of-Experts models by using the built-in string structure of 3D NAND. It combines content-addressable memory masking for expert selection with multibit CIM computation so both happen inside a single cycle instead of separate steps. This removes wasted work on inactive experts and raises effective parallelism while still storing all expert weights on the high-density array. The authors report that the design, together with co-optimized multibit cells and block-wise signed arithmetic, produces 4-114.8 times higher performance and 3.9-70 times better energy efficiency than prior approaches while preserving model accuracy. If the claim holds, on-device deployment of large MoE models becomes practical because memory capacity and compute efficiency are no longer in conflict.

Core claim

By leveraging the intrinsic string structure of 3D NAND technology, NASiC fuses the dynamical expert selection through CAM-based masking mechanism and activated expert computation through CIM into a single computation cycle, eradicating redundant computation and enhancing computational parallelism, with circuit-level optimizations and multibit CIM cell co-design enabling block-wise parallel computation with in-situ signed multibit input and weight expansion.

What carries the argument

CAM-based masking mechanism integrated into the 3D NAND string structure that performs expert selection and CIM computation inside one cycle

If this is right

  • Inactive experts produce no array activity, removing the redundant work that normally lowers effective parallelism in sparse MoE execution.
  • All expert weights remain resident in the high-density 3D NAND array, eliminating the need to swap parameters on and off chip.
  • Multibit input and weight expansion occurs in place, raising utilization of each Flash cell without extra area.
  • Block-wise parallel operation scales throughput linearly with the number of active experts inside the same cycle budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same string-level masking trick could be tested on other sparse activation patterns such as dynamic pruning or conditional computation outside MoE.
  • If the single-cycle fusion holds at scale, end-to-end latency for on-device MoE chat or generation would drop because selection and execution no longer add separate pipeline stages.
  • The design choice to keep every expert in place might change how model developers trade off number of experts against total parameter count when targeting edge hardware.

Load-bearing premise

The circuit-level optimizations, multibit CIM cells, and CAM masking can be built into real 3D NAND hardware without large accuracy loss, area cost, or fabrication problems that erase the claimed speed and energy gains.

What would settle it

Fabricate a test chip of the NASiC array in 3D NAND and measure whether the single-cycle selection-plus-computation step delivers the reported performance and energy numbers at the stated accuracy level when compared with separate selection followed by computation.

Figures

Figures reproduced from arXiv: 2605.23294 by Dongxue Zhao, Ling Liang, Meng Li, Qianqian Huang, Ru Huang, Shuzhang Zhong, Tianyang Luo, Weikai Xu, Yimao Cai, Zongwei Wang.

Figure 1
Figure 1. Figure 1: (a) The scaling trend of large language models (LLMs) from dense models to Mixture-of-Experts (MoE) models. (b) Storage capacity comparison of memory technologies, identifying 3D NAND as the unique terabyte-level solution. 3D NAND uniquely offers the highest storage capacity, reaching the terabyte level by vertically stacking hundreds of layers with multi-level storage capacity per Flash cell, which is nec… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Conventional dense model with the feed-forward network (FFN) layer. (b) The standard MoE model. (c) The grouped MoE model [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Illustration of 3D NAND die. (b) Structure of 3D NAND plane. (c) Top￾down view of 3D NAND layer. (d) Threshold voltage (VTH) distributions for Flash cell. 2.2 3D NAND technology for CIM and MoE As mentioned above, the massive storage requirement of MoE mod￾els necessitates a memory technology that offers unparalleled den￾sity. Among the memory technologies, 3D NAND is the only mature and commercially a… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Conventional 3D NAND CIM architecture. (b) The proposed 3D NAND-based content addressable memory (CAM)-selected CIM architecture. (c) The physical structure. (d) The 2-bit CAM cell design using multi-level cell (MLC) Flash cell. (e) The 1-bit CIM cell design. (f) Final computing results of the proposed CAM-selected CIM. threshold voltage (VTH) states from single-level cell (SLC) to multi￾level cell (ML… view at source ↗
Figure 5
Figure 5. Figure 5: Conventional 3D NAND CIM architecture with (a) contiguous expert map￾ping strategy or (b) interleaved expert mapping strategy. (c) Proposed NASiC architec￾ture with interleaved expert mapping strategy. (d) Finer-grained expansion. computation cycle (Figure 4b). Consequently, NASiC architecture eliminates the redundant computation and significantly improve the effective parallelism and energy efficiency for… view at source ↗
Figure 6
Figure 6. Figure 6: (a) Conventional 3D NAND CIM circuit. (b) Proposed circuit-level optimiza￾tions. (c) The modified thermometer encoding scheme. (d) The in-situ multibit input expansion scheme. (e) The signed multiplication map and (f) computation results. multibit weights for block-wise parallel computation. We illustrate the encoding scheme using a standard 3D NAND plane where one GSL corresponds to four SSLs, where the m… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Conventional 3D NAND CIM with SLC Flash. (b) Proposed 3D NAND-based multibit CIM cell. (c) Encoding scheme with 3-state Flash cell. (d) Optimized timing diagram of computation. (e) The signed multiplication map and (f) computing results. (g-h) Encoding scheme with 4-state Flash cell. (i) The weight expansion. Table I: Configuration of 3D NAND plane. 3.3.2 Technique 3 (T3): 3D NAND-based multibit CIM ce… view at source ↗
Figure 10
Figure 10. Figure 10: (a) Impact of CAM layers in T1 for area efficiency. (b) Evaluation of area efficiency and overall area-energy-delay product of NASiC architecture with T1∼T3. Table II: Comparison of proposed NASiC architecture with other designs. minimal area overhead. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) Energy consumption breakdown of Base and NASiC architecture with three techniques. (b) Energy efficiency gain of NASiC architecture over Base. (Figure 7d) increases sub-linearly, thereby resulting in a dramatic boost to effective computational throughput (Figure 8d). 4.2.2 Energy efficiency. The evaluations of energy consumption and efficiency are carried out with 128 input dimension (𝜎 = 0.15), and th… view at source ↗
read the original abstract

The Mixture-of-Experts (MoE) models have emerged as the state-of-the-art paradigm for scaling up large language models (LLMs) without proportionally increased computational cost. However, its on-device deployment faces a critical challenge due to the large memory requirement for storing all expert parameters. 3D NAND-based computing-in-memory (CIM) architectures uniquely offer high storage capacity and reduced data movement, while they are ill-suited for MoE models with dynamically sparse expert activation, leading to a degradation of effective computational parallelism, along with underutilization of multibit storage capability of Flash cells. In this work, we proposed a 3D NAND-based content addressable-selected CIM architecture, dubbed as NASiC, which is tailored to MoE models. By leveraging the intrinsic string structure of 3D NAND technology, NASiC fuses the dynamical expert selection through CAM-based masking mechanism and activated expert computation through CIM into a single computation cycle, eradicating redundant computation and enhancing computational parallelism. Moreover, circuit-level optimizations and multibit CIM cell are co-designed with proposed NASiC architecture, featuring block-wise parallel computation with in-situ signed multibit input and weight expansion, substantially improving the throughput and energy-efficiency of NAND CIM array, as well as the utilization of high-density 3D NAND technology for MoE models. With extensive experimental results, we demonstrate NASiC achieves 4-114.8x improved performance and 3.9-70x improved energy efficiency over state-of-the-art designs, along with high accuracy, showing its great potential for efficient on-device MoE LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes NASiC, a 3D NAND-based CAM-selected multibit CIM architecture for on-device MoE LLM inference. It exploits the intrinsic string structure of 3D NAND to fuse CAM-based dynamical expert selection (masking) and CIM-based expert computation into a single cycle, eliminating redundant operations and improving parallelism. Circuit-level co-design includes block-wise parallel signed multibit input/weight expansion. The authors report 4-114.8× performance and 3.9-70× energy-efficiency gains over state-of-the-art designs while preserving high accuracy.

Significance. If the central claims hold under realistic device conditions, the work could meaningfully advance efficient on-device deployment of large MoE models by addressing the mismatch between dynamic sparsity and standard NAND CIM utilization. The architecture-circuit-cell co-design for sparse activation and multibit Flash cells is a clear strength; the single-cycle fusion idea directly targets a known limitation in prior CIM designs for MoE workloads.

major comments (2)
  1. [Abstract and §5] Abstract and §5 (Experimental Results): The headline claims of 4-114.8× performance and 3.9-70× energy efficiency are load-bearing for the contribution, yet neither the abstract nor the results section specifies the exact SOTA baselines, the simulation framework (SPICE vs. architectural), accuracy metric definitions, or any Monte-Carlo/process-variation analysis. Without these, the reported gains cannot be assessed for robustness against the cell non-idealities that the architecture assumes away.
  2. [§3 and §4] §3 (Architecture) and §4 (Circuit Co-design): The single-cycle CAM-CIM fusion via 3D NAND strings is presented as eradicating redundant computation, but the description provides no timing diagram, critical-path analysis, or quantification of area/delay overhead from the added CAM masking transistors on the NAND string. Any non-zero overhead would directly undermine the parallelism and utilization claims that justify the large speed-up numbers.
minor comments (2)
  1. [Abstract] Abstract: The acronym NASiC is introduced without expansion.
  2. [§5] Figure captions and legends in §5 are occasionally missing units or baseline identifiers, reducing readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the clarity and robustness of our claims. We will revise the manuscript to provide the requested details on baselines, simulation methodology, metrics, variation analysis, and circuit timing.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Experimental Results): The headline claims of 4-114.8× performance and 3.9-70× energy efficiency are load-bearing for the contribution, yet neither the abstract nor the results section specifies the exact SOTA baselines, the simulation framework (SPICE vs. architectural), accuracy metric definitions, or any Monte-Carlo/process-variation analysis. Without these, the reported gains cannot be assessed for robustness against the cell non-idealities that the architecture assumes away.

    Authors: We agree that explicit specification is required for independent assessment. In the revised manuscript we will (i) name the precise SOTA baselines in both the abstract and §5, (ii) state that all results derive from an architectural simulator whose timing and energy models are calibrated against SPICE simulations of the 3D NAND cells, (iii) define the accuracy metrics (perplexity on WikiText-103 and downstream task accuracy), and (iv) add Monte-Carlo results that incorporate measured cell-to-cell variation and process corners to quantify robustness of the reported speed-up and energy figures. revision: yes

  2. Referee: [§3 and §4] §3 (Architecture) and §4 (Circuit Co-design): The single-cycle CAM-CIM fusion via 3D NAND strings is presented as eradicating redundant computation, but the description provides no timing diagram, critical-path analysis, or quantification of area/delay overhead from the added CAM masking transistors on the NAND string. Any non-zero overhead would directly undermine the parallelism and utilization claims that justify the large speed-up numbers.

    Authors: The single-cycle property follows directly from the vertical string topology of 3D NAND: the CAM masking transistors sit on the same string as the CIM cells and are activated during the pre-charge phase that already occurs in every CIM cycle, so they do not lengthen the critical path. Nevertheless, we acknowledge the absence of supporting diagrams and quantification. The revision will add (i) a timing diagram showing the fused CAM-CIM sequence within one cycle, (ii) a critical-path breakdown confirming that the masking transistors lie off the read-current path, and (iii) area and delay numbers extracted from the same SPICE-calibrated models used for the performance results, demonstrating that the incremental overhead is negligible relative to the baseline NAND string. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an architecture proposal for NASiC that fuses CAM-based expert selection and CIM computation via 3D NAND string structure, supported by circuit co-design and validated through experimental results showing performance and energy gains. No equations, self-definitional mappings, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The performance assertions rest on external experimental benchmarks rather than reducing to the input assumptions by construction, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete free parameters, axioms, or invented entities. The design appears to rest on standard 3D NAND string properties and CIM principles without new postulated physical entities.

pith-pipeline@v0.9.0 · 5866 in / 1298 out tokens · 24998 ms · 2026-05-25T02:54:32.432007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 7 internal anchors

  1. [1]

    Attention is all you need.Advances in Neural Information Processing Systems, 2017

    A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 6

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  4. [4]

    A survey on recent advances in llm-based multi-turn dialogue systems.arXiv preprint arXiv:2402.18013, 2024

    Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems.arXiv preprint arXiv:2402.18013, 2024

  5. [5]

    Mariogpt: Open-ended text2level generation through large language models.Advances in Neural Information Processing Systems, 36, 2024

    Shyam Sudhakaran, Miguel González-Duque, Matthias Freiberger, Claire Glanois, Elias Najarro, and Sebastian Risi. Mariogpt: Open-ended text2level generation through large language models.Advances in Neural Information Processing Systems, 36, 2024

  6. [6]

    Adapted large language models can outperform medical experts in clinical text summarization.Nature medicine, 30(4):1134–1142, 2024

    Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerová, et al. Adapted large language models can outperform medical experts in clinical text summarization.Nature medicine, 30(4):1134–1142, 2024

  7. [7]

    Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025

    Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al. Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025

  8. [8]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  9. [9]

    Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- tian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

  10. [10]

    A survey: Collaborative hardware and software design in the era of large language models

    Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, et al. A survey: Collaborative hardware and software design in the era of large language models. IEEE Circuits and Systems Magazine, 25(1):35–57, 2025

  11. [11]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  12. [12]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

  13. [13]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

  14. [14]

    Glam: Efficient scaling of language models with mixture-of-experts

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. InInterna- tional conference on machine learning, pages 5547–5569. PMLR, 2022

  15. [15]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  16. [16]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672, 2022

  17. [17]

    Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architec- tures

    Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, et al. Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architec- tures. InProceedings of the 52nd Annual International Symposium on Computer Architecture, pages 1731–1745, 2025

  18. [18]

    Edgemoe: Fast on-device inference of moe-based large language models

    Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352, 2023

  19. [19]

    Collaborative compression for large-scale moe deployment on edge.arXiv preprint arXiv:2509.25689, 2025

    Yixiao Chen, Yanyue Xie, Ruining Yang, Wei Jiang, Wei Wang, Yong He, Yue Chen, Pu Zhao, and Yanzhi Wang. Collaborative compression for large-scale moe deployment on edge.arXiv preprint arXiv:2509.25689, 2025

  20. [20]

    Semiconductor memory technologies: State- of-the-art and future trends.Computer, 57(4):150–154, 2024

    Shimeng Yu and Tae-Hyeon Kim. Semiconductor memory technologies: State- of-the-art and future trends.Computer, 57(4):150–154, 2024

  21. [21]

    Cambricon-llm: A chiplet-based hybrid architecture for on-device inference of 70b llm

    Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, et al. Cambricon-llm: A chiplet-based hybrid architecture for on-device inference of 70b llm. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1474–1488. IEEE, 2024

  22. [22]

    Aif: Accelerating on-device llm inference using in-flash pro- cessing

    Jaeyong Lee, Hyeunjoo Kim, Sanghun Oh, Myoungjun Chun, Myungsuk Kim, and Jihong Kim. Aif: Accelerating on-device llm inference using in-flash pro- cessing. InProceedings of the 52nd Annual International Symposium on Computer Architecture, pages 529–543, 2025

  23. [23]

    Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference.arXiv preprint arXiv:2303.06182, 2023

    Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, and Benjamin Lee. Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference.arXiv preprint arXiv:2303.06182, 2023

  24. [24]

    Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference

    Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1018–1031. IEEE, 2024

  25. [25]

    Ssd offloading for llm mixture- of-experts weights considered harmful in energy efficiency.IEEE Computer Architecture Letters, 2025

    Kwanhee Kyung, Sungmin Yun, and Jung Ho Ahn. Ssd offloading for llm mixture- of-experts weights considered harmful in energy efficiency.IEEE Computer Architecture Letters, 2025

  26. [26]

    An approach of 3d nand flash based nonvolatile computing-in- memory (nvcim) accelerator for deep neural networks (dnns) with calibration and read disturb analysis

    Po-Kai Hsu, Pei-Ying Du, Chieh Roger Lo, Hang-Ting Lue, Wei-Chen Chen, Tzu- Hsuan Hsu, Teng-Hao Yeh, Chih-Chang Hsieh, Ming-Liang Wei, Keh-Chung Wang, et al. An approach of 3d nand flash based nonvolatile computing-in- memory (nvcim) accelerator for deep neural networks (dnns) with calibration and read disturb analysis. In2020 IEEE International Memory Wo...

  27. [27]

    Hang-Ting Lue, Po-Kai Hsu, Ming-Liang Wei, Teng-Hao Yeh, Pei-Ying Du, Wei- Chen Chen, Keh-Chung Wang, and Chih-Yuan Lu. Optimal design methods to transform 3d nand flash into a high-density, high-bandwidth and low-power nonvolatile computing in memory (nvcim) accelerator for deep-learning neural networks (dnn). In2019 IEEE International Electron Devices M...

  28. [28]

    Technological design of 3d nand-based compute- in-memory architecture for gb-scale deep neural network.IEEE Electron Device Letters, 42(2):160–163, 2020

    Wonbo Shim and Shimeng Yu. Technological design of 3d nand-based compute- in-memory architecture for gb-scale deep neural network.IEEE Electron Device Letters, 42(2):160–163, 2020

  29. [29]

    Architectural design of 3d nand flash based compute-in-memory for inference engine

    Wonbo Shim, Hongwu Jiang, Xiaochen Peng, and Shimeng Yu. Architectural design of 3d nand flash based compute-in-memory for inference engine. In Proceedings of the International Symposium on Memory Systems, pages 77–85, 2020

  30. [30]

    Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

    Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

  31. [31]

    13.4 a 512gb 3-bit/cell 3d 6 th-generation v-nand flash memory with 82mb/s write throughput and 1.2 gb/s interface

    Dongku Kang, Minsu Kim, Su Chang Jeon, Wontaeck Jung, Jooyong Park, Gyosoo Choo, Dong-kyo Shim, Anil Kavala, Seung-Bum Kim, Kyung-Min Kang, et al. 13.4 a 512gb 3-bit/cell 3d 6 th-generation v-nand flash memory with 82mb/s write throughput and 1.2 gb/s interface. In2019 IEEE International Solid-State Circuits Conference-(ISSCC), pages 216–218. IEEE, 2019

  32. [32]

    A 1-tb 4-b/cell 4-plane 162-layer 3-d flash memory with 2.4-gb/s io interface.IEEE Journal of Solid-State Circuits, 58(1):316–328, 2022

    Jong Hak Yuh, Yen-Lung Jason Li, Heguang Li, Yoshihiro Oyama, Cynthia Hsu, Pradeep Anantula, Gwang Yeong Stanley Jeong, Anirudh Amarnath, Siddhesh Darne, Sneha Bhatia, et al. A 1-tb 4-b/cell 4-plane 162-layer 3-d flash memory with 2.4-gb/s io interface.IEEE Journal of Solid-State Circuits, 58(1):316–328, 2022

  33. [33]

    Impact of 3d nand current variation on inference accuracy for in-memory computing.Journal of Semiconductor Technology and Science, 22(5): 341–345, 2022

    Wonbo Shim. Impact of 3d nand current variation on inference accuracy for in-memory computing.Journal of Semiconductor Technology and Science, 22(5): 341–345, 2022

  34. [34]

    A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

    Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

  35. [35]

    Architecture and process integration overview of 3d nand flash technologies.Applied Sciences, 11 (15), 2021

    Geun Ho Lee, Sungmin Hwang, Junsu Yu, and Hyungjin Kim. Architecture and process integration overview of 3d nand flash technologies.Applied Sciences, 11 (15), 2021. ISSN 2076-3417

  36. [36]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087– 38099. PMLR, 2023

  37. [37]

    Analysis and verilog-a modeling of floating-gate transistors.IEEE Open Journal of Circuits and Systems, 2024

    Sayma Nowshin Chowdhury, Matthew Chen, and Sahil Shah. Analysis and verilog-a modeling of floating-gate transistors.IEEE Open Journal of Circuits and Systems, 2024

  38. [38]

    3d-fpim: An extreme energy- efficient dnn acceleration system using 3d nand flash-based in-situ pim unit

    Hunjun Lee, Minseop Kim, Dongmoon Min, Joonsung Kim, Jongwon Back, Honam Yoo, Jong-Ho Lee, and Jangwoo Kim. 3d-fpim: An extreme energy- efficient dnn acceleration system using 3d nand flash-based in-situ pim unit. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1359–1376. IEEE, 2022

  39. [39]

    A 2-gs/s 8-bit time-interleaved sar adc for millimeter-wave pulsed radar baseband soc.IEEE Journal of Solid-State Circuits, 52(10):2712–2720, 2017

    Takuji Miki, Toshiaki Ozeki, and Jun-ichi Naka. A 2-gs/s 8-bit time-interleaved sar adc for millimeter-wave pulsed radar baseband soc.IEEE Journal of Solid-State Circuits, 52(10):2712–2720, 2017

  40. [40]

    Nvidia a100 tensor core gpu: Performance and innovation.IEEE Micro, 41(2):29–35, 2021

    Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. Nvidia a100 tensor core gpu: Performance and innovation.IEEE Micro, 41(2):29–35, 2021. 7