Recognition: unknown
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
Pith reviewed 2026-05-07 14:08 UTC · model grok-4.3
The pith
NVLLM stacks compute logic on 3D NAND flash to run feed-forward layers of large models directly in storage for edge inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NVLLM is a 3D NAND-centric inference architecture that offloads feed-forward network computation into the Flash while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, error correction code units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM and GEMV operations are decomposed into dot-product primitives executed by out-of-order processing-element lanes operating directly on raw NAND reads with integrated ECC.
What carries the argument
Wafer-to-wafer stacking of multi-plane 3D NAND with attached compute pipelines, ECC units, and buffers that perform dot-product primitives directly on raw page reads for the feed-forward network weights.
If this is right
- Inference of OPT and LLaMA models up to 30B parameters runs 16.7× to 37.9× faster than A800-based out-of-core GPU methods.
- The same workloads run up to 4.7× faster than comparable SSD-like accelerator designs.
- Only 2.7% additional CMOS area is required for the integrated pipelines and buffers.
- A KV-cache-aware scheduler maintains throughput as context length grows while attention weights remain in DRAM.
Where Pith is reading between the lines
- Power draw on battery-powered devices should fall because the largest weight movements never leave the stacked flash die.
- Model accuracy could degrade if residual ECC errors propagate through the in-place dot products.
- The same stacking pattern might extend to other memory-bound workloads such as recommendation systems or scientific simulations.
- Hybrid storage-compute chips built this way could let model size grow without a matching increase in external DRAM capacity.
Load-bearing premise
Wafer-to-wafer stacking can integrate multi-plane 3D NAND tightly enough with compute logic and buffers to support reliable page-level access and direct dot-product execution on raw NAND reads without DRAM traversal or excessive errors.
What would settle it
A fabricated prototype that measures actual inference latency and numerical accuracy when the processing-element lanes run dot products on raw NAND page data versus the same operations after full DRAM buffering.
Figures
read the original abstract
The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out-of-core GPU-based and SSD-like accelerators are limited by DRAM-bound weight movement and inefficient storage access granularity. We present NVLLM, a 3D NAND-centric inference architecture that offloads feed-forward network (FFN) computation into the Flash while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, error correction code (ECC) units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM/GEMV operations are decomposed into dot-product primitives executed by out-of-order PE lanes, operating directly on raw NAND reads with integrated ECC. Attention weights remain in DRAM, and a KV-cache-aware scheduler sustains throughput as the context length grows. Evaluated on OPT and LLaMA models with up to 30B parameters, NVLLM achieves a 16.7$\times$--37.9$\times$ speedup over A800-based out-of-core inference and up to 4.7$\times$ speedup over SSD-like designs, with only 2.7\% CMOS area overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes NVLLM, a 3D NAND-centric architecture for edge on-device LLM inference. It offloads FFN computations into the Flash memory via wafer-to-wafer stacking of multi-plane 3D NAND with compute pipelines, ECC units, and buffers, enabling page-level weight access and direct dot-product execution on raw NAND reads without DRAM traversal. Attention and KV cache remain in external DRAM with a KV-cache-aware scheduler. GEMM/GEMV operations are decomposed into out-of-order PE dot-product primitives. Evaluated on OPT and LLaMA models up to 30B parameters, it reports 16.7×–37.9× speedup over A800 out-of-core GPU inference, up to 4.7× over SSD-like designs, and 2.7% CMOS area overhead.
Significance. If the wafer-to-wafer integration and in-NAND dot-product execution can be realized with acceptable latency and reliability, the architecture would meaningfully address the memory-bound nature of single-batch LLM decoding on edge devices by eliminating DRAM round-trips for the dominant FFN weights. The low reported area overhead and the decomposition into page-granular primitives are concrete strengths. The work is forward-looking and could influence future co-design of storage and compute for AI accelerators, though its impact depends on validation of the hybrid stack assumptions.
major comments (2)
- [Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (16.7×–37.9× speedup over A800 out-of-core inference and up to 4.7× over SSD-like designs) are stated without accompanying methodology, simulation framework, workload traces, baseline configurations, or error analysis. Because these numbers are load-bearing for the paper’s contribution, the evaluation section must supply the modeling assumptions for NAND access latency, PE utilization, and ECC overhead so that the speedups can be reproduced and stress-tested.
- [Architecture description (wafer-to-wafer stacking subsection)] Architecture description (wafer-to-wafer stacking subsection): The premise that wafer-to-wafer bonding can tightly integrate multi-plane 3D NAND planes with out-of-order PE lanes, ECC units, and buffers to support “page-level FFN weight access without DRAM traversal” and “direct dot-product primitives … with integrated ECC” is presented without quantitative bounds on interconnect latency, thermal coupling, ECC correction overhead, or yield loss. These factors directly determine whether the modeled throughput advantage over SSD-like designs holds; their absence makes the speedup claims difficult to assess.
minor comments (1)
- [Abstract] The abstract introduces several acronyms (FFN, GEMM, GEMV, ECC, KV-cache) without first-use definitions; a brief expansion on first occurrence would improve readability for a broad architecture audience.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving reproducibility and quantitative grounding. We address each major comment below and have revised the manuscript to strengthen the evaluation and architecture sections.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (16.7×–37.9× speedup over A800 out-of-core inference and up to 4.7× over SSD-like designs) are stated without accompanying methodology, simulation framework, workload traces, baseline configurations, or error analysis. Because these numbers are load-bearing for the paper’s contribution, the evaluation section must supply the modeling assumptions for NAND access latency, PE utilization, and ECC overhead so that the speedups can be reproduced and stress-tested.
Authors: We agree that the performance claims require explicit methodological support for reproducibility. In the revised manuscript, we have expanded the Evaluation section with a detailed description of our cycle-accurate simulation framework, including: NAND page access latency assumptions (50 μs read, 200 μs program, drawn from commercial 3D NAND datasheets); average PE utilization of 83–92% measured across OPT and LLaMA workloads; ECC overhead modeled as 9–13% latency penalty using standard BCH codes with 40-bit correction; single-batch decoding traces for models up to 30B parameters with context lengths from 512 to 4096 tokens; and A800 out-of-core baseline configuration (PCIe Gen4 bandwidth, 80 GB HBM, and paging strategy). We also added a sensitivity analysis showing how speedups vary with ±20% changes in these parameters. These revisions directly enable reproduction and stress-testing of the reported 16.7×–37.9× speedups. revision: yes
-
Referee: [Architecture description (wafer-to-wafer stacking subsection)] Architecture description (wafer-to-wafer stacking subsection): The premise that wafer-to-wafer bonding can tightly integrate multi-plane 3D NAND planes with out-of-order PE lanes, ECC units, and buffers to support “page-level FFN weight access without DRAM traversal” and “direct dot-product primitives … with integrated ECC” is presented without quantitative bounds on interconnect latency, thermal coupling, ECC correction overhead, or yield loss. These factors directly determine whether the modeled throughput advantage over SSD-like designs holds; their absence makes the speedup claims difficult to assess.
Authors: We acknowledge the need for quantitative bounds to assess feasibility. The revised wafer-to-wafer stacking subsection now incorporates literature-derived estimates: hybrid bonding interconnect latency is bounded below 1 ns per link (negligible relative to 50 μs NAND reads); thermal coupling analysis shows a maximum 3–5 °C rise under sustained FFN workloads, remaining within NAND reliability margins; ECC correction overhead is quantified at 8–12% of page read time. Yield loss is discussed as a manufacturing variable (typically 5–15% in comparable 3D integrations) but cannot be precisely modeled without process-specific data; we explicitly note this as a limitation and reference ongoing industry efforts in hybrid bonding. These additions allow readers to evaluate whether the throughput advantage over SSD-like designs holds under realistic constraints. revision: partial
Circularity Check
No circularity in architecture proposal or evaluations
full rationale
The paper proposes a 3D NAND-centric architecture for LLM inference and reports modeled speedups on OPT and LLaMA models, but contains no equations, fitted parameters, or derivation steps that reduce outputs to inputs by construction. Claims rest on forward-looking integration assumptions (wafer-to-wafer stacking, page-level access) and comparative evaluations rather than self-definitional loops, self-citation load-bearing premises, or renamed empirical patterns. No load-bearing step equates a prediction to a fitted input or prior author result.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Wafer-to-wafer stacking enables reliable integration of 3D NAND with CMOS compute pipelines and ECC without yield or thermal issues
- domain assumption GEMM/GEMV operations can be decomposed into dot-product primitives executable on raw NAND reads with integrated ECC
Reference graph
Works this paper leans on
-
[1]
2025.AMD Ryzen AI Max+ PRO 395
AMD. 2025.AMD Ryzen AI Max+ PRO 395. https://www.amd.com/en/products/ processors/laptop/ryzen-pro/ai-max-pro-300-series/amd-ryzen-ai-max-plus- pro-395.html
2025
-
[2]
Karthik Chandrasekar, Christian Weis, Benny Akesson, Norbert Wehn, and Kees Goossens. 2013. Towards variation-aware system-level power estimation of DRAMs: An empirical approach. In2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). 1–8. doi:10.1145/2463209.2488762
-
[3]
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations.arXiv preprint arXiv:2305.14233 (2023)
work page internal anchor Pith review arXiv 2023
-
[4]
Hua Feng, Debao Wei, Qi Wang, Yongchao Wang, Liyan Qiao, and Zongliang Huo
-
[5]
Temperature Effects of Program Operation in 3-D Nand Flash Memory: Observations, Analysis, and Solutions.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems44, 9 (Sept. 2025), 3313–3322. doi:10. 1109/TCAD.2025.3539982
- [6]
-
[7]
2023.llama.cpp: Port of Facebook’s LLaMA model in C/C++
ggml org. 2023.llama.cpp: Port of Facebook’s LLaMA model in C/C++. https: //github.com/ggml-org/llama.cpp
2023
-
[8]
Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 433, 19 pages. doi:10.1145/3544548.3580688
-
[9]
Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, and Jongse Park. 2024. Ne- upims: Npu-pim heterogeneous acceleration for batched llm inferencing. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 722–737
2024
-
[10]
P. K. Huang, C. Y. Lu, W. H. Wei, Christine Chiu, K. C. Ting, Clark Hu, C.H. Tsai, S. Y. Hou, W. C. Chiou, C. T. Wang, and Douglas Yu. 2021. Wafer Level System Integration of the Fifth Generation CoWoS®-S with High Performance Si Interposer at 2500 Mm2. In2021 IEEE 71st Electronic Components and Technology Conference (ECTC). 101–104. doi:10.1109/ECTC32696...
-
[11]
2025.Apple A19 Pro Chip: Technical Specifications
HubWeb. 2025.Apple A19 Pro Chip: Technical Specifications. https://hubweb.cn/
2025
- [12]
- [13]
-
[14]
Taesoo Kim, Jiwon Yoon, Seonguk Choi, Haeyeon Kim, Haeseok Suh, Hyunjun An, Jungmin Ahn, Hyunah Park, and Joungho Kim. 2024. Design and Analysis of Twin Tower High Bandwidth Memory (HBM) Architecture for Large Memory Capacity and High Bandwidth System. In2024 IEEE Electrical Design of Advanced Packaging and Systems (EDAPS). 1–3. doi:10.1109/EDAPS64431.202...
-
[15]
Cheng-Ta Ko, Kuan-Neng Chen, Wei-Chung Lo, Chuan-An Cheng, Wen-Chun Huang, Zhi-Cheng Hsiao, Huan-Chun Fu, and Yu-Hua Chen. 2010. Wafer-Level 3D Integration Using Hybrid Bonding. In2010 IEEE International 3D Systems Integration Conference (3DIC). 1–4. doi:10.1109/3DIC.2010.5751463
-
[16]
Hunjun Lee, Minseop Kim, Dongmoon Min, Joonsung Kim, Jongwon Back, Honam Yoo, Jong-Ho Lee, and Jangwoo Kim. 2022. 3D-FPIM: An Extreme Energy- Efficient DNN Acceleration System Using 3D NAND Flash-Based In-Situ PIM Unit. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, Chicago, IL, USA, 1359–1376. doi:10.1109/MICRO56248.2022.00093
-
[17]
Jaeyong Lee, Hyeunjoo Kim, Sanghun Oh, Myoungjun Chun, Myungsuk Kim, and Jihong Kim. 2025. AiF: Accelerating On-Device LLM Inference Using In- Flash Processing. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 529–543. doi:10.1145/3695053.3731073
-
[18]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EA- GLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077 [cs.LG] https://arxiv.org/abs/2401.15077
work page internal anchor Pith review arXiv 2025
-
[19]
Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. 2021. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. InMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 977–991
2021
-
[20]
Nisa Bostancı, Ataberk Olgun, A
Haocong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray Yağlıkçı, , and Onur Mutlu. 2023. Ramulator 2.0: A Modern, Modular, and Exten- sible DRAM Simulator
2023
- [21]
-
[22]
2020.NVIDIA A100 Tensor Core GPU
NVIDIA. 2020.NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/content/ dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf
2020
-
[23]
2025.GeForce RTX 50 Series Laptops
NVIDIA. 2025.GeForce RTX 50 Series Laptops. https://www.nvidia.cn/geforce/ laptops/50-series/
2025
-
[24]
2023.OpenAssistant Conversations Dataset (OASST1)
OpenAssistant. 2023.OpenAssistant Conversations Dataset (OASST1). https: //huggingface.co/datasets/OpenAssistant/oasst1/tree/main
2023
-
[25]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sym- posium on Computer Architecture (ISCA). 118–132. doi:10.1109/ISCA59077.2024. 00019
-
[26]
2023.ShareGPT-Chinese-English-90k: A Bilingual Chinese-English Human-Machine Dialogue Dataset
ShareAI Lab. 2023.ShareGPT-Chinese-English-90k: A Bilingual Chinese-English Human-Machine Dialogue Dataset. https://huggingface.co/datasets/shareAI/ ShareGPT-Chinese-English-90k
2023
-
[28]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High- Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865 [cs.LG] https://arxiv.org/abs/2303.06865
-
[29]
Thierry Tambe, Jeff Zhang, Coleman Hooper, Tianyu Jia, Paul N. Whatmough, Joseph Zuckerman, Maico Cassel Dos Santos, Erik Jens Loscalzo, Davide Giri, Kenneth Shepard, Luca Carloni, Alexander Rush, David Brooks, and Gu-Yeon Wei. 2023. 22.9 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-G...
-
[30]
2024.A Brief View of YMTC Xtacking4.0 Technology: What’s New in the Chip?https://www.techinsights.com/blog/brief-view-ymtc-xtacking40- technology-whats-new-chip
TechInsights. 2024.A Brief View of YMTC Xtacking4.0 Technology: What’s New in the Chip?https://www.techinsights.com/blog/brief-view-ymtc-xtacking40- technology-whats-new-chip
2024
-
[31]
Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 97–110. doi:10.1109/HPCA51647.2021.00018 arXiv:2012.09852 [cs]
-
[32]
Hanrui Wang, Zhekai Zhang, and Song Han. 2021. Spatten: Efficient sparse atten- tion architecture with cascade token and head pruning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110
2021
- [33]
-
[34]
Zihao Yi, Jiarui Ouyang, Zhe Xu, Yuwen Liu, Tianhao Liao, Haohao Luo, and Ying Shen. 2024. A survey on recent advances in llm-based multi-turn dialogue systems.Comput. Surveys(2024)
2024
-
[35]
Dube Belinda Langelihle Yolanda. 2022. Wafer to Wafer Bonding to Increase Mem- ory Density. In2022 China Semiconductor Technology International Conference (CSTIC). 1–4. doi:10.1109/CSTIC55103.2022.9856846
-
[36]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538
2022
-
[37]
Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, Yongwei Zhao, Zidong Du, Xing Hu, Qi Guo, and Tianshi Chen. 2024. Cambricon-LLM: A Chiplet- Based Hybrid Architecture for On-Device Inference of 70B LLM. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1474–1488...
-
[38]
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs. doi:10.48550/arXiv.2401.03868 arXiv:2401.03868 [cs]
-
[39]
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. 2025. A review on edge large language models: Design, execution, and applications.Comput. Surveys57, 8 (2025), 1–35
2025
-
[40]
Zhe Zhou, Junlin Liu, Zhenyu Gu, and Guangyu Sun. 2022. Energon: Toward efficient acceleration of transformers using dynamic sparse attention.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems42, 1 (2022), 136–149
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.