pith. machine review for the scientific record. sign in

arxiv: 2604.25699 · v1 · submitted 2026-04-28 · 💻 cs.AR

Recognition: unknown

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:08 UTC · model grok-4.3

classification 💻 cs.AR
keywords 3D NANDLLM inferenceedge deviceson-device AIwafer-to-wafer stackingfeed-forward networkflash-based computationKV cache
0
0 comments X

The pith

NVLLM stacks compute logic on 3D NAND flash to run feed-forward layers of large models directly in storage for edge inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an architecture that moves the feed-forward network calculations of large language models inside the 3D NAND flash itself. This avoids repeated transfers of weights across the slow DRAM interface that currently limits single-batch decoding on edge hardware. Stacking processing elements directly onto the flash wafers lets the system read pages and execute the required dot products in place, with error correction handled on the same die. Attention layers continue to use external DRAM and lightweight CMOS logic, while a scheduler tracks the growing key-value cache to keep throughput steady. If the integration holds, models with up to 30 billion parameters become feasible on devices that cannot afford the power or bandwidth of conventional GPU or SSD accelerators.

Core claim

NVLLM is a 3D NAND-centric inference architecture that offloads feed-forward network computation into the Flash while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, error correction code units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM and GEMV operations are decomposed into dot-product primitives executed by out-of-order processing-element lanes operating directly on raw NAND reads with integrated ECC.

What carries the argument

Wafer-to-wafer stacking of multi-plane 3D NAND with attached compute pipelines, ECC units, and buffers that perform dot-product primitives directly on raw page reads for the feed-forward network weights.

If this is right

  • Inference of OPT and LLaMA models up to 30B parameters runs 16.7× to 37.9× faster than A800-based out-of-core GPU methods.
  • The same workloads run up to 4.7× faster than comparable SSD-like accelerator designs.
  • Only 2.7% additional CMOS area is required for the integrated pipelines and buffers.
  • A KV-cache-aware scheduler maintains throughput as context length grows while attention weights remain in DRAM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Power draw on battery-powered devices should fall because the largest weight movements never leave the stacked flash die.
  • Model accuracy could degrade if residual ECC errors propagate through the in-place dot products.
  • The same stacking pattern might extend to other memory-bound workloads such as recommendation systems or scientific simulations.
  • Hybrid storage-compute chips built this way could let model size grow without a matching increase in external DRAM capacity.

Load-bearing premise

Wafer-to-wafer stacking can integrate multi-plane 3D NAND tightly enough with compute logic and buffers to support reliable page-level access and direct dot-product execution on raw NAND reads without DRAM traversal or excessive errors.

What would settle it

A fabricated prototype that measures actual inference latency and numerical accuracy when the processing-element lanes run dot products on raw NAND page data versus the same operations after full DRAM buffering.

Figures

Figures reproduced from arXiv: 2604.25699 by Changwei Yan, Haoyu Cui, Mingbo Hao, Weiwei Shan, Yizhi Ding, Zhangrui Qian, Zhihao Yan.

Figure 2
Figure 2. Figure 2: Under the single-batch inference constraint, the view at source ↗
Figure 1
Figure 1. Figure 1: (a) Comparison of edge and cloud LLM inference view at source ↗
Figure 3
Figure 3. Figure 3: (a) Breakdown of model parameters and per-token view at source ↗
Figure 4
Figure 4. Figure 4: (a) NAND read-induced RBER increases perplexity view at source ↗
Figure 5
Figure 5. Figure 5: Proposed NVLLM architecture. Algorithm 1 Out-of-Order Dot Product for Error-Resilient Input: Weight vector w ∈ R ℎ×1 , Activation vector a ∈ R 1×ℎ , Segment factor 𝑑, Error-correcting code 𝐿(𝑛, 𝑑), Parity matrix p ∈ R (𝑛−𝑑)× (ℎ/𝑑) , Output: Final dot product result 𝑠 1: Initialize accumulator 𝑠 ← 0, scoreboard 𝐵 ← ∅ 2: Weight vector pointer 𝑝𝑡𝑟 ← 0 3: while 𝑝𝑡𝑟 < len(w) do 4: w𝑑 ← w[𝑝𝑡𝑟 : 𝑝𝑡𝑟 + 𝑑 − 1] ⊲ Cu… view at source ↗
Figure 6
Figure 6. Figure 6: Throughput comparison result. for fair comparison with these designs view at source ↗
Figure 7
Figure 7. Figure 7: End to end latency comparison result view at source ↗
Figure 8
Figure 8. Figure 8: (a) The effect of KV-cache-aware scheduling. (b) view at source ↗
read the original abstract

The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out-of-core GPU-based and SSD-like accelerators are limited by DRAM-bound weight movement and inefficient storage access granularity. We present NVLLM, a 3D NAND-centric inference architecture that offloads feed-forward network (FFN) computation into the Flash while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, error correction code (ECC) units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM/GEMV operations are decomposed into dot-product primitives executed by out-of-order PE lanes, operating directly on raw NAND reads with integrated ECC. Attention weights remain in DRAM, and a KV-cache-aware scheduler sustains throughput as the context length grows. Evaluated on OPT and LLaMA models with up to 30B parameters, NVLLM achieves a 16.7$\times$--37.9$\times$ speedup over A800-based out-of-core inference and up to 4.7$\times$ speedup over SSD-like designs, with only 2.7\% CMOS area overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes NVLLM, a 3D NAND-centric architecture for edge on-device LLM inference. It offloads FFN computations into the Flash memory via wafer-to-wafer stacking of multi-plane 3D NAND with compute pipelines, ECC units, and buffers, enabling page-level weight access and direct dot-product execution on raw NAND reads without DRAM traversal. Attention and KV cache remain in external DRAM with a KV-cache-aware scheduler. GEMM/GEMV operations are decomposed into out-of-order PE dot-product primitives. Evaluated on OPT and LLaMA models up to 30B parameters, it reports 16.7×–37.9× speedup over A800 out-of-core GPU inference, up to 4.7× over SSD-like designs, and 2.7% CMOS area overhead.

Significance. If the wafer-to-wafer integration and in-NAND dot-product execution can be realized with acceptable latency and reliability, the architecture would meaningfully address the memory-bound nature of single-batch LLM decoding on edge devices by eliminating DRAM round-trips for the dominant FFN weights. The low reported area overhead and the decomposition into page-granular primitives are concrete strengths. The work is forward-looking and could influence future co-design of storage and compute for AI accelerators, though its impact depends on validation of the hybrid stack assumptions.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (16.7×–37.9× speedup over A800 out-of-core inference and up to 4.7× over SSD-like designs) are stated without accompanying methodology, simulation framework, workload traces, baseline configurations, or error analysis. Because these numbers are load-bearing for the paper’s contribution, the evaluation section must supply the modeling assumptions for NAND access latency, PE utilization, and ECC overhead so that the speedups can be reproduced and stress-tested.
  2. [Architecture description (wafer-to-wafer stacking subsection)] Architecture description (wafer-to-wafer stacking subsection): The premise that wafer-to-wafer bonding can tightly integrate multi-plane 3D NAND planes with out-of-order PE lanes, ECC units, and buffers to support “page-level FFN weight access without DRAM traversal” and “direct dot-product primitives … with integrated ECC” is presented without quantitative bounds on interconnect latency, thermal coupling, ECC correction overhead, or yield loss. These factors directly determine whether the modeled throughput advantage over SSD-like designs holds; their absence makes the speedup claims difficult to assess.
minor comments (1)
  1. [Abstract] The abstract introduces several acronyms (FFN, GEMM, GEMV, ECC, KV-cache) without first-use definitions; a brief expansion on first occurrence would improve readability for a broad architecture audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving reproducibility and quantitative grounding. We address each major comment below and have revised the manuscript to strengthen the evaluation and architecture sections.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (16.7×–37.9× speedup over A800 out-of-core inference and up to 4.7× over SSD-like designs) are stated without accompanying methodology, simulation framework, workload traces, baseline configurations, or error analysis. Because these numbers are load-bearing for the paper’s contribution, the evaluation section must supply the modeling assumptions for NAND access latency, PE utilization, and ECC overhead so that the speedups can be reproduced and stress-tested.

    Authors: We agree that the performance claims require explicit methodological support for reproducibility. In the revised manuscript, we have expanded the Evaluation section with a detailed description of our cycle-accurate simulation framework, including: NAND page access latency assumptions (50 μs read, 200 μs program, drawn from commercial 3D NAND datasheets); average PE utilization of 83–92% measured across OPT and LLaMA workloads; ECC overhead modeled as 9–13% latency penalty using standard BCH codes with 40-bit correction; single-batch decoding traces for models up to 30B parameters with context lengths from 512 to 4096 tokens; and A800 out-of-core baseline configuration (PCIe Gen4 bandwidth, 80 GB HBM, and paging strategy). We also added a sensitivity analysis showing how speedups vary with ±20% changes in these parameters. These revisions directly enable reproduction and stress-testing of the reported 16.7×–37.9× speedups. revision: yes

  2. Referee: [Architecture description (wafer-to-wafer stacking subsection)] Architecture description (wafer-to-wafer stacking subsection): The premise that wafer-to-wafer bonding can tightly integrate multi-plane 3D NAND planes with out-of-order PE lanes, ECC units, and buffers to support “page-level FFN weight access without DRAM traversal” and “direct dot-product primitives … with integrated ECC” is presented without quantitative bounds on interconnect latency, thermal coupling, ECC correction overhead, or yield loss. These factors directly determine whether the modeled throughput advantage over SSD-like designs holds; their absence makes the speedup claims difficult to assess.

    Authors: We acknowledge the need for quantitative bounds to assess feasibility. The revised wafer-to-wafer stacking subsection now incorporates literature-derived estimates: hybrid bonding interconnect latency is bounded below 1 ns per link (negligible relative to 50 μs NAND reads); thermal coupling analysis shows a maximum 3–5 °C rise under sustained FFN workloads, remaining within NAND reliability margins; ECC correction overhead is quantified at 8–12% of page read time. Yield loss is discussed as a manufacturing variable (typically 5–15% in comparable 3D integrations) but cannot be precisely modeled without process-specific data; we explicitly note this as a limitation and reference ongoing industry efforts in hybrid bonding. These additions allow readers to evaluate whether the throughput advantage over SSD-like designs holds under realistic constraints. revision: partial

Circularity Check

0 steps flagged

No circularity in architecture proposal or evaluations

full rationale

The paper proposes a 3D NAND-centric architecture for LLM inference and reports modeled speedups on OPT and LLaMA models, but contains no equations, fitted parameters, or derivation steps that reduce outputs to inputs by construction. Claims rest on forward-looking integration assumptions (wafer-to-wafer stacking, page-level access) and comparative evaluations rather than self-definitional loops, self-citation load-bearing premises, or renamed empirical patterns. No load-bearing step equates a prediction to a fitted input or prior author result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on domain assumptions about advanced 3D stacking feasibility and hardware integration rather than new mathematical derivations or fitted parameters.

axioms (2)
  • domain assumption Wafer-to-wafer stacking enables reliable integration of 3D NAND with CMOS compute pipelines and ECC without yield or thermal issues
    Invoked to support page-level access and direct NAND computation
  • domain assumption GEMM/GEMV operations can be decomposed into dot-product primitives executable on raw NAND reads with integrated ECC
    Central to avoiding DRAM traversal for FFN weights

pith-pipeline@v0.9.0 · 5554 in / 1321 out tokens · 57952 ms · 2026-05-07T14:08:52.498054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 22 canonical work pages · 2 internal anchors

  1. [1]

    2025.AMD Ryzen AI Max+ PRO 395

    AMD. 2025.AMD Ryzen AI Max+ PRO 395. https://www.amd.com/en/products/ processors/laptop/ryzen-pro/ai-max-pro-300-series/amd-ryzen-ai-max-plus- pro-395.html

  2. [2]

    Karthik Chandrasekar, Christian Weis, Benny Akesson, Norbert Wehn, and Kees Goossens. 2013. Towards variation-aware system-level power estimation of DRAMs: An empirical approach. In2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). 1–8. doi:10.1145/2463209.2488762

  3. [3]

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations.arXiv preprint arXiv:2305.14233 (2023)

  4. [4]

    Hua Feng, Debao Wei, Qi Wang, Yongchao Wang, Liyan Qiao, and Zongliang Huo

  5. [5]

    2025), 3313–3322

    Temperature Effects of Program Operation in 3-D Nand Flash Memory: Observations, Analysis, and Solutions.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems44, 9 (Sept. 2025), 3313–3322. doi:10. 1109/TCAD.2025.3539982

  6. [6]

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. 2024. Break the Sequential De- pendency of LLM Inference Using Lookahead Decoding. arXiv:2402.02057 [cs.LG] https://arxiv.org/abs/2402.02057

  7. [7]

    2023.llama.cpp: Port of Facebook’s LLaMA model in C/C++

    ggml org. 2023.llama.cpp: Port of Facebook’s LLaMA model in C/C++. https: //github.com/ggml-org/llama.cpp

  8. [8]

    Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 433, 19 pages. doi:10.1145/3544548.3580688

  9. [9]

    Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, and Jongse Park. 2024. Ne- upims: Npu-pim heterogeneous acceleration for batched llm inferencing. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 722–737

  10. [10]

    P. K. Huang, C. Y. Lu, W. H. Wei, Christine Chiu, K. C. Ting, Clark Hu, C.H. Tsai, S. Y. Hou, W. C. Chiou, C. T. Wang, and Douglas Yu. 2021. Wafer Level System Integration of the Fifth Generation CoWoS®-S with High Performance Si Interposer at 2500 Mm2. In2021 IEEE 71st Electronic Components and Technology Conference (ECTC). 101–104. doi:10.1109/ECTC32696...

  11. [11]

    2025.Apple A19 Pro Chip: Technical Specifications

    HubWeb. 2025.Apple A19 Pro Chip: Technical Specifications. https://hubweb.cn/

  12. [12]

    Zongliang Huo, Weihua Cheng, and Simon Yang. 2022. Unleash Scaling Potential of 3D NAND with Innovative Xtacking®Architecture. In2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 254–255. doi:10. 1109/VLSITechnologyandCir46769.2022.9830285

  13. [13]

    Yunho Jin, Chun-Feng Wu, David Brooks, and Gu-Yeon Wei. 2023. S 3: In- creasing GPU Utilization during Generative Inference for Higher Throughput. arXiv:2306.06000 [cs.AR] https://arxiv.org/abs/2306.06000

  14. [14]

    Taesoo Kim, Jiwon Yoon, Seonguk Choi, Haeyeon Kim, Haeseok Suh, Hyunjun An, Jungmin Ahn, Hyunah Park, and Joungho Kim. 2024. Design and Analysis of Twin Tower High Bandwidth Memory (HBM) Architecture for Large Memory Capacity and High Bandwidth System. In2024 IEEE Electrical Design of Advanced Packaging and Systems (EDAPS). 1–3. doi:10.1109/EDAPS64431.202...

  15. [15]

    Cheng-Ta Ko, Kuan-Neng Chen, Wei-Chung Lo, Chuan-An Cheng, Wen-Chun Huang, Zhi-Cheng Hsiao, Huan-Chun Fu, and Yu-Hua Chen. 2010. Wafer-Level 3D Integration Using Hybrid Bonding. In2010 IEEE International 3D Systems Integration Conference (3DIC). 1–4. doi:10.1109/3DIC.2010.5751463

  16. [16]

    Hunjun Lee, Minseop Kim, Dongmoon Min, Joonsung Kim, Jongwon Back, Honam Yoo, Jong-Ho Lee, and Jangwoo Kim. 2022. 3D-FPIM: An Extreme Energy- Efficient DNN Acceleration System Using 3D NAND Flash-Based In-Situ PIM Unit. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, Chicago, IL, USA, 1359–1376. doi:10.1109/MICRO56248.2022.00093

  17. [17]

    Jaeyong Lee, Hyeunjoo Kim, Sanghun Oh, Myoungjun Chun, Myungsuk Kim, and Jihong Kim. 2025. AiF: Accelerating On-Device LLM Inference Using In- Flash Processing. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Computing Machinery, New York, NY, USA, 529–543. doi:10.1145/3695053.3731073

  18. [18]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EA- GLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077 [cs.LG] https://arxiv.org/abs/2401.15077

  19. [19]

    Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. 2021. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. InMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 977–991

  20. [20]

    Nisa Bostancı, Ataberk Olgun, A

    Haocong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray Yağlıkçı, , and Onur Mutlu. 2023. Ramulator 2.0: A Modern, Modular, and Exten- sible DRAM Simulator

  21. [21]

    Rino Micheloni, Seiichi Aritome, and Luca Crippa. 2017. Array Architectures for 3-D NAND Flash Memories.Proc. IEEE105, 9 (Sept. 2017), 1634–1649. doi:10. 1109/JPROC.2017.2697000

  22. [22]

    2020.NVIDIA A100 Tensor Core GPU

    NVIDIA. 2020.NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/content/ dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf

  23. [23]

    2025.GeForce RTX 50 Series Laptops

    NVIDIA. 2025.GeForce RTX 50 Series Laptops. https://www.nvidia.cn/geforce/ laptops/50-series/

  24. [24]

    2023.OpenAssistant Conversations Dataset (OASST1)

    OpenAssistant. 2023.OpenAssistant Conversations Dataset (OASST1). https: //huggingface.co/datasets/OpenAssistant/oasst1/tree/main

  25. [25]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sym- posium on Computer Architecture (ISCA). 118–132. doi:10.1109/ISCA59077.2024. 00019

  26. [26]

    2023.ShareGPT-Chinese-English-90k: A Bilingual Chinese-English Human-Machine Dialogue Dataset

    ShareAI Lab. 2023.ShareGPT-Chinese-English-90k: A Bilingual Chinese-English Human-Machine Dialogue Dataset. https://huggingface.co/datasets/shareAI/ ShareGPT-Chinese-English-90k

  27. [28]

    Sheng, L

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High- Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865 [cs.LG] https://arxiv.org/abs/2303.06865

  28. [29]

    Goswami, R., Wang, S., 2018

    Thierry Tambe, Jeff Zhang, Coleman Hooper, Tianyu Jia, Paul N. Whatmough, Joseph Zuckerman, Maico Cassel Dos Santos, Erik Jens Loscalzo, Davide Giri, Kenneth Shepard, Luca Carloni, Alexander Rush, David Brooks, and Gu-Yeon Wei. 2023. 22.9 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-G...

  29. [30]

    2024.A Brief View of YMTC Xtacking4.0 Technology: What’s New in the Chip?https://www.techinsights.com/blog/brief-view-ymtc-xtacking40- technology-whats-new-chip

    TechInsights. 2024.A Brief View of YMTC Xtacking4.0 Technology: What’s New in the Chip?https://www.techinsights.com/blog/brief-view-ymtc-xtacking40- technology-whats-new-chip

  30. [31]

    Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 97–110. doi:10.1109/HPCA51647.2021.00018 arXiv:2012.09852 [cs]

  31. [32]

    Hanrui Wang, Zhekai Zhang, and Song Han. 2021. Spatten: Efficient sparse atten- tion architecture with cascade token and head pruning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110

  32. [33]

    Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, and Ziyuan Ling. 2024. On-Device Language Models: A Comprehensive Review. arXiv:2409.00088 [cs.CL] https://arxiv.org/abs/2409.00088

  33. [34]

    Zihao Yi, Jiarui Ouyang, Zhe Xu, Yuwen Liu, Tianhao Liao, Haohao Luo, and Ying Shen. 2024. A survey on recent advances in llm-based multi-turn dialogue systems.Comput. Surveys(2024)

  34. [35]

    Dube Belinda Langelihle Yolanda. 2022. Wafer to Wafer Bonding to Increase Mem- ory Density. In2022 China Semiconductor Technology International Conference (CSTIC). 1–4. doi:10.1109/CSTIC55103.2022.9856846

  35. [36]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538

  36. [37]

    Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, Yongwei Zhao, Zidong Du, Xing Hu, Qi Guo, and Tianshi Chen. 2024. Cambricon-LLM: A Chiplet- Based Hybrid Architecture for On-Device Inference of 70B LLM. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1474–1488...

  37. [38]

    Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. 2024. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs. doi:10.48550/arXiv.2401.03868 arXiv:2401.03868 [cs]

  38. [39]

    Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. 2025. A review on edge large language models: Design, execution, and applications.Comput. Surveys57, 8 (2025), 1–35

  39. [40]

    Zhe Zhou, Junlin Liu, Zhenyu Gu, and Guangyu Sun. 2022. Energon: Toward efficient acceleration of transformers using dynamic sparse attention.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems42, 1 (2022), 136–149