arxiv: 2604.04750 · v2 · submitted 2026-04-06 · 💻 cs.AR · cs.DC

Recognition: 2 theorem links

· Lean Theorem

DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

Zhiwen Mo , Guoyu Li , Hao Mark Chen , Yu Cheng , Zhengju Tang , Qianzhou Wang , Lei Wang , Shuang Liang

show 6 more authors

Lingxiao Ma Xianqi Zhou Yuxiao Guo Wayne Luk Jilong Xue Hongxiang Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3

classification 💻 cs.AR cs.DC

keywords 3D-stacked acceleratorsdesign space explorationLLM inferenceperformance modelingdistributed systemshardware-software co-designmemory bandwidth modeling

0 comments

The pith

DeepStack models 3D-stacked AI accelerators to explore 250 trillion design points up to 100000 times faster than simulators while delivering up to 9.5 times higher LLM throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeepStack as a performance model and tool that supports early co-design of distributed 3D-stacked hardware and software for large language model inference. It builds detailed representations of 3D memory behavior and execution scheduling to let designers evaluate enormous numbers of possible configurations quickly. A sympathetic reader would care because manual or slow simulation of such systems is impractical at current AI scales, and early identification of better hardware choices can reduce later redesign costs. The model maintains accuracy through targeted abstractions while running much faster than prior simulators, and the resulting search shows concrete gains plus new observations about what drives optimal choices.

Core claim

DeepStack captures fine-grained 3D memory semantics such as transaction-aware bandwidth, bank activation constraints, buffering limitations, and thermal-power modeling, together with comprehensive parallelization strategies and execution scheduling for distributed LLM inference. Novel techniques of dual-stage network abstraction and tile-level compute-communication overlap produce runtimes up to 100000 times faster than state-of-the-art simulators at comparable accuracy, as cross-validated on in-house 3D designs, NS-3, and vLLM. The resulting hierarchical search covers 2.5 times 10 to the 14 design points across DRAM layers, vertical connectivity, interconnect, compute-memory allocation, and

What carries the argument

Dual-stage network abstraction together with tile-level compute-communication overlap, which together enable rapid yet accurate simulation of distributed 3D memory and scheduling behavior.

If this is right

Up to 100000 times faster runtime than existing simulators at comparable accuracy.
Practical exploration of a 2.5 times 10 to the 14 point design space covering hardware layers, connectivity, allocation, and scheduling.
Up to 9.5 times higher throughput from co-optimized parallelism and 3D architecture choices.
Batch size creates a more fundamental architectural divide than the prefill versus decode distinction.
Parallelism strategy and hardware architecture are tightly coupled, so incomplete schedule search produces permanently suboptimal silicon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Design teams may need to treat batch-size handling as a primary constraint when laying out future 3D memory stacks.
The same modeling approach could be applied to evaluate non-LLM workloads on similar 3D hardware.
Early adoption of such tools might reduce the frequency of hardware revisions that later software tuning cannot fix.

Load-bearing premise

The accuracy of the fine-grained 3D memory semantics and dual-stage network abstraction will hold for hardware and workloads beyond the specific in-house designs and validation cases used.

What would settle it

Run DeepStack on a new 3D-stacked prototype not used in its validation, compare its throughput and latency predictions against measured hardware execution, and check whether error stays inside the reported 2 to 12 percent range.

Figures

Figures reproduced from arXiv: 2604.04750 by Guoyu Li, Hao Mark Chen, Hongxiang Fan, Jilong Xue, Lei Wang, Lingxiao Ma, Qianzhou Wang, Shuang Liang, Wayne Luk, Xianqi Zhou, Yu Cheng, Yuxiao Guo, Zhengju Tang, Zhiwen Mo.

**Figure 3.** Figure 3: The number of distinct parallelisms achieved at a [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of DeepStack DSE Framework. Chips Ethernet Die UCIe Interposer Substrate XBAR (a)Processing Engine View (b) DRAM Stack Cluster View (c) Die View (d) Chip View (e) System Overview Clusters Inter-Cluster Connection PE PE PE PE PE PE PE X BAR PE PHY PHY PHY PHY HBM HBM HBM HBM Compute Layer Logic Base Layer SFU Vector Unit Matrix Unit Scheduler Register File SFU Vector Unit Matrix Unit Scheduler Regi… view at source ↗

**Figure 5.** Figure 5: Example of Cross Sectional and Top View of a 3D-Stacked DRAM Architecture. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Example of mapping 64-node logical EP all-to-all [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Tile-level compute–communication overlap model [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 10.** Figure 10: DeepStack modeling accuracy vs. ASTRA-sim (NS-3 [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 9.** Figure 9: DeepStack modeling accuracy on vLLM TP8 and [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 11.** Figure 11: Pareto frontiers from DeepStack’s DSE across design points. A representative subset is shown for clarity. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: UTPS/STPS decoding performance comparison. DeepStack DSE denotes the best configuration searched by our [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 13.** Figure 13: Theoretical and effective DRAM bandwidth break [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 14.** Figure 14: End-to-end TPS for DeepSeek-V3 and area break [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 15.** Figure 15: DSE heatmaps for throughput-optimal and energy-optimal designs across stacked ( [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 16.** Figure 16: Temperature distribution across the 3D DRAM [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗

**Figure 17.** Figure 17: NoC bandwidth and hop latency sensitivity. Re [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗

read the original abstract

Advances in hybrid bonding and packaging have driven growing interest in 3D DRAM-stacked accelerators with higher memory bandwidth and capacity. As LLMs scale to hundreds of billions or trillions of parameters, distributed inference across multiple 3D chips becomes essential. With cross-stack co-design increasingly critical, we propose DeepStack, an accurate and efficient performance model and tool to enable early-stage system-hardware co-design space exploration (DSE) for distributed 3D-stacked AI systems. At the hardware level, DeepStack captures fine-grained 3D memory semantics such as transaction-aware bandwidth, bank activation constraints, buffering limitations, and thermal-power modeling. At the system level, DeepStack incorporates comprehensive parallelization strategies and execution scheduling for distributed LLM inference. With novel modeling techniques such as dual-stage network abstraction and tile-level compute-communication overlap, we achieve up to 100,000x faster runtime over state-of-the-art simulators at comparable accuracy, cross-validated against our in-house 3D designs, NS-3 backend (2.12%), and vLLM serving on 8xB200 GPUs (12.18%). With hierarchical design space search, DeepStack enables efficient exploration over 2.5x10^14 design points spanning 3D-stacked DRAM layers, DRAM vertical connectivity, interconnect, compute-memory allocation, and distributed scheduling. Compared with baseline designs, DeepStack achieves up to 9.5x higher throughput through co-optimized parallelism and 3D architecture search. Our DSE further reveals that batch size drives a more fundamental architectural divide than the prefill/decode distinction, and that parallelism strategy and hardware architecture are tightly coupled -- incomplete schedule search leads to permanently suboptimal silicon irrecoverable by software tuning. We intend to open source DeepStack to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepStack gives a fast DSE tool for 3D AI accelerators with some validation but narrow checks that limit trust in the full search results.

read the letter

The main thing to know is that DeepStack introduces a performance model and search tool for distributed 3D-stacked AI accelerators that runs much faster than simulators while claiming good accuracy, leading to 9.5x throughput improvements in their exploration. It stands out for combining fine-grained 3D memory modeling with dual-stage network abstraction and tile-level overlap to handle compute-communication in distributed LLM inference. The hierarchical search covers an enormous space of 2.5e14 points on layers, connectivity, allocation, and scheduling. Validation against NS-3 at 2.12% error, vLLM at 12.18%, and in-house designs gives it some grounding, and the batch size insight over prefill/decode is a useful observation. The work does well by keeping the model tied to hardware details like bank constraints and thermal effects instead of black-box fitting. Opening the source would help others build on it. The soft spots center on validation breadth. The checks are on specific setups, and a 12% error could influence design rankings across that vast space. We lack details on how many points were validated or performance in untested areas like more DRAM layers or extreme batches. This leaves the co-optimization claims somewhat provisional. This paper suits hardware designers and co-design researchers working on scalable AI accelerators. Tool developers and those exploring 3D architectures would find the modeling techniques and search results practical. It has enough substance and external checks to merit a serious referee, though they should probe the validation coverage. I recommend sending it to peer review.

Referee Report

2 major / 3 minor

Summary. The paper introduces DeepStack, a performance modeling tool for early-stage design space exploration (DSE) of distributed 3D-stacked AI accelerators targeting LLM inference. It incorporates fine-grained hardware models for 3D DRAM (transaction-aware bandwidth, bank activation, buffering, thermal-power) and system-level elements including parallelization strategies and scheduling. Novel techniques include dual-stage network abstraction and tile-level compute-communication overlap, enabling up to 100,000x faster runtime than state-of-the-art simulators at comparable accuracy. Cross-validation is reported against in-house 3D designs, NS-3 (2.12% error), and vLLM on 8xB200 GPUs (12.18% error). Hierarchical search allows exploration of 2.5x10^14 design points across DRAM layers, vertical connectivity, interconnect, compute-memory allocation, and scheduling. Results claim up to 9.5x higher throughput versus baselines via co-optimized parallelism and architecture search, plus insights that batch size drives architectural divides more than prefill/decode and that parallelism and hardware are tightly coupled.

Significance. If the accuracy and generalization claims hold, DeepStack would be a significant contribution to hardware-software co-design for 3D-stacked AI systems, as the reported speedup and scale of DSE (2.5x10^14 points) could substantially accelerate exploration of distributed inference architectures. The cross-validation against independent simulators (NS-3) and real GPU runs (vLLM) plus the intent to open-source the tool are notable strengths that support reproducibility and broader adoption.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: The cross-validation reports average errors of 2.12% (NS-3) and 12.18% (vLLM) but provides no details on the number of design points validated, the distribution of errors across regimes (e.g., DRAM layer counts >4, extreme batch sizes, or novel parallelism strategies), or whether the transaction-aware bandwidth, bank-activation, and thermal models were tested for bias in unvalidated configurations. This is load-bearing for the claim that the model supports accurate ranking over the full 2.5x10^14-point space.
[DSE and Results] DSE and Results sections: The 9.5x throughput improvement and architectural insights (batch size as fundamental divide, tight coupling of parallelism and hardware) rest on the dual-stage network abstraction and tile-level overlap model correctly predicting performance without post-hoc tuning. No explicit evidence is given that these components remain unbiased when extrapolating beyond the specific in-house 3D designs and NS-3/vLLM cases, which risks mis-ranking designs in the co-optimization conclusions.

minor comments (3)

[Abstract] The abstract mentions 'hierarchical design space search' but the manuscript does not clarify the exact partitioning or pruning criteria used to traverse 2.5x10^14 points efficiently.
[Modeling] Notation for the dual-stage network abstraction and tile-level overlap model could be more precisely defined with equations or pseudocode to aid reproducibility.
[Figures] Figure captions and legends should explicitly state the error metric (e.g., mean absolute percentage error) and the exact configurations compared in the NS-3 and vLLM validations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the validation and extrapolation claims. We address each major comment below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: The cross-validation reports average errors of 2.12% (NS-3) and 12.18% (vLLM) but provides no details on the number of design points validated, the distribution of errors across regimes (e.g., DRAM layer counts >4, extreme batch sizes, or novel parallelism strategies), or whether the transaction-aware bandwidth, bank-activation, and thermal models were tested for bias in unvalidated configurations. This is load-bearing for the claim that the model supports accurate ranking over the full 2.5x10^14-point space.

Authors: We agree that the manuscript lacks sufficient detail on the validation set composition and error distribution. The reported average errors are based on a set of configurations that include multiple DRAM layer counts, batch sizes, and parallelism strategies, but these specifics are not broken out. In the revised manuscript, we will add a new subsection (or expanded table) in the Evaluation section that reports the exact number of validated design points (approximately 45 for NS-3 and 25 for vLLM), error distributions across regimes including DRAM layers >4 and extreme batch sizes, and a discussion of coverage for the transaction-aware bandwidth, bank-activation, and thermal models. We will also note any observed biases and the fraction of the full DSE space represented by the validated points. revision: yes
Referee: [DSE and Results] DSE and Results sections: The 9.5x throughput improvement and architectural insights (batch size as fundamental divide, tight coupling of parallelism and hardware) rest on the dual-stage network abstraction and tile-level overlap model correctly predicting performance without post-hoc tuning. No explicit evidence is given that these components remain unbiased when extrapolating beyond the specific in-house 3D designs and NS-3/vLLM cases, which risks mis-ranking designs in the co-optimization conclusions.

Authors: The dual-stage network abstraction and tile-level compute-communication overlap were validated as part of the overall model accuracy against both NS-3 and real vLLM runs on 8xB200 GPUs, and the 9.5x gains and insights emerge directly from the hierarchical search results. We acknowledge that the manuscript does not provide separate bias analysis for these modeling components in extrapolated regimes beyond the validated cases. In the revision, we will add sensitivity analysis and additional cross-checks in the DSE and Results sections demonstrating that the components maintain low error in configurations outside the original validation set (e.g., higher layer counts and novel parallelism). This will better support the reliability of the reported throughput improvements and architectural conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; model grounded in explicit semantics and externally validated

full rationale

The paper's core performance model incorporates fine-grained 3D memory semantics (transaction-aware bandwidth, bank activation, buffering, thermal-power) and dual-stage network abstraction with tile-level overlap. These are presented as direct hardware modeling rather than fitted parameters or self-referential definitions. Accuracy is cross-validated against independent external references (NS-3 at 2.12% error, vLLM on 8xB200 at 12.18% error, plus in-house 3D designs), not against the model's own outputs or fitted subsets of the target DSE data. The 2.5e14-point search and 9.5x throughput gains are downstream applications of the validated model; no equations, self-citations, or ansatzes reduce the claimed predictions or uniqueness to the inputs by construction. This is a standard non-circular engineering modeling paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on standard performance-modeling assumptions plus two novel abstractions introduced in the paper; no explicit free parameters are named in the abstract, but accuracy claims implicitly depend on calibration against the validation workloads.

axioms (2)

domain assumption 3D memory semantics (transaction-aware bandwidth, bank activation constraints, buffering limitations, thermal-power) can be abstracted accurately enough for early DSE
Invoked in the hardware-level modeling section of the abstract as the foundation for the claimed accuracy.
domain assumption Distributed LLM inference can be captured by comprehensive parallelization strategies and execution scheduling that interact with the 3D hardware model
Stated as part of the system-level modeling contribution.

invented entities (2)

dual-stage network abstraction no independent evidence
purpose: To model cross-stack communication at sufficient fidelity while enabling 100,000x speedup
Introduced as a novel modeling technique; no independent evidence outside the paper's validation is provided in the abstract.
tile-level compute-communication overlap model no independent evidence
purpose: To capture fine-grained overlap between computation and communication in distributed 3D inference
Presented as a key novel technique enabling both speed and accuracy; independent evidence limited to the reported error rates.

pith-pipeline@v0.9.0 · 5682 in / 1843 out tokens · 51486 ms · 2026-05-10T18:58:38.136601+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/FunctionalEquation, Foundation/DimensionForcing, Foundation/AlexanderDuality reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

With novel modeling techniques such as dual-stage network abstraction and tile-level compute-communication overlap... transaction-aware bandwidth, bank activation constraints, buffering limitations, and thermal-power modeling
Foundation/ArithmeticFromLogic, Foundation/BranchSelection branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical design space search... 2.5×10^14 design points spanning 3D-stacked DRAM layers, DRAM vertical connectivity, interconnect, compute-memory allocation, and distributed scheduling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

131 extracted references · 32 canonical work pages · 13 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. 2023. GPT-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve. In USENIX Symposium on Operating Systems Design and Implementation. https: //api.semanticscholar.org/CorpusID:268249103

2024
[3]

2025.Kimi-K2 Thinking

Moonshot AI. 2025.Kimi-K2 Thinking. Moonshot AI. https://moonshotai. github.io/Kimi-K2/thinking.html

2025
[4]

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training generalized multi-query trans- former models from multi-head checkpoints.arXiv preprint arXiv:2305.13245 (2023)

work page internal anchor Pith review arXiv 2023
[5]

Lihong Ao and Aymeric Ramiere. 2024. Through-chip microchannels for three-dimensional integrated circuits cooling.Thermal Science and Engineering Progress47 (2024), 102333

2024
[6]

Chen Bai, Xin Fan, Zhenhua Zhu, Wei Zhang, and Yuan Xie. 2025. AccelStack: A Cost-Driven Analysis of 3D-Stacked LLM Accelerators. In2025 ACM/IEEE International Conference on Computer-Aided Design (ICCAD)

2025
[7]

2018.JAX: composable transformations of Python+NumPy programs

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018.JAX: composable transformations of Python+NumPy programs. http://github.com/jax-ml/jax

2018
[8]

Cadence Design Systems. 2026. Palladium Emulation. https://www.cadence. com/en_US/home/tools/system-design-and-verification/emulation-and- prototyping/palladium.html. Accessed: 2026-03-29

2026
[9]

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuan Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. 2024. FLUX: Fast software-based communication overlap on GPUs through kernel fusion.ArXivabs/2406.06858 (2024). https://api.semanticscholar.org/ CorpusID:270380238

work page arXiv 2024
[10]

Hao Mark Chen, Guanxi Lu, Yasuyuki Okoshi, Zhiwen Mo, Masato Motomura, and Hongxiang Fan. 2025. Rethinking optimal verification granularity for compute-efficient test-time scaling.arXiv preprint arXiv:2505.11730(2025)

work page arXiv 2025
[11]

Hao Mark Chen, Zhiwen Mo, Guanxi Lu, Shuang Liang, Lingxiao Ma, Wayne Luk, and Hongxiang Fan. 2026. FastTTS: Accelerating test-time scaling for edge LLM reasoning. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 732–748

2026
[12]

Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In2012 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 33–38

2012
[13]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578– 594

2018
[14]

Yu Cheng, Lei Wang, Yining Shi, Yuqing Xia, Lingxiao Ma, Jilong Xue, Yang Wang, Zhiwen Mo, Feiyang Chen, Fan Yang, et al. 2025. PipeThreader: Software- defined pipelining for efficient DNN execution. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)

2025
[15]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24, 240 (2023), 1–113

2023
[16]

Wega Chu, Dylan Patel, Daniel Nishball, et al . 2026. Vera Rubin – Extreme Co-Design: An evolution from Grace Blackwell Oberon. SemiAnalysis Newslet- ter. https://newsletter.semianalysis.com/p/vera-rubin-extreme-co-design-an- evolution Accessed: 2026-03-20

2026
[17]

Lawrence T Clark, Vinay Vashishtha, Lucian Shifren, Aditya Gujja, Saurabh Sinha, Brian Cline, Chandarasekaran Ramamurthy, and Greg Yeric. 2016. ASAP7: A 7-nm finFET predictive process design kit.Microelectronics Journal53 (2016), 105–115

2016
[18]

2020.NVIDIA A100 Tensor Core GPU Architecture

NVIDIA Corporation. 2020.NVIDIA A100 Tensor Core GPU Architecture. Tech- nical Report. NVIDIA Corporation. https://images.nvidia.com/aem-dam/en- zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

2020
[19]

2023.NVIDIA H100 Tensor Core GPU Architecture

NVIDIA Corporation. 2023.NVIDIA H100 Tensor Core GPU Architecture. Tech- nical Report. NVIDIA Corporation. https://resources.nvidia.com/en-us-tensor- core

2023
[20]

2024.NVIDIA Blackwell Architecture Technical Brief

NVIDIA Corporation. 2024.NVIDIA Blackwell Architecture Technical Brief. Technical Report. NVIDIA Corporation. https://resources.nvidia.com/en-us- blackwell-architecture

2024
[21]

NVIDIA Corporation. 2025. CUDA Techniques to Maximize Memory Bandwidth and Hide Latency (Session S72683). https://www.nvidia.com/en-us/on-demand/ session/gtc25-s72683/

2025
[22]

Michael Davies, Neal Crago, Karthikeyan Sankaralingam, and Christos Kozyrakis. 2025. Efficient LLM Inference: Bandwidth, Compute, Synchroniza- tion, and Capacity are all you need.arXiv preprint arXiv:2507.14397(2025)

work page arXiv 2025
[23]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

2024
[24]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

2022
[25]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Heebo Ha, Hongju Kim, Sumin Lee, Sooyong Choi, Chunghyeon Choi, Wan Yusmawati Wan Yusoff, Ali Shan, Sooman Lim, and Byungil Hwang. 2025. Overview of thermal management solution for 3D integrated circuits using Carbon-Nanotube-Based Silicon Through-Vias.Micromachines(2025)

2025
[27]

Ramyad Hadidi, Bahar Asgari, Burhan Ahmad Mudassar, Saibal Mukhopadhyay, Sudhakar Yalamanchili, and Hyesoon Kim. 2017. Demystifying the characteris- tics of 3D-stacked memories: A case study for hybrid memory cube. In2017 IEEE international symposium on Workload characterization (IISWC). IEEE, 66–75

2017
[28]

Walid Hafez, P Agnihotri, M Asoro, M Aykol, B Bains, R Bambery, M Bapna, A Barik, A Chatterjee, PC Chiu, et al. 2023. Intel PowerVia technology: Backside power delivery for high density and high-performance computing. In2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 1–2

2023
[29]

Jun-Han Han, Xinfei Guo, Kevin Skadron, and Mircea R. Stan. 2022. From 2.5D to 3D Chiplet Systems: Investigation of thermal implications with HotSpot 7.0.2022 21st IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (iTherm)(2022), 1–6. https://api.semanticscholar.org/ CorpusID:252625064

2022
[30]

Siyuan He, Peiran Yan, Yandong He, Youwei Zhuo, and Tianyu Jia. 2025. Tasa: Thermal-aware 3D-stacked architecture design with bandwidth sharing for LLM inference.arXiv preprint arXiv:2508.07252(2025)

work page arXiv 2025
[31]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Mark Horowitz. 2014. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC). IEEE, 10–14

2014
[33]

Han-Wen Hu and Kuan-Neng Chen. 2021. Development of low temperature CuCu bonding and hybrid bonding for three-dimensional integrated circuits (3D IC).Microelectronics Reliability127 (2021), 114412

2021
[34]

Qijing Huang, Po-An Tsai, Joel S Emer, and Angshuman Parashar. 2024. Mind the gap: Attainable data movement and operational intensity bounds for tensor algorithms. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 150–166

2024
[35]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al . 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems32 (2019). 13 Zhiwen Mo et al

2019
[36]

Le, and Z

Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Z. Chen. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. InNeural Information Processing Systems. https: //api.semanticscholar.org/CorpusID:53670168

2018
[37]

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. 2022. Tutel: Adaptive mixture-of-experts at scale.ArXivabs/2206.03382 (2022). https://api.semanticscholar.org/CorpusID: 249431713

work page arXiv 2022
[38]

2010.Memory systems: cache, DRAM, disk

Bruce Jacob, David Wang, and Spencer Ng. 2010.Memory systems: cache, DRAM, disk. Morgan Kaufmann

2010
[39]

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuai- wen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509(2023)

work page internal anchor Pith review arXiv 2023
[40]

Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madan Musuvathi, Todd Mytkowicz, and Olli Saarikivi
[41]

https://api.semanticscholar.org/CorpusID:237292610

Breaking the computation and communication abstraction barrier in dis- tributed machine learning workloads.Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(2021). https://api.semanticscholar.org/CorpusID:237292610

2021
[42]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks.Proceedings of Machine Learning and Systems1 (2019), 1–13

2019
[43]

Anne Jourdain, Michele Stucchi, Geert Van der Plas, Gerald Beyer, and Eric Beyne. 2022. Buried power rails and nano-scale TSV: technology boosters for backside power delivery network and 3D heterogeneous integration. In 2022 IEEE 72nd Electronic Components and Technology Conference (ECTC). IEEE, 1531–1538

2022
[44]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[45]

Sho Ko, Nathan Zhang, Olivia Hsu, Ardavan Pedram, and Kunle Olukotun. 2024. DFModel: Design space optimization of large-scale systems exploiting dataflow mappings.arXiv preprint arXiv:2412.16432(2024)

work page arXiv 2024
[46]

Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks.arXiv preprint arXiv:1404.5997(2014)

work page arXiv 2014
[47]

Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings.IEEE micro40, 3 (2020), 20–29

2020
[48]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

2023
[49]

Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecasting GPU performance for deep learning training and inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 493–508

2025
[50]

Seung-Hoon Lee, Su-Jong Kim, Ji-Su Lee, and Seok-Ho Rhi. 2025. Thermal issues related to hybrid bonding of 3D-stacked high bandwidth memory: A comprehensive review.Electronics14, 13 (2025), 2682

2025
[51]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668(2020)

work page internal anchor Pith review arXiv 2020
[52]

Cong Li, Yihan Yin, Xintong Wu, Jingchen Zhu, Zhutianya Gao, Dimin Niu, Qiang Wu, Xin Si, Yuan Xie, Chen Zhang, et al . 2025. H2-LLM: Hardware- dataflow co-exploration for heterogeneous hybrid-bonding-based low-batch LLM inference. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 194–210

2025
[53]

Cong Li, Yihan Yin, Chenhao Xue, Zhao Wang, Fujun Bai, Yixin Guo, Xip- ing Jiang, Qiang Wu, Yuan Xie, and Guangyu Sun. 2026. Hardware-software co-design for 3D-DRAM-based LLM serving accelerator.arXiv preprint arXiv:2603.04797(2026)

work page arXiv 2026
[54]

Hao Li, Ganesh Balamurugan, James Jaussi, and Bryan Casper. 2018. A 112 Gb/s PAM4 linear TIA with 0.96 pJ/bit energy efficiency in 28 nm CMOS. In ESSCIRC 2018-IEEE 44th European Solid State Circuits Conference (ESSCIRC). IEEE, 238–241

2018
[55]

Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023. Colossal-ai: A unified deep learning system for large-scale parallel training. InProceedings of the 52nd International Conference on Parallel Processing. 766–775

2023
[56]

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023. Sequence parallelism: Long sequence training from system perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2391–2404

2023
[57]

Zhiqi Lin, Youshan Miao, Guanbin Xu, Cheng Li, Olli Saarikivi, Saeed Maleki, and Fan Yang. 2024. Tessel: Boosting distributed execution of large dnn models via flexible schedule search. In2024 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 803–816

2024
[58]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al . 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context.ArXivabs/2310.01889 (2023). https: //api.semanticscholar.org/CorpusID:263608461

work page internal anchor Pith review arXiv 2023
[60]

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. 2025. A comprehensive survey on long context language modeling.arXiv preprint arXiv:2503.17407(2025)

work page arXiv 2025
[61]

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. 2025. Can 1b llm surpass 405b LLM? rethinking compute- optimal test-time scaling.arXiv preprint arXiv:2502.06703(2025)

work page arXiv 2025
[62]

Shuqing Luo, Ye Han, Pingzhi Li, Jiayin Qin, Jie Peng, Yang Katie Zhao, Yu Cao, and Tianlong Chen. 2025. Mozart: Modularized and efficient MoE training on 3.5 D wafer-scale chiplet architectures. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[63]

Patterson

Xiaoyu Ma and David A. Patterson. 2026. Challenges and Research Directions for Large Language Model Inference Hardware.ArXivabs/2601.05047 (2026). https://api.semanticscholar.org/CorpusID:284543821

work page arXiv 2026
[64]

Stefan Mach, Fabian Schuiki, Florian Zaruba, and Luca Benini. 2020. FP- new: An open-source multiformat floating-point unit architecture for energy- proportional transprecision computing.IEEE Transactions on Very Large Scale Integration (VLSI) Systems29, 4 (2020), 774–787

2020
[65]

Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al . 2025. LUT Tensor Core: A software-hardware co-design for LUT-based low-bit LLM inference. InPro- ceedings of the 52nd Annual International Symposium on Computer Architecture. 514–528

2025
[66]

Manuel Mota. 2022. Unpacking the Rise of Multi-Die SoCs with UCIe. Synopsys Technical Article. https://www.synopsys.com/articles/ucie-multi-die-socs. html Accessed: 2026-03-24

2022
[67]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating systems principles. 1–15

2019
[68]

Devanur, Gregory R

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei A. Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training.Proceedings of the 27th ACM Symposium on Operating Systems Principles(2019). https: //api.semanticscholar.org/CorpusID:202488191

2019
[69]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al . 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the inter- national conference for high performance computing, netwo...

2021
[70]

NVIDIA. 2026. CUTLASS. https://github.com/NVIDIA/cutlass

2026
[71]

NVIDIA. 2026. Megatron-LM. https://github.com/NVIDIA/Megatron-LM

2026
[72]

2023.NVIDIA DGX H100/H200 User Guide

NVIDIA Corporation. 2023.NVIDIA DGX H100/H200 User Guide. https://docs. nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html Accessed: 2025-11-17

2023
[73]

Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W Keckler, and William J Dally. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. InProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 41–54

2017
[74]

Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, and John D Owens. 2023. Stream-k: Work-centric parallel decomposition for dense matrix- matrix multiplication on the GPU. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 429–431

2023
[75]

Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, et al. 2025. Stra- tum: System-hardware co-design with tiered monolithic 3D-stackable DRAM for efficient MoE serving. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture®. 1–17

2025
[76]

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In2019 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 304–315

2019
[77]

Sudeep Pasricha and Mahdi Nikdast. 2020. A survey of silicon photonics for energy-efficient manycore computing.IEEE Design & Test37, 4 (2020), 60–81

2020
[78]

2025.InferenceMAX™: Open Source Inference Benchmarking

Dylan Patel, Kimbo Chen, Daniel Nishball, et al . 2025.InferenceMAX™: Open Source Inference Benchmarking. https://newsletter.semianalysis.com/ 14 DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators p/inferencemax-open-source-inference Accessed: 2025-11-18

2025
[79]

Sinclair

Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D. Sinclair. 2024. T3: Transparent tracking & triggering for fine-grained overlap of compute & collectives.Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(2024). https://api.semanticscholar.org/Co...

2024
[80]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently scaling Transformer inference.ArXivabs/2211.05102 (2022). https://api.semanticscholar.org/CorpusID:253420623

work page arXiv 2022

Showing first 80 references.