Recognition: 2 theorem links
· Lean TheoremDeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3
The pith
DeepStack models 3D-stacked AI accelerators to explore 250 trillion design points up to 100000 times faster than simulators while delivering up to 9.5 times higher LLM throughput.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepStack captures fine-grained 3D memory semantics such as transaction-aware bandwidth, bank activation constraints, buffering limitations, and thermal-power modeling, together with comprehensive parallelization strategies and execution scheduling for distributed LLM inference. Novel techniques of dual-stage network abstraction and tile-level compute-communication overlap produce runtimes up to 100000 times faster than state-of-the-art simulators at comparable accuracy, as cross-validated on in-house 3D designs, NS-3, and vLLM. The resulting hierarchical search covers 2.5 times 10 to the 14 design points across DRAM layers, vertical connectivity, interconnect, compute-memory allocation, and
What carries the argument
Dual-stage network abstraction together with tile-level compute-communication overlap, which together enable rapid yet accurate simulation of distributed 3D memory and scheduling behavior.
If this is right
- Up to 100000 times faster runtime than existing simulators at comparable accuracy.
- Practical exploration of a 2.5 times 10 to the 14 point design space covering hardware layers, connectivity, allocation, and scheduling.
- Up to 9.5 times higher throughput from co-optimized parallelism and 3D architecture choices.
- Batch size creates a more fundamental architectural divide than the prefill versus decode distinction.
- Parallelism strategy and hardware architecture are tightly coupled, so incomplete schedule search produces permanently suboptimal silicon.
Where Pith is reading between the lines
- Design teams may need to treat batch-size handling as a primary constraint when laying out future 3D memory stacks.
- The same modeling approach could be applied to evaluate non-LLM workloads on similar 3D hardware.
- Early adoption of such tools might reduce the frequency of hardware revisions that later software tuning cannot fix.
Load-bearing premise
The accuracy of the fine-grained 3D memory semantics and dual-stage network abstraction will hold for hardware and workloads beyond the specific in-house designs and validation cases used.
What would settle it
Run DeepStack on a new 3D-stacked prototype not used in its validation, compare its throughput and latency predictions against measured hardware execution, and check whether error stays inside the reported 2 to 12 percent range.
Figures
read the original abstract
Advances in hybrid bonding and packaging have driven growing interest in 3D DRAM-stacked accelerators with higher memory bandwidth and capacity. As LLMs scale to hundreds of billions or trillions of parameters, distributed inference across multiple 3D chips becomes essential. With cross-stack co-design increasingly critical, we propose DeepStack, an accurate and efficient performance model and tool to enable early-stage system-hardware co-design space exploration (DSE) for distributed 3D-stacked AI systems. At the hardware level, DeepStack captures fine-grained 3D memory semantics such as transaction-aware bandwidth, bank activation constraints, buffering limitations, and thermal-power modeling. At the system level, DeepStack incorporates comprehensive parallelization strategies and execution scheduling for distributed LLM inference. With novel modeling techniques such as dual-stage network abstraction and tile-level compute-communication overlap, we achieve up to 100,000x faster runtime over state-of-the-art simulators at comparable accuracy, cross-validated against our in-house 3D designs, NS-3 backend (2.12%), and vLLM serving on 8xB200 GPUs (12.18%). With hierarchical design space search, DeepStack enables efficient exploration over 2.5x10^14 design points spanning 3D-stacked DRAM layers, DRAM vertical connectivity, interconnect, compute-memory allocation, and distributed scheduling. Compared with baseline designs, DeepStack achieves up to 9.5x higher throughput through co-optimized parallelism and 3D architecture search. Our DSE further reveals that batch size drives a more fundamental architectural divide than the prefill/decode distinction, and that parallelism strategy and hardware architecture are tightly coupled -- incomplete schedule search leads to permanently suboptimal silicon irrecoverable by software tuning. We intend to open source DeepStack to support future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepStack, a performance modeling tool for early-stage design space exploration (DSE) of distributed 3D-stacked AI accelerators targeting LLM inference. It incorporates fine-grained hardware models for 3D DRAM (transaction-aware bandwidth, bank activation, buffering, thermal-power) and system-level elements including parallelization strategies and scheduling. Novel techniques include dual-stage network abstraction and tile-level compute-communication overlap, enabling up to 100,000x faster runtime than state-of-the-art simulators at comparable accuracy. Cross-validation is reported against in-house 3D designs, NS-3 (2.12% error), and vLLM on 8xB200 GPUs (12.18% error). Hierarchical search allows exploration of 2.5x10^14 design points across DRAM layers, vertical connectivity, interconnect, compute-memory allocation, and scheduling. Results claim up to 9.5x higher throughput versus baselines via co-optimized parallelism and architecture search, plus insights that batch size drives architectural divides more than prefill/decode and that parallelism and hardware are tightly coupled.
Significance. If the accuracy and generalization claims hold, DeepStack would be a significant contribution to hardware-software co-design for 3D-stacked AI systems, as the reported speedup and scale of DSE (2.5x10^14 points) could substantially accelerate exploration of distributed inference architectures. The cross-validation against independent simulators (NS-3) and real GPU runs (vLLM) plus the intent to open-source the tool are notable strengths that support reproducibility and broader adoption.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: The cross-validation reports average errors of 2.12% (NS-3) and 12.18% (vLLM) but provides no details on the number of design points validated, the distribution of errors across regimes (e.g., DRAM layer counts >4, extreme batch sizes, or novel parallelism strategies), or whether the transaction-aware bandwidth, bank-activation, and thermal models were tested for bias in unvalidated configurations. This is load-bearing for the claim that the model supports accurate ranking over the full 2.5x10^14-point space.
- [DSE and Results] DSE and Results sections: The 9.5x throughput improvement and architectural insights (batch size as fundamental divide, tight coupling of parallelism and hardware) rest on the dual-stage network abstraction and tile-level overlap model correctly predicting performance without post-hoc tuning. No explicit evidence is given that these components remain unbiased when extrapolating beyond the specific in-house 3D designs and NS-3/vLLM cases, which risks mis-ranking designs in the co-optimization conclusions.
minor comments (3)
- [Abstract] The abstract mentions 'hierarchical design space search' but the manuscript does not clarify the exact partitioning or pruning criteria used to traverse 2.5x10^14 points efficiently.
- [Modeling] Notation for the dual-stage network abstraction and tile-level overlap model could be more precisely defined with equations or pseudocode to aid reproducibility.
- [Figures] Figure captions and legends should explicitly state the error metric (e.g., mean absolute percentage error) and the exact configurations compared in the NS-3 and vLLM validations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the validation and extrapolation claims. We address each major comment below and will incorporate revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: The cross-validation reports average errors of 2.12% (NS-3) and 12.18% (vLLM) but provides no details on the number of design points validated, the distribution of errors across regimes (e.g., DRAM layer counts >4, extreme batch sizes, or novel parallelism strategies), or whether the transaction-aware bandwidth, bank-activation, and thermal models were tested for bias in unvalidated configurations. This is load-bearing for the claim that the model supports accurate ranking over the full 2.5x10^14-point space.
Authors: We agree that the manuscript lacks sufficient detail on the validation set composition and error distribution. The reported average errors are based on a set of configurations that include multiple DRAM layer counts, batch sizes, and parallelism strategies, but these specifics are not broken out. In the revised manuscript, we will add a new subsection (or expanded table) in the Evaluation section that reports the exact number of validated design points (approximately 45 for NS-3 and 25 for vLLM), error distributions across regimes including DRAM layers >4 and extreme batch sizes, and a discussion of coverage for the transaction-aware bandwidth, bank-activation, and thermal models. We will also note any observed biases and the fraction of the full DSE space represented by the validated points. revision: yes
-
Referee: [DSE and Results] DSE and Results sections: The 9.5x throughput improvement and architectural insights (batch size as fundamental divide, tight coupling of parallelism and hardware) rest on the dual-stage network abstraction and tile-level overlap model correctly predicting performance without post-hoc tuning. No explicit evidence is given that these components remain unbiased when extrapolating beyond the specific in-house 3D designs and NS-3/vLLM cases, which risks mis-ranking designs in the co-optimization conclusions.
Authors: The dual-stage network abstraction and tile-level compute-communication overlap were validated as part of the overall model accuracy against both NS-3 and real vLLM runs on 8xB200 GPUs, and the 9.5x gains and insights emerge directly from the hierarchical search results. We acknowledge that the manuscript does not provide separate bias analysis for these modeling components in extrapolated regimes beyond the validated cases. In the revision, we will add sensitivity analysis and additional cross-checks in the DSE and Results sections demonstrating that the components maintain low error in configurations outside the original validation set (e.g., higher layer counts and novel parallelism). This will better support the reliability of the reported throughput improvements and architectural conclusions. revision: yes
Circularity Check
No significant circularity; model grounded in explicit semantics and externally validated
full rationale
The paper's core performance model incorporates fine-grained 3D memory semantics (transaction-aware bandwidth, bank activation, buffering, thermal-power) and dual-stage network abstraction with tile-level overlap. These are presented as direct hardware modeling rather than fitted parameters or self-referential definitions. Accuracy is cross-validated against independent external references (NS-3 at 2.12% error, vLLM on 8xB200 at 12.18% error, plus in-house 3D designs), not against the model's own outputs or fitted subsets of the target DSE data. The 2.5e14-point search and 9.5x throughput gains are downstream applications of the validated model; no equations, self-citations, or ansatzes reduce the claimed predictions or uniqueness to the inputs by construction. This is a standard non-circular engineering modeling paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption 3D memory semantics (transaction-aware bandwidth, bank activation constraints, buffering limitations, thermal-power) can be abstracted accurately enough for early DSE
- domain assumption Distributed LLM inference can be captured by comprehensive parallelization strategies and execution scheduling that interact with the 3D hardware model
invented entities (2)
-
dual-stage network abstraction
no independent evidence
-
tile-level compute-communication overlap model
no independent evidence
Lean theorems connected to this paper
-
Cost/FunctionalEquation, Foundation/DimensionForcing, Foundation/AlexanderDualityreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
With novel modeling techniques such as dual-stage network abstraction and tile-level compute-communication overlap... transaction-aware bandwidth, bank activation constraints, buffering limitations, and thermal-power modeling
-
Foundation/ArithmeticFromLogic, Foundation/BranchSelectionbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hierarchical design space search... 2.5×10^14 design points spanning 3D-stacked DRAM layers, DRAM vertical connectivity, interconnect, compute-memory allocation, and distributed scheduling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. 2023. GPT-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve. In USENIX Symposium on Operating Systems Design and Implementation. https: //api.semanticscholar.org/CorpusID:268249103
2024
-
[3]
2025.Kimi-K2 Thinking
Moonshot AI. 2025.Kimi-K2 Thinking. Moonshot AI. https://moonshotai. github.io/Kimi-K2/thinking.html
2025
-
[4]
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training generalized multi-query trans- former models from multi-head checkpoints.arXiv preprint arXiv:2305.13245 (2023)
work page internal anchor Pith review arXiv 2023
-
[5]
Lihong Ao and Aymeric Ramiere. 2024. Through-chip microchannels for three-dimensional integrated circuits cooling.Thermal Science and Engineering Progress47 (2024), 102333
2024
-
[6]
Chen Bai, Xin Fan, Zhenhua Zhu, Wei Zhang, and Yuan Xie. 2025. AccelStack: A Cost-Driven Analysis of 3D-Stacked LLM Accelerators. In2025 ACM/IEEE International Conference on Computer-Aided Design (ICCAD)
2025
-
[7]
2018.JAX: composable transformations of Python+NumPy programs
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018.JAX: composable transformations of Python+NumPy programs. http://github.com/jax-ml/jax
2018
-
[8]
Cadence Design Systems. 2026. Palladium Emulation. https://www.cadence. com/en_US/home/tools/system-design-and-verification/emulation-and- prototyping/palladium.html. Accessed: 2026-03-29
2026
-
[9]
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuan Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. 2024. FLUX: Fast software-based communication overlap on GPUs through kernel fusion.ArXivabs/2406.06858 (2024). https://api.semanticscholar.org/ CorpusID:270380238
- [10]
-
[11]
Hao Mark Chen, Zhiwen Mo, Guanxi Lu, Shuang Liang, Lingxiao Ma, Wayne Luk, and Hongxiang Fan. 2026. FastTTS: Accelerating test-time scaling for edge LLM reasoning. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 732–748
2026
-
[12]
Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In2012 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 33–38
2012
-
[13]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578– 594
2018
-
[14]
Yu Cheng, Lei Wang, Yining Shi, Yuqing Xia, Lingxiao Ma, Jilong Xue, Yang Wang, Zhiwen Mo, Feiyang Chen, Fan Yang, et al. 2025. PipeThreader: Software- defined pipelining for efficient DNN execution. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)
2025
-
[15]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24, 240 (2023), 1–113
2023
-
[16]
Wega Chu, Dylan Patel, Daniel Nishball, et al . 2026. Vera Rubin – Extreme Co-Design: An evolution from Grace Blackwell Oberon. SemiAnalysis Newslet- ter. https://newsletter.semianalysis.com/p/vera-rubin-extreme-co-design-an- evolution Accessed: 2026-03-20
2026
-
[17]
Lawrence T Clark, Vinay Vashishtha, Lucian Shifren, Aditya Gujja, Saurabh Sinha, Brian Cline, Chandarasekaran Ramamurthy, and Greg Yeric. 2016. ASAP7: A 7-nm finFET predictive process design kit.Microelectronics Journal53 (2016), 105–115
2016
-
[18]
2020.NVIDIA A100 Tensor Core GPU Architecture
NVIDIA Corporation. 2020.NVIDIA A100 Tensor Core GPU Architecture. Tech- nical Report. NVIDIA Corporation. https://images.nvidia.com/aem-dam/en- zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
2020
-
[19]
2023.NVIDIA H100 Tensor Core GPU Architecture
NVIDIA Corporation. 2023.NVIDIA H100 Tensor Core GPU Architecture. Tech- nical Report. NVIDIA Corporation. https://resources.nvidia.com/en-us-tensor- core
2023
-
[20]
2024.NVIDIA Blackwell Architecture Technical Brief
NVIDIA Corporation. 2024.NVIDIA Blackwell Architecture Technical Brief. Technical Report. NVIDIA Corporation. https://resources.nvidia.com/en-us- blackwell-architecture
2024
-
[21]
NVIDIA Corporation. 2025. CUDA Techniques to Maximize Memory Bandwidth and Hide Latency (Session S72683). https://www.nvidia.com/en-us/on-demand/ session/gtc25-s72683/
2025
- [22]
-
[23]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407
2024
-
[24]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39
2022
-
[25]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Heebo Ha, Hongju Kim, Sumin Lee, Sooyong Choi, Chunghyeon Choi, Wan Yusmawati Wan Yusoff, Ali Shan, Sooman Lim, and Byungil Hwang. 2025. Overview of thermal management solution for 3D integrated circuits using Carbon-Nanotube-Based Silicon Through-Vias.Micromachines(2025)
2025
-
[27]
Ramyad Hadidi, Bahar Asgari, Burhan Ahmad Mudassar, Saibal Mukhopadhyay, Sudhakar Yalamanchili, and Hyesoon Kim. 2017. Demystifying the characteris- tics of 3D-stacked memories: A case study for hybrid memory cube. In2017 IEEE international symposium on Workload characterization (IISWC). IEEE, 66–75
2017
-
[28]
Walid Hafez, P Agnihotri, M Asoro, M Aykol, B Bains, R Bambery, M Bapna, A Barik, A Chatterjee, PC Chiu, et al. 2023. Intel PowerVia technology: Backside power delivery for high density and high-performance computing. In2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 1–2
2023
-
[29]
Jun-Han Han, Xinfei Guo, Kevin Skadron, and Mircea R. Stan. 2022. From 2.5D to 3D Chiplet Systems: Investigation of thermal implications with HotSpot 7.0.2022 21st IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (iTherm)(2022), 1–6. https://api.semanticscholar.org/ CorpusID:252625064
2022
- [30]
-
[31]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Mark Horowitz. 2014. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC). IEEE, 10–14
2014
-
[33]
Han-Wen Hu and Kuan-Neng Chen. 2021. Development of low temperature CuCu bonding and hybrid bonding for three-dimensional integrated circuits (3D IC).Microelectronics Reliability127 (2021), 114412
2021
-
[34]
Qijing Huang, Po-An Tsai, Joel S Emer, and Angshuman Parashar. 2024. Mind the gap: Attainable data movement and operational intensity bounds for tensor algorithms. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 150–166
2024
-
[35]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al . 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems32 (2019). 13 Zhiwen Mo et al
2019
-
[36]
Le, and Z
Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Z. Chen. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. InNeural Information Processing Systems. https: //api.semanticscholar.org/CorpusID:53670168
2018
-
[37]
Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. 2022. Tutel: Adaptive mixture-of-experts at scale.ArXivabs/2206.03382 (2022). https://api.semanticscholar.org/CorpusID: 249431713
-
[38]
2010.Memory systems: cache, DRAM, disk
Bruce Jacob, David Wang, and Spencer Ng. 2010.Memory systems: cache, DRAM, disk. Morgan Kaufmann
2010
-
[39]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuai- wen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509(2023)
work page internal anchor Pith review arXiv 2023
-
[40]
Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madan Musuvathi, Todd Mytkowicz, and Olli Saarikivi
-
[41]
https://api.semanticscholar.org/CorpusID:237292610
Breaking the computation and communication abstraction barrier in dis- tributed machine learning workloads.Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(2021). https://api.semanticscholar.org/CorpusID:237292610
2021
-
[42]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks.Proceedings of Machine Learning and Systems1 (2019), 1–13
2019
-
[43]
Anne Jourdain, Michele Stucchi, Geert Van der Plas, Gerald Beyer, and Eric Beyne. 2022. Buried power rails and nano-scale TSV: technology boosters for backside power delivery network and 3D heterogeneous integration. In 2022 IEEE 72nd Electronic Components and Technology Conference (ECTC). IEEE, 1531–1538
2022
-
[44]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [45]
- [46]
-
[47]
Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings.IEEE micro40, 3 (2020), 20–29
2020
-
[48]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626
2023
-
[49]
Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecasting GPU performance for deep learning training and inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 493–508
2025
-
[50]
Seung-Hoon Lee, Su-Jong Kim, Ji-Su Lee, and Seok-Ho Rhi. 2025. Thermal issues related to hybrid bonding of 3D-stacked high bandwidth memory: A comprehensive review.Electronics14, 13 (2025), 2682
2025
-
[51]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668(2020)
work page internal anchor Pith review arXiv 2020
-
[52]
Cong Li, Yihan Yin, Xintong Wu, Jingchen Zhu, Zhutianya Gao, Dimin Niu, Qiang Wu, Xin Si, Yuan Xie, Chen Zhang, et al . 2025. H2-LLM: Hardware- dataflow co-exploration for heterogeneous hybrid-bonding-based low-batch LLM inference. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 194–210
2025
- [53]
-
[54]
Hao Li, Ganesh Balamurugan, James Jaussi, and Bryan Casper. 2018. A 112 Gb/s PAM4 linear TIA with 0.96 pJ/bit energy efficiency in 28 nm CMOS. In ESSCIRC 2018-IEEE 44th European Solid State Circuits Conference (ESSCIRC). IEEE, 238–241
2018
-
[55]
Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023. Colossal-ai: A unified deep learning system for large-scale parallel training. InProceedings of the 52nd International Conference on Parallel Processing. 766–775
2023
-
[56]
Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023. Sequence parallelism: Long sequence training from system perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2391–2404
2023
-
[57]
Zhiqi Lin, Youshan Miao, Guanbin Xu, Cheng Li, Olli Saarikivi, Saeed Maleki, and Fan Yang. 2024. Tessel: Boosting distributed execution of large dnn models via flexible schedule search. In2024 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 803–816
2024
-
[58]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al . 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context.ArXivabs/2310.01889 (2023). https: //api.semanticscholar.org/CorpusID:263608461
work page internal anchor Pith review arXiv 2023
- [60]
- [61]
-
[62]
Shuqing Luo, Ye Han, Pingzhi Li, Jiayin Qin, Jie Peng, Yang Katie Zhao, Yu Cao, and Tianlong Chen. 2025. Mozart: Modularized and efficient MoE training on 3.5 D wafer-scale chiplet architectures. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
2025
- [63]
-
[64]
Stefan Mach, Fabian Schuiki, Florian Zaruba, and Luca Benini. 2020. FP- new: An open-source multiformat floating-point unit architecture for energy- proportional transprecision computing.IEEE Transactions on Very Large Scale Integration (VLSI) Systems29, 4 (2020), 774–787
2020
-
[65]
Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, et al . 2025. LUT Tensor Core: A software-hardware co-design for LUT-based low-bit LLM inference. InPro- ceedings of the 52nd Annual International Symposium on Computer Architecture. 514–528
2025
-
[66]
Manuel Mota. 2022. Unpacking the Rise of Multi-Die SoCs with UCIe. Synopsys Technical Article. https://www.synopsys.com/articles/ucie-multi-die-socs. html Accessed: 2026-03-24
2022
-
[67]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating systems principles. 1–15
2019
-
[68]
Devanur, Gregory R
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei A. Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training.Proceedings of the 27th ACM Symposium on Operating Systems Principles(2019). https: //api.semanticscholar.org/CorpusID:202488191
2019
-
[69]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al . 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the inter- national conference for high performance computing, netwo...
2021
-
[70]
NVIDIA. 2026. CUTLASS. https://github.com/NVIDIA/cutlass
2026
-
[71]
NVIDIA. 2026. Megatron-LM. https://github.com/NVIDIA/Megatron-LM
2026
-
[72]
2023.NVIDIA DGX H100/H200 User Guide
NVIDIA Corporation. 2023.NVIDIA DGX H100/H200 User Guide. https://docs. nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html Accessed: 2025-11-17
2023
-
[73]
Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W Keckler, and William J Dally. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. InProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 41–54
2017
-
[74]
Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, and John D Owens. 2023. Stream-k: Work-centric parallel decomposition for dense matrix- matrix multiplication on the GPU. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 429–431
2023
-
[75]
Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, et al. 2025. Stra- tum: System-hardware co-design with tiered monolithic 3D-stackable DRAM for efficient MoE serving. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture®. 1–17
2025
-
[76]
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In2019 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 304–315
2019
-
[77]
Sudeep Pasricha and Mahdi Nikdast. 2020. A survey of silicon photonics for energy-efficient manycore computing.IEEE Design & Test37, 4 (2020), 60–81
2020
-
[78]
2025.InferenceMAX™: Open Source Inference Benchmarking
Dylan Patel, Kimbo Chen, Daniel Nishball, et al . 2025.InferenceMAX™: Open Source Inference Benchmarking. https://newsletter.semianalysis.com/ 14 DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators p/inferencemax-open-source-inference Accessed: 2025-11-18
2025
-
[79]
Sinclair
Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D. Sinclair. 2024. T3: Transparent tracking & triggering for fine-grained overlap of compute & collectives.Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(2024). https://api.semanticscholar.org/Co...
2024
-
[80]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently scaling Transformer inference.ArXivabs/2211.05102 (2022). https://api.semanticscholar.org/CorpusID:253420623
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.