Recognition: unknown
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
Pith reviewed 2026-05-10 11:24 UTC · model grok-4.3
The pith
ELMoE-3D jointly scales expert and bit elasticity in MoE models to enable self-speculative decoding that also acts as an expert cache on hybrid-bonding hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ELMoE-3D is a hybrid-bonding-based hardware-software co-designed framework for Mixture-of-Experts models that unifies cache-based acceleration and speculative decoding. By jointly scaling the expert elasticity axis and the bit elasticity axis, it builds Elastic Self-Speculative Decoding that functions simultaneously as an expert cache and a strongly aligned self-draft model. The LSB-augmented bit-sliced architecture exploits redundancy in bit-slice representations to enable bit-nested execution, all accelerated by the high internal bandwidth of hybrid bonding in 3D stacks.
What carries the argument
Elastic Self-Speculative Decoding (Elastic-SD) formed by jointly scaling the expert elasticity and bit elasticity axes of MoE models to serve simultaneously as expert cache and self-draft model
If this is right
- Achieves an average 6.6× speedup and 4.4× energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16
- Delivers 2.2× speedup and 1.4× energy efficiency gain over the best-performing prior accelerator baseline
- Unifies cache-based acceleration and speculative decoding to provide overall speedup across all batch sizes
- Maintains model accuracy while eliminating separate overhead for the self-draft model through native bit-nested execution
Where Pith is reading between the lines
- The same joint elasticity scaling could be tested on other sparse-activation architectures if they exhibit comparable expert and bit redundancy.
- Hardware platforms with high internal bandwidth but without hybrid bonding might still capture part of the bit-sliced acceleration benefit.
- The approach implies that speculative decoding in MoE need not remain separate from caching mechanisms when structural elasticity is exploited.
Load-bearing premise
The expert and bit elasticity axes of MoE models can be jointly scaled to make the self-draft model function as an expert cache without accuracy loss or extra overhead.
What would settle it
Direct measurement of end-to-end accuracy and total latency when running Elastic-SD versus standard MoE inference on the same 3D-stacked hardware, checking whether accuracy drops or overhead appears.
Figures
read the original abstract
Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average $6.6\times$ speedup and $4.4\times$ energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16, and delivers $2.2\times$ speedup and $1.4\times$ energy efficiency gain over the best-performing prior accelerator baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ELMoE-3D, a hybrid-bonding (HB) hardware-software co-design for on-premises MoE serving. It identifies two intrinsic elasticity axes (expert and bit) and jointly scales them to construct Elastic Self-Speculative Decoding (Elastic-SD). This mechanism is claimed to simultaneously act as an expert cache (via LSB-augmented bit-sliced execution) and a strongly aligned self-draft model for speculative decoding, yielding 6.6× average speedup and 4.4× energy-efficiency gain over naive MoE on xPU, plus 2.2× speedup and 1.4× energy gain over the best prior accelerator baseline, across batch sizes 1–16 with no accuracy loss.
Significance. If the dual-use Elastic-SD construction holds with the reported performance and zero net overhead, the work would meaningfully advance efficient serving of large MoE models on memory-bound 3D-stacked hardware by unifying caching and speculation. The joint exploitation of expert and bit elasticity is a distinctive co-design idea that could generalize to other sparse architectures.
major comments (2)
- [Abstract and Elastic-SD construction] Abstract and the Elastic-SD construction section: The headline 6.6× speedup rests on the claim that LSB-augmented bit-sliced execution simultaneously delivers high expert-cache hit rates and preserves the exact logit distribution required for high speculative acceptance rates with no accuracy loss. No quantitative evidence (acceptance-rate curves, cache-hit-rate breakdowns, or logit-distribution comparisons) is supplied to show that these two properties hold jointly at the reported batch sizes 1–16; if either fails, rejected drafts re-incur full expert loads and the net gain collapses.
- [Experimental evaluation] Experimental evaluation section: The reported averages (6.6×, 2.2×) are given without per-batch breakdowns, error bars, or explicit descriptions of baseline implementations, accuracy metrics, and 3D-stacked hardware parameters. This prevents verification that the gains are robust, especially at low batch sizes where verification overhead is highest.
minor comments (2)
- [Notation and definitions] Notation for the two elasticity axes is introduced in the abstract but not consistently carried through the text; a single table summarizing the scaling rules for expert and bit dimensions would improve clarity.
- [Architecture diagram] Figure illustrating the bit-sliced LSB augmentation path would benefit from explicit call-outs showing how the same hardware structures serve both caching and draft-model roles.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of our Elastic-SD construction and evaluation. We address each major point below and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Elastic-SD construction] Abstract and the Elastic-SD construction section: The headline 6.6× speedup rests on the claim that LSB-augmented bit-sliced execution simultaneously delivers high expert-cache hit rates and preserves the exact logit distribution required for high speculative acceptance rates with no accuracy loss. No quantitative evidence (acceptance-rate curves, cache-hit-rate breakdowns, or logit-distribution comparisons) is supplied to show that these two properties hold jointly at the reported batch sizes 1–16; if either fails, rejected drafts re-incur full expert loads and the net gain collapses.
Authors: We agree that explicit quantitative evidence for the joint cache-hit and acceptance-rate behavior is required to substantiate the dual-use claim. The current manuscript presents the overall speedups but does not include the requested acceptance-rate curves, per-batch cache-hit breakdowns, or logit-distribution comparisons. In the revised version we will add these analyses (new figures and tables) drawn from our evaluation runs at batch sizes 1–16, confirming that the LSB-augmented execution maintains both high hit rates and logit fidelity with no accuracy degradation. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation section: The reported averages (6.6×, 2.2×) are given without per-batch breakdowns, error bars, or explicit descriptions of baseline implementations, accuracy metrics, and 3D-stacked hardware parameters. This prevents verification that the gains are robust, especially at low batch sizes where verification overhead is highest.
Authors: We acknowledge the need for greater transparency in the evaluation. The manuscript reports aggregate speedups and energy gains but omits per-batch tables, error bars, detailed baseline configurations, and the precise 3D-stacked hardware parameters used. We will revise the experimental section to include these elements: per-batch speedup and energy results with standard deviations, explicit descriptions of all baselines (including their implementations), the accuracy metrics employed, and the relevant 3D hardware specifications. revision: yes
Circularity Check
No circularity: empirical hardware claims with no derivation chain
full rationale
The paper presents a hardware-software co-design for MoE serving on 3D-stacked hardware, reporting measured speedups and energy gains from Elastic-SD. No equations, fitted parameters, self-citations as load-bearing premises, or renamings of known results appear in the abstract or description. The central claims rest on experimental results across batch sizes rather than any prediction that reduces to its own inputs by construction. The identification of elasticity axes is presented as an architectural observation, not a self-referential definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Apple Inc. 2026. Apple Mac Studio. https://www.apple.com/kr/mac-studio/. Accessed: 2026-04-02
2026
-
[2]
Gonzalez, Matei Zaharia, and Ion Stoica
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. 2025. MoE-Lightning: High- Throughput MoE Inference on Memory-constrained GPUs. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1(Rotterdam...
- [3]
-
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Insu Choi, Young-Seo Yoon, and Joon-Sung Yang. 2025. Bit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and Exploitation. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). 821–835. https://doi.org/10.1109/HPCA61900.2025.00067
-
[6]
Yuseon Choi, Sangjin Kim, Jungjun Oh, Byeongcheol Kim, and Hoi-Jun Yoo
-
[7]
arXiv:2512.12930 [cs.LG] https://arxiv.org/abs/2512.12930
SeVeDo: A Heterogeneous Transformer Accelerator for Low-Bit Infer- ence via Hierarchical Group Quantization and SVD-Guided Mixed Precision. arXiv:2512.12930 [cs.LG] https://arxiv.org/abs/2512.12930
- [8]
-
[9]
Lawrence T. Clark, Vinay Vashishtha, Lucian Shifren, Aditya Gujja, Saurabh Sinha, Brian Cline, Chandarasekaran Ramamurthy, and Greg Yeric. 2016. ASAP7: A 7-nm finFET predictive process design kit.Microelectronics Journal53 (2016), 105–115. https://doi.org/10.1016/j.mejo.2016.04.006
-
[10]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian L...
work page internal anchor Pith review arXiv 2024
-
[12]
Pingcheng Dong, Yonghao Tan, Xuejiao Liu, Peng Luo, Yu Liu, Di Pang, Songchen Ma, Xijie Huang, Shih-Yang Liu, Dong Zhang, Zhichao Lu, Luhong Liang, Chi- Ying Tsui, Fengbin Tu, Liang Zhao, and Kwang-Ting Cheng. 2026. 31.1 A 14.08- to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression a...
-
[13]
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean Wu. 2024. LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Lingui...
2024
- [14]
- [15]
-
[16]
Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. 2024. Direct Alignment of Draft Model for Speculative Decod- ing with Chat-Fine-tuned LLMs. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models. https://openreview.net/forum? id=126PpV2CoO
2024
-
[17]
Sangwoo Ha, Jingu Lee, Youngjin Moon, Sunjoo Whang, Wooyoung Jo, Gwangtae Park, Sangjin Kim, Soyeon Um, Junha Ryu, Yurim Jo, and Hoi-Jun Yoo. 2026. SMoLPU: 122.1µJ/Token Sparse MoE-Based Speculative Decoding Language Processing Unit with Adaptive-Offload NPU-CIM Core. In2026 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 69. 312–314. htt...
-
[18]
Sanghyeok Han, Byungkuk Yoon, Gyeonghwan Park, Choungki Song, Dongkyun Kim, and Jae-Joon Kim. 2025. Near-Memory LLM Inference Processor based on 3D DRAM-to-Logic Hybrid Bonding. InProceedings of the 62nd Annual ACM/IEEE Design Automation Conference(San Francisco, California, United States)(DAC ’25). IEEE Press, Article 205, 7 pages. https://doi.org/10.110...
-
[19]
Siyuan He, Zhantong Zhu, Yandong He, and Tianyu Jia. 2025. LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture- Dataflow Co-Optimization. , 9 pages. https://doi.org/10.1109/ICCAD66269.2025. 11240889
-
[20]
Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, and Jongse Park. 2024. Ne- upims: Npu-pim heterogeneous acceleration for batched llm inferencing. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 722–737
2024
-
[21]
Hanbo Huang, Yihan Li, Bowen Jiang, Lin Liu, Bo Jiang, Ruoyu Sun, Zhuotao Liu, and Shiyu Liang. 2025. On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model Confidentiality. InICLR 2025 Workshop on Building Trust in Language Models and Applications. https: //openreview.net/forum?id=u61yT9ZkEZ
2025
-
[22]
Zongle Huang, Lei Zhu, ZongYuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, and Tianyu Zhang. 2025. MoESD: Unveil Speculative Decoding’s Potential for Accelerating Sparse MoE. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id= FAeU7516MR
2025
-
[23]
Raymond Hung, Gilbert See, Ying Wang, Chang Bum Yong, Ke Zheng, Yauloong Chang, Avi Shantaram, Ruiping Wang, Arvind Sundarrajan, Jonathan Abdilla, Nithyananda Hegde, Stefan Schmid, Djuro Bikaljevic, and Manfred Glantschnig
-
[24]
In2024 IEEE 74th Electronic Components and Technology Conference (ECTC)
Enabling Die-to-Wafer Hybrid Bonding for the Next Generation Advanced 3D Packaging. In2024 IEEE 74th Electronic Components and Technology Conference (ECTC). 778–783. https://doi.org/10.1109/ECTC51529.2024.00127
-
[25]
Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. 2025. Pre-Gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference. InProceedings of the 51st Annual International Symposium on Computer Architecture(Buenos Aires, Argentina) (ISCA ’24). IEEE Press, 1018–1031. https://doi.org/10.1109...
-
[26]
Dongseok Im, Gwangtae Park, Zhiyong Li, Junha Ryu, and Hoi-Jun Yoo. 2023. Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 69–80. https://doi.org/10.1109/HPCA56546.2023. 10071031
-
[27]
JEDEC Solid State Technology Association. 2019. JESD209-5: Low Power Dou- ble Data Rate 5 (LPDDR5). https://www.jedec.org/standards-documents/docs/ jesd209-5. Standard specification
2019
-
[28]
Dongyun Kam, Myeongji Yun, Sunwoo Yoo, Seungwoo Hong, Zhengya Zhang, and Youngjoo Lee. 2025. Panacea: Novel DNN Accelerator using Accuracy- Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity . In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 701–715. h...
-
[29]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG] https: //arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [30]
-
[31]
Jin Hyun Kim, Yuhwan Ro, Jinin So, Sukhan Lee, Shin-haeng Kang, YeonGon Cho, Hyeonsu Kim, Byeongho Kim, Kyungsoo Kim, Sangsoo Park, Jin-Seong Kim, Sanghoon Cha, Won-Jo Lee, Jin Jung, Jong-Geon Lee, Jieun Lee, JoonHo Song, Seungwon Lee, Jeonghyeon Cho, Jaehoon Yu, and Kyomin Sohn. 2023. Samsung PIM/PNM for Transfmer Based AI : Energy Efficiency on PIM/PNM ...
-
[32]
Sangjin Kim, Jungjun Oh, Byeongcheol Kim, Yuseon Choi, Gwangtae Park, and Hoi-Jun Yoo. 2026. 31.2 Revolver: Low-Bit GenAI Accelerator for Distilled-Model and CoT with Phase-Aware-Quantization and Rotation-Based Integer-Scaled Group Quantization. In2026 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 69. 534–536. https://doi.org/10.1109/IS...
- [33]
-
[34]
Phil C. Knag, Gregory K. Chen, Shanshan Xie, Satish Yada, Wei Wu, Yu-Shiang Lin, Alexander Kashirin, Xiemei Meng, Russell Criss, Ana Sonia Leon, Carlos Tokunaga, Ram K. Krishnamurthy, and James W. Tschanz. 2026. 10.6 A Hybrid- Bonded 12.1Tops/mm2 5 6-Core DNN Processor with 2.5Tb/s/mm2 3D Network on Chip. In2026 IEEE International Solid-State Circuits Con...
-
[35]
Juhyoung Lee, Dongjoo Shin, Jinsu Lee, Jinmook Lee, Sanghoon Kang, and Hoi- Jun Yoo. 2019. A Full HD 60 fps CNN Super Resolution Processor with Selective Caching based Layer Fusion for Mobile Devices. In2019 Symposium on VLSI Circuits. C302–C303. https://doi.org/10.23919/VLSIC.2019.8778104
-
[36]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 795, 13 pages
2023
-
[37]
Cong Li, Yihan Yin, Xintong Wu, Jingchen Zhu, Zhutianya Gao, Dimin Niu, Qiang Wu, Xin Si, Yuan Xie, Chen Zhang, and Guangyu Sun. 2025. H2-LLM: Hardware- Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25). Association for Comput...
-
[38]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 7421–7432. ht...
2024
-
[39]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 1162, 14 pages
2024
-
[40]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. arXiv:2503.01840 [cs.CL] https://arxiv.org/abs/2503.01840
work page internal anchor Pith review arXiv 2025
-
[41]
Jiahao Liu, Qifan Wang, Jingang Wang, and Xunliang Cai. 2024. Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 30...
-
[42]
Nisa Bostancı, Ataberk Olgun, A
Haocong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray Yağlıkçı, and Onur Mutlu. 2024. Ramulator 2.0: A Modern, Modular, and Extensi- ble DRAM Simulator.IEEE Computer Architecture Letters23, 1 (2024), 112–116. https://doi.org/10.1109/LCA.2023.3333759
-
[43]
Bradley McDanel, Steven Li, Sruthikesh Surineni, and Harshit Khaitan
-
[44]
arXiv preprint arXiv:2602.16052 , year=
MoE-Spec: Expert Budgeting for Efficient Speculative Decoding. arXiv:2602.16052 [cs.LG] https://arxiv.org/abs/2602.16052
-
[45]
Micron Technology, Inc. 2023. LPDDR5/LPDDR5X SDRAM Datasheet. https: //www.micron.com/. Technical documentation
2023
-
[46]
Seungjae Moon, Jung-Hoon Kim, Junsoo Kim, Seongmin Hong, Junseo Cha, Minsu Kim, Sukbin Lim, Gyubin Choi, Dongjin Seo, Jongho Kim, Hunjong Lee, Hyunjun Park, Ryeowook Ko, Soongyu Choi, Jongse Park, Jinwon Lee, and Joo- Young Kim. 2024. A Latency Processing Unit: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference.IEEE Micro...
-
[47]
Pranav Ajit Nair, Puranjay Datta, Jeff Dean, Prateek Jain, and Aditya Kusupati
-
[48]
InForty-second International Conference on Machine Learning
Matryoshka Quantization. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=phVWcUSGYP
-
[49]
Dimin Niu, Shuangchen Li, Yuhao Wang, Wei Han, Zhe Zhang, Yijin Guan, Tianchan Guan, Fei Sun, Fei Xue, Lide Duan, Yuanwei Fang, Hongzhong Zheng, Xiping Jiang, Song Wang, Fengguo Zuo, Yubing Wang, Bing Yu, Qiwei Ren, and Yuan Xie. 2022. 184QPS/W 64Mb/mm23D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System. In2022 IEEE I...
-
[50]
NVIDIA. 2026. NVIDIA DGX Spark. https://www.nvidia.com/ko-kr/products/ workstations/dgx-spark/. Accessed: 2026-04-02
2026
-
[51]
OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Apple- baum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives,...
work page internal anchor Pith review arXiv 2025
-
[52]
Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minx- uan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, and Mingu Kang. 2025. Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving. InProceedings of the 58th IEEE/ACM Interna- tional Symposium on Microarchitecture (MICRO ’25). Associat...
-
[53]
Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. 2026. AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=XPIEkFdEDi
2026
-
[54]
Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, and Jung Ho Ahn. 2024. AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA...
-
[55]
Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, and Jae W. Lee. 2024. Any-precision LLM: low-cost deployment of multiple, different-sized LLMs. In Proceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 1607, 20 pages
2024
-
[56]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Semyon Savkin, Eitan Porat, Or Ordentlich, and Yury Polyanskiy. 2025. NestQuant: nested lattice quantization for matrix products and LLMs. InForty- second International Conference on Machine Learning. https://openreview.net/ forum?id=4OWGON33HE
2025
- [58]
-
[59]
Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Joon Kyung Kim, Vikas Chandra, and Hadi Esmaeilzadeh. 2018. Bit fusion: bit-level dynamically composable architecture for accelerating deep neural net- works. InProceedings of the 45th Annual International Symposium on Com- puter Architecture(Los Angeles, California)(ISCA ’18). IEEE Pres...
-
[60]
Le, Geoffrey E
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Con- ference Track Proceedings. OpenReview.net. https...
2017
-
[61]
Whatmough, and Babak Ehteshami Bejnordi
Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart Van Baalen, Markus Nagel, Paul N. Whatmough, and Babak Ehteshami Bejnordi. 2025. Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference. Transactions on Machine Learning Research(2025). https://openreview.net/forum? id=ul4W26KEKz
2025
-
[62]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, and Xuechen Li. 2023. Alpaca: A Strong, Replicable Instruction-Following Model. https://crfm.stanford. edu/2023/03/13/alpaca.html. Stanford Center for Research on Foundation Models (CRFM)
2023
-
[63]
5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowe...
work page internal anchor Pith review arXiv 2025
-
[64]
Abdelfattah, and Kai-Chiang Wu
Pei-Shuo Wang, Jian-Jia Chen, Chun-Che Yang, Chi-Chih Chang, Ning-Chi Huang, Mohamed S. Abdelfattah, and Kai-Chiang Wu. 2025. Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/foru...
2025
-
[65]
Wenfeng Wang, Jiacheng Liu, Xiaofeng Hou, Xinfeng Xia, Peng Tang, Mingxuan Zhang, Chao Li, and Minyi Guo. 2025. MoE-SpeQ: Speculative Quantized De- coding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts. arXiv:2511.14102 [cs.LG] https://arxiv.org/abs/2511.14102
- [66]
-
[67]
Tony F. Wu, Huichu Liu, H. Ekin Sumbul, Lita Yang, Dipti Baheti, Jeremy Coriell, William Koven, Anu Krishnan, Mohit Mittal, Matheus Trevisan Moreira, Max Waugaman, Laurent Ye, and Edith Beigné. 2024. 11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at < 2 𝜇mPitch with up to 40% Ener...
-
[68]
John Wuu, Rahul Agarwal, Michael Ciraula, Carl Dietz, Brett Johnson, Dave Johnson, Russell Schreiber, Raja Swaminathan, Will Walker, and Samuel Naffziger
-
[69]
In: 2022 IEEE International Solid-State Circuits Conference (ISSCC)
3D V-Cache: the Implementation of a Hybrid-Bonded 64MB Stacked Cache for a 7nm x86-64 CPU. In2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. 428–429. https://doi.org/10.1109/ISSCC42614.2022.9731565
- [70]
-
[71]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Lita Yang, Kavya Sreedhar, Huichu Liu, and Edith Beigne. 2024. Enabling On- Device Large Language Models with 3D-Stacked Memory. InNeurIPS 2024 Work- shop Machine Learning with new Compute Paradigms. https://openreview.net/ forum?id=P4LViaB8g0
2024
- [73]
-
[74]
Zhiheng Yue, Huizheng Wang, Jiahao Fang, Jinyi Deng, Guangyang Lu, Fengbin Tu, Ruiqi Guo, Yuxuan Li, Yubin Qin, Yang Wang, Chao Li, Huiming Han, Shaojun Wei, Yang Hu, and Shouyi Yin. 2025. Exploiting Similarity Opportunities of Emerging Vision AI Models on Hybrid Bonding Architecture. InProceedings of the 51st Annual International Symposium on Computer Ar...
-
[75]
Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, and Jung Ho Ahn. 2024. Duplex: A De- vice for Large Language Models with Mixture of Experts, Grouped Query Atten- tion, and Continuous Batching. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1429–1443. https://doi.org/10.11...
-
[76]
Wentao Zhao, Boya Lv, Meng Wu, Peiyu Chen, Fengyun Yan, Yufei Ma, Tianyu Jia, Ru Huang, and Le Ye. 2025. 3D-TokSIM: Stacking 3D Memory with Token- Stationary Compute-in-Memory for Speculative LLM Inference. In2025 62nd ACM/IEEE Design Automation Conference (DAC). 1–7. https://doi.org/10.1109/ DAC63849.2025.11132883
-
[77]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT- bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.