Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
Pith reviewed 2026-05-22 15:03 UTC · model grok-4.3
The pith
Sandwich lets CPUs serve LLMs efficiently by switching between prefill and decode plans without interference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining seamless phase-wise plan switching to eliminate cross-phase interference, TopoTree for automated substructure-aware partial core allocation, and fast-start-then-finetune dynamic-shape tensor program generation, Sandwich delivers a full-stack CPU LLM serving system that achieves high performance under non-disaggregated constraints.
What carries the argument
Hot-switching mechanism for seamless phase-wise configuration changes, supported by TopoTree as a tree-based hardware abstraction that enables automated partial core allocation respecting sub-NUMA structures.
If this is right
- Average 2.01x end-to-end speedup for LLM serving workloads on CPUs.
- Up to 3.40x reduction in latency for certain serving scenarios.
- Kernel performance matching static compilers at three orders of magnitude lower tuning cost.
- Consistent operation across x86 and ARM CPU platforms without disaggregation.
- Reduced cross-phase resource interference in shared CPU environments.
Where Pith is reading between the lines
- The switching and topology-aware allocation ideas could extend to other accelerators that experience phase-dependent resource conflicts.
- TopoTree-style abstractions might help manage resources in multi-chip or heterogeneous CPU setups beyond LLM serving.
- Lower tuning costs could encourage broader adoption of dynamic-shape optimizations in production inference systems.
Load-bearing premise
Seamless switching between prefill and decode plans incurs negligible overhead and TopoTree's automated core allocation avoids interference across dynamic workloads and all sub-NUMA topologies without manual tuning.
What would settle it
Direct measurement of plan-switching latency or performance under rapidly changing batch sizes and sequence lengths on multi-socket or complex NUMA CPU systems that would reveal unexpected overhead or contention.
Figures
read the original abstract
CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints--existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation. Across five x86/ARM CPU platforms, Sandwich achieves an average 2.01x end-to-end speedup and up to 3.40x latency reduction over state-of-the-art systems. Its kernels match static compiler performance with three orders of magnitude lower tuning cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Sandwich, a full-stack CPU LLM serving system designed to address conflicting prefill and decode resource demands in non-disaggregated deployments. It proposes three innovations: seamless phase-wise plan switching to eliminate cross-phase interference, TopoTree (a tree-based hardware abstraction for automated sub-NUMA partial core allocation), and a fast-start-then-finetune approach for dynamic-shape tensor program generation. Across five x86/ARM platforms, it reports an average 2.01x end-to-end speedup and up to 3.40x latency reduction versus state-of-the-art systems, while matching static compiler kernel performance at three orders of magnitude lower tuning cost.
Significance. If the performance claims are substantiated with detailed breakdowns, this work would offer a practical advance for CPU-based LLM serving, which remains relevant for cost, availability, and edge use cases. The automated handling of sub-NUMA structures and low-overhead dynamic switching could improve utilization in real workloads, and the reduced tuning cost for dynamic shapes is a clear practical strength.
major comments (2)
- [§5 and abstract] §5 (Evaluation) and abstract: the headline 2.01x average speedup and 3.40x latency reduction rest on the assumption of negligible overhead for phase-wise plan switching plus TopoTree's interference-free partial core allocation. The manuscript provides no isolated micro-benchmarks quantifying switching latency or allocation stability under bursty/varying batch sizes across the five platforms; without these, it is impossible to determine how much of the reported gains are attributable to the proposed mechanisms versus other factors.
- [§3] §3 (TopoTree design): the claim that TopoTree enables automated, interference-free allocation across sub-NUMA topologies without manual tuning is load-bearing for the cross-platform results, yet the evaluation does not include targeted tests (e.g., LLC-slice contention or NUMA-boundary workloads) that would falsify or confirm robustness.
minor comments (2)
- [Abstract] Abstract: add one sentence naming the primary baselines (e.g., vLLM-CPU, TensorRT-LLM-CPU) so readers can immediately contextualize the 2.01x figure.
- [§5] Figures in §5: ensure error bars or standard deviations are shown for all latency and throughput numbers; the current presentation makes it hard to judge statistical significance of the reported speedups.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the substantiation of our claims without altering the core contributions.
read point-by-point responses
-
Referee: [§5 and abstract] §5 (Evaluation) and abstract: the headline 2.01x average speedup and 3.40x latency reduction rest on the assumption of negligible overhead for phase-wise plan switching plus TopoTree's interference-free partial core allocation. The manuscript provides no isolated micro-benchmarks quantifying switching latency or allocation stability under bursty/varying batch sizes across the five platforms; without these, it is impossible to determine how much of the reported gains are attributable to the proposed mechanisms versus other factors.
Authors: We agree that isolated micro-benchmarks would improve transparency. The reported 2.01x and 3.40x figures are measured in complete end-to-end serving runs that include all phase switches and dynamic allocations under realistic workloads, so overheads are already reflected in the net gains. To directly address the concern, we will add a dedicated subsection and figure in §5 with micro-benchmarks for switching latency and stability under bursty batch sizes on all five platforms. revision: yes
-
Referee: [§3] §3 (TopoTree design): the claim that TopoTree enables automated, interference-free allocation across sub-NUMA topologies without manual tuning is load-bearing for the cross-platform results, yet the evaluation does not include targeted tests (e.g., LLC-slice contention or NUMA-boundary workloads) that would falsify or confirm robustness.
Authors: The consistent speedups across five platforms with heterogeneous sub-NUMA structures already serve as empirical support for TopoTree's automated, interference-free allocation. However, we acknowledge that explicit stress tests would further confirm robustness. We will incorporate additional targeted experiments in the revised §5 evaluating performance under LLC-slice contention and NUMA-boundary workloads. revision: yes
Circularity Check
No circularity; claims rest on direct empirical measurements
full rationale
The paper is a systems implementation describing Sandwich with three innovations for CPU LLM serving. All headline performance numbers (2.01x average speedup, 3.40x latency reduction, three orders of magnitude lower tuning cost) are presented as results of benchmarking on five real platforms. No mathematical derivations, first-principles predictions, or equations appear in the abstract or described contributions. There are no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to their own inputs by construction. The derivation chain is self-contained because it consists of implementation choices validated by external measurement rather than internal redefinition.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Prefill and decode phases have conflicting resource demands that cause interference under non-disaggregated CPU deployment.
- domain assumption Sub-NUMA hardware structures such as LLC slices affect core allocation performance.
invented entities (1)
-
TopoTree
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TopoTree ... group(n,t,d) inserts L(d)/n new nodes ... remove(n,d) eliminates n right-most children ... sandwich-config algorithm
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fast-start-then-finetune ... sliding window ... tensor schedule reuse for dynamic-shape GEMM
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
AMD. n.d.. Server Processor Specifications
-
[2]
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. InProceedings of the International Conference on High Performance Computing, Networking, ...
work page 2022
-
[3]
ARM. 2025. Arm big.LITTLE. https://www.arm.com/technologies/big-little
work page 2025
-
[4]
Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson
-
[5]
Hoard: a scalable memory allocator for multithreaded applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems. Association for Computing Machinery, New York, NY, USA, 117–128
-
[6]
L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS).ACM Trans. Math. Software28, 2 (2002), 135–151
work page 2002
-
[7]
Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krish- namurthy. 2023. Punica: Multi-Tenant LoRA Serving
work page 2023
-
[8]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, USA, 579–594
work page 2018
-
[9]
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2019. Learning to Optimize Tensor Programs
work page 2019
-
[10]
CPU-World. 2024. Intel Xeon Platinum 8272CL specifications
work page 2024
-
[11]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
work page 2022
-
[12]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic manage- ment: a holistic approach to memory placement on NUMA systems. InProceedings of the Eighteenth International Conference on Architectural Support for Program- ming Languages and Operating Systems(Houston, Te...
work page 2013
-
[13]
Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. 2023. Hidet: Task-Mapping Programming Paradigm for Deep Learn- ing Tensor Programs. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association ...
work page 2023
-
[14]
Jiangsu Du, Jinhui Wei, Jiazhi Jiang, Shenggan Cheng, Dan Huang, Zhiguang Chen, and Yutong Lu. 2024. Liger: Interleaving Intra- and Inter-Operator Par- allelism for Distributed Large Model Inference. InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Edinburgh, United Kingdom)(PPoPP ’24). Association...
work page 2024
-
[15]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv:2407.21783 [cs.AI]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Program- ming(Virtual Event, Republic of Korea)(PPoPP ’21). Association for Computing Machinery, New York, NY, USA, 389–402
work page 2021
-
[17]
Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. InSC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA, 4–4
work page 2006
-
[18]
Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2022. TensorIR: An Abstraction for Automatic Tensorized Program Optimization
work page 2022
-
[19]
Xiao Fu, Weiling Yang, Dezun Dong, and Xing Su. 2024. Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs. InProceedings of the 38th ACM International Conference on Supercomputing(Kyoto, Japan)(ICS ’24). Association for Computing Machinery, New York, NY, USA, 137–149. https://doi.org/10. 1145/3650200.3656620
-
[20]
Gallivan, William Jalby, Ulrike Meier, and Ahmed H
Kyle A. Gallivan, William Jalby, Ulrike Meier, and Ahmed H. Sameh. 1988. Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design.Inter- national Journal of High Performance Computing Applications2 (1988), 12 – 48. https://api.semanticscholar.org/CorpusID:62189292
work page 1988
-
[21]
Georgi Gerganov and contributors. 2023. llama.cpp: LLaMA inference in C/C++
work page 2023
-
[22]
Millad Ghane, Sunita Chandrasekaran, and Margaret S. Cheung. 2019. Gecko: Hierarchical Distributed View of Heterogeneous Shared Memory Architectures. InProceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores(Washington, DC, USA)(PMAM’19). Association for Computing Machinery, New York, NY, USA, 21–30
work page 2019
-
[23]
Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1994.Concrete Mathe- matics: A Foundation for Computer Science. Addison-Wesley Longman Publishing Co., Inc., USA
work page 1994
-
[24]
John A. Gunnels, Greg M. Henry, and Robert A. Geijn. 2001. A Family of High- Performance Matrix Multiplication Algorithms. InInternational Conference on Conceptual Structures. https://api.semanticscholar.org/CorpusID:442764
work page 2001
-
[25]
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. 2024. Characterization of Large Language Model Development in the Datacenter. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA, 709–729
work page 2024
-
[26]
Huawei Noah. Year. Bolt: A Deep Learning Library with High Performance and Heterogeneous Flexibility. https://github.com/huawei-noah/bolt. Accessed on Date
-
[27]
Intel. 2023. xFasterTransformer
work page 2023
-
[28]
Intel Corporation. 2024. Intel Math Kernel Library
work page 2024
-
[29]
Intel Corporation. 2024. Intel OpenVINO Toolkit
work page 2024
-
[30]
Intel Corporation. n.d.. Intel ® Xeon® Gold 6230 Processor (27.5M Cache, 2.10 GHz) - Specifications
- [31]
-
[32]
Kiseok Jeon, Junghee Lee, Bumsoo Kim, and James J. Kim. 2023. Hardware Accelerated Reusable Merkle Tree Generation for Bitcoin Blockchain Headers. IEEE Computer Architecture Letters22, 2 (2023), 69–72
work page 2023
-
[35]
Jiazhi Jiang, Jiangsu Du, Dan Huang, Dongsheng Li, Jiang Zheng, and Yutong Lu
-
[36]
Characterizing and Optimizing Transformer Inference on ARM Many-core Processor. InProceedings of the 51st International Conference on Parallel Processing (Bordeaux, France)(ICPP ’22). Association for Computing Machinery, New York, NY, USA, Article 20, 11 pages
-
[37]
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Opti- mization
work page 2017
-
[38]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, N...
work page 2023
-
[39]
Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context
work page 2023
-
[40]
Zoltan Majo and Thomas R. Gross. 2011. Memory system performance in a NUMA multicore multiprocessor. InProceedings of the 4th Annual International Conference on Systems and Storage. Association for Computing Machinery, New York, NY, USA, 10 pages. https://doi.org/10.1145/1987816.1987832
-
[41]
Kapil K. Mathur and S. Lennart Johnsson. 1994. Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer.Parallel Comput.20 (1994), 919–951. https://api.semanticscholar.org/CorpusID:16487869
work page 1994
-
[42]
Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Jeffrey Young, Tushar Krishna, and Hyesoon Kim. 2024. Understanding Performance Implications of LLM Infer- ence on CPUs. In2024 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 169–180
work page 2024
-
[43]
NVIDIA. 2023. FasterTransformer
work page 2023
-
[44]
oneDNN Contributors. 2024. oneAPI Deep Neural Network Library (oneDNN). https://github.com/oneapi-src/oneDNN
work page 2024
-
[45]
Open MPI. 2023. hwloc. https://github.com/open-mpi/hwloc
work page 2023
-
[46]
OpenAI. 2023. ChatGPT: Optimizing Language Models for Dialogue
work page 2023
-
[47]
OpenMP Architecture Review Board. 2008. OpenMP Application Program Inter- face Version 3.0. http://www.openmp.org/mp-documents/spec30.pdf
work page 2008
-
[48]
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chan- dra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. InProceed- ings of the 2024 ACM Symposium on Cloud Computing(Redmond, WA, USA) (SoCC ’24). Association for Computing Machinery, New York, NY, USA, 1...
-
[49]
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.ArXivabs/2407.00079 (2024). 12 Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving Conference’17, July 2017, Washington, DC, USA
-
[50]
Hongliang Qu and Zhibin Yu. 2024. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). Association for Computing Machinery, New York, NY, USA, 1233–1249
work page 2024
- [51]
-
[52]
RyokoAI. 2021. ShareGPT52K
work page 2021
- [53]
-
[54]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.ArXivabs/1909.08053 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[55]
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu
-
[56]
RoFormer: Enhanced Transformer with Rotary Position Embedding
-
[57]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
WikiChip. n.d.. Kunpeng 920-4826 - HiSilicon. https://en.wikichip.org/wiki/hisilicon/kunpeng/920-4826
-
[59]
WikiChip. n.d.. TaiShan v110 - Microarchitectures - HiSilicon. https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110. Accessed on September 6, 2025
work page 2025
-
[60]
Wikipedia contributors. 2024. List of Intel Xeon processors (Skylake-based) — Wikipedia, The Free Encyclopedia
work page 2024
-
[61]
XNNPack Contributors. 2024. XNNPack
work page 2024
-
[62]
Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical place trees: a portable abstraction for task parallelism and data movement. In Proceedings of the 22nd International Conference on Languages and Compilers for Parallel Computing(Newark, DE)(LCPC’09). Springer-Verlag, Berlin, Heidelberg, 172–187
work page 2009
-
[63]
Feng Yu, Guangli Li, Jiacheng Zhao, Huimin Cui, Xiaobing Feng, and Jingling Xue
-
[64]
Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the- Fly Micro-Kernel Polymerization. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA, USA)(ASPLOS ’24). Association for Computing Machinery, New York, NY, USA, 797–812
-
[65]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538
work page 2022
- [66]
-
[67]
Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. LLM- PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
work page 2024
-
[68]
Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko. 2022. Di- etCode: Automatic Optimization for Dynamic Tensor Programs. InProceedings of Machine Learning and Systems, Vol. 4. Conference on Machine Learning and Systems, USA, 848–863
work page 2022
-
[69]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2024. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv:2309.11998 [cs.CL]
-
[70]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: generating high-performance tensor programs for deep learning. InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Associatio...
work page 2020
-
[71]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan- ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Par- allelism for Distributed Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX ...
work page 2022
-
[72]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
work page 2024
-
[73]
Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.