Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

Chuan Wu; Jiuru Li; Juntao Zhao

arxiv: 2507.18454 · v2 · submitted 2025-05-19 · 💻 cs.AR · cs.AI· cs.DC· cs.PL

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

Juntao Zhao , Jiuru Li , Chuan Wu This is my paper

Pith reviewed 2026-05-22 15:03 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.PL

keywords CPU LLM servingphase-wise plan switchinghardware topology abstractiondynamic-shape kernelsnon-disaggregated deploymenttensor program generationconfiguration search

0 comments

The pith

Sandwich lets CPUs serve LLMs efficiently by switching between prefill and decode plans without interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CPUs remain essential for running large language models because they are widely available, cost-effective, and suitable for edge use. The core difficulty is that prefill and decode phases demand conflicting resources, and existing approaches either create interference or require hardware disaggregation. Sandwich addresses this through seamless phase-wise plan switching, a tree-based hardware model called TopoTree that automatically allocates partial cores while respecting substructures such as LLC slices, and a fast-start-then-finetune method for generating dynamic-shape tensor programs. These elements together produce an average 2.01x end-to-end speedup and up to 3.40x latency reduction across five x86 and ARM platforms. The resulting kernels reach the performance of static compilers while requiring three orders of magnitude less tuning effort.

Core claim

By combining seamless phase-wise plan switching to eliminate cross-phase interference, TopoTree for automated substructure-aware partial core allocation, and fast-start-then-finetune dynamic-shape tensor program generation, Sandwich delivers a full-stack CPU LLM serving system that achieves high performance under non-disaggregated constraints.

What carries the argument

Hot-switching mechanism for seamless phase-wise configuration changes, supported by TopoTree as a tree-based hardware abstraction that enables automated partial core allocation respecting sub-NUMA structures.

If this is right

Average 2.01x end-to-end speedup for LLM serving workloads on CPUs.
Up to 3.40x reduction in latency for certain serving scenarios.
Kernel performance matching static compilers at three orders of magnitude lower tuning cost.
Consistent operation across x86 and ARM CPU platforms without disaggregation.
Reduced cross-phase resource interference in shared CPU environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The switching and topology-aware allocation ideas could extend to other accelerators that experience phase-dependent resource conflicts.
TopoTree-style abstractions might help manage resources in multi-chip or heterogeneous CPU setups beyond LLM serving.
Lower tuning costs could encourage broader adoption of dynamic-shape optimizations in production inference systems.

Load-bearing premise

Seamless switching between prefill and decode plans incurs negligible overhead and TopoTree's automated core allocation avoids interference across dynamic workloads and all sub-NUMA topologies without manual tuning.

What would settle it

Direct measurement of plan-switching latency or performance under rapidly changing batch sizes and sequence lengths on multi-socket or complex NUMA CPU systems that would reveal unexpected overhead or contention.

Figures

Figures reproduced from arXiv: 2507.18454 by Chuan Wu, Jiuru Li, Juntao Zhao.

**Figure 3.** Figure 3: (a) Kunpeng: i) NUMA: using all the NUMA nodes; ii) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Compile-time workflow of Sandwich prefill, and a remove transform to discover all core utilization plans with some cores removed for decode. Both results are given to TopoTree interpretation, where sandwich-config algorithm generates the spectrum of service configurations from a TopoTree. In the tensor program generation, sandwich-kernel algorithm generates a collection of computational slices and their p… view at source ↗

**Figure 5.** Figure 5: Example TopoTree and its transformations on Kunpeng920. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Example TopoTree Interpretation constitute a transformation tree, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of kernel execution speed-up among Sandwich and vendor solutions. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Kernel generation comparison among Sandwich [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of SLO attainment percentage under different SLO scales. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of TTFT in single sequence serving. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of output token generation throughput for single sequence serving. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: SLO and Goodput comparison for batched serving. [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Request latency distribution for Llama3-8B, batch [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: The mean TTFT and TPOT of xFasterTransformers [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

read the original abstract

CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints--existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation. Across five x86/ARM CPU platforms, Sandwich achieves an average 2.01x end-to-end speedup and up to 3.40x latency reduction over state-of-the-art systems. Its kernels match static compiler performance with three orders of magnitude lower tuning cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sandwich combines hot phase switching with a sub-NUMA topology tree and quick dynamic kernel generation to cut interference in CPU LLM serving, delivering reported speedups that look useful if the overheads check out.

read the letter

The main thing here is that Sandwich tackles the prefill/decode resource clash on CPUs through seamless plan switching, a TopoTree abstraction for automated partial core allocation that respects things like LLC slices, and a fast-start-then-finetune method for dynamic-shape kernels. These pieces together aim to make non-disaggregated serving more efficient on everyday hardware without heavy manual tuning per platform.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Sandwich, a full-stack CPU LLM serving system designed to address conflicting prefill and decode resource demands in non-disaggregated deployments. It proposes three innovations: seamless phase-wise plan switching to eliminate cross-phase interference, TopoTree (a tree-based hardware abstraction for automated sub-NUMA partial core allocation), and a fast-start-then-finetune approach for dynamic-shape tensor program generation. Across five x86/ARM platforms, it reports an average 2.01x end-to-end speedup and up to 3.40x latency reduction versus state-of-the-art systems, while matching static compiler kernel performance at three orders of magnitude lower tuning cost.

Significance. If the performance claims are substantiated with detailed breakdowns, this work would offer a practical advance for CPU-based LLM serving, which remains relevant for cost, availability, and edge use cases. The automated handling of sub-NUMA structures and low-overhead dynamic switching could improve utilization in real workloads, and the reduced tuning cost for dynamic shapes is a clear practical strength.

major comments (2)

[§5 and abstract] §5 (Evaluation) and abstract: the headline 2.01x average speedup and 3.40x latency reduction rest on the assumption of negligible overhead for phase-wise plan switching plus TopoTree's interference-free partial core allocation. The manuscript provides no isolated micro-benchmarks quantifying switching latency or allocation stability under bursty/varying batch sizes across the five platforms; without these, it is impossible to determine how much of the reported gains are attributable to the proposed mechanisms versus other factors.
[§3] §3 (TopoTree design): the claim that TopoTree enables automated, interference-free allocation across sub-NUMA topologies without manual tuning is load-bearing for the cross-platform results, yet the evaluation does not include targeted tests (e.g., LLC-slice contention or NUMA-boundary workloads) that would falsify or confirm robustness.

minor comments (2)

[Abstract] Abstract: add one sentence naming the primary baselines (e.g., vLLM-CPU, TensorRT-LLM-CPU) so readers can immediately contextualize the 2.01x figure.
[§5] Figures in §5: ensure error bars or standard deviations are shown for all latency and throughput numbers; the current presentation makes it hard to judge statistical significance of the reported speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the substantiation of our claims without altering the core contributions.

read point-by-point responses

Referee: [§5 and abstract] §5 (Evaluation) and abstract: the headline 2.01x average speedup and 3.40x latency reduction rest on the assumption of negligible overhead for phase-wise plan switching plus TopoTree's interference-free partial core allocation. The manuscript provides no isolated micro-benchmarks quantifying switching latency or allocation stability under bursty/varying batch sizes across the five platforms; without these, it is impossible to determine how much of the reported gains are attributable to the proposed mechanisms versus other factors.

Authors: We agree that isolated micro-benchmarks would improve transparency. The reported 2.01x and 3.40x figures are measured in complete end-to-end serving runs that include all phase switches and dynamic allocations under realistic workloads, so overheads are already reflected in the net gains. To directly address the concern, we will add a dedicated subsection and figure in §5 with micro-benchmarks for switching latency and stability under bursty batch sizes on all five platforms. revision: yes
Referee: [§3] §3 (TopoTree design): the claim that TopoTree enables automated, interference-free allocation across sub-NUMA topologies without manual tuning is load-bearing for the cross-platform results, yet the evaluation does not include targeted tests (e.g., LLC-slice contention or NUMA-boundary workloads) that would falsify or confirm robustness.

Authors: The consistent speedups across five platforms with heterogeneous sub-NUMA structures already serve as empirical support for TopoTree's automated, interference-free allocation. However, we acknowledge that explicit stress tests would further confirm robustness. We will incorporate additional targeted experiments in the revised §5 evaluating performance under LLC-slice contention and NUMA-boundary workloads. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct empirical measurements

full rationale

The paper is a systems implementation describing Sandwich with three innovations for CPU LLM serving. All headline performance numbers (2.01x average speedup, 3.40x latency reduction, three orders of magnitude lower tuning cost) are presented as results of benchmarking on five real platforms. No mathematical derivations, first-principles predictions, or equations appear in the abstract or described contributions. There are no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to their own inputs by construction. The derivation chain is self-contained because it consists of implementation choices validated by external measurement rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on empirical demonstration of three new system components whose benefits are measured rather than derived from first principles or external benchmarks.

axioms (2)

domain assumption Prefill and decode phases have conflicting resource demands that cause interference under non-disaggregated CPU deployment.
Invoked to motivate the need for phase-wise plan switching.
domain assumption Sub-NUMA hardware structures such as LLC slices affect core allocation performance.
Basis for introducing TopoTree partial core allocation.

invented entities (1)

TopoTree no independent evidence
purpose: Tree-based hardware abstraction for automated substructure-aware partial core allocation
Newly introduced to handle sub-NUMA structures automatically.

pith-pipeline@v0.9.0 · 5706 in / 1444 out tokens · 44194 ms · 2026-05-22T15:03:17.816421+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TopoTree ... group(n,t,d) inserts L(d)/n new nodes ... remove(n,d) eliminates n right-most children ... sandwich-config algorithm
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fast-start-then-finetune ... sliding window ... tensor schedule reuse for dynamic-shape GEMM

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 3 internal anchors

[1]

AMD. n.d.. Server Processor Specifications

work page
[2]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. InProceedings of the International Conference on High Performance Computing, Networking, ...

work page 2022
[3]

ARM. 2025. Arm big.LITTLE. https://www.arm.com/technologies/big-little

work page 2025
[4]

Berger, Kathryn S

Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson

work page
[5]

In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems

Hoard: a scalable memory allocator for multithreaded applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems. Association for Computing Machinery, New York, NY, USA, 117–128

work page
[6]

L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS).ACM Trans. Math. Software28, 2 (2002), 135–151

work page 2002
[7]

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krish- namurthy. 2023. Punica: Multi-Tenant LoRA Serving

work page 2023
[8]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, USA, 579–594

work page 2018
[9]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2019. Learning to Optimize Tensor Programs

work page 2019
[10]

CPU-World. 2024. Intel Xeon Platinum 8272CL specifications

work page 2024
[11]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

work page 2022
[12]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic manage- ment: a holistic approach to memory placement on NUMA systems. InProceedings of the Eighteenth International Conference on Architectural Support for Program- ming Languages and Operating Systems(Houston, Te...

work page 2013
[13]

Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. 2023. Hidet: Task-Mapping Programming Paradigm for Deep Learn- ing Tensor Programs. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association ...

work page 2023
[14]

Jiangsu Du, Jinhui Wei, Jiazhi Jiang, Shenggan Cheng, Dan Huang, Zhiguang Chen, and Yutong Lu. 2024. Liger: Interleaving Intra- and Inter-Operator Par- allelism for Distributed Large Model Inference. InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Edinburgh, United Kingdom)(PPoPP ’24). Association...

work page 2024
[15]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv:2407.21783 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Program- ming(Virtual Event, Republic of Korea)(PPoPP ’21). Association for Computing Machinery, New York, NY, USA, 389–402

work page 2021
[17]

Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J

Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. InSC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA, 4–4

work page 2006
[18]

Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2022. TensorIR: An Abstraction for Automatic Tensorized Program Optimization

work page 2022
[19]

Xiao Fu, Weiling Yang, Dezun Dong, and Xing Su. 2024. Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs. InProceedings of the 38th ACM International Conference on Supercomputing(Kyoto, Japan)(ICS ’24). Association for Computing Machinery, New York, NY, USA, 137–149. https://doi.org/10. 1145/3650200.3656620

work page arXiv 2024
[20]

Gallivan, William Jalby, Ulrike Meier, and Ahmed H

Kyle A. Gallivan, William Jalby, Ulrike Meier, and Ahmed H. Sameh. 1988. Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design.Inter- national Journal of High Performance Computing Applications2 (1988), 12 – 48. https://api.semanticscholar.org/CorpusID:62189292

work page 1988
[21]

Georgi Gerganov and contributors. 2023. llama.cpp: LLaMA inference in C/C++

work page 2023
[22]

Millad Ghane, Sunita Chandrasekaran, and Margaret S. Cheung. 2019. Gecko: Hierarchical Distributed View of Heterogeneous Shared Memory Architectures. InProceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores(Washington, DC, USA)(PMAM’19). Association for Computing Machinery, New York, NY, USA, 21–30

work page 2019
[23]

Graham, Donald E

Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1994.Concrete Mathe- matics: A Foundation for Computer Science. Addison-Wesley Longman Publishing Co., Inc., USA

work page 1994
[24]

Gunnels, Greg M

John A. Gunnels, Greg M. Henry, and Robert A. Geijn. 2001. A Family of High- Performance Matrix Multiplication Algorithms. InInternational Conference on Conceptual Structures. https://api.semanticscholar.org/CorpusID:442764

work page 2001
[25]

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. 2024. Characterization of Large Language Model Development in the Datacenter. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA, 709–729

work page 2024
[26]

Huawei Noah. Year. Bolt: A Deep Learning Library with High Performance and Heterogeneous Flexibility. https://github.com/huawei-noah/bolt. Accessed on Date

work page
[27]

Intel. 2023. xFasterTransformer

work page 2023
[28]

Intel Corporation. 2024. Intel Math Kernel Library

work page 2024
[29]

Intel Corporation. 2024. Intel OpenVINO Toolkit

work page 2024
[30]

Intel Corporation. n.d.. Intel ® Xeon® Gold 6230 Processor (27.5M Cache, 2.10 GHz) - Specifications

work page
[31]

Jacobson

V. Jacobson. 1988. Congestion avoidance and control. InSymposium Proceed- ings on Communications Architectures and Protocols(Stanford, California, USA) (SIGCOMM ’88). Association for Computing Machinery, New York, NY, USA, 314–329

work page 1988
[32]

Kiseok Jeon, Junghee Lee, Bumsoo Kim, and James J. Kim. 2023. Hardware Accelerated Reusable Merkle Tree Generation for Bitcoin Blockchain Headers. IEEE Computer Architecture Letters22, 2 (2023), 69–72

work page 2023
[35]

Jiazhi Jiang, Jiangsu Du, Dan Huang, Dongsheng Li, Jiang Zheng, and Yutong Lu

work page
[36]

InProceedings of the 51st International Conference on Parallel Processing (Bordeaux, France)(ICPP ’22)

Characterizing and Optimizing Transformer Inference on ARM Many-core Processor. InProceedings of the 51st International Conference on Parallel Processing (Bordeaux, France)(ICPP ’22). Association for Computing Machinery, New York, NY, USA, Article 20, 11 pages

work page
[37]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Opti- mization

work page 2017
[38]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, N...

work page 2023
[39]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context

work page 2023
[40]

Zoltan Majo and Thomas R. Gross. 2011. Memory system performance in a NUMA multicore multiprocessor. InProceedings of the 4th Annual International Conference on Systems and Storage. Association for Computing Machinery, New York, NY, USA, 10 pages. https://doi.org/10.1145/1987816.1987832

work page doi:10.1145/1987816.1987832 2011
[41]

Mathur and S

Kapil K. Mathur and S. Lennart Johnsson. 1994. Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer.Parallel Comput.20 (1994), 919–951. https://api.semanticscholar.org/CorpusID:16487869

work page 1994
[42]

Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Jeffrey Young, Tushar Krishna, and Hyesoon Kim. 2024. Understanding Performance Implications of LLM Infer- ence on CPUs. In2024 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 169–180

work page 2024
[43]

NVIDIA. 2023. FasterTransformer

work page 2023
[44]

oneDNN Contributors. 2024. oneAPI Deep Neural Network Library (oneDNN). https://github.com/oneapi-src/oneDNN

work page 2024
[45]

Open MPI. 2023. hwloc. https://github.com/open-mpi/hwloc

work page 2023
[46]

OpenAI. 2023. ChatGPT: Optimizing Language Models for Dialogue

work page 2023
[47]

OpenMP Architecture Review Board. 2008. OpenMP Application Program Inter- face Version 3.0. http://www.openmp.org/mp-documents/spec30.pdf

work page 2008
[48]

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chan- dra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. InProceed- ings of the 2024 ACM Symposium on Cloud Computing(Redmond, WA, USA) (SoCC ’24). Association for Computing Machinery, New York, NY, USA, 1...

work page doi:10.1145/3698038.3698523 2024
[49]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.ArXivabs/2407.00079 (2024). 12 Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving Conference’17, July 2017, Washington, DC, USA

work page arXiv 2024
[50]

Hongliang Qu and Zhibin Yu. 2024. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). Association for Computing Machinery, New York, NY, USA, 1233–1249

work page 2024
[51]

Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2023. CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters. arXiv:2308.00852 [cs.NI]

work page arXiv 2023
[52]

RyokoAI. 2021. ShareGPT52K

work page 2021
[53]

Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, and Hengyu Meng. 2023. Efficient LLM Inference on CPUs. arXiv:2311.00502 [cs.LG]

work page arXiv 2023
[54]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.ArXivabs/1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[55]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu

work page
[56]

RoFormer: Enhanced Transformer with Rotary Position Embedding

work page
[57]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

WikiChip. n.d.. Kunpeng 920-4826 - HiSilicon. https://en.wikichip.org/wiki/hisilicon/kunpeng/920-4826

work page
[59]

WikiChip. n.d.. TaiShan v110 - Microarchitectures - HiSilicon. https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110. Accessed on September 6, 2025

work page 2025
[60]

Wikipedia contributors. 2024. List of Intel Xeon processors (Skylake-based) — Wikipedia, The Free Encyclopedia

work page 2024
[61]

XNNPack Contributors. 2024. XNNPack

work page 2024
[62]

Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical place trees: a portable abstraction for task parallelism and data movement. In Proceedings of the 22nd International Conference on Languages and Compilers for Parallel Computing(Newark, DE)(LCPC’09). Springer-Verlag, Berlin, Heidelberg, 172–187

work page 2009
[63]

Feng Yu, Guangli Li, Jiacheng Zhao, Huimin Cui, Xiaobing Feng, and Jingling Xue

work page
[64]

InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA, USA)(ASPLOS ’24)

Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the- Fly Micro-Kernel Polymerization. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA, USA)(ASPLOS ’24). Association for Computing Machinery, New York, NY, USA, 797–812

work page
[65]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538

work page 2022
[66]

Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, and Anshumali Shrivastava. 2024. NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention. arXiv:2403.01273 [cs.LG]

work page arXiv 2024
[67]

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. LLM- PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

work page 2024
[68]

Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko. 2022. Di- etCode: Automatic Optimization for Dynamic Tensor Programs. InProceedings of Machine Learning and Systems, Vol. 4. Conference on Machine Learning and Systems, USA, 848–863

work page 2022
[69]

Xing, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2024. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv:2309.11998 [cs.CL]

work page arXiv 2024
[70]

Gonzalez, and Ion Stoica

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: generating high-performance tensor programs for deep learning. InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Associatio...

work page 2020
[71]

Xing, Joseph E

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan- ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Par- allelism for Distributed Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX ...

work page 2022
[72]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

work page 2024
[73]

Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, ...

work page 2022

[1] [1]

AMD. n.d.. Server Processor Specifications

work page

[2] [2]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. InProceedings of the International Conference on High Performance Computing, Networking, ...

work page 2022

[3] [3]

ARM. 2025. Arm big.LITTLE. https://www.arm.com/technologies/big-little

work page 2025

[4] [4]

Berger, Kathryn S

Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson

work page

[5] [5]

In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems

Hoard: a scalable memory allocator for multithreaded applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems. Association for Computing Machinery, New York, NY, USA, 117–128

work page

[6] [6]

L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS).ACM Trans. Math. Software28, 2 (2002), 135–151

work page 2002

[7] [7]

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krish- namurthy. 2023. Punica: Multi-Tenant LoRA Serving

work page 2023

[8] [8]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, USA, 579–594

work page 2018

[9] [9]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2019. Learning to Optimize Tensor Programs

work page 2019

[10] [10]

CPU-World. 2024. Intel Xeon Platinum 8272CL specifications

work page 2024

[11] [11]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

work page 2022

[12] [12]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic manage- ment: a holistic approach to memory placement on NUMA systems. InProceedings of the Eighteenth International Conference on Architectural Support for Program- ming Languages and Operating Systems(Houston, Te...

work page 2013

[13] [13]

Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. 2023. Hidet: Task-Mapping Programming Paradigm for Deep Learn- ing Tensor Programs. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association ...

work page 2023

[14] [14]

Jiangsu Du, Jinhui Wei, Jiazhi Jiang, Shenggan Cheng, Dan Huang, Zhiguang Chen, and Yutong Lu. 2024. Liger: Interleaving Intra- and Inter-Operator Par- allelism for Distributed Large Model Inference. InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Edinburgh, United Kingdom)(PPoPP ’24). Association...

work page 2024

[15] [15]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv:2407.21783 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Program- ming(Virtual Event, Republic of Korea)(PPoPP ’21). Association for Computing Machinery, New York, NY, USA, 389–402

work page 2021

[17] [17]

Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J

Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. InSC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA, 4–4

work page 2006

[18] [18]

Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2022. TensorIR: An Abstraction for Automatic Tensorized Program Optimization

work page 2022

[19] [19]

Xiao Fu, Weiling Yang, Dezun Dong, and Xing Su. 2024. Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs. InProceedings of the 38th ACM International Conference on Supercomputing(Kyoto, Japan)(ICS ’24). Association for Computing Machinery, New York, NY, USA, 137–149. https://doi.org/10. 1145/3650200.3656620

work page arXiv 2024

[20] [20]

Gallivan, William Jalby, Ulrike Meier, and Ahmed H

Kyle A. Gallivan, William Jalby, Ulrike Meier, and Ahmed H. Sameh. 1988. Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design.Inter- national Journal of High Performance Computing Applications2 (1988), 12 – 48. https://api.semanticscholar.org/CorpusID:62189292

work page 1988

[21] [21]

Georgi Gerganov and contributors. 2023. llama.cpp: LLaMA inference in C/C++

work page 2023

[22] [22]

Millad Ghane, Sunita Chandrasekaran, and Margaret S. Cheung. 2019. Gecko: Hierarchical Distributed View of Heterogeneous Shared Memory Architectures. InProceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores(Washington, DC, USA)(PMAM’19). Association for Computing Machinery, New York, NY, USA, 21–30

work page 2019

[23] [23]

Graham, Donald E

Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1994.Concrete Mathe- matics: A Foundation for Computer Science. Addison-Wesley Longman Publishing Co., Inc., USA

work page 1994

[24] [24]

Gunnels, Greg M

John A. Gunnels, Greg M. Henry, and Robert A. Geijn. 2001. A Family of High- Performance Matrix Multiplication Algorithms. InInternational Conference on Conceptual Structures. https://api.semanticscholar.org/CorpusID:442764

work page 2001

[25] [25]

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. 2024. Characterization of Large Language Model Development in the Datacenter. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA, 709–729

work page 2024

[26] [26]

Huawei Noah. Year. Bolt: A Deep Learning Library with High Performance and Heterogeneous Flexibility. https://github.com/huawei-noah/bolt. Accessed on Date

work page

[27] [27]

Intel. 2023. xFasterTransformer

work page 2023

[28] [28]

Intel Corporation. 2024. Intel Math Kernel Library

work page 2024

[29] [29]

Intel Corporation. 2024. Intel OpenVINO Toolkit

work page 2024

[30] [30]

Intel Corporation. n.d.. Intel ® Xeon® Gold 6230 Processor (27.5M Cache, 2.10 GHz) - Specifications

work page

[31] [31]

Jacobson

V. Jacobson. 1988. Congestion avoidance and control. InSymposium Proceed- ings on Communications Architectures and Protocols(Stanford, California, USA) (SIGCOMM ’88). Association for Computing Machinery, New York, NY, USA, 314–329

work page 1988

[32] [32]

Kiseok Jeon, Junghee Lee, Bumsoo Kim, and James J. Kim. 2023. Hardware Accelerated Reusable Merkle Tree Generation for Bitcoin Blockchain Headers. IEEE Computer Architecture Letters22, 2 (2023), 69–72

work page 2023

[33] [35]

Jiazhi Jiang, Jiangsu Du, Dan Huang, Dongsheng Li, Jiang Zheng, and Yutong Lu

work page

[34] [36]

InProceedings of the 51st International Conference on Parallel Processing (Bordeaux, France)(ICPP ’22)

Characterizing and Optimizing Transformer Inference on ARM Many-core Processor. InProceedings of the 51st International Conference on Parallel Processing (Bordeaux, France)(ICPP ’22). Association for Computing Machinery, New York, NY, USA, Article 20, 11 pages

work page

[35] [37]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Opti- mization

work page 2017

[36] [38]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, N...

work page 2023

[37] [39]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context

work page 2023

[38] [40]

Zoltan Majo and Thomas R. Gross. 2011. Memory system performance in a NUMA multicore multiprocessor. InProceedings of the 4th Annual International Conference on Systems and Storage. Association for Computing Machinery, New York, NY, USA, 10 pages. https://doi.org/10.1145/1987816.1987832

work page doi:10.1145/1987816.1987832 2011

[39] [41]

Mathur and S

Kapil K. Mathur and S. Lennart Johnsson. 1994. Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer.Parallel Comput.20 (1994), 919–951. https://api.semanticscholar.org/CorpusID:16487869

work page 1994

[40] [42]

Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Jeffrey Young, Tushar Krishna, and Hyesoon Kim. 2024. Understanding Performance Implications of LLM Infer- ence on CPUs. In2024 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 169–180

work page 2024

[41] [43]

NVIDIA. 2023. FasterTransformer

work page 2023

[42] [44]

oneDNN Contributors. 2024. oneAPI Deep Neural Network Library (oneDNN). https://github.com/oneapi-src/oneDNN

work page 2024

[43] [45]

Open MPI. 2023. hwloc. https://github.com/open-mpi/hwloc

work page 2023

[44] [46]

OpenAI. 2023. ChatGPT: Optimizing Language Models for Dialogue

work page 2023

[45] [47]

OpenMP Architecture Review Board. 2008. OpenMP Application Program Inter- face Version 3.0. http://www.openmp.org/mp-documents/spec30.pdf

work page 2008

[46] [48]

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chan- dra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. InProceed- ings of the 2024 ACM Symposium on Cloud Computing(Redmond, WA, USA) (SoCC ’24). Association for Computing Machinery, New York, NY, USA, 1...

work page doi:10.1145/3698038.3698523 2024

[47] [49]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.ArXivabs/2407.00079 (2024). 12 Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving Conference’17, July 2017, Washington, DC, USA

work page arXiv 2024

[48] [50]

Hongliang Qu and Zhibin Yu. 2024. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). Association for Computing Machinery, New York, NY, USA, 1233–1249

work page 2024

[49] [51]

Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2023. CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters. arXiv:2308.00852 [cs.NI]

work page arXiv 2023

[50] [52]

RyokoAI. 2021. ShareGPT52K

work page 2021

[51] [53]

Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, and Hengyu Meng. 2023. Efficient LLM Inference on CPUs. arXiv:2311.00502 [cs.LG]

work page arXiv 2023

[52] [54]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.ArXivabs/1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[53] [55]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu

work page

[54] [56]

RoFormer: Enhanced Transformer with Rotary Position Embedding

work page

[55] [57]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [58]

WikiChip. n.d.. Kunpeng 920-4826 - HiSilicon. https://en.wikichip.org/wiki/hisilicon/kunpeng/920-4826

work page

[57] [59]

WikiChip. n.d.. TaiShan v110 - Microarchitectures - HiSilicon. https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110. Accessed on September 6, 2025

work page 2025

[58] [60]

Wikipedia contributors. 2024. List of Intel Xeon processors (Skylake-based) — Wikipedia, The Free Encyclopedia

work page 2024

[59] [61]

XNNPack Contributors. 2024. XNNPack

work page 2024

[60] [62]

Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical place trees: a portable abstraction for task parallelism and data movement. In Proceedings of the 22nd International Conference on Languages and Compilers for Parallel Computing(Newark, DE)(LCPC’09). Springer-Verlag, Berlin, Heidelberg, 172–187

work page 2009

[61] [63]

Feng Yu, Guangli Li, Jiacheng Zhao, Huimin Cui, Xiaobing Feng, and Jingling Xue

work page

[62] [64]

InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA, USA)(ASPLOS ’24)

Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the- Fly Micro-Kernel Polymerization. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA, USA)(ASPLOS ’24). Association for Computing Machinery, New York, NY, USA, 797–812

work page

[63] [65]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538

work page 2022

[64] [66]

Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, and Anshumali Shrivastava. 2024. NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention. arXiv:2403.01273 [cs.LG]

work page arXiv 2024

[65] [67]

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. LLM- PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

work page 2024

[66] [68]

Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko. 2022. Di- etCode: Automatic Optimization for Dynamic Tensor Programs. InProceedings of Machine Learning and Systems, Vol. 4. Conference on Machine Learning and Systems, USA, 848–863

work page 2022

[67] [69]

Xing, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2024. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv:2309.11998 [cs.CL]

work page arXiv 2024

[68] [70]

Gonzalez, and Ion Stoica

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: generating high-performance tensor programs for deep learning. InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Associatio...

work page 2020

[69] [71]

Xing, Joseph E

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan- ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Par- allelism for Distributed Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX ...

work page 2022

[70] [72]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

work page 2024

[71] [73]

Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, ...

work page 2022