pith. sign in

arxiv: 2507.18454 · v2 · submitted 2025-05-19 · 💻 cs.AR · cs.AI· cs.DC· cs.PL

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

Pith reviewed 2026-05-22 15:03 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.PL
keywords CPU LLM servingphase-wise plan switchinghardware topology abstractiondynamic-shape kernelsnon-disaggregated deploymenttensor program generationconfiguration search
0
0 comments X

The pith

Sandwich lets CPUs serve LLMs efficiently by switching between prefill and decode plans without interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CPUs remain essential for running large language models because they are widely available, cost-effective, and suitable for edge use. The core difficulty is that prefill and decode phases demand conflicting resources, and existing approaches either create interference or require hardware disaggregation. Sandwich addresses this through seamless phase-wise plan switching, a tree-based hardware model called TopoTree that automatically allocates partial cores while respecting substructures such as LLC slices, and a fast-start-then-finetune method for generating dynamic-shape tensor programs. These elements together produce an average 2.01x end-to-end speedup and up to 3.40x latency reduction across five x86 and ARM platforms. The resulting kernels reach the performance of static compilers while requiring three orders of magnitude less tuning effort.

Core claim

By combining seamless phase-wise plan switching to eliminate cross-phase interference, TopoTree for automated substructure-aware partial core allocation, and fast-start-then-finetune dynamic-shape tensor program generation, Sandwich delivers a full-stack CPU LLM serving system that achieves high performance under non-disaggregated constraints.

What carries the argument

Hot-switching mechanism for seamless phase-wise configuration changes, supported by TopoTree as a tree-based hardware abstraction that enables automated partial core allocation respecting sub-NUMA structures.

If this is right

  • Average 2.01x end-to-end speedup for LLM serving workloads on CPUs.
  • Up to 3.40x reduction in latency for certain serving scenarios.
  • Kernel performance matching static compilers at three orders of magnitude lower tuning cost.
  • Consistent operation across x86 and ARM CPU platforms without disaggregation.
  • Reduced cross-phase resource interference in shared CPU environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The switching and topology-aware allocation ideas could extend to other accelerators that experience phase-dependent resource conflicts.
  • TopoTree-style abstractions might help manage resources in multi-chip or heterogeneous CPU setups beyond LLM serving.
  • Lower tuning costs could encourage broader adoption of dynamic-shape optimizations in production inference systems.

Load-bearing premise

Seamless switching between prefill and decode plans incurs negligible overhead and TopoTree's automated core allocation avoids interference across dynamic workloads and all sub-NUMA topologies without manual tuning.

What would settle it

Direct measurement of plan-switching latency or performance under rapidly changing batch sizes and sequence lengths on multi-socket or complex NUMA CPU systems that would reveal unexpected overhead or contention.

Figures

Figures reproduced from arXiv: 2507.18454 by Chuan Wu, Jiuru Li, Juntao Zhao.

Figure 2
Figure 2. Figure 2: GEMM: (a) GFLOPS under different combinations [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Kunpeng: i) NUMA: using all the NUMA nodes; ii) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compile-time workflow of Sandwich prefill, and a remove transform to discover all core utilization plans with some cores removed for decode. Both results are given to TopoTree interpretation, where sandwich-config algorithm gener￾ates the spectrum of service configurations from a TopoTree. In the tensor program generation, sandwich-kernel algorithm generates a collection of computational slices and their p… view at source ↗
Figure 5
Figure 5. Figure 5: Example TopoTree and its transformations on Kunpeng920. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example TopoTree Interpretation constitute a transformation tree, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of kernel execution speed-up among Sandwich and vendor solutions. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Kernel generation comparison among Sandwich [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of SLO attainment percentage under different SLO scales. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of TTFT in single sequence serving. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of output token generation throughput for single sequence serving. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: SLO and Goodput comparison for batched serving. [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Request latency distribution for Llama3-8B, batch [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The mean TTFT and TPOT of xFasterTransformers [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
read the original abstract

CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints--existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation. Across five x86/ARM CPU platforms, Sandwich achieves an average 2.01x end-to-end speedup and up to 3.40x latency reduction over state-of-the-art systems. Its kernels match static compiler performance with three orders of magnitude lower tuning cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Sandwich, a full-stack CPU LLM serving system designed to address conflicting prefill and decode resource demands in non-disaggregated deployments. It proposes three innovations: seamless phase-wise plan switching to eliminate cross-phase interference, TopoTree (a tree-based hardware abstraction for automated sub-NUMA partial core allocation), and a fast-start-then-finetune approach for dynamic-shape tensor program generation. Across five x86/ARM platforms, it reports an average 2.01x end-to-end speedup and up to 3.40x latency reduction versus state-of-the-art systems, while matching static compiler kernel performance at three orders of magnitude lower tuning cost.

Significance. If the performance claims are substantiated with detailed breakdowns, this work would offer a practical advance for CPU-based LLM serving, which remains relevant for cost, availability, and edge use cases. The automated handling of sub-NUMA structures and low-overhead dynamic switching could improve utilization in real workloads, and the reduced tuning cost for dynamic shapes is a clear practical strength.

major comments (2)
  1. [§5 and abstract] §5 (Evaluation) and abstract: the headline 2.01x average speedup and 3.40x latency reduction rest on the assumption of negligible overhead for phase-wise plan switching plus TopoTree's interference-free partial core allocation. The manuscript provides no isolated micro-benchmarks quantifying switching latency or allocation stability under bursty/varying batch sizes across the five platforms; without these, it is impossible to determine how much of the reported gains are attributable to the proposed mechanisms versus other factors.
  2. [§3] §3 (TopoTree design): the claim that TopoTree enables automated, interference-free allocation across sub-NUMA topologies without manual tuning is load-bearing for the cross-platform results, yet the evaluation does not include targeted tests (e.g., LLC-slice contention or NUMA-boundary workloads) that would falsify or confirm robustness.
minor comments (2)
  1. [Abstract] Abstract: add one sentence naming the primary baselines (e.g., vLLM-CPU, TensorRT-LLM-CPU) so readers can immediately contextualize the 2.01x figure.
  2. [§5] Figures in §5: ensure error bars or standard deviations are shown for all latency and throughput numbers; the current presentation makes it hard to judge statistical significance of the reported speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the substantiation of our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [§5 and abstract] §5 (Evaluation) and abstract: the headline 2.01x average speedup and 3.40x latency reduction rest on the assumption of negligible overhead for phase-wise plan switching plus TopoTree's interference-free partial core allocation. The manuscript provides no isolated micro-benchmarks quantifying switching latency or allocation stability under bursty/varying batch sizes across the five platforms; without these, it is impossible to determine how much of the reported gains are attributable to the proposed mechanisms versus other factors.

    Authors: We agree that isolated micro-benchmarks would improve transparency. The reported 2.01x and 3.40x figures are measured in complete end-to-end serving runs that include all phase switches and dynamic allocations under realistic workloads, so overheads are already reflected in the net gains. To directly address the concern, we will add a dedicated subsection and figure in §5 with micro-benchmarks for switching latency and stability under bursty batch sizes on all five platforms. revision: yes

  2. Referee: [§3] §3 (TopoTree design): the claim that TopoTree enables automated, interference-free allocation across sub-NUMA topologies without manual tuning is load-bearing for the cross-platform results, yet the evaluation does not include targeted tests (e.g., LLC-slice contention or NUMA-boundary workloads) that would falsify or confirm robustness.

    Authors: The consistent speedups across five platforms with heterogeneous sub-NUMA structures already serve as empirical support for TopoTree's automated, interference-free allocation. However, we acknowledge that explicit stress tests would further confirm robustness. We will incorporate additional targeted experiments in the revised §5 evaluating performance under LLC-slice contention and NUMA-boundary workloads. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct empirical measurements

full rationale

The paper is a systems implementation describing Sandwich with three innovations for CPU LLM serving. All headline performance numbers (2.01x average speedup, 3.40x latency reduction, three orders of magnitude lower tuning cost) are presented as results of benchmarking on five real platforms. No mathematical derivations, first-principles predictions, or equations appear in the abstract or described contributions. There are no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to their own inputs by construction. The derivation chain is self-contained because it consists of implementation choices validated by external measurement rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on empirical demonstration of three new system components whose benefits are measured rather than derived from first principles or external benchmarks.

axioms (2)
  • domain assumption Prefill and decode phases have conflicting resource demands that cause interference under non-disaggregated CPU deployment.
    Invoked to motivate the need for phase-wise plan switching.
  • domain assumption Sub-NUMA hardware structures such as LLC slices affect core allocation performance.
    Basis for introducing TopoTree partial core allocation.
invented entities (1)
  • TopoTree no independent evidence
    purpose: Tree-based hardware abstraction for automated substructure-aware partial core allocation
    Newly introduced to handle sub-NUMA structures automatically.

pith-pipeline@v0.9.0 · 5706 in / 1444 out tokens · 44194 ms · 2026-05-22T15:03:17.816421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 3 internal anchors

  1. [1]

    AMD. n.d.. Server Processor Specifications

  2. [2]

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. InProceedings of the International Conference on High Performance Computing, Networking, ...

  3. [3]

    ARM. 2025. Arm big.LITTLE. https://www.arm.com/technologies/big-little

  4. [4]

    Berger, Kathryn S

    Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson

  5. [5]

    In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems

    Hoard: a scalable memory allocator for multithreaded applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems. Association for Computing Machinery, New York, NY, USA, 117–128

  6. [6]

    L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS).ACM Trans. Math. Software28, 2 (2002), 135–151

  7. [7]

    Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krish- namurthy. 2023. Punica: Multi-Tenant LoRA Serving

  8. [8]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, USA, 579–594

  9. [9]

    Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2019. Learning to Optimize Tensor Programs

  10. [10]

    CPU-World. 2024. Intel Xeon Platinum 8272CL specifications

  11. [11]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

  12. [12]

    Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic manage- ment: a holistic approach to memory placement on NUMA systems. InProceedings of the Eighteenth International Conference on Architectural Support for Program- ming Languages and Operating Systems(Houston, Te...

  13. [13]

    Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. 2023. Hidet: Task-Mapping Programming Paradigm for Deep Learn- ing Tensor Programs. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association ...

  14. [14]

    Jiangsu Du, Jinhui Wei, Jiazhi Jiang, Shenggan Cheng, Dan Huang, Zhiguang Chen, and Yutong Lu. 2024. Liger: Interleaving Intra- and Inter-Operator Par- allelism for Distributed Large Model Inference. InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Edinburgh, United Kingdom)(PPoPP ’24). Association...

  15. [15]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv:2407.21783 [cs.AI]

  16. [16]

    Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Program- ming(Virtual Event, Republic of Korea)(PPoPP ’21). Association for Computing Machinery, New York, NY, USA, 389–402

  17. [17]

    Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J

    Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. InSC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA, 4–4

  18. [18]

    Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2022. TensorIR: An Abstraction for Automatic Tensorized Program Optimization

  19. [19]

    Xiao Fu, Weiling Yang, Dezun Dong, and Xing Su. 2024. Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs. InProceedings of the 38th ACM International Conference on Supercomputing(Kyoto, Japan)(ICS ’24). Association for Computing Machinery, New York, NY, USA, 137–149. https://doi.org/10. 1145/3650200.3656620

  20. [20]

    Gallivan, William Jalby, Ulrike Meier, and Ahmed H

    Kyle A. Gallivan, William Jalby, Ulrike Meier, and Ahmed H. Sameh. 1988. Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design.Inter- national Journal of High Performance Computing Applications2 (1988), 12 – 48. https://api.semanticscholar.org/CorpusID:62189292

  21. [21]

    Georgi Gerganov and contributors. 2023. llama.cpp: LLaMA inference in C/C++

  22. [22]

    Millad Ghane, Sunita Chandrasekaran, and Margaret S. Cheung. 2019. Gecko: Hierarchical Distributed View of Heterogeneous Shared Memory Architectures. InProceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores(Washington, DC, USA)(PMAM’19). Association for Computing Machinery, New York, NY, USA, 21–30

  23. [23]

    Graham, Donald E

    Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1994.Concrete Mathe- matics: A Foundation for Computer Science. Addison-Wesley Longman Publishing Co., Inc., USA

  24. [24]

    Gunnels, Greg M

    John A. Gunnels, Greg M. Henry, and Robert A. Geijn. 2001. A Family of High- Performance Matrix Multiplication Algorithms. InInternational Conference on Conceptual Structures. https://api.semanticscholar.org/CorpusID:442764

  25. [25]

    Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. 2024. Characterization of Large Language Model Development in the Datacenter. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA, 709–729

  26. [26]

    Huawei Noah. Year. Bolt: A Deep Learning Library with High Performance and Heterogeneous Flexibility. https://github.com/huawei-noah/bolt. Accessed on Date

  27. [27]

    Intel. 2023. xFasterTransformer

  28. [28]

    Intel Corporation. 2024. Intel Math Kernel Library

  29. [29]

    Intel Corporation. 2024. Intel OpenVINO Toolkit

  30. [30]

    Intel Corporation. n.d.. Intel ® Xeon® Gold 6230 Processor (27.5M Cache, 2.10 GHz) - Specifications

  31. [31]

    Jacobson

    V. Jacobson. 1988. Congestion avoidance and control. InSymposium Proceed- ings on Communications Architectures and Protocols(Stanford, California, USA) (SIGCOMM ’88). Association for Computing Machinery, New York, NY, USA, 314–329

  32. [32]

    Kiseok Jeon, Junghee Lee, Bumsoo Kim, and James J. Kim. 2023. Hardware Accelerated Reusable Merkle Tree Generation for Bitcoin Blockchain Headers. IEEE Computer Architecture Letters22, 2 (2023), 69–72

  33. [35]

    Jiazhi Jiang, Jiangsu Du, Dan Huang, Dongsheng Li, Jiang Zheng, and Yutong Lu

  34. [36]

    InProceedings of the 51st International Conference on Parallel Processing (Bordeaux, France)(ICPP ’22)

    Characterizing and Optimizing Transformer Inference on ARM Many-core Processor. InProceedings of the 51st International Conference on Parallel Processing (Bordeaux, France)(ICPP ’22). Association for Computing Machinery, New York, NY, USA, Article 20, 11 pages

  35. [37]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Opti- mization

  36. [38]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, N...

  37. [39]

    Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context

  38. [40]

    Zoltan Majo and Thomas R. Gross. 2011. Memory system performance in a NUMA multicore multiprocessor. InProceedings of the 4th Annual International Conference on Systems and Storage. Association for Computing Machinery, New York, NY, USA, 10 pages. https://doi.org/10.1145/1987816.1987832

  39. [41]

    Mathur and S

    Kapil K. Mathur and S. Lennart Johnsson. 1994. Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer.Parallel Comput.20 (1994), 919–951. https://api.semanticscholar.org/CorpusID:16487869

  40. [42]

    Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Jeffrey Young, Tushar Krishna, and Hyesoon Kim. 2024. Understanding Performance Implications of LLM Infer- ence on CPUs. In2024 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 169–180

  41. [43]

    NVIDIA. 2023. FasterTransformer

  42. [44]

    oneDNN Contributors. 2024. oneAPI Deep Neural Network Library (oneDNN). https://github.com/oneapi-src/oneDNN

  43. [45]

    Open MPI. 2023. hwloc. https://github.com/open-mpi/hwloc

  44. [46]

    OpenAI. 2023. ChatGPT: Optimizing Language Models for Dialogue

  45. [47]

    OpenMP Architecture Review Board. 2008. OpenMP Application Program Inter- face Version 3.0. http://www.openmp.org/mp-documents/spec30.pdf

  46. [48]

    Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chan- dra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. InProceed- ings of the 2024 ACM Symposium on Cloud Computing(Redmond, WA, USA) (SoCC ’24). Association for Computing Machinery, New York, NY, USA, 1...

  47. [49]

    Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.ArXivabs/2407.00079 (2024). 12 Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving Conference’17, July 2017, Washington, DC, USA

  48. [50]

    Hongliang Qu and Zhibin Yu. 2024. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). Association for Computing Machinery, New York, NY, USA, 1233–1249

  49. [51]

    Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2023. CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters. arXiv:2308.00852 [cs.NI]

  50. [52]

    RyokoAI. 2021. ShareGPT52K

  51. [53]

    Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, and Hengyu Meng. 2023. Efficient LLM Inference on CPUs. arXiv:2311.00502 [cs.LG]

  52. [54]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.ArXivabs/1909.08053 (2019)

  53. [55]

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu

  54. [56]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

  55. [57]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]

  56. [58]

    WikiChip. n.d.. Kunpeng 920-4826 - HiSilicon. https://en.wikichip.org/wiki/hisilicon/kunpeng/920-4826

  57. [59]

    WikiChip. n.d.. TaiShan v110 - Microarchitectures - HiSilicon. https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110. Accessed on September 6, 2025

  58. [60]

    Wikipedia contributors. 2024. List of Intel Xeon processors (Skylake-based) — Wikipedia, The Free Encyclopedia

  59. [61]

    XNNPack Contributors. 2024. XNNPack

  60. [62]

    Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical place trees: a portable abstraction for task parallelism and data movement. In Proceedings of the 22nd International Conference on Languages and Compilers for Parallel Computing(Newark, DE)(LCPC’09). Springer-Verlag, Berlin, Heidelberg, 172–187

  61. [63]

    Feng Yu, Guangli Li, Jiacheng Zhao, Huimin Cui, Xiaobing Feng, and Jingling Xue

  62. [64]

    InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA, USA)(ASPLOS ’24)

    Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the- Fly Micro-Kernel Polymerization. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(La Jolla, CA, USA)(ASPLOS ’24). Association for Computing Machinery, New York, NY, USA, 797–812

  63. [65]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538

  64. [66]

    Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, and Anshumali Shrivastava. 2024. NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention. arXiv:2403.01273 [cs.LG]

  65. [67]

    Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. LLM- PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

  66. [68]

    Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko. 2022. Di- etCode: Automatic Optimization for Dynamic Tensor Programs. InProceedings of Machine Learning and Systems, Vol. 4. Conference on Machine Learning and Systems, USA, 848–863

  67. [69]

    Xing, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2024. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv:2309.11998 [cs.CL]

  68. [70]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: generating high-performance tensor programs for deep learning. InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20). USENIX Associatio...

  69. [71]

    Xing, Joseph E

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan- ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Par- allelism for Distributed Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX ...

  70. [72]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

  71. [73]

    Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, ...