pith. machine review for the scientific record. sign in

arxiv: 2604.12171 · v1 · submitted 2026-04-14 · 💻 cs.DC · cs.LG

Recognition: unknown

PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:29 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords pipeline parallelismlive reconfigurationKV cacheLLM servingPageAttentiondynamic inferencetime-to-first-tokenreconfiguration overhead
0
0 comments X

The pith

PipeLive enables live in-place pipeline parallelism reconfiguration for LLMs by redesigning KV cache layout and using incremental state patching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PipeLive targets the inability of static pipeline parallelism setups to adapt to changing loads or hardware in LLM serving without causing downtime. Traditional stops and restarts are too slow for dynamic environments such as serverless platforms. The approach introduces a unified KV cache resizing mechanism built on an extended PageAttention and pairs it with an incremental patching process that keeps KV states consistent across old and new layer placements. A sympathetic reader would care because the changes let systems adjust parallelism while inference continues, cutting first-token latency and overhead in real workloads.

Core claim

PipeLive enables live in-place PP reconfiguration through a redesigned KV cache layout co-designed with an extension to PageAttention for unified live KV resizing, together with an incremental KV patching mechanism that synchronizes states between source and target configurations and identifies a safe switch point, allowing reconfiguration with minimal disruption to ongoing inference.

What carries the argument

Incremental KV patching mechanism that synchronizes KV states between source and target configurations while locating a safe switch point.

Load-bearing premise

The incremental KV patching can reliably locate a safe switch point and preserve KV consistency without introducing errors or large overhead while states continue to evolve during live execution.

What would settle it

A dynamic serving trace in which KV states desynchronize during a live switch, producing incorrect output tokens or a crash, or in which measured reconfiguration time stays above 10 ms.

Figures

Figures reproduced from arXiv: 2604.12171 by Adel N. Toosi, Chen Wang, Muhammed Tawfiqul Islam, Xu Bai.

Figure 1
Figure 1. Figure 1: illustrates this effect in a two-GPU setup (NVIDIA A100 + L40S). Each subplot reports total token throughput (including both input and output tokens) under prefill-heavy (input=512, output=16) and decode-heavy (input=128, out￾put=512) workloads, while the x-axis enumerates different PP configurations for a 64-layer Qwen3-30B model (e.g., 28/36 assigns 28 layers to the A100 and 36 to the L40S). The results … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of PipeLive. PipeLive introduces a centralized Reconfiguration Coordina￾tor. Given the current and target PP configurations, the coor￾dinator executes a reconfiguration protocol that (1) evaluates feasibility under GPU memory constraints and KV cache state, and (2) synthesizes an explicit execution plan that de￾composes the target configuration into coordinated actions, including KV cache resi… view at source ↗
Figure 3
Figure 3. Figure 3: illustrates a canonical PP reconfiguration workflow. We represent a PP configuration as an ordered tuple of layer ranges assigned to each GPU. For clarity, we consider a sim￾plified setup with three GPUs (GPU 1, GPU 2, GPU 3) serving a six-layer model. 1. Initial PP Configuration. The system transitions from an initial configuration C𝐴 = ⟨[1, 2], [3, 4], [5, 6]⟩ to a tar￾get configuration C𝐵 = ⟨[1], [2, 3]… view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the timeline of the live in-place PP re￾configuration protocol across the five phases and highlights the dependency relationships among the primitives. Weight loading and KV cache migration have no direct dependency and can therefore execute in parallel. All other primitives are issued sequentially to ensure correctness and a well-defined transition order. 5 Worker-Level Dynamic KV Cache Man￾ag… view at source ↗
Figure 5
Figure 5. Figure 5: PipeLive extends PagedAttention to access non￾contiguous GPU memory, enabling dynamic KV cache resiz￾ing during live migration. Because GPU memory cannot be resized in place, shrinking or expanding the KV cache requires allocating a new buffer and copying all live KV blocks, making dynamic resizing costly and impractical during reconfiguration. PipeLive addresses this by adopting a block-level KV cache all… view at source ↗
Figure 6
Figure 6. Figure 6: Layer stacking KV cache layout in PipeLive. Stack￾ing two layers into one GPU memory block halved the logical KV block size. GPU memory allocation must respect the CUDA virtual￾memory allocation granularity [18], which is commonly 2 MiB on current NVIDIA GPUs1 , whereas PageAttention typically uses much smaller KV block sizes (32KB–128KB) to control internal fragmentation. Directly matching KV blocks to th… view at source ↗
Figure 7
Figure 7. Figure 7: NCCL Communications in PP Reconfiguration. A circular dependency exists between GPU 1 and GPU 2, causing a deadlock. to the corresponding local layer caches. Multiple sender– receiver pairs may coexist within a single migrator to handle concurrent migrations across different GPU pairs. Each sender thread maintains a dirty bitmap B (𝑖𝑠 ,𝑖𝑑 ) ∈ 0, 1 𝑁 , where 𝑁 is the total number of physical KV cache slots … view at source ↗
Figure 8
Figure 8. Figure 8: Performance of different PP configurations for heterogeneous GPUs under varying workloads [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of different PP configurations for heterogeneous GPUs under the mixed decode-heavy and prefill-heavy workloads. a request rate of 3, switching from a prefill-heavy to a decode￾heavy workload results in different optimal configurations across all three metrics, with decode-heavy workloads fa￾voring configurations that assign more layers to L40S (e.g., from 36/44 to 52/28 for TTFT). Motivated by … view at source ↗
Figure 10
Figure 10. Figure 10: End-to-end performance of PP reconfiguration with kv resizing disabled and enabled under mixed workload. 16 8 4 2 1 Number of Stacked Layer 0 25 50 75 100 KV Memory Utilization (%) 95.6% 93.5% 84.9% 74.1% 56.4% [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of stop time and migration time un￾der different migrated layers with different migration modes. 0.5 1.0 1.5 2.0 2.5 3.0 Request Rate (req/s) 0 1000 2000 TTFT (ms) 0.5 1.0 1.5 2.0 2.5 3.0 Request Rate (req/s) 200 400 600 TPOT (ms) PipeLive KV Patch Disabled (Async) Sync [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: End-to-end tests of performance of PP reconfigu￾ration with different migration modes. in total migration time, as the KV patch continuously syn￾chronizes newly generated and migrated KV states [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
read the original abstract

Pipeline parallelism (PP) is widely used to partition layers of large language models (LLMs) across GPUs, enabling scalable inference for large models. However, existing systems rely on static PP configurations that fail to adapt to dynamic settings, such as serverless platforms and heterogeneous GPU environments. Reconfiguring PP by stopping and redeploying service incurs prohibitive downtime, so reconfiguration must instead proceed live and in place, without interrupting inference. However, live in-place PP reconfiguration is fundamentally challenging. GPUs are already saturated with model weights and KV cache, leaving little room for new layer placements and necessitating KV cache resizing, at odds with systems like vLLM that preallocate for throughput. Moreover, maintaining KV consistency during execution is difficult: stop-and-copy introduces large pauses, while background synchronization risks inconsistency as states evolve. We present PipeLive, which enables live in-place PP reconfiguration with minimal disruption. PipeLive introduces a redesigned KV cache layout together with a co-designed extension to PageAttention, forming a unified mechanism for live KV resizing. It further adopts an incremental KV patching mechanism, inspired by live virtual machine migration, to synchronize KV states between source and target configurations and identify a safe switch point. PipeLive achieves a 2.5X reduction in time-to-first-token (TTFT) without KV cache overflow compared to disabling KV resizing. Furthermore, compared to a variant without KV patching, it reduces reconfiguration overhead from seconds to under 10ms, and improves TTFT and time-per-output-token (TPOT) by up to 54.7% and 14.7%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents PipeLive, a system enabling live in-place pipeline parallelism (PP) reconfiguration for dynamic LLM serving without interrupting inference. It introduces a redesigned KV cache layout co-designed with an extension to PageAttention for live KV resizing, plus an incremental KV patching mechanism (inspired by VM migration) to synchronize states between source and target PP configurations and locate a safe switch point. Empirical claims include a 2.5X TTFT reduction without KV cache overflow versus disabling resizing, reconfiguration overhead reduced from seconds to under 10 ms versus a no-patching variant, and TTFT/TPOT improvements of up to 54.7% and 14.7%.

Significance. If the central claims hold under realistic workloads, this would be a meaningful systems contribution for serverless and heterogeneous-GPU LLM serving, where static PP configurations are inadequate. The work supplies concrete end-to-end measurements of TTFT, TPOT, and reconfiguration latency on actual hardware, which strengthens the practical relevance of the live-reconfiguration approach.

major comments (2)
  1. [Section describing incremental KV patching (mechanism and safe-switch logic)] The description of the incremental KV patching mechanism does not specify any verification protocol (e.g., checksums on KV blocks, output-token determinism checks, or bounded generation windows) for confirming KV equivalence at the switch point while new tokens continue to be generated. This is load-bearing for the sub-10 ms overhead and “no inconsistency” claims, because any lag in detecting a safe point risks either using stale KV entries or an implicit stall.
  2. [Evaluation section (performance figures and tables)] Evaluation results report 2.5X TTFT reduction and 54.7 % / 14.7 % improvements without error bars, number of runs, or data-exclusion criteria. Because the central performance assertions rest on these unreviewed implementation measurements, the absence of statistical detail prevents independent assessment of whether the gains are robust or sensitive to particular workloads or hardware configurations.
minor comments (1)
  1. [KV cache redesign subsection] Notation for the extended PageAttention data structures and the new KV cache layout could be clarified with an explicit diagram or pseudocode listing the additional fields introduced for live resizing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our paper. We address each of the major comments in detail below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [Section describing incremental KV patching (mechanism and safe-switch logic)] The description of the incremental KV patching mechanism does not specify any verification protocol (e.g., checksums on KV blocks, output-token determinism checks, or bounded generation windows) for confirming KV equivalence at the switch point while new tokens continue to be generated. This is load-bearing for the sub-10 ms overhead and “no inconsistency” claims, because any lag in detecting a safe point risks either using stale KV entries or an implicit stall.

    Authors: We appreciate the referee pointing out the need for more detail on the verification of KV equivalence. Upon review, the original manuscript describes the high-level mechanism but does not elaborate on the low-level verification steps. In the revised manuscript, we will add a detailed explanation of the safe-switch logic. The incremental patching uses per-block sequence numbers that are updated atomically with each patch. The target configuration tracks the maximum sequence number received for each KV block. A safe switch point is declared when the target's maximum sequence number matches the source's current generation point (i.e., the last token generated), and no new tokens have been produced in the interim window of 1-2 tokens to bound any potential lag. This design ensures equivalence without expensive checksums, as the sequence numbers guarantee that all prior KV states have been patched. We will include pseudocode and a timing diagram in the revision to clarify this process. revision: yes

  2. Referee: [Evaluation section (performance figures and tables)] Evaluation results report 2.5X TTFT reduction and 54.7 % / 14.7 % improvements without error bars, number of runs, or data-exclusion criteria. Because the central performance assertions rest on these unreviewed implementation measurements, the absence of statistical detail prevents independent assessment of whether the gains are robust or sensitive to particular workloads or hardware configurations.

    Authors: The referee is correct that additional statistical information would improve the evaluation. We will revise the evaluation section to report that all results are averaged over 10 independent runs, with error bars showing the standard deviation. We will also specify the data collection criteria: experiments were run on a fixed set of 5 representative workloads (including ShareGPT and synthetic traces), with no data points excluded. The hardware setup (8x A100 GPUs) and software versions will be detailed in a new reproducibility subsection. While the observed variance was low (<5% in most cases) due to the deterministic nature of the controlled testbed, we agree that reporting these details allows for better assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems evaluation only

full rationale

The paper describes an engineering system (PipeLive) for live PP reconfiguration using KV cache redesign, PageAttention extension, and incremental patching inspired by VM migration. All central claims (2.5X TTFT reduction, <10ms overhead, 54.7% TTFT and 14.7% TPOT gains) are presented as direct experimental measurements on a prototype, not as quantities derived from equations or parameters fitted within the same work. No self-referential equations, fitted-input predictions, or load-bearing self-citations appear in the provided text. The KV patching mechanism is an implementation choice whose correctness is asserted via runtime behavior and benchmarks rather than reduced to prior self-citations or definitions. This is a standard non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces engineering mechanisms (redesigned KV cache layout, co-designed PageAttention extension, incremental KV patching) rather than mathematical axioms or fitted parameters; no free parameters or invented physical entities are stated.

pith-pipeline@v0.9.0 · 5599 in / 1132 out tokens · 27378 ms · 2026-05-10T16:29:17.516395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 9 canonical work pages · 6 internal anchors

  1. [1]

    Gulavani, Ramachandran Ramjee, and Alexey Tu- manov

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, and Alexey Tu- manov. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

  2. [2]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners. (2020)

  3. [3]

    Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live Migration of Virtual Machines. InProceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation (NSDI)

  4. [4]

    Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. InInternational Conference on Learning Representations (ICLR)

  5. [5]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

  6. [6]

    InAdvances in Neural Information Processing Systems (NeurIPS)

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS)

  7. [7]

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. InProceed- ings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24)

  8. [8]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Vaughan, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

  9. [9]

    Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chi- ang, Alvin Cheung, and Ion Stoica. 2024. Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity.arXiv preprint arXiv:2404.14527(2024)

  10. [10]

    Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu, Walid Gaaloul, Michael Sheng, Qi Yu, and Sami Yangui. 2025. UELLM: A Unified and Efficient Approach for Large Language Model Inference Serving. InInternational Conference on Service-Oriented Computing (ICSOC)

  11. [11]

    Zixuan Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, and Torsten Hoefler. 2025. Demystifying NCCL: An In-Depth Analysis of GPU Communication Protocols and Algorithms. InProceedings of the 39th IEEE International Parallel and Distributed Processing Symposium (IPDPS)

  12. [12]

    Le, Yonghui Wu, and Zhifeng Chen

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. InAdvances in Neural Information Processing Systems (NeurIPS)

  13. [13]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  14. [14]

    InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

    Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

  15. [15]

    Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, and Kejiang Ye. 2025. FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters.arXiv preprint arXiv:2510.11938(2025)

  16. [16]

    Chiheng Lou, Sheng Qi, Chao Jin, Dapeng Nie, Haoran Yang, Yu Ding, Xuanzhe Liu, and Xin Jin. 2025. HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds.arXiv preprint arXiv:2502.15524(2025)

  17. [17]

    Zizhao Mo, Jianxiong Liao, Huanle Xu, Zhi Zhou, and Chengzhong Xu

  18. [18]

    InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC)

    Hetis: Serving LLMs in Heterogeneous GPU Clusters with Fine- grained and Dynamic Parallelism. InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC)

  19. [19]

    Devanur, Gregory R

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. InProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19)

  20. [20]

    OpenAI. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774 (2023)

  21. [21]

    Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. 2024. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP)

  22. [22]

    Ruoyu Qin, Zheming Li, Weiran He, Junda Cui, Fangcheng Ren, Mingx- ing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Moon- cake: Trading More Storage for Less Computation — A KVCache- Centric Architecture for Serving LLM Chatbot. In23rd USENIX Con- ference on File and Storage Technologies (FAST 25)

  23. [23]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053(2020)

  24. [24]

    Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, and Gennady Pekhimenko. 2025. Seesaw: High-throughput LLM Inference via Model Re-sharding. InProceedings of Machine Learning and Systems (MLSys). 13 Conference’17, July 2017, Washington, DC, USA Xu BAI, Muhammed Tawfiqul Islam, Chen Wang, and A...

  25. [25]

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24)

  26. [26]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971(2023)

  27. [27]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. (2017)

  28. [28]

    Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch

  29. [29]

    InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP)

    Tenplex: Dynamic Parallelism for Deep Learning Using Paral- lelizable Tensor Collections. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP)

  30. [30]

    Jiarong Xing, Yifan Qiao, Simon Mo, Xingqi Cui, Gur-Eyal Sela, Yang Zhou, Joseph Gonzalez, and Ion Stoica. 2025. Towards Efficient and Practical GPU Multitasking in the Era of LLM.arXiv preprint arXiv:2508.08448(2025)

  31. [31]

    Hongxin Xu, Tianyu Guo, and Xianwei Zhang. [n. d.]. DynaPipe: Dynamic Layer Redistribution for Efficient Serving of LLMs with Pipeline Parallelism. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  32. [32]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, et al. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

  33. [33]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22)

  34. [34]

    Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, and Kurt Keutzer. 2024. LLM Inference Unveiled: Survey and Roofline Model Insights

  35. [35]

    Xing, Joseph E

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Au- tomating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. (2022)

  36. [36]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAd- vances in Neural Information Processing Systems (NeurIPS)

  37. [37]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). 14