FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters

Chengzhi Lu; Chengzhong Xu; Kejiang Ye; Shijie Peng; Yanying Lin

arxiv: 2510.11938 · v2 · submitted 2025-10-13 · 💻 cs.DC

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters

Yanying Lin , Shijie Peng , Chengzhi Lu , Chengzhong Xu , Kejiang Ye This is my paper

Pith reviewed 2026-05-18 07:04 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM servingserverless clusterspipeline refactoringGPU fragmentationdynamic adaptationcache transitionsresource efficiencytopology-aware allocation

0 comments

The pith

FlexPipe reconfigures LLM pipelines at runtime to handle variable workloads in fragmented serverless GPU clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model serving in serverless clusters struggles with unpredictable request volumes and scattered GPU resources that fixed pipeline designs cannot manage efficiently. FlexPipe addresses this by splitting models into fine-grained stages and adjusting the pipeline structure while requests are in progress, using consistent cache transitions to avoid recomputing results. A sympathetic reader would care because successful adaptation would let operators reserve much less hardware for the same performance targets, making inference services more affordable and scalable on shared infrastructure.

Core claim

FlexPipe decomposes models into fine-grained stages and intelligently adjusts pipeline granularity based on real-time request pattern analysis. It implements fine-grained model partitioning with preserved computational graph constraints, inflight pipeline refactoring with consistent cache transitions, and topology-aware resource allocation that navigates GPU fragmentation. Evaluation on an 82-GPU cluster shows these changes deliver up to 8.5x better resource efficiency and 38.3% lower latency than prior systems while cutting GPU reservations from 75% to 30% of peak capacity.

What carries the argument

inflight pipeline refactoring with consistent cache transitions that enables runtime changes to pipeline granularity while maintaining computational correctness and cache validity

If this is right

GPU reservation requirements fall from 75 percent to 30 percent of peak capacity.
Resource efficiency rises by up to 8.5 times relative to existing static pipeline systems.
End-to-end latency drops by 38.3 percent compared with state-of-the-art approaches.
Systems can respond to changing request patterns without relying on large static over-provisioning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Runtime reconfiguration methods developed here may apply to serving other large models that also face fragmented hardware in shared clusters.
Cluster schedulers could adopt similar topology awareness to reduce the impact of resource fragmentation in broader distributed workloads.
Workload prediction components would need to factor in reconfiguration costs to decide when adjustments are worthwhile.

Load-bearing premise

In-flight pipeline refactoring with consistent cache transitions can occur at runtime without adding prohibitive overhead or violating computational graph constraints in fragmented clusters.

What would settle it

Direct measurements of latency and resource usage on the 82-GPU cluster under rapidly varying request patterns that show whether cache transition overhead cancels out the reported efficiency gains.

Figures

Figures reproduced from arXiv: 2510.11938 by Chengzhi Lu, Chengzhong Xu, Kejiang Ye, Shijie Peng, Yanying Lin.

**Figure 1.** Figure 1: Request distribution CV (coefficient of variation) variations across different periods. Significant mismatches exist in CV calculated with different window sizes (180s, 3h, 12h), 7× variation exists. (a) Request distribution CV of Alibaba Trace, (b) Request distribution CV of Top-1 App, and (c) Top-2 App from Azure [58]. and CV metrics to seamlessly reconfigure pipeline topologies while maintaining cache … view at source ↗

**Figure 2.** Figure 2: Resource fragmentation in Alibaba. (a) GPU subscription rate averaging 216%, indicating significant resource overcommitment, and (b) Heatmap revealing spatially scattered GPU availability patterns that impede formation of high-bandwidth interconnected GPU groups needed for tensor parallelism [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of request distribution variability on pipeline performance. (a) Goodput decreases by 37% as CV increases from 0.1 to 8 due to resource contention; (b) Average queue length grows nearly 4× with increasing CV, indicating pipeline congestion; (c) Stall cycle ratio increases exponentially (22×) at high CV values, showing how static pipelines become inefficient under variable workloads. providers need t… view at source ↗

**Figure 4.** Figure 4: Latency distribution across different request patterns. (a) Box plot comparing pipeline granularities across varying CV values, showing fine-grained pipelines perform better with high-variability workloads; (b) Detailed latency distribution for CV=4 with 4-stage pipeline, revealing significant variance from pipeline stalls. one-third that of the 4-stage architecture, comparable to the latter’s performance … view at source ↗

**Figure 5.** Figure 5: FlexPipe system architecture showing the three core components: ❶ Fine-Grained Pipeline Model Partitioning that decomposes LLMs at operator level for optimal adaptability, ❷ Inflight Pipeline Refactoring that dynamically adjusts pipeline granularity based on request patterns, and ❸ Adaptive Pipeline Scaling that enables efficient resource allocation during traffic fluctuations. Fine-Grained Pipeline Model… view at source ↗

**Figure 6.** Figure 6: Inflight pipeline refactoring mechanism. (a) Stage refinement process where fine-grained partitioning occurs by evicting parameters and redistributing them onto additional GPUs; (b) Temporal sequence diagram showing synchronization protocol during refactoring; (c) Stage consolidation process where parameters from multiple stages are merged, utilizing host memory caching to minimize loading overhead from p… view at source ↗

**Figure 7.** Figure 7: The process of model scaling using fine-grained pipeline stages. FlexPipe conference satisfies the minimum granularity pipeline stage for loading and executing inference. Then, after traffic changes, it modifies to a coarser granularity pipeline stage with fewer additional overheads. integrating SLO constraints to ensure service quality: (𝑇𝑗 − 𝑆𝑗) · Í𝑚𝑗 𝑘=1 𝜇𝑗𝑘 𝑄𝑗 ≥ 𝑟𝑗 (12) where 𝑇𝑗 is the SLO deadline for… view at source ↗

**Figure 8.** Figure 8: End-to-End Latency Breakdown across varying request distributions. FlexPipe maintains lower overall latency despite higher communication overhead by significantly reducing queue wait times: (a) CV=1 (stable workload), (b) CV=2 (moderate variability), and (c) CV=4 (highly variable workload). 0 100 200 Frequency 0 50 100 150 200 250 300 Time (seconds) 5 10 15 20 RT (s) FlexPipe AlpaServe MuxServe 1 2 3 CV … view at source ↗

**Figure 9.** Figure 9: Latency under highly variable workload (CV=8, first 300s). (a) Request distribution CV variability measured in 15s windows, (b) Response latency comparison across systems. workloads, communication-intensive fine-grained pipelines significantly outperform static architectures trapped in queue buildup cycles. All experiments used a baseline of 20 QPS across the complete 2-hour lifecycle, with different CV v… view at source ↗

**Figure 11.** Figure 11: Pipeline stall recovery time across systems and request distribution variability (CV). FlexPipe achieves substantially faster recovery under high-variability workloads (9ms at CV=4), demonstrating the effectiveness of dynamic pipeline refactoring in addressing structural stall causes. architectures. The latency percentile analysis shows FlexPipe maintains consistently lower latency across all percentil… view at source ↗

**Figure 13.** Figure 13: Performance comparison with production workloads: (a) Average prefill latency across model scales showing FlexPipe’s consistent advantage (6.43%-24.38% improvement), (b) Latency distribution showing tighter performance bounds with fewer outliers. elastic scaling with 5-minute reclamation windows, directly addressing the resource fragmentation challenges. 9.5 Performance in Production Workloads To evaluat… view at source ↗

read the original abstract

Serving Large Language Models (LLMs) in production faces significant challenges from highly variable request patterns and severe resource fragmentation in serverless clusters. Current systems rely on static pipeline configurations that struggle to adapt to dynamic workload conditions, leading to substantial inefficiencies. We present FlexPipe, a novel system that dynamically reconfigures pipeline architectures during runtime to address these fundamental limitations. FlexPipe decomposes models into fine-grained stages and intelligently adjusts pipeline granularity based on real-time request pattern analysis, implementing three key innovations: fine-grained model partitioning with preserved computational graph constraints, inflight pipeline refactoring with consistent cache transitions, and topology-aware resource allocation that navigates GPU fragmentation. Comprehensive evaluation on an 82-GPU cluster demonstrates that FlexPipe achieves up to 8.5x better resource efficiency while maintaining 38.3% lower latency compared to state-of-the-art systems, reducing GPU reservation requirements from 75% to 30% of peak capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexPipe shows how to do runtime pipeline refactoring for LLMs in fragmented serverless clusters and reports large efficiency gains, but the overhead of those transitions needs clearer measurement.

read the letter

FlexPipe tackles LLM serving in serverless clusters where GPUs are fragmented and workloads shift fast. It breaks models into fine stages, then refactors the pipeline while requests are in flight, with steps to keep caches consistent and pick resources that fit the broken topology. The three pieces—fine partitioning that respects the graph, inflight refactoring, and topology-aware allocation—form the main contribution. The 82-GPU evaluation claims up to 8.5x better resource use, 38% lower latency than prior systems, and a drop in reserved GPUs from 75% to 30% of peak. Those numbers would matter if the baselines are comparable and the gains survive real traces. The work does a decent job naming the static-pipeline problem and showing a concrete way to adapt at runtime. The evaluation scale is reasonable for this area. The soft spot is the missing breakdown of refactoring cost. The abstract and claims treat the transitions as low-overhead, but without separate numbers on per-refactor latency, cache-move time, or failure cases it is hard to know how much of the reported win comes from the new mechanisms versus other tuning. Minor gaps in statistical detail on the runs would also be easy to fix. This paper is for people who build or tune distributed inference systems, especially anyone dealing with serverless GPU pools and bursty traffic. A reader who needs ideas for handling fragmentation would get practical value from the implementation choices. It deserves a serious referee because the problem is current and the system is tested at cluster scale, even if the experiments need tightening on overhead and baselines. I would send it for peer review with requests for those internal measurements.

Referee Report

2 major / 1 minor

Summary. The paper presents FlexPipe, a system for serving LLMs in serverless clusters with high request variability and GPU fragmentation. It claims to enable dynamic runtime reconfiguration of pipeline architectures via fine-grained model partitioning that preserves computational graph constraints, inflight pipeline refactoring with consistent cache transitions, and topology-aware resource allocation. Evaluation on an 82-GPU cluster is reported to deliver up to 8.5× better resource efficiency, 38.3% lower latency than state-of-the-art systems, and a reduction in required GPU reservations from 75% to 30% of peak capacity.

Significance. If the empirical results can be shown to be robust, the work would constitute a useful contribution to dynamic LLM serving by directly targeting resource fragmentation and workload variability in serverless settings, where static pipelines are known to underutilize hardware.

major comments (2)

[Evaluation] Evaluation section: the abstract and strongest claims assert 8.5× resource-efficiency gains and 38.3% latency reduction on an 82-GPU cluster, yet supply no description of the state-of-the-art baselines, workload traces, statistical significance tests, or controls for confounding factors such as cluster topology or request arrival patterns; these omissions leave the central performance assertions weakly supported.
[Abstract / System Overview] Abstract and system-design description: the key assumption that inflight pipeline refactoring with consistent cache transitions incurs negligible overhead while preserving graph constraints is load-bearing for the reported efficiency numbers, but the manuscript provides no internal measurements (e.g., per-refactor latency, cache-transition cost, or failure rates) to substantiate that the refactoring cost is dominated by request processing time.

minor comments (1)

Clarify the precise granularity at which model stages are decomposed and how the topology-aware allocator interacts with existing serverless schedulers; a small diagram or pseudocode fragment would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have made revisions to strengthen the presentation of our results and system design.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the abstract and strongest claims assert 8.5× resource-efficiency gains and 38.3% latency reduction on an 82-GPU cluster, yet supply no description of the state-of-the-art baselines, workload traces, statistical significance tests, or controls for confounding factors such as cluster topology or request arrival patterns; these omissions leave the central performance assertions weakly supported.

Authors: We acknowledge that greater explicitness in the evaluation would improve clarity and robustness. The full manuscript already describes the baselines (static pipeline systems and representative dynamic serving frameworks) and the production-derived workload traces in Section 5, along with the 82-GPU cluster topology. However, we agree that adding statistical significance tests and explicit controls for confounding factors would address the concern. In the revised manuscript we have expanded the evaluation section to include: (1) a dedicated table listing all baselines with citations, (2) details of the workload traces including arrival patterns and variability metrics, (3) results of paired t-tests with p-values across repeated runs, and (4) a discussion of experimental controls that fix cluster topology while varying request patterns across multiple independent trials. revision: yes
Referee: [Abstract / System Overview] Abstract and system-design description: the key assumption that inflight pipeline refactoring with consistent cache transitions incurs negligible overhead while preserving graph constraints is load-bearing for the reported efficiency numbers, but the manuscript provides no internal measurements (e.g., per-refactor latency, cache-transition cost, or failure rates) to substantiate that the refactoring cost is dominated by request processing time.

Authors: We agree that direct internal measurements would provide stronger substantiation for the negligible-overhead claim. The current manuscript focuses on end-to-end results, but we have now added a new subsection (5.4) and accompanying figure that reports per-refactor latency, cache-transition costs, and failure rates measured across thousands of refactoring events. These measurements show average refactoring overhead below 5% of typical request processing time, with cache-transition costs under 2 ms and failure rates below 0.1% under the evaluated conditions, confirming that the overhead is indeed dominated by request processing. revision: yes

Circularity Check

0 steps flagged

No significant circularity; system design and empirical results are self-contained

full rationale

The paper describes a systems contribution with three stated innovations (fine-grained partitioning preserving graph constraints, inflight refactoring with cache transitions, topology-aware allocation) and reports measured end-to-end results on an 82-GPU cluster. No equations, fitted parameters, predictions derived from internal definitions, or load-bearing self-citations appear in the provided text. Efficiency and latency figures are presented as evaluation outcomes rather than quantities obtained by construction from the system's own inputs or prior self-citations. The derivation chain is therefore independent of the target claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the description remains at the level of system architecture and empirical claims.

pith-pipeline@v0.9.0 · 5707 in / 1112 out tokens · 35958 ms · 2026-05-18T07:04:48.168627+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost.lean Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

inflight pipeline refactoring with consistent cache transitions... fine-grained model partitioning with preserved computational graph constraints... topology-aware resource allocation that navigates GPU fragmentation
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dynamic programming algorithm that simultaneously considers communication-computation overlap and future refactoring needs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
cs.DC 2026-05 unverdicted novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
cs.DC 2026-04 unverdicted novelty 7.0

PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 2 Pith papers · 8 internal anchors

[1]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters EUROSYS ’26, April 27–30, 2026, Edinburgh, Scotland Uk Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inferenc...

work page 2026
[2]

Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt. 2018. SAND: To- wards High-Performance Serverless Computing.. InProc. 2018 USENIX Annu. Tech. Conf. USENIX ATC 2018 Boston MA USA July 11-13 2018 (USENIX ATC 2018). 923–935

work page 2018
[4]

2023-05-08

Mohamed Alzayat, Jonathan Mace, Peter Druschel, and Deepak Garg. 2023-05-08. Groundhog: Efficient Request Isolation in FaaS. InProc. Eighteenth Eur. Conf. Comput. Syst. (EuroSys ’23). ACM, 398–415.https: //doi.org/10.1145/3552326.3567503

work page doi:10.1145/3552326.3567503 2023
[5]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, et al. 2022-11. DeepSpeed- Inference: Enabling Efficient Inference of Trans- former Models at Unprecedented Scale. InSC22 Int. Conf. High Per- form. Comput. Netw. Storage Anal. (SC 2022). IEEE, 46:1–46:15.https: //doi.org/10.1109/sc41...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41404.2022.00051 2022
[7]

Lixiang Ao, George Porter, and Geoffrey M. Voelker. 2022-03-28. FaaS- nap: FaaS Made Fast Using Snapshot-Based VMs. InProc. Seven- teenth Eur. Conf. Comput. Syst. (EuroSys ’22). ACM, 730–746.https: //doi.org/10.1145/3492321.3524270

work page doi:10.1145/3492321.3524270 2022
[9]

2022-03-28

Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ram- jee, and Nipun Kwatra. 2022-03-28. Varuna: Scalable, Low-Cost Train- ing of Massive Deep Learning Models. InProc. Seventeenth Eur. Conf. Comput. Syst. (EuroSys ’22). ACM, 472–487.https://doi.org/10.1145/ 3492321.3519584

work page arXiv 2022
[10]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

work page
[11]

33 (2020), 1877–1901

Language Models Are Few-Shot Learners. 33 (2020), 1877–1901

work page 2020
[12]

2023-10-28

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2023-10-28. Punica: Multi-Tenant LoRA Serv- ing.https://doi.org/10.48550/arXiv.2310.18547

work page doi:10.48550/arxiv.2310.18547 2023
[13]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019-05-24. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://doi.org/10.48550/arXiv.1810.04805

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2019
[14]

Khaled Diab, Parham Yassini, and Mohamed Hefeeda. 2022. Orca: Server-assisted Multicast for Datacenter Networks.. In19th USENIX Symp. Networked Syst. Des. Implement. NSDI 2022 Renton W A USA April 4-6 2022 (NSDI 2022). 1075–1091

work page 2022
[15]

2024-06-13

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024-06-13. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. https://doi.org/10.48550/arXiv.2404.02015

work page doi:10.48550/arxiv.2404.02015 2024
[16]

2021-02-17

Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, et al . 2021-02-17. DAPPLE: A Pipelined Data Parallel Approach for Training Large Models. InProc. 26th ACM SIGPLAN Symp. Princ. Pract. Parallel Program. (PPoPP ’21). ACM, 431–445.https://doi.org/10.1145/3437801.3441593

work page doi:10.1145/3437801.3441593 2021
[17]

Mohammadbagher Fotouhi, Derek Chen, and Wes J. Lloyd. 2019- 12-09. Function-as-a-Service Application Service Composition: Im- plications for a Natural Language Processing Application. InProc. 5th Int. Workshop Serverless Comput. (Middleware ’19). ACM, 49–54. https://doi.org/10.1145/3366623.3368141

work page doi:10.1145/3366623.3368141 2019
[18]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 135–153

work page 2024
[19]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al . 2024-11-23. The Llama 3 Herd of Models.https: //doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[20]

2024-09-23

Bodun Hu, Jiamin Li, Le Xu, Myungjin Lee, Akshay Jajoo, Geon-Woo Kim, Hong Xu, and Aditya Akella. 2024-09-23. BlockLLM: Multi- tenant Finer-grained Serving for Large Language Models.https: //doi.org/10.48550/arXiv.2404.18322

work page doi:10.48550/arxiv.2404.18322 2024
[21]

2024-01-20

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, et al. 2024-01-20. Infer- ence without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads.https://doi.org/10.48550/arXiv.2401.11181

work page doi:10.48550/arxiv.2401.11181 2024
[22]

2019-12-08

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, et al. 2019-12-08. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. InProceedings of the 33rd International Conference on Neural Information Processing Systems. Number 10. Curran Associates Inc., 103–112

work page 2019
[23]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, et al. 2023-10-23. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProc. 29th Symp. Oper. Syst. Princ. (SOSP ’23). ACM, 611–626.https://doi.org/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023
[24]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 155–172

work page 2024
[26]

Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. 2022. Tetris: Memory-efficient Serverless Inference through Tensor Sharing.. InProc. 2022 USENIX Annu. Tech. Conf. USENIX ATC 2022 Carlsbad CA USA July 11-13 2022 (USENIX ATC 2022). USENIX Association

work page 2022
[27]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, et al. 2023. AlpaServe: Statisti- cal Multiplexing with Model Parallelism for Deep Learning Serving.. In17th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2023 Boston MA USA July 10-12 2023 (OSDI 2023). USENIX Association, 663–679

work page 2023
[28]

Yanying Lin, Yanbo Li, Shijie Peng, Yingfei Tang, Shutian Luo, Haiying Shen, Chengzhong Xu, and Kejiang Ye. 2024-07. QUART: Latency- Aware FaaS System for Pipelining Large Model Inference. In2024 IEEE 44th Int. Conf. Distrib. Comput. Syst. ICDCS. 1–12.https://doi.org/10. 1109/ICDCS60910.2024.00010

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

2023-01-21

Zhiqi Lin, Youshan Miao, Guodong Liu, Xiaoxiang Shi, Quanlu Zhang, Fan Yang, Saeed Maleki, Yi Zhu, et al . 2023-01-21. SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction. https://doi.org/10.48550/arXiv.2301.08984 EUROSYS ’26, April 27–30, 2026, Edinburgh, Scotland Uk Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, and Kejiang Ye

work page doi:10.48550/arxiv.2301.08984 2023
[30]

2024-08-04

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, et al . 2024-08-04. CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming

work page 2024
[31]

Scaling symbolic evaluation for automated verification of systems code with serval

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019-10-27. PipeDream: Generalized Pipeline Parallelism for DNN Training. InProc. 27th ACM Symp. Oper. Syst. Princ. (SOSP ’19). ACM, 1–15.https://doi.org/10.1145/3341301.3359646

work page doi:10.1145/3341301.3359646 2019
[32]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, et al. 2021-11-14. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC ’21). ACM, 58.https: //doi.org/10.1145/3458817.3476209

work page doi:10.1145/3458817.3476209 2021
[33]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, ’I nigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In51st ACMIEEE Annu. Int. Symp. Comput. Archit. ISCA 2024 B. Aires Argent. June 29 - July 3 2024. IEEE, 118–132.https://doi.org/10.1109/ISCA59077.2024. 00019

work page doi:10.1109/isca59077.2024 2024
[34]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022-12-06. Robust Speech Recognition via Large-Scale Weak Supervision.https://doi.org/10.48550/arXiv. 2212.04356

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022
[35]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020-11. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. InSC20 Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC 2020). IEEE, 20.https://doi.org/10.1109/sc41405.2020. 00024

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020 2020
[36]

Pvm: Efficient shadow paging for deploying secure containers in cloud-native environment,

Alireza Sahraei, Soteris Demetriou, Amirali Sobhgol, Haoran Zhang, Abhigna Nagaraja, Neeraj Pathak, Girish Joshi, Carla Souza, et al . 2023-10-23. XFaaS: Hyperscale and Low Cost Serverless Functions at Meta. InProc. 29th Symp. Oper. Syst. Princ. (SOSP ’23). ACM, 231–246. https://doi.org/10.1145/3600006.3613155

work page doi:10.1145/3600006.3613155 2023
[37]

Larissa Schmid, Marcin Copik, Alexandru Calotoiu, Laurin Brandner, Anne Koziolek, and Torsten Hoefler. 2025. SeBS-Flow: Benchmarking Serverless Cloud Function Workflows. InProc. Twent. Eur. Conf. Com- put. Syst. EuroSys 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025. ACM, 902–920.https://doi.org/10.1145/3689031.3717465

work page doi:10.1145/3689031.3717465 2025
[38]

Mohammad Shahrad, Rodrigo Fonseca, ’I nigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, et al

work page
[39]

Serverless in the Wild: Characterizing and Optimizing the Server- less Workload at a Large Cloud Provider.. InProc. 2020 USENIX Annu. Tech. Conf. USENIX ATC 2020 July 15-17 2020 (USENIX ATC 2020). 205–218

work page 2020
[40]

S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, et al . 2024-06-05. S- LoRA: Serving Thousands of Concurrent LoRA Adapters.https: //doi.org/10.48550/arXiv.2311.03285

work page doi:10.48550/arxiv.2311.03285 2024
[41]

Gonzalez, and Ion Stoica

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 965– 988

work page 2024
[42]

2023-06-15

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, et al. 2023-06-15. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU.. InInt. Conf. Mach. Learn. ICML 2023 23-29 July 2023 Honol. Hawaii USA (ICML 2023). 31094–31116

work page 2023
[43]

Sudipta Saha Shubha, Haiying Shen, and Anand Iyer. 2024. USHER: Holistic Interference Avoidance for Resource Optimized ML Inference. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 947–964

work page 2024
[44]

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symp. Oper. Syst. Des. Imple- ment. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 173–191

work page 2024
[45]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to- End Optimization of LLM-based Applications with Ayo. InProc. 30th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst. Vol. 2 ASPLOS 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025, Lieven Eeckhout, Georgios Smaragdakis, Katai Liang, Adrian Sampson, Martha A. Kim, and Christopher ...

work page doi:10.1145/3676641.3716278 2025
[46]

Masahiro Tanaka, Kenjiro Taura, Toshihiro Hanawa, and Kentaro Torisawa. 2021-05. Automatic Graph Partitioning for Very Large-scale Deep Learning. In2021 IEEE Int. Parallel Distrib. Process. Symp. IPDPS (IPDPS 2021). IEEE, 1004–1013.https://doi.org/10.1109/ipdps49936. 2021.00109

work page doi:10.1109/ipdps49936 2021
[47]

Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization.. InAdv. Neural Inf. Process. Syst. 34 Annu. Conf. Neural Inf. Process. Syst. 2021 NeurIPS 2021 Dec. 6-14 2021 Virtual (NeurIPS 2021, Vol. 34). Curran Associates, Inc., 24829–24840

work page 2021
[48]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timoth’ee Lacroix, Baptiste Rozi‘ere, Naman Goyal, et al. 2023-02-27. LLaMA: Open and Efficient Foundation Language Models.https://doi.org/10.48550/arXiv.2302.13971

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
[49]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, et al. 2023- 07-19. Llama 2: Open Foundation and Fine-Tuned Chat Models. https://doi.org/10.48550/arXiv.2307.09288

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[50]

Ao Wang, Shuai Chang, Huangshi Tian, Hongqi Wang, Haoran Yang, Huiba Li, Rui Du, and Yue Cheng. 2021. FaaSNet: Scalable and Fast Pro- visioning of Custom Serverless Container Runtimes at Alibaba Cloud Function Compute.. InProc. 2021 USENIX Annu. Tech. Conf. USENIX ATC 2021 July 14-16 2021 (USENIX ATC 2021). USENIX Association, 443–457

work page 2021
[51]

2023-05-08

Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023-05-08. Tabi: An Efficient Multi-Level Inference System for Large Language Models. InProc. Eighteenth Eur. Conf. Comput. Syst. (EuroSys ’23). ACM, 233–248.https://doi.org/10.1145/3552326.3587438

work page doi:10.1145/3552326.3587438 2023
[52]

Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, et al. 2022. MLaaS in the Wild: Work- load Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters.. In19th USENIX Symp. Networked Syst. Des. Implement. NSDI 2022 Renton W A USA April 4-6 2022 (NSDI 2022). USENIX Association, 945–960

work page 2022
[53]

Qizhen Weng, Lingyun Yang, Yinghao Yu, Wei Wang, Xiaochuan Tang, Guodong Yang, and Liping Zhang. 2023. Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient De- scent.. InProc. 2023 USENIX Annu. Tech. Conf. USENIX ATC 2023 Boston MA USA July 10-12 2023 (USENIX ATC 2023). USENIX Association, 995–1008

work page 2023
[54]

2023-10-01

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, et al . 2023-10-01. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity. 17, 2 (2023-10-01), 211–224. https://doi.org/10.14778/3626292.3626303

work page doi:10.14778/3626292.3626303 2023
[55]

2022-02-28

Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li. 2022-02-28. INFless: A Native Serverless System for Low-Latency, High-Throughput Inference. In Proc. 27th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst. (ASPLOS ’22). ACM, 768–781.https://doi.org/10.1145/3503222.3507709 FlexPipe: Adapting Dyna...

work page doi:10.1145/3503222.3507709 2022
[56]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, et al. 2025. CacheBlend: Fast Large Lan- guage Model Serving for RAG with Cached Knowledge Fusion. InProc. Twent. Eur. Conf. Comput. Syst. EuroSys 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025. ACM, 94–109.https://doi.org/10.1145/3689031. 3696098

work page doi:10.1145/3689031 2025
[57]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models.. In16th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2022 Carlsbad CA USA July 11-13 2022 (OSDI 2022). USENIX Association, 521–538

work page 2022
[58]

Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, and Youyou Lu. 2025. Medusa: Accelerating Serverless LLM Inference with Materi- alization. InProc. 30th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst. Vol. 1 ASPLOS 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025, Lieven Eeckhout, Georgios Smaragdakis, Kaitai Liang, Adrian Sampson, Martha A. ...

work page doi:10.1145/3669940.3707285 2025
[59]

Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild.. In20th USENIX Symp. Net- worked Syst. Des. Implement. NSDI 2023 Boston MA April 17-19 2023 (NSDI 2023). USENIX Association, 787–808

work page 2023
[60]

2021-10-

Yanqi Zhang, ’I nigo Goiri, Gohar Irfan Chaudhry, Rodrigo Fonseca, Sameh Elnikety, Christina Delimitrou, and Ricardo Bianchini. 2021-10-

work page 2021
[61]

Faster and Cheaper Serverless Computing on Harvested Resources. InProc. ACM SIGOPS 28th Symp. Oper. Syst. Princ. (SOSP ’21). ACM, 724–739.https://doi.org/10.1145/3477132.3483580

work page doi:10.1145/3477132.3483580
[62]

Zili Zhang, Chao Jin, and Xin Jin. 2024. Jolteon: Unleashing the Promise of Serverless for Serverless Workflows.. In21st USENIX Symp. Net- worked Syst. Des. Implement. NSDI 2024 St. Clara CA April 15-17 2024 (NSDI 2024). USENIX Association, 167–183

work page 2024
[63]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, et al. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning.. In16th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2022 Carlsbad CA USA July 11-13 2022 (OSDI 2022). 559–578

work page 2022
[64]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Associati...

work page 2024

[1] [1]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters EUROSYS ’26, April 27–30, 2026, Edinburgh, Scotland Uk Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inferenc...

work page 2026

[2] [2]

Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt. 2018. SAND: To- wards High-Performance Serverless Computing.. InProc. 2018 USENIX Annu. Tech. Conf. USENIX ATC 2018 Boston MA USA July 11-13 2018 (USENIX ATC 2018). 923–935

work page 2018

[3] [4]

2023-05-08

Mohamed Alzayat, Jonathan Mace, Peter Druschel, and Deepak Garg. 2023-05-08. Groundhog: Efficient Request Isolation in FaaS. InProc. Eighteenth Eur. Conf. Comput. Syst. (EuroSys ’23). ACM, 398–415.https: //doi.org/10.1145/3552326.3567503

work page doi:10.1145/3552326.3567503 2023

[4] [5]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, et al. 2022-11. DeepSpeed- Inference: Enabling Efficient Inference of Trans- former Models at Unprecedented Scale. InSC22 Int. Conf. High Per- form. Comput. Netw. Storage Anal. (SC 2022). IEEE, 46:1–46:15.https: //doi.org/10.1109/sc41...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41404.2022.00051 2022

[5] [7]

Lixiang Ao, George Porter, and Geoffrey M. Voelker. 2022-03-28. FaaS- nap: FaaS Made Fast Using Snapshot-Based VMs. InProc. Seven- teenth Eur. Conf. Comput. Syst. (EuroSys ’22). ACM, 730–746.https: //doi.org/10.1145/3492321.3524270

work page doi:10.1145/3492321.3524270 2022

[6] [9]

2022-03-28

Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ram- jee, and Nipun Kwatra. 2022-03-28. Varuna: Scalable, Low-Cost Train- ing of Massive Deep Learning Models. InProc. Seventeenth Eur. Conf. Comput. Syst. (EuroSys ’22). ACM, 472–487.https://doi.org/10.1145/ 3492321.3519584

work page arXiv 2022

[7] [10]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

work page

[8] [11]

33 (2020), 1877–1901

Language Models Are Few-Shot Learners. 33 (2020), 1877–1901

work page 2020

[9] [12]

2023-10-28

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2023-10-28. Punica: Multi-Tenant LoRA Serv- ing.https://doi.org/10.48550/arXiv.2310.18547

work page doi:10.48550/arxiv.2310.18547 2023

[10] [13]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019-05-24. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://doi.org/10.48550/arXiv.1810.04805

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2019

[11] [14]

Khaled Diab, Parham Yassini, and Mohamed Hefeeda. 2022. Orca: Server-assisted Multicast for Datacenter Networks.. In19th USENIX Symp. Networked Syst. Des. Implement. NSDI 2022 Renton W A USA April 4-6 2022 (NSDI 2022). 1075–1091

work page 2022

[12] [15]

2024-06-13

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024-06-13. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. https://doi.org/10.48550/arXiv.2404.02015

work page doi:10.48550/arxiv.2404.02015 2024

[13] [16]

2021-02-17

Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, et al . 2021-02-17. DAPPLE: A Pipelined Data Parallel Approach for Training Large Models. InProc. 26th ACM SIGPLAN Symp. Princ. Pract. Parallel Program. (PPoPP ’21). ACM, 431–445.https://doi.org/10.1145/3437801.3441593

work page doi:10.1145/3437801.3441593 2021

[14] [17]

Mohammadbagher Fotouhi, Derek Chen, and Wes J. Lloyd. 2019- 12-09. Function-as-a-Service Application Service Composition: Im- plications for a Natural Language Processing Application. InProc. 5th Int. Workshop Serverless Comput. (Middleware ’19). ACM, 49–54. https://doi.org/10.1145/3366623.3368141

work page doi:10.1145/3366623.3368141 2019

[15] [18]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 135–153

work page 2024

[16] [19]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al . 2024-11-23. The Llama 3 Herd of Models.https: //doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[17] [20]

2024-09-23

Bodun Hu, Jiamin Li, Le Xu, Myungjin Lee, Akshay Jajoo, Geon-Woo Kim, Hong Xu, and Aditya Akella. 2024-09-23. BlockLLM: Multi- tenant Finer-grained Serving for Large Language Models.https: //doi.org/10.48550/arXiv.2404.18322

work page doi:10.48550/arxiv.2404.18322 2024

[18] [21]

2024-01-20

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, et al. 2024-01-20. Infer- ence without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads.https://doi.org/10.48550/arXiv.2401.11181

work page doi:10.48550/arxiv.2401.11181 2024

[19] [22]

2019-12-08

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, et al. 2019-12-08. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. InProceedings of the 33rd International Conference on Neural Information Processing Systems. Number 10. Curran Associates Inc., 103–112

work page 2019

[20] [23]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, et al. 2023-10-23. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProc. 29th Symp. Oper. Syst. Princ. (SOSP ’23). ACM, 611–626.https://doi.org/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023

[21] [24]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 155–172

work page 2024

[22] [26]

Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. 2022. Tetris: Memory-efficient Serverless Inference through Tensor Sharing.. InProc. 2022 USENIX Annu. Tech. Conf. USENIX ATC 2022 Carlsbad CA USA July 11-13 2022 (USENIX ATC 2022). USENIX Association

work page 2022

[23] [27]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, et al. 2023. AlpaServe: Statisti- cal Multiplexing with Model Parallelism for Deep Learning Serving.. In17th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2023 Boston MA USA July 10-12 2023 (OSDI 2023). USENIX Association, 663–679

work page 2023

[24] [28]

Yanying Lin, Yanbo Li, Shijie Peng, Yingfei Tang, Shutian Luo, Haiying Shen, Chengzhong Xu, and Kejiang Ye. 2024-07. QUART: Latency- Aware FaaS System for Pipelining Large Model Inference. In2024 IEEE 44th Int. Conf. Distrib. Comput. Syst. ICDCS. 1–12.https://doi.org/10. 1109/ICDCS60910.2024.00010

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [29]

2023-01-21

Zhiqi Lin, Youshan Miao, Guodong Liu, Xiaoxiang Shi, Quanlu Zhang, Fan Yang, Saeed Maleki, Yi Zhu, et al . 2023-01-21. SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction. https://doi.org/10.48550/arXiv.2301.08984 EUROSYS ’26, April 27–30, 2026, Edinburgh, Scotland Uk Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, and Kejiang Ye

work page doi:10.48550/arxiv.2301.08984 2023

[26] [30]

2024-08-04

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, et al . 2024-08-04. CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming

work page 2024

[27] [31]

Scaling symbolic evaluation for automated verification of systems code with serval

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019-10-27. PipeDream: Generalized Pipeline Parallelism for DNN Training. InProc. 27th ACM Symp. Oper. Syst. Princ. (SOSP ’19). ACM, 1–15.https://doi.org/10.1145/3341301.3359646

work page doi:10.1145/3341301.3359646 2019

[28] [32]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, et al. 2021-11-14. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC ’21). ACM, 58.https: //doi.org/10.1145/3458817.3476209

work page doi:10.1145/3458817.3476209 2021

[29] [33]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, ’I nigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In51st ACMIEEE Annu. Int. Symp. Comput. Archit. ISCA 2024 B. Aires Argent. June 29 - July 3 2024. IEEE, 118–132.https://doi.org/10.1109/ISCA59077.2024. 00019

work page doi:10.1109/isca59077.2024 2024

[30] [34]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022-12-06. Robust Speech Recognition via Large-Scale Weak Supervision.https://doi.org/10.48550/arXiv. 2212.04356

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022

[31] [35]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020-11. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. InSC20 Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC 2020). IEEE, 20.https://doi.org/10.1109/sc41405.2020. 00024

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020 2020

[32] [36]

Pvm: Efficient shadow paging for deploying secure containers in cloud-native environment,

Alireza Sahraei, Soteris Demetriou, Amirali Sobhgol, Haoran Zhang, Abhigna Nagaraja, Neeraj Pathak, Girish Joshi, Carla Souza, et al . 2023-10-23. XFaaS: Hyperscale and Low Cost Serverless Functions at Meta. InProc. 29th Symp. Oper. Syst. Princ. (SOSP ’23). ACM, 231–246. https://doi.org/10.1145/3600006.3613155

work page doi:10.1145/3600006.3613155 2023

[33] [37]

Larissa Schmid, Marcin Copik, Alexandru Calotoiu, Laurin Brandner, Anne Koziolek, and Torsten Hoefler. 2025. SeBS-Flow: Benchmarking Serverless Cloud Function Workflows. InProc. Twent. Eur. Conf. Com- put. Syst. EuroSys 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025. ACM, 902–920.https://doi.org/10.1145/3689031.3717465

work page doi:10.1145/3689031.3717465 2025

[34] [38]

Mohammad Shahrad, Rodrigo Fonseca, ’I nigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, et al

work page

[35] [39]

Serverless in the Wild: Characterizing and Optimizing the Server- less Workload at a Large Cloud Provider.. InProc. 2020 USENIX Annu. Tech. Conf. USENIX ATC 2020 July 15-17 2020 (USENIX ATC 2020). 205–218

work page 2020

[36] [40]

S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, et al . 2024-06-05. S- LoRA: Serving Thousands of Concurrent LoRA Adapters.https: //doi.org/10.48550/arXiv.2311.03285

work page doi:10.48550/arxiv.2311.03285 2024

[37] [41]

Gonzalez, and Ion Stoica

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 965– 988

work page 2024

[38] [42]

2023-06-15

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, et al. 2023-06-15. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU.. InInt. Conf. Mach. Learn. ICML 2023 23-29 July 2023 Honol. Hawaii USA (ICML 2023). 31094–31116

work page 2023

[39] [43]

Sudipta Saha Shubha, Haiying Shen, and Anand Iyer. 2024. USHER: Holistic Interference Avoidance for Resource Optimized ML Inference. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 947–964

work page 2024

[40] [44]

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symp. Oper. Syst. Des. Imple- ment. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 173–191

work page 2024

[41] [45]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to- End Optimization of LLM-based Applications with Ayo. InProc. 30th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst. Vol. 2 ASPLOS 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025, Lieven Eeckhout, Georgios Smaragdakis, Katai Liang, Adrian Sampson, Martha A. Kim, and Christopher ...

work page doi:10.1145/3676641.3716278 2025

[42] [46]

Masahiro Tanaka, Kenjiro Taura, Toshihiro Hanawa, and Kentaro Torisawa. 2021-05. Automatic Graph Partitioning for Very Large-scale Deep Learning. In2021 IEEE Int. Parallel Distrib. Process. Symp. IPDPS (IPDPS 2021). IEEE, 1004–1013.https://doi.org/10.1109/ipdps49936. 2021.00109

work page doi:10.1109/ipdps49936 2021

[43] [47]

Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization.. InAdv. Neural Inf. Process. Syst. 34 Annu. Conf. Neural Inf. Process. Syst. 2021 NeurIPS 2021 Dec. 6-14 2021 Virtual (NeurIPS 2021, Vol. 34). Curran Associates, Inc., 24829–24840

work page 2021

[44] [48]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timoth’ee Lacroix, Baptiste Rozi‘ere, Naman Goyal, et al. 2023-02-27. LLaMA: Open and Efficient Foundation Language Models.https://doi.org/10.48550/arXiv.2302.13971

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023

[45] [49]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, et al. 2023- 07-19. Llama 2: Open Foundation and Fine-Tuned Chat Models. https://doi.org/10.48550/arXiv.2307.09288

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

[46] [50]

Ao Wang, Shuai Chang, Huangshi Tian, Hongqi Wang, Haoran Yang, Huiba Li, Rui Du, and Yue Cheng. 2021. FaaSNet: Scalable and Fast Pro- visioning of Custom Serverless Container Runtimes at Alibaba Cloud Function Compute.. InProc. 2021 USENIX Annu. Tech. Conf. USENIX ATC 2021 July 14-16 2021 (USENIX ATC 2021). USENIX Association, 443–457

work page 2021

[47] [51]

2023-05-08

Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023-05-08. Tabi: An Efficient Multi-Level Inference System for Large Language Models. InProc. Eighteenth Eur. Conf. Comput. Syst. (EuroSys ’23). ACM, 233–248.https://doi.org/10.1145/3552326.3587438

work page doi:10.1145/3552326.3587438 2023

[48] [52]

Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, et al. 2022. MLaaS in the Wild: Work- load Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters.. In19th USENIX Symp. Networked Syst. Des. Implement. NSDI 2022 Renton W A USA April 4-6 2022 (NSDI 2022). USENIX Association, 945–960

work page 2022

[49] [53]

Qizhen Weng, Lingyun Yang, Yinghao Yu, Wei Wang, Xiaochuan Tang, Guodong Yang, and Liping Zhang. 2023. Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient De- scent.. InProc. 2023 USENIX Annu. Tech. Conf. USENIX ATC 2023 Boston MA USA July 10-12 2023 (USENIX ATC 2023). USENIX Association, 995–1008

work page 2023

[50] [54]

2023-10-01

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, et al . 2023-10-01. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity. 17, 2 (2023-10-01), 211–224. https://doi.org/10.14778/3626292.3626303

work page doi:10.14778/3626292.3626303 2023

[51] [55]

2022-02-28

Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li. 2022-02-28. INFless: A Native Serverless System for Low-Latency, High-Throughput Inference. In Proc. 27th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst. (ASPLOS ’22). ACM, 768–781.https://doi.org/10.1145/3503222.3507709 FlexPipe: Adapting Dyna...

work page doi:10.1145/3503222.3507709 2022

[52] [56]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, et al. 2025. CacheBlend: Fast Large Lan- guage Model Serving for RAG with Cached Knowledge Fusion. InProc. Twent. Eur. Conf. Comput. Syst. EuroSys 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025. ACM, 94–109.https://doi.org/10.1145/3689031. 3696098

work page doi:10.1145/3689031 2025

[53] [57]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models.. In16th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2022 Carlsbad CA USA July 11-13 2022 (OSDI 2022). USENIX Association, 521–538

work page 2022

[54] [58]

Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, and Youyou Lu. 2025. Medusa: Accelerating Serverless LLM Inference with Materi- alization. InProc. 30th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst. Vol. 1 ASPLOS 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025, Lieven Eeckhout, Georgios Smaragdakis, Kaitai Liang, Adrian Sampson, Martha A. ...

work page doi:10.1145/3669940.3707285 2025

[55] [59]

Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild.. In20th USENIX Symp. Net- worked Syst. Des. Implement. NSDI 2023 Boston MA April 17-19 2023 (NSDI 2023). USENIX Association, 787–808

work page 2023

[56] [60]

2021-10-

Yanqi Zhang, ’I nigo Goiri, Gohar Irfan Chaudhry, Rodrigo Fonseca, Sameh Elnikety, Christina Delimitrou, and Ricardo Bianchini. 2021-10-

work page 2021

[57] [61]

Faster and Cheaper Serverless Computing on Harvested Resources. InProc. ACM SIGOPS 28th Symp. Oper. Syst. Princ. (SOSP ’21). ACM, 724–739.https://doi.org/10.1145/3477132.3483580

work page doi:10.1145/3477132.3483580

[58] [62]

Zili Zhang, Chao Jin, and Xin Jin. 2024. Jolteon: Unleashing the Promise of Serverless for Serverless Workflows.. In21st USENIX Symp. Net- worked Syst. Des. Implement. NSDI 2024 St. Clara CA April 15-17 2024 (NSDI 2024). USENIX Association, 167–183

work page 2024

[59] [63]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, et al. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning.. In16th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2022 Carlsbad CA USA July 11-13 2022 (OSDI 2022). 559–578

work page 2022

[60] [64]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Associati...

work page 2024