FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
Pith reviewed 2026-05-18 07:04 UTC · model grok-4.3
The pith
FlexPipe reconfigures LLM pipelines at runtime to handle variable workloads in fragmented serverless GPU clusters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlexPipe decomposes models into fine-grained stages and intelligently adjusts pipeline granularity based on real-time request pattern analysis. It implements fine-grained model partitioning with preserved computational graph constraints, inflight pipeline refactoring with consistent cache transitions, and topology-aware resource allocation that navigates GPU fragmentation. Evaluation on an 82-GPU cluster shows these changes deliver up to 8.5x better resource efficiency and 38.3% lower latency than prior systems while cutting GPU reservations from 75% to 30% of peak capacity.
What carries the argument
inflight pipeline refactoring with consistent cache transitions that enables runtime changes to pipeline granularity while maintaining computational correctness and cache validity
If this is right
- GPU reservation requirements fall from 75 percent to 30 percent of peak capacity.
- Resource efficiency rises by up to 8.5 times relative to existing static pipeline systems.
- End-to-end latency drops by 38.3 percent compared with state-of-the-art approaches.
- Systems can respond to changing request patterns without relying on large static over-provisioning.
Where Pith is reading between the lines
- Runtime reconfiguration methods developed here may apply to serving other large models that also face fragmented hardware in shared clusters.
- Cluster schedulers could adopt similar topology awareness to reduce the impact of resource fragmentation in broader distributed workloads.
- Workload prediction components would need to factor in reconfiguration costs to decide when adjustments are worthwhile.
Load-bearing premise
In-flight pipeline refactoring with consistent cache transitions can occur at runtime without adding prohibitive overhead or violating computational graph constraints in fragmented clusters.
What would settle it
Direct measurements of latency and resource usage on the 82-GPU cluster under rapidly varying request patterns that show whether cache transition overhead cancels out the reported efficiency gains.
Figures
read the original abstract
Serving Large Language Models (LLMs) in production faces significant challenges from highly variable request patterns and severe resource fragmentation in serverless clusters. Current systems rely on static pipeline configurations that struggle to adapt to dynamic workload conditions, leading to substantial inefficiencies. We present FlexPipe, a novel system that dynamically reconfigures pipeline architectures during runtime to address these fundamental limitations. FlexPipe decomposes models into fine-grained stages and intelligently adjusts pipeline granularity based on real-time request pattern analysis, implementing three key innovations: fine-grained model partitioning with preserved computational graph constraints, inflight pipeline refactoring with consistent cache transitions, and topology-aware resource allocation that navigates GPU fragmentation. Comprehensive evaluation on an 82-GPU cluster demonstrates that FlexPipe achieves up to 8.5x better resource efficiency while maintaining 38.3% lower latency compared to state-of-the-art systems, reducing GPU reservation requirements from 75% to 30% of peak capacity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents FlexPipe, a system for serving LLMs in serverless clusters with high request variability and GPU fragmentation. It claims to enable dynamic runtime reconfiguration of pipeline architectures via fine-grained model partitioning that preserves computational graph constraints, inflight pipeline refactoring with consistent cache transitions, and topology-aware resource allocation. Evaluation on an 82-GPU cluster is reported to deliver up to 8.5× better resource efficiency, 38.3% lower latency than state-of-the-art systems, and a reduction in required GPU reservations from 75% to 30% of peak capacity.
Significance. If the empirical results can be shown to be robust, the work would constitute a useful contribution to dynamic LLM serving by directly targeting resource fragmentation and workload variability in serverless settings, where static pipelines are known to underutilize hardware.
major comments (2)
- [Evaluation] Evaluation section: the abstract and strongest claims assert 8.5× resource-efficiency gains and 38.3% latency reduction on an 82-GPU cluster, yet supply no description of the state-of-the-art baselines, workload traces, statistical significance tests, or controls for confounding factors such as cluster topology or request arrival patterns; these omissions leave the central performance assertions weakly supported.
- [Abstract / System Overview] Abstract and system-design description: the key assumption that inflight pipeline refactoring with consistent cache transitions incurs negligible overhead while preserving graph constraints is load-bearing for the reported efficiency numbers, but the manuscript provides no internal measurements (e.g., per-refactor latency, cache-transition cost, or failure rates) to substantiate that the refactoring cost is dominated by request processing time.
minor comments (1)
- Clarify the precise granularity at which model stages are decomposed and how the topology-aware allocator interacts with existing serverless schedulers; a small diagram or pseudocode fragment would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have made revisions to strengthen the presentation of our results and system design.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the abstract and strongest claims assert 8.5× resource-efficiency gains and 38.3% latency reduction on an 82-GPU cluster, yet supply no description of the state-of-the-art baselines, workload traces, statistical significance tests, or controls for confounding factors such as cluster topology or request arrival patterns; these omissions leave the central performance assertions weakly supported.
Authors: We acknowledge that greater explicitness in the evaluation would improve clarity and robustness. The full manuscript already describes the baselines (static pipeline systems and representative dynamic serving frameworks) and the production-derived workload traces in Section 5, along with the 82-GPU cluster topology. However, we agree that adding statistical significance tests and explicit controls for confounding factors would address the concern. In the revised manuscript we have expanded the evaluation section to include: (1) a dedicated table listing all baselines with citations, (2) details of the workload traces including arrival patterns and variability metrics, (3) results of paired t-tests with p-values across repeated runs, and (4) a discussion of experimental controls that fix cluster topology while varying request patterns across multiple independent trials. revision: yes
-
Referee: [Abstract / System Overview] Abstract and system-design description: the key assumption that inflight pipeline refactoring with consistent cache transitions incurs negligible overhead while preserving graph constraints is load-bearing for the reported efficiency numbers, but the manuscript provides no internal measurements (e.g., per-refactor latency, cache-transition cost, or failure rates) to substantiate that the refactoring cost is dominated by request processing time.
Authors: We agree that direct internal measurements would provide stronger substantiation for the negligible-overhead claim. The current manuscript focuses on end-to-end results, but we have now added a new subsection (5.4) and accompanying figure that reports per-refactor latency, cache-transition costs, and failure rates measured across thousands of refactoring events. These measurements show average refactoring overhead below 5% of typical request processing time, with cache-transition costs under 2 ms and failure rates below 0.1% under the evaluated conditions, confirming that the overhead is indeed dominated by request processing. revision: yes
Circularity Check
No significant circularity; system design and empirical results are self-contained
full rationale
The paper describes a systems contribution with three stated innovations (fine-grained partitioning preserving graph constraints, inflight refactoring with cache transitions, topology-aware allocation) and reports measured end-to-end results on an 82-GPU cluster. No equations, fitted parameters, predictions derived from internal definitions, or load-bearing self-citations appear in the provided text. Efficiency and latency figures are presented as evaluation outcomes rather than quantities obtained by construction from the system's own inputs or prior self-citations. The derivation chain is therefore independent of the target claims.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost.leanJcost_pos_of_ne_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
inflight pipeline refactoring with consistent cache transitions... fine-grained model partitioning with preserved computational graph constraints... topology-aware resource allocation that navigates GPU fragmentation
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dynamic programming algorithm that simultaneously considers communication-computation overlap and future refactoring needs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
-
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
PipeLive enables live pipeline parallelism reconfiguration for LLMs via KV cache redesign and VM-migration-inspired patching, cutting TTFT by 2.5x and reconfiguration time to under 10ms.
Reference graph
Works this paper leans on
-
[1]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters EUROSYS ’26, April 27–30, 2026, Edinburgh, Scotland Uk Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inferenc...
work page 2026
-
[2]
Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt. 2018. SAND: To- wards High-Performance Serverless Computing.. InProc. 2018 USENIX Annu. Tech. Conf. USENIX ATC 2018 Boston MA USA July 11-13 2018 (USENIX ATC 2018). 923–935
work page 2018
-
[4]
Mohamed Alzayat, Jonathan Mace, Peter Druschel, and Deepak Garg. 2023-05-08. Groundhog: Efficient Request Isolation in FaaS. InProc. Eighteenth Eur. Conf. Comput. Syst. (EuroSys ’23). ACM, 398–415.https: //doi.org/10.1145/3552326.3567503
-
[5]
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, et al. 2022-11. DeepSpeed- Inference: Enabling Efficient Inference of Trans- former Models at Unprecedented Scale. InSC22 Int. Conf. High Per- form. Comput. Netw. Storage Anal. (SC 2022). IEEE, 46:1–46:15.https: //doi.org/10.1109/sc41...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41404.2022.00051 2022
-
[7]
Lixiang Ao, George Porter, and Geoffrey M. Voelker. 2022-03-28. FaaS- nap: FaaS Made Fast Using Snapshot-Based VMs. InProc. Seven- teenth Eur. Conf. Comput. Syst. (EuroSys ’22). ACM, 730–746.https: //doi.org/10.1145/3492321.3524270
-
[9]
Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ram- jee, and Nipun Kwatra. 2022-03-28. Varuna: Scalable, Low-Cost Train- ing of Massive Deep Learning Models. InProc. Seventeenth Eur. Conf. Comput. Syst. (EuroSys ’22). ACM, 472–487.https://doi.org/10.1145/ 3492321.3519584
-
[10]
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al
- [11]
-
[12]
Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2023-10-28. Punica: Multi-Tenant LoRA Serv- ing.https://doi.org/10.48550/arXiv.2310.18547
-
[13]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019-05-24. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://doi.org/10.48550/arXiv.1810.04805
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2019
-
[14]
Khaled Diab, Parham Yassini, and Mohamed Hefeeda. 2022. Orca: Server-assisted Multicast for Datacenter Networks.. In19th USENIX Symp. Networked Syst. Des. Implement. NSDI 2022 Renton W A USA April 4-6 2022 (NSDI 2022). 1075–1091
work page 2022
-
[15]
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024-06-13. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. https://doi.org/10.48550/arXiv.2404.02015
-
[16]
Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, et al . 2021-02-17. DAPPLE: A Pipelined Data Parallel Approach for Training Large Models. InProc. 26th ACM SIGPLAN Symp. Princ. Pract. Parallel Program. (PPoPP ’21). ACM, 431–445.https://doi.org/10.1145/3437801.3441593
-
[17]
Mohammadbagher Fotouhi, Derek Chen, and Wes J. Lloyd. 2019- 12-09. Function-as-a-Service Application Service Composition: Im- plications for a Natural Language Processing Application. InProc. 5th Int. Workshop Serverless Comput. (Middleware ’19). ACM, 49–54. https://doi.org/10.1145/3366623.3368141
-
[18]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 135–153
work page 2024
-
[19]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al . 2024-11-23. The Llama 3 Herd of Models.https: //doi.org/10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[20]
Bodun Hu, Jiamin Li, Le Xu, Myungjin Lee, Akshay Jajoo, Geon-Woo Kim, Hong Xu, and Aditya Akella. 2024-09-23. BlockLLM: Multi- tenant Finer-grained Serving for Large Language Models.https: //doi.org/10.48550/arXiv.2404.18322
-
[21]
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, et al. 2024-01-20. Infer- ence without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads.https://doi.org/10.48550/arXiv.2401.11181
-
[22]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, et al. 2019-12-08. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. InProceedings of the 33rd International Conference on Neural Information Processing Systems. Number 10. Curran Associates Inc., 103–112
work page 2019
-
[23]
Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, et al. 2023-10-23. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProc. 29th Symp. Oper. Syst. Princ. (SOSP ’23). ACM, 611–626.https://doi.org/10.1145/3600006.3613165
-
[24]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 155–172
work page 2024
-
[26]
Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. 2022. Tetris: Memory-efficient Serverless Inference through Tensor Sharing.. InProc. 2022 USENIX Annu. Tech. Conf. USENIX ATC 2022 Carlsbad CA USA July 11-13 2022 (USENIX ATC 2022). USENIX Association
work page 2022
-
[27]
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, et al. 2023. AlpaServe: Statisti- cal Multiplexing with Model Parallelism for Deep Learning Serving.. In17th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2023 Boston MA USA July 10-12 2023 (OSDI 2023). USENIX Association, 663–679
work page 2023
-
[28]
Yanying Lin, Yanbo Li, Shijie Peng, Yingfei Tang, Shutian Luo, Haiying Shen, Chengzhong Xu, and Kejiang Ye. 2024-07. QUART: Latency- Aware FaaS System for Pipelining Large Model Inference. In2024 IEEE 44th Int. Conf. Distrib. Comput. Syst. ICDCS. 1–12.https://doi.org/10. 1109/ICDCS60910.2024.00010
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Zhiqi Lin, Youshan Miao, Guodong Liu, Xiaoxiang Shi, Quanlu Zhang, Fan Yang, Saeed Maleki, Yi Zhu, et al . 2023-01-21. SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction. https://doi.org/10.48550/arXiv.2301.08984 EUROSYS ’26, April 27–30, 2026, Edinburgh, Scotland Uk Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, and Kejiang Ye
-
[30]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, et al . 2024-08-04. CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming
work page 2024
-
[31]
Scaling symbolic evaluation for automated verification of systems code with serval
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019-10-27. PipeDream: Generalized Pipeline Parallelism for DNN Training. InProc. 27th ACM Symp. Oper. Syst. Princ. (SOSP ’19). ACM, 1–15.https://doi.org/10.1145/3341301.3359646
-
[32]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, et al. 2021-11-14. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC ’21). ACM, 58.https: //doi.org/10.1145/3458817.3476209
-
[33]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, ’I nigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In51st ACMIEEE Annu. Int. Symp. Comput. Archit. ISCA 2024 B. Aires Argent. June 29 - July 3 2024. IEEE, 118–132.https://doi.org/10.1109/ISCA59077.2024. 00019
-
[34]
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022-12-06. Robust Speech Recognition via Large-Scale Weak Supervision.https://doi.org/10.48550/arXiv. 2212.04356
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022
-
[35]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020-11. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. InSC20 Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC 2020). IEEE, 20.https://doi.org/10.1109/sc41405.2020. 00024
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020 2020
-
[36]
Pvm: Efficient shadow paging for deploying secure containers in cloud-native environment,
Alireza Sahraei, Soteris Demetriou, Amirali Sobhgol, Haoran Zhang, Abhigna Nagaraja, Neeraj Pathak, Girish Joshi, Carla Souza, et al . 2023-10-23. XFaaS: Hyperscale and Low Cost Serverless Functions at Meta. InProc. 29th Symp. Oper. Syst. Princ. (SOSP ’23). ACM, 231–246. https://doi.org/10.1145/3600006.3613155
-
[37]
Larissa Schmid, Marcin Copik, Alexandru Calotoiu, Laurin Brandner, Anne Koziolek, and Torsten Hoefler. 2025. SeBS-Flow: Benchmarking Serverless Cloud Function Workflows. InProc. Twent. Eur. Conf. Com- put. Syst. EuroSys 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025. ACM, 902–920.https://doi.org/10.1145/3689031.3717465
-
[38]
Mohammad Shahrad, Rodrigo Fonseca, ’I nigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, et al
-
[39]
Serverless in the Wild: Characterizing and Optimizing the Server- less Workload at a Large Cloud Provider.. InProc. 2020 USENIX Annu. Tech. Conf. USENIX ATC 2020 July 15-17 2020 (USENIX ATC 2020). 205–218
work page 2020
-
[40]
S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, et al . 2024-06-05. S- LoRA: Serving Thousands of Concurrent LoRA Adapters.https: //doi.org/10.48550/arXiv.2311.03285
-
[41]
Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 965– 988
work page 2024
-
[42]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, et al. 2023-06-15. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU.. InInt. Conf. Mach. Learn. ICML 2023 23-29 July 2023 Honol. Hawaii USA (ICML 2023). 31094–31116
work page 2023
-
[43]
Sudipta Saha Shubha, Haiying Shen, and Anand Iyer. 2024. USHER: Holistic Interference Avoidance for Resource Optimized ML Inference. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 947–964
work page 2024
-
[44]
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symp. Oper. Syst. Des. Imple- ment. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 173–191
work page 2024
-
[45]
Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to- End Optimization of LLM-based Applications with Ayo. InProc. 30th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst. Vol. 2 ASPLOS 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025, Lieven Eeckhout, Georgios Smaragdakis, Katai Liang, Adrian Sampson, Martha A. Kim, and Christopher ...
-
[46]
Masahiro Tanaka, Kenjiro Taura, Toshihiro Hanawa, and Kentaro Torisawa. 2021-05. Automatic Graph Partitioning for Very Large-scale Deep Learning. In2021 IEEE Int. Parallel Distrib. Process. Symp. IPDPS (IPDPS 2021). IEEE, 1004–1013.https://doi.org/10.1109/ipdps49936. 2021.00109
-
[47]
Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization.. InAdv. Neural Inf. Process. Syst. 34 Annu. Conf. Neural Inf. Process. Syst. 2021 NeurIPS 2021 Dec. 6-14 2021 Virtual (NeurIPS 2021, Vol. 34). Curran Associates, Inc., 24829–24840
work page 2021
-
[48]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timoth’ee Lacroix, Baptiste Rozi‘ere, Naman Goyal, et al. 2023-02-27. LLaMA: Open and Efficient Foundation Language Models.https://doi.org/10.48550/arXiv.2302.13971
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
-
[49]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, et al. 2023- 07-19. Llama 2: Open Foundation and Fine-Tuned Chat Models. https://doi.org/10.48550/arXiv.2307.09288
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
-
[50]
Ao Wang, Shuai Chang, Huangshi Tian, Hongqi Wang, Haoran Yang, Huiba Li, Rui Du, and Yue Cheng. 2021. FaaSNet: Scalable and Fast Pro- visioning of Custom Serverless Container Runtimes at Alibaba Cloud Function Compute.. InProc. 2021 USENIX Annu. Tech. Conf. USENIX ATC 2021 July 14-16 2021 (USENIX ATC 2021). USENIX Association, 443–457
work page 2021
-
[51]
Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023-05-08. Tabi: An Efficient Multi-Level Inference System for Large Language Models. InProc. Eighteenth Eur. Conf. Comput. Syst. (EuroSys ’23). ACM, 233–248.https://doi.org/10.1145/3552326.3587438
-
[52]
Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, et al. 2022. MLaaS in the Wild: Work- load Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters.. In19th USENIX Symp. Networked Syst. Des. Implement. NSDI 2022 Renton W A USA April 4-6 2022 (NSDI 2022). USENIX Association, 945–960
work page 2022
-
[53]
Qizhen Weng, Lingyun Yang, Yinghao Yu, Wei Wang, Xiaochuan Tang, Guodong Yang, and Liping Zhang. 2023. Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient De- scent.. InProc. 2023 USENIX Annu. Tech. Conf. USENIX ATC 2023 Boston MA USA July 10-12 2023 (USENIX ATC 2023). USENIX Association, 995–1008
work page 2023
-
[54]
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, et al . 2023-10-01. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity. 17, 2 (2023-10-01), 211–224. https://doi.org/10.14778/3626292.3626303
-
[55]
Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li. 2022-02-28. INFless: A Native Serverless System for Low-Latency, High-Throughput Inference. In Proc. 27th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst. (ASPLOS ’22). ACM, 768–781.https://doi.org/10.1145/3503222.3507709 FlexPipe: Adapting Dyna...
-
[56]
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, et al. 2025. CacheBlend: Fast Large Lan- guage Model Serving for RAG with Cached Knowledge Fusion. InProc. Twent. Eur. Conf. Comput. Syst. EuroSys 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025. ACM, 94–109.https://doi.org/10.1145/3689031. 3696098
-
[57]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models.. In16th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2022 Carlsbad CA USA July 11-13 2022 (OSDI 2022). USENIX Association, 521–538
work page 2022
-
[58]
Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, and Youyou Lu. 2025. Medusa: Accelerating Serverless LLM Inference with Materi- alization. InProc. 30th ACM Int. Conf. Archit. Support Program. Lang. Oper. Syst. Vol. 1 ASPLOS 2025 Rotterdam Neth. 30 March 2025 - 3 April 2025, Lieven Eeckhout, Georgios Smaragdakis, Kaitai Liang, Adrian Sampson, Martha A. ...
-
[59]
Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild.. In20th USENIX Symp. Net- worked Syst. Des. Implement. NSDI 2023 Boston MA April 17-19 2023 (NSDI 2023). USENIX Association, 787–808
work page 2023
- [60]
-
[61]
Faster and Cheaper Serverless Computing on Harvested Resources. InProc. ACM SIGOPS 28th Symp. Oper. Syst. Princ. (SOSP ’21). ACM, 724–739.https://doi.org/10.1145/3477132.3483580
-
[62]
Zili Zhang, Chao Jin, and Xin Jin. 2024. Jolteon: Unleashing the Promise of Serverless for Serverless Workflows.. In21st USENIX Symp. Net- worked Syst. Des. Implement. NSDI 2024 St. Clara CA April 15-17 2024 (NSDI 2024). USENIX Association, 167–183
work page 2024
-
[63]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, et al. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning.. In16th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2022 Carlsbad CA USA July 11-13 2022 (OSDI 2022). 559–578
work page 2022
-
[64]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symp. Oper. Syst. Des. Implement. OSDI 2024 St. Clara CA USA July 10-12 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Associati...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.