arxiv: 2604.17227 · v1 · submitted 2026-04-19 · 💻 cs.DC

Recognition: unknown

Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda

Minxian Xu , Jingfeng Wu , Shengye Song , Satish Narayana Srirama , Bahman Javad , Rajiv Ranjan , Devki Nandan Jha , Sa Wang

show 11 more authors

Wenhong Tian Huanle Xu Li Li Zizhao Mo Shuo Ren Thomas Kunz Petar Kochovski Vlado Stankovski Kejiang Ye Chengzhong Xu Rajkumar Buyya

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:22 UTC · model grok-4.3

classification 💻 cs.DC

keywords large language modelscloud-native systemsdistributed computingscalabilityefficiencyserverless inferencequantum computingfederated learning

0 comments

The pith

Large language models depend on cloud-native and distributed systems for efficient scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to establish that cloud-native and distributed systems must be integrated into LLM workflows because traditional setups fall short on the needed scale and efficiency. This matters because LLMs underpin growing AI uses from text processing to code writing, and bottlenecks in computing could slow their development. The paper details practical issues like managing data and optimizing resources, then highlights tools such as automatic scaling and hybrid cloud setups as solutions. It also surveys newer ideas including serverless inference and federated learning as future paths. Finally, it provides a roadmap stressing the value of shared standards and teamwork across fields.

Core claim

The paper claims that cloud platforms and distributed systems play a key role in supporting the scalability, efficiency, and optimization of large language models. The complexities of LLM deployment involve data management, resource optimization, and the adoption of microservices, autoscaling, and hybrid cloud-edge solutions. Emerging research trends such as serverless inference, quantum computing, and federated learning hold potential to advance LLM capabilities further. A roadmap for future work calls for ongoing research, standardization efforts, and collaboration between sectors to support LLM expansion in research and enterprise settings.

What carries the argument

Cloud-native and distributed architectures, which enable handling of LLM computational demands through features like microservices, autoscaling, and hybrid cloud-edge solutions.

If this is right

LLM deployment can incorporate data management and resource optimization techniques from cloud systems.
Hybrid cloud-edge solutions will address specific deployment complexities.
Serverless inference, quantum computing, and federated learning represent promising directions for next innovations.
Continued research, standardization, and cross-sector collaboration are required to sustain LLM growth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations without large data centers could leverage these systems to develop competitive LLMs.
The emphasis on federated learning may enable privacy-preserving model training across distributed data sources.
Quantum computing integration could transform energy efficiency in model operations if realized.

Load-bearing premise

That traditional systems are unable to meet the computational requirements of large language models and that the listed emerging trends will drive the next phase of innovation.

What would settle it

A demonstration that a large-scale LLM can be trained and deployed efficiently using only traditional non-cloud, non-distributed infrastructure.

read the original abstract

The rapid rise of Large Language Models (LLMs) has revolutionized various artificial intelligence (AI) applications, from natural language processing to code generation. However, the computational demands of these models, particularly in training and inference, present significant challenges. Traditional systems are often unable to meet these requirements, necessitating the integration of cloud-native and distributed architectures. This paper explores the role of cloud platforms and distributed systems in supporting the scalability, efficiency, and optimization of LLMs. We discuss the complexities of LLM deployment, including data management, resource optimization, and the need for microservices, autoscaling, and hybrid cloud-edge solutions. Additionally, we examine emerging research trends, such as serverless inference, quantum computing, and federated learning, and their potential to drive the next phase of LLM innovation. The paper concludes with a roadmap for future developments, emphasizing the need for continued research, standardization, and cross-sector collaboration to sustain the growth of LLMs in both research and enterprise applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the computational demands of Large Language Models (LLMs) necessitate the integration of cloud-native and distributed architectures, as traditional systems are often unable to meet these requirements. It explores complexities in LLM deployment such as data management, resource optimization, microservices, autoscaling, and hybrid cloud-edge solutions. The manuscript examines emerging research trends including serverless inference, quantum computing, and federated learning, and concludes with a roadmap for future developments emphasizing continued research, standardization, and cross-sector collaboration.

Significance. If pursued, this agenda could help direct research toward more scalable and efficient LLM systems by highlighting the intersection of distributed computing and AI challenges. The paper's strength is in providing a high-level synthesis of deployment issues and forward-looking trends, which may stimulate targeted follow-up studies, though its impact as a position paper rests on the actionability of the proposed directions rather than new derivations or data.

major comments (2)

[Abstract] Abstract: The foundational assertion that 'Traditional systems are often unable to meet these requirements' is presented without any specific benchmarks, scaling examples, or citations to LLM performance bottlenecks, which is load-bearing for the motivation to integrate cloud-native architectures.
[Abstract] Abstract (emerging trends paragraph): The statement that trends such as serverless inference, quantum computing, and federated learning 'will drive the next phase of LLM innovation' is made without analysis of their current maturity, specific applicability to LLM training/inference, or preliminary evidence, which underpins the credibility of the concluding roadmap.

minor comments (2)

The discussion of deployment complexities would benefit from clearer organization, such as explicit subsection headings for data management versus resource optimization, to improve readability.
Additional citations to recent work on distributed LLM training frameworks would help ground the high-level claims in existing literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the two points on the abstract below and will incorporate targeted clarifications to strengthen the grounding of our claims while preserving the high-level nature of this research agenda paper.

read point-by-point responses

Referee: [Abstract] Abstract: The foundational assertion that 'Traditional systems are often unable to meet these requirements' is presented without any specific benchmarks, scaling examples, or citations to LLM performance bottlenecks, which is load-bearing for the motivation to integrate cloud-native architectures.

Authors: We agree that the abstract would benefit from explicit support for this statement. The full manuscript already references scaling challenges such as quadratic attention complexity and the multi-node requirements for training models with hundreds of billions of parameters. In revision, we will add one or two concise citations (e.g., to transformer scaling laws and reported training infrastructure needs) directly in the abstract to anchor the claim without expanding its length or shifting the paper's position-paper character. revision: yes
Referee: [Abstract] Abstract (emerging trends paragraph): The statement that trends such as serverless inference, quantum computing, and federated learning 'will drive the next phase of LLM innovation' is made without analysis of their current maturity, specific applicability to LLM training/inference, or preliminary evidence, which underpins the credibility of the concluding roadmap.

Authors: This observation is valid for the abstract's brevity. The body of the manuscript discusses maturity levels and applicability in more detail (e.g., serverless for inference workloads, federated learning for privacy-preserving training). For the revision, we will soften the phrasing to 'emerging trends with the potential to drive...' and insert a short qualifier noting their varying stages of readiness, thereby improving credibility while keeping the abstract concise and forward-looking. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a descriptive research agenda surveying LLM challenges and advocating cloud-native/distributed approaches plus emerging trends. It contains no equations, formal derivations, fitted parameters, or predictions that reduce to inputs by construction. Central claims are high-level motivations and calls for future work rather than asserted results resting on self-referential steps, self-citations, or renamed empirical patterns. The text is therefore self-contained with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a research agenda without new technical claims, so it introduces no free parameters, axioms, or invented entities beyond standard assumptions in cloud computing literature.

pith-pipeline@v0.9.0 · 5549 in / 1015 out tokens · 40931 ms · 2026-05-10T06:22:55.627606+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

151 extracted references · 46 canonical work pages · 8 internal anchors

[1]

In: Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation

ZhongY,LiuS,ChenJ,etal.DistServe:Disaggregatingprefillanddecodingforgoodput-optimizedlargelanguagemodel serving. In: Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. 2024:193– 210

2024
[2]

Cloud container technologies: a state-of-the-art review.IEEE Transactions on Cloud Computing.2017;7(3):677–692

Pahl C, Brogi A, Soldani J, Jamshidi P. Cloud container technologies: a state-of-the-art review.IEEE Transactions on Cloud Computing.2017;7(3):677–692

2017
[3]

Attention is all you need

Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the Advances in neural information processing systems. 2017

2017
[4]

Scaling Laws for Neural Language Models

Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361. 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[5]

Deep learning.Nature.2015;521(7553):436–444

LeCun Y, Bengio Y, Hinton G. Deep learning.Nature.2015;521(7553):436–444

2015
[6]

IsaevM,McDonaldN,VuducR.Scalinginfrastructuretosupportmulti-trillionparameterLLMtraining.In:Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2023

2023
[7]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.; 2020

Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.; 2020

2020
[8]

Curran Associates, Inc

ZhangY,YouY.SpeedLoader:AnI/OefficientschemeforheterogeneousanddistributedLLMoperation.In:Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc. 2024:34637–34655

2024
[9]

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale

Aminabadi RY, Rajbhandari S, Awan AA, et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2022:1–15

2022
[10]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao T, Fu D, Ermon S, Rudra A, Ré C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In: Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc. 2022:16344–16359

2022
[11]

Towards end-to-end optimization of llm-based applications with ayo

Tan X, Jiang Y, Yang Y, Xu H. Towards end-to-end optimization of llm-based applications with ayo. In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2025:1302–1316

2025
[12]

In-datacenter performance analysis of a tensor processing unit

Jouppi NP, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th annual international symposium on computer architecture. 2017:1–12

2017
[13]

Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning

Zheng L, Li Z, Zhang H, et al. Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 2022. 38 Xu ET AL

2022
[14]

Efficient memory management for large language model serving with pagedattention

Kwon W, Li Z, Zhuang S, et al. Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems principles. 2023:611–626

2023
[15]

Open issues in scheduling microservices in the cloud.IEEE Cloud Computing.2016;3(5):81–88

Fazio M, Celesti A, Ranjan R, Liu C, Chen L, Villari M. Open issues in scheduling microservices in the cloud.IEEE Cloud Computing.2016;3(5):81–88

2016
[16]

TheKubernetesAuthors.Kubernetes:Production-GradeContainerOrchestration.https://kubernetes.io;.Accessed:2026- 04-08

2026
[17]

Flexgen: High-throughput generative inference of large language models with a single gpu

Sheng Y, Zheng L, Yuan B, et al. Flexgen: High-throughput generative inference of large language models with a single gpu. In: Proceedings of the International Conference on Machine Learning. 2023:31094–31116

2023
[18]

In: Proceedings of the 16th USENIX symposium on operating systems design and implementation

YuGI,JeongJS,KimGW,KimS,ChunBG.Orca:AdistributedservingsystemforTransformer-Basedgenerativemodels. In: Proceedings of the 16th USENIX symposium on operating systems design and implementation. 2022:521–538

2022
[19]

Evaluation and benchmarking of llm agents: A survey

Mohammadi M, Li Y, Lo J, Yip W. Evaluation and benchmarking of llm agents: A survey. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 2025:6129–6139

2025
[20]

Towards High-Goodput LLM Serving with Prefill-decode Multiplexing

Chen Y, Cui W, Zhao H, et al. Towards High-Goodput LLM Serving with Prefill-decode Multiplexing. In: Proceedings ofthe31stACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems, Volume 2. 2026:2030–2047

2026
[21]

Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents

Rombaut B, Masoumzadeh S, Vasilevski K, Lin D, Hassan AE. Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents. In: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering. 2025:739–751

2025
[22]

Process modeling in web applications.ACM Transactions on Software Engineering and Methodology.2006;15(4):360–409

Brambilla M, Ceri S, Fraternali P, Manolescu I. Process modeling in web applications.ACM Transactions on Software Engineering and Methodology.2006;15(4):360–409

2006
[23]

Challenges in deployment and configuration management in cyber physical system

Jha DN, Li Y, Jayaraman PP, et al. Challenges in deployment and configuration management in cyber physical system. Handbook of Integration of Cloud Computing, Cyber Physical Systems and Internet of Things.2020:215–235

2020
[24]

Parallel processing systems for big data: a survey

Zhang Y, Cao T, Li S, et al. Parallel processing systems for big data: a survey. Proceedings of the IEEE. 2016;104(11):2114–2136

2016
[25]

LiuX,BuyyaR.Resourcemanagementandschedulingindistributedstreamprocessingsystems:ataxonomy,review,and future directions.ACM Computing Surveys.2020;53(3):1–41

2020
[26]

Deep learning workload scheduling in gpu datacenters: A survey.ACM Computing Surveys

Ye Z, Gao W, Hu Q, et al. Deep learning workload scheduling in gpu datacenters: A survey.ACM Computing Surveys. 2024;56(6):1–38

2024
[27]

StatuScale: Status-aware and Elastic Scaling Strategy for Microservice Applications.ACM Trans

Wen L, Xu M, Gill SS, et al. StatuScale: Status-aware and Elastic Scaling Strategy for Microservice Applications.ACM Trans. Auton. Adapt. Syst..2025;20(1). doi: 10.1145/3686253

work page doi:10.1145/3686253 2025
[28]

LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs.Authorea Preprints.2025

Heisler M, Yousefijamarani Z, Wang X, et al. LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs.Authorea Preprints.2025

2025
[29]

Cloud native system for llm inference serving.arXiv preprint arXiv:2507.18007

Xu M, Liao J, Wu J, He Y, Ye K, Xu C. Cloud native system for llm inference serving.arXiv preprint arXiv:2507.18007. 2025

work page arXiv 2025
[30]

{NanoFlow}:Towardsoptimallargelanguagemodelservingthroughput.In:Proceedingsof the 19th USENIX Symposium on Operating Systems Design and Implementation

ZhuK,GaoY,ZhaoY,etal. {NanoFlow}:Towardsoptimallargelanguagemodelservingthroughput.In:Proceedingsof the 19th USENIX Symposium on Operating Systems Design and Implementation. 2025:749–765

2025
[31]

Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems.2025;7

Ye Z, Chen L, Lai R, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems.2025;7

2025
[32]

2025:446–461

ZhangC,DuK,LiuS,etal.JENGA:EffectivememorymanagementforservingLLMwithheterogeneity.In:Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 2025:446–461. Xu ET AL. 39

2025
[33]

throttll’em: Predictive gpu throttling for energy effi- cient llm inference serving

Kakolyris AK, Masouros D, Vavaroutsos P, Xydis S, Soudris D. throttll’em: Predictive gpu throttling for energy effi- cient llm inference serving. In: Proceedings of the 2025 IEEE International Symposium on High Performance Computer Architecture. 2025:1363–1378

2025
[34]

Collaborative inference and learning between edge slms and cloud llms: A survey of algorithms, execution, and open challenges,

Li S, Wang H, Xu W, et al. Collaborative inference and learning between edge slms and cloud LLMs: A survey of algorithms, execution, and open challenges.arXiv preprint arXiv:2507.16731.2025

work page arXiv 2025
[35]

Extracting training data from large language models

Carlini N, Tramer F, Wallace E, et al. Extracting training data from large language models. In: Proceedings of the 30th USENIX security symposium. 2021:2633–2650

2021
[36]

Quantifying memorization across neural language models

Carlini N, Ippolito D, Jagielski M, Lee K, Tramer F, Zhang C. Quantifying memorization across neural language models. In: Proceedings of the 11th International Conference on Learning Representations. 2022

2022
[37]

Deep learning with differential privacy

Abadi M, Chu A, Goodfellow I, et al. Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016:308–318

2016
[38]

Communication-efficient learning of deep networks from decentralized data

McMahan B, Moore E, Ramage D, Hampson S, Arcas yBA. Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics. 2017:1273–1282

2017
[39]

Oblivious {Multi-Party} machine learning on trusted processors

Ohrimenko O, Schuster F, Fournet C, et al. Oblivious {Multi-Party} machine learning on trusted processors. In: Proceedings of the 25th USENIX Security Symposium. 2016:619–636

2016
[40]

Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection

Greshake K, Abdelnabi S, Mishra S, Endres C, Holz T, Fritz M. Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection. In: Proceedings of the 16th ACM workshop on artificial intelligence and security. 2023:79–90

2023
[41]

Stealing machine learning models via prediction APIs

Tramèr F, Zhang F, Juels A, Reiter MK, Ristenpart T. Stealing machine learning models via prediction APIs. In: Proceedings of the 25th USENIX security symposium. 2016:601–618

2016
[42]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou A, Wang Z, Carlini N, Nasr M, Kolter JZ, Fredrikson M. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043.2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

ACMComputingSurveys

DasBC,AminiMH,WuY.Securityandprivacychallengesoflargelanguagemodels:Asurvey. ACMComputingSurveys. 2025;57(6):1–39

2025
[44]

IEEEInternet of Things Journal.2022;9(11):8364–8386

BianJ,AlArafatA,XiongH,etal.Machinelearninginreal-timeInternetofThings(IoT)systems:Asurvey. IEEEInternet of Things Journal.2022;9(11):8364–8386

2022
[45]

Spatial big data architecture: from data warehouses and data lakes to the Lakehouse.Journal of Parallel and Distributed Computing.2023;176:70–79

Ait Errami S, Hajji H, Ait El Kadi K, Badir H. Spatial big data architecture: from data warehouses and data lakes to the Lakehouse.Journal of Parallel and Distributed Computing.2023;176:70–79

2023
[46]

IEEETransactionsonKnowledgeand Data Engineering.2023;35(12):12571–12590

HaiR,KoutrasC,QuixC,JarkeM.Datalakes:Asurveyoffunctionsandsystems. IEEETransactionsonKnowledgeand Data Engineering.2023;35(12):12571–12590

2023
[47]

Qiu, and Lili Qiu

Zhao S, Yang Y, Wang Z, He Z, Qiu LK, Qiu L. Retrieval augmented generation (rag) and beyond: A comprehensive survey on how to make your llms use external data more wisely.arXiv preprint arXiv:2409.14924.2024

work page arXiv 2024
[48]

Computational cxl-memory solution for accelerating memory-intensive applications.IEEE Computer Architecture Letters.2022;22(1):5–8

Sim J, Ahn S, Ahn T, et al. Computational cxl-memory solution for accelerating memory-intensive applications.IEEE Computer Architecture Letters.2022;22(1):5–8

2022
[49]

Gu R, Xu Z, Che Y, et al. High-level data abstraction and elastic data caching for data-intensive ai applications on cloud- native platforms.IEEE Transactions on Parallel and Distributed Systems.2023;34(11):2946–2964

2023
[50]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao Y, Xiong Y, Gao X, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997.2023;2(1):32

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

New Trends in High-D Vector Similarity Search: AI-driven, Progressive, and Distributed

Echihabi K, Palpanas T, Zoumpatianos K. New Trends in High-D Vector Similarity Search: AI-driven, Progressive, and Distributed.. Proceedings of the VLDB Endowment.2021;14(12):3198–3201. 40 Xu ET AL

2021
[52]

Unleash llms potential for sequential recommendation by coordinating dual dynamic index mechanism

Yin J, Zeng Z, Li M, et al. Unleash llms potential for sequential recommendation by coordinating dual dynamic index mechanism. In: Proceedings of the 2025 ACM on Web Conference. 2025:216–227

2025
[53]

LiH,FuF,LinS,etal.Hydraulis:BalancingLargeTransformerModelTrainingviaCo-designingParallelStrategiesand Data Assignment.Proceedings of the ACM on Management of Data.2025;3(6):1–30

2025
[54]

A generative caching system for large language models.arXiv preprint arXiv:2503.17603.2025

Iyengar A, Kundu A, Kompella R, Mamidi SN. A generative caching system for large language models.arXiv preprint arXiv:2503.17603.2025

work page arXiv 2025
[55]

Tracing the Data Trail: A Survey of Data Provenance, Transparency and Traceability in LLMs

Hohensinner R, Mutlu B, Estrada IGM, Vukovic M, Kopeinik S, Kern R. Tracing the Data Trail: A Survey of Data Provenance, Transparency and Traceability in LLMs.arXiv preprint arXiv:2601.14311.2026

work page arXiv 2026
[56]

In: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

AgostiniNB,CurzelS,AmatyaV,etal.AnMLIR-basedcompilerflowforsystem-leveldesignandhardwareacceleration. In: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design. 2022:1–9

2022
[57]

2023:189–201

MajumderK,BondhugulaU.Hir:Anmlir-basedintermediaterepresentationforhardwareacceleratordescription.In:Pro- ceedingsofthe28thACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperating Systems, Volume 4. 2023:189–201

2023
[58]

Packt Publishing Ltd, 2022

Masood F, Brigoli R.Machine Learning on Kubernetes: A practical handbook for building and using a complete open source machine learning platform on Kubernetes. Packt Publishing Ltd, 2022

2022
[59]

Llm-pilot: Characterize and optimize performance of your llm inference services

Lazuka M, Anghel A, Parnell T. Llm-pilot: Characterize and optimize performance of your llm inference services. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. 2024:1–18

2024
[60]

Jegham N,Abdelatti M,Koh CY,Elmoubarki L,Hendawi A.How hungryis ai?benchmarking energy,water, andcarbon footprint of llm inference.arXiv preprint arXiv:2505.09598.2025

work page arXiv 2025
[61]

A survey on federated fine-tuning of large language models,

Wu Y, Tian C, Li J, et al. A survey on federated fine-tuning of large language models.arXiv preprint arXiv:2503.12016. 2025

work page arXiv 2025
[62]

IEEE Transactions on Computers.2026;75(4):1636-1649

HuJ,XuM,YeK,XuC.BrownoutServe:SLO-AwareInferenceServingUnderBurstyWorkloadsforMoE-BasedLLMs. IEEE Transactions on Computers.2026;75(4):1636-1649. doi: 10.1109/TC.2026.3655019

work page doi:10.1109/tc.2026.3655019 2026
[63]

DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving.IEEE Transactions on Services Computing.2026;19(2):1134-1147

Liao J, Xu M, Zheng W, et al. DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving.IEEE Transactions on Services Computing.2026;19(2):1134-1147. doi: 10.1109/TSC.2026.3670011

work page doi:10.1109/tsc.2026.3670011 2026
[64]

ShuffleInfer: Disaggregate LLM inference for mixed downstream workloads.ACM Transactions on Architecture and Code Optimization.2025;22(2):1–24

Hu C, Huang H, Xu L, et al. ShuffleInfer: Disaggregate LLM inference for mixed downstream workloads.ACM Transactions on Architecture and Code Optimization.2025;22(2):1–24

2025
[65]

Splitwise: Efficient generative LLM inference using phase splitting

Patel P, Choukse E, Zhang C, et al. Splitwise: Efficient generative LLM inference using phase splitting. In: Proceedings of the 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture. 2024:118–132

2024
[66]

Fast distributed inference serving for large language models,

Wu B, Zhong Y, Zhang Z, et al. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920.2023

work page arXiv 2023
[67]

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference.arXiv preprint arXiv:2508.19559.2025

Li R, Du R, Chu Z, et al. Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference.arXiv preprint arXiv:2508.19559.2025

work page arXiv 2025
[68]

TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity.arXiv preprint arXiv:2512.03416.2025

Lai R, Liu H, Lu C, et al. TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity.arXiv preprint arXiv:2512.03416.2025

work page arXiv 2025
[69]

LoongServe: Efficiently serving long-context large language models with elastic sequence parallelism

Wu B, Liu S, Zhong Y, Sun P, Liu X, Jin X. LoongServe: Efficiently serving long-context large language models with elastic sequence parallelism. In: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 2024:640–654. Xu ET AL. 41

2024
[70]

dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving

Wu B, Zhu R, Zhang Z, Sun P, Liu X, Jin X. dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving. In: Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. 2024:911– 927

2024
[71]

MegaScale-Infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism

Zhu R, Jiang Z, Jin C, et al. MegaScale-Infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism. In: Proceedings of the ACM SIGCOMM 2025 Conference. 2025:592–608

2025
[72]

Punica: Multi-tenant LoRA serving.Proceedings of Machine Learning and Systems.2024;6:1–13

Chen L, Ye Z, Wu Y, Zhuo D, Ceze L, Krishnamurthy A. Punica: Multi-tenant LoRA serving.Proceedings of Machine Learning and Systems.2024;6:1–13

2024
[73]

AlpaServe: Statistical multiplexing with model parallelism for deep learning serving

Li Z, Zheng L, Zhong Y, et al. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In: Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation. 2023:663–679

2023
[74]

arXivpreprint arXiv:2503.22562.2025

GoelK,MohanJ,KwatraN,AnupindiRS,RamjeeR.Niyama:BreakingthesilosofLLMinferenceserving. arXivpreprint arXiv:2503.22562.2025

work page arXiv 2025
[75]

BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Dis- aggregated LLM Serving in AI Infrastructure

He Y, Xu M, Wu J, et al. BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Dis- aggregated LLM Serving in AI Infrastructure. Software: Practice and Experience. 2026;56(4):424-444. doi: https://doi.org/10.1002/spe.70054

work page doi:10.1002/spe.70054 2026
[76]

DynamoLLM: Designing LLM Inference Clusters for Perfor- manceandEnergyEfficiency.In:Proceedingsofthe2025IEEEInternationalSymposiumonHighPerformanceComputer Architecture

Stojkovic J, Zhang C, Goiri I, Torrellas J, Choukse E. DynamoLLM: Designing LLM Inference Clusters for Perfor- manceandEnergyEfficiency.In:Proceedingsofthe2025IEEEInternationalSymposiumonHighPerformanceComputer Architecture. 2025:1348-1362

2025
[77]

2024:207–222

PatelP,ChoukseE,ZhangC,etal.CharacterizingpowermanagementopportunitiesforLLMsinthecloud.In:Proceedings ofthe29thACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems, Volume 3. 2024:207–222

2024
[78]

In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

StojkovicJ,ZhangC,GoiriI,etal.TAPAS:Thermal-andPower-AwareSchedulingforLLMInferenceinCloudPlatforms. In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. Association for Computing Machinery 2025; New York, NY, USA:1266–1281

2025
[79]

Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding.arXiv preprint arXiv:2501.12162, 2025

Zikun L, Zhuofu C, Remi D, et al. AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding.arXiv preprint arXiv:2501.12162.2026

work page arXiv 2026
[80]

Fairness in serving large language models

Sheng Y, Cao S, Li D, et al. Fairness in serving large language models. In: Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation. USENIX Association 2024; USA

2024

Showing first 80 references.