pith. machine review for the scientific record. sign in

arxiv: 2604.17227 · v1 · submitted 2026-04-19 · 💻 cs.DC

Recognition: unknown

Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:22 UTC · model grok-4.3

classification 💻 cs.DC
keywords large language modelscloud-native systemsdistributed computingscalabilityefficiencyserverless inferencequantum computingfederated learning
0
0 comments X

The pith

Large language models depend on cloud-native and distributed systems for efficient scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to establish that cloud-native and distributed systems must be integrated into LLM workflows because traditional setups fall short on the needed scale and efficiency. This matters because LLMs underpin growing AI uses from text processing to code writing, and bottlenecks in computing could slow their development. The paper details practical issues like managing data and optimizing resources, then highlights tools such as automatic scaling and hybrid cloud setups as solutions. It also surveys newer ideas including serverless inference and federated learning as future paths. Finally, it provides a roadmap stressing the value of shared standards and teamwork across fields.

Core claim

The paper claims that cloud platforms and distributed systems play a key role in supporting the scalability, efficiency, and optimization of large language models. The complexities of LLM deployment involve data management, resource optimization, and the adoption of microservices, autoscaling, and hybrid cloud-edge solutions. Emerging research trends such as serverless inference, quantum computing, and federated learning hold potential to advance LLM capabilities further. A roadmap for future work calls for ongoing research, standardization efforts, and collaboration between sectors to support LLM expansion in research and enterprise settings.

What carries the argument

Cloud-native and distributed architectures, which enable handling of LLM computational demands through features like microservices, autoscaling, and hybrid cloud-edge solutions.

If this is right

  • LLM deployment can incorporate data management and resource optimization techniques from cloud systems.
  • Hybrid cloud-edge solutions will address specific deployment complexities.
  • Serverless inference, quantum computing, and federated learning represent promising directions for next innovations.
  • Continued research, standardization, and cross-sector collaboration are required to sustain LLM growth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations without large data centers could leverage these systems to develop competitive LLMs.
  • The emphasis on federated learning may enable privacy-preserving model training across distributed data sources.
  • Quantum computing integration could transform energy efficiency in model operations if realized.

Load-bearing premise

That traditional systems are unable to meet the computational requirements of large language models and that the listed emerging trends will drive the next phase of innovation.

What would settle it

A demonstration that a large-scale LLM can be trained and deployed efficiently using only traditional non-cloud, non-distributed infrastructure.

read the original abstract

The rapid rise of Large Language Models (LLMs) has revolutionized various artificial intelligence (AI) applications, from natural language processing to code generation. However, the computational demands of these models, particularly in training and inference, present significant challenges. Traditional systems are often unable to meet these requirements, necessitating the integration of cloud-native and distributed architectures. This paper explores the role of cloud platforms and distributed systems in supporting the scalability, efficiency, and optimization of LLMs. We discuss the complexities of LLM deployment, including data management, resource optimization, and the need for microservices, autoscaling, and hybrid cloud-edge solutions. Additionally, we examine emerging research trends, such as serverless inference, quantum computing, and federated learning, and their potential to drive the next phase of LLM innovation. The paper concludes with a roadmap for future developments, emphasizing the need for continued research, standardization, and cross-sector collaboration to sustain the growth of LLMs in both research and enterprise applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the computational demands of Large Language Models (LLMs) necessitate the integration of cloud-native and distributed architectures, as traditional systems are often unable to meet these requirements. It explores complexities in LLM deployment such as data management, resource optimization, microservices, autoscaling, and hybrid cloud-edge solutions. The manuscript examines emerging research trends including serverless inference, quantum computing, and federated learning, and concludes with a roadmap for future developments emphasizing continued research, standardization, and cross-sector collaboration.

Significance. If pursued, this agenda could help direct research toward more scalable and efficient LLM systems by highlighting the intersection of distributed computing and AI challenges. The paper's strength is in providing a high-level synthesis of deployment issues and forward-looking trends, which may stimulate targeted follow-up studies, though its impact as a position paper rests on the actionability of the proposed directions rather than new derivations or data.

major comments (2)
  1. [Abstract] Abstract: The foundational assertion that 'Traditional systems are often unable to meet these requirements' is presented without any specific benchmarks, scaling examples, or citations to LLM performance bottlenecks, which is load-bearing for the motivation to integrate cloud-native architectures.
  2. [Abstract] Abstract (emerging trends paragraph): The statement that trends such as serverless inference, quantum computing, and federated learning 'will drive the next phase of LLM innovation' is made without analysis of their current maturity, specific applicability to LLM training/inference, or preliminary evidence, which underpins the credibility of the concluding roadmap.
minor comments (2)
  1. The discussion of deployment complexities would benefit from clearer organization, such as explicit subsection headings for data management versus resource optimization, to improve readability.
  2. Additional citations to recent work on distributed LLM training frameworks would help ground the high-level claims in existing literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the two points on the abstract below and will incorporate targeted clarifications to strengthen the grounding of our claims while preserving the high-level nature of this research agenda paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The foundational assertion that 'Traditional systems are often unable to meet these requirements' is presented without any specific benchmarks, scaling examples, or citations to LLM performance bottlenecks, which is load-bearing for the motivation to integrate cloud-native architectures.

    Authors: We agree that the abstract would benefit from explicit support for this statement. The full manuscript already references scaling challenges such as quadratic attention complexity and the multi-node requirements for training models with hundreds of billions of parameters. In revision, we will add one or two concise citations (e.g., to transformer scaling laws and reported training infrastructure needs) directly in the abstract to anchor the claim without expanding its length or shifting the paper's position-paper character. revision: yes

  2. Referee: [Abstract] Abstract (emerging trends paragraph): The statement that trends such as serverless inference, quantum computing, and federated learning 'will drive the next phase of LLM innovation' is made without analysis of their current maturity, specific applicability to LLM training/inference, or preliminary evidence, which underpins the credibility of the concluding roadmap.

    Authors: This observation is valid for the abstract's brevity. The body of the manuscript discusses maturity levels and applicability in more detail (e.g., serverless for inference workloads, federated learning for privacy-preserving training). For the revision, we will soften the phrasing to 'emerging trends with the potential to drive...' and insert a short qualifier noting their varying stages of readiness, thereby improving credibility while keeping the abstract concise and forward-looking. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a descriptive research agenda surveying LLM challenges and advocating cloud-native/distributed approaches plus emerging trends. It contains no equations, formal derivations, fitted parameters, or predictions that reduce to inputs by construction. Central claims are high-level motivations and calls for future work rather than asserted results resting on self-referential steps, self-citations, or renamed empirical patterns. The text is therefore self-contained with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a research agenda without new technical claims, so it introduces no free parameters, axioms, or invented entities beyond standard assumptions in cloud computing literature.

pith-pipeline@v0.9.0 · 5549 in / 1015 out tokens · 40931 ms · 2026-05-10T06:22:55.627606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

151 extracted references · 46 canonical work pages · 8 internal anchors

  1. [1]

    In: Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation

    ZhongY,LiuS,ChenJ,etal.DistServe:Disaggregatingprefillanddecodingforgoodput-optimizedlargelanguagemodel serving. In: Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. 2024:193– 210

  2. [2]

    Cloud container technologies: a state-of-the-art review.IEEE Transactions on Cloud Computing.2017;7(3):677–692

    Pahl C, Brogi A, Soldani J, Jamshidi P. Cloud container technologies: a state-of-the-art review.IEEE Transactions on Cloud Computing.2017;7(3):677–692

  3. [3]

    Attention is all you need

    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the Advances in neural information processing systems. 2017

  4. [4]

    Scaling Laws for Neural Language Models

    Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361. 2020

  5. [5]

    Deep learning.Nature.2015;521(7553):436–444

    LeCun Y, Bengio Y, Hinton G. Deep learning.Nature.2015;521(7553):436–444

  6. [6]

    IsaevM,McDonaldN,VuducR.Scalinginfrastructuretosupportmulti-trillionparameterLLMtraining.In:Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2023

  7. [7]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.; 2020

    Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.; 2020

  8. [8]

    Curran Associates, Inc

    ZhangY,YouY.SpeedLoader:AnI/OefficientschemeforheterogeneousanddistributedLLMoperation.In:Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc. 2024:34637–34655

  9. [9]

    Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale

    Aminabadi RY, Rajbhandari S, Awan AA, et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2022:1–15

  10. [10]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Dao T, Fu D, Ermon S, Rudra A, Ré C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In: Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc. 2022:16344–16359

  11. [11]

    Towards end-to-end optimization of llm-based applications with ayo

    Tan X, Jiang Y, Yang Y, Xu H. Towards end-to-end optimization of llm-based applications with ayo. In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2025:1302–1316

  12. [12]

    In-datacenter performance analysis of a tensor processing unit

    Jouppi NP, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th annual international symposium on computer architecture. 2017:1–12

  13. [13]

    Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning

    Zheng L, Li Z, Zhang H, et al. Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 2022. 38 Xu ET AL

  14. [14]

    Efficient memory management for large language model serving with pagedattention

    Kwon W, Li Z, Zhuang S, et al. Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems principles. 2023:611–626

  15. [15]

    Open issues in scheduling microservices in the cloud.IEEE Cloud Computing.2016;3(5):81–88

    Fazio M, Celesti A, Ranjan R, Liu C, Chen L, Villari M. Open issues in scheduling microservices in the cloud.IEEE Cloud Computing.2016;3(5):81–88

  16. [16]

    TheKubernetesAuthors.Kubernetes:Production-GradeContainerOrchestration.https://kubernetes.io;.Accessed:2026- 04-08

  17. [17]

    Flexgen: High-throughput generative inference of large language models with a single gpu

    Sheng Y, Zheng L, Yuan B, et al. Flexgen: High-throughput generative inference of large language models with a single gpu. In: Proceedings of the International Conference on Machine Learning. 2023:31094–31116

  18. [18]

    In: Proceedings of the 16th USENIX symposium on operating systems design and implementation

    YuGI,JeongJS,KimGW,KimS,ChunBG.Orca:AdistributedservingsystemforTransformer-Basedgenerativemodels. In: Proceedings of the 16th USENIX symposium on operating systems design and implementation. 2022:521–538

  19. [19]

    Evaluation and benchmarking of llm agents: A survey

    Mohammadi M, Li Y, Lo J, Yip W. Evaluation and benchmarking of llm agents: A survey. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 2025:6129–6139

  20. [20]

    Towards High-Goodput LLM Serving with Prefill-decode Multiplexing

    Chen Y, Cui W, Zhao H, et al. Towards High-Goodput LLM Serving with Prefill-decode Multiplexing. In: Proceedings ofthe31stACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems, Volume 2. 2026:2030–2047

  21. [21]

    Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents

    Rombaut B, Masoumzadeh S, Vasilevski K, Lin D, Hassan AE. Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents. In: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering. 2025:739–751

  22. [22]

    Process modeling in web applications.ACM Transactions on Software Engineering and Methodology.2006;15(4):360–409

    Brambilla M, Ceri S, Fraternali P, Manolescu I. Process modeling in web applications.ACM Transactions on Software Engineering and Methodology.2006;15(4):360–409

  23. [23]

    Challenges in deployment and configuration management in cyber physical system

    Jha DN, Li Y, Jayaraman PP, et al. Challenges in deployment and configuration management in cyber physical system. Handbook of Integration of Cloud Computing, Cyber Physical Systems and Internet of Things.2020:215–235

  24. [24]

    Parallel processing systems for big data: a survey

    Zhang Y, Cao T, Li S, et al. Parallel processing systems for big data: a survey. Proceedings of the IEEE. 2016;104(11):2114–2136

  25. [25]

    LiuX,BuyyaR.Resourcemanagementandschedulingindistributedstreamprocessingsystems:ataxonomy,review,and future directions.ACM Computing Surveys.2020;53(3):1–41

  26. [26]

    Deep learning workload scheduling in gpu datacenters: A survey.ACM Computing Surveys

    Ye Z, Gao W, Hu Q, et al. Deep learning workload scheduling in gpu datacenters: A survey.ACM Computing Surveys. 2024;56(6):1–38

  27. [27]

    StatuScale: Status-aware and Elastic Scaling Strategy for Microservice Applications.ACM Trans

    Wen L, Xu M, Gill SS, et al. StatuScale: Status-aware and Elastic Scaling Strategy for Microservice Applications.ACM Trans. Auton. Adapt. Syst..2025;20(1). doi: 10.1145/3686253

  28. [28]

    LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs.Authorea Preprints.2025

    Heisler M, Yousefijamarani Z, Wang X, et al. LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs.Authorea Preprints.2025

  29. [29]

    Cloud native system for llm inference serving.arXiv preprint arXiv:2507.18007

    Xu M, Liao J, Wu J, He Y, Ye K, Xu C. Cloud native system for llm inference serving.arXiv preprint arXiv:2507.18007. 2025

  30. [30]

    {NanoFlow}:Towardsoptimallargelanguagemodelservingthroughput.In:Proceedingsof the 19th USENIX Symposium on Operating Systems Design and Implementation

    ZhuK,GaoY,ZhaoY,etal. {NanoFlow}:Towardsoptimallargelanguagemodelservingthroughput.In:Proceedingsof the 19th USENIX Symposium on Operating Systems Design and Implementation. 2025:749–765

  31. [31]

    Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems.2025;7

    Ye Z, Chen L, Lai R, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems.2025;7

  32. [32]

    2025:446–461

    ZhangC,DuK,LiuS,etal.JENGA:EffectivememorymanagementforservingLLMwithheterogeneity.In:Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 2025:446–461. Xu ET AL. 39

  33. [33]

    throttll’em: Predictive gpu throttling for energy effi- cient llm inference serving

    Kakolyris AK, Masouros D, Vavaroutsos P, Xydis S, Soudris D. throttll’em: Predictive gpu throttling for energy effi- cient llm inference serving. In: Proceedings of the 2025 IEEE International Symposium on High Performance Computer Architecture. 2025:1363–1378

  34. [34]

    Collaborative inference and learning between edge slms and cloud llms: A survey of algorithms, execution, and open challenges,

    Li S, Wang H, Xu W, et al. Collaborative inference and learning between edge slms and cloud LLMs: A survey of algorithms, execution, and open challenges.arXiv preprint arXiv:2507.16731.2025

  35. [35]

    Extracting training data from large language models

    Carlini N, Tramer F, Wallace E, et al. Extracting training data from large language models. In: Proceedings of the 30th USENIX security symposium. 2021:2633–2650

  36. [36]

    Quantifying memorization across neural language models

    Carlini N, Ippolito D, Jagielski M, Lee K, Tramer F, Zhang C. Quantifying memorization across neural language models. In: Proceedings of the 11th International Conference on Learning Representations. 2022

  37. [37]

    Deep learning with differential privacy

    Abadi M, Chu A, Goodfellow I, et al. Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016:308–318

  38. [38]

    Communication-efficient learning of deep networks from decentralized data

    McMahan B, Moore E, Ramage D, Hampson S, Arcas yBA. Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics. 2017:1273–1282

  39. [39]

    Oblivious {Multi-Party} machine learning on trusted processors

    Ohrimenko O, Schuster F, Fournet C, et al. Oblivious {Multi-Party} machine learning on trusted processors. In: Proceedings of the 25th USENIX Security Symposium. 2016:619–636

  40. [40]

    Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection

    Greshake K, Abdelnabi S, Mishra S, Endres C, Holz T, Fritz M. Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection. In: Proceedings of the 16th ACM workshop on artificial intelligence and security. 2023:79–90

  41. [41]

    Stealing machine learning models via prediction APIs

    Tramèr F, Zhang F, Juels A, Reiter MK, Ristenpart T. Stealing machine learning models via prediction APIs. In: Proceedings of the 25th USENIX security symposium. 2016:601–618

  42. [42]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou A, Wang Z, Carlini N, Nasr M, Kolter JZ, Fredrikson M. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043.2023

  43. [43]

    ACMComputingSurveys

    DasBC,AminiMH,WuY.Securityandprivacychallengesoflargelanguagemodels:Asurvey. ACMComputingSurveys. 2025;57(6):1–39

  44. [44]

    IEEEInternet of Things Journal.2022;9(11):8364–8386

    BianJ,AlArafatA,XiongH,etal.Machinelearninginreal-timeInternetofThings(IoT)systems:Asurvey. IEEEInternet of Things Journal.2022;9(11):8364–8386

  45. [45]

    Spatial big data architecture: from data warehouses and data lakes to the Lakehouse.Journal of Parallel and Distributed Computing.2023;176:70–79

    Ait Errami S, Hajji H, Ait El Kadi K, Badir H. Spatial big data architecture: from data warehouses and data lakes to the Lakehouse.Journal of Parallel and Distributed Computing.2023;176:70–79

  46. [46]

    IEEETransactionsonKnowledgeand Data Engineering.2023;35(12):12571–12590

    HaiR,KoutrasC,QuixC,JarkeM.Datalakes:Asurveyoffunctionsandsystems. IEEETransactionsonKnowledgeand Data Engineering.2023;35(12):12571–12590

  47. [47]

    Qiu, and Lili Qiu

    Zhao S, Yang Y, Wang Z, He Z, Qiu LK, Qiu L. Retrieval augmented generation (rag) and beyond: A comprehensive survey on how to make your llms use external data more wisely.arXiv preprint arXiv:2409.14924.2024

  48. [48]

    Computational cxl-memory solution for accelerating memory-intensive applications.IEEE Computer Architecture Letters.2022;22(1):5–8

    Sim J, Ahn S, Ahn T, et al. Computational cxl-memory solution for accelerating memory-intensive applications.IEEE Computer Architecture Letters.2022;22(1):5–8

  49. [49]

    Gu R, Xu Z, Che Y, et al. High-level data abstraction and elastic data caching for data-intensive ai applications on cloud- native platforms.IEEE Transactions on Parallel and Distributed Systems.2023;34(11):2946–2964

  50. [50]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao Y, Xiong Y, Gao X, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997.2023;2(1):32

  51. [51]

    New Trends in High-D Vector Similarity Search: AI-driven, Progressive, and Distributed

    Echihabi K, Palpanas T, Zoumpatianos K. New Trends in High-D Vector Similarity Search: AI-driven, Progressive, and Distributed.. Proceedings of the VLDB Endowment.2021;14(12):3198–3201. 40 Xu ET AL

  52. [52]

    Unleash llms potential for sequential recommendation by coordinating dual dynamic index mechanism

    Yin J, Zeng Z, Li M, et al. Unleash llms potential for sequential recommendation by coordinating dual dynamic index mechanism. In: Proceedings of the 2025 ACM on Web Conference. 2025:216–227

  53. [53]

    LiH,FuF,LinS,etal.Hydraulis:BalancingLargeTransformerModelTrainingviaCo-designingParallelStrategiesand Data Assignment.Proceedings of the ACM on Management of Data.2025;3(6):1–30

  54. [54]

    A generative caching system for large language models.arXiv preprint arXiv:2503.17603.2025

    Iyengar A, Kundu A, Kompella R, Mamidi SN. A generative caching system for large language models.arXiv preprint arXiv:2503.17603.2025

  55. [55]

    Tracing the Data Trail: A Survey of Data Provenance, Transparency and Traceability in LLMs

    Hohensinner R, Mutlu B, Estrada IGM, Vukovic M, Kopeinik S, Kern R. Tracing the Data Trail: A Survey of Data Provenance, Transparency and Traceability in LLMs.arXiv preprint arXiv:2601.14311.2026

  56. [56]

    In: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

    AgostiniNB,CurzelS,AmatyaV,etal.AnMLIR-basedcompilerflowforsystem-leveldesignandhardwareacceleration. In: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design. 2022:1–9

  57. [57]

    2023:189–201

    MajumderK,BondhugulaU.Hir:Anmlir-basedintermediaterepresentationforhardwareacceleratordescription.In:Pro- ceedingsofthe28thACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperating Systems, Volume 4. 2023:189–201

  58. [58]

    Packt Publishing Ltd, 2022

    Masood F, Brigoli R.Machine Learning on Kubernetes: A practical handbook for building and using a complete open source machine learning platform on Kubernetes. Packt Publishing Ltd, 2022

  59. [59]

    Llm-pilot: Characterize and optimize performance of your llm inference services

    Lazuka M, Anghel A, Parnell T. Llm-pilot: Characterize and optimize performance of your llm inference services. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. 2024:1–18

  60. [60]

    Jegham N,Abdelatti M,Koh CY,Elmoubarki L,Hendawi A.How hungryis ai?benchmarking energy,water, andcarbon footprint of llm inference.arXiv preprint arXiv:2505.09598.2025

  61. [61]

    A survey on federated fine-tuning of large language models,

    Wu Y, Tian C, Li J, et al. A survey on federated fine-tuning of large language models.arXiv preprint arXiv:2503.12016. 2025

  62. [62]

    IEEE Transactions on Computers.2026;75(4):1636-1649

    HuJ,XuM,YeK,XuC.BrownoutServe:SLO-AwareInferenceServingUnderBurstyWorkloadsforMoE-BasedLLMs. IEEE Transactions on Computers.2026;75(4):1636-1649. doi: 10.1109/TC.2026.3655019

  63. [63]

    DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving.IEEE Transactions on Services Computing.2026;19(2):1134-1147

    Liao J, Xu M, Zheng W, et al. DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving.IEEE Transactions on Services Computing.2026;19(2):1134-1147. doi: 10.1109/TSC.2026.3670011

  64. [64]

    ShuffleInfer: Disaggregate LLM inference for mixed downstream workloads.ACM Transactions on Architecture and Code Optimization.2025;22(2):1–24

    Hu C, Huang H, Xu L, et al. ShuffleInfer: Disaggregate LLM inference for mixed downstream workloads.ACM Transactions on Architecture and Code Optimization.2025;22(2):1–24

  65. [65]

    Splitwise: Efficient generative LLM inference using phase splitting

    Patel P, Choukse E, Zhang C, et al. Splitwise: Efficient generative LLM inference using phase splitting. In: Proceedings of the 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture. 2024:118–132

  66. [66]

    Fast distributed inference serving for large language models,

    Wu B, Zhong Y, Zhang Z, et al. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920.2023

  67. [67]

    Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference.arXiv preprint arXiv:2508.19559.2025

    Li R, Du R, Chu Z, et al. Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference.arXiv preprint arXiv:2508.19559.2025

  68. [68]

    TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity.arXiv preprint arXiv:2512.03416.2025

    Lai R, Liu H, Lu C, et al. TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity.arXiv preprint arXiv:2512.03416.2025

  69. [69]

    LoongServe: Efficiently serving long-context large language models with elastic sequence parallelism

    Wu B, Liu S, Zhong Y, Sun P, Liu X, Jin X. LoongServe: Efficiently serving long-context large language models with elastic sequence parallelism. In: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 2024:640–654. Xu ET AL. 41

  70. [70]

    dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving

    Wu B, Zhu R, Zhang Z, Sun P, Liu X, Jin X. dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving. In: Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. 2024:911– 927

  71. [71]

    MegaScale-Infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism

    Zhu R, Jiang Z, Jin C, et al. MegaScale-Infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism. In: Proceedings of the ACM SIGCOMM 2025 Conference. 2025:592–608

  72. [72]

    Punica: Multi-tenant LoRA serving.Proceedings of Machine Learning and Systems.2024;6:1–13

    Chen L, Ye Z, Wu Y, Zhuo D, Ceze L, Krishnamurthy A. Punica: Multi-tenant LoRA serving.Proceedings of Machine Learning and Systems.2024;6:1–13

  73. [73]

    AlpaServe: Statistical multiplexing with model parallelism for deep learning serving

    Li Z, Zheng L, Zhong Y, et al. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In: Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation. 2023:663–679

  74. [74]

    arXivpreprint arXiv:2503.22562.2025

    GoelK,MohanJ,KwatraN,AnupindiRS,RamjeeR.Niyama:BreakingthesilosofLLMinferenceserving. arXivpreprint arXiv:2503.22562.2025

  75. [75]

    BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Dis- aggregated LLM Serving in AI Infrastructure

    He Y, Xu M, Wu J, et al. BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Dis- aggregated LLM Serving in AI Infrastructure. Software: Practice and Experience. 2026;56(4):424-444. doi: https://doi.org/10.1002/spe.70054

  76. [76]

    DynamoLLM: Designing LLM Inference Clusters for Perfor- manceandEnergyEfficiency.In:Proceedingsofthe2025IEEEInternationalSymposiumonHighPerformanceComputer Architecture

    Stojkovic J, Zhang C, Goiri I, Torrellas J, Choukse E. DynamoLLM: Designing LLM Inference Clusters for Perfor- manceandEnergyEfficiency.In:Proceedingsofthe2025IEEEInternationalSymposiumonHighPerformanceComputer Architecture. 2025:1348-1362

  77. [77]

    2024:207–222

    PatelP,ChoukseE,ZhangC,etal.CharacterizingpowermanagementopportunitiesforLLMsinthecloud.In:Proceedings ofthe29thACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems, Volume 3. 2024:207–222

  78. [78]

    In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

    StojkovicJ,ZhangC,GoiriI,etal.TAPAS:Thermal-andPower-AwareSchedulingforLLMInferenceinCloudPlatforms. In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. Association for Computing Machinery 2025; New York, NY, USA:1266–1281

  79. [79]

    Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding.arXiv preprint arXiv:2501.12162, 2025

    Zikun L, Zhuofu C, Remi D, et al. AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding.arXiv preprint arXiv:2501.12162.2026

  80. [80]

    Fairness in serving large language models

    Sheng Y, Cao S, Li D, et al. Fairness in serving large language models. In: Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation. USENIX Association 2024; USA

Showing first 80 references.