Recognition: unknown
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
Pith reviewed 2026-05-10 06:22 UTC · model grok-4.3
The pith
Large language models depend on cloud-native and distributed systems for efficient scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that cloud platforms and distributed systems play a key role in supporting the scalability, efficiency, and optimization of large language models. The complexities of LLM deployment involve data management, resource optimization, and the adoption of microservices, autoscaling, and hybrid cloud-edge solutions. Emerging research trends such as serverless inference, quantum computing, and federated learning hold potential to advance LLM capabilities further. A roadmap for future work calls for ongoing research, standardization efforts, and collaboration between sectors to support LLM expansion in research and enterprise settings.
What carries the argument
Cloud-native and distributed architectures, which enable handling of LLM computational demands through features like microservices, autoscaling, and hybrid cloud-edge solutions.
If this is right
- LLM deployment can incorporate data management and resource optimization techniques from cloud systems.
- Hybrid cloud-edge solutions will address specific deployment complexities.
- Serverless inference, quantum computing, and federated learning represent promising directions for next innovations.
- Continued research, standardization, and cross-sector collaboration are required to sustain LLM growth.
Where Pith is reading between the lines
- Organizations without large data centers could leverage these systems to develop competitive LLMs.
- The emphasis on federated learning may enable privacy-preserving model training across distributed data sources.
- Quantum computing integration could transform energy efficiency in model operations if realized.
Load-bearing premise
That traditional systems are unable to meet the computational requirements of large language models and that the listed emerging trends will drive the next phase of innovation.
What would settle it
A demonstration that a large-scale LLM can be trained and deployed efficiently using only traditional non-cloud, non-distributed infrastructure.
read the original abstract
The rapid rise of Large Language Models (LLMs) has revolutionized various artificial intelligence (AI) applications, from natural language processing to code generation. However, the computational demands of these models, particularly in training and inference, present significant challenges. Traditional systems are often unable to meet these requirements, necessitating the integration of cloud-native and distributed architectures. This paper explores the role of cloud platforms and distributed systems in supporting the scalability, efficiency, and optimization of LLMs. We discuss the complexities of LLM deployment, including data management, resource optimization, and the need for microservices, autoscaling, and hybrid cloud-edge solutions. Additionally, we examine emerging research trends, such as serverless inference, quantum computing, and federated learning, and their potential to drive the next phase of LLM innovation. The paper concludes with a roadmap for future developments, emphasizing the need for continued research, standardization, and cross-sector collaboration to sustain the growth of LLMs in both research and enterprise applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the computational demands of Large Language Models (LLMs) necessitate the integration of cloud-native and distributed architectures, as traditional systems are often unable to meet these requirements. It explores complexities in LLM deployment such as data management, resource optimization, microservices, autoscaling, and hybrid cloud-edge solutions. The manuscript examines emerging research trends including serverless inference, quantum computing, and federated learning, and concludes with a roadmap for future developments emphasizing continued research, standardization, and cross-sector collaboration.
Significance. If pursued, this agenda could help direct research toward more scalable and efficient LLM systems by highlighting the intersection of distributed computing and AI challenges. The paper's strength is in providing a high-level synthesis of deployment issues and forward-looking trends, which may stimulate targeted follow-up studies, though its impact as a position paper rests on the actionability of the proposed directions rather than new derivations or data.
major comments (2)
- [Abstract] Abstract: The foundational assertion that 'Traditional systems are often unable to meet these requirements' is presented without any specific benchmarks, scaling examples, or citations to LLM performance bottlenecks, which is load-bearing for the motivation to integrate cloud-native architectures.
- [Abstract] Abstract (emerging trends paragraph): The statement that trends such as serverless inference, quantum computing, and federated learning 'will drive the next phase of LLM innovation' is made without analysis of their current maturity, specific applicability to LLM training/inference, or preliminary evidence, which underpins the credibility of the concluding roadmap.
minor comments (2)
- The discussion of deployment complexities would benefit from clearer organization, such as explicit subsection headings for data management versus resource optimization, to improve readability.
- Additional citations to recent work on distributed LLM training frameworks would help ground the high-level claims in existing literature.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address the two points on the abstract below and will incorporate targeted clarifications to strengthen the grounding of our claims while preserving the high-level nature of this research agenda paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The foundational assertion that 'Traditional systems are often unable to meet these requirements' is presented without any specific benchmarks, scaling examples, or citations to LLM performance bottlenecks, which is load-bearing for the motivation to integrate cloud-native architectures.
Authors: We agree that the abstract would benefit from explicit support for this statement. The full manuscript already references scaling challenges such as quadratic attention complexity and the multi-node requirements for training models with hundreds of billions of parameters. In revision, we will add one or two concise citations (e.g., to transformer scaling laws and reported training infrastructure needs) directly in the abstract to anchor the claim without expanding its length or shifting the paper's position-paper character. revision: yes
-
Referee: [Abstract] Abstract (emerging trends paragraph): The statement that trends such as serverless inference, quantum computing, and federated learning 'will drive the next phase of LLM innovation' is made without analysis of their current maturity, specific applicability to LLM training/inference, or preliminary evidence, which underpins the credibility of the concluding roadmap.
Authors: This observation is valid for the abstract's brevity. The body of the manuscript discusses maturity levels and applicability in more detail (e.g., serverless for inference workloads, federated learning for privacy-preserving training). For the revision, we will soften the phrasing to 'emerging trends with the potential to drive...' and insert a short qualifier noting their varying stages of readiness, thereby improving credibility while keeping the abstract concise and forward-looking. revision: partial
Circularity Check
No significant circularity
full rationale
The manuscript is a descriptive research agenda surveying LLM challenges and advocating cloud-native/distributed approaches plus emerging trends. It contains no equations, formal derivations, fitted parameters, or predictions that reduce to inputs by construction. Central claims are high-level motivations and calls for future work rather than asserted results resting on self-referential steps, self-citations, or renamed empirical patterns. The text is therefore self-contained with no load-bearing circular reductions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation
ZhongY,LiuS,ChenJ,etal.DistServe:Disaggregatingprefillanddecodingforgoodput-optimizedlargelanguagemodel serving. In: Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. 2024:193– 210
2024
-
[2]
Cloud container technologies: a state-of-the-art review.IEEE Transactions on Cloud Computing.2017;7(3):677–692
Pahl C, Brogi A, Soldani J, Jamshidi P. Cloud container technologies: a state-of-the-art review.IEEE Transactions on Cloud Computing.2017;7(3):677–692
2017
-
[3]
Attention is all you need
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the Advances in neural information processing systems. 2017
2017
-
[4]
Scaling Laws for Neural Language Models
Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361. 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[5]
Deep learning.Nature.2015;521(7553):436–444
LeCun Y, Bengio Y, Hinton G. Deep learning.Nature.2015;521(7553):436–444
2015
-
[6]
IsaevM,McDonaldN,VuducR.Scalinginfrastructuretosupportmulti-trillionparameterLLMtraining.In:Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2023
2023
-
[7]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.; 2020
Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.; 2020
2020
-
[8]
Curran Associates, Inc
ZhangY,YouY.SpeedLoader:AnI/OefficientschemeforheterogeneousanddistributedLLMoperation.In:Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc. 2024:34637–34655
2024
-
[9]
Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale
Aminabadi RY, Rajbhandari S, Awan AA, et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2022:1–15
2022
-
[10]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Dao T, Fu D, Ermon S, Rudra A, Ré C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In: Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc. 2022:16344–16359
2022
-
[11]
Towards end-to-end optimization of llm-based applications with ayo
Tan X, Jiang Y, Yang Y, Xu H. Towards end-to-end optimization of llm-based applications with ayo. In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2025:1302–1316
2025
-
[12]
In-datacenter performance analysis of a tensor processing unit
Jouppi NP, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th annual international symposium on computer architecture. 2017:1–12
2017
-
[13]
Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning
Zheng L, Li Z, Zhang H, et al. Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 2022. 38 Xu ET AL
2022
-
[14]
Efficient memory management for large language model serving with pagedattention
Kwon W, Li Z, Zhuang S, et al. Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems principles. 2023:611–626
2023
-
[15]
Open issues in scheduling microservices in the cloud.IEEE Cloud Computing.2016;3(5):81–88
Fazio M, Celesti A, Ranjan R, Liu C, Chen L, Villari M. Open issues in scheduling microservices in the cloud.IEEE Cloud Computing.2016;3(5):81–88
2016
-
[16]
TheKubernetesAuthors.Kubernetes:Production-GradeContainerOrchestration.https://kubernetes.io;.Accessed:2026- 04-08
2026
-
[17]
Flexgen: High-throughput generative inference of large language models with a single gpu
Sheng Y, Zheng L, Yuan B, et al. Flexgen: High-throughput generative inference of large language models with a single gpu. In: Proceedings of the International Conference on Machine Learning. 2023:31094–31116
2023
-
[18]
In: Proceedings of the 16th USENIX symposium on operating systems design and implementation
YuGI,JeongJS,KimGW,KimS,ChunBG.Orca:AdistributedservingsystemforTransformer-Basedgenerativemodels. In: Proceedings of the 16th USENIX symposium on operating systems design and implementation. 2022:521–538
2022
-
[19]
Evaluation and benchmarking of llm agents: A survey
Mohammadi M, Li Y, Lo J, Yip W. Evaluation and benchmarking of llm agents: A survey. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 2025:6129–6139
2025
-
[20]
Towards High-Goodput LLM Serving with Prefill-decode Multiplexing
Chen Y, Cui W, Zhao H, et al. Towards High-Goodput LLM Serving with Prefill-decode Multiplexing. In: Proceedings ofthe31stACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems, Volume 2. 2026:2030–2047
2026
-
[21]
Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents
Rombaut B, Masoumzadeh S, Vasilevski K, Lin D, Hassan AE. Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents. In: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering. 2025:739–751
2025
-
[22]
Process modeling in web applications.ACM Transactions on Software Engineering and Methodology.2006;15(4):360–409
Brambilla M, Ceri S, Fraternali P, Manolescu I. Process modeling in web applications.ACM Transactions on Software Engineering and Methodology.2006;15(4):360–409
2006
-
[23]
Challenges in deployment and configuration management in cyber physical system
Jha DN, Li Y, Jayaraman PP, et al. Challenges in deployment and configuration management in cyber physical system. Handbook of Integration of Cloud Computing, Cyber Physical Systems and Internet of Things.2020:215–235
2020
-
[24]
Parallel processing systems for big data: a survey
Zhang Y, Cao T, Li S, et al. Parallel processing systems for big data: a survey. Proceedings of the IEEE. 2016;104(11):2114–2136
2016
-
[25]
LiuX,BuyyaR.Resourcemanagementandschedulingindistributedstreamprocessingsystems:ataxonomy,review,and future directions.ACM Computing Surveys.2020;53(3):1–41
2020
-
[26]
Deep learning workload scheduling in gpu datacenters: A survey.ACM Computing Surveys
Ye Z, Gao W, Hu Q, et al. Deep learning workload scheduling in gpu datacenters: A survey.ACM Computing Surveys. 2024;56(6):1–38
2024
-
[27]
StatuScale: Status-aware and Elastic Scaling Strategy for Microservice Applications.ACM Trans
Wen L, Xu M, Gill SS, et al. StatuScale: Status-aware and Elastic Scaling Strategy for Microservice Applications.ACM Trans. Auton. Adapt. Syst..2025;20(1). doi: 10.1145/3686253
-
[28]
LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs.Authorea Preprints.2025
Heisler M, Yousefijamarani Z, Wang X, et al. LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs.Authorea Preprints.2025
2025
-
[29]
Cloud native system for llm inference serving.arXiv preprint arXiv:2507.18007
Xu M, Liao J, Wu J, He Y, Ye K, Xu C. Cloud native system for llm inference serving.arXiv preprint arXiv:2507.18007. 2025
-
[30]
{NanoFlow}:Towardsoptimallargelanguagemodelservingthroughput.In:Proceedingsof the 19th USENIX Symposium on Operating Systems Design and Implementation
ZhuK,GaoY,ZhaoY,etal. {NanoFlow}:Towardsoptimallargelanguagemodelservingthroughput.In:Proceedingsof the 19th USENIX Symposium on Operating Systems Design and Implementation. 2025:749–765
2025
-
[31]
Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems.2025;7
Ye Z, Chen L, Lai R, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems.2025;7
2025
-
[32]
2025:446–461
ZhangC,DuK,LiuS,etal.JENGA:EffectivememorymanagementforservingLLMwithheterogeneity.In:Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 2025:446–461. Xu ET AL. 39
2025
-
[33]
throttll’em: Predictive gpu throttling for energy effi- cient llm inference serving
Kakolyris AK, Masouros D, Vavaroutsos P, Xydis S, Soudris D. throttll’em: Predictive gpu throttling for energy effi- cient llm inference serving. In: Proceedings of the 2025 IEEE International Symposium on High Performance Computer Architecture. 2025:1363–1378
2025
-
[34]
Li S, Wang H, Xu W, et al. Collaborative inference and learning between edge slms and cloud LLMs: A survey of algorithms, execution, and open challenges.arXiv preprint arXiv:2507.16731.2025
-
[35]
Extracting training data from large language models
Carlini N, Tramer F, Wallace E, et al. Extracting training data from large language models. In: Proceedings of the 30th USENIX security symposium. 2021:2633–2650
2021
-
[36]
Quantifying memorization across neural language models
Carlini N, Ippolito D, Jagielski M, Lee K, Tramer F, Zhang C. Quantifying memorization across neural language models. In: Proceedings of the 11th International Conference on Learning Representations. 2022
2022
-
[37]
Deep learning with differential privacy
Abadi M, Chu A, Goodfellow I, et al. Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016:308–318
2016
-
[38]
Communication-efficient learning of deep networks from decentralized data
McMahan B, Moore E, Ramage D, Hampson S, Arcas yBA. Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics. 2017:1273–1282
2017
-
[39]
Oblivious {Multi-Party} machine learning on trusted processors
Ohrimenko O, Schuster F, Fournet C, et al. Oblivious {Multi-Party} machine learning on trusted processors. In: Proceedings of the 25th USENIX Security Symposium. 2016:619–636
2016
-
[40]
Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection
Greshake K, Abdelnabi S, Mishra S, Endres C, Holz T, Fritz M. Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection. In: Proceedings of the 16th ACM workshop on artificial intelligence and security. 2023:79–90
2023
-
[41]
Stealing machine learning models via prediction APIs
Tramèr F, Zhang F, Juels A, Reiter MK, Ristenpart T. Stealing machine learning models via prediction APIs. In: Proceedings of the 25th USENIX security symposium. 2016:601–618
2016
-
[42]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou A, Wang Z, Carlini N, Nasr M, Kolter JZ, Fredrikson M. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043.2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
ACMComputingSurveys
DasBC,AminiMH,WuY.Securityandprivacychallengesoflargelanguagemodels:Asurvey. ACMComputingSurveys. 2025;57(6):1–39
2025
-
[44]
IEEEInternet of Things Journal.2022;9(11):8364–8386
BianJ,AlArafatA,XiongH,etal.Machinelearninginreal-timeInternetofThings(IoT)systems:Asurvey. IEEEInternet of Things Journal.2022;9(11):8364–8386
2022
-
[45]
Spatial big data architecture: from data warehouses and data lakes to the Lakehouse.Journal of Parallel and Distributed Computing.2023;176:70–79
Ait Errami S, Hajji H, Ait El Kadi K, Badir H. Spatial big data architecture: from data warehouses and data lakes to the Lakehouse.Journal of Parallel and Distributed Computing.2023;176:70–79
2023
-
[46]
IEEETransactionsonKnowledgeand Data Engineering.2023;35(12):12571–12590
HaiR,KoutrasC,QuixC,JarkeM.Datalakes:Asurveyoffunctionsandsystems. IEEETransactionsonKnowledgeand Data Engineering.2023;35(12):12571–12590
2023
-
[47]
Zhao S, Yang Y, Wang Z, He Z, Qiu LK, Qiu L. Retrieval augmented generation (rag) and beyond: A comprehensive survey on how to make your llms use external data more wisely.arXiv preprint arXiv:2409.14924.2024
-
[48]
Computational cxl-memory solution for accelerating memory-intensive applications.IEEE Computer Architecture Letters.2022;22(1):5–8
Sim J, Ahn S, Ahn T, et al. Computational cxl-memory solution for accelerating memory-intensive applications.IEEE Computer Architecture Letters.2022;22(1):5–8
2022
-
[49]
Gu R, Xu Z, Che Y, et al. High-level data abstraction and elastic data caching for data-intensive ai applications on cloud- native platforms.IEEE Transactions on Parallel and Distributed Systems.2023;34(11):2946–2964
2023
-
[50]
Retrieval-Augmented Generation for Large Language Models: A Survey
Gao Y, Xiong Y, Gao X, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997.2023;2(1):32
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
New Trends in High-D Vector Similarity Search: AI-driven, Progressive, and Distributed
Echihabi K, Palpanas T, Zoumpatianos K. New Trends in High-D Vector Similarity Search: AI-driven, Progressive, and Distributed.. Proceedings of the VLDB Endowment.2021;14(12):3198–3201. 40 Xu ET AL
2021
-
[52]
Unleash llms potential for sequential recommendation by coordinating dual dynamic index mechanism
Yin J, Zeng Z, Li M, et al. Unleash llms potential for sequential recommendation by coordinating dual dynamic index mechanism. In: Proceedings of the 2025 ACM on Web Conference. 2025:216–227
2025
-
[53]
LiH,FuF,LinS,etal.Hydraulis:BalancingLargeTransformerModelTrainingviaCo-designingParallelStrategiesand Data Assignment.Proceedings of the ACM on Management of Data.2025;3(6):1–30
2025
-
[54]
A generative caching system for large language models.arXiv preprint arXiv:2503.17603.2025
Iyengar A, Kundu A, Kompella R, Mamidi SN. A generative caching system for large language models.arXiv preprint arXiv:2503.17603.2025
-
[55]
Tracing the Data Trail: A Survey of Data Provenance, Transparency and Traceability in LLMs
Hohensinner R, Mutlu B, Estrada IGM, Vukovic M, Kopeinik S, Kern R. Tracing the Data Trail: A Survey of Data Provenance, Transparency and Traceability in LLMs.arXiv preprint arXiv:2601.14311.2026
-
[56]
In: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design
AgostiniNB,CurzelS,AmatyaV,etal.AnMLIR-basedcompilerflowforsystem-leveldesignandhardwareacceleration. In: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design. 2022:1–9
2022
-
[57]
2023:189–201
MajumderK,BondhugulaU.Hir:Anmlir-basedintermediaterepresentationforhardwareacceleratordescription.In:Pro- ceedingsofthe28thACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperating Systems, Volume 4. 2023:189–201
2023
-
[58]
Packt Publishing Ltd, 2022
Masood F, Brigoli R.Machine Learning on Kubernetes: A practical handbook for building and using a complete open source machine learning platform on Kubernetes. Packt Publishing Ltd, 2022
2022
-
[59]
Llm-pilot: Characterize and optimize performance of your llm inference services
Lazuka M, Anghel A, Parnell T. Llm-pilot: Characterize and optimize performance of your llm inference services. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. 2024:1–18
2024
- [60]
-
[61]
A survey on federated fine-tuning of large language models,
Wu Y, Tian C, Li J, et al. A survey on federated fine-tuning of large language models.arXiv preprint arXiv:2503.12016. 2025
-
[62]
IEEE Transactions on Computers.2026;75(4):1636-1649
HuJ,XuM,YeK,XuC.BrownoutServe:SLO-AwareInferenceServingUnderBurstyWorkloadsforMoE-BasedLLMs. IEEE Transactions on Computers.2026;75(4):1636-1649. doi: 10.1109/TC.2026.3655019
-
[63]
Liao J, Xu M, Zheng W, et al. DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving.IEEE Transactions on Services Computing.2026;19(2):1134-1147. doi: 10.1109/TSC.2026.3670011
-
[64]
ShuffleInfer: Disaggregate LLM inference for mixed downstream workloads.ACM Transactions on Architecture and Code Optimization.2025;22(2):1–24
Hu C, Huang H, Xu L, et al. ShuffleInfer: Disaggregate LLM inference for mixed downstream workloads.ACM Transactions on Architecture and Code Optimization.2025;22(2):1–24
2025
-
[65]
Splitwise: Efficient generative LLM inference using phase splitting
Patel P, Choukse E, Zhang C, et al. Splitwise: Efficient generative LLM inference using phase splitting. In: Proceedings of the 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture. 2024:118–132
2024
-
[66]
Fast distributed inference serving for large language models,
Wu B, Zhong Y, Zhang Z, et al. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920.2023
-
[67]
Li R, Du R, Chu Z, et al. Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference.arXiv preprint arXiv:2508.19559.2025
-
[68]
Lai R, Liu H, Lu C, et al. TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity.arXiv preprint arXiv:2512.03416.2025
-
[69]
LoongServe: Efficiently serving long-context large language models with elastic sequence parallelism
Wu B, Liu S, Zhong Y, Sun P, Liu X, Jin X. LoongServe: Efficiently serving long-context large language models with elastic sequence parallelism. In: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 2024:640–654. Xu ET AL. 41
2024
-
[70]
dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving
Wu B, Zhu R, Zhang Z, Sun P, Liu X, Jin X. dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving. In: Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. 2024:911– 927
2024
-
[71]
MegaScale-Infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism
Zhu R, Jiang Z, Jin C, et al. MegaScale-Infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism. In: Proceedings of the ACM SIGCOMM 2025 Conference. 2025:592–608
2025
-
[72]
Punica: Multi-tenant LoRA serving.Proceedings of Machine Learning and Systems.2024;6:1–13
Chen L, Ye Z, Wu Y, Zhuo D, Ceze L, Krishnamurthy A. Punica: Multi-tenant LoRA serving.Proceedings of Machine Learning and Systems.2024;6:1–13
2024
-
[73]
AlpaServe: Statistical multiplexing with model parallelism for deep learning serving
Li Z, Zheng L, Zhong Y, et al. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In: Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation. 2023:663–679
2023
-
[74]
arXivpreprint arXiv:2503.22562.2025
GoelK,MohanJ,KwatraN,AnupindiRS,RamjeeR.Niyama:BreakingthesilosofLLMinferenceserving. arXivpreprint arXiv:2503.22562.2025
-
[75]
He Y, Xu M, Wu J, et al. BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Dis- aggregated LLM Serving in AI Infrastructure. Software: Practice and Experience. 2026;56(4):424-444. doi: https://doi.org/10.1002/spe.70054
-
[76]
DynamoLLM: Designing LLM Inference Clusters for Perfor- manceandEnergyEfficiency.In:Proceedingsofthe2025IEEEInternationalSymposiumonHighPerformanceComputer Architecture
Stojkovic J, Zhang C, Goiri I, Torrellas J, Choukse E. DynamoLLM: Designing LLM Inference Clusters for Perfor- manceandEnergyEfficiency.In:Proceedingsofthe2025IEEEInternationalSymposiumonHighPerformanceComputer Architecture. 2025:1348-1362
2025
-
[77]
2024:207–222
PatelP,ChoukseE,ZhangC,etal.CharacterizingpowermanagementopportunitiesforLLMsinthecloud.In:Proceedings ofthe29thACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems, Volume 3. 2024:207–222
2024
-
[78]
In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
StojkovicJ,ZhangC,GoiriI,etal.TAPAS:Thermal-andPower-AwareSchedulingforLLMInferenceinCloudPlatforms. In: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. Association for Computing Machinery 2025; New York, NY, USA:1266–1281
2025
-
[79]
Zikun L, Zhuofu C, Remi D, et al. AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding.arXiv preprint arXiv:2501.12162.2026
-
[80]
Fairness in serving large language models
Sheng Y, Cao S, Li D, et al. Fairness in serving large language models. In: Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation. USENIX Association 2024; USA
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.