pith. sign in

arxiv: 2505.09999 · v3 · submitted 2025-05-15 · 💻 cs.DC

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

Pith reviewed 2026-05-22 15:37 UTC · model grok-4.3

classification 💻 cs.DC
keywords LLM servingworkload characterizationworkload generationproduction tracesperformance benchmarkingcloud inferencemultimodal models
0
0 comments X

The pith

ServeGen generates realistic LLM serving workloads by composing per-client patterns observed in production traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper provides an in-depth characterization of LLM serving workloads from a worldwide cloud service, covering language models as well as multimodal and reasoning models. It reveals important workload characteristics not fully captured in prior smaller-scale analyses. Building on these findings, ServeGen is introduced as a framework to generate realistic workloads by composing them on a per-client basis. Accurate workload generation is essential for properly benchmarking and improving LLM serving systems, as it directly affects how well systems handle real traffic. Validation in a production setting shows that this method prevents half the under-provisioning errors seen with simpler workload generation approaches.

Core claim

The paper characterizes LLM serving workloads at scale from a cloud inference service and proposes ServeGen, a framework that generates realistic workloads through per-client composition based on observed patterns. This is validated in production where it avoids 50% under-provisioning compared to naive generation.

What carries the argument

ServeGen's per-client composition mechanism that builds overall workloads by modeling and combining the request patterns of individual clients from real traces.

If this is right

  • Per-client composition allows more accurate replication of complex workload characteristics than previous methods.
  • Generated workloads from ServeGen enable better performance benchmarking of LLM serving systems.
  • The approach covers a range of model types including multimodal and reasoning models.
  • Real-world use demonstrates reduced under-provisioning in resource allocation for serving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • ServeGen might be adapted for other cloud environments if the core client patterns prove similar across providers.
  • Future extensions could incorporate evolving client behaviors as models and usage patterns change over time.
  • Similar composition strategies may improve workload modeling in adjacent areas such as general cloud service benchmarking.

Load-bearing premise

The traces from one cloud provider capture the essential patterns of LLM serving workloads that apply more broadly to other settings and newer model types.

What would settle it

Running ServeGen on traces from a separate cloud provider and finding that the generated workloads still cause significant under-provisioning in production tests would challenge the generalizability of the approach.

Figures

Figures reproduced from arXiv: 2505.09999 by Ennan Zhai, Kun Qian, Wenyuan Yu, Xin Jin, Xue Li, Yuxing Xiang.

Figure 1
Figure 1. Figure 1: Inter-arrival time characterization. 0 500 1000 Rate (req/s) Mon. Tue. Wed. Thu. Fri. Sat. Sun. Day of the week 1.0 1.2 1.4 1.6 CV 0 12 24 Hour of the day M-large M-small M-code M-rp [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Long-term rate and CV shifts. M-small, and M-mid within a 20-minute window. Conform￾ing to existing analyses [39, 44], we find that the arrival patterns exhibit notable burstiness, indicated by CVs greater than 1. Consequently, Poisson processes (which have a CV of 1) often poorly model the IATs in bursty workloads (such as in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Input and output length distribution. x-axis: # tokens; y-axis: frequency. Each subfigure corresponds to a specific workload and time period, split to two consecutive x-scales to better visualize the shift in average lengths (left) as well as the tail distribution (right). 0 5000 10000 15000 0 1000 2000 90% Median 0 1000 2000 Midnight 0 1000 2000 Morning 0 20000 0 1000 2000 Afternoon (a) M-mid 0 50 100 Mid… view at source ↗
Figure 5
Figure 5. Figure 5: Client heterogeneity in terms of rate, etc. All CDFs are weighted by client rates. 0 8 16 24 32 40 48 Time (hour) 0 50 100 150 Rate (req/s) Client A (33.6%) Client B (13.4%) Client C (9.6%) Client D (8.8%) 0 8 16 24 32 40 48 Time (hour) 1 2 3 4 CV Client A (2.1) Client B (1.1) Client C (1.0) Client D (1.0) 50 75 100 125 Input Length 10 3 10 1 Frequency 0 50 100 Output Length 10 3 10 1 Client A 40 60 Input … view at source ↗
Figure 7
Figure 7. Figure 7: Characterization of multimodal inputs in three different workloads. Rows: mm-image, mm-audio, and mm-video, respectively. Columns: (a) number of multimodal inputs per request; (b) tokenized length distribution of multimodal inputs; (c) correlation between text tokens and multimodal tokens; (d) overall arrival rate of multimodal and text tokens. 10 20 30 #Items 80 90 100 CDF (%) max 36 max 7 max 24 0 8 16 2… view at source ↗
Figure 8
Figure 8. Figure 8: Characterization of omni-modal inputs in mm-omni. Left: number of multimodal inputs per request. Right: arrival rate of multimodal and text tokens, normalized by the total input rate. 0 25 50 75 100 Image/Input (%) Frequency 67 0 25 50 75 100 Audio/Input (%) Frequency 66 0 25 50 75 100 Video/Input (%) Frequency 86 [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ratio of multimodal input tokens per request in mm-image, mm-audio, and mm-video. Numbers indicate the aver￾age ratio. per request (shown in (a))4 and the lack of correlation be￾tween text and multimodal tokens (shown in (c)), we observe highly varied load on modality encoders, as illustrated in (d). Two observations further complicate the load variance in multimodal workloads. (𝑖) The variance of multimod… view at source ↗
Figure 10
Figure 10. Figure 10: Breakdown of first-token time when serving requests with image or video inputs (mm-image and mm-video). 1 10 100 Rate Rank 0 25 50 75 100 Weighted CDF (%) 35 0 1 5 CV 0 25 50 75 100 Weighted CDF (%) 48 0 500 1000 Avg. Image Length 0 25 50 75 100 Weighted CDF (%) 0 50 100 Avg. Image/Input (%) 0 25 50 75 100 Weighted CDF (%) [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Client characterization for mm-image. CDFs are weighted by rates. 0 4 8 12 16 20 24 Time (hour) 0 1 2 3 Rate (req/s) Client A (11.7%) Client B (11.2%) Client C (10.7%) Client D (9.1%) 0 500 1000 Image Tokens 10 4 10 3 10 2 Frequency 0 25 50 75 Image/Input (%) 10 3 10 2 10 1 Client A 500 750 1000 Image Tokens 10 3 10 2 Frequency 80 85 90 Image/Input (%) 10 2 10 1 10 0 Client B [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 14
Figure 14. Figure 14: Characterization of request arrival patterns in deepseek-r1 and deepqwen-r1. Left: Rate and burstiness shifts over a day. Right: Normalized inter-arrival time distributions. 0 10 20 30 40 # Turns 0 25 50 75 100 CDF (%) avg=3.5 (a) CDF of conversation lengths. 0 250 500 750 1000 Inter-Turn Time (s) 0.000 0.001 0.002 0.003 Probability Density Turn 1-2 Turn 2-3 Turn 3-4 All (b) PDF of inter-turn times [PITH… view at source ↗
Figure 13
Figure 13. Figure 13: Characterization of input and output lengths for the deepseek-r1 workload in one day. Error bars in (a) indicate the range of average lengths over the day. output lengths and a distinct ratio of reason and answer to￾kens (§5.1). In addition, request arrivals in reasoning work￾loads are less bursty, partly owing to a considerable pro￾portion of multi-turn conversations, which alter the request arrival patt… view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of two upsampling methods for a workload containing only multi-turn requests. 10 0 10 1 10 2 10 3 10 4 Rate Rank 0 25 50 75 100 Weighted CDF (%) 674 (a) 0 1 5 CV 0 25 50 75 100 Weighted CDF (%) 75 (b) 0 20 40 60 80 100 Answer/Output (%) Frequency C1 Noon C1 Night C2 Noon C2 Night (c) [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Client decomposition for deepseek-r1. (a) weighted CDF of client arrival rate. (b) weighted CDF of client burstiness. (c) output length breakdown of top clients (C1 and C2). analysis window, we have identified5 188,986 multi-turn re￾quests out of 1,964,415 total requests, forming 57,205 con￾versations [PITH_FULL_IMAGE:figures/full_fig_p010_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: Comparison of workload generation accuracy. each client, either by sampling from the Client Pool pre￾configured with realistic client behaviors, or by selecting from a set of user-specified clients with custom traces and datasets. Next, ServeGen samples the request timestamps and data for each client with the Timestamp Sampler and Request Data Sampler, scaling client rates according to the total rate and … view at source ↗
Figure 20
Figure 20. Figure 20: Provisioning results using the Naive approach and ServeGen. In each cell, the number indicates the provisioned in￾stances, while the color shows the over-provisioning percentage. 6.3 Use Case: Instance Provisioning We now put ServeGen to use, illustrating how it helps with benchmarking LLM serving systems by running the gener￾ated workloads on vLLM [28], a representative LLM serv￾ing system with wide adop… view at source ↗
read the original abstract

With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limited due to the lack of a comprehensive workload characterization. Prior analyses remain insufficient in scale and scope, thus failing to fully capture intricate workload characteristics. In this paper, we fill the gap with an in-depth characterization of LLM serving workloads collected from our worldwide cloud inference serving service, covering not only language models but also emerging multimodal and reasoning models, and unveiling important new findings in each case. Moreover, based on our findings, we propose ServeGen, a principled framework for generating realistic LLM serving workloads by composing them on a per-client basis. A practical use case in production validates that ServeGen avoids 50% under-provisioning compared to naive workload generation, demonstrating ServeGen's advantage in performance benchmarking. ServeGen is available at https://github.com/alibaba/ServeGen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper provides an in-depth characterization of production LLM serving workloads from a worldwide cloud inference service, covering language, multimodal, and reasoning models and identifying new patterns in each. It introduces ServeGen, a workload generation framework that composes realistic request streams on a per-client basis using these observed patterns. A production use case is presented in which ServeGen-generated workloads avoid 50% under-provisioning relative to naive generation methods, illustrating the framework's value for performance benchmarking of LLM serving systems.

Significance. If the characterization holds and the per-client generation method transfers beyond the source traces, ServeGen could become a useful open tool for creating more realistic benchmarks in the LLM serving community. The open-sourcing at https://github.com/alibaba/ServeGen supports reproducibility and further evaluation by others.

major comments (2)
  1. [Practical use case / evaluation section] The central validation claim (avoiding 50% under-provisioning) is load-bearing for the paper's demonstration of ServeGen's advantage, yet the abstract and evaluation section provide no details on measurement methodology, statistical significance testing, workload scale, model versions, or controls for confounding factors. Without these, the result cannot be properly assessed or reproduced.
  2. [ServeGen framework and validation] ServeGen's per-client composition is derived directly from traces collected at a single cloud provider (Alibaba). The manuscript does not present external benchmarks, cross-provider validation, or sensitivity analysis showing that the observed arrival processes, model-type mixtures, and client behaviors remain representative when applied to other environments or future model releases; this directly affects the generalizability of the reported provisioning improvement.
minor comments (2)
  1. [Introduction / Related Work] Clarify early in the paper how 'per-client composition' differs from existing trace-replay or statistical workload generators, with a brief comparison table if possible.
  2. [Characterization sections] Ensure all figures showing workload characteristics (e.g., request rate distributions, model-type breakdowns) include axis labels, units, and sample sizes.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below, indicating the revisions we will make to strengthen the paper while being transparent about its scope and limitations.

read point-by-point responses
  1. Referee: [Practical use case / evaluation section] The central validation claim (avoiding 50% under-provisioning) is load-bearing for the paper's demonstration of ServeGen's advantage, yet the abstract and evaluation section provide no details on measurement methodology, statistical significance testing, workload scale, model versions, or controls for confounding factors. Without these, the result cannot be properly assessed or reproduced.

    Authors: We agree that the evaluation section would benefit from expanded methodological details to support assessment and reproducibility. In the revised manuscript we will add a new subsection under the practical use case that specifies the workload scale (number of clients and total requests processed), the exact model versions used in the production validation, the statistical significance testing procedures applied to the under-provisioning measurements, and the controls employed for potential confounding factors such as hardware heterogeneity and request distribution variations. revision: yes

  2. Referee: [ServeGen framework and validation] ServeGen's per-client composition is derived directly from traces collected at a single cloud provider (Alibaba). The manuscript does not present external benchmarks, cross-provider validation, or sensitivity analysis showing that the observed arrival processes, model-type mixtures, and client behaviors remain representative when applied to other environments or future model releases; this directly affects the generalizability of the reported provisioning improvement.

    Authors: The characterization and ServeGen framework are derived from production traces of our Alibaba worldwide cloud inference service. We have performed sensitivity analyses on arrival processes and model-type mixtures within the collected dataset; these will be presented more explicitly in the revised manuscript. Cross-provider validation is not possible with the data available to us. We will add a dedicated limitations paragraph discussing generalizability to other providers and future model releases, while noting that the open-sourced ServeGen repository enables external researchers to test and adapt the framework on their own traces. revision: partial

standing simulated objections not resolved
  • Cross-provider validation or external benchmarks of the observed workload patterns and the 50% provisioning improvement, due to the proprietary nature of production traces from other cloud providers.

Circularity Check

0 steps flagged

No circularity: characterization and generator derived from independent production traces with external-style validation

full rationale

The paper performs standard workload characterization on traces collected from the authors' own Alibaba cloud service, identifies patterns (language/multimodal/reasoning mixtures, per-client behaviors), and builds ServeGen as a composition framework on those observed statistics. The production use-case validation compares ServeGen-generated workloads against naive baselines for provisioning accuracy. No step reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the derivation chain remains self-contained against the collected traces without re-using the target metric as input. Generalizability concerns exist but are separate from circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that production traces from a single provider capture generalizable workload features and that per-client composition preserves key statistical properties without introducing artifacts.

axioms (1)
  • domain assumption Production traces from the authors' cloud service are representative of broader LLM serving workloads.
    Invoked when claiming the characterization and ServeGen transfer to other settings.

pith-pipeline@v0.9.0 · 5736 in / 1203 out tokens · 37357 ms · 2026-05-22T15:37:50.559637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

    cs.LG 2025-12 conditional novelty 8.0

    Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than ...

  2. Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

    cs.CL 2026-04 unverdicted novelty 6.0

    Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.

  3. WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

    cs.DC 2025-12 unverdicted novelty 6.0

    WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.

  4. The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

    cs.LG 2026-03 unverdicted novelty 5.0

    The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 4 Pith papers · 8 internal anchors

  1. [1]

    Disaggregated prefilling and KV cache transfer roadmap in vLLM

    2024. Disaggregated prefilling and KV cache transfer roadmap in vLLM. https://github.com/vllm-project/vllm/issues/10818. (2024)

  2. [2]

    Learning to reason with LLMs

    2024. Learning to reason with LLMs. https://openai.com/index/ learning-to-reason-with-llms/ . (2024)

  3. [3]

    DeepSeek-V3/R1 Inference System Overview

    2025. DeepSeek-V3/R1 Inference System Overview. https://github.com/deepseek-ai/open-infra-index/blob/main/ 202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_ inference_system_overview.md. (2025)

  4. [4]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. (2024). arXiv:cs.LG/2403.02310 https://arxiv.org/ abs/2403.02310

  5. [5]

    Ganger, Garth A

    George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, Garth A. Gib- son, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the di- versity of cluster workloads and its impact on research results. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX As- sociation, Boston, MA, 533–546. https://www.usenix.org/conference/ atc18/presentation/am...

  6. [6]

    Attig, P

    N. Attig, P. Gibbon, and Th. Lippert. 2011. Trends in supercomputing: The European path to exascale. Computer Physics Communications 182, 9 (2011), 2041–2046. https://doi.org/10.1016/j.cpc.2010.11.011 Com- puter Physics Communications Special Edition for Conference on Computational Physics Trondheim, Norway, June 23-26, 2010

  7. [7]

    Arshdeep Bahga, Vijay Krishna Madisetti, et al. 2011. Synthetic work- load generation for cloud computing applications. Journal of Software Engineering and Applications 4, 07 (2011), 396

  8. [8]

    Luiz Andr Barroso, Jimmy Clidaras, and Urs Hlzle. 2013. The Data- center as a Computer: An Introduction to the Design of Warehouse-Scale Machines (2nd ed.). Morgan & Claypool Publishers

  9. [9]

    Christopher Beck

    Shane Bergsma, Timothy Zeyl, Arik Senderovich, and J. Christopher Beck. 2021. Generating Complex, Realistic Cloud Workloads using Recurrent Neural Networks. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP ’21). Association for Computing Machinery, New York, NY, USA, 376–391. https://doi.org/ 10.1145/3477132.3483590

  10. [10]

    Shiyi Cao, Yichuan Wang, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, and Ion Stoica. 2025. Locality-aware Fair Scheduling in LLM Serving. (2025). arXiv:cs.DC/2501.14312 https: //arxiv.org/abs/2501.14312

  11. [11]

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. 2025. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scal- ing. arXiv preprint arXiv:2501.17811 (2025)

  12. [12]

    Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understand- ing and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. In Proceedings of the International Symposium on Operating Systems Principles (SOSP)

  13. [13]

    DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. (2025). arXiv:cs.CL/2501.12948 https://arxiv.org/abs/2501.12948

  14. [14]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations. https://...

  15. [15]

    Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: flexible spatial-temporal multiplexing for multiple LLM serving. InProceedings of the 41st International Conference on Machine Learning . Article 473, 13 pages

  16. [16]

    DeepSeek-AI et al. 2025. DeepSeek-V3 Technical Report. (2025). arXiv:cs.CL/2412.19437 https://arxiv.org/abs/2412.19437

  17. [17]

    Rodrigo et al

    Gonzalo P. Rodrigo et al. 2018. Towards understanding HPC users and systems: A NERSC case study. J. Parallel and Distrib. Comput. 111 (2018), 206–221. https://doi.org/10.1016/j.jpdc.2017.09.002

  18. [18]

    Haoran Qiu et al. 2025. Towards Efficient Large Multimodal Model Serving. (2025). arXiv:cs.DC/2502.00937 https://arxiv.org/abs/2502. 00937

  19. [19]

    Jovan Stojkovic et al. 2024. DynamoLLM: Designing LLM In- ference Clusters for Performance and Energy Efficiency. (2024). arXiv:cs.AI/2408.00741 https://arxiv.org/abs/2408.00741

  20. [20]

    Pratyush Patel et al. 2024. Splitwise: Efficient generative LLM inference using phase splitting. (2024). arXiv:cs.AR/2311.18677 https://arxiv.org/ abs/2311.18677

  21. [21]

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 24). USENIX Association, Santa Clara, CA, 135–153. https: //www.usenix.org/conference/osdi24/...

  22. [22]

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference. (2024). arXiv:cs.PF/2401.08671 https: //arxiv.org/abs/2401.08671

  23. [23]

    Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei Zhang. 2021. Characterization and prediction of deep learning work- loads in large-scale gpu datacenters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15

  24. [24]

    Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. 2024. Characterization of Large Language Model Development in the Datacenter. In 21st USENIX Sym- posium on Networked Systems Design and Implementation (NSDI 24) . USENIX Association, Santa Clara, C...

  25. [25]

    Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX As- sociation, Renton, WA, 947–960. https://www.usenix.org/conference/ atc19/presentation/jeon

  26. [26]

    Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo,...

  27. [27]

    Da-Cheng Juan, Lei Li, Huan-Kai Peng, Diana Marculescu, and Chris- tos Faloutsos. 2014. Beyond poisson: Modeling inter-arrival time of requests in a datacenter. In Advances in Knowledge Discovery and Data Mining: 18th Pacific-Asia Conference, PAKDD 2014, Tainan, Taiwan, May 13-16, 2014. Proceedings, Part II 18 . Springer, 198–209

  28. [28]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  29. [29]

    In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

    Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 611–626. https://doi.org/10.1145/ 3600006.3613165

  30. [30]

    Gonzalez, and Ion Stoica

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) . USENIX As- sociation, Boston, ...

  31. [31]

    Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, and Mosharaf Chowdhury. 2024. Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services. (2024). arXiv:cs.DC/2404.16283 https://arxiv.org/abs/2404.16283

  32. [32]

    Chengzhi Lu, Kejiang Ye, Guoyao Xu, Cheng-Zhong Xu, and Tongxin Bai. 2017. Imbalance in the cloud: An analysis on Alibaba cluster trace. In 2017 IEEE International Conference on Big Data (Big Data) . 2884–2892. https://doi.org/10.1109/BigData.2017.8258257

  33. [33]

    OpenAI. 2024. Introducing OpenAI o1. (2024). https://openai.com/o1/

  34. [34]

    OpenAI. 2024. OpenAI’s GPT-4o model. (2024). https://openai.com/ index/hello-gpt-4o/

  35. [35]

    Tirthak Patel, Zhengchun Liu, Raj Kettimuthu, Paul Rich, William Allcock, and Devesh Tiwari. 2020. Job characteristics on large-scale systems: long-term analysis, quantification, and implications. In Pro- ceedings of the International Conference for High Performance Comput- ing, Networking, Storage and Analysis (SC ’20). IEEE Press, Article 84, 17 pages

  36. [36]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation — A KVCache-centric Ar- chitecture for Serving LLM Chatbot. In 23rd USENIX Conference on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 155–170. https://ww...

  37. [37]

    Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. (Septem- ber 2024). https://qwenlm.github.io/blog/qwen2.5/

  38. [38]

    Ganger, Randy H

    Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing (SoCC ’12) . Association for Computing Machinery, New York, NY, USA, Article 7, 13 pages. https://doi.org/10.1145/2391229.2391236

  39. [39]

    RyokoAI. 2024. ShareGPT-52K. (2024). https://huggingface.co/ datasets/RyokoAI/ShareGPT52K

  40. [40]

    Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Char- acterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 2...

  41. [41]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. (2020). arXiv:cs.CL/1909.08053 https://arxiv.org/abs/1909.08053

  42. [42]

    Abhishek Verma, Madhukar Korupolu, and John Wilkes. 2014. Evalu- ating job packing in warehouse-scale computing. In 2014 IEEE Interna- tional Conference on Cluster Computing (CLUSTER) . IEEE, 48–56

  43. [43]

    Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin, and Yangqing Jia. 2019. Characterizing Deep Learning Training Workloads on Alibaba-PAI . In 2019 IEEE International Symposium on Workload Characterization (IISWC) . IEEE Computer Society, Los Alamitos, CA, USA, 189–202. https://doi.org/10.1109/IISWC47752.2019. 9042047

  44. [44]

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. 2024. MINT: Evaluating LLMs in Multi-turn Interac- tion with Tools and Language Feedback. (2024). arXiv:cs.CL/2309.10691 https://arxiv.org/abs/2309.10691

  45. [45]

    Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2024. BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems. (2024). arXiv:cs.DC/2401.17644 https: //arxiv.org/abs/2401.17644

  46. [46]

    Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Het- erogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) . USENIX Association, Renton, WA, 945–960. https://www.us...

  47. [47]

    Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism. arXiv preprint arXiv:2404.09526 (2024)

  48. [48]

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. 2024. Fast Distributed Inference Serving for Large Language Models. (2024). 14 arXiv:cs.LG/2305.05920 https://arxiv.org/abs/2305.05920

  49. [49]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2023. Au- toGen: Enabling Next-Gen LLM Applications via Multi-Agent Conver- sation. (2023). arXiv:cs.AI/2308.08155 https://arxiv.org/abs/2308.08155

  50. [50]

    xAI. 2025. Grok 3 Beta — The Age of Reasoning Agents. (2025). https://x.ai/blog/grok-3/

  51. [51]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. Qwen2.5-Omni Technical Report. arXiv preprint arXiv:2503.20215 (2025)

  52. [52]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . USENIX As- sociation, Carlsbad, CA, 521–538. https://www.usenix.org/conference/ osdi22/presentation/yu

  53. [53]

    Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) . USENIX As- sociation, Boston, MA, 787–808. https://www.usenix.org/conference/ nsdi23/presentation/zhang-hong

  54. [54]

    Yanqi Zhang, Íñigo Goiri, Gohar Irfan Chaudhry, Rodrigo Fonseca, Sameh Elnikety, Christina Delimitrou, and Ricardo Bianchini. 2021. Faster and Cheaper Serverless Computing on Harvested Resources. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP ’21). Association for Computing Machinery, New York, NY, USA, 724–739. http...

  55. [55]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. (2024). arXiv:cs.AI/2312.07104 https://arxiv.org/abs/2312.07104

  56. [56]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. (2024). arXiv:cs.DC/2401.09670 15