ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

Alibaba, “ServeGen: Workload Characterization, Generation of Large Language Model Serving in Production,”arXiv:2505 · 2025 · cs.DC · arXiv 2505.09999

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limited due to the lack of a comprehensive workload characterization. Prior analyses remain insufficient in scale and scope, thus failing to fully capture intricate workload characteristics. In this paper, we fill the gap with an in-depth characterization of LLM serving workloads collected from our worldwide cloud inference serving service, covering not only language models but also emerging multimodal and reasoning models, and unveiling important new findings in each case. Moreover, based on our findings, we propose ServeGen, a principled framework for generating realistic LLM serving workloads by composing them on a per-client basis. A practical use case in production validates that ServeGen avoids 50% under-provisioning compared to naive workload generation, demonstrating ServeGen's advantage in performance benchmarking. ServeGen is available at https://github.com/alibaba/ServeGen.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

cs.LG · 2025-12-16 · conditional · novelty 8.0

Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.

Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

cs.DC · 2025-12-10 · unverdicted · novelty 6.0

WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

cs.LG · 2026-03-22 · unverdicted · novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

citing papers explorer

Showing 4 of 4 citing papers.

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving cs.LG · 2025-12-16 · conditional · none · ref 44 · internal anchor
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.
Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving cs.CL · 2026-04-09 · unverdicted · none · ref 30 · internal anchor
Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving cs.DC · 2025-12-10 · unverdicted · none · ref 25 · internal anchor
WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project cs.LG · 2026-03-22 · unverdicted · none · ref 26 · internal anchor
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer