pith. machine review for the scientific record. sign in

arxiv: 2505.09999 · v3 · submitted 2025-05-15 · 💻 cs.DC

Recognition: unknown

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

Authors on Pith no claims yet
classification 💻 cs.DC
keywords servingservegenworkloadworkloadscharacterizationlanguagemodelsbenchmarking
0
0 comments X
read the original abstract

With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limited due to the lack of a comprehensive workload characterization. Prior analyses remain insufficient in scale and scope, thus failing to fully capture intricate workload characteristics. In this paper, we fill the gap with an in-depth characterization of LLM serving workloads collected from our worldwide cloud inference serving service, covering not only language models but also emerging multimodal and reasoning models, and unveiling important new findings in each case. Moreover, based on our findings, we propose ServeGen, a principled framework for generating realistic LLM serving workloads by composing them on a per-client basis. A practical use case in production validates that ServeGen avoids 50% under-provisioning compared to naive workload generation, demonstrating ServeGen's advantage in performance benchmarking. ServeGen is available at https://github.com/alibaba/ServeGen.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

    cs.CL 2026-04 unverdicted novelty 6.0

    Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.

  2. The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

    cs.LG 2026-03 unverdicted novelty 5.0

    The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.