Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
4 Pith papers cite this work. Polarity classification is still indexing.
abstract
With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limited due to the lack of a comprehensive workload characterization. Prior analyses remain insufficient in scale and scope, thus failing to fully capture intricate workload characteristics. In this paper, we fill the gap with an in-depth characterization of LLM serving workloads collected from our worldwide cloud inference serving service, covering not only language models but also emerging multimodal and reasoning models, and unveiling important new findings in each case. Moreover, based on our findings, we propose ServeGen, a principled framework for generating realistic LLM serving workloads by composing them on a per-client basis. A practical use case in production validates that ServeGen avoids 50% under-provisioning compared to naive workload generation, demonstrating ServeGen's advantage in performance benchmarking. ServeGen is available at https://github.com/alibaba/ServeGen.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.
WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
citing papers explorer
-
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.
-
Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving
Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.
-
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving
WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.
-
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.