OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Youhe Jiang , Fangcheng Fu , Taiyi Wang , Guoliang He , Eiko Yoneki

Authors on Pith no claims yet

classification 💻 cs.DC

keywords modeloserveservingworkloadheterogeneityheterogeneousworkloadsdeployment

read the original abstract

Serving Large Language Models (LLMs) can benefit immensely from parallelizing both the model and input requests across multiple devices, but incoming workloads exhibit substantial spatial and temporal heterogeneity. Spatially, workloads comprise heterogeneous requests with varying compute and memory demands. Temporally, workload composition varies over time. Nevertheless, existing systems typically assume spatially uniform and temporally stable workloads, employing a homogeneous, static model deployment. This mismatch between the assumption and real-world spatial-temporal heterogeneity results in suboptimal performance. We present OServe, an LLM serving system with heterogeneous and flexible model deployment that addresses both spatial and temporal heterogeneity. First, OServe introduces a novel workload-aware scheduling algorithm that optimizes heterogeneous model deployments according to real-time workload characteristics. Second, OServe proposes an efficient workload-adaptive switching method that migrates model deployments in response to predicted workload changes. Experiments on real-world traces show that OServe improves performance by up to 2$\times$ (average: 1.5$\times$) compared to state-of-the-art serving systems.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
cs.DC 2026-04 unverdicted novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
cs.DC 2026-05 unverdicted novelty 6.0

HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.