STAR: Decode-Phase Rescheduling for LLM Inference

Zhibin Wang , Zetao Hong , Xue Li , Zibo Wang , Shipeng Li , Qingkai Meng , Qing Wang , Chengying Huan

show 3 more authors

Rong Gu Sheng Zhong Chen Tian

Authors on Pith no claims yet

classification 💻 cs.DC cs.LG

keywords decodelengthreschedulingworkloadsinferencemodelphaseprediction

0 comments

read the original abstract

Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose STAR, a decode rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with a dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 75.1% and achieving 2.63 times higher goodput.

This paper has not been read by Pith yet.

STAR: Decode-Phase Rescheduling for LLM Inference

discussion (0)