pith. machine review for the scientific record. sign in

arxiv: 2510.13668 · v2 · submitted 2025-10-15 · 💻 cs.DC · cs.LG

Recognition: unknown

STAR: Decode-Phase Rescheduling for LLM Inference

Authors on Pith no claims yet
classification 💻 cs.DC cs.LG
keywords decodelengthreschedulingworkloadsinferencemodelphaseprediction
0
0 comments X
read the original abstract

Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose STAR, a decode rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with a dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 75.1% and achieving 2.63 times higher goodput.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.