STAR cuts P99 TPOT by 75.1% and raises goodput 2.63x via a lightweight hidden-state length predictor and dynamic decode rescheduling that combines current and predicted loads.
Orca: A distributed serving system for {Transformer-Based} generative models
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
STAR: Decode-Phase Rescheduling for LLM Inference
STAR cuts P99 TPOT by 75.1% and raises goodput 2.63x via a lightweight hidden-state length predictor and dynamic decode rescheduling that combines current and predicted loads.