arxiv: 2510.13668 · v2 · submitted 2025-10-15 · 💻 cs.DC · cs.LG

STAR: Decode-Phase Rescheduling for LLM Inference

Zhibin Wang , Zetao Hong , Xue Li , Zibo Wang , Shipeng Li , Qingkai Meng , Qing Wang , Chengying Huan

show 3 more authors

Rong Gu Sheng Zhong Chen Tian

This is my paper

Pith reviewed 2026-05-18 06:08 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords LLM inferencedecode reschedulinglength predictionhidden statesworkload balancingTPOTgoodputSLO

0 comments

The pith

STAR reschedules LLM decode workloads using hidden-state length predictions to cut P99 TPOT by 75.1%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model inference suffers from severe workload imbalance during the decode phase because output lengths vary widely, especially on long reasoning tasks. Static prefill-to-decode scheduling in existing systems produces SLO violations and out-of-memory errors when workloads shift. STAR addresses this with a lightweight predictor that reads the model's own hidden states to forecast remaining generation length, then feeds both current and predicted loads into a dynamic rescheduler that rebalances decode work. If the approach holds, inference systems could maintain steadier latency and higher throughput without redesigning the core architecture. Readers care because the method targets a concrete, recurring bottleneck in production LLM serving.

Core claim

STAR is a decode rescheduling system powered by length prediction to anticipate future workloads. Its core contributions are a lightweight continuous LLM-native prediction method that leverages hidden states to model remaining generation length at high precision and low overhead, plus a rescheduling solution that applies a dynamic balancing mechanism integrating current and predicted workloads.

What carries the argument

Hidden-state length predictor that models remaining generation length from LLM internal states, combined with a dynamic balancing mechanism that reschedules decode-phase work.

If this is right

Reduces P99 TPOT by 75.1%
Achieves 2.63 times higher goodput
Avoids SLO violations and OOM failures under evolving decode workloads
Handles long-output reasoning tasks without static pre-assignment

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The predictor could be reused in other variable-length generation settings such as code or math reasoning agents.
Combining decode rescheduling with prefill optimizations might yield larger end-to-end gains in shared clusters.
Production accuracy of the hidden-state predictor under shifting user query distributions would determine whether the reported gains persist.

Load-bearing premise

The hidden-state length predictor must remain accurate enough across diverse real-world workloads and model sizes so that rescheduling decisions improve rather than degrade performance.

What would settle it

Running STAR on workloads where length-prediction error is high and checking whether P99 TPOT rises or goodput falls below the static-scheduling baseline.

Figures

Figures reproduced from arXiv: 2510.13668 by Chengying Huan, Chen Tian, Qingkai Meng, Qing Wang, Rong Gu, Sheng Zhong, Shipeng Li, Xue Li, Zetao Hong, Zhibin Wang, Zibo Wang.

**Figure 1.** Figure 1: Different output lengths lead to significant load variations across decode instances. dominates the total cost in LLM inference, especially for long outputs. Therefore, the variation in output length leads to significant workload imbalance across requests during the decode phase. Specifically, the variation of workload in the decode phase leads to two critical issues: • Issue 1: OOM of KV cache. Recently, … view at source ↗

**Figure 2.** Figure 2: Output length distribution. SLOs differ: prefill minimizes Time-to-First-Token (TTFT), while decode reduces Time-per-Output-Token (TPOT). Recognizing the distinct characteristics of these two phases, modern LLM serving systems (e.g., Mooncake [26], DistServe [35]) separate prefill and decode onto different hardware resources to satisfy their respective resource demands. Upon arrival, a request exclusivel… view at source ↗

**Figure 3.** Figure 3: Per-step execution time (TPOT) across three decode instances under PD disaggregation (1 prefill + 3 decode instances). • Round-robin scheduling [31]: This straightforward approach assigns requests to decode instances in a roundrobin manner, ensuring an even distribution of requests. However, it overlooks the varying workloads of different requests, leading to potential load imbalances and suboptimal pe… view at source ↗

**Figure 5.** Figure 5: System overview. 2.4 Challenges In summary, to balance the inference workload across decode instances in PD disaggregation, we identify two key challenges: • Accurate and efficient prediction of remaining output length: Existing methods either require intrusive prompt modifications, rely on less capable auxiliary models with poor accuracy for long outputs, or incur prohibitive overheads that prevent it… view at source ↗

**Figure 6.** Figure 6: Our runtime prediction method: the MLP predictor consumes the hidden state vector of the last token from the final layer to estimate remaining output length. Additional context from generated tokens: As rescheduling occurs during the decode phase, the model has already generated some output tokens. These tokens provide additional context that can be leveraged to improve prediction accuracy. Moreover, as … view at source ↗

**Figure 7.** Figure 7: MAE of each prediction model for requests with 30-32K output tokens at different generated tokens. D = {(h𝑡 , 𝑦𝑡)} across the generation trajectory of each request. To ensure proper evaluation, we split the data at the request level rather than the sample level. Specifically, we randomly partition the original ShareGPT requests into training (70%), validation (15%), and test (15%) sets. This ensures that… view at source ↗

**Figure 9.** Figure 9: Workflow of the scheduler. requests assigned to instance 𝑖. For a request𝑟 ∈ 𝐵𝑖 , 𝑁 (𝑟) denotes the current number of tokens in request𝑟, and 𝑁ˆ (𝑟) denotes the predicted remaining generation length for request𝑟. The current token load of instance 𝑖 is 𝑁𝑖(𝐵𝑖) = Í 𝑟 ∈𝐵𝑖 𝑁 (𝑟). We notice that for scheduling purposes of workload balancing, the absolute execution time is less important than the relative dif… view at source ↗

**Figure 10.** Figure 10: Overall performance under various RPS on ShareGPT and Alpaca datasets. Goodput. In addition to throughput, we further employ goodput as a more comprehensive metric to incorporate SLO attainment, denoting the effective thoroughput, i.e., requests per second that meet their SLOs. Compared to throughput, ARES presents a more pronounced advantage in goodput, as the workload imbalance further leads to increase… view at source ↗

**Figure 11.** Figure 11: Transmission time proportion in TBT on various hardware. and A100 GPUs, which have different compute capabilities, and we also vary the network bandwidth to assess its impact on migration overhead [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Execution time variance comparison across different cluster sizes under 25 Gbps transfer speed. 0 250 500 750 1000 1250 1500 1750 2000 Time (seconds) 0 10 20 30 Execution time variance (ms²) 8.624 1.851 0.741 0.780 vLLM vLLM + rescheduling ARES ARES-Oracle [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Execution time variance across different scheduling algorithms on high-load dataset. ARES-Oracle, which assumes perfect knowledge of the remaining generation lengths for all requests. Subsequently, we evaluate the effectiveness of our approach in both small-scale real systems and large-scale simulated clusters by demonstrating the variation in execution time across decode instances for 2,000 seconds o… view at source ↗

read the original abstract

Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose STAR, a decode rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with a dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 75.1% and achieving 2.63 times higher goodput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAR's hidden-state length predictor for decode rescheduling reports strong TPOT and goodput gains but rests on untested robustness to prediction errors across workloads.

read the letter

The thing to know is that STAR uses a lightweight predictor from LLM hidden states to forecast output lengths and then reschedules decode work dynamically. The paper reports a 49 percent drop in prediction error, 93 percent fewer parameters in the predictor, 75 percent lower P99 TPOT, and 2.6 times better goodput. What is new is the continuous, LLM-native prediction method and its tight integration into a decode-phase balancer that mixes current and predicted loads. Prior length prediction exists, but this version keeps overhead low and applies it to rescheduling rather than just initial scheduling. The approach does well at targeting a practical issue in current serving systems. Variable output lengths, especially in reasoning tasks, break static prefill-decode splits and cause SLO violations or OOMs. The low-parameter design and focus on decode phase make sense for minimizing added cost. The main soft spot is the lack of detail on how the system behaves when predictions are inaccurate. The gains assume the predictor stays good enough across workloads and model sizes that rescheduling helps more than it hurts. Without sensitivity tests or misprediction handling described, it's hard to judge if the mechanism is robust. The abstract also leaves out the specific baselines and workload details behind the performance numbers, so the claims need the full paper to evaluate properly. This work is aimed at people who build and operate LLM inference systems in distributed setups. A reader interested in runtime optimizations for serving would pick up usable ideas on dynamic balancing. I would recommend sending it to peer review. The core problem is real, the solution is concrete, and the numbers are sharp enough to merit checking against the experiments.

Referee Report

2 major / 2 minor

Summary. The manuscript presents STAR, a decode-phase rescheduling system for LLM inference. It introduces a lightweight LLM-native length predictor that uses hidden states to forecast remaining generation length, claiming a 49.42% MAE reduction and 93.28% parameter reduction. This predictor feeds a dynamic balancing mechanism that integrates current and predicted workloads to reschedule decode tasks, yielding a 75.1% reduction in P99 TPOT and 2.63× higher goodput relative to static PD-disaggregation baselines.

Significance. If the empirical gains prove robust, the work addresses a practical bottleneck in LLM serving systems where variable output lengths (especially long-output reasoning tasks) cause decode-phase imbalance, SLO violations, and OOM events. The hidden-state predictor approach is a lightweight, model-native technique that could improve resource utilization in production deployments without heavy additional infrastructure.

major comments (2)

[Abstract and §5] Abstract and §5 (Evaluation): The headline claims of 75.1% P99 TPOT reduction and 2.63× goodput improvement rest on the hidden-state predictor remaining sufficiently accurate on unseen workloads. No sensitivity analysis to prediction error, no results on long-output reasoning tasks at different model scales, and no quantification of how mispredictions affect tail metrics or add overhead are provided; this is load-bearing for contribution (2).
[§4.1] §4.1 (Predictor Design): The reported 49.42% MAE reduction and 93.28% parameter cut are presented as concrete improvements, yet the evaluation supplies no information on baselines for length prediction, workload traces used for testing, statistical significance, or data exclusions. Without these, it is unclear whether the predictor generalizes or merely fits the reported conditions.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit statements of the exact baselines (e.g., vLLM PD disaggregation, other predictors) and workload characteristics against which all numbers are measured.
[§4.1] Notation for hidden-state length prediction (e.g., definition of remaining length target) could be clarified with a short equation in the predictor section to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional evidence would strengthen the claims about predictor robustness and experimental transparency. We address each point below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Evaluation): The headline claims of 75.1% P99 TPOT reduction and 2.63× goodput improvement rest on the hidden-state predictor remaining sufficiently accurate on unseen workloads. No sensitivity analysis to prediction error, no results on long-output reasoning tasks at different model scales, and no quantification of how mispredictions affect tail metrics or add overhead are provided; this is load-bearing for contribution (2).

Authors: We agree that the current evaluation lacks explicit sensitivity analysis and cross-scale validation for long-output tasks. In the revised manuscript we will add a dedicated subsection in §5 that (i) injects controlled prediction errors at multiple levels and measures resulting changes in P99 TPOT and goodput, (ii) reports results on long-output reasoning workloads (e.g., GSM8K, MATH) for at least two additional model scales, and (iii) quantifies the incremental latency and memory overhead of the predictor and rescheduling logic. These additions will directly address the load-bearing nature of the claims. revision: yes
Referee: [§4.1] §4.1 (Predictor Design): The reported 49.42% MAE reduction and 93.28% parameter cut are presented as concrete improvements, yet the evaluation supplies no information on baselines for length prediction, workload traces used for testing, statistical significance, or data exclusions. Without these, it is unclear whether the predictor generalizes or merely fits the reported conditions.

Authors: We acknowledge the need for fuller experimental context. The revised §4.1 will explicitly list the length-prediction baselines, describe the workload traces and train/test splits, report statistical significance tests, and state any data-exclusion criteria. These clarifications will be added without altering the reported MAE and parameter numbers. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system with independent measurements

full rationale

The paper presents an engineering system (length predictor + decode rescheduler) whose headline gains are measured on a deployed prototype rather than derived from equations that reduce to the inputs. The MAE reduction and P99 TPOT improvement are reported as experimental outcomes; the integration of current and predicted workloads is a design choice validated by those measurements, not a self-definitional or fitted-input loop. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the abstract or described contributions. This is the common honest case of a systems paper whose claims rest on external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the domain assumption that hidden states encode sufficient future-length information and on the modeling choice that a lightweight head can extract it without harming generation quality; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption LLM hidden states during decode contain usable signal for remaining output length
Invoked when the paper states the predictor 'leverages LLM hidden state to model remaining generation length'

pith-pipeline@v0.9.0 · 5720 in / 1274 out tokens · 56383 ms · 2026-05-18T06:08:06.870377+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness, Aczél classification) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length... A rescheduling solution in decode phase with a dynamic balancing mechanism that integrates current and predicted workloads
IndisputableMonolith/Foundation/AlexanderDuality.lean (D=3 forcing via linking) alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-stage rescheduling strategy that identifies overloaded and underloaded decode instances... simulates the migration... maximizes workload variance reduction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 24), pages 117–134, Santa Clara, CA, July 2024. USENIX Association

work page 2024
[2]

Qwen.https://chat.qwen.ai/, 2025

Alibaba. Qwen.https://chat.qwen.ai/, 2025. Accessed: 2025-08-30

work page 2025
[3]

Claude.https://claude.ai/, 2025

Anthropic. Claude.https://claude.ai/, 2025. Accessed: 2025-08-30

work page 2025
[4]

Ef- ficient and economic large language model inference with attention offloading, 2024

Shaoyuan Chen, Yutong Lin, Mingxing Zhang, and Yongwei Wu. Ef- ficient and economic large language model inference with attention offloading, 2024

work page 2024
[5]

Slice-level scheduling for high throughput and load balanced llm serving, 2025

Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, and Sheng Zhang. Slice-level scheduling for high throughput and load balanced llm serving, 2025

work page 2025
[6]

Deepseek.https://chat.deepseek.com/, 2025

DeepSeek-AI. Deepseek.https://chat.deepseek.com/, 2025. Accessed: 2025-08-30

work page 2025
[7]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025
[8]

Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page 2024
[9]

Gptq: Accurate post-training quantization for generative pre-trained trans- formers, 2023

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained trans- formers, 2023

work page 2023
[10]

Cost-Efficient large language model serving for multi-turn conversations with Cache- dAttention

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. Cost-Efficient large language model serving for multi-turn conversations with Cache- dAttention. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 111–126, Santa Clara, CA, July 2024. USENIX Association

work page 2024
[11]

Gemini 2.5.https://gemini.google.com/app, 2025

Google-DeepMind. Gemini 2.5.https://gemini.google.com/app, 2025. Accessed: 2025-08-30

work page 2025
[12]

Defeating nondeter- minism in llm inference.Thinking Machines Lab: Connectionism,

Horace He and Thinking Machines Lab. Defeating nondeter- minism in llm inference.Thinking Machines Lab: Connectionism,

work page
[13]

https://thinkingmachines.ai/blog/defeating-nondeterminism- in-llm-inference/

work page
[14]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The cu- rious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[15]

2024-01-20

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al. Inference without interference: Disaggregate llm inference for mixed downstream workloads.arXiv preprint arXiv:2401.11181, 2024

work page arXiv 2024
[16]

Advances in Neural Information Processing Systems, 36:18015–18027, 2023

Yunho Jin, Chun-Feng Wu, David Brooks, and Gu-Yeon Wei.𝑠3: Increas- ing gpu utilization during generative inference for higher throughput. Advances in Neural Information Processing Systems, 36:18015–18027, 2023

work page 2023
[17]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[18]

Exploring the impact of temperature on large language models:hot or cold?, 2025

Lujun Li, Lama Sleem, Niccolo’ Gentile, Geoffrey Nichil, and Radu State. Exploring the impact of temperature on large language models:hot or cold?, 2025

work page 2025
[19]

Flowkv: A disaggregated inference framework with low-latency kv cache transfer and load-aware scheduling, 2025

Weiqing Li, Guochao Jiang, Xiangyong Ding, Zhangcheng Tao, Chuzhan Hao, Chenfeng Xu, Yuewei Zhang, and Hao Wang. Flowkv: A disaggregated inference framework with low-latency kv cache transfer and load-aware scheduling, 2025

work page 2025
[20]

Eagle-2: Faster inference of language models with dynamic draft trees, 2024

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees, 2024

work page 2024
[21]

Eagle-3: Scaling up inference acceleration of large language models via training- time test, 2025

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training- time test, 2025

work page 2025
[22]

Eagle: Speculative sampling requires rethinking feature uncertainty, 2025

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025

work page 2025
[23]

Decoupled weight decay regulariza- tion, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regulariza- tion, 2019

work page 2019
[24]

Spotserve: Serving generative large language models on preemptible instances

Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. Spotserve: Serving generative large language models on preemptible instances. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’24, page 1112–1127, New York, NY, USA, 2024. ...

work page 2024
[25]

Chatgpt.https://chat.openai.com, 2025

OpenAI. Chatgpt.https://chat.openai.com, 2025. Accessed: 2025-08-30

work page 2025
[26]

Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69

Lutz Prechelt. Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69. Springer, 2002

work page 2002
[27]

Mooncake: Trading more storage for less computation — a KVCache-centric ar- chitecture for serving LLM chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation — a KVCache-centric ar- chitecture for serving LLM chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25), pages 155–170, Santa Clara, CA, February 2025. USENIX Association. 13

work page 2025
[28]

Power-aware deep learning model serving with {𝜇 -Serve}

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Başar, and Ravishankar K Iyer. Power-aware deep learning model serving with {𝜇 -Serve}. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 75–93, 2024

work page 2024
[29]

Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T Kalbarczyk, Tamer Başar, and Ravishankar K Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

work page arXiv 2024
[30]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.https://github.com/ tatsu-lab/stanford_alpaca, 2023

work page 2023
[31]

Sharegpt.https://sharegpt.com/, 2023

ShareGPT Teams. Sharegpt.https://sharegpt.com/, 2023. Accessed: 2025

work page 2023
[32]

vllm disaggregated prefill.https://docs.vllm.ai/en/latest/ examples/online_serving/disaggregated_prefill.html, 2025

vLLM Project. vllm disaggregated prefill.https://docs.vllm.ai/en/latest/ examples/online_serving/disaggregated_prefill.html, 2025. Accessed: 2025-09-09

work page 2025
[33]

Chain-of-thought prompt- ing elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompt- ing elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[34]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022

work page 2022
[35]

Response length perception and sequence scheduling: An llm-empowered llm inference pipeline.Advances in Neural Information Processing Systems, 36:65517–65530, 2023

Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, and Yang You. Response length perception and sequence scheduling: An llm-empowered llm inference pipeline.Advances in Neural Information Processing Systems, 36:65517–65530, 2023

work page 2023
[36]

DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association. 14

work page 2024