pith. machine review for the scientific record. sign in

arxiv: 2510.13668 · v2 · submitted 2025-10-15 · 💻 cs.DC · cs.LG

STAR: Decode-Phase Rescheduling for LLM Inference

Pith reviewed 2026-05-18 06:08 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords LLM inferencedecode reschedulinglength predictionhidden statesworkload balancingTPOTgoodputSLO
0
0 comments X

The pith

STAR reschedules LLM decode workloads using hidden-state length predictions to cut P99 TPOT by 75.1%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model inference suffers from severe workload imbalance during the decode phase because output lengths vary widely, especially on long reasoning tasks. Static prefill-to-decode scheduling in existing systems produces SLO violations and out-of-memory errors when workloads shift. STAR addresses this with a lightweight predictor that reads the model's own hidden states to forecast remaining generation length, then feeds both current and predicted loads into a dynamic rescheduler that rebalances decode work. If the approach holds, inference systems could maintain steadier latency and higher throughput without redesigning the core architecture. Readers care because the method targets a concrete, recurring bottleneck in production LLM serving.

Core claim

STAR is a decode rescheduling system powered by length prediction to anticipate future workloads. Its core contributions are a lightweight continuous LLM-native prediction method that leverages hidden states to model remaining generation length at high precision and low overhead, plus a rescheduling solution that applies a dynamic balancing mechanism integrating current and predicted workloads.

What carries the argument

Hidden-state length predictor that models remaining generation length from LLM internal states, combined with a dynamic balancing mechanism that reschedules decode-phase work.

If this is right

  • Reduces P99 TPOT by 75.1%
  • Achieves 2.63 times higher goodput
  • Avoids SLO violations and OOM failures under evolving decode workloads
  • Handles long-output reasoning tasks without static pre-assignment

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The predictor could be reused in other variable-length generation settings such as code or math reasoning agents.
  • Combining decode rescheduling with prefill optimizations might yield larger end-to-end gains in shared clusters.
  • Production accuracy of the hidden-state predictor under shifting user query distributions would determine whether the reported gains persist.

Load-bearing premise

The hidden-state length predictor must remain accurate enough across diverse real-world workloads and model sizes so that rescheduling decisions improve rather than degrade performance.

What would settle it

Running STAR on workloads where length-prediction error is high and checking whether P99 TPOT rises or goodput falls below the static-scheduling baseline.

Figures

Figures reproduced from arXiv: 2510.13668 by Chengying Huan, Chen Tian, Qingkai Meng, Qing Wang, Rong Gu, Sheng Zhong, Shipeng Li, Xue Li, Zetao Hong, Zhibin Wang, Zibo Wang.

Figure 1
Figure 1. Figure 1: Different output lengths lead to significant load variations across decode instances. dominates the total cost in LLM inference, especially for long outputs. Therefore, the variation in output length leads to significant workload imbalance across requests during the decode phase. Specifically, the variation of workload in the decode phase leads to two critical issues: • Issue 1: OOM of KV cache. Recently, … view at source ↗
Figure 2
Figure 2. Figure 2: Output length distribution. SLOs differ: prefill minimizes Time-to-First-Token (TTFT), while decode reduces Time-per-Output-Token (TPOT). Recognizing the distinct characteristics of these two phases, modern LLM serving systems (e.g., Mooncake [26], Dist￾Serve [35]) separate prefill and decode onto different hard￾ware resources to satisfy their respective resource demands. Upon arrival, a request exclusivel… view at source ↗
Figure 3
Figure 3. Figure 3: Per-step execution time (TPOT) across three de￾code instances under PD disaggregation (1 prefill + 3 decode instances). • Round-robin scheduling [31]: This straightforward ap￾proach assigns requests to decode instances in a round￾robin manner, ensuring an even distribution of requests. However, it overlooks the varying workloads of different requests, leading to potential load imbalances and subop￾timal pe… view at source ↗
Figure 5
Figure 5. Figure 5: System overview. 2.4 Challenges In summary, to balance the inference workload across de￾code instances in PD disaggregation, we identify two key challenges: • Accurate and efficient prediction of remaining out￾put length: Existing methods either require intrusive prompt modifications, rely on less capable auxiliary mod￾els with poor accuracy for long outputs, or incur prohibi￾tive overheads that prevent it… view at source ↗
Figure 6
Figure 6. Figure 6: Our runtime prediction method: the MLP predictor consumes the hidden state vector of the last token from the final layer to estimate remaining output length. Additional context from generated tokens: As reschedul￾ing occurs during the decode phase, the model has already generated some output tokens. These tokens provide addi￾tional context that can be leveraged to improve prediction accuracy. Moreover, as … view at source ↗
Figure 7
Figure 7. Figure 7: MAE of each prediction model for requests with 30-32K output tokens at different generated tokens. D = {(h𝑡 , 𝑦𝑡)} across the generation trajectory of each re￾quest. To ensure proper evaluation, we split the data at the request level rather than the sample level. Specifically, we randomly partition the original ShareGPT requests into train￾ing (70%), validation (15%), and test (15%) sets. This ensures that… view at source ↗
Figure 9
Figure 9. Figure 9: Workflow of the scheduler. requests assigned to instance 𝑖. For a request𝑟 ∈ 𝐵𝑖 , 𝑁 (𝑟) de￾notes the current number of tokens in request𝑟, and 𝑁ˆ (𝑟) de￾notes the predicted remaining generation length for request𝑟. The current token load of instance 𝑖 is 𝑁𝑖(𝐵𝑖) = Í 𝑟 ∈𝐵𝑖 𝑁 (𝑟). We notice that for scheduling purposes of workload bal￾ancing, the absolute execution time is less important than the relative dif… view at source ↗
Figure 10
Figure 10. Figure 10: Overall performance under various RPS on ShareGPT and Alpaca datasets. Goodput. In addition to throughput, we further employ goodput as a more comprehensive metric to incorporate SLO attainment, denoting the effective thoroughput, i.e., requests per second that meet their SLOs. Compared to throughput, ARES presents a more pronounced advantage in goodput, as the workload imbalance further leads to increase… view at source ↗
Figure 11
Figure 11. Figure 11: Transmission time proportion in TBT on various hardware. and A100 GPUs, which have different compute capabilities, and we also vary the network bandwidth to assess its impact on migration overhead [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Execution time variance comparison across different cluster sizes under 25 Gbps transfer speed. 0 250 500 750 1000 1250 1500 1750 2000 Time (seconds) 0 10 20 30 Execution time variance (ms²) 8.624 1.851 0.741 0.780 vLLM vLLM + rescheduling ARES ARES-Oracle [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Execution time variance across different schedul￾ing algorithms on high-load dataset. ARES-Oracle, which assumes perfect knowledge of the re￾maining generation lengths for all requests. Subsequently, we evaluate the effectiveness of our approach in both small-scale real systems and large-scale simulated clusters by demon￾strating the variation in execution time across decode in￾stances for 2,000 seconds o… view at source ↗
read the original abstract

Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose STAR, a decode rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with a dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 75.1% and achieving 2.63 times higher goodput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents STAR, a decode-phase rescheduling system for LLM inference. It introduces a lightweight LLM-native length predictor that uses hidden states to forecast remaining generation length, claiming a 49.42% MAE reduction and 93.28% parameter reduction. This predictor feeds a dynamic balancing mechanism that integrates current and predicted workloads to reschedule decode tasks, yielding a 75.1% reduction in P99 TPOT and 2.63× higher goodput relative to static PD-disaggregation baselines.

Significance. If the empirical gains prove robust, the work addresses a practical bottleneck in LLM serving systems where variable output lengths (especially long-output reasoning tasks) cause decode-phase imbalance, SLO violations, and OOM events. The hidden-state predictor approach is a lightweight, model-native technique that could improve resource utilization in production deployments without heavy additional infrastructure.

major comments (2)
  1. [Abstract and §5] Abstract and §5 (Evaluation): The headline claims of 75.1% P99 TPOT reduction and 2.63× goodput improvement rest on the hidden-state predictor remaining sufficiently accurate on unseen workloads. No sensitivity analysis to prediction error, no results on long-output reasoning tasks at different model scales, and no quantification of how mispredictions affect tail metrics or add overhead are provided; this is load-bearing for contribution (2).
  2. [§4.1] §4.1 (Predictor Design): The reported 49.42% MAE reduction and 93.28% parameter cut are presented as concrete improvements, yet the evaluation supplies no information on baselines for length prediction, workload traces used for testing, statistical significance, or data exclusions. Without these, it is unclear whether the predictor generalizes or merely fits the reported conditions.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from explicit statements of the exact baselines (e.g., vLLM PD disaggregation, other predictors) and workload characteristics against which all numbers are measured.
  2. [§4.1] Notation for hidden-state length prediction (e.g., definition of remaining length target) could be clarified with a short equation in the predictor section to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional evidence would strengthen the claims about predictor robustness and experimental transparency. We address each point below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Evaluation): The headline claims of 75.1% P99 TPOT reduction and 2.63× goodput improvement rest on the hidden-state predictor remaining sufficiently accurate on unseen workloads. No sensitivity analysis to prediction error, no results on long-output reasoning tasks at different model scales, and no quantification of how mispredictions affect tail metrics or add overhead are provided; this is load-bearing for contribution (2).

    Authors: We agree that the current evaluation lacks explicit sensitivity analysis and cross-scale validation for long-output tasks. In the revised manuscript we will add a dedicated subsection in §5 that (i) injects controlled prediction errors at multiple levels and measures resulting changes in P99 TPOT and goodput, (ii) reports results on long-output reasoning workloads (e.g., GSM8K, MATH) for at least two additional model scales, and (iii) quantifies the incremental latency and memory overhead of the predictor and rescheduling logic. These additions will directly address the load-bearing nature of the claims. revision: yes

  2. Referee: [§4.1] §4.1 (Predictor Design): The reported 49.42% MAE reduction and 93.28% parameter cut are presented as concrete improvements, yet the evaluation supplies no information on baselines for length prediction, workload traces used for testing, statistical significance, or data exclusions. Without these, it is unclear whether the predictor generalizes or merely fits the reported conditions.

    Authors: We acknowledge the need for fuller experimental context. The revised §4.1 will explicitly list the length-prediction baselines, describe the workload traces and train/test splits, report statistical significance tests, and state any data-exclusion criteria. These clarifications will be added without altering the reported MAE and parameter numbers. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system with independent measurements

full rationale

The paper presents an engineering system (length predictor + decode rescheduler) whose headline gains are measured on a deployed prototype rather than derived from equations that reduce to the inputs. The MAE reduction and P99 TPOT improvement are reported as experimental outcomes; the integration of current and predicted workloads is a design choice validated by those measurements, not a self-definitional or fitted-input loop. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the abstract or described contributions. This is the common honest case of a systems paper whose claims rest on external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the domain assumption that hidden states encode sufficient future-length information and on the modeling choice that a lightweight head can extract it without harming generation quality; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption LLM hidden states during decode contain usable signal for remaining output length
    Invoked when the paper states the predictor 'leverages LLM hidden state to model remaining generation length'

pith-pipeline@v0.9.0 · 5720 in / 1274 out tokens · 56383 ms · 2026-05-18T06:08:06.870377+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 24), pages 117–134, Santa Clara, CA, July 2024. USENIX Association

  2. [2]

    Qwen.https://chat.qwen.ai/, 2025

    Alibaba. Qwen.https://chat.qwen.ai/, 2025. Accessed: 2025-08-30

  3. [3]

    Claude.https://claude.ai/, 2025

    Anthropic. Claude.https://claude.ai/, 2025. Accessed: 2025-08-30

  4. [4]

    Ef- ficient and economic large language model inference with attention offloading, 2024

    Shaoyuan Chen, Yutong Lin, Mingxing Zhang, and Yongwei Wu. Ef- ficient and economic large language model inference with attention offloading, 2024

  5. [5]

    Slice-level scheduling for high throughput and load balanced llm serving, 2025

    Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, and Sheng Zhang. Slice-level scheduling for high throughput and load balanced llm serving, 2025

  6. [6]

    Deepseek.https://chat.deepseek.com/, 2025

    DeepSeek-AI. Deepseek.https://chat.deepseek.com/, 2025. Accessed: 2025-08-30

  7. [7]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  8. [8]

    Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  9. [9]

    Gptq: Accurate post-training quantization for generative pre-trained trans- formers, 2023

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained trans- formers, 2023

  10. [10]

    Cost-Efficient large language model serving for multi-turn conversations with Cache- dAttention

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. Cost-Efficient large language model serving for multi-turn conversations with Cache- dAttention. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 111–126, Santa Clara, CA, July 2024. USENIX Association

  11. [11]

    Gemini 2.5.https://gemini.google.com/app, 2025

    Google-DeepMind. Gemini 2.5.https://gemini.google.com/app, 2025. Accessed: 2025-08-30

  12. [12]

    Defeating nondeter- minism in llm inference.Thinking Machines Lab: Connectionism,

    Horace He and Thinking Machines Lab. Defeating nondeter- minism in llm inference.Thinking Machines Lab: Connectionism,

  13. [13]

    https://thinkingmachines.ai/blog/defeating-nondeterminism- in-llm-inference/

  14. [14]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The cu- rious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

  15. [15]

    2024-01-20

    Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al. Inference without interference: Disaggregate llm inference for mixed downstream workloads.arXiv preprint arXiv:2401.11181, 2024

  16. [16]

    Advances in Neural Information Processing Systems, 36:18015–18027, 2023

    Yunho Jin, Chun-Feng Wu, David Brooks, and Gu-Yeon Wei.𝑠3: Increas- ing gpu utilization during generative inference for higher throughput. Advances in Neural Information Processing Systems, 36:18015–18027, 2023

  17. [17]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  18. [18]

    Exploring the impact of temperature on large language models:hot or cold?, 2025

    Lujun Li, Lama Sleem, Niccolo’ Gentile, Geoffrey Nichil, and Radu State. Exploring the impact of temperature on large language models:hot or cold?, 2025

  19. [19]

    Flowkv: A disaggregated inference framework with low-latency kv cache transfer and load-aware scheduling, 2025

    Weiqing Li, Guochao Jiang, Xiangyong Ding, Zhangcheng Tao, Chuzhan Hao, Chenfeng Xu, Yuewei Zhang, and Hao Wang. Flowkv: A disaggregated inference framework with low-latency kv cache transfer and load-aware scheduling, 2025

  20. [20]

    Eagle-2: Faster inference of language models with dynamic draft trees, 2024

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees, 2024

  21. [21]

    Eagle-3: Scaling up inference acceleration of large language models via training- time test, 2025

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training- time test, 2025

  22. [22]

    Eagle: Speculative sampling requires rethinking feature uncertainty, 2025

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025

  23. [23]

    Decoupled weight decay regulariza- tion, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regulariza- tion, 2019

  24. [24]

    Spotserve: Serving generative large language models on preemptible instances

    Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. Spotserve: Serving generative large language models on preemptible instances. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’24, page 1112–1127, New York, NY, USA, 2024. ...

  25. [25]

    Chatgpt.https://chat.openai.com, 2025

    OpenAI. Chatgpt.https://chat.openai.com, 2025. Accessed: 2025-08-30

  26. [26]

    Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69

    Lutz Prechelt. Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69. Springer, 2002

  27. [27]

    Mooncake: Trading more storage for less computation — a KVCache-centric ar- chitecture for serving LLM chatbot

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation — a KVCache-centric ar- chitecture for serving LLM chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25), pages 155–170, Santa Clara, CA, February 2025. USENIX Association. 13

  28. [28]

    Power-aware deep learning model serving with {𝜇 -Serve}

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Başar, and Ravishankar K Iyer. Power-aware deep learning model serving with {𝜇 -Serve}. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 75–93, 2024

  29. [29]

    Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T Kalbarczyk, Tamer Başar, and Ravishankar K Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

  30. [30]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.https://github.com/ tatsu-lab/stanford_alpaca, 2023

  31. [31]

    Sharegpt.https://sharegpt.com/, 2023

    ShareGPT Teams. Sharegpt.https://sharegpt.com/, 2023. Accessed: 2025

  32. [32]

    vllm disaggregated prefill.https://docs.vllm.ai/en/latest/ examples/online_serving/disaggregated_prefill.html, 2025

    vLLM Project. vllm disaggregated prefill.https://docs.vllm.ai/en/latest/ examples/online_serving/disaggregated_prefill.html, 2025. Accessed: 2025-09-09

  33. [33]

    Chain-of-thought prompt- ing elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompt- ing elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  34. [34]

    Orca: A distributed serving system for {Transformer-Based} generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022

  35. [35]

    Response length perception and sequence scheduling: An llm-empowered llm inference pipeline.Advances in Neural Information Processing Systems, 36:65517–65530, 2023

    Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, and Yang You. Response length perception and sequence scheduling: An llm-empowered llm inference pipeline.Advances in Neural Information Processing Systems, 36:65517–65530, 2023

  36. [36]

    DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association. 14