pith. sign in

arxiv: 2512.19179 · v3 · pith:DZFZTF2Hnew · submitted 2025-12-22 · 💻 cs.DC

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing

Pith reviewed 2026-05-21 17:19 UTC · model grok-4.3

classification 💻 cs.DC
keywords LLM servinginference schedulinglength heterogeneitymulti-instance systemsdynamic programmingload balancingattention backendtail latency
0
0 comments X

The pith

CascadeInfer partitions LLM serving instances into length-specialized groups to cut end-to-end latency and raise throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that mixing requests of very different lengths inside the same batch harms GPU efficiency in the attention layers of modern LLMs. CascadeInfer therefore splits a set of instances into groups each responsible for a narrow band of lengths so that requests travel through the groups like stages in a pipeline. A dynamic programming routine picks the band boundaries that deliver the best overall quality of experience, while runtime adjustments keep the load balanced inside and across groups. The approach becomes relevant once context windows exceed 128K tokens because length variance then turns into a dominant source of under-utilization and long delays. If the method works as described, operators can serve more traffic at lower latency on the same number of GPUs.

Core claim

CascadeInfer is a runtime system that dynamically reschedules requests across multiple instances serving the same LLM to mitigate per-instance length heterogeneity. It partitions these instances into length-specialized groups, each handling requests within a designated length range, naturally forming a pipeline as requests flow through them. CascadeInfer devises a dynamic programming algorithm to efficiently find the stage partition with the best QoE, employs runtime range refinement together with decentralized load rebalance both across and within groups, achieving a balanced and efficient multi-instance service.

What carries the argument

length-range partitions of instances that form a request pipeline, with boundaries chosen by dynamic programming to minimize heterogeneity within each batch

If this is right

  • End-to-end latency falls by up to 67 percent under identical hardware and model settings.
  • Tail latency falls by up to 69 percent.
  • System throughput rises by up to 2.89 times relative to prior multi-instance schedulers.
  • Decentralized load rebalancing keeps utilization high both within each length group and across the pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grouping principle could be tested on other batch-sensitive GPU kernels beyond attention, such as certain matrix-multiplication patterns.
  • Placing the length-range decisions inside the front-end load balancer might reduce the frequency of runtime rescheduling.
  • The dynamic-programming step itself may need approximation or caching when the cluster contains hundreds of instances.

Load-bearing premise

Rescheduling a request from one instance to another adds almost no extra delay compared with the time saved by keeping batch lengths more uniform, and the chosen length ranges stay useful long enough that the dynamic-programming solution does not need constant re-solving.

What would settle it

Run CascadeInfer on a workload whose request lengths shift rapidly every few seconds and measure whether the claimed 67 percent latency reduction still appears or whether the cost of frequent rescheduling cancels the gains.

Figures

Figures reproduced from arXiv: 2512.19179 by 2), (2) ScitiX AI), Bohan Zhao (2), Chenqi Zhao (1), Wenfei Wu (1) ((1) Peking University, Yitao Yuan (1, Yongchao He (2), Zane Cao (2).

Figure 1
Figure 1. Figure 1: Request-length distribution in batches under various scheduling policies and request rates. Batches were sampled at 20%, 40%, 60%, and 80% of the inference process. The inputs come from an LLM dialogue dataset [1], and requests longer than 128K are discarded. FlashAttention FlashInfer Triton 250:0 200:1 150:2 100:3 50:4 0:5 0 50 100 Latency (ms) (a) Request length 1000 vs 50000. 500:0 400:2 300:4 200:6 100… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of sequence length heterogeneity on decoding forward pass performance. Measured on a single H100 GPU using vLLM and SGLang with FlashAttention, FlashInfer, and Triton (model: Llama-3.2-3B, batch size: 512). (vs. 14% baseline). (2) Engines observe highly heterogeneous sequence lengths. Real workloads exhibit skewed length distributions, with many short requests mixed with few but increasingly com￾mon… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture and workflow of CascadeInfer. Engine instances are grouped by length into stages forming a logical pipeline; sequences may exit early without traversing all stages. gresses, sequences naturally flow from shorter to longer stages. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline planning based on the request length distribu￾tion. lengths lie in [l ′ ,l). The pipeline’s “goodness” is quantified as the total QoE of all instances processing all requests, called pipeline quality. Algorithm. Let fs,e,l denote the optimal pipeline quality of serving all sequences with length ≤ l using s stages and e instances. fs,e,l can be recursively represented as the sum of the optimal qual… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of intra-stage load balancing using dynamic decentralized bid-ask scheduling. tionately skew the length distribution; freezing the boundary prevents these discrete events from causing huge shifts in partition logic, ensuring reliable decisions. 4.4 Decentralized Load (Re)Balancing Two classes of intra-stage load (re)balancing. When an upstream instance hands over requests to its downstream suc… view at source ↗
Figure 6
Figure 6. Figure 6: Mean and 95th-percentile TTFT measured across different LLM models under varying request arrival rates. strict concurrency limit (capped at three parallel transfers in our implementation); requests exceeding this threshold con￾tinue running on the source to avoid performance regression. Finally, we employ asynchronous multi-round live migration (adapting Llumnix [24]) combined with bidirectional transfer s… view at source ↗
Figure 7
Figure 7. Figure 7: Mean and 95th-percentile TPOT measured across different LLM models and varying request arrival rates. vLLM Llumnix CascadeInfer 0.0 0.5 Llama-3.2-3B 0 0.05 0.1 Mean TPOT (s) 0.00 0.25 GLM-4-9B 0 0.2 0.0 0.2 Phi-3-14B 0 0.2 0.4 0.0 0.1 Qwen2.5-32B 0 0.05 Req. rate (req/s) [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TPOT of a single instance across varying request ar￾rival rates. CascadeInfer’s single-instance performance matches vLLM’s but falls behind Llumnix’s. By comparing this to other results in §6, we find that CascadeInfer’s multi-instance scheduling delivers higher gains than Llumnix’s. Experiment parameters. We vary request arrival rates to cover both light and heavy loads. Light load verifies that CascadeIn… view at source ↗
Figure 10
Figure 10. Figure 10: System throughput measured across different LLM models under varying request arrival rates. SGLang vLLM Llumnix CascadeInfer 5 10 Llama-3.2-3B 0 1k 2k 3k token/s 2 4 Llama-3.1-8B 0 0.5k 1k 1.5k Req. rate (req/s) (a) L40 testbed 0.5 1 TP=2 0 100 200 token/s 0.5 1 TP=4 0 100 200 Req. rate (req/s) (b) Tensor parallelism [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: SLO attainment measured across SLO levels and varying request arrival rates. The SLO is defined by the mean TTFT and TPOT at minimum system load, and the N× SLO scales both constraints N times. ing. CascadeInfer sustains a higher threshold than all base￾lines. Under heavy load, its average throughput reaches 1.99× and 2.18× those of vLLM and SGLang, respectively, and is 1.71× that of Llumnix. These gains … view at source ↗
Figure 13
Figure 13. Figure 13: Prediction error of our cost model. Errors closer to zero are better. no pipeline chain CascadeInfer 6 8 10 Req. rate (req/s) 0 0.2 Latency (s) (a) Normalized latency 6 8 10 Req. rate (req/s) 2k 2.5k 3k 3.5k 4k token/s (b) System throughput [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance across layouts and varying request arrival rates. evaluate attainment when both bounds are scaled by a factor of N [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
read the original abstract

Efficiently harnessing GPU compute is critical to improving user experience and reducing operational costs in large language model (LLM) services. However, current inference engine schedulers overlook the attention backend's sensitivity to request-length heterogeneity within a batch. As state-of-the-art models now support context windows exceeding 128K tokens, this once-tolerable inefficiency has escalated into a primary system bottleneck, causing severe performance degradation through GPU underutilization and increased latency. We present CascadeInfer, a runtime system that dynamically reschedules requests across multiple instances serving the same LLM to mitigate per-instance length heterogeneity. CascadeInfer partitions these instances into length-specialized groups, each handling requests within a designated length range, naturally forming a pipeline as requests flow through them. CascadeInfer devises a dynamic programming algorithm to efficiently find the stage partition with the best QoE, employs runtime range refinement together with decentralized load (re)balance both across and within groups, achieving a balanced and efficient multi-instance service. Our evaluation shows that, under the same configuration, CascadeInfer reduces end-to-end latency by up to 67% and tail latency by up to 69%, while improving overall system throughput by up to 2.89 times compared to the state-of-the-art multi-instance scheduling systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes CascadeInfer, a runtime system for efficient LLM serving that partitions multiple instances into length-specialized groups, uses a dynamic programming algorithm to select optimal stage partitions for best QoE, and applies runtime range refinement plus decentralized load rebalancing to mitigate per-instance length heterogeneity. It reports concrete gains of up to 67% lower end-to-end latency, 69% lower tail latency, and 2.89x higher throughput versus state-of-the-art multi-instance schedulers under the same configuration, targeting long-context models (>128K tokens).

Significance. If the empirical gains prove robust, the work addresses a growing systems bottleneck in LLM inference by turning length heterogeneity from a liability into a structured pipeline, with potential for substantial improvements in GPU utilization, latency, and cost in production serving clusters. The dynamic-programming partitioner and decentralized balancer represent practical engineering contributions that could influence future schedulers.

major comments (1)
  1. The central latency and throughput claims rest on the premise that KV-cache migration during dynamic rescheduling incurs negligible overhead relative to the heterogeneity penalty eliminated. However, for contexts exceeding 128K tokens the transfer size is large; no section quantifies or bounds this cost (e.g., PCIe/RDMA latency) under the evaluated hardware, leaving open the possibility that migration overhead erodes or reverses the reported 67% and 69% reductions.
minor comments (1)
  1. The abstract states gains occur 'under the same configuration' without enumerating the exact baseline scheduler, model sizes, or arrival patterns; adding this detail would strengthen the comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on KV-cache migration overhead. We address the concern directly below and have revised the manuscript to incorporate supporting analysis and measurements.

read point-by-point responses
  1. Referee: The central latency and throughput claims rest on the premise that KV-cache migration during dynamic rescheduling incurs negligible overhead relative to the heterogeneity penalty eliminated. However, for contexts exceeding 128K tokens the transfer size is large; no section quantifies or bounds this cost (e.g., PCIe/RDMA latency) under the evaluated hardware, leaving open the possibility that migration overhead erodes or reverses the reported 67% and 69% reductions.

    Authors: We agree that the original manuscript does not provide explicit quantification or bounds on KV-cache migration cost for contexts exceeding 128K tokens. In the revised manuscript we have added a new subsection (Section 5.4) together with Appendix D that reports both analytical bounds and empirical measurements of PCIe and RDMA transfer latency on the same A100-based testbed used for the main evaluation. The measurements show that a 128K-token KV-cache transfer (approximately 1.8–2.2 GB depending on model) completes in 35–55 ms over RDMA, which is amortized across the request lifetime and remains well below the per-request latency reductions obtained from length-specialized batching. We further demonstrate that the runtime range refinement and decentralized balancer trigger migrations only when the expected heterogeneity penalty exceeds this measured cost, thereby preserving the reported end-to-end and tail-latency gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system results rest on independent measurements

full rationale

The paper describes a runtime scheduling system whose central claims are measured end-to-end latency, tail latency, and throughput improvements obtained from an implemented prototype running on real hardware and workloads. The dynamic-programming partitioner and decentralized rebalancer are algorithmic procedures whose correctness and performance are validated externally by experiment rather than by any equation that reduces to its own fitted parameters or to a self-citation chain. No derivation step equates a claimed prediction to an input by construction; the reported gains are falsifiable observations outside the algorithm itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The system rests on the engineering assumption that request lengths are known at arrival and that cross-instance migration cost is low enough to be amortized; no new physical constants or mathematical axioms are introduced.

pith-pipeline@v0.9.0 · 5799 in / 1111 out tokens · 38130 ms · 2026-05-21T17:19:09.498014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 6 internal anchors

  1. [1]

    https://huggingface.co/d atasets/RyokoAI/ShareGPT52K, 2023

    ShareGPT Datasets. https://huggingface.co/d atasets/RyokoAI/ShareGPT52K, 2023

  2. [2]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong 12 Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, M...

  3. [3]

    Locality-aware fair scheduling in llm serving

    Shiyi Cao, Yichuan Wang, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, and Ion Stoica. Locality-aware fair scheduling in llm serving. arXiv preprint arXiv:2501.14312, 2025

  4. [4]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

  5. [5]

    Flash-decoding for long-context inference

    Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context inference. 2023

  6. [6]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Han- wei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, ...

  7. [7]

    Serverlessllm: Low-latency serverless inference for large language models

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. Serverlessllm: Low-latency serverless inference for large language models. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153. USENIX Association, 2024

  8. [8]

    Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024

    Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Sto- ica, and Hao Zhang. Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024. 13

  9. [9]

    Bid, ask and transaction prices in a specialist market with het- erogeneously informed traders.Journal of financial economics, 14(1):71–100, 1985

    Lawrence R Glosten and Paul R Milgrom. Bid, ask and transaction prices in a specialist market with het- erogeneously informed traders.Journal of financial economics, 14(1):71–100, 1985

  10. [10]

    Accelerating llm serving for multi-turn dialogues with efficient resource management

    Jinwoo Jeong and Jeongseob Ahn. Accelerating llm serving for multi-turn dialogues with efficient resource management. InProceedings of the 30th ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volume 2, pages 1–15, 2025

  11. [11]

    Efficient memory man- agement for large language model serving with page- dattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  12. [12]

    xformers: A modular and hackable transformer modelling library

    Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. xformers: A modular and hackable transformer modelling library. https: //github.com/facebookresearch/xformers , 2022

  13. [13]

    A proof for the queuing formula: L= λ w.Operations research, 9(3):383–387, 1961

    John DC Little. A proof for the queuing formula: L= λ w.Operations research, 9(3):383–387, 1961

  14. [14]

    Introducing llama 3.1: Our most capable models to date, 2024

    Meta. Introducing llama 3.1: Our most capable models to date, 2024

  15. [15]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024

    Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024

  16. [16]

    Cuda c++ programming guide, 2025

    NVIDIA. Cuda c++ programming guide, 2025

  17. [17]

    Cutlass, 2025

    NVIDIA. Cutlass, 2025. https://github.com/NVI DIA/cutlass

  18. [18]

    Fastertransformer, 2025

    NVIDIA. Fastertransformer, 2025. https://github .com/NVIDIA/FasterTransformer

  19. [19]

    Nvidia dynamo, 2025

    NVIDIA. Nvidia dynamo, 2025. https://github.c om/ai-dynamo/dynamo

  20. [20]

    Chatgpt application, 2025

    OpenAI. Chatgpt application, 2025. https://chat .openai.com/

  21. [21]

    Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbig- niew T Kalbarczyk, Tamer Ba¸ sar, and Ravishankar K Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

  22. [22]

    Aibrix: Towards scalable, cost-effective large language model inference infrastructure.arXiv preprint arXiv:2504.03648, 2025

    The AIBrix Team: Jiaxin Shan, Varun Gupta, Le Xu, Haiyang Shi, Jingyuan Zhang, Ning Wang, Linhui Xu, Rong Kang, Tongping Liu, Yifei Zhang, Yiqing Zhu, Shuowei Jin, Gangmuk Lim, Binbin Chen, Zuzhi Chen, Xiao Liu, Xin Chen, Kante Yin, Chak-Pong Chung, Chenyu Jiang, Yicheng Lu, Jianjun Chen, Caixue Lin, Wu Xiang, Rui Shi, and Liguang Xie. Aibrix: Towards sca...

  23. [23]

    Step3, 2025

    StepFun. Step3, 2025. https://github.com/stepf un-ai/Step3

  24. [24]

    Llumnix: Dynamic scheduling for large language model serving

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association

  25. [25]

    QwQ: Reflect deeply on the boundaries of the unknown

    Qwen Team. QwQ: Reflect deeply on the boundaries of the unknown. https://qwenlm.github.io/blog/ qwq-32b-preview/, 2024

  26. [26]

    Triton: an intermediate language and compiler for tiled neural network computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

  27. [27]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  28. [28]

    Fast Distributed Inference Serving for Large Language Models

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920, 2023

  29. [29]

    Grok application, 2025.https://grok.com/

    X. Grok application, 2025.https://grok.com/

  30. [30]

    Qwen2.5 technical report, 2025

    Qwen: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Day- iheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Ke- qin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li...

  31. [31]

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable atten- tion engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025

  32. [32]

    Orca: A distributed serving system for transformer-based generative mod- els

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative mod- els. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022

  33. [33]

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y . X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089, 2025

  34. [34]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM: Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, S...

  35. [35]

    Gonzalez, Clark Bar- rett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Bar- rett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural...

  36. [36]

    Response length perception and sequence scheduling: An llm-empowered llm infer- ence pipeline.Advances in Neural Information Process- ing Systems, 36:65517–65530, 2023

    Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, and Yang You. Response length perception and sequence scheduling: An llm-empowered llm infer- ence pipeline.Advances in Neural Information Process- ing Systems, 36:65517–65530, 2023

  37. [37]

    Dist- serve: Disaggregating prefill and decoding for goodput- optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Dist- serve: Disaggregating prefill and decoding for goodput- optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association

  38. [38]

    Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. Megascale-infer: Serving mixture-of-experts at scale with disaggregated expert parallelism.arXiv preprint...