pith. sign in

arxiv: 2606.05933 · v1 · pith:D6S2FLXNnew · submitted 2026-06-04 · 💻 cs.DC

Beyond Greedy Chunking: SLO-Aware Sliding-Window Scheduling for LLM Inference

Pith reviewed 2026-06-27 23:47 UTC · model grok-4.3

classification 💻 cs.DC
keywords LLM inference schedulingSLO-aware schedulingsliding window chunkingbatch latency predictiondynamic programming request selectionQoS differentiationonline LLM services
0
0 comments X

The pith

SlidingServe raises LLM inference service capacity by up to 30 percent while cutting SLO violations 16 to 53 percent under heavy load.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SlidingServe as an SLO-aware scheduler that replaces coarse output constraints with fine-grained control over request execution in LLM services. It relies on a lightweight batch latency predictor to forecast execution times, then applies SlidingChunker to merge current and next-iteration data for dynamic chunking decisions. Multi-Level Priority Sorter orders requests to trade off fairness against efficiency, and BatchConstructor solves a dynamic program to protect the most at-risk requests when SLOs are threatened. If the mechanisms work as described, online LLM systems can serve more concurrent users without breaching latency targets.

Core claim

SlidingServe improves service capacity by up to 30% compared to advanced scheduling systems under various load conditions, and further reduces the rate of SLO violation by 16%-53% under heavy-load inference mode, by combining a lightweight batch latency predictor with SlidingChunker for dynamic chunking, Multi-Level Priority Sorter for request ordering, and BatchConstructor for dynamic-programming selection of safe request sets.

What carries the argument

SlidingChunker, which merges information from the current and next iteration to perform dynamic chunking guided by the batch latency predictor.

If this is right

  • Dynamic chunking that looks one iteration ahead raises overall throughput while preserving per-request latency bounds.
  • Multi-level priority ordering can balance fairness and efficiency without separate fairness mechanisms.
  • Dynamic programming selection of requests inside a batch reduces SLO violations for the highest-priority requests when contention is high.
  • The system maintains strict QoS differentiation across requests even when resource contention is severe.
  • Service capacity scales with load without proportional growth in violation rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same predictor-plus-sliding-window pattern could be tested on non-LLM workloads that also require strict per-request latency targets.
  • Replacing the current predictor with a more accurate but still lightweight model would be a direct next experiment to quantify how much of the gain depends on prediction quality.
  • Evaluating the approach on models larger than those used in the paper would test whether the chunking and selection overheads remain acceptable.

Load-bearing premise

The lightweight batch latency predictor supplies accurate enough estimates of batch execution times for dynamic chunking and priority decisions to succeed without large scheduling errors.

What would settle it

Measure the difference between predicted and actual batch execution times on a production LLM workload and check whether prediction errors above a few percent cause the claimed capacity gains or SLO reductions to disappear.

Figures

Figures reproduced from arXiv: 2606.05933 by Jialun Li, Weigang Wu, Xuan Mo, Yuansheng Chen, Yue Zhang.

Figure 1
Figure 1. Figure 1: Comparison of different scheduling strategies [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: SlidingServe architecture 3.1 Overview The architecture of SlidingServe can be summarized as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Maximum goodput across models, hardware, and datasets [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Latency and SLO violations of five databases under overload [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cumulative violations over time and overall SLO [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

With the rapid growth of interactive applications in large language model (LLM) online services, maintaining high system throughput while ensuring user-perceived latency has become a key issue in inference scheduling. Existing LLM service systems rely on coarse-grained output constraints, making it difficult to effectively handle resource contention among multiple requests, resulting in low resource utilization efficiency and limited support for fine-grained quality of service (QoS) differentiation. We present SlidingServe, a sliding-window-driven SLO-Aware scheduling system for online LLM inference. SlidingServe designed a lightweight batch latency predictor to estimate the execution time of a batch. Based on this, SlidingServe uses SlidingChunker to combine information from the current iteration and the next iteration to achieve dynamic chunking and improve the overall system throughput while maintaining strict QoS guarantees. SlidingServe introduces Multi-Level Priority Sorter to sort candidate requests in order to balance fairness and efficiency. Additionally, when multiple requests within the same batch are at risk of SLO violating,SlidingServe introduces BatchConstructor, which uses dynamic programming to select the set of requests to execute in the current round, mitigating the SLO violation risk of critical requests.Our evaluation demonstrates that SlidingServe can improve service capacity by up to 30% compared to advanced scheduling systems under various load conditions, and further reduces the rate of SLO violation by 16%-53% under heavy-load inference mode.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents SlidingServe, a sliding-window-driven SLO-aware scheduler for online LLM inference. It introduces a lightweight batch latency predictor to estimate batch execution times, SlidingChunker for dynamic chunking that incorporates current and next-iteration information, a Multi-Level Priority Sorter to balance fairness and efficiency, and BatchConstructor (a DP-based selector) to mitigate SLO violations when multiple requests are at risk. The evaluation claims up to 30% higher service capacity versus advanced baselines under varied loads and 16-53% lower SLO violation rates under heavy load.

Significance. If the predictor accuracy and experimental results hold, the work could improve fine-grained QoS differentiation and resource utilization in multi-tenant LLM serving, moving beyond coarse output constraints.

major comments (2)
  1. [Abstract] Abstract: the quantitative claims (30% capacity gain, 16-53% SLO reduction) are presented without any description of experimental setup, workload traces, baselines, number of runs, or statistical significance, making it impossible to evaluate whether the reported improvements are load-bearing or reproducible.
  2. [System Design (lightweight batch latency predictor)] Batch latency predictor description (system overview): all downstream components (SlidingChunker dynamic boundaries, priority ordering, and BatchConstructor DP selection) depend on accurate execution-time estimates, yet the manuscript supplies no training data/procedure, no accuracy metrics (MAPE, max error, or per-batch error distribution), and no sensitivity analysis. If prediction error exceeds a few percent, the claimed gains become unreliable.
minor comments (1)
  1. [BatchConstructor] Notation for the DP objective and constraints in BatchConstructor should be defined explicitly with symbols before use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments correctly identify gaps in the current manuscript. We will revise the abstract and add a dedicated subsection on the batch latency predictor (including training details, accuracy metrics, and sensitivity analysis) to make the quantitative claims and system design fully evaluable.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the quantitative claims (30% capacity gain, 16-53% SLO reduction) are presented without any description of experimental setup, workload traces, baselines, number of runs, or statistical significance, making it impossible to evaluate whether the reported improvements are load-bearing or reproducible.

    Authors: We agree that the abstract as written does not provide sufficient context for the reported gains. In the revised version we will expand the abstract (within length limits) to briefly state the workload traces, primary baselines, evaluation methodology (including number of runs), and note that improvements are reported as averages with observed ranges across load conditions. Full experimental details remain in Section 5. revision: yes

  2. Referee: [System Design (lightweight batch latency predictor)] Batch latency predictor description (system overview): all downstream components (SlidingChunker dynamic boundaries, priority ordering, and BatchConstructor DP selection) depend on accurate execution-time estimates, yet the manuscript supplies no training data/procedure, no accuracy metrics (MAPE, max error, or per-batch error distribution), and no sensitivity analysis. If prediction error exceeds a few percent, the claimed gains become unreliable.

    Authors: The observation is accurate: the current manuscript does not include training data, procedure, accuracy metrics, or sensitivity analysis for the predictor. We will add a new subsection (likely 3.2) that describes the training dataset and procedure, reports MAPE and per-batch error distributions on held-out batches, and presents a sensitivity study showing how SLO violation rates and throughput change under injected prediction errors of 5-20%. This will directly address the reliability concern for the downstream components. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical evaluation of an independently described scheduler

full rationale

The paper presents SlidingServe as a system design (SlidingChunker, Multi-Level Priority Sorter, BatchConstructor DP, and a lightweight batch latency predictor) whose performance claims are obtained from direct evaluation under load conditions. No equations, fitted parameters, or self-citations are shown to define the reported capacity gains or SLO-violation reductions; the abstract and description treat these as measured outcomes rather than quantities derived by construction from the inputs. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5786 in / 961 out tokens · 31393 ms · 2026-06-27T23:47:46.542938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    2020.Towards a human-like open-domain chatbot

    Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020.Towards a human-like open-domain chatbot. arXiv:2001.09977

  2. [2]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). pages 117–134,2024

  3. [3]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023.Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv:2308.16369

  4. [4]

    Yu Ding, Jingxuan Zhao, Zhengong Cai, Kai Shi, Fansong Zeng, and Boweiy Yang. 2025. Adaptoserve: An Efficient System for Supporting Adaptive Chunked- Prefills in LLM Inference. In2025 IEEE International Conference on High Perfor- mance Computing and Communications (HPCC). 1–9

  5. [5]

    2025.Ecoserve: Enabling cost-effective llm serving with proactive intra-and inter-instance orchestration

    Jiangsu Du, Hongbin Zhang, Taosheng Wei, Zhenyi Zheng, Kaiyi Wu, Zhiguang Chen, and Yutong Lu. 2025.Ecoserve: Enabling cost-effective llm serving with proactive intra-and inter-instance orchestration. arXiv:2504.18154

  6. [6]

    Jingqi Feng, Yukai Huang, Rui Zhang, Sicheng Liang, Ming Yan, and Jie Wu

  7. [7]

    InProceedings of the 52nd Annual International Symposium on Computer Architecture

    Windserve: Efficient phase-disaggregated llm serving with stream-based dynamic scheduling. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1283–1295

  8. [8]

    Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

  9. [9]

    Kanishk Goel, Jayashree Mohan, Nipun Kwatra, Ravi Shreyas Anupindi, and Ramachandran Ramjee. 2026. QoServe: Breaking the Silos of LLM Inference Serving. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1492–1507

  10. [10]

    2024.Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. 2024.Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. arXiv:2401.08671

  11. [11]

    Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, and Yu Wang. 2025. Sola: Optimizing slo attainment for large language model serving with state-aware scheduling.Proceedings of Machine Learning and Systems7 (2025)

  12. [12]

    HuggingFace. 2025. arxiv_summarization_postprocess. https://huggingface.co/ datasets/whu9/arxiv_summarization_postprocess

  13. [13]

    HuggingFace. 2025. ShareGPT_Vicuna_unfiltered. https://huggingface.co/ datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

  14. [14]

    Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St Amant, Chetan Bansal, Victor Ruhle, Anoop Kulkarni, et al

  15. [15]

    SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling.Proceedings of the ACM on Measurement and Analysis of Computing Systems9, 3 (2025), 1–24

  16. [16]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology35, 2 (2026), 1–72

  17. [17]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles(SOSP). pages 611–626,2023

  18. [18]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 118–132

  19. [19]

    Uwe Peters and Benjamin Chin-Yee. 2025. Generalization bias in large language model summarization of scientific research.Royal Society Open Science12, 4 (2025), pages 241776

  20. [20]

    Saurabh Pujar, Luca Buratti, Xiaojie Guo, Nicolas Dupuis, Burn Lewis, Sahil Suneja, Atin Sood, Ganesh Nalawade, Matt Jones, Alessandro Morari, et al. 2023. Automated code generation for information technology tasks in yaml through large language models. InProceedings of the 60th ACM/IEEE Design Automation Conference (DAC). pages 1–4,2023

  21. [21]

    2024.ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving

    Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, et al. 2024.ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving. arXiv:2410.01228

  22. [22]

    Partha Pratim Ray. 2023. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope.Internet of Things and Cyber-Physical Systems3 (2023), pages 121–154

  23. [23]

    2020.Recipes for building an open-domain chatbot

    Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. 2020.Recipes for building an open-domain chatbot. arXiv:2004.13637 Beyond Greedy Chunking: SLO-Aware Sliding-Window Scheduling for LLM Inference

  24. [24]

    2025.Hygen: Efficient llm serving via elastic online-offline request co-location

    Ting Sun, Penghan Wang, and Fan Lai. 2025.Hygen: Efficient llm serving via elastic online-offline request co-location. arXiv:2501.14808

  25. [25]

    Vaidya, F

    N. Vaidya, F. Oh, and N. Comly. 2023. Optimizing inference on large language models with NVIDIA TensorRT-LLM, now publicly available. https://github. com/NVIDIA/TensorRT-LLM

  26. [26]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for transformer-based generative models. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). pages 521–538,2022

  27. [27]

    2024.Planning with large language models for code generation

    Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum, and Chuang Gan. 2024.Planning with large language models for code generation. arXiv:2303.05510

  28. [28]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al . 2024. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems37 (2024), 62557–62583

  29. [29]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

  30. [30]

    2025.PolyServe: Efficient Multi-SLO Serving at Scale

    Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy, Baris Kasikci, and Liguang Xie. 2025.PolyServe: Efficient Multi-SLO Serving at Scale. arXiv:2507.17769