pith. sign in

arxiv: 2605.16637 · v1 · pith:7Z62GUQFnew · submitted 2026-05-15 · 💻 cs.DC

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

Pith reviewed 2026-05-19 20:55 UTC · model grok-4.3

classification 💻 cs.DC
keywords agentic LLM workflowsworkflow schedulingheterogeneous GPU clustersprefill-decode disaggregationDAG schedulingSLO attainmentKV-cache managementend-to-end latency
0
0 comments X

The pith

HexAGenT schedules agentic LLM workflows on heterogeneous GPU clusters to cut the SLO scale needed for timely end-to-end completion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic LLM applications run user requests as multi-step workflows whose dependencies unfold at runtime, so the relevant latency is the full workflow completion time rather than any single model call. HexAGenT treats each incoming request as an incrementally revealed directed acyclic graph and keeps a live estimate of when that workflow would finish if run in isolation. It then ranks ready calls by their projected chance of pushing the whole workflow past its target horizon and picks prefill and decode placements together with queue priority while respecting KV-cache limits and cross-stage data movement costs on mixed A100, H100, and H200 hardware. The result is that the same level of workflow success can be reached with noticeably smaller service-level-objective multipliers than earlier schedulers.

Core claim

By modeling each request as an online-revealed DAG, maintaining a running estimate of the workflow's standalone completion horizon, prioritizing ready calls by projected risk of missing that horizon, and jointly selecting prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency, HexAGenT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment, with maximum reductions of 45.0% and 80.5%, respectively, across representative agentic workloads on heterogeneous A100/H100/H200 clusters.

What carries the argument

Workflow-aware scheduler that represents requests as online-revealed DAGs, estimates standalone completion horizons, and performs joint risk-based prioritization plus prefill/decode placement across heterogeneous GPUs.

If this is right

  • Production clusters can serve the same volume of agentic workflows while provisioning fewer GPUs or accepting tighter latency targets.
  • Mixed-generation GPU fleets become more practical because the scheduler explicitly balances prefill and decode work across device types.
  • Workflow-level success rates improve at the same resource budget because placement and priority decisions incorporate end-to-end horizon risk rather than per-call metrics.
  • Operators can lower over-provisioning margins while still guaranteeing high-percentile workflow completion times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same risk-horizon approach could be adapted to other incrementally revealed workflow systems such as distributed data pipelines or multi-agent robotic control.
  • Adding a cost or energy term to the placement decision would let the scheduler optimize for both latency and operational expense on heterogeneous hardware.
  • Evaluating the method under sudden cluster reconfigurations or bursty arrival patterns would test whether the online DAG estimation remains stable outside the evaluated static settings.

Load-bearing premise

The representative agentic workloads and the specific heterogeneous A100/H100/H200 cluster configurations used for evaluation are sufficiently similar to real production deployments that the reported reductions in required SLO scale will generalize.

What would settle it

Running the scheduler on a different collection of agentic workloads or on a cluster whose GPU mix and network characteristics differ from the A100/H100/H200 testbed and measuring no reduction, or an increase, in the SLO scale needed to reach the same attainment levels would show the central claim does not hold.

read the original abstract

Agentic LLM applications increasingly execute user requests as multi-step workflows involving planning, tool use, branching, refinement, and synthesis. In such settings, users experience the end-to-end latency of an entire workflow, not the latency of any single LLM call. In this paper, we study how to schedule online agentic workflows across heterogeneous prefill-decode disaggregated LLM serving clusters to efficiently meet workflow-level latency objectives. The problem is challenging because workflow dependencies are revealed incrementally at runtime, calls have heterogeneous prompts, outputs, and KV-cache requirements, and the prefill and decode stages impose different compute, memory, and transfer constraints across heterogeneous GPUs. To solve this problem, we present HexAGenT, a workflow-aware scheduler for a heterogeneous prefill-decode inference service. HexAGenT models each request as an online-revealed DAG, maintains a running estimate of the workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly selects prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency. Across representative agentic workloads and heterogeneous A100/H100/H200 clusters, HexAGenT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment, with maximum reductions of 45.0% and 80.5%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes HexAGenT, a workflow- and heterogeneity-aware scheduler for online agentic LLM serving on prefill-decode disaggregated clusters with A100/H100/H200 GPUs. Workflows are modeled as incrementally revealed DAGs; the scheduler maintains a running estimate of each workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly decides prefill placement, decode placement, and local queue priority while respecting KV-cache capacity and cross-stage transfer costs. Evaluation across representative agentic workloads reports average reductions in required SLO scale of 20.1% at 95% attainment and 33.0% at 99% attainment (maxima 45.0% and 80.5%).

Significance. If the reported SLO reductions prove robust, the work would offer a practical advance for serving multi-step agentic applications whose end-to-end latency, rather than per-call latency, determines user experience. The combination of online DAG awareness with explicit modeling of prefill/decode asymmetry and GPU heterogeneity addresses a timely systems problem that existing single-request or homogeneous schedulers do not handle.

major comments (2)
  1. [§6] §6 (Evaluation): The central quantitative claim—average 20.1% and 33.0% reductions in required SLO scale—rests on the representativeness of the chosen agentic workloads and the A100/H100/H200 cluster configurations. The manuscript does not describe how these workloads were selected or validated against production traces, nor does it report sensitivity to branching factor, tool-call latency variance, or cross-GPU bandwidth. Without such evidence the reported percentages cannot be assessed for generalization.
  2. [§6.2] §6.2 (Baselines and methodology): The abstract and evaluation summary supply no information on the concrete baselines, statistical aggregation method, or measurement protocol used to obtain the 20.1%/33.0% figures. This information is load-bearing for the empirical contribution and must be supplied with sufficient detail for independent verification.
minor comments (2)
  1. [§3] §3 (System model): The notation for online DAG revelation and the precise definition of the “standalone completion horizon” could be clarified with a small example or pseudocode to aid readers unfamiliar with agentic workflows.
  2. [Figure 4] Figure 4 and Table 2: Axis labels and legend entries are too small for comfortable reading; consider increasing font size or splitting the figure.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and will revise the evaluation section to improve clarity and provide additional supporting details where feasible.

read point-by-point responses
  1. Referee: [§6] §6 (Evaluation): The central quantitative claim—average 20.1% and 33.0% reductions in required SLO scale—rests on the representativeness of the chosen agentic workloads and the A100/H100/H200 cluster configurations. The manuscript does not describe how these workloads were selected or validated against production traces, nor does it report sensitivity to branching factor, tool-call latency variance, or cross-GPU bandwidth. Without such evidence the reported percentages cannot be assessed for generalization.

    Authors: We agree that more explicit description of workload construction and sensitivity analysis would strengthen the paper. In the revised manuscript we will add to §6 a description of how the workloads were assembled from representative multi-step agentic patterns (planning, tool invocation, branching, and synthesis) drawn from open frameworks, together with new sensitivity results for branching factor and tool-call latency variance. Our cluster model already incorporates realistic cross-GPU transfer costs for the A100/H100/H200 mix; we will report additional bandwidth sweeps. Direct validation against proprietary production traces is not possible for us, but the workloads are constructed to reproduce the key online-DAG and heterogeneity properties observed in public agentic benchmarks. revision: yes

  2. Referee: [§6.2] §6.2 (Baselines and methodology): The abstract and evaluation summary supply no information on the concrete baselines, statistical aggregation method, or measurement protocol used to obtain the 20.1%/33.0% figures. This information is load-bearing for the empirical contribution and must be supplied with sufficient detail for independent verification.

    Authors: We acknowledge the omission in the abstract and high-level summary. Section 6.2 of the full manuscript already specifies the baselines (FCFS, SJF, and heterogeneity-unaware disaggregated schedulers), the aggregation method (mean and tail statistics over 10 independent runs with different random seeds), and the measurement protocol (SLO scale defined as the multiplicative factor on the workflow’s standalone completion horizon required to reach the target attainment). To make this information immediately accessible, we will insert a short summary paragraph and table at the start of the evaluation section in the revised version. revision: yes

standing simulated objections not resolved
  • Direct validation of workloads against proprietary production traces from specific industry deployments, which are not publicly available.

Circularity Check

0 steps flagged

No circularity: performance claims are empirical outcomes of scheduler evaluation

full rationale

The paper's central claims consist of measured reductions in required SLO scale (20.1% at 95% attainment, 33.0% at 99%) obtained by running HexAGenT on representative agentic workloads and heterogeneous A100/H100/H200 clusters. These are presented as direct experimental results rather than predictions derived from fitted parameters, self-referential definitions, or load-bearing self-citations. The scheduler description (DAG modeling, risk prioritization, placement selection) is algorithmic and evaluated externally; no equation or theorem reduces by construction to its own inputs. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumptions that workflows are well captured by online-revealed DAGs and that the tested workloads and clusters are representative; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Agentic LLM requests can be modeled as online-revealed DAGs whose dependencies become known incrementally at runtime.
    This modeling choice underpins the prioritization and placement logic.
  • domain assumption Prefill and decode stages impose distinct compute, memory, and cross-stage transfer constraints on heterogeneous GPUs.
    Required for the joint placement decisions described.

pith-pipeline@v0.9.0 · 5817 in / 1436 out tokens · 76236 ms · 2026-05-19T20:55:31.072816+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    HEXAGENT models each request as an online-revealed DAG, maintains a running estimate of the workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly selects prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Across representative agentic workloads and heterogeneous A100/H100/H200 clusters, HEXAGENT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 3 internal anchors

  1. [1]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency trade-off in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-...

  2. [2]

    Broder, Anna R

    Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced allocations. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, pages 593–602, New York, NY, USA, 1994. Association for Computing Machinery . doi: 10.1145/195058.195412

  3. [3]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38. AAAI Press, 2024

  4. [4]

    Accelerating retrieval-augmented generation,

    Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. MoE-Lightning: High-throughput MoE inference on memory-constrained GPUs. In Proceedings of the 16 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–7...

  5. [5]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality . LMSYS Blog, March 2023. URLhttps://lmsys.org/blog/2023-03-30-vicuna/

  6. [6]

    Davis, Ken W

    Robert I. Davis, Ken W. Tindell, and Alan Burns. Scheduling slack time in fixed priority pre-emptive systems. In Proceedings of the 14th IEEE Real-Time Systems Symposium, pages 222–231, Washington, DC, USA, 1993. IEEE Computer Society . doi: 10.1109/REAL.1993.393505

  7. [7]

    ServerlessLLM: Low-latency serverless inference for large language models

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. ServerlessLLM: Low-latency serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URLhttps:/...

  8. [8]

    Efficient LLM scheduling by learning to rank, 2024

    Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. Efficient LLM scheduling by learning to rank, 2024

  9. [9]

    Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus

    Guoliang He, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, et al. Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus. Advances in Neural Information Processing Systems, 38:147100–147126, 2026

  10. [10]

    Efficient multi-round llm inference over disaggregated serving

    Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026

  11. [11]

    Osdp: Optimal sharded data parallel for distributed deep learning

    Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258, 2022

  12. [12]

    Hexgen: Generative inference of large language model over heterogeneous environment

    Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment. arXiv preprint arXiv:2311.11514, 2023

  13. [13]

    Demystifying cost-efficiency in llm serving over heterogeneous gpus

    Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025

  14. [14]

    Thunderserve: High-performance and cost-efficient llm serving in cloud environments

    Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025

  15. [15]

    Cascadia: An efficient cascade serving system for large language models

    Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025

  16. [16]

    Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment

    Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025

  17. [17]

    OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

    Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026

  18. [18]

    Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization

    Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026

  19. [19]

    Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

    Youhe Jiang, Ran Yan, You Peng, Wenshuang Li, Taiyi Wang, Fangcheng Fu, and Binhang Yuan. Autopoiesis: A self-evolving system paradigm for llm serving under runtime dynamics. arXiv preprint arXiv:2604.07144, 2026

  20. [20]

    P/D-Serve: Serving disaggregated large language model at scale, 2024

    Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo,...

  21. [21]

    Efficient memory management for large language model serving with pagedattention,

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611–626, New York, NY, USA, 2023. Association for Computing M...

  22. [22]

    Gonzalez, and Ion Stoica

    Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Alvin Cheung, Joseph E. Gonzalez, and Ion Stoica. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live, 2025

  23. [23]

    Gonzalez, and Ion Stoica

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, Boston, MA, July 2023....

  24. [24]

    Parrot: Efficient serving of LLM-based applications with semantic variable

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of LLM-based applications with semantic variable. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 929–945, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URLhttps://www.us...

  25. [25]

    Hermes: Efficient serving of LLM applications with probabilistic demand modeling

    Yifei Liu, Zuo Gan, Zhenghao Gan, Weiye Wang, Chen Chen, Yizhou Shan, Xusheng Chen, Zhenhua Han, Yifei Zhu, Shixuan Sun, and Minyi Guo. Hermes: Efficient serving of LLM applications with probabilistic demand modeling. ACM Transactions on Architecture and Code Optimization, 2026. doi: 10.1145/3803390

  26. [26]

    Gonzalez, and Ion Stoica

    Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. Autellix: An efficient serving engine for LLM agents as general programs, 2025

  27. [27]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...

  28. [28]

    Skyserve: Serving ai models across regions and clouds with spot instances

    Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025

  29. [29]

    The state of AI in 2025

    McKinsey & Company. The state of AI in 2025. McKinsey Global Survey , 2025. URL https://www.mckinsey.com/c apabilities/quantumblack/our-insights/the-state-of-ai

  30. [30]

    Galvatron: Efficient transformer train- ing over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878, 2022

    Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878, 2022

  31. [31]

    Splitwise: Efficient generative llm inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

  32. [32]

    Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

    Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025. URL https://icml.cc/virtual/2025/poste r/46593

  33. [33]

    Kalbarczyk, and Ravishankar K

    Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. Queue management for SLO-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing, pages 18–35, New York, NY, USA, 2024. Association for Computing Machinery . doi: 10.1145/...

  34. [34]

    Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql

    You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025

  35. [35]

    Kalbarczyk, Tamer Basar, and Ravishankar K

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Basar, and Ravishankar K. Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction. In Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024 (AIOps 2024), pages 1–...

  36. [36]

    Kalbarczyk, Tamer Basa ¸r, and Ravishankar K

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Basa ¸r, and Ravishankar K. Iyer. Efficient interactive LLM serving with proxy model-based sequence length prediction. In Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024, New York, NY, USA, ...

  37. [37]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

  38. [38]

    A proof of the optimality of the shortest remaining processing time discipline

    Linus Schrage. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 16(3):687–690, 1968

  39. [39]

    Don’t stop me now: Embedding based scheduling for LLMs, 2024

    Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, and Michael Mitzenmacher. Don’t stop me now: Embedding based scheduling for LLMs, 2024

  40. [40]

    Stankovic, Marco Spuri, Krithi Ramamritham, and Giorgio C

    John A. Stankovic, Marco Spuri, Krithi Ramamritham, and Giorgio C. Buttazzo. Scheduling in Real-Time Systems. Springer, Boston, MA, USA, 1998

  41. [41]

    Llumnix: Dynamic scheduling for large language model serving

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133- 40-3. URLhttps://www.usenix.org/conference/os...

  42. [42]

    Parallax: Efficient llm inference service over decentralized environment

    Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025

  43. [43]

    Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

    Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

  44. [44]

    Roofline: An insightful visual performance model for multicore architectures,

    Samuel Williams, Andrew Waterman, and David A. Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009. doi: 10.1145/1498765.1498785. URL https://doi.org/10.1145/1498765.1498785

  45. [45]

    Fast distributed inference serving for large language models, 2023

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models, 2023

  46. [46]

    Fsa: An alternative efficient implementation of native sparse attention kernel

    Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alternative efficient implementation of native sparse attention kernel. arXiv preprint arXiv:2508.18224, 2025

  47. [47]

    Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

    Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, and Binhang Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus. arXiv preprint arXiv:2511.00796, 2025

  48. [48]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10...

  49. [49]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

  50. [50]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations,

  51. [51]

    URLhttps://openreview.net/forum?id=WE_vluYUL-X

  52. [52]

    Orca: A distributed serving system for transformer-based generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URLhttps://www.usenix.org/conferenc...

  53. [53]

    LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

    Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025

  54. [54]

    Blendserve: Optimizing offline inference for auto-regressive large models with resource-aware batching, 2024

    Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Stoica. Blendserve: Optimizing offline inference for auto-regressive large models with resource-aware batching, 2024

  55. [55]

    Gonzalez, Clark W

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. In Advances in Neural Information Processing Systems, volume 37, Red Hook, NY, USA, 2024. 19 Curran Associates,...

  56. [56]

    Distserve: Dis- aggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Dis- aggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-93...

  57. [57]

    Language agent tree search unifies reasoning, acting, and planning in language models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 62138–62160. PMLR, 2024. URLhttps://proceedings.mlr.pre...