HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

Binhang Yuan; Chen Wang; Jiawei Jiang; Ke Zhou; Wenshuang Li; Xu Xu; Youhe Jiang; You Peng

arxiv: 2605.16637 · v1 · pith:7Z62GUQFnew · submitted 2026-05-15 · 💻 cs.DC

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

You Peng , Youhe Jiang , Wenshuang Li , Xu Xu , Ke Zhou , Jiawei Jiang , Chen Wang , Binhang Yuan This is my paper

Pith reviewed 2026-05-19 20:55 UTC · model grok-4.3

classification 💻 cs.DC

keywords agentic LLM workflowsworkflow schedulingheterogeneous GPU clustersprefill-decode disaggregationDAG schedulingSLO attainmentKV-cache managementend-to-end latency

0 comments

The pith

HexAGenT schedules agentic LLM workflows on heterogeneous GPU clusters to cut the SLO scale needed for timely end-to-end completion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic LLM applications run user requests as multi-step workflows whose dependencies unfold at runtime, so the relevant latency is the full workflow completion time rather than any single model call. HexAGenT treats each incoming request as an incrementally revealed directed acyclic graph and keeps a live estimate of when that workflow would finish if run in isolation. It then ranks ready calls by their projected chance of pushing the whole workflow past its target horizon and picks prefill and decode placements together with queue priority while respecting KV-cache limits and cross-stage data movement costs on mixed A100, H100, and H200 hardware. The result is that the same level of workflow success can be reached with noticeably smaller service-level-objective multipliers than earlier schedulers.

Core claim

By modeling each request as an online-revealed DAG, maintaining a running estimate of the workflow's standalone completion horizon, prioritizing ready calls by projected risk of missing that horizon, and jointly selecting prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency, HexAGenT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment, with maximum reductions of 45.0% and 80.5%, respectively, across representative agentic workloads on heterogeneous A100/H100/H200 clusters.

What carries the argument

Workflow-aware scheduler that represents requests as online-revealed DAGs, estimates standalone completion horizons, and performs joint risk-based prioritization plus prefill/decode placement across heterogeneous GPUs.

If this is right

Production clusters can serve the same volume of agentic workflows while provisioning fewer GPUs or accepting tighter latency targets.
Mixed-generation GPU fleets become more practical because the scheduler explicitly balances prefill and decode work across device types.
Workflow-level success rates improve at the same resource budget because placement and priority decisions incorporate end-to-end horizon risk rather than per-call metrics.
Operators can lower over-provisioning margins while still guaranteeing high-percentile workflow completion times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same risk-horizon approach could be adapted to other incrementally revealed workflow systems such as distributed data pipelines or multi-agent robotic control.
Adding a cost or energy term to the placement decision would let the scheduler optimize for both latency and operational expense on heterogeneous hardware.
Evaluating the method under sudden cluster reconfigurations or bursty arrival patterns would test whether the online DAG estimation remains stable outside the evaluated static settings.

Load-bearing premise

The representative agentic workloads and the specific heterogeneous A100/H100/H200 cluster configurations used for evaluation are sufficiently similar to real production deployments that the reported reductions in required SLO scale will generalize.

What would settle it

Running the scheduler on a different collection of agentic workloads or on a cluster whose GPU mix and network characteristics differ from the A100/H100/H200 testbed and measuring no reduction, or an increase, in the SLO scale needed to reach the same attainment levels would show the central claim does not hold.

read the original abstract

Agentic LLM applications increasingly execute user requests as multi-step workflows involving planning, tool use, branching, refinement, and synthesis. In such settings, users experience the end-to-end latency of an entire workflow, not the latency of any single LLM call. In this paper, we study how to schedule online agentic workflows across heterogeneous prefill-decode disaggregated LLM serving clusters to efficiently meet workflow-level latency objectives. The problem is challenging because workflow dependencies are revealed incrementally at runtime, calls have heterogeneous prompts, outputs, and KV-cache requirements, and the prefill and decode stages impose different compute, memory, and transfer constraints across heterogeneous GPUs. To solve this problem, we present HexAGenT, a workflow-aware scheduler for a heterogeneous prefill-decode inference service. HexAGenT models each request as an online-revealed DAG, maintains a running estimate of the workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly selects prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency. Across representative agentic workloads and heterogeneous A100/H100/H200 clusters, HexAGenT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment, with maximum reductions of 45.0% and 80.5%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HexAGenT gives a concrete scheduler for workflow-level SLOs in agentic LLM serving on mixed prefill-decode clusters, but the reported gains rest on unexamined assumptions about how representative the test workloads really are.

read the letter

HexAGenT models agentic workflows as DAGs revealed online, tracks a projected completion horizon for each workflow, and prioritizes ready calls by their risk of missing that horizon while choosing prefill and decode placements across heterogeneous GPUs and watching KV-cache and transfer costs. The main contribution is the combination of online DAG handling, risk-based prioritization, and joint stage placement decisions tailored to disaggregated serving. Earlier schedulers mostly treated requests in isolation or assumed static graphs and uniform hardware, so this synthesis targets a real shift as multi-step agent applications grow. The paper lays out the constraints clearly: varying prompt and output sizes, different compute and memory profiles for prefill versus decode, and cross-GPU transfer penalties on A100/H100/H200 mixes. The headline result is a reduction in the SLO scale needed for 95% and 99% workflow attainment, averaging 20% and 33% respectively. That is a usable number if it holds. The soft spot is generalization. The workloads are labeled representative and the clusters heterogeneous, yet the abstract supplies no information on how they were chosen, whether they match production traces, or how sensitive the gains are to branching factor, tool-call variance, or interconnect bandwidth. Without those checks, the percentage improvements could shrink or disappear under different conditions. The baselines and measurement details also need scrutiny to confirm they are fair. This work is aimed at systems researchers and operators who build or tune LLM serving platforms for complex agentic traffic. Anyone thinking about disaggregated inference or workflow-aware scheduling would get concrete design ideas from it. I would send it to peer review. The problem is timely, the approach is a reasonable extension of prior ideas, and the empirical claims are specific enough that referees can test them directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes HexAGenT, a workflow- and heterogeneity-aware scheduler for online agentic LLM serving on prefill-decode disaggregated clusters with A100/H100/H200 GPUs. Workflows are modeled as incrementally revealed DAGs; the scheduler maintains a running estimate of each workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly decides prefill placement, decode placement, and local queue priority while respecting KV-cache capacity and cross-stage transfer costs. Evaluation across representative agentic workloads reports average reductions in required SLO scale of 20.1% at 95% attainment and 33.0% at 99% attainment (maxima 45.0% and 80.5%).

Significance. If the reported SLO reductions prove robust, the work would offer a practical advance for serving multi-step agentic applications whose end-to-end latency, rather than per-call latency, determines user experience. The combination of online DAG awareness with explicit modeling of prefill/decode asymmetry and GPU heterogeneity addresses a timely systems problem that existing single-request or homogeneous schedulers do not handle.

major comments (2)

[§6] §6 (Evaluation): The central quantitative claim—average 20.1% and 33.0% reductions in required SLO scale—rests on the representativeness of the chosen agentic workloads and the A100/H100/H200 cluster configurations. The manuscript does not describe how these workloads were selected or validated against production traces, nor does it report sensitivity to branching factor, tool-call latency variance, or cross-GPU bandwidth. Without such evidence the reported percentages cannot be assessed for generalization.
[§6.2] §6.2 (Baselines and methodology): The abstract and evaluation summary supply no information on the concrete baselines, statistical aggregation method, or measurement protocol used to obtain the 20.1%/33.0% figures. This information is load-bearing for the empirical contribution and must be supplied with sufficient detail for independent verification.

minor comments (2)

[§3] §3 (System model): The notation for online DAG revelation and the precise definition of the “standalone completion horizon” could be clarified with a small example or pseudocode to aid readers unfamiliar with agentic workflows.
[Figure 4] Figure 4 and Table 2: Axis labels and legend entries are too small for comfortable reading; consider increasing font size or splitting the figure.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and will revise the evaluation section to improve clarity and provide additional supporting details where feasible.

read point-by-point responses

Referee: [§6] §6 (Evaluation): The central quantitative claim—average 20.1% and 33.0% reductions in required SLO scale—rests on the representativeness of the chosen agentic workloads and the A100/H100/H200 cluster configurations. The manuscript does not describe how these workloads were selected or validated against production traces, nor does it report sensitivity to branching factor, tool-call latency variance, or cross-GPU bandwidth. Without such evidence the reported percentages cannot be assessed for generalization.

Authors: We agree that more explicit description of workload construction and sensitivity analysis would strengthen the paper. In the revised manuscript we will add to §6 a description of how the workloads were assembled from representative multi-step agentic patterns (planning, tool invocation, branching, and synthesis) drawn from open frameworks, together with new sensitivity results for branching factor and tool-call latency variance. Our cluster model already incorporates realistic cross-GPU transfer costs for the A100/H100/H200 mix; we will report additional bandwidth sweeps. Direct validation against proprietary production traces is not possible for us, but the workloads are constructed to reproduce the key online-DAG and heterogeneity properties observed in public agentic benchmarks. revision: yes
Referee: [§6.2] §6.2 (Baselines and methodology): The abstract and evaluation summary supply no information on the concrete baselines, statistical aggregation method, or measurement protocol used to obtain the 20.1%/33.0% figures. This information is load-bearing for the empirical contribution and must be supplied with sufficient detail for independent verification.

Authors: We acknowledge the omission in the abstract and high-level summary. Section 6.2 of the full manuscript already specifies the baselines (FCFS, SJF, and heterogeneity-unaware disaggregated schedulers), the aggregation method (mean and tail statistics over 10 independent runs with different random seeds), and the measurement protocol (SLO scale defined as the multiplicative factor on the workflow’s standalone completion horizon required to reach the target attainment). To make this information immediately accessible, we will insert a short summary paragraph and table at the start of the evaluation section in the revised version. revision: yes

standing simulated objections not resolved

Direct validation of workloads against proprietary production traces from specific industry deployments, which are not publicly available.

Circularity Check

0 steps flagged

No circularity: performance claims are empirical outcomes of scheduler evaluation

full rationale

The paper's central claims consist of measured reductions in required SLO scale (20.1% at 95% attainment, 33.0% at 99%) obtained by running HexAGenT on representative agentic workloads and heterogeneous A100/H100/H200 clusters. These are presented as direct experimental results rather than predictions derived from fitted parameters, self-referential definitions, or load-bearing self-citations. The scheduler description (DAG modeling, risk prioritization, placement selection) is algorithmic and evaluated externally; no equation or theorem reduces by construction to its own inputs. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumptions that workflows are well captured by online-revealed DAGs and that the tested workloads and clusters are representative; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Agentic LLM requests can be modeled as online-revealed DAGs whose dependencies become known incrementally at runtime.
This modeling choice underpins the prioritization and placement logic.
domain assumption Prefill and decode stages impose distinct compute, memory, and cross-stage transfer constraints on heterogeneous GPUs.
Required for the joint placement decisions described.

pith-pipeline@v0.9.0 · 5817 in / 1436 out tokens · 76236 ms · 2026-05-19T20:55:31.072816+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HEXAGENT models each request as an online-revealed DAG, maintains a running estimate of the workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly selects prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across representative agentic workloads and heterogeneous A100/H100/H200 clusters, HEXAGENT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 3 internal anchors

[1]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency trade-off in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-...

work page 2024
[2]

Broder, Anna R

Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced allocations. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, pages 593–602, New York, NY, USA, 1994. Association for Computing Machinery . doi: 10.1145/195058.195412

work page doi:10.1145/195058.195412 1994
[3]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38. AAAI Press, 2024

work page 2024
[4]

Accelerating retrieval-augmented generation,

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. MoE-Lightning: High-throughput MoE inference on memory-constrained GPUs. In Proceedings of the 16 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–7...

work page doi:10.1145/3669940.3707267 2025
[5]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality . LMSYS Blog, March 2023. URLhttps://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[6]

Davis, Ken W

Robert I. Davis, Ken W. Tindell, and Alan Burns. Scheduling slack time in fixed priority pre-emptive systems. In Proceedings of the 14th IEEE Real-Time Systems Symposium, pages 222–231, Washington, DC, USA, 1993. IEEE Computer Society . doi: 10.1109/REAL.1993.393505

work page doi:10.1109/real.1993.393505 1993
[7]

ServerlessLLM: Low-latency serverless inference for large language models

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. ServerlessLLM: Low-latency serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URLhttps:/...

work page 2024
[8]

Efficient LLM scheduling by learning to rank, 2024

Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. Efficient LLM scheduling by learning to rank, 2024

work page 2024
[9]

Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus

Guoliang He, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, et al. Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus. Advances in Neural Information Processing Systems, 38:147100–147126, 2026

work page 2026
[10]

Efficient multi-round llm inference over disaggregated serving

Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026

work page arXiv 2026
[11]

Osdp: Optimal sharded data parallel for distributed deep learning

Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258, 2022

work page arXiv 2022
[12]

Hexgen: Generative inference of large language model over heterogeneous environment

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment. arXiv preprint arXiv:2311.11514, 2023

work page arXiv 2023
[13]

Demystifying cost-efficiency in llm serving over heterogeneous gpus

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025

work page arXiv 2025
[14]

Thunderserve: High-performance and cost-efficient llm serving in cloud environments

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025

work page 2025
[15]

Cascadia: An efficient cascade serving system for large language models

Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025

work page arXiv 2025
[16]

Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment

Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025

work page arXiv 2025
[17]

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization

Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026

work page arXiv 2026
[19]

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

Youhe Jiang, Ran Yan, You Peng, Wenshuang Li, Taiyi Wang, Fangcheng Fu, and Binhang Yuan. Autopoiesis: A self-evolving system paradigm for llm serving under runtime dynamics. arXiv preprint arXiv:2604.07144, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

P/D-Serve: Serving disaggregated large language model at scale, 2024

Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo,...

work page 2024
[21]

Efficient memory management for large language model serving with pagedattention,

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611–626, New York, NY, USA, 2023. Association for Computing M...

work page doi:10.1145/3600006.3613165 2023
[22]

Gonzalez, and Ion Stoica

Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Alvin Cheung, Joseph E. Gonzalez, and Ion Stoica. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live, 2025

work page 2025
[23]

Gonzalez, and Ion Stoica

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, Boston, MA, July 2023....

work page 2023
[24]

Parrot: Efficient serving of LLM-based applications with semantic variable

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of LLM-based applications with semantic variable. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 929–945, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URLhttps://www.us...

work page 2024
[25]

Hermes: Efficient serving of LLM applications with probabilistic demand modeling

Yifei Liu, Zuo Gan, Zhenghao Gan, Weiye Wang, Chen Chen, Yizhou Shan, Xusheng Chen, Zhenhua Han, Yifei Zhu, Shixuan Sun, and Minyi Guo. Hermes: Efficient serving of LLM applications with probabilistic demand modeling. ACM Transactions on Architecture and Code Optimization, 2026. doi: 10.1145/3803390

work page doi:10.1145/3803390 2026
[26]

Gonzalez, and Ion Stoica

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. Autellix: An efficient serving engine for LLM agents as general programs, 2025

work page 2025
[27]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...

work page 2023
[28]

Skyserve: Serving ai models across regions and clouds with spot instances

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025

work page 2025
[29]

The state of AI in 2025

McKinsey & Company. The state of AI in 2025. McKinsey Global Survey , 2025. URL https://www.mckinsey.com/c apabilities/quantumblack/our-insights/the-state-of-ai

work page 2025
[30]

Galvatron: Efficient transformer train- ing over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878, 2022

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878, 2022

work page arXiv 2022
[31]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

work page 2024
[32]

Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025. URL https://icml.cc/virtual/2025/poste r/46593

work page 2025
[33]

Kalbarczyk, and Ravishankar K

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. Queue management for SLO-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing, pages 18–35, New York, NY, USA, 2024. Association for Computing Machinery . doi: 10.1145/...

work page doi:10.1145/3698038.3698523 2024
[34]

Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql

You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025

work page arXiv 2025
[35]

Kalbarczyk, Tamer Basar, and Ravishankar K

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Basar, and Ravishankar K. Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction. In Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024 (AIOps 2024), pages 1–...

work page 2024
[36]

Kalbarczyk, Tamer Basa ¸r, and Ravishankar K

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Basa ¸r, and Ravishankar K. Iyer. Efficient interactive LLM serving with proxy model-based sequence length prediction. In Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024, New York, NY, USA, ...

work page 2024
[37]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

work page 2023
[38]

A proof of the optimality of the shortest remaining processing time discipline

Linus Schrage. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 16(3):687–690, 1968

work page 1968
[39]

Don’t stop me now: Embedding based scheduling for LLMs, 2024

Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, and Michael Mitzenmacher. Don’t stop me now: Embedding based scheduling for LLMs, 2024

work page 2024
[40]

Stankovic, Marco Spuri, Krithi Ramamritham, and Giorgio C

John A. Stankovic, Marco Spuri, Krithi Ramamritham, and Giorgio C. Buttazzo. Scheduling in Real-Time Systems. Springer, Boston, MA, USA, 1998

work page 1998
[41]

Llumnix: Dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133- 40-3. URLhttps://www.usenix.org/conference/os...

work page 2024
[42]

Parallax: Efficient llm inference service over decentralized environment

Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025

work page arXiv 2025
[43]

Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

work page 2024
[44]

Roofline: An insightful visual performance model for multicore architectures,

Samuel Williams, Andrew Waterman, and David A. Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009. doi: 10.1145/1498765.1498785. URL https://doi.org/10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[45]

Fast distributed inference serving for large language models, 2023

Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models, 2023

work page 2023
[46]

Fsa: An alternative efficient implementation of native sparse attention kernel

Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alternative efficient implementation of native sparse attention kernel. arXiv preprint arXiv:2508.18224, 2025

work page arXiv 2025
[47]

Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, and Binhang Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus. arXiv preprint arXiv:2511.00796, 2025

work page arXiv 2025
[48]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10...

work page doi:10.18653/v1/d18-1259 2018
[49]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

work page 2023
[50]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations,

work page
[51]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page
[52]

Orca: A distributed serving system for transformer-based generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URLhttps://www.usenix.org/conferenc...

work page 2022
[53]

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Blendserve: Optimizing offline inference for auto-regressive large models with resource-aware batching, 2024

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Stoica. Blendserve: Optimizing offline inference for auto-regressive large models with resource-aware batching, 2024

work page 2024
[55]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. In Advances in Neural Information Processing Systems, volume 37, Red Hook, NY, USA, 2024. 19 Curran Associates,...

work page 2024
[56]

Distserve: Dis- aggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Dis- aggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-93...

work page 2024
[57]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 62138–62160. PMLR, 2024. URLhttps://proceedings.mlr.pre...

work page 2024

[1] [1]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency trade-off in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-...

work page 2024

[2] [2]

Broder, Anna R

Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced allocations. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, pages 593–602, New York, NY, USA, 1994. Association for Computing Machinery . doi: 10.1145/195058.195412

work page doi:10.1145/195058.195412 1994

[3] [3]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38. AAAI Press, 2024

work page 2024

[4] [4]

Accelerating retrieval-augmented generation,

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. MoE-Lightning: High-throughput MoE inference on memory-constrained GPUs. In Proceedings of the 16 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–7...

work page doi:10.1145/3669940.3707267 2025

[5] [5]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality . LMSYS Blog, March 2023. URLhttps://lmsys.org/blog/2023-03-30-vicuna/

work page 2023

[6] [6]

Davis, Ken W

Robert I. Davis, Ken W. Tindell, and Alan Burns. Scheduling slack time in fixed priority pre-emptive systems. In Proceedings of the 14th IEEE Real-Time Systems Symposium, pages 222–231, Washington, DC, USA, 1993. IEEE Computer Society . doi: 10.1109/REAL.1993.393505

work page doi:10.1109/real.1993.393505 1993

[7] [7]

ServerlessLLM: Low-latency serverless inference for large language models

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. ServerlessLLM: Low-latency serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URLhttps:/...

work page 2024

[8] [8]

Efficient LLM scheduling by learning to rank, 2024

Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. Efficient LLM scheduling by learning to rank, 2024

work page 2024

[9] [9]

Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus

Guoliang He, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, et al. Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus. Advances in Neural Information Processing Systems, 38:147100–147126, 2026

work page 2026

[10] [10]

Efficient multi-round llm inference over disaggregated serving

Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026

work page arXiv 2026

[11] [11]

Osdp: Optimal sharded data parallel for distributed deep learning

Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258, 2022

work page arXiv 2022

[12] [12]

Hexgen: Generative inference of large language model over heterogeneous environment

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment. arXiv preprint arXiv:2311.11514, 2023

work page arXiv 2023

[13] [13]

Demystifying cost-efficiency in llm serving over heterogeneous gpus

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025

work page arXiv 2025

[14] [14]

Thunderserve: High-performance and cost-efficient llm serving in cloud environments

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025

work page 2025

[15] [15]

Cascadia: An efficient cascade serving system for large language models

Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025

work page arXiv 2025

[16] [16]

Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment

Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025

work page arXiv 2025

[17] [17]

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization

Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026

work page arXiv 2026

[19] [19]

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

Youhe Jiang, Ran Yan, You Peng, Wenshuang Li, Taiyi Wang, Fangcheng Fu, and Binhang Yuan. Autopoiesis: A self-evolving system paradigm for llm serving under runtime dynamics. arXiv preprint arXiv:2604.07144, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

P/D-Serve: Serving disaggregated large language model at scale, 2024

Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo,...

work page 2024

[21] [21]

Efficient memory management for large language model serving with pagedattention,

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611–626, New York, NY, USA, 2023. Association for Computing M...

work page doi:10.1145/3600006.3613165 2023

[22] [22]

Gonzalez, and Ion Stoica

Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Alvin Cheung, Joseph E. Gonzalez, and Ion Stoica. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live, 2025

work page 2025

[23] [23]

Gonzalez, and Ion Stoica

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, Boston, MA, July 2023....

work page 2023

[24] [24]

Parrot: Efficient serving of LLM-based applications with semantic variable

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of LLM-based applications with semantic variable. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 929–945, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URLhttps://www.us...

work page 2024

[25] [25]

Hermes: Efficient serving of LLM applications with probabilistic demand modeling

Yifei Liu, Zuo Gan, Zhenghao Gan, Weiye Wang, Chen Chen, Yizhou Shan, Xusheng Chen, Zhenhua Han, Yifei Zhu, Shixuan Sun, and Minyi Guo. Hermes: Efficient serving of LLM applications with probabilistic demand modeling. ACM Transactions on Architecture and Code Optimization, 2026. doi: 10.1145/3803390

work page doi:10.1145/3803390 2026

[26] [26]

Gonzalez, and Ion Stoica

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. Autellix: An efficient serving engine for LLM agents as general programs, 2025

work page 2025

[27] [27]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...

work page 2023

[28] [28]

Skyserve: Serving ai models across regions and clouds with spot instances

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025

work page 2025

[29] [29]

The state of AI in 2025

McKinsey & Company. The state of AI in 2025. McKinsey Global Survey , 2025. URL https://www.mckinsey.com/c apabilities/quantumblack/our-insights/the-state-of-ai

work page 2025

[30] [30]

Galvatron: Efficient transformer train- ing over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878, 2022

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878, 2022

work page arXiv 2022

[31] [31]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

work page 2024

[32] [32]

Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025. URL https://icml.cc/virtual/2025/poste r/46593

work page 2025

[33] [33]

Kalbarczyk, and Ravishankar K

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. Queue management for SLO-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing, pages 18–35, New York, NY, USA, 2024. Association for Computing Machinery . doi: 10.1145/...

work page doi:10.1145/3698038.3698523 2024

[34] [34]

Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql

You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025

work page arXiv 2025

[35] [35]

Kalbarczyk, Tamer Basar, and Ravishankar K

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Basar, and Ravishankar K. Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction. In Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024 (AIOps 2024), pages 1–...

work page 2024

[36] [36]

Kalbarczyk, Tamer Basa ¸r, and Ravishankar K

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Basa ¸r, and Ravishankar K. Iyer. Efficient interactive LLM serving with proxy model-based sequence length prediction. In Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024, New York, NY, USA, ...

work page 2024

[37] [37]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

work page 2023

[38] [38]

A proof of the optimality of the shortest remaining processing time discipline

Linus Schrage. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 16(3):687–690, 1968

work page 1968

[39] [39]

Don’t stop me now: Embedding based scheduling for LLMs, 2024

Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, and Michael Mitzenmacher. Don’t stop me now: Embedding based scheduling for LLMs, 2024

work page 2024

[40] [40]

Stankovic, Marco Spuri, Krithi Ramamritham, and Giorgio C

John A. Stankovic, Marco Spuri, Krithi Ramamritham, and Giorgio C. Buttazzo. Scheduling in Real-Time Systems. Springer, Boston, MA, USA, 1998

work page 1998

[41] [41]

Llumnix: Dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133- 40-3. URLhttps://www.usenix.org/conference/os...

work page 2024

[42] [42]

Parallax: Efficient llm inference service over decentralized environment

Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025

work page arXiv 2025

[43] [43]

Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024

work page 2024

[44] [44]

Roofline: An insightful visual performance model for multicore architectures,

Samuel Williams, Andrew Waterman, and David A. Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009. doi: 10.1145/1498765.1498785. URL https://doi.org/10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009

[45] [45]

Fast distributed inference serving for large language models, 2023

Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models, 2023

work page 2023

[46] [46]

Fsa: An alternative efficient implementation of native sparse attention kernel

Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alternative efficient implementation of native sparse attention kernel. arXiv preprint arXiv:2508.18224, 2025

work page arXiv 2025

[47] [47]

Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, and Binhang Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus. arXiv preprint arXiv:2511.00796, 2025

work page arXiv 2025

[48] [48]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10...

work page doi:10.18653/v1/d18-1259 2018

[49] [49]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

work page 2023

[50] [50]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations,

work page

[51] [51]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page

[52] [52]

Orca: A distributed serving system for transformer-based generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URLhttps://www.usenix.org/conferenc...

work page 2022

[53] [53]

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Blendserve: Optimizing offline inference for auto-regressive large models with resource-aware batching, 2024

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Stoica. Blendserve: Optimizing offline inference for auto-regressive large models with resource-aware batching, 2024

work page 2024

[55] [55]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. In Advances in Neural Information Processing Systems, volume 37, Red Hook, NY, USA, 2024. 19 Curran Associates,...

work page 2024

[56] [56]

Distserve: Dis- aggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Dis- aggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-93...

work page 2024

[57] [57]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 62138–62160. PMLR, 2024. URLhttps://proceedings.mlr.pre...

work page 2024