HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
Pith reviewed 2026-05-19 20:55 UTC · model grok-4.3
The pith
HexAGenT schedules agentic LLM workflows on heterogeneous GPU clusters to cut the SLO scale needed for timely end-to-end completion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling each request as an online-revealed DAG, maintaining a running estimate of the workflow's standalone completion horizon, prioritizing ready calls by projected risk of missing that horizon, and jointly selecting prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency, HexAGenT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment, with maximum reductions of 45.0% and 80.5%, respectively, across representative agentic workloads on heterogeneous A100/H100/H200 clusters.
What carries the argument
Workflow-aware scheduler that represents requests as online-revealed DAGs, estimates standalone completion horizons, and performs joint risk-based prioritization plus prefill/decode placement across heterogeneous GPUs.
If this is right
- Production clusters can serve the same volume of agentic workflows while provisioning fewer GPUs or accepting tighter latency targets.
- Mixed-generation GPU fleets become more practical because the scheduler explicitly balances prefill and decode work across device types.
- Workflow-level success rates improve at the same resource budget because placement and priority decisions incorporate end-to-end horizon risk rather than per-call metrics.
- Operators can lower over-provisioning margins while still guaranteeing high-percentile workflow completion times.
Where Pith is reading between the lines
- The same risk-horizon approach could be adapted to other incrementally revealed workflow systems such as distributed data pipelines or multi-agent robotic control.
- Adding a cost or energy term to the placement decision would let the scheduler optimize for both latency and operational expense on heterogeneous hardware.
- Evaluating the method under sudden cluster reconfigurations or bursty arrival patterns would test whether the online DAG estimation remains stable outside the evaluated static settings.
Load-bearing premise
The representative agentic workloads and the specific heterogeneous A100/H100/H200 cluster configurations used for evaluation are sufficiently similar to real production deployments that the reported reductions in required SLO scale will generalize.
What would settle it
Running the scheduler on a different collection of agentic workloads or on a cluster whose GPU mix and network characteristics differ from the A100/H100/H200 testbed and measuring no reduction, or an increase, in the SLO scale needed to reach the same attainment levels would show the central claim does not hold.
read the original abstract
Agentic LLM applications increasingly execute user requests as multi-step workflows involving planning, tool use, branching, refinement, and synthesis. In such settings, users experience the end-to-end latency of an entire workflow, not the latency of any single LLM call. In this paper, we study how to schedule online agentic workflows across heterogeneous prefill-decode disaggregated LLM serving clusters to efficiently meet workflow-level latency objectives. The problem is challenging because workflow dependencies are revealed incrementally at runtime, calls have heterogeneous prompts, outputs, and KV-cache requirements, and the prefill and decode stages impose different compute, memory, and transfer constraints across heterogeneous GPUs. To solve this problem, we present HexAGenT, a workflow-aware scheduler for a heterogeneous prefill-decode inference service. HexAGenT models each request as an online-revealed DAG, maintains a running estimate of the workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly selects prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency. Across representative agentic workloads and heterogeneous A100/H100/H200 clusters, HexAGenT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment, with maximum reductions of 45.0% and 80.5%, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HexAGenT, a workflow- and heterogeneity-aware scheduler for online agentic LLM serving on prefill-decode disaggregated clusters with A100/H100/H200 GPUs. Workflows are modeled as incrementally revealed DAGs; the scheduler maintains a running estimate of each workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly decides prefill placement, decode placement, and local queue priority while respecting KV-cache capacity and cross-stage transfer costs. Evaluation across representative agentic workloads reports average reductions in required SLO scale of 20.1% at 95% attainment and 33.0% at 99% attainment (maxima 45.0% and 80.5%).
Significance. If the reported SLO reductions prove robust, the work would offer a practical advance for serving multi-step agentic applications whose end-to-end latency, rather than per-call latency, determines user experience. The combination of online DAG awareness with explicit modeling of prefill/decode asymmetry and GPU heterogeneity addresses a timely systems problem that existing single-request or homogeneous schedulers do not handle.
major comments (2)
- [§6] §6 (Evaluation): The central quantitative claim—average 20.1% and 33.0% reductions in required SLO scale—rests on the representativeness of the chosen agentic workloads and the A100/H100/H200 cluster configurations. The manuscript does not describe how these workloads were selected or validated against production traces, nor does it report sensitivity to branching factor, tool-call latency variance, or cross-GPU bandwidth. Without such evidence the reported percentages cannot be assessed for generalization.
- [§6.2] §6.2 (Baselines and methodology): The abstract and evaluation summary supply no information on the concrete baselines, statistical aggregation method, or measurement protocol used to obtain the 20.1%/33.0% figures. This information is load-bearing for the empirical contribution and must be supplied with sufficient detail for independent verification.
minor comments (2)
- [§3] §3 (System model): The notation for online DAG revelation and the precise definition of the “standalone completion horizon” could be clarified with a small example or pseudocode to aid readers unfamiliar with agentic workflows.
- [Figure 4] Figure 4 and Table 2: Axis labels and legend entries are too small for comfortable reading; consider increasing font size or splitting the figure.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment below and will revise the evaluation section to improve clarity and provide additional supporting details where feasible.
read point-by-point responses
-
Referee: [§6] §6 (Evaluation): The central quantitative claim—average 20.1% and 33.0% reductions in required SLO scale—rests on the representativeness of the chosen agentic workloads and the A100/H100/H200 cluster configurations. The manuscript does not describe how these workloads were selected or validated against production traces, nor does it report sensitivity to branching factor, tool-call latency variance, or cross-GPU bandwidth. Without such evidence the reported percentages cannot be assessed for generalization.
Authors: We agree that more explicit description of workload construction and sensitivity analysis would strengthen the paper. In the revised manuscript we will add to §6 a description of how the workloads were assembled from representative multi-step agentic patterns (planning, tool invocation, branching, and synthesis) drawn from open frameworks, together with new sensitivity results for branching factor and tool-call latency variance. Our cluster model already incorporates realistic cross-GPU transfer costs for the A100/H100/H200 mix; we will report additional bandwidth sweeps. Direct validation against proprietary production traces is not possible for us, but the workloads are constructed to reproduce the key online-DAG and heterogeneity properties observed in public agentic benchmarks. revision: yes
-
Referee: [§6.2] §6.2 (Baselines and methodology): The abstract and evaluation summary supply no information on the concrete baselines, statistical aggregation method, or measurement protocol used to obtain the 20.1%/33.0% figures. This information is load-bearing for the empirical contribution and must be supplied with sufficient detail for independent verification.
Authors: We acknowledge the omission in the abstract and high-level summary. Section 6.2 of the full manuscript already specifies the baselines (FCFS, SJF, and heterogeneity-unaware disaggregated schedulers), the aggregation method (mean and tail statistics over 10 independent runs with different random seeds), and the measurement protocol (SLO scale defined as the multiplicative factor on the workflow’s standalone completion horizon required to reach the target attainment). To make this information immediately accessible, we will insert a short summary paragraph and table at the start of the evaluation section in the revised version. revision: yes
- Direct validation of workloads against proprietary production traces from specific industry deployments, which are not publicly available.
Circularity Check
No circularity: performance claims are empirical outcomes of scheduler evaluation
full rationale
The paper's central claims consist of measured reductions in required SLO scale (20.1% at 95% attainment, 33.0% at 99%) obtained by running HexAGenT on representative agentic workloads and heterogeneous A100/H100/H200 clusters. These are presented as direct experimental results rather than predictions derived from fitted parameters, self-referential definitions, or load-bearing self-citations. The scheduler description (DAG modeling, risk prioritization, placement selection) is algorithmic and evaluated externally; no equation or theorem reduces by construction to its own inputs. The derivation chain is therefore self-contained against the reported benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Agentic LLM requests can be modeled as online-revealed DAGs whose dependencies become known incrementally at runtime.
- domain assumption Prefill and decode stages impose distinct compute, memory, and cross-stage transfer constraints on heterogeneous GPUs.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HEXAGENT models each request as an online-revealed DAG, maintains a running estimate of the workflow's standalone completion horizon, prioritizes ready calls by projected risk of missing that horizon, and jointly selects prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across representative agentic workloads and heterogeneous A100/H100/H200 clusters, HEXAGENT reduces the SLO scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency trade-off in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-...
work page 2024
-
[2]
Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced allocations. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, pages 593–602, New York, NY, USA, 1994. Association for Computing Machinery . doi: 10.1145/195058.195412
-
[3]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38. AAAI Press, 2024
work page 2024
-
[4]
Accelerating retrieval-augmented generation,
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. MoE-Lightning: High-throughput MoE inference on memory-constrained GPUs. In Proceedings of the 16 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–7...
-
[5]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality . LMSYS Blog, March 2023. URLhttps://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[6]
Robert I. Davis, Ken W. Tindell, and Alan Burns. Scheduling slack time in fixed priority pre-emptive systems. In Proceedings of the 14th IEEE Real-Time Systems Symposium, pages 222–231, Washington, DC, USA, 1993. IEEE Computer Society . doi: 10.1109/REAL.1993.393505
-
[7]
ServerlessLLM: Low-latency serverless inference for large language models
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. ServerlessLLM: Low-latency serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URLhttps:/...
work page 2024
-
[8]
Efficient LLM scheduling by learning to rank, 2024
Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. Efficient LLM scheduling by learning to rank, 2024
work page 2024
-
[9]
Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus
Guoliang He, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, et al. Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus. Advances in Neural Information Processing Systems, 38:147100–147126, 2026
work page 2026
-
[10]
Efficient multi-round llm inference over disaggregated serving
Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026
-
[11]
Osdp: Optimal sharded data parallel for distributed deep learning
Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258, 2022
-
[12]
Hexgen: Generative inference of large language model over heterogeneous environment
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment. arXiv preprint arXiv:2311.11514, 2023
-
[13]
Demystifying cost-efficiency in llm serving over heterogeneous gpus
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025
-
[14]
Thunderserve: High-performance and cost-efficient llm serving in cloud environments
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025
work page 2025
-
[15]
Cascadia: An efficient cascade serving system for large language models
Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025
-
[16]
Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment
Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025
-
[17]
OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026
-
[19]
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Youhe Jiang, Ran Yan, You Peng, Wenshuang Li, Taiyi Wang, Fangcheng Fu, and Binhang Yuan. Autopoiesis: A self-evolving system paradigm for llm serving under runtime dynamics. arXiv preprint arXiv:2604.07144, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
P/D-Serve: Serving disaggregated large language model at scale, 2024
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo,...
work page 2024
-
[21]
Efficient memory management for large language model serving with pagedattention,
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pages 611–626, New York, NY, USA, 2023. Association for Computing M...
-
[22]
Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Alvin Cheung, Joseph E. Gonzalez, and Ion Stoica. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live, 2025
work page 2025
-
[23]
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, Boston, MA, July 2023....
work page 2023
-
[24]
Parrot: Efficient serving of LLM-based applications with semantic variable
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of LLM-based applications with semantic variable. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 929–945, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URLhttps://www.us...
work page 2024
-
[25]
Hermes: Efficient serving of LLM applications with probabilistic demand modeling
Yifei Liu, Zuo Gan, Zhenghao Gan, Weiye Wang, Chen Chen, Yizhou Shan, Xusheng Chen, Zhenhua Han, Yifei Zhu, Shixuan Sun, and Minyi Guo. Hermes: Efficient serving of LLM applications with probabilistic demand modeling. ACM Transactions on Architecture and Code Optimization, 2026. doi: 10.1145/3803390
-
[26]
Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. Autellix: An efficient serving engine for LLM agents as general programs, 2025
work page 2025
-
[27]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...
work page 2023
-
[28]
Skyserve: Serving ai models across regions and clouds with spot instances
Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025
work page 2025
-
[29]
McKinsey & Company. The state of AI in 2025. McKinsey Global Survey , 2025. URL https://www.mckinsey.com/c apabilities/quantumblack/our-insights/the-state-of-ai
work page 2025
-
[30]
Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878, 2022
-
[31]
Splitwise: Efficient generative llm inference using phase splitting
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024
work page 2024
-
[32]
Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E
Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025. URL https://icml.cc/virtual/2025/poste r/46593
work page 2025
-
[33]
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. Queue management for SLO-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing, pages 18–35, New York, NY, USA, 2024. Association for Computing Machinery . doi: 10.1145/...
-
[34]
Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql
You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025
-
[35]
Kalbarczyk, Tamer Basar, and Ravishankar K
Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Basar, and Ravishankar K. Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction. In Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024 (AIOps 2024), pages 1–...
work page 2024
-
[36]
Kalbarczyk, Tamer Basa ¸r, and Ravishankar K
Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Basa ¸r, and Ravishankar K. Iyer. Efficient interactive LLM serving with proxy model-based sequence length prediction. In Proceedings of the 5th International Workshop on Cloud Intelligence / AIOps at ASPLOS 2024, New York, NY, USA, ...
work page 2024
-
[37]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023
work page 2023
-
[38]
A proof of the optimality of the shortest remaining processing time discipline
Linus Schrage. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 16(3):687–690, 1968
work page 1968
-
[39]
Don’t stop me now: Embedding based scheduling for LLMs, 2024
Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, and Michael Mitzenmacher. Don’t stop me now: Embedding based scheduling for LLMs, 2024
work page 2024
-
[40]
Stankovic, Marco Spuri, Krithi Ramamritham, and Giorgio C
John A. Stankovic, Marco Spuri, Krithi Ramamritham, and Giorgio C. Buttazzo. Scheduling in Real-Time Systems. Springer, Boston, MA, USA, 1998
work page 1998
-
[41]
Llumnix: Dynamic scheduling for large language model serving
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133- 40-3. URLhttps://www.usenix.org/conference/os...
work page 2024
-
[42]
Parallax: Efficient llm inference service over decentralized environment
Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025
-
[43]
Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024
work page 2024
-
[44]
Roofline: An insightful visual performance model for multicore architectures,
Samuel Williams, Andrew Waterman, and David A. Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009. doi: 10.1145/1498765.1498785. URL https://doi.org/10.1145/1498765.1498785
-
[45]
Fast distributed inference serving for large language models, 2023
Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models, 2023
work page 2023
-
[46]
Fsa: An alternative efficient implementation of native sparse attention kernel
Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alternative efficient implementation of native sparse attention kernel. arXiv preprint arXiv:2508.18224, 2025
-
[47]
Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, and Binhang Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus. arXiv preprint arXiv:2511.00796, 2025
-
[48]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10...
-
[49]
Griffiths, Yuan Cao, and Karthik Narasimhan
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023
work page 2023
-
[50]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations,
-
[51]
URLhttps://openreview.net/forum?id=WE_vluYUL-X
-
[52]
Orca: A distributed serving system for transformer-based generative models
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URLhttps://www.usenix.org/conferenc...
work page 2022
-
[53]
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Stoica. Blendserve: Optimizing offline inference for auto-regressive large models with resource-aware batching, 2024
work page 2024
-
[55]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. In Advances in Neural Information Processing Systems, volume 37, Red Hook, NY, USA, 2024. 19 Curran Associates,...
work page 2024
-
[56]
Distserve: Dis- aggregating prefill and decoding for goodput-optimized large language model serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Dis- aggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-93...
work page 2024
-
[57]
Language agent tree search unifies reasoning, acting, and planning in language models
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 62138–62160. PMLR, 2024. URLhttps://proceedings.mlr.pre...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.