Recognition: unknown
MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems
Pith reviewed 2026-05-10 14:28 UTC · model grok-4.3
The pith
MARS reduces end-to-end latency in agentic LLM systems by up to 5.94 times while preserving throughput.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MARS creates a unified information stream across GPU inference and CPU tool execution. An external control plane uses this stream to decouple admission from execution and thereby avoids heterogeneous oversubscription. An internal agent-centric scheduler assigns priority to latency-sensitive continuations and retains KV cache only when warm resumption reduces total time. On the evaluated workloads this combination lowers end-to-end latency by up to 5.94 times while keeping system throughput near its maximum, and when installed as the backend for OpenHands it shortens task completion time by up to 1.87 times.
What carries the argument
A unified information stream that feeds an external control plane for decoupled admission control together with an internal agent-centric scheduler that prioritizes continuations and adaptively manages KV cache.
If this is right
- Multi-turn agent loops experience lower completion time when GPU and CPU demands are coordinated globally rather than locally.
- Frameworks that embed MARS as a serving backend complete real tasks faster without extra hardware.
- KV-cache retention decisions based on measured latency benefit reduce wasted memory while preserving speed.
- Admission control that sees both resource types prevents one type from becoming a bottleneck for the whole system.
- Throughput stays close to the maximum even as latency drops, showing that the scheduler does not trade one for the other.
Where Pith is reading between the lines
- The same separation of admission from execution could be applied to agent systems that also use network or storage resources.
- Scheduling at the level of an entire agent lifetime rather than individual model calls may become the default approach for heterogeneous AI workloads.
- The technique could be tested on clusters where agents span multiple machines to check whether the latency gains scale beyond single-node settings.
Load-bearing premise
The tested agentic workloads and hardware setups represent the patterns that will appear in other agentic deployments, and the measured speedups will appear on different agent implementations and machines.
What would settle it
A workload whose tool-execution times differ substantially from the tested cases, or execution on hardware with different GPU-CPU coupling, that shows no latency reduction or a drop in throughput when MARS is used.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-tool loops, and a spatial shift from chat-scale, GPU-only execution to repository-scale, GPU-CPU co-located execution. Consequently, coordinating heterogeneous resource demands of agentic execution has emerged as a critical system challenge. We design and implement MARS, an efficient and adaptive co-scheduling system that globally coordinates heterogeneous agentic workloads under coupled GPU-CPU resource pressure. By establishing holistic visibility across GPU inference and CPU tool execution via a unified information stream, an external control plane in MARS decouples admission from execution to prevent heterogeneous resource oversubscription. An internal agent-centric scheduler further minimizes the end-to-end critical path by prioritizing latency-sensitive continuations and adaptively retaining KV cache state only when warm resumption yields a latency benefit. Our evaluations show that MARS reduces end-to-end latency by up to 5.94x while maintaining nearly maximal system throughput. We further integrate MARS as the serving backend for the OpenHands coding agent framework, demonstrating its real-world effectiveness by accelerating end-to-end task completion time by up to 1.87x. Our source code will be publicly available soon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MARS, a co-scheduling system for heterogeneous agentic LLM workloads involving multi-turn GPU inference and CPU tool execution. It establishes a unified information stream for global visibility, decouples admission control from execution to prevent resource oversubscription, and uses an agent-centric internal scheduler that prioritizes latency-sensitive continuations while adaptively retaining KV cache only when beneficial. Evaluations report up to 5.94x end-to-end latency reduction with near-maximal throughput, plus 1.87x task completion speedup when integrated as the backend for the OpenHands coding agent framework. Source code will be released publicly.
Significance. If the empirical claims prove robust, the work addresses a timely and practically relevant systems problem: coordinating coupled GPU-CPU demands in autonomous agent deployments. The OpenHands integration provides concrete evidence of real-world utility, and the explicit commitment to open-sourcing code is a clear strength that enables reproducibility and community validation of the co-scheduling techniques.
major comments (2)
- [Evaluation] Evaluation section: The central claims of 5.94x latency reduction and 1.87x OpenHands speedup are load-bearing. The provided abstract and summary give no information on baselines, workload characteristics (tool-call frequency, KV-cache sizes, multi-turn depth), hardware configuration, or statistical significance of results. Without these details the headline numbers cannot be assessed for soundness or generalization.
- [Evaluation] Evaluation section: The weakest assumption is that the tested agentic workloads and resource-pressure patterns are representative. The paper must include sensitivity analysis across varying tool-call rates, conversation depths, and hardware pairings, or explicitly bound the conditions under which the decoupled-admission and adaptive-KV benefits transfer; otherwise the generalization of the co-scheduling gains remains unproven.
minor comments (1)
- [Abstract] Abstract: the phrase 'nearly maximal system throughput' should be quantified (e.g., percentage of peak throughput or absolute tokens/s) to allow precise comparison with baselines.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the practical relevance of MARS for heterogeneous agentic workloads. We address each major comment below and will revise the manuscript to improve the clarity and robustness of the evaluation section.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The central claims of 5.94x latency reduction and 1.87x OpenHands speedup are load-bearing. The provided abstract and summary give no information on baselines, workload characteristics (tool-call frequency, KV-cache sizes, multi-turn depth), hardware configuration, or statistical significance of results. Without these details the headline numbers cannot be assessed for soundness or generalization.
Authors: The full Evaluation section (Section 5) already specifies the baselines (vLLM with FIFO scheduling and separate GPU/CPU queues), workload parameters (tool-call frequencies 5-50%, KV-cache sizes 512-4096 tokens, multi-turn depths 3-20), hardware (A100 GPUs with Xeon CPUs), and reports means with standard deviations over 5 runs. We agree these details should be more prominent and will add a concise summary table to the abstract and introduction in the revised version. revision: yes
-
Referee: [Evaluation] Evaluation section: The weakest assumption is that the tested agentic workloads and resource-pressure patterns are representative. The paper must include sensitivity analysis across varying tool-call rates, conversation depths, and hardware pairings, or explicitly bound the conditions under which the decoupled-admission and adaptive-KV benefits transfer; otherwise the generalization of the co-scheduling gains remains unproven.
Authors: The current evaluation already varies tool-call rates (0-50%), conversation depths (up to 15 turns), and tests two hardware pairings. To further strengthen generalization, we will add expanded sensitivity plots and a dedicated subsection in the revised manuscript that explicitly bounds the conditions (e.g., benefits when tool execution exceeds 20% of inference latency). revision: partial
Circularity Check
No circularity: empirical system evaluation with direct measurements
full rationale
The paper presents a systems design for MARS (decoupled admission, agent-centric scheduling, adaptive KV retention) and supports its claims exclusively through empirical evaluations on concrete workloads and OpenHands integration. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Performance numbers (5.94x latency, 1.87x task time) are reported as measured outcomes rather than outputs of any closed-form model that reduces to its inputs. This is the expected non-finding for an implementation-and-benchmark paper whose central results are falsifiable by re-running the experiments on different hardware or agents.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiy- ing Zhang. 2024. InferCept: Efficient Intercept Support for Augmented Large Language Model Inference. InProceedings of the 41st Interna- tional Conference on Machine Learning (ICML), Vol. 235. 81–95
2024
-
[2]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134
2024
-
[3]
Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhar- gav S. Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.arXiv preprint arXiv:2308.16369(2023)
-
[4]
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In8th USENIX Symposium on Networked Systems Design and Implementation (NSDI)
2011
-
[5]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference.Proceedings of Machine Learning and Systems (MLSys)6 (2024), 325–338
2024
-
[6]
Google. 2026. Gemini CLI.https://codeassist.google/
2026
-
[7]
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High-Throughput Text Generation for LLMs via MII and DeepSpeed-Inference.arXiv preprint arXiv:2401.08671(2024)
-
[8]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. InThe 12th International Conference on Learning Representations (ICLR)
2024
-
[9]
Hugging Face. 2023. Text Generation Inference.https://github.com/ huggingface/text-generation-inference
2023
-
[10]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InThe 12th International Conference on Learning Representations (ICLR)
2024
-
[11]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica
-
[12]
InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)
Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP). 611–626
-
[13]
Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, and Ion Stoica. 2025. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live.arXiv preprint arXiv:2511.02230(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM- based Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 929–945
2024
- [15]
-
[16]
Tianyang Liu, Canwen Xu, and Julian McAuley. 2024. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. In The 12th International Conference on Learning Representations (ICLR)
2024
-
[17]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Stream- ing for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference (SIGCOMM). 38–56
2024
-
[18]
Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. 2025. Autellix: An Efficient Serving Engine for LLM Agents as General Programs.arXiv preprint arXiv:2502.13965 (2025)
-
[19]
Vasilios Mavroudis. 2024. LangChain. (2024)
2024
-
[20]
Kai Mei, Xi Zhu, Wujiang Xu, Mingyu Jin, Wenyue Hua, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. 2025. AIOS: LLM Agent Operating System. In2nd Conference on Language Modeling (COLM)
2025
-
[21]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Di Lu, Orfeas Menis Mastromicha- lakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, Anurag Kashyap...
work page internal anchor Pith review arXiv 2026
-
[22]
Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. 2024. SpotServe: Serving Generative Large Language Models on Preemptible Instances. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS). 1112–1127
2024
-
[23]
Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Sen Hu, Bo Li, Chen Hu, Binxing Jiao, Daxin Jiang, Yuntao Du, and Pin Lyu. 2026. GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Lever- aging.Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)40, 38 (2026), 32564–32572
2026
-
[24]
OpenAI. 2025. GPT-OSS-120B & GPT-OSS-20B Model Card.https: //openai.com/index/gpt-oss-model-card/. OpenAI model card
2025
-
[25]
OpenAI. 2025. Introducing Deep Research.https://openai.com/index/ introducing-deep-research/
2025
-
[26]
OpenAI. 2026. Prompting.https://developers.openai.com/codex/ prompting/
2026
-
[27]
OpenHands. 2026. Docker Sandbox.https://docs.openhands.dev/ openhands/usage/sandboxes/docker
2026
-
[28]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems.arXiv preprint arXiv:2310.08560(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Keshav Santhanam, Deepti Raghavan, Muhammad Shahir Rahman, Thejas Venkatesh, Neha Kunjal, Pratiksha Thaker, Philip Levis, and 13 Matei Zaharia. 2024. ALTO: An Efficient Network Orchestrator for Compound AI Systems. InProceedings of the 4th Workshop on Machine Learning and Systems (EuroMLSys). 117–125
2024
-
[30]
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 173–191
2024
-
[31]
Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to- End Optimization of LLM-based Applications with Ayo. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS). 1302–1316
2025
-
[32]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Jun- yang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig
-
[33]
InThe 13th International Conference on Learning Representations (ICLR)
OpenHands: An Open Platform for AI Software Developers as Generalist Agents. InThe 13th International Conference on Learning Representations (ICLR)
-
[34]
White, Doug Burger, and Chi Wang
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Con- versation. InProceedings of the 1st Conference on Language Modeling (COLM)
2024
-
[35]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Jing Hua Toh, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caim- ing Xiong, Victor Zhong, and Tao Yu. 2024. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environ- ments.Advances in Neural Inform...
2024
-
[36]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
-
[37]
Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-Agent: Agent- Computer Interfaces Enable Automated Software Engineering.Ad- vances in Neural Information Processing Systems (NeurIPS)37 (2024), 50528–50652
2024
-
[39]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models.Advances in Neural Information Processing Systems (NeurIPS)36 (2023), 11809–11822
2023
-
[40]
Narasimhan, and Yuan Cao
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe 11th International Conference on Learning Representations (ICLR)
2023
-
[41]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 521–538
2022
-
[42]
Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. 2024. ∞Bench: Extending Long Context Evaluation Beyond 100K Tokens. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL). 15262–15277
2024
-
[43]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs.Ad- vances in Neural Information Processing Systems (NeurIPS)37 (2024), 62557–62583. 14
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.