pith. sign in

arxiv: 2606.00866 · v1 · pith:TZNMFFYDnew · submitted 2026-05-30 · 💻 cs.OS

Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI

Pith reviewed 2026-06-28 17:28 UTC · model grok-4.3

classification 💻 cs.OS
keywords agentic workloadsKV cache offloadingLLM servingmemory tieringtool call schedulingidleness rankingGPU-CPU placement
0
0 comments X

The pith

MORI ranks agent programs by continuous relative idleness to assign KV cache between GPU HBM and CPU DRAM, matching hardware capacity ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that agentic workloads consist of busy phases with rapid short tool calls and idle phases with long-running calls, but existing eviction policies like LRU or simple binary busy/idle labels fail to align the busy-to-idle ratio with GPU-to-CPU memory capacity. This mismatch leaves one tier underutilized while the other forces unnecessary evictions and KV cache transfers. MORI instead treats idleness as a relative spectrum, ranks all active programs continuously, places the busiest on GPU and the most idle on CPU, and dynamically moves the partition boundary to fit available hardware. On real coding agent traces from Claude Code across multiple GPU-model pairs, this produces 20-71% higher throughput and 18-43% lower time-to-first-token than prior offloading baselines.

Core claim

MORI is an agent serving system whose central mechanism ranks every active program by a continuous, relative idleness score derived from its recent tool-call pattern, assigns the busiest programs to GPU HBM and the most idle to CPU DRAM, shifts the partition point on the fly to equalize load with hardware capacity ratios, and applies admission control at each tier so that tool-call duration differences no longer cause wasteful migrations.

What carries the argument

Continuous relative idleness ranking across all programs, which dynamically sets the GPU-CPU partition boundary to match hardware capacity.

If this is right

  • Memory tiers stay balanced without one sitting idle while the other evicts.
  • KV cache placements remain stable across short tool calls inside a busy phase.
  • Admission control prevents oversubscription at either tier even when the busy-idle mix changes.
  • The same ranking works across different GPU sizes and model KV footprints without retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The ranking could be combined with intra-tier eviction policies to further reduce migrations inside GPU or CPU.
  • Similar relative-spectrum placement might apply to other two-tier systems such as local versus remote storage in distributed serving.
  • Workloads without clear busy-idle phases would need an alternative signal to keep the ranking useful.

Load-bearing premise

Agentic programs exhibit a two-phase structure of rapid short calls and long-running calls that a single continuous idleness ranking can capture and align with hardware capacity ratios.

What would settle it

A workload trace in which replacing the relative idleness ranking with either LRU or a fixed binary label produces throughput and TTFT identical to MORI.

Figures

Figures reproduced from arXiv: 2606.00866 by Hanchen Li, Hao Kang, Ion Stoica, Tian Xia, Xiaokun Chen, Yifan Qiao, Yi Xu, Zhifei Li.

Figure 1
Figure 1. Figure 1: Overview of MORI. (1) MORI categorizes programs into a continuous idleness spectrum reflecting how likely each is to remain GPU-resident. (2) An adaptive offloading sched￾uler places busy programs (low idleness) in GPU HBM and idle programs (high idleness) in CPU DRAM, with admission control for both tiers. (3) In multi-replica deployments, an affinity-aware load balancer routes requests to engines that al… view at source ↗
Figure 2
Figure 2. Figure 2: Structure of an agentic program. A program alter￾nates between inference steps (shaded) and tool-call gaps. The KV cache grows across steps due to prefix dependencies. agentic workloads, however, a single user task produces a chain of model invocations connected by prefix dependencies. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Three examples of idle phases in agentic programs: (a) a long-running tool call such as a test suite or compilation, (b) waiting for human interaction (e.g., approval or feedback), and (c) spawning subagents and waiting for their results. Red dashed frames denote busy phases; green dashed frames de￾note idle phases. Note that in (c), the subagents themselves may be in busy phases, but the parent agent rema… view at source ↗
Figure 6
Figure 6. Figure 6: MORI three-tier queue architecture. Each replica maintains a GPU queue (HBM) and a CPU queue (DRAM); a global Waiting queue is shared across replicas. Programs are colored by their idleness (red = busy, green = idle). Idle programs are demoted from a higher tier to a lower one (GPU→CPU or CPU→Waiting), and busy programs are pro￾moted from a lower tier to a higher one (Waiting→CPU or CPU→GPU). 4 Design This… view at source ↗
Figure 7
Figure 7. Figure 7: End-to-end performance on H200 (80 GB) with DP=1 and Qwen-2.5 7B. 20 50 80 concurrency 0 100 200 300 400 500 CPU:GPU=1:1 Throughput (tokens/s) Output Throughput 20 50 80 concurrency 0.00 0.25 0.50 0.75 1.00 1.25 Steps/s Steps/s 20 50 80 concurrency 0 20 40 60 TTFT (s) TTFT (avg) 20 50 80 concurrency 0 100 200 300 400 500 CPU:GPU=2:1 Throughput (tokens/s) 20 50 80 concurrency 0.00 0.25 0.50 0.75 1.00 1.25 S… view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end performance on H200 with DP=1 and Qwen-3 30B-A3B. 6.2.1 Single Replica Scheduling with Memory Tiering. We first present results for the single-replica (DP=1) config￾urations, which isolate the effect of phase-aware memory scheduling from load-balancing decisions across replicas [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: End-to-end performance on H200 with DP=3 and Qwen-3 30B-A3B. this by tracking KV cache residency across both GPU and CPU tiers: when a program returns from a tool call, the scheduler routes it back to the replica where its cache is stored, whether on GPU or in that replica’s CPU memory, avoiding cross-replica migration. At 80 programs per replica with 2× CPU memory, MORI switches only 2.0% of pro￾grams (0… view at source ↗
read the original abstract

Modern LLM serving systems increasingly host agentic workloads, whose sessions issue tens of model invocations interleaved with tool calls, accumulating KV cache that can be reused across steps. As requests' total KV cache size easily exceeds GPU HBM capacity, researchers offload them to CPU DRAM. However, tool-call durations span orders of magnitude, and the cost of transferring KV cache between tiers makes it impractical to re-place entries on every call. We observe that agentic programs exhibit a two-phase structure: busy phases of rapid short tool calls and idle phases dominated by long-running calls. Current eviction policies such as LRU fail to capture this property. A binary busy/idle label also falls short because the ratio of busy to idle programs may not match the hardware's GPU-to-CPU capacity ratio. When it does not, one tier sits underutilized while the other is oversubscribed, wasting memory or forcing unnecessary evictions. We present MORI, an agent serving system that solves the above problem. Our key insight is that idleness is a continuous, relative spectrum. MORI ranks all active programs by idleness, assigns the busiest to GPU HBM and the most idle to CPU DRAM, dynamically shifts the partition boundary to match hardware capacity, and enforces admission control at each memory tier. Evaluated on real coding agent workloads collected from Claude Code across four GPU and model pairs, MORI delivers 20--71% higher throughput and 18--43% lower TTFT than the best baseline with offloading.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MORI, a system for serving agentic LLM workloads that offloads KV cache between GPU HBM and CPU DRAM. It observes that such workloads exhibit busy phases of short tool calls and idle phases of long-running calls, and proposes ranking active programs on a continuous relative idleness spectrum to dynamically set the GPU/CPU partition boundary to match hardware capacity ratios while enforcing per-tier admission control. On real coding-agent workloads from Claude Code across four GPU/model pairs, it reports 20--71% higher throughput and 18--43% lower TTFT versus the best baseline that uses offloading.

Significance. If the central empirical claims hold after the design details are fully specified and the evaluation is expanded, the work would be significant for memory management in LLM serving systems. The insight that idleness is relative rather than binary, and the use of real agentic traces, are strengths; the approach could improve utilization when KV-cache footprints exceed HBM without requiring per-call migrations.

major comments (2)
  1. [Design / Algorithm description (referenced in abstract)] The algorithm or formula used to compute the continuous relative idleness ranking (the load-bearing mechanism that is claimed to align program placement with hardware capacity ratios) is not provided. Without an explicit definition or pseudocode, it is impossible to determine whether the ranking can be computed from past observations alone or requires lookahead/oracle knowledge of tool-call durations, directly affecting the validity of the two-phase exploitation claim.
  2. [Evaluation section] The evaluation reports aggregate throughput and TTFT gains but provides no description of the baseline offloading implementations, workload statistics (e.g., distribution of tool-call durations, number of programs, KV sizes), number of runs, or variance. This makes it impossible to assess whether the 20--71% and 18--43% improvements are robust or sensitive to post-hoc choices.
minor comments (1)
  1. [Abstract] The abstract states performance numbers but does not indicate whether the reported gains include confidence intervals or are from single runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify genuine gaps in the manuscript's presentation of the core algorithm and evaluation details. We will revise to address both.

read point-by-point responses
  1. Referee: [Design / Algorithm description (referenced in abstract)] The algorithm or formula used to compute the continuous relative idleness ranking (the load-bearing mechanism that is claimed to align program placement with hardware capacity ratios) is not provided. Without an explicit definition or pseudocode, it is impossible to determine whether the ranking can be computed from past observations alone or requires lookahead/oracle knowledge of tool-call durations, directly affecting the validity of the two-phase exploitation claim.

    Authors: We agree the manuscript lacks an explicit formula or pseudocode for the relative idleness ranking. This omission prevents readers from verifying that the mechanism relies only on past observations. In the revision we will add a precise definition of the idleness score (derived from each program's observed tool-call duration history), the ranking procedure, the dynamic boundary adjustment logic, and pseudocode. The ranking uses only historical data; no lookahead or oracle information is required. revision: yes

  2. Referee: [Evaluation section] The evaluation reports aggregate throughput and TTFT gains but provides no description of the baseline offloading implementations, workload statistics (e.g., distribution of tool-call durations, number of programs, KV sizes), number of runs, or variance. This makes it impossible to assess whether the 20--71% and 18--43% improvements are robust or sensitive to post-hoc choices.

    Authors: We agree that the evaluation section is missing these required details. The current manuscript reports only aggregate results without describing the baseline implementations, workload characteristics, run counts, or variance. In the revision we will expand the evaluation section to include full descriptions of the baselines, workload statistics (tool-call duration distributions, program counts, KV-cache sizes), number of runs, and standard deviations or confidence intervals for the reported throughput and TTFT gains. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical system design and evaluation

full rationale

The paper contains no equations, derivations, fitted parameters, or self-citations. Its core claims rest on an empirical observation of workload structure followed by a system implementation (MORI) whose performance is measured against baselines on real Claude Code workloads. The idleness-ranking insight is presented as a design principle without any mathematical formalization that could reduce to its own inputs. No load-bearing step reduces by construction to a prior result or fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that agentic programs exhibit distinguishable busy and idle phases that admit a useful relative ranking; no free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Agentic programs exhibit a two-phase structure: busy phases of rapid short tool calls and idle phases dominated by long-running calls.
    This observation is presented as the key insight enabling the relative idleness ranking and is required for the dynamic boundary adjustment to be effective.

pith-pipeline@v0.9.1-grok · 5832 in / 1291 out tokens · 27265 ms · 2026-06-28T17:28:56.715531+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Libra: Efficient Resource Management for Agentic RL Post-Training

    cs.LG 2026-06 unverdicted novelty 4.0

    Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward conve...

Reference graph

Works this paper leans on

75 extracted references · 2 canonical work pages · cited by 1 Pith paper

  1. [1]

    Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang. 2024. InferCept: Efficient Intercept Support for Augmented Large Language Model Inference. arXiv:2402.01869 [cs.LG]https: //arxiv.org/abs/2402.01869

  2. [2]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 117–134

  3. [3]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  4. [4]

    Anthropic. 2025. Best Practices for Claude Code: Subagents and Paral- lel Exploration.https://code.claude.com/docs/en/best-practices. Accessed: 2026-05-14

  5. [5]

    Anthropic. 2025. Claude Code.https://www.anthropic.com/claude- code. Accessed: 2026-05-13

  6. [6]

    Anthropic. 2025. Claude Code Model Configuration.https://code. claude.com/docs/en/model-config. Accessed: 2026-05-14

  7. [7]

    Anysphere. 2024. Cursor: The AI Code Editor.https://cursor.com/. Accessed: 2026-05-13

  8. [8]

    Anysphere. 2025. Subagents in Cursor.https://cursor.com/docs/ subagents. Accessed: 2026-05-14

  9. [9]

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al . 2025. Why do multi-agent llm systems fail?Advances in Neural Information Processing Systems38 (2025)

  10. [10]

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. 2025. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941(2025)

  11. [11]

    Ishan Dhanani and Matej Kosec. 2026. Full-Stack Op- timizations for Agentic Inference with NVIDIA Dynamo. https://developer.nvidia.com/blog/full-stack-optimizations- for-agentic-inference-with-nvidia-dynamo/. NVIDIA Technical Blog. Accessed: 2026-05-15

  12. [12]

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost- Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InProceedings of the 2024 USENIX Annual Technical Conference (ATC)

  13. [13]

    In Gim, Zhiyao Ma, Seung-seob Lee, and Lin Zhong. 2025. Pie: A programmable serving system for emerging llm applications. InPro- ceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 415–430

  14. [14]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783(2024)

  15. [15]

    Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. 2024. Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity. arXiv preprint arXiv:2404.14527(2024)

  16. [16]

    Dongxin Guo, Jikun Wu, and Siu-Ming Yiu. 2026. SAGA: Workflow- Atomic Scheduling for AI Agent Inference on GPU Clusters.arXiv preprint arXiv:2605.00528(2026)

  17. [17]

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large lan- guage model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680(2024)

  18. [18]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al . 2024. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, V ol. 2024. 23247–23275

  19. [19]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654(2024)

  20. [20]

    Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2024. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool.arXiv preprint arXiv:2406.17565 (2024)

  21. [21]

    Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2025. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference. InProceedings of Machine Learning and Systems (MLSys)

  22. [22]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InProceedings of the International Conference on Learning Representations (ICLR)

  23. [23]

    David S Johnson. 1974. Fast algorithms for bin packing.J. Comput. System Sci.8, 3 (1974), 272–314

  24. [24]

    Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora

  25. [25]

    ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System.arXiv preprint arXiv:2602.13692(2026)

  26. [26]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 13

  27. [27]

    InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)

    Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP). 611–626

  28. [28]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

  29. [29]

    Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, and Ion Stoica. 2025. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live.arXiv preprint arXiv:2511.02230(2025)

  30. [30]

    Gonzalez

    Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica, and Joseph E. Gonzalez. 2026. Combee: Scaling Prompt Learning for Self-Improving Language Model Agents. arXiv:2604.04247 [cs.AI] https://arxiv.org/abs/2604.04247

  31. [31]

    Hanchen Li, Yuhan Liu, Yihua Cheng, Kuntai Du, and Junchen Jiang

  32. [32]

    arXiv:2503.14647 [cs.NI]https://arxiv

    Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache. arXiv:2503.14647 [cs.NI]https://arxiv. org/abs/2503.14647

  33. [33]

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945

  34. [34]

    Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al

  35. [35]

    Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

  36. [36]

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

  37. [37]

    Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, and Ion Stoica. 2025. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965 (2025)

  38. [38]

    Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  39. [39]

    NVIDIA. 2023. NVIDIA H100 Tensor Core GPU Architec- ture Whitepaper.https://resources.nvidia.com/en-us-hopper- architecture/nvidia-h100-tensor-c. Accessed: 2026-05-13

  40. [40]

    OpenAI. 2025. Codex CLI: OpenAI’s Coding Agent in the Terminal. https://github.com/openai/codex. Accessed: 2026-05-13

  41. [41]

    OpenAI. 2026. Why we no longer evaluate SWE-bench Veri- fied.https://openai.com/index/why-we-no-longer-evaluate-swe- bench-verified/. Accessed: 2026-05-11

  42. [42]

    Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows. InAdvances in Neural Information Processing Systems (NeurIPS)

  43. [43]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Í nigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA)

  44. [44]

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 15174–15186

  45. [45]

    Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache- Centric Disaggregated Architecture for LLM Serving.arXiv preprint arXiv:2407.00079(2024)

  46. [46]

    SGLang. 2026. SGLang Model Gateway.https://docs.sglang.io/docs/ advanced_features/sgl_model_gateway. SGLang Documentation. Accessed: 2026-05-15

  47. [47]

    Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.arXiv preprint arXiv:1911.02150(2019)

  48. [48]

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems36 (2023), 38154–38180

  49. [49]

    Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InProceedings of the 40th International Conference on Machine L...

  50. [50]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal rein- forcement learning.Advances in neural information processing systems 36 (2023), 8634–8652

  51. [51]

    Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2025. Preble: Efficient distributed prompt scheduling for llm serving. InInternational conference on learning representations, V ol. 2025. 37057–37082

  52. [52]

    Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. 2026. Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution.arXiv preprint arXiv:2603.18897(2026)

  53. [53]

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 173–191

  54. [54]

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2024. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence.arXiv preprint arXiv:2410.21465(2024)

  55. [55]

    Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2024. Teola: Towards end-to-end optimization of llm-based applications.arXiv preprint arXiv:2407.00326(2024)

  56. [56]

    Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to-End Optimization of LLM-based Applications with Ayo. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Rotterdam, Netherlands)(ASPLOS ’25). Association for Computing Machinery, New York, NY , USA, 130...

  57. [57]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS). 5998–6008

  58. [58]

    Noppanat Wadlom, Junyi Shen, and Yao Lu. 2026. Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective.arXiv preprint arXiv:2603.16104(2026)

  59. [59]

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, 14 et al. 2025. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Repre- sentations, V ol. 2025. 65882–65919

  60. [60]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conver- sations. InFirst conference on language modeling

  61. [61]

    Jackson, Zhifei Li, Jiarong Xing, Scott Shenker, and Ion Stoica

    Tian Xia, Ziming Mao, Jamison Kerney, Ethan J. Jackson, Zhifei Li, Jiarong Xing, Scott Shenker, and Ion Stoica. 2026. SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference. In Proceedings of the 21st European Conference on Computer Systems (McEwan Hall/The University of Edinburgh, Edinburgh, Scotland UK) (EUROSYS ’26). Association for C...

  62. [62]

    Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, and Xin Jin. 2025. ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production. InProceedings of the 23rd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI)

  63. [63]

    Zhiqiang Xie. 2025. SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends.https://www.lmsys.org/blog/ 2025-09-10-sglang-hicache/

  64. [64]

    Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Stoica. 2024. Pie: Pooling CPU Memory for LLM Inference. arXiv:2411.09317 [cs.LG] https://arxiv.org/abs/2411.09317

  65. [65]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  66. [66]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al . 2024. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115(2024)

  67. [67]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent- computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  68. [68]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. In Proceedings of the twentieth European conference on computer systems. 94–109

  69. [69]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models.arXiv preprint arXiv:2210.03629(2023)

  70. [70]

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems7 (2025)

  71. [71]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 521–538

  72. [72]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Ef- ficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems (NeurIPS)

  73. [73]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serv- ing. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

  74. [74]

    Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, et al

  75. [75]

    In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)

    {NanoFlow}: Towards optimal large language model serving throughput. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 749–765. 15