Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI

Hanchen Li; Hao Kang; Ion Stoica; Tian Xia; Xiaokun Chen; Yifan Qiao; Yi Xu; Zhifei Li

arxiv: 2606.00866 · v1 · pith:TZNMFFYDnew · submitted 2026-05-30 · 💻 cs.OS

Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI

Tian Xia , Hanchen Li , Zhifei Li , Xiaokun Chen , Hao Kang , Yifan Qiao , Yi Xu , Ion Stoica This is my paper

Pith reviewed 2026-06-28 17:28 UTC · model grok-4.3

classification 💻 cs.OS

keywords agentic workloadsKV cache offloadingLLM servingmemory tieringtool call schedulingidleness rankingGPU-CPU placement

0 comments

The pith

MORI ranks agent programs by continuous relative idleness to assign KV cache between GPU HBM and CPU DRAM, matching hardware capacity ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that agentic workloads consist of busy phases with rapid short tool calls and idle phases with long-running calls, but existing eviction policies like LRU or simple binary busy/idle labels fail to align the busy-to-idle ratio with GPU-to-CPU memory capacity. This mismatch leaves one tier underutilized while the other forces unnecessary evictions and KV cache transfers. MORI instead treats idleness as a relative spectrum, ranks all active programs continuously, places the busiest on GPU and the most idle on CPU, and dynamically moves the partition boundary to fit available hardware. On real coding agent traces from Claude Code across multiple GPU-model pairs, this produces 20-71% higher throughput and 18-43% lower time-to-first-token than prior offloading baselines.

Core claim

MORI is an agent serving system whose central mechanism ranks every active program by a continuous, relative idleness score derived from its recent tool-call pattern, assigns the busiest programs to GPU HBM and the most idle to CPU DRAM, shifts the partition point on the fly to equalize load with hardware capacity ratios, and applies admission control at each tier so that tool-call duration differences no longer cause wasteful migrations.

What carries the argument

Continuous relative idleness ranking across all programs, which dynamically sets the GPU-CPU partition boundary to match hardware capacity.

If this is right

Memory tiers stay balanced without one sitting idle while the other evicts.
KV cache placements remain stable across short tool calls inside a busy phase.
Admission control prevents oversubscription at either tier even when the busy-idle mix changes.
The same ranking works across different GPU sizes and model KV footprints without retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ranking could be combined with intra-tier eviction policies to further reduce migrations inside GPU or CPU.
Similar relative-spectrum placement might apply to other two-tier systems such as local versus remote storage in distributed serving.
Workloads without clear busy-idle phases would need an alternative signal to keep the ranking useful.

Load-bearing premise

Agentic programs exhibit a two-phase structure of rapid short calls and long-running calls that a single continuous idleness ranking can capture and align with hardware capacity ratios.

What would settle it

A workload trace in which replacing the relative idleness ranking with either LRU or a fixed binary label produces throughput and TTFT identical to MORI.

Figures

Figures reproduced from arXiv: 2606.00866 by Hanchen Li, Hao Kang, Ion Stoica, Tian Xia, Xiaokun Chen, Yifan Qiao, Yi Xu, Zhifei Li.

**Figure 1.** Figure 1: Overview of MORI. (1) MORI categorizes programs into a continuous idleness spectrum reflecting how likely each is to remain GPU-resident. (2) An adaptive offloading scheduler places busy programs (low idleness) in GPU HBM and idle programs (high idleness) in CPU DRAM, with admission control for both tiers. (3) In multi-replica deployments, an affinity-aware load balancer routes requests to engines that al… view at source ↗

**Figure 2.** Figure 2: Structure of an agentic program. A program alternates between inference steps (shaded) and tool-call gaps. The KV cache grows across steps due to prefix dependencies. agentic workloads, however, a single user task produces a chain of model invocations connected by prefix dependencies. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Three examples of idle phases in agentic programs: (a) a long-running tool call such as a test suite or compilation, (b) waiting for human interaction (e.g., approval or feedback), and (c) spawning subagents and waiting for their results. Red dashed frames denote busy phases; green dashed frames denote idle phases. Note that in (c), the subagents themselves may be in busy phases, but the parent agent rema… view at source ↗

**Figure 6.** Figure 6: MORI three-tier queue architecture. Each replica maintains a GPU queue (HBM) and a CPU queue (DRAM); a global Waiting queue is shared across replicas. Programs are colored by their idleness (red = busy, green = idle). Idle programs are demoted from a higher tier to a lower one (GPU→CPU or CPU→Waiting), and busy programs are promoted from a lower tier to a higher one (Waiting→CPU or CPU→GPU). 4 Design This… view at source ↗

**Figure 7.** Figure 7: End-to-end performance on H200 (80 GB) with DP=1 and Qwen-2.5 7B. 20 50 80 concurrency 0 100 200 300 400 500 CPU:GPU=1:1 Throughput (tokens/s) Output Throughput 20 50 80 concurrency 0.00 0.25 0.50 0.75 1.00 1.25 Steps/s Steps/s 20 50 80 concurrency 0 20 40 60 TTFT (s) TTFT (avg) 20 50 80 concurrency 0 100 200 300 400 500 CPU:GPU=2:1 Throughput (tokens/s) 20 50 80 concurrency 0.00 0.25 0.50 0.75 1.00 1.25 S… view at source ↗

**Figure 8.** Figure 8: End-to-end performance on H200 with DP=1 and Qwen-3 30B-A3B. 6.2.1 Single Replica Scheduling with Memory Tiering. We first present results for the single-replica (DP=1) configurations, which isolate the effect of phase-aware memory scheduling from load-balancing decisions across replicas [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: End-to-end performance on H200 with DP=3 and Qwen-3 30B-A3B. this by tracking KV cache residency across both GPU and CPU tiers: when a program returns from a tool call, the scheduler routes it back to the replica where its cache is stored, whether on GPU or in that replica’s CPU memory, avoiding cross-replica migration. At 80 programs per replica with 2× CPU memory, MORI switches only 2.0% of programs (0… view at source ↗

read the original abstract

Modern LLM serving systems increasingly host agentic workloads, whose sessions issue tens of model invocations interleaved with tool calls, accumulating KV cache that can be reused across steps. As requests' total KV cache size easily exceeds GPU HBM capacity, researchers offload them to CPU DRAM. However, tool-call durations span orders of magnitude, and the cost of transferring KV cache between tiers makes it impractical to re-place entries on every call. We observe that agentic programs exhibit a two-phase structure: busy phases of rapid short tool calls and idle phases dominated by long-running calls. Current eviction policies such as LRU fail to capture this property. A binary busy/idle label also falls short because the ratio of busy to idle programs may not match the hardware's GPU-to-CPU capacity ratio. When it does not, one tier sits underutilized while the other is oversubscribed, wasting memory or forcing unnecessary evictions. We present MORI, an agent serving system that solves the above problem. Our key insight is that idleness is a continuous, relative spectrum. MORI ranks all active programs by idleness, assigns the busiest to GPU HBM and the most idle to CPU DRAM, dynamically shifts the partition boundary to match hardware capacity, and enforces admission control at each memory tier. Evaluated on real coding agent workloads collected from Claude Code across four GPU and model pairs, MORI delivers 20--71% higher throughput and 18--43% lower TTFT than the best baseline with offloading.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MORI's continuous idleness ranking for KV cache placement is the actual novelty and it reports clear gains on real agent traces, but the ranking computation is not described.

read the letter

The punchline is that MORI ranks active programs on a continuous idleness spectrum rather than using LRU or binary busy/idle labels, then slides the GPU-CPU partition boundary to match capacity ratios. That is the concrete change from prior offloading work.

The paper does a good job naming the workload pattern: agent sessions alternate between short rapid tool calls and long-running ones, so fixed policies leave one memory tier idle while the other evicts. The evaluation uses real coding-agent traces from Claude Code on four GPU-model pairs and shows 20-71% throughput lift and 18-43% lower TTFT versus the best baseline with offloading. Those numbers are on actual traces, which counts as evidence.

The soft spot is exactly the one the stress-test note flags. The abstract gives no equation or algorithm for the idleness score, so it is unclear how the system estimates idleness for a tool call that has just started and whose duration is unknown. If the ranking uses only historical averages, the boundary will drift when call lengths vary, and the reported gains may depend on an unstated assumption of accurate forward estimates. Without that piece the central claim is hard to judge.

This paper is for people who build or tune serving systems for agentic LLM workloads. A reader who already works on KV cache management or tiered memory will find the workload characterization and the empirical comparison useful. It deserves a serious referee because the problem is timely and the results are grounded in real traces, even though the method section will need expansion for reproducibility.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MORI, a system for serving agentic LLM workloads that offloads KV cache between GPU HBM and CPU DRAM. It observes that such workloads exhibit busy phases of short tool calls and idle phases of long-running calls, and proposes ranking active programs on a continuous relative idleness spectrum to dynamically set the GPU/CPU partition boundary to match hardware capacity ratios while enforcing per-tier admission control. On real coding-agent workloads from Claude Code across four GPU/model pairs, it reports 20--71% higher throughput and 18--43% lower TTFT versus the best baseline that uses offloading.

Significance. If the central empirical claims hold after the design details are fully specified and the evaluation is expanded, the work would be significant for memory management in LLM serving systems. The insight that idleness is relative rather than binary, and the use of real agentic traces, are strengths; the approach could improve utilization when KV-cache footprints exceed HBM without requiring per-call migrations.

major comments (2)

[Design / Algorithm description (referenced in abstract)] The algorithm or formula used to compute the continuous relative idleness ranking (the load-bearing mechanism that is claimed to align program placement with hardware capacity ratios) is not provided. Without an explicit definition or pseudocode, it is impossible to determine whether the ranking can be computed from past observations alone or requires lookahead/oracle knowledge of tool-call durations, directly affecting the validity of the two-phase exploitation claim.
[Evaluation section] The evaluation reports aggregate throughput and TTFT gains but provides no description of the baseline offloading implementations, workload statistics (e.g., distribution of tool-call durations, number of programs, KV sizes), number of runs, or variance. This makes it impossible to assess whether the 20--71% and 18--43% improvements are robust or sensitive to post-hoc choices.

minor comments (1)

[Abstract] The abstract states performance numbers but does not indicate whether the reported gains include confidence intervals or are from single runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify genuine gaps in the manuscript's presentation of the core algorithm and evaluation details. We will revise to address both.

read point-by-point responses

Referee: [Design / Algorithm description (referenced in abstract)] The algorithm or formula used to compute the continuous relative idleness ranking (the load-bearing mechanism that is claimed to align program placement with hardware capacity ratios) is not provided. Without an explicit definition or pseudocode, it is impossible to determine whether the ranking can be computed from past observations alone or requires lookahead/oracle knowledge of tool-call durations, directly affecting the validity of the two-phase exploitation claim.

Authors: We agree the manuscript lacks an explicit formula or pseudocode for the relative idleness ranking. This omission prevents readers from verifying that the mechanism relies only on past observations. In the revision we will add a precise definition of the idleness score (derived from each program's observed tool-call duration history), the ranking procedure, the dynamic boundary adjustment logic, and pseudocode. The ranking uses only historical data; no lookahead or oracle information is required. revision: yes
Referee: [Evaluation section] The evaluation reports aggregate throughput and TTFT gains but provides no description of the baseline offloading implementations, workload statistics (e.g., distribution of tool-call durations, number of programs, KV sizes), number of runs, or variance. This makes it impossible to assess whether the 20--71% and 18--43% improvements are robust or sensitive to post-hoc choices.

Authors: We agree that the evaluation section is missing these required details. The current manuscript reports only aggregate results without describing the baseline implementations, workload characteristics, run counts, or variance. In the revision we will expand the evaluation section to include full descriptions of the baselines, workload statistics (tool-call duration distributions, program counts, KV-cache sizes), number of runs, and standard deviations or confidence intervals for the reported throughput and TTFT gains. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical system design and evaluation

full rationale

The paper contains no equations, derivations, fitted parameters, or self-citations. Its core claims rest on an empirical observation of workload structure followed by a system implementation (MORI) whose performance is measured against baselines on real Claude Code workloads. The idleness-ranking insight is presented as a design principle without any mathematical formalization that could reduce to its own inputs. No load-bearing step reduces by construction to a prior result or fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that agentic programs exhibit distinguishable busy and idle phases that admit a useful relative ranking; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Agentic programs exhibit a two-phase structure: busy phases of rapid short tool calls and idle phases dominated by long-running calls.
This observation is presented as the key insight enabling the relative idleness ranking and is required for the dynamic boundary adjustment to be effective.

pith-pipeline@v0.9.1-grok · 5832 in / 1291 out tokens · 27265 ms · 2026-06-28T17:28:56.715531+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Libra: Efficient Resource Management for Agentic RL Post-Training
cs.LG 2026-06 unverdicted novelty 4.0

Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward conve...

Reference graph

Works this paper leans on

75 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang. 2024. InferCept: Efficient Intercept Support for Augmented Large Language Model Inference. arXiv:2402.01869 [cs.LG]https: //arxiv.org/abs/2402.01869

arXiv 2024
[2]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 117–134

2024
[3]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2023
[4]

Anthropic. 2025. Best Practices for Claude Code: Subagents and Paral- lel Exploration.https://code.claude.com/docs/en/best-practices. Accessed: 2026-05-14

2025
[5]

Anthropic. 2025. Claude Code.https://www.anthropic.com/claude- code. Accessed: 2026-05-13

2025
[6]

Anthropic. 2025. Claude Code Model Configuration.https://code. claude.com/docs/en/model-config. Accessed: 2026-05-14

2025
[7]

Anysphere. 2024. Cursor: The AI Code Editor.https://cursor.com/. Accessed: 2026-05-13

2024
[8]

Anysphere. 2025. Subagents in Cursor.https://cursor.com/docs/ subagents. Accessed: 2026-05-14

2025
[9]

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al . 2025. Why do multi-agent llm systems fail?Advances in Neural Information Processing Systems38 (2025)

2025
[10]

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. 2025. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941(2025)

Pith/arXiv arXiv 2025
[11]

Ishan Dhanani and Matej Kosec. 2026. Full-Stack Op- timizations for Agentic Inference with NVIDIA Dynamo. https://developer.nvidia.com/blog/full-stack-optimizations- for-agentic-inference-with-nvidia-dynamo/. NVIDIA Technical Blog. Accessed: 2026-05-15

2026
[12]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost- Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InProceedings of the 2024 USENIX Annual Technical Conference (ATC)

2024
[13]

In Gim, Zhiyao Ma, Seung-seob Lee, and Lin Zhong. 2025. Pie: A programmable serving system for emerging llm applications. InPro- ceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 415–430

2025
[14]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783(2024)

Pith/arXiv arXiv 2024
[15]

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. 2024. Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity. arXiv preprint arXiv:2404.14527(2024)

arXiv 2024
[16]

Dongxin Guo, Jikun Wu, and Siu-Ming Yiu. 2026. SAGA: Workflow- Atomic Scheduling for AI Agent Inference on GPU Clusters.arXiv preprint arXiv:2605.00528(2026)

Pith/arXiv arXiv 2026
[17]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large lan- guage model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680(2024)

Pith/arXiv arXiv 2024
[18]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al . 2024. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, V ol. 2024. 23247–23275

2024
[19]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654(2024)

Pith/arXiv arXiv 2024
[20]

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2024. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool.arXiv preprint arXiv:2406.17565 (2024)

arXiv 2024
[21]

Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2025. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference. InProceedings of Machine Learning and Systems (MLSys)

2025
[22]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InProceedings of the International Conference on Learning Representations (ICLR)

2024
[23]

David S Johnson. 1974. Fast algorithms for bin packing.J. Comput. System Sci.8, 3 (1974), 272–314

1974
[24]

Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora
[25]

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System.arXiv preprint arXiv:2602.13692(2026)

arXiv 2026
[26]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 13
[27]

InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)

Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP). 611–626
[28]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

2024
[29]

Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, and Ion Stoica. 2025. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live.arXiv preprint arXiv:2511.02230(2025)

Pith/arXiv arXiv 2025
[30]

Gonzalez

Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica, and Joseph E. Gonzalez. 2026. Combee: Scaling Prompt Learning for Self-Improving Language Model Agents. arXiv:2604.04247 [cs.AI] https://arxiv.org/abs/2604.04247

Pith/arXiv arXiv 2026
[31]

Hanchen Li, Yuhan Liu, Yihua Cheng, Kuntai Du, and Junchen Jiang
[32]

arXiv:2503.14647 [cs.NI]https://arxiv

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache. arXiv:2503.14647 [cs.NI]https://arxiv. org/abs/2503.14647

arXiv
[33]

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945

2024
[34]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al
[35]

Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

arXiv 2025
[36]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

2024
[37]

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, and Ion Stoica. 2025. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965 (2025)

arXiv 2025
[38]

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

2025
[39]

NVIDIA. 2023. NVIDIA H100 Tensor Core GPU Architec- ture Whitepaper.https://resources.nvidia.com/en-us-hopper- architecture/nvidia-h100-tensor-c. Accessed: 2026-05-13

2023
[40]

OpenAI. 2025. Codex CLI: OpenAI’s Coding Agent in the Terminal. https://github.com/openai/codex. Accessed: 2026-05-13

2025
[41]

OpenAI. 2026. Why we no longer evaluate SWE-bench Veri- fied.https://openai.com/index/why-we-no-longer-evaluate-swe- bench-verified/. Accessed: 2026-05-11

2026
[42]

Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows. InAdvances in Neural Information Processing Systems (NeurIPS)

2025
[43]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Í nigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA)

2024
[44]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 15174–15186

2024
[45]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache- Centric Disaggregated Architecture for LLM Serving.arXiv preprint arXiv:2407.00079(2024)

arXiv 2024
[46]

SGLang. 2026. SGLang Model Gateway.https://docs.sglang.io/docs/ advanced_features/sgl_model_gateway. SGLang Documentation. Accessed: 2026-05-15

2026
[47]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.arXiv preprint arXiv:1911.02150(2019)

Pith/arXiv arXiv 2019
[48]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems36 (2023), 38154–38180

2023
[49]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InProceedings of the 40th International Conference on Machine L...

2023
[50]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal rein- forcement learning.Advances in neural information processing systems 36 (2023), 8634–8652

2023
[51]

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2025. Preble: Efficient distributed prompt scheduling for llm serving. InInternational conference on learning representations, V ol. 2025. 37057–37082

2025
[52]

Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. 2026. Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution.arXiv preprint arXiv:2603.18897(2026)

Pith/arXiv arXiv 2026
[53]

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 173–191

2024
[54]

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2024. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence.arXiv preprint arXiv:2410.21465(2024)

arXiv 2024
[55]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2024. Teola: Towards end-to-end optimization of llm-based applications.arXiv preprint arXiv:2407.00326(2024)

arXiv 2024
[56]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to-End Optimization of LLM-based Applications with Ayo. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Rotterdam, Netherlands)(ASPLOS ’25). Association for Computing Machinery, New York, NY , USA, 130...

work page doi:10.1145/3676641 2025
[57]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS). 5998–6008

2017
[58]

Noppanat Wadlom, Junyi Shen, and Yao Lu. 2026. Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective.arXiv preprint arXiv:2603.16104(2026)

arXiv 2026
[59]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, 14 et al. 2025. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Repre- sentations, V ol. 2025. 65882–65919

2025
[60]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conver- sations. InFirst conference on language modeling

2024
[61]

Jackson, Zhifei Li, Jiarong Xing, Scott Shenker, and Ion Stoica

Tian Xia, Ziming Mao, Jamison Kerney, Ethan J. Jackson, Zhifei Li, Jiarong Xing, Scott Shenker, and Ion Stoica. 2026. SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference. In Proceedings of the 21st European Conference on Computer Systems (McEwan Hall/The University of Edinburgh, Edinburgh, Scotland UK) (EUROSYS ’26). Association for C...

work page doi:10.1145/3767295.3769353 2026
[62]

Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, and Xin Jin. 2025. ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production. InProceedings of the 23rd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI)

2025
[63]

Zhiqiang Xie. 2025. SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends.https://www.lmsys.org/blog/ 2025-09-10-sglang-hicache/

2025
[64]

Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Stoica. 2024. Pie: Pooling CPU Memory for LLM Inference. arXiv:2411.09317 [cs.LG] https://arxiv.org/abs/2411.09317

arXiv 2024
[65]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

Pith/arXiv arXiv 2025
[66]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al . 2024. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115(2024)

Pith/arXiv arXiv 2024
[67]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent- computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024
[68]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. In Proceedings of the twentieth European conference on computer systems. 94–109

2025
[69]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models.arXiv preprint arXiv:2210.03629(2023)

Pith/arXiv arXiv 2023
[70]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems7 (2025)

2025
[71]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 521–538

2022
[72]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Ef- ficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems (NeurIPS)

2024
[73]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serv- ing. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

2024
[74]

Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, et al
[75]

In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)

{NanoFlow}: Towards optimal large language model serving throughput. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 749–765. 15

[1] [1]

Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang. 2024. InferCept: Efficient Intercept Support for Augmented Large Language Model Inference. arXiv:2402.01869 [cs.LG]https: //arxiv.org/abs/2402.01869

arXiv 2024

[2] [2]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 117–134

2024

[3] [3]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2023

[4] [4]

Anthropic. 2025. Best Practices for Claude Code: Subagents and Paral- lel Exploration.https://code.claude.com/docs/en/best-practices. Accessed: 2026-05-14

2025

[5] [5]

Anthropic. 2025. Claude Code.https://www.anthropic.com/claude- code. Accessed: 2026-05-13

2025

[6] [6]

Anthropic. 2025. Claude Code Model Configuration.https://code. claude.com/docs/en/model-config. Accessed: 2026-05-14

2025

[7] [7]

Anysphere. 2024. Cursor: The AI Code Editor.https://cursor.com/. Accessed: 2026-05-13

2024

[8] [8]

Anysphere. 2025. Subagents in Cursor.https://cursor.com/docs/ subagents. Accessed: 2026-05-14

2025

[9] [9]

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al . 2025. Why do multi-agent llm systems fail?Advances in Neural Information Processing Systems38 (2025)

2025

[10] [10]

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. 2025. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941(2025)

Pith/arXiv arXiv 2025

[11] [11]

Ishan Dhanani and Matej Kosec. 2026. Full-Stack Op- timizations for Agentic Inference with NVIDIA Dynamo. https://developer.nvidia.com/blog/full-stack-optimizations- for-agentic-inference-with-nvidia-dynamo/. NVIDIA Technical Blog. Accessed: 2026-05-15

2026

[12] [12]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost- Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InProceedings of the 2024 USENIX Annual Technical Conference (ATC)

2024

[13] [13]

In Gim, Zhiyao Ma, Seung-seob Lee, and Lin Zhong. 2025. Pie: A programmable serving system for emerging llm applications. InPro- ceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 415–430

2025

[14] [14]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783(2024)

Pith/arXiv arXiv 2024

[15] [15]

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. 2024. Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity. arXiv preprint arXiv:2404.14527(2024)

arXiv 2024

[16] [16]

Dongxin Guo, Jikun Wu, and Siu-Ming Yiu. 2026. SAGA: Workflow- Atomic Scheduling for AI Agent Inference on GPU Clusters.arXiv preprint arXiv:2605.00528(2026)

Pith/arXiv arXiv 2026

[17] [17]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large lan- guage model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680(2024)

Pith/arXiv arXiv 2024

[18] [18]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al . 2024. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, V ol. 2024. 23247–23275

2024

[19] [19]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654(2024)

Pith/arXiv arXiv 2024

[20] [20]

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2024. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool.arXiv preprint arXiv:2406.17565 (2024)

arXiv 2024

[21] [21]

Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2025. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference. InProceedings of Machine Learning and Systems (MLSys)

2025

[22] [22]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InProceedings of the International Conference on Learning Representations (ICLR)

2024

[23] [23]

David S Johnson. 1974. Fast algorithms for bin packing.J. Comput. System Sci.8, 3 (1974), 272–314

1974

[24] [24]

Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora

[25] [25]

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System.arXiv preprint arXiv:2602.13692(2026)

arXiv 2026

[26] [26]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 13

[27] [27]

InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)

Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP). 611–626

[28] [28]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

2024

[29] [29]

Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, and Ion Stoica. 2025. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live.arXiv preprint arXiv:2511.02230(2025)

Pith/arXiv arXiv 2025

[30] [30]

Gonzalez

Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica, and Joseph E. Gonzalez. 2026. Combee: Scaling Prompt Learning for Self-Improving Language Model Agents. arXiv:2604.04247 [cs.AI] https://arxiv.org/abs/2604.04247

Pith/arXiv arXiv 2026

[31] [31]

Hanchen Li, Yuhan Liu, Yihua Cheng, Kuntai Du, and Junchen Jiang

[32] [32]

arXiv:2503.14647 [cs.NI]https://arxiv

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache. arXiv:2503.14647 [cs.NI]https://arxiv. org/abs/2503.14647

arXiv

[33] [33]

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945

2024

[34] [34]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al

[35] [35]

Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

arXiv 2025

[36] [36]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

2024

[37] [37]

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, and Ion Stoica. 2025. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965 (2025)

arXiv 2025

[38] [38]

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

2025

[39] [39]

NVIDIA. 2023. NVIDIA H100 Tensor Core GPU Architec- ture Whitepaper.https://resources.nvidia.com/en-us-hopper- architecture/nvidia-h100-tensor-c. Accessed: 2026-05-13

2023

[40] [40]

OpenAI. 2025. Codex CLI: OpenAI’s Coding Agent in the Terminal. https://github.com/openai/codex. Accessed: 2026-05-13

2025

[41] [41]

OpenAI. 2026. Why we no longer evaluate SWE-bench Veri- fied.https://openai.com/index/why-we-no-longer-evaluate-swe- bench-verified/. Accessed: 2026-05-11

2026

[42] [42]

Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows. InAdvances in Neural Information Processing Systems (NeurIPS)

2025

[43] [43]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Í nigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA)

2024

[44] [44]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 15174–15186

2024

[45] [45]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: A KVCache- Centric Disaggregated Architecture for LLM Serving.arXiv preprint arXiv:2407.00079(2024)

arXiv 2024

[46] [46]

SGLang. 2026. SGLang Model Gateway.https://docs.sglang.io/docs/ advanced_features/sgl_model_gateway. SGLang Documentation. Accessed: 2026-05-15

2026

[47] [47]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.arXiv preprint arXiv:1911.02150(2019)

Pith/arXiv arXiv 2019

[48] [48]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems36 (2023), 38154–38180

2023

[49] [49]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InProceedings of the 40th International Conference on Machine L...

2023

[50] [50]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal rein- forcement learning.Advances in neural information processing systems 36 (2023), 8634–8652

2023

[51] [51]

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2025. Preble: Efficient distributed prompt scheduling for llm serving. InInternational conference on learning representations, V ol. 2025. 37057–37082

2025

[52] [52]

Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. 2026. Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution.arXiv preprint arXiv:2603.18897(2026)

Pith/arXiv arXiv 2026

[53] [53]

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 173–191

2024

[54] [54]

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2024. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence.arXiv preprint arXiv:2410.21465(2024)

arXiv 2024

[55] [55]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2024. Teola: Towards end-to-end optimization of llm-based applications.arXiv preprint arXiv:2407.00326(2024)

arXiv 2024

[56] [56]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards End-to-End Optimization of LLM-based Applications with Ayo. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Rotterdam, Netherlands)(ASPLOS ’25). Association for Computing Machinery, New York, NY , USA, 130...

work page doi:10.1145/3676641 2025

[57] [57]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS). 5998–6008

2017

[58] [58]

Noppanat Wadlom, Junyi Shen, and Yao Lu. 2026. Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective.arXiv preprint arXiv:2603.16104(2026)

arXiv 2026

[59] [59]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, 14 et al. 2025. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Repre- sentations, V ol. 2025. 65882–65919

2025

[60] [60]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conver- sations. InFirst conference on language modeling

2024

[61] [61]

Jackson, Zhifei Li, Jiarong Xing, Scott Shenker, and Ion Stoica

Tian Xia, Ziming Mao, Jamison Kerney, Ethan J. Jackson, Zhifei Li, Jiarong Xing, Scott Shenker, and Ion Stoica. 2026. SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference. In Proceedings of the 21st European Conference on Computer Systems (McEwan Hall/The University of Edinburgh, Edinburgh, Scotland UK) (EUROSYS ’26). Association for C...

work page doi:10.1145/3767295.3769353 2026

[62] [62]

Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, and Xin Jin. 2025. ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production. InProceedings of the 23rd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI)

2025

[63] [63]

Zhiqiang Xie. 2025. SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends.https://www.lmsys.org/blog/ 2025-09-10-sglang-hicache/

2025

[64] [64]

Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Stoica. 2024. Pie: Pooling CPU Memory for LLM Inference. arXiv:2411.09317 [cs.LG] https://arxiv.org/abs/2411.09317

arXiv 2024

[65] [65]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

Pith/arXiv arXiv 2025

[66] [66]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al . 2024. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115(2024)

Pith/arXiv arXiv 2024

[67] [67]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent- computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024

[68] [68]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. In Proceedings of the twentieth European conference on computer systems. 94–109

2025

[69] [69]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models.arXiv preprint arXiv:2210.03629(2023)

Pith/arXiv arXiv 2023

[70] [70]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems7 (2025)

2025

[71] [71]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 521–538

2022

[72] [72]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Ef- ficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems (NeurIPS)

2024

[73] [73]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serv- ing. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

2024

[74] [74]

Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, et al

[75] [75]

In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)

{NanoFlow}: Towards optimal large language model serving throughput. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 749–765. 15