pith. sign in

arxiv: 2605.27744 · v1 · pith:EIOSARSDnew · submitted 2026-05-26 · 💻 cs.AI

A Policy-Driven Runtime Layer for Agentic LLM Serving

Pith reviewed 2026-06-29 16:51 UTC · model grok-4.3

classification 💻 cs.AI
keywords agentic LLM servingruntime layermulti-agent systemsKV cachingagent identityCacheSageserving stackpolicy primitives
0
0 comments X

The pith

An agent runtime layer with observe, score, predict, and act primitives coordinates policies across agent frameworks and serving engines using agent identity as the shared coordinate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent LLM systems split knowledge between the framework above, which tracks agent identities and dispatch, and the engine below, which sees every event but lacks agent context. Cross-cutting policies such as prefix caching, batch shaping, fairness, and safety enforcement currently require one-off patches into one layer or the other. The paper claims this seam is best resolved by inserting a new agent runtime layer that exposes four primitives—observe, score, predict, act—into which policies plug, with agent identity as the linking coordinate. The abstraction is validated by mapping nine policies and by implementing CacheSage for KV caching across sessions, which learns the per-workload agent transition matrix online to drive survival-based eviction and prefetching. On five real multi-agent workloads the approach delivers 13 to 37 percentage point cache hit-rate gains, 12 to 29 percent lower mean TTFT, and 6 to 14 percent higher throughput.

Core claim

Inserting an agent runtime layer between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate, allows cross-cutting policies to be expressed uniformly instead of through separate patches into either neighbor; the claim is instantiated and measured on KV caching via CacheSage, which learns the agent transition matrix online for survival-based eviction and between-step prefetch.

What carries the argument

The agent runtime layer, which sits between the agent framework and serving engine and coordinates policies through the four primitives observe, score, predict, act using agent identity as the shared coordinate.

If this is right

  • Nine cross-cutting policies including prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, and safety enforcement can be expressed once inside the runtime layer.
  • KV caching across sessions can use an online-learned per-workload agent transition matrix to perform survival-based eviction and between-step prefetching.
  • Mean time-to-first-token can drop 12 to 29 percent and throughput can rise 6 to 14 percent on multi-agent workloads.
  • Cache hit rates can increase 13 to 37 percentage points over an unmodified serving stack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The four primitives could serve as a stable interface that lets new policies be developed and tested without touching either the framework or engine code.
  • Agent identity as a first-class coordinate might allow the same runtime layer to support dynamic policy composition when workloads mix different agent types.
  • The separation of concerns could simplify auditing or rolling back individual policies in production without engine restarts.

Load-bearing premise

Cross-cutting policies that depend on both agent frameworks and serving engines are best solved by adding a new architectural layer rather than by continued point fixes in the existing layers.

What would settle it

A controlled experiment that reimplements the nine policies and CacheSage logic directly inside the framework or engine and measures whether it matches the reported cache hit-rate, TTFT, and throughput gains on the same five workloads.

Figures

Figures reproduced from arXiv: 2605.27744 by Chaeeun Kim, Liting Hu, Rui Zhang.

Figure 1
Figure 1. Figure 1: The seam between agent framework and serv [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cross-session structure on five multi-agent [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CacheSage instantiates the runtime layer for KV caching with three components: a transition learner (observe), a survival-probability scorer (score), and a cross-session prefetch hook (predict + act). content into the cache. It flows through the engine’s normal serving path, so no kernel or scheduler change is needed. 4 Evaluation Setup. Experiments are conducted on NVIDIA H100 80 GB GPU using Llama-3.1-8B… view at source ↗
Figure 4
Figure 4. Figure 4: Headline evaluation across five multi-agent workloads, comparing vLLM, Continuum, and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that multi-agent LLM serving suffers from a seam between agent frameworks (which know identities and structure) and engines (which see events but lack agent context), leading to ad-hoc point fixes for cross-cutting policies like KV caching and batch shaping. It argues this is best solved by inserting a new agent runtime layer exposing observe/score/predict/act primitives with agent identity as the coordinate, maps nine policies onto it, and validates the abstraction via CacheSage (an online-learned agent transition matrix for survival-based KV eviction and prefetch) on five workloads, reporting +13 to +37 pp hit-rate gains, 12-29% lower TTFT, and 6-14% higher throughput versus an unmodified stack.

Significance. If the architectural argument holds, the layer could reduce fragmentation in policy implementation for agentic workloads. The mapping of nine policies and the concrete, online CacheSage instantiation provide a useful existence proof for the primitives; the reported gains on real workloads are a positive signal, though the evaluation remains preliminary.

major comments (1)
  1. [Abstract] Abstract: the central claim that the seam 'is best addressed by an architectural change rather than point fixes' is load-bearing for the recommendation yet unsupported by direct evidence. The only quantitative comparison is CacheSage versus an unmodified baseline; no equivalent policy implemented as a minimal patch into the engine (or framework) that could exploit agent identity is evaluated, leaving open whether the reported gains require the new layer or could be matched at lower integration cost.
minor comments (2)
  1. [Abstract] Abstract: workload descriptions, baseline engine/framework versions, error bars, and statistical tests are absent from the reported results, making it difficult to assess the reliability of the +13 to +37 pp hit-rate and throughput numbers.
  2. [Abstract] Abstract: the transition matrix is described as 'learned online' but no detail is given on its parameterization, update rule, or free parameters, which would clarify the scope of the 'parameter-free' aspects of the policy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the focus on the load-bearing claim in the abstract. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the seam 'is best addressed by an architectural change rather than point fixes' is load-bearing for the recommendation yet unsupported by direct evidence. The only quantitative comparison is CacheSage versus an unmodified baseline; no equivalent policy implemented as a minimal patch into the engine (or framework) that could exploit agent identity is evaluated, leaving open whether the reported gains require the new layer or could be matched at lower integration cost.

    Authors: We agree that the manuscript provides no direct head-to-head evaluation of an equivalent policy implemented as a minimal patch. The central argument is conceptual: agent identity, roles, and dispatch structure exist only in the framework, while engine events lack this coordinate, so any agent-aware policy must bridge the seam. A patch into the engine would require either (a) extending the framework-engine interface to carry agent context on every event (which recreates the runtime layer's role) or (b) ad-hoc duplication of agent logic inside the engine for each policy. The paper maps nine distinct policies (KV caching, batch shaping, speculative execution, fairness, tool-result memoization, safety, etc.) onto the same four primitives to illustrate that a single layer can host them without repeated interface changes. The CacheSage evaluation demonstrates that one such policy can be realized cleanly via the primitives and yields measurable gains, but we did not implement or benchmark the corresponding point-fix versions. Adding those comparisons would require substantial additional engineering outside the paper's scope and is not planned for revision. revision: no

Circularity Check

0 steps flagged

No circularity: architectural argument validated empirically on external workloads

full rationale

The paper's central claim is an architectural recommendation (insert runtime layer exposing observe/score/predict/act with agent identity as coordinate) rather than a mathematical derivation. Validation uses CacheSage, which learns the agent transition matrix online from real multi-agent workloads (external data). No equations, fitted predictions, or self-citations are shown that reduce any result to its own inputs by construction. The transition matrix and performance deltas (+13-37pp hit-rate, etc.) are measured against unmodified baselines on five workloads, satisfying the self-contained-against-external-benchmarks criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full text unavailable so ledger is limited to claims visible in the abstract.

free parameters (1)
  • agent transition matrix
    Learned online per workload for CacheSage eviction and prefetch decisions.
axioms (1)
  • domain assumption Many cross-cutting policies depend on both agent identities/roles and engine-level events.
    Stated as the core problem motivating the layer.
invented entities (1)
  • agent runtime layer no independent evidence
    purpose: Third tier exposing observe/score/predict/act primitives with agent identity as coordinate.
    New architectural component proposed to sit between framework and engine.

pith-pipeline@v0.9.1-grok · 5781 in / 1360 out tokens · 30251 ms · 2026-06-29T16:51:04.483945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 16 canonical work pages · 9 internal anchors

  1. [1]

    Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiy- ing Zhang. 2024. Infercept: Efficient intercept support for augmented large language model inference.arXiv preprint arXiv:2402.01869 (2024)

  2. [2]

    Zhuohang Bian, Feiyang Wu, Teng Ma, and Youwei Zhuo. 2025. Token- cake: A KV-Cache-centric Serving Framework for LLM-based Multi- Agent Applications.arXiv preprint arXiv:2510.18586(2025)

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  4. [4]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168(2021)

  5. [5]

    Datadog. 2026. State of AI Engineering 2026. Industry report, https: //www.datadoghq.com/state-of-ai-engineering/

  6. [6]

    Yixin Dong, Charlie F Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen. 2025. Xgrammar: Flexible and efficient structured generation engine for large language models.Proceedings of Machine Learning and Systems7 (2025)

  7. [7]

    Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, et al. 2025. EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving.arXiv preprint arXiv:2512.14946(2025)

  8. [8]

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems6 (2024), 325–338

  9. [9]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  10. [10]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300 (2020)

  11. [11]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. 2024. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Rep- resentations, Vol. 2024. 23247–23275

  12. [12]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  13. [13]

    InProceedings of the 29th symposium on operating systems principles

    Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626

  14. [14]

    LangChain. 2026. State of Agent Engineering 2026. Industry survey, https://www.langchain.com/state-of-agent-engineering

  15. [15]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic{KV}cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

  16. [16]

    Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonza- lez, and Ion Stoica. 2025. Continuum: Efficient and robust multi- turn llm agent scheduling with kv cache time-to-live.arXiv preprint arXiv:2511.02230(2025)

  17. [17]

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945

  18. [18]

    Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al

  19. [19]

    Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

  20. [20]

    Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, et al

  21. [21]

    DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving.arXiv preprint arXiv:2411.02820(2024)

  22. [22]

    Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, et al. 2025. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965(2025)

  23. [23]

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2024. Gaia: a benchmark for general ai assistants. In International Conference on Learning Representations, Vol. 2024. 9025– 9049

  24. [24]

    OpenAI. 2024. Swarm: Educational framework exploring ergonomic, lightweight multi-agent orchestration. https://github.com/openai/ swarm

  25. [25]

    Zaifeng Pan, AJJKUMAR DAHYALAL PATEL, Yipeng Shen, Zhengding Hu, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2026. KVFlow: Efficient prefix caching for accelerating LLM-based multi-agent workflows.Advances in Neural Information Processing Systems38 (2026), 126246–126265

  26. [26]

    Yeonju Ro, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Ricardo Bian- chini, Aditya Akella, Zhangyang Wang, Mattan Erez, and Esha Choukse. 2025. Sherlock: Reliable and Efficient Agentic Workflow Execution.arXiv preprint arXiv:2511.00330(2025)

  27. [27]

    Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, and Michael Mitzenmacher. 2026. Fast inference for augmented large language models.Advances in Neural Information Processing Systems38 (2026), 71562–71591

  28. [28]

    Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2025. Preble: Efficient distributed prompt scheduling for llm serving. InInternational conference on learning representations, Vol. 2025. 37057–37082

  29. [29]

    Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. 2026. Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897(2026)

  30. [30]

    Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. 2025. VL- cache: Sparsity and modality-aware KV cache compression for vision- language model inference acceleration. InInternational Conference on Learning Representations, Vol. 2025. 219–239

  31. [31]

    Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Haoning Guan, Rui Qu, Maoliang Li, Xiang Chen, and Guojie Luo. 2025. Agent. xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc. arXiv preprint arXiv:2506.24045(2025)

  32. [32]

    Brandon T Willard and Rémi Louf. 2023. Efficient guided generation for large language models.arXiv preprint arXiv:2307.09702(2023)

  33. [33]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conver- sations. InFirst conference on language modeling

  34. [34]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: 6 A Policy-Driven Runtime Layer for Agentic LLM Serving Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109

  35. [35]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629(2022)

  36. [36]

    Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, et al. 2026. Kvcomm: Online cross-context kv-cache communi- cation for efficient llm-based multi-agent systems.Advances in Neural Information Processing Systems38 (2026), 17882–17928

  37. [37]

    Rui Zhang and Liting Hu. 2025. Enabling Fairness Across Multi-modal and Multi-agent Applications. In2025 IEEE International Conference on Edge Computing and Communications (EDGE). IEEE, 90–92

  38. [38]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems36 (2023), 46595– 46623

  39. [39]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured lan- guage model programs.Advances in neural information processing systems37 (2024), 62557–62583. 7