pith. machine review for the scientific record. sign in

arxiv: 2605.06472 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords KV cache managementdynamic agent workflowsLLM servingprediction-based optimizationcache reuseinference accelerationmulti-agent systems
0
0 comments X

The pith

PBKV predicts future agent calls in dynamic workflows to decide which KV cache entries to keep, achieving up to 1.85 times speedup over LRU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agent workflows often share context across agents, enabling KV cache reuse to reduce redundant computation during inference. Existing cache management either operates at the agent level without workflow awareness or assumes fixed agent sequences that do not match real dynamic workflows. PBKV addresses this by predicting upcoming agent invocations through a combination of historical workflow patterns and the current task context. These predictions inform estimates of cache entry reuse potential, allowing the system to retain valuable entries in limited GPU memory while using conservative strategies to handle prediction inaccuracies. Tests across workflow benchmarks confirm notable performance improvements for both dynamic and static cases.

Core claim

For each workflow, PBKV predicts the agent invocations in several future steps by fusing the guidance from historical workflows and context of the target workflow. Based on the predictions, PBKV estimates the reuse potential of cache entries and keeps the high-potential entries in GPU memory. To be robust to prediction errors, PBKV utilizes the predictions conservatively during both cache eviction and prefetching.

What carries the argument

Prediction-based estimation of cache reuse potential from fused historical and contextual data, which drives conservative eviction and prefetching to retain high-value KV entries.

Load-bearing premise

Fusing historical workflow data with target context produces predictions accurate enough to yield net cache-reuse gains even after accounting for prediction errors and conservative fallback rules.

What would settle it

A benchmark where agent invocation predictions are consistently inaccurate enough that conservative fallbacks result in cache performance no better than or worse than LRU eviction.

Figures

Figures reproduced from arXiv: 2605.06472 by Binhang Yuan, Fangcheng Fu, Hao Wang, Haoyu Zheng, Jiawei Jiang, Jia Wu, Xiao Yan, Yongqiang Zhang, Yuanyuan Zhu.

Figure 1
Figure 1. Figure 1: A call graph for the code-generation task. The Tester conditionally triggers a retry path through Analyzer and Coder, i.e., a retry loop. Inputs Predictor KV-Cache Management KV-Cache Storage LLM Prefill Embedding 𝑥 𝐾 × 𝑉 ⋮ K-step Forecast (i) Lifecycle-aware (ii) Lookahead score-driven 𝑆𝑐𝑜𝑟𝑒(𝑐) GPU Memory retired | active Global Call Graph 𝐺 Workflow 𝑤 (i) Multi-signal (ii) Multi-step (i) Idle GPU memory … view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the predictor. It fuses a topology-aware agent embedding from Graph￾SAGE (hcur), an attention-based workflow prefix summary (hpath), and a semantic signal reused from prefill (htxt), then jointly predicts the next K agent probability distributions via an MLP. As shown in view at source ↗
Figure 4
Figure 4. Figure 4: Computing the cross-workflow reuse score. For each active workflow w ac￾cessing cache node c, the K-step (K=3 here) predictor outputs per-step access probabilities, which are weighted by the survival probability s (k) and confidence factor γ k−1 and summed across m workflows as Score(c). From One-Step to K-Step Lookahead. As de￾scribed above, a single-step view remains myopic, so we extend the score to a K… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis of PBKV and its variants on HoVer + LangChain with Qwen3-32B. view at source ↗
Figure 6
Figure 6. Figure 6: KV-Cache hit rate of each policy over time on the HoVer + LangChain workload, served by view at source ↗
Figure 7
Figure 7. Figure 7: Architecture of the predictor. It fuses a topology-aware agent embedding from Graph￾SAGE (hcur), an attention-based workflow prefix summary (hpath), and a semantic signal reused from prefill (htxt), then jointly predicts the next K agent probability distributions via an MLP head. Why GraphSAGE? Our predictor design follows two desiderata. First, the predictor must serve all possible workflows across every … view at source ↗
Figure 8
Figure 8. Figure 8: Top-1 prediction accuracy as a function of training-set size at horizons k=1, 2, 3. Our predictor leads every baseline at every train size shown view at source ↗
Figure 9
Figure 9. Figure 9: Top-1 prediction accuracy as a function of the layer from which the prefill semantic signal htxt is extracted. The served LLM is Qwen3-32B, which exposes 64 transformer block outputs (ℓ=1, . . . , 64) followed by the post-norm hidden state post-norm that is fed into the output head. with htxt taken from each transformer block output (ℓ=1, . . . , 64) and from the post-norm hidden state post-norm, holding t… view at source ↗
Figure 10
Figure 10. Figure 10: Average cache hit rate vs. confidence decay coefficient view at source ↗
Figure 11
Figure 11. Figure 11: Cache hit rate of conservative and aggressive prefetching across different prefetching view at source ↗
Figure 12
Figure 12. Figure 12: Cache hit rate of PBKV-HE, conservative prefetching, and aggressive prefetching across view at source ↗
read the original abstract

LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within workflows, or manage cache at the workflow level but assume that each workflow calls a static sequence of agents. However, practical workflows are typically dynamic, where the sequence of invoked agents and thus induced cache reuse opportunities depend on the context of each task. To serve such dynamic workflows efficiently, we build a system dubbed PBKV (\textbf{P}rediction-\textbf{B}ased \textbf{KV}-Cache Management). For each workflow, PBKV predicts the agent invocations in several future steps by fusing the guidance from historical workflows and context of the target workflow. Based on the predictions, PBKV estimates the reuse potential of cache entries and keeps the high-potential entries in GPU memory. To be robust to prediction errors, PBKV utilizes the predictions conservatively during both cache eviction and prefetching. Experiments on three workflow benchmarks show that PBKV achieves up to $1.85\times$ speedup over LRU on dynamic workflows, and up to $1.26\times$ speedup over the SOTA baseline KVFlow on the static workflow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PBKV, a prediction-based KV-cache management system for serving dynamic LLM agent workflows. For each workflow, PBKV fuses historical workflow data with the target context to predict agent invocations over future steps, estimates cache-entry reuse potential from those predictions, and applies conservative rules during eviction and prefetching to remain robust to errors. On three workflow benchmarks, PBKV is reported to deliver up to 1.85× speedup versus LRU on dynamic workflows and 1.26× versus the SOTA baseline KVFlow on static workflows.

Significance. If the performance claims are reproducible and the speedups are shown to stem from the prediction mechanism rather than other factors, the work would address a practical gap in KV-cache management for context-dependent multi-agent LLM systems. The conservative handling of predictions is a sound design choice that could translate to reliable gains in production serving environments.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (1.85× over LRU, 1.26× over KVFlow) are presented without any quantitative data on prediction precision/recall, error distribution across dynamic branches, or an ablation that isolates prediction quality from the conservative fallback rules. This information is load-bearing for the claim that fusing historical and target context produces net reuse gains after accounting for errors.
  2. [Experimental Evaluation] Experimental section (inferred from benchmark results): no details are supplied on experimental controls, run-to-run variance, how prediction accuracy was measured, or data-exclusion criteria. Without these, the reported speedups cannot be verified and the weakest assumption—that predictions deliver net gains—remains untested.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly stated the prediction model (e.g., whether it is a simple heuristic or a learned component) and the exact definition of “reuse potential.”

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of prediction quality and experimental rigor, which we address below by committing to targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (1.85× over LRU, 1.26× over KVFlow) are presented without any quantitative data on prediction precision/recall, error distribution across dynamic branches, or an ablation that isolates prediction quality from the conservative fallback rules. This information is load-bearing for the claim that fusing historical and target context produces net reuse gains after accounting for errors.

    Authors: We agree that explicit metrics on prediction quality are needed to fully substantiate the claims. In the revised manuscript we will add a dedicated subsection (Section 5.3) reporting precision/recall for the fused historical+target predictor on each benchmark, broken down by dynamic branch points, along with the distribution of prediction errors (false positives/negatives per future step). We will also include an ablation that disables the predictor entirely (relying only on the conservative eviction/prefetch rules) and compares it directly to full PBKV; this will isolate the incremental benefit of the predictions while confirming that the conservative rules prevent net losses from errors. These additions will be placed before the main speedup results so readers can verify that the reported 1.85× and 1.26× gains arise from the prediction mechanism. revision: yes

  2. Referee: [Experimental Evaluation] Experimental section (inferred from benchmark results): no details are supplied on experimental controls, run-to-run variance, how prediction accuracy was measured, or data-exclusion criteria. Without these, the reported speedups cannot be verified and the weakest assumption—that predictions deliver net gains—remains untested.

    Authors: We acknowledge that the current experimental description is insufficient for reproducibility. In the revision we will expand Section 4 (Experimental Setup) with: (1) explicit controls (identical GPU hardware, fixed model checkpoints, same random seeds for workflow generation); (2) run-to-run variance reported as mean ± standard deviation over five independent runs per configuration; (3) precise definition of prediction accuracy measurement (exact match of predicted vs. ground-truth agent invocations extracted from the workflow traces at each step); and (4) data-exclusion criteria (only traces with malformed agent outputs were discarded; <2% of data). These details will be added alongside the existing benchmark descriptions, enabling direct verification that the speedups stem from the prediction-driven cache decisions rather than uncontrolled factors. revision: yes

Circularity Check

0 steps flagged

No circularity: predictions use external historical data and empirical speedups are independent of derivation

full rationale

The paper's core mechanism fuses historical workflow data (external to the current execution) with target context to predict future agent invocations, then applies conservative eviction/prefetch rules based on those predictions. No equations, fitted parameters, or self-citations are shown that would make the reported speedups (1.85× over LRU, 1.26× over KVFlow) reduce by construction to the inputs or to a self-referential definition. The derivation chain remains self-contained because the prediction step is not tautological with the cache management outcomes, and experimental results are presented as measured outcomes rather than derived identities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the system implicitly relies on an unspecified prediction model whose accuracy is treated as sufficient for net benefit.

pith-pipeline@v0.9.0 · 5549 in / 1134 out tokens · 28202 ms · 2026-05-08T12:36:02.578350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    LangChain

    Harrison Chase. LangChain. https://github.com/langchain-ai/langchain, 2022. Soft- ware

  2. [2]

    Autogen: Enabling next-gen LLM applications via multi-agent conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL https:// openreview.net/fo...

  3. [3]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations,

  4. [4]

    URLhttps://openreview.net/forum?id=VtmBAGCN7o

  5. [5]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  6. [6]

    Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

  7. [7]

    Dualpath: Breaking the storage bandwidth bottleneck in agentic llm inference, 2026

    Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, and Panpan Huang. Dualpath: Breaking the storage bandwidth bottleneck in agentic llm inference, 2026. URLhttps://arxiv.org/abs/2602.21548

  8. [8]

    LangGraph

    LangChain AI. LangGraph. https://github.com/langchain-ai/langgraph, 2024. Soft- ware

  9. [9]

    Laszlo A. Belady. A study of replacement algorithms for a virtual-storage computer.IBM Systems journal, 5(2):78–101, 1966

  10. [10]

    KVFlow: Efficient prefix caching for accelerating LLM- based multi-agent workflows

    Zaifeng Pan, Ajjkumar Patel, Yipeng Shen, Zhengding Hu, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. KVFlow: Efficient prefix caching for accelerating LLM- based multi-agent workflows. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=5Iw1nDtYmT

  11. [11]

    Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965, 2025

    Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, et al. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965, 2025

  12. [12]

    Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

    Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

  13. [13]

    Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

    Marcel Wagenländer, Otto White, Britannio Jarrett, Pedro Silvestre, Yanda Tao, Guo Li, Huanzhou Zhu, Llúis Vilanova, and Peter Pietzuch. Scepsy: Serving agentic workflows using aggregate llm pipelines, 2026. URLhttps://arxiv.org/abs/2604.15186

  14. [14]

    Agentic uncertainty quantification.arXiv preprint arXiv:2601.15703, 2026

    Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, and Chien-Sheng Wu. Agentic uncertainty quantification.arXiv preprint arXiv:2601.15703, 2026

  15. [15]

    Agentic confidence calibration.arXiv preprint arXiv:2601.15778, 2026

    Jiaxin Zhang, Caiming Xiong, and Chien-Sheng Wu. Agentic confidence calibration.arXiv preprint arXiv:2601.15778, 2026. 10

  16. [16]

    Barke et al

    Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. Agentrx: Diagnosing ai agent failures from execution trajectories.arXiv preprint arXiv:2602.02475, 2026

  17. [17]

    Beyond black-box benchmarking: Observability, analytics, and optimization of agentic systems

    Dany Moshkovich, Hadar Mulian, Sergey Zeltyn, Natti Eder, Inna Skarbovsky, and Roy Abitbol. Beyond black-box benchmarking: Observability, analytics, and optimization of agentic systems. arXiv preprint arXiv:2503.06745, 2025

  18. [18]

    Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems

    Dany Moshkovich and Sergey Zeltyn. Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 3840–3844. IEEE, 2025

  19. [19]

    SGLang HiCache: Fast hierarchical KV caching with your favorite storage backends

    LMSYS Org and SGLang Team. SGLang HiCache: Fast hierarchical KV caching with your favorite storage backends. https://lmsys.org/blog/2025-09-10-sglang-hicache/ , 9 2025

  20. [20]

    Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017

    Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017

  21. [21]

    Hover: A dataset for many-hop fact extraction and claim verification

    Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. Hover: A dataset for many-hop fact extraction and claim verification. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3441–3460, 2020

  22. [22]

    SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

  23. [23]

    Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

    Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

  24. [24]

    CrewAI.https://github.com/crewAIInc/crewAI, 2023

    CrewAI. CrewAI.https://github.com/crewAIInc/crewAI, 2023. Software

  25. [25]

    CONCUR: High-throughput agentic batch inference of LLM via congestion-based concurrency control.arXiv preprint arXiv:2601.22705,

    Qiaoling Chen, Zhisheng Ye, Tian Tang, Peng Sun, Boyu Tian, Guoteng Wang, Shenggui Li, Yonggang Wen, Zhenhua Han, and Tianwei Zhang. Concur: High-throughput agentic batch inference of llm via congestion-based concurrency control.arXiv preprint arXiv:2601.22705, 2026

  26. [26]

    Huang et al

    Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al. Lmcache: An efficient kv cache layer for enterprise- scale llm inference.arXiv preprint arXiv:2510.09665, 2025

  27. [27]

    KVCOMM: Online cross- context KV-cache communication for efficient LLM-based multi-agent systems

    Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, and Yiran Chen. KVCOMM: Online cross- context KV-cache communication for efficient LLM-based multi-agent systems. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net...

  28. [28]

    Cacheblend: Fast large language model serving for rag with cached knowledge fusion

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems, pages 94–109, 2025

  29. [29]

    Gonzalez, and Ion Stoica

    Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph E. Gonzalez, and Ion Stoica. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live. InICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving, 2026. URL https://openreview.net/ forum?id=sVzK0LC9pn

  30. [30]

    Parrot: Efficient serving of {LLM-based} applications with semantic variable

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 929–945, 2024. 11

  31. [31]

    Towards end-to-end optimization of llm- based applications with ayo

    Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. Towards end-to-end optimization of llm- based applications with ayo. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2025. doi: 10.1145/3676641.3716278. URLhttps://doi.org/10.1145/3676641.3716278

  32. [32]

    Speculative actions: A lossless framework for faster AI agents

    Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. Speculative actions: A lossless framework for faster AI agents. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=P0GOk5wslg

  33. [33]

    Kipf and Max Welling

    Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017. URL https: //openreview.net/forum?id=SJU4ayYgl

  34. [34]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  35. [35]

    An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

    Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271, 2018

  36. [36]

    Modeling relational data with graph convolutional networks

    Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. InEuropean semantic web conference, pages 593–607. Springer, 2018

  37. [37]

    DON’t STOP ME NOW: EMBEDDING BASED SCHEDULING FOR LLMS

    Rana Shahout, eran malach, Chunwei Liu, Weifan Jiang, Minlan Yu, and Michael Mitzen- macher. DON’t STOP ME NOW: EMBEDDING BASED SCHEDULING FOR LLMS. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=7JhGdZvW4T

  38. [38]

    Layer by layer: Uncovering hidden representations in language models

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann Lecun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In International Conference on Machine Learning, pages 55854–55875. PMLR, 2025

  39. [39]

    Improving llm predictions via inter-layer structural encoders

    Tom Ulanovski, Eyal Blyachman, and Maya Bechler-Speicher. Improving llm predictions via inter-layer structural encoders. InICLR Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2026

  40. [40]

    Competitive caching with machine learned advice

    Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned advice. Journal of the ACM (JACM), 68(4):1–25, 2021

  41. [41]

    Near-optimal bounds for online caching with machine learned advice

    Dhruv Rohatgi. Near-optimal bounds for online caching with machine learned advice. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1834–1845. SIAM, 2020

  42. [42]

    displacing known-valuable active cache in exchange for specula- tively valuable cache

    Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions.Communications of the ACM, 65(7):33–35, 2022. A Main Results with Standard Deviations. In our main experiments, we evaluate LRU, KVFlow (only on the static workload due to its inherent limitation), and our PBKV along with its variants on two LLMs and three workloads. Each setting i...