arxiv: 2605.06472 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

Haoyu Zheng , Fangcheng Fu , Jia Wu , Binhang Yuan , Yongqiang Zhang , Hao Wang , Yuanyuan Zhu , Xiao Yan

show 1 more author

Jiawei Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords KV cache managementdynamic agent workflowsLLM servingprediction-based optimizationcache reuseinference accelerationmulti-agent systems

0 comments

The pith

PBKV predicts future agent calls in dynamic workflows to decide which KV cache entries to keep, achieving up to 1.85 times speedup over LRU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agent workflows often share context across agents, enabling KV cache reuse to reduce redundant computation during inference. Existing cache management either operates at the agent level without workflow awareness or assumes fixed agent sequences that do not match real dynamic workflows. PBKV addresses this by predicting upcoming agent invocations through a combination of historical workflow patterns and the current task context. These predictions inform estimates of cache entry reuse potential, allowing the system to retain valuable entries in limited GPU memory while using conservative strategies to handle prediction inaccuracies. Tests across workflow benchmarks confirm notable performance improvements for both dynamic and static cases.

Core claim

For each workflow, PBKV predicts the agent invocations in several future steps by fusing the guidance from historical workflows and context of the target workflow. Based on the predictions, PBKV estimates the reuse potential of cache entries and keeps the high-potential entries in GPU memory. To be robust to prediction errors, PBKV utilizes the predictions conservatively during both cache eviction and prefetching.

What carries the argument

Prediction-based estimation of cache reuse potential from fused historical and contextual data, which drives conservative eviction and prefetching to retain high-value KV entries.

Load-bearing premise

Fusing historical workflow data with target context produces predictions accurate enough to yield net cache-reuse gains even after accounting for prediction errors and conservative fallback rules.

What would settle it

A benchmark where agent invocation predictions are consistently inaccurate enough that conservative fallbacks result in cache performance no better than or worse than LRU eviction.

Figures

Figures reproduced from arXiv: 2605.06472 by Binhang Yuan, Fangcheng Fu, Hao Wang, Haoyu Zheng, Jiawei Jiang, Jia Wu, Xiao Yan, Yongqiang Zhang, Yuanyuan Zhu.

**Figure 1.** Figure 1: A call graph for the code-generation task. The Tester conditionally triggers a retry path through Analyzer and Coder, i.e., a retry loop. Inputs Predictor KV-Cache Management KV-Cache Storage LLM Prefill Embedding 𝑥 𝐾 × 𝑉 ⋮ K-step Forecast (i) Lifecycle-aware (ii) Lookahead score-driven 𝑆𝑐𝑜𝑟𝑒(𝑐) GPU Memory retired | active Global Call Graph 𝐺 Workflow 𝑤 (i) Multi-signal (ii) Multi-step (i) Idle GPU memory … view at source ↗

**Figure 3.** Figure 3: Architecture of the predictor. It fuses a topology-aware agent embedding from GraphSAGE (hcur), an attention-based workflow prefix summary (hpath), and a semantic signal reused from prefill (htxt), then jointly predicts the next K agent probability distributions via an MLP. As shown in view at source ↗

**Figure 4.** Figure 4: Computing the cross-workflow reuse score. For each active workflow w accessing cache node c, the K-step (K=3 here) predictor outputs per-step access probabilities, which are weighted by the survival probability s (k) and confidence factor γ k−1 and summed across m workflows as Score(c). From One-Step to K-Step Lookahead. As described above, a single-step view remains myopic, so we extend the score to a K… view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis of PBKV and its variants on HoVer + LangChain with Qwen3-32B. view at source ↗

**Figure 6.** Figure 6: KV-Cache hit rate of each policy over time on the HoVer + LangChain workload, served by view at source ↗

**Figure 7.** Figure 7: Architecture of the predictor. It fuses a topology-aware agent embedding from GraphSAGE (hcur), an attention-based workflow prefix summary (hpath), and a semantic signal reused from prefill (htxt), then jointly predicts the next K agent probability distributions via an MLP head. Why GraphSAGE? Our predictor design follows two desiderata. First, the predictor must serve all possible workflows across every … view at source ↗

**Figure 8.** Figure 8: Top-1 prediction accuracy as a function of training-set size at horizons k=1, 2, 3. Our predictor leads every baseline at every train size shown view at source ↗

**Figure 9.** Figure 9: Top-1 prediction accuracy as a function of the layer from which the prefill semantic signal htxt is extracted. The served LLM is Qwen3-32B, which exposes 64 transformer block outputs (ℓ=1, . . . , 64) followed by the post-norm hidden state post-norm that is fed into the output head. with htxt taken from each transformer block output (ℓ=1, . . . , 64) and from the post-norm hidden state post-norm, holding t… view at source ↗

**Figure 10.** Figure 10: Average cache hit rate vs. confidence decay coefficient view at source ↗

**Figure 11.** Figure 11: Cache hit rate of conservative and aggressive prefetching across different prefetching view at source ↗

**Figure 12.** Figure 12: Cache hit rate of PBKV-HE, conservative prefetching, and aggressive prefetching across view at source ↗

read the original abstract

LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within workflows, or manage cache at the workflow level but assume that each workflow calls a static sequence of agents. However, practical workflows are typically dynamic, where the sequence of invoked agents and thus induced cache reuse opportunities depend on the context of each task. To serve such dynamic workflows efficiently, we build a system dubbed PBKV (\textbf{P}rediction-\textbf{B}ased \textbf{KV}-Cache Management). For each workflow, PBKV predicts the agent invocations in several future steps by fusing the guidance from historical workflows and context of the target workflow. Based on the predictions, PBKV estimates the reuse potential of cache entries and keeps the high-potential entries in GPU memory. To be robust to prediction errors, PBKV utilizes the predictions conservatively during both cache eviction and prefetching. Experiments on three workflow benchmarks show that PBKV achieves up to $1.85\times$ speedup over LRU on dynamic workflows, and up to $1.26\times$ speedup over the SOTA baseline KVFlow on the static workflow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PBKV gives a workable way to handle KV-cache reuse in dynamic agent workflows by predicting a few steps ahead and staying conservative on errors, with reported speedups that look usable but rest on unexamined prediction quality.

read the letter

The core idea is straightforward: for workflows where the sequence of agents can branch based on context, PBKV fuses historical workflow traces with the current task to forecast the next few agent calls, then uses those forecasts only conservatively when deciding what to evict or prefetch in the KV cache. This produces up to 1.85× over LRU on dynamic cases and 1.26× over KVFlow on static ones across three benchmarks. The practical payoff is real for anyone running multi-agent LLM services where context sharing is common but sequences are not fixed in advance. Prior cache managers either stayed agent-local or assumed static flows, so the shift to workflow-level prediction with built-in robustness is the actual increment here. The conservative fallback is a sensible engineering move that prevents bad predictions from hurting performance. The experiments deliver concrete numbers on real benchmarks, which is better than many systems papers. The main gap is that nothing is shown about how accurate the predictions actually are, how errors distribute across branches, or whether an ablation would still show gains if the predictor were replaced by a simple heuristic. Without those numbers it is hard to tell how much of the speedup comes from the prediction step versus the conservative rules acting like an improved LRU. Readers who run production agent systems or work on serving stacks will find the design and results worth looking at. The problem is timely and the claims are straightforward to check with standard metrics, so the paper should go to referees who can ask for the missing prediction diagnostics and sensitivity checks.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PBKV, a prediction-based KV-cache management system for serving dynamic LLM agent workflows. For each workflow, PBKV fuses historical workflow data with the target context to predict agent invocations over future steps, estimates cache-entry reuse potential from those predictions, and applies conservative rules during eviction and prefetching to remain robust to errors. On three workflow benchmarks, PBKV is reported to deliver up to 1.85× speedup versus LRU on dynamic workflows and 1.26× versus the SOTA baseline KVFlow on static workflows.

Significance. If the performance claims are reproducible and the speedups are shown to stem from the prediction mechanism rather than other factors, the work would address a practical gap in KV-cache management for context-dependent multi-agent LLM systems. The conservative handling of predictions is a sound design choice that could translate to reliable gains in production serving environments.

major comments (2)

[Abstract] Abstract: the central performance claims (1.85× over LRU, 1.26× over KVFlow) are presented without any quantitative data on prediction precision/recall, error distribution across dynamic branches, or an ablation that isolates prediction quality from the conservative fallback rules. This information is load-bearing for the claim that fusing historical and target context produces net reuse gains after accounting for errors.
[Experimental Evaluation] Experimental section (inferred from benchmark results): no details are supplied on experimental controls, run-to-run variance, how prediction accuracy was measured, or data-exclusion criteria. Without these, the reported speedups cannot be verified and the weakest assumption—that predictions deliver net gains—remains untested.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly stated the prediction model (e.g., whether it is a simple heuristic or a learned component) and the exact definition of “reuse potential.”

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of prediction quality and experimental rigor, which we address below by committing to targeted revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (1.85× over LRU, 1.26× over KVFlow) are presented without any quantitative data on prediction precision/recall, error distribution across dynamic branches, or an ablation that isolates prediction quality from the conservative fallback rules. This information is load-bearing for the claim that fusing historical and target context produces net reuse gains after accounting for errors.

Authors: We agree that explicit metrics on prediction quality are needed to fully substantiate the claims. In the revised manuscript we will add a dedicated subsection (Section 5.3) reporting precision/recall for the fused historical+target predictor on each benchmark, broken down by dynamic branch points, along with the distribution of prediction errors (false positives/negatives per future step). We will also include an ablation that disables the predictor entirely (relying only on the conservative eviction/prefetch rules) and compares it directly to full PBKV; this will isolate the incremental benefit of the predictions while confirming that the conservative rules prevent net losses from errors. These additions will be placed before the main speedup results so readers can verify that the reported 1.85× and 1.26× gains arise from the prediction mechanism. revision: yes
Referee: [Experimental Evaluation] Experimental section (inferred from benchmark results): no details are supplied on experimental controls, run-to-run variance, how prediction accuracy was measured, or data-exclusion criteria. Without these, the reported speedups cannot be verified and the weakest assumption—that predictions deliver net gains—remains untested.

Authors: We acknowledge that the current experimental description is insufficient for reproducibility. In the revision we will expand Section 4 (Experimental Setup) with: (1) explicit controls (identical GPU hardware, fixed model checkpoints, same random seeds for workflow generation); (2) run-to-run variance reported as mean ± standard deviation over five independent runs per configuration; (3) precise definition of prediction accuracy measurement (exact match of predicted vs. ground-truth agent invocations extracted from the workflow traces at each step); and (4) data-exclusion criteria (only traces with malformed agent outputs were discarded; <2% of data). These details will be added alongside the existing benchmark descriptions, enabling direct verification that the speedups stem from the prediction-driven cache decisions rather than uncontrolled factors. revision: yes

Circularity Check

0 steps flagged

No circularity: predictions use external historical data and empirical speedups are independent of derivation

full rationale

The paper's core mechanism fuses historical workflow data (external to the current execution) with target context to predict future agent invocations, then applies conservative eviction/prefetch rules based on those predictions. No equations, fitted parameters, or self-citations are shown that would make the reported speedups (1.85× over LRU, 1.26× over KVFlow) reduce by construction to the inputs or to a self-referential definition. The derivation chain remains self-contained because the prediction step is not tautological with the cache management outcomes, and experimental results are presented as measured outcomes rather than derived identities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the system implicitly relies on an unspecified prediction model whose accuracy is treated as sufficient for net benefit.

pith-pipeline@v0.9.0 · 5549 in / 1134 out tokens · 28202 ms · 2026-05-08T12:36:02.578350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 13 canonical work pages · 2 internal anchors

[1]

LangChain

Harrison Chase. LangChain. https://github.com/langchain-ai/langchain, 2022. Soft- ware

2022
[2]

Autogen: Enabling next-gen LLM applications via multi-agent conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL https:// openreview.net/fo...

2024
[3]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations,
[4]

URLhttps://openreview.net/forum?id=VtmBAGCN7o
[5]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[6]

Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

2024
[7]

Dualpath: Breaking the storage bandwidth bottleneck in agentic llm inference, 2026

Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, and Panpan Huang. Dualpath: Breaking the storage bandwidth bottleneck in agentic llm inference, 2026. URLhttps://arxiv.org/abs/2602.21548

work page arXiv 2026
[8]

LangGraph

LangChain AI. LangGraph. https://github.com/langchain-ai/langgraph, 2024. Soft- ware

2024
[9]

Laszlo A. Belady. A study of replacement algorithms for a virtual-storage computer.IBM Systems journal, 5(2):78–101, 1966

1966
[10]

KVFlow: Efficient prefix caching for accelerating LLM- based multi-agent workflows

Zaifeng Pan, Ajjkumar Patel, Yipeng Shen, Zhengding Hu, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. KVFlow: Efficient prefix caching for accelerating LLM- based multi-agent workflows. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=5Iw1nDtYmT

2025
[11]

Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965, 2025

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, et al. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965, 2025

work page arXiv 2025
[12]

Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

work page arXiv 2026
[13]

Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

Marcel Wagenländer, Otto White, Britannio Jarrett, Pedro Silvestre, Yanda Tao, Guo Li, Huanzhou Zhu, Llúis Vilanova, and Peter Pietzuch. Scepsy: Serving agentic workflows using aggregate llm pipelines, 2026. URLhttps://arxiv.org/abs/2604.15186

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Agentic uncertainty quantification.arXiv preprint arXiv:2601.15703, 2026

Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, and Chien-Sheng Wu. Agentic uncertainty quantification.arXiv preprint arXiv:2601.15703, 2026

work page arXiv 2026
[15]

Agentic confidence calibration.arXiv preprint arXiv:2601.15778, 2026

Jiaxin Zhang, Caiming Xiong, and Chien-Sheng Wu. Agentic confidence calibration.arXiv preprint arXiv:2601.15778, 2026. 10

work page arXiv 2026
[16]

Barke et al

Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. Agentrx: Diagnosing ai agent failures from execution trajectories.arXiv preprint arXiv:2602.02475, 2026

work page arXiv 2026
[17]

Beyond black-box benchmarking: Observability, analytics, and optimization of agentic systems

Dany Moshkovich, Hadar Mulian, Sergey Zeltyn, Natti Eder, Inna Skarbovsky, and Roy Abitbol. Beyond black-box benchmarking: Observability, analytics, and optimization of agentic systems. arXiv preprint arXiv:2503.06745, 2025

work page arXiv 2025
[18]

Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems

Dany Moshkovich and Sergey Zeltyn. Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 3840–3844. IEEE, 2025

2025
[19]

SGLang HiCache: Fast hierarchical KV caching with your favorite storage backends

LMSYS Org and SGLang Team. SGLang HiCache: Fast hierarchical KV caching with your favorite storage backends. https://lmsys.org/blog/2025-09-10-sglang-hicache/ , 9 2025

2025
[20]

Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.Advances in neural information processing systems, 30, 2017

2017
[21]

Hover: A dataset for many-hop fact extraction and claim verification

Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. Hover: A dataset for many-hop fact extraction and claim verification. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3441–3460, 2020

2020
[22]

SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

2024
[23]

Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

work page arXiv 2023
[24]

CrewAI.https://github.com/crewAIInc/crewAI, 2023

CrewAI. CrewAI.https://github.com/crewAIInc/crewAI, 2023. Software

2023
[25]

CONCUR: High-throughput agentic batch inference of LLM via congestion-based concurrency control.arXiv preprint arXiv:2601.22705,

Qiaoling Chen, Zhisheng Ye, Tian Tang, Peng Sun, Boyu Tian, Guoteng Wang, Shenggui Li, Yonggang Wen, Zhenhua Han, and Tianwei Zhang. Concur: High-throughput agentic batch inference of llm via congestion-based concurrency control.arXiv preprint arXiv:2601.22705, 2026

work page arXiv 2026
[26]

Huang et al

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al. Lmcache: An efficient kv cache layer for enterprise- scale llm inference.arXiv preprint arXiv:2510.09665, 2025

work page arXiv 2025
[27]

KVCOMM: Online cross- context KV-cache communication for efficient LLM-based multi-agent systems

Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, and Yiran Chen. KVCOMM: Online cross- context KV-cache communication for efficient LLM-based multi-agent systems. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net...

2025
[28]

Cacheblend: Fast large language model serving for rag with cached knowledge fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems, pages 94–109, 2025

2025
[29]

Gonzalez, and Ion Stoica

Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph E. Gonzalez, and Ion Stoica. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live. InICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving, 2026. URL https://openreview.net/ forum?id=sVzK0LC9pn

2026
[30]

Parrot: Efficient serving of {LLM-based} applications with semantic variable

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 929–945, 2024. 11

2024
[31]

Towards end-to-end optimization of llm- based applications with ayo

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. Towards end-to-end optimization of llm- based applications with ayo. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2025. doi: 10.1145/3676641.3716278. URLhttps://doi.org/10.1145/3676641.3716278

work page doi:10.1145/3676641.3716278 2025
[32]

Speculative actions: A lossless framework for faster AI agents

Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. Speculative actions: A lossless framework for faster AI agents. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=P0GOk5wslg

2026
[33]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017. URL https: //openreview.net/forum?id=SJU4ayYgl

2017
[34]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[35]

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271, 2018

work page internal anchor Pith review arXiv 2018
[36]

Modeling relational data with graph convolutional networks

Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. InEuropean semantic web conference, pages 593–607. Springer, 2018

2018
[37]

DON’t STOP ME NOW: EMBEDDING BASED SCHEDULING FOR LLMS

Rana Shahout, eran malach, Chunwei Liu, Weifan Jiang, Minlan Yu, and Michael Mitzen- macher. DON’t STOP ME NOW: EMBEDDING BASED SCHEDULING FOR LLMS. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=7JhGdZvW4T

2025
[38]

Layer by layer: Uncovering hidden representations in language models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann Lecun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. In International Conference on Machine Learning, pages 55854–55875. PMLR, 2025

2025
[39]

Improving llm predictions via inter-layer structural encoders

Tom Ulanovski, Eyal Blyachman, and Maya Bechler-Speicher. Improving llm predictions via inter-layer structural encoders. InICLR Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2026

2026
[40]

Competitive caching with machine learned advice

Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned advice. Journal of the ACM (JACM), 68(4):1–25, 2021

2021
[41]

Near-optimal bounds for online caching with machine learned advice

Dhruv Rohatgi. Near-optimal bounds for online caching with machine learned advice. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1834–1845. SIAM, 2020

2020
[42]

displacing known-valuable active cache in exchange for specula- tively valuable cache

Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions.Communications of the ACM, 65(7):33–35, 2022. A Main Results with Standard Deviations. In our main experiments, we evaluate LRU, KVFlow (only on the static workload due to its inherent limitation), and our PBKV along with its variants on two LLMs and three workloads. Each setting i...

2022