Stateful Inference for Low-Latency Multi-Agent Tool Calling

· 2026 · cs.LG · arXiv 2605.26289

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(\Delta_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\times$ faster per turn on a 6-turn agentic workflow and $4.2\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.

representative citing papers

Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.

citing papers explorer

Showing 1 of 1 citing paper.

Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path cs.LG · 2026-06-28 · unverdicted · none · ref 12 · internal anchor
Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.

Stateful Inference for Low-Latency Multi-Agent Tool Calling

fields

years

verdicts

representative citing papers

citing papers explorer