Parallelizing Tool Execution and LLM Generation for Low-Latency Agent Serving

Han Zhao; Hao Wang; Jianxun Li; Kai Chen; Kaiqiang Xu; Rui Ma; Yifan Sui; Yuqing Yang; Zhiyuan He

arxiv: 2603.18897 · v3 · pith:KSEEQ52Xnew · submitted 2026-03-19 · 💻 cs.DC · cs.AI

Parallelizing Tool Execution and LLM Generation for Low-Latency Agent Serving

Yifan Sui , Han Zhao , Rui Ma , Zhiyuan He , Hao Wang , Jianxun Li , Kaiqiang Xu , Kai Chen

show 1 more author

Yuqing Yang

This is my paper

classification 💻 cs.DC cs.AI

keywords toolexecutionpasteagentgenerationlatencyloopserving

0 comments

read the original abstract

LLM-powered agents execute tasks through a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, leaving tool latency exposed on the task critical path. This paper presents PASTE, a tool-aware agent-serving system that predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. PASTE isolates speculative results until confirmed by the LLM and jointly schedules tool execution and returning LLM sessions to avoid shifting bottlenecks to the GPU. Across deep research, coding, and scientific-agent workloads, PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Policy-Driven Runtime Layer for Agentic LLM Serving
cs.AI 2026-05 unverdicted novelty 7.0

Introduces a three-tier architecture with an agent runtime layer and four primitives for agent-aware policies in LLM serving, validated on KV caching via CacheSage showing 13-37pp hit-rate gains on five workloads.
Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools
cs.CR 2026-06 unverdicted novelty 6.0

Ghost tool calls from speculative dispatch create persistent intent leaks that only issue-time policies changing or suppressing call arguments or destinations can reduce, per evaluations of twelve policies on three corpora.
Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI
cs.OS 2026-05 unverdicted novelty 6.0

MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.
HARBOR: Automated Harness Optimization
cs.LG 2026-04 unverdicted novelty 6.0

HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.
SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents
cs.CL 2026-05 unverdicted novelty 5.0

SpecHop accelerates multi-hop LLM tool use via continuous multi-threaded speculation with asynchronous verification, approaching oracle latency gains and reducing latency up to 40% on retrieval tasks.
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
cs.LG 2026-05 unverdicted novelty 5.0

PBKV predicts agent invocations in dynamic LLM workflows to manage KV-cache reuse, delivering up to 1.85x speedup over LRU and 1.26x over KVFlow.
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
cs.SE 2026-04 unverdicted novelty 5.0

Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
B-PASTE: Beam-Aware Pattern-Guided Speculative Execution for Resource-Constrained LLM Agents
cs.DC 2026-04 unverdicted novelty 5.0

B-PASTE uses beam-aware speculation of tool-call branches ranked by critical-path reduction to deliver up to 1.4x end-to-end speedup in resource-constrained LLM agents.