ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

Beidi Chen; Chenfeng Xu; Hao Kang; Junxiong Wang; Simran Arora; Tushar Krishna; Weili Xu; Xinyu Yang; Yinfang Chen; Ziyang Li

arxiv: 2602.13692 · v3 · pith:LUXMQJAZnew · submitted 2026-02-14 · 💻 cs.OS · cs.MA

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

Hao Kang , Ziyang Li , Weili Xu , Xinyu Yang , Yinfang Chen , Junxiong Wang , Beidi Chen , Tushar Krishna

show 2 more authors

Chenfeng Xu Simran Arora

This is my paper

classification 💻 cs.OS cs.MA

keywords agenticthunderagentinferencetoolsystemmemoryprogram-awaresystems

0 comments

read the original abstract

Large language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering
cs.DC 2026-06 unverdicted novelty 7.0

SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.
Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI
cs.OS 2026-05 unverdicted novelty 6.0

MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
cs.AI 2026-04 unverdicted novelty 6.0

Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and ...
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
cs.DC 2026-04 unverdicted novelty 6.0

KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline
cs.AI 2026-06 unverdicted novelty 5.0

An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.
Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving
cs.AR 2026-06 unverdicted novelty 5.0

AsymCache combines Multi-Segment Attention, position-aware eviction, and adaptive chunking to cut TTFT by up to 2.03x and TPOT by up to 1.71x versus recent baselines in LLM serving.
Agentic AI Workload Characteristics
cs.DC 2026-05 unverdicted novelty 5.0

Agentic workloads with context caching become decode-dominated with high KV-cache reuse and show tool use shifting from early read/explore to later execute/write phases.
Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
cs.AI 2026-05 unverdicted novelty 5.0

Compiling agentic workflows into LLM weights creates subterranean agents with near-frontier quality at two orders of magnitude less cost, validated empirically on travel booking, Zoom support, and insurance claims tasks.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
cs.LG 2026-04 unverdicted novelty 5.0

AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...