SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering

Chang Chen; Qianxu Wang; Steven Swanson; Yanbo Zhou; Yue Guan; Yufei Ding; Zaifeng Pan; Zhengding Hu

arxiv: 2607.00151 · v1 · pith:TZ6T4OA7new · submitted 2026-06-30 · 💻 cs.DC

SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering

Zaifeng Pan , Qianxu Wang , Zhengding Hu , Chang Chen , Yue Guan , Yanbo Zhou , Steven Swanson , Yufei Ding This is my paper

Pith reviewed 2026-07-02 17:17 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM agentscontext engineeringKV cachelookahead schedulingtime-to-first-tokenagent servingcontext transformation

0 comments

The pith

Context transformations in LLM agents can execute asynchronously via segment decomposability to eliminate TTFT overhead

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agent workflows grow contexts through interleaved tool calls and feedback, requiring engineering steps like offloading or reduction that normally force KV cache invalidation and costly re-prefill on every change. The paper establishes that these transformations are segment-decomposable, so the result for any prefix stands alone without reference to later tokens. This property lets the system treat transformations as lookahead asynchronous operations that the runtime can run early and cache in advance. A programming model and scheduler then allow direct context replacement at runtime with no blocking delay. Experiments confirm the approach removes the overhead and lowers TTFT by up to 11.9 times.

Core claim

Context transformations are segment-decomposable so that the transformation applied to a prefix is independent of future tokens; this allows a lookahead programming model to schedule the transformations asynchronously, the runtime to precompute the corresponding KV caches, and a lookahead-aware scheduler to swap contexts without re-prefill or interference with latency-critical work.

What carries the argument

Lookahead programming model that marks context transformations as asynchronous operations, backed by proactive KV-cache preparation and a lookahead-aware scheduler

If this is right

Agent frameworks can apply offloading, reduction, and isolation without paying re-prefill cost on each change.
Transformed KV caches are ready for immediate use at the moment the context switch occurs.
The scheduler can interleave lookahead requests with normal inference while keeping interference bounded.
No changes to existing agent execution logic are required to obtain the latency benefit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same segment-independence property may apply to other dynamic context operations such as retrieval or memory consolidation in long-running agents.
Removing the transformation bottleneck could make multi-hour agent sessions practical on current serving hardware.
Tighter coupling between the lookahead scheduler and tool-calling loops might further reduce end-to-end latency beyond the reported TTFT gains.

Load-bearing premise

Every context transformation can be applied to a prefix without depending on any later tokens in the sequence.

What would settle it

A concrete context transformation (for example a summarizer or offloader) in which changing tokens after position k alters the correct output for the first k tokens, so that any precomputed cache for the prefix is wrong.

Figures

Figures reproduced from arXiv: 2607.00151 by Chang Chen, Qianxu Wang, Steven Swanson, Yanbo Zhou, Yue Guan, Yufei Ding, Zaifeng Pan, Zhengding Hu.

**Figure 3.** Figure 3: An illustrative example of context transformation [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Execution timeline of the lookahead program [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 2.** Figure 2: Agent serving systems. The serving stack for LLM agents typically consists of two loosely coupled layers. At the frontend, agent frameworks [4, 33, 39, 47, 59] define the agent harness, including control flow, tool usage, and context management policies. These frameworks focus on expressiveness and task-solving capability, enabling developers to compose complex agent behaviors. At the backend, LLM serving… view at source ↗

**Figure 5.** Figure 5: Offloading transforms bulky observations (e.g., tool [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Summarization compresses the prefix into a com [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Context isolation launches a sub-agent with a clean [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: System overview of SmoothAgent. This ordering, combined with the latency constraint, ensures that decode and LC requests are never delayed by BE execution, while allowing the scheduler to exploit slack within each iteration. Discussion on API-based serving. The best-effort abstraction for lookahead requests also enables flexible API design. Service providers can expose lookahead requests as a lower-priorit… view at source ↗

**Figure 10.** Figure 10: Transform-point TTFT for each strategy on Qwen3-8B (PD co-located) at concurrency levels 1, 4, 8, and 16. SmoothA [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Transform-point TTFT for each strategy on Qwen3-32B (PD co-located, TP=4 across four H100 GPUs). [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Context length (top) and per-turn TTFT (bottom) for a single Qwen3-8B agent under each strategy. Synchronous [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Transform-time TTFT on Qwen3-8B in a PD disaggregated deployment with four prefill and four decode instances. [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Impact of lookahead traffic on latency-critical [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 15.** Figure 15: Accuracy of the context-aware performance model [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

read the original abstract

LLM-based agents execute multi-turn workflows with continuously growing contexts, where LLM calls are interleaved with tool invocations and environment feedback. To maintain model quality, modern agent frameworks rely on context engineering strategies such as offloading, reduction, and isolation to control the context length. However, these strategies introduce significant context transformation overhead: each transformation invalidates existing KV caches and triggers re-prefill, leading to increased time-to-first-token (TTFT). In this paper, we identify that context transformations are segment-decomposable, where the transformation of a prefix is independent of future tokens. This property enables transformations to be executed ahead of time. Based on this insight, we propose a lookahead programming model that allows agent frameworks to express context transformations as asynchronous operations without modifying their execution logic. The runtime proactively executes these transformations and prepares transformed KV caches in advance, enabling direct context replacement without blocking. We further design a lookahead-aware scheduler in LLM serving systems to support these asynchronous requests alongside latency-critical workloads with controlled interference. We implement our approach to support representative context engineering strategies and integrate it into existing agent frameworks and LLM serving systems. Experiments show that our approach effectively eliminates transformation overhead and reduces TTFT by up to 11.9x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The lookahead model for precomputing segment-decomposable context transformations is a practical systems idea that could hide overhead in agent serving, but the decomposability claim needs explicit checks to avoid silent quality issues.

read the letter

The main thing to know is that this paper introduces a lookahead programming model so agent frameworks can mark context transformations (offload, reduce, isolate) as async operations. The runtime then prepares the transformed KV caches ahead of time, and a new scheduler mixes those requests with normal work while limiting interference. Experiments are said to remove the transformation overhead and cut TTFT by up to 11.9x.

What is actually new is the combination of the programming model with the scheduler; prior context-reduction work focused on shrinking length but did not treat the transformations themselves as lookahead tasks that can be overlapped. The integration story with existing agent frameworks and serving systems is also concrete and useful.

The work does well on the engineering side by keeping changes mostly transparent to the agent logic and by targeting a real bottleneck in long-horizon loops. The scheduler design appears sensible for controlled sharing of GPU resources.

The soft spot is the segment-decomposability assumption itself. The stress-test concern is fair: many reduction and isolation steps use aggregate statistics or later environment feedback, so the prefix transformation is not obviously independent of future tokens. If the paper only validates simple cases and does not show either a formal class of supported transformations or side-by-side output equivalence checks, the latency win could mask quality regressions. The abstract gives no methodology or error analysis, so the full paper must supply those details clearly.

This is for people building LLM serving stacks for agent workloads. It deserves peer review because the problem is timely, the mechanism is distinct from prior art, and the practical payoff is large enough to warrant referee time even if the evaluation needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims that context transformations (offloading, reduction, isolation) in long-horizon LLM agent workflows are segment-decomposable, i.e., the transformation of any prefix is independent of future tokens. This property enables a lookahead programming model for expressing transformations as asynchronous operations, proactive KV-cache preparation by the runtime, and a lookahead-aware scheduler that supports these requests with controlled interference, ultimately eliminating transformation overhead and reducing TTFT by up to 11.9x.

Significance. If the segment-decomposability property holds with equivalence to monolithic transformations and the scheduler introduces no quality loss, the approach would allow agent frameworks to hide context-engineering latency, improving responsiveness for multi-turn workflows that interleave LLM calls with tools and feedback.

major comments (2)

[Abstract] Abstract: the central claim rests on the assertion that 'context transformations are segment-decomposable, where the transformation of a prefix is independent of future tokens,' yet the manuscript supplies neither a formal characterization of the class of transformations for which this holds nor an equivalence argument (proof or empirical check) that segment-wise results equal the monolithic result. This is load-bearing because reduction and isolation strategies commonly rely on aggregate statistics or environment feedback that can retroactively affect earlier segments, risking incorrect KV caches.
[Experiments] The experimental claim of up to 11.9x TTFT reduction is presented without reported methodology, baselines, datasets, or quality metrics (e.g., downstream task accuracy or KV-cache equivalence checks) that would confirm the decomposability assumption was not violated in the tested strategies.

minor comments (1)

The abstract states integration into 'existing agent frameworks and LLM serving systems' but does not name the specific frameworks or serving systems used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim rests on the assertion that 'context transformations are segment-decomposable, where the transformation of a prefix is independent of future tokens,' yet the manuscript supplies neither a formal characterization of the class of transformations for which this holds nor an equivalence argument (proof or empirical check) that segment-wise results equal the monolithic result. This is load-bearing because reduction and isolation strategies commonly rely on aggregate statistics or environment feedback that can retroactively affect earlier segments, risking incorrect KV caches.

Authors: We agree that a formal characterization and equivalence argument would strengthen the central claim. The manuscript defines segment-decomposability for the specific class of transformations (offloading via local token scoring, reduction via prefix-local summarization or pruning, and isolation via non-crossing context boundaries) where each prefix transformation is independent of future tokens by construction. We will add a new subsection with a formal definition of the property, a proof sketch showing equivalence to the monolithic case for these strategies, and empirical checks comparing segment-wise versus full-context outputs on representative workflows. revision: yes
Referee: [Experiments] The experimental claim of up to 11.9x TTFT reduction is presented without reported methodology, baselines, datasets, or quality metrics (e.g., downstream task accuracy or KV-cache equivalence checks) that would confirm the decomposability assumption was not violated in the tested strategies.

Authors: The full manuscript reports the experimental methodology, baselines (vanilla vLLM and Hugging Face serving), datasets (long-horizon tasks from AgentBench and custom multi-turn workflows), and quality metrics (task accuracy, output equivalence for KV-cache validation, and TTFT) in Sections 5 and 6. The 11.9x result is obtained only on workloads where decomposability was verified via equivalence checks. We will revise the abstract and introduction to explicitly reference these sections and add a dedicated table of KV-cache equivalence results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical results and stated insight

full rationale

The paper presents an empirical performance claim (TTFT reduction up to 11.9x) supported by experiments on implemented strategies, alongside an identified property (segment-decomposability) used to motivate a lookahead model. No equations, fitted parameters, or self-citations are shown that reduce the central result to its own inputs by construction. The derivation chain is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the segment-decomposability property of context transformations and on the experimental validation of the runtime and scheduler; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Context transformations are segment-decomposable, where the transformation of a prefix is independent of future tokens.
This is the key insight stated in the abstract that enables proactive execution.

pith-pipeline@v0.9.1-grok · 5767 in / 1146 out tokens · 28369 ms · 2026-07-02T17:17:52.977009+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 23 canonical work pages · 10 internal anchors

[1]

Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang
[2]

Infercept: Efficient intercept support for augmented large language model inference.arXiv preprint arXiv:2402.01869(2024)

work page arXiv 2024
[3]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24). 117–134

2024
[4]

Anthropic. 2024. Building effective agents. https://www.anthropic.com/engine ering/building-effective-agents

2024
[5]

Anthropic. 2025. Claude Code. https://www.anthropic.com/claude-code

2025
[6]

Anthropic. 2025. Effective context engineering for AI agents. https://www.anth ropic.com/engineering/effective-context-engineering-for-ai-agents

2025
[7]

Anthropic. 2025. How we built our multi-agent research system. https://www. anthropic.com/engineering/multi-agent-research-system

2025
[8]

Qiaoling Chen, Zhisheng Ye, Tian Tang, Peng Sun, Boyu Tian, Guoteng Wang, Shenggui Li, Yonggang Wen, Zhenhua Han, and Tianwei Zhang. 2026. CON- CUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control.arXiv preprint arXiv:2601.22705(2026)

work page arXiv 2026
[9]

Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xu- peng Miao, Tianqi Chen, and Zhihao Jia. 2025. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs...

work page arXiv 2025
[10]

Chester Curme and Mason Daugherty. 2026. Context Management for Deep Agents. https://blog.langchain.com/context-management-for-deepagents

2026
[11]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

2022
[12]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. Flex attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.054962, 3 (2024), 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Hugging Face. 2025. Open-source DeepResearch - Freeing our search agents. https://huggingface.co/blog/open-deep-research

2025
[15]

Taosong Fang, Zhen Zheng, Zhengzhao Ma, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. 2026. FlashAgents: Accelerating Multi-Agent LLM Systems via Streaming Prefill Overlap.Proceedings of Machine Learning and Systems(2026)

2026
[16]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 111–126

2024
[17]

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025
[18]

Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-tune: Harnessing large language models for automated database system tuning.Proceedings of the ACM on Management of Data3, 1 (2025), 1–26

2025
[19]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems6 (2024), 325–338

2024
[20]

GLM-5-Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Kanishk Goel, Jayashree Mohan, Nipun Kwatra, Ravi Shreyas Anupindi, and Ramachandran Ramjee. 2026. QoServe: Breaking the Silos of LLM Inference Serving. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1492–1507

2026
[22]

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. Deepspeed-fastgen: High- throughput text generation for llms via mii and deepspeed-inference.arXiv preprint arXiv:2401.08671(2024)

work page arXiv 2024
[23]

2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance

Kelly Hong, Anton Troynikov, and Jeff Huber. 2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance. Technical Report. Chroma. https: //research.trychroma.com/context-rot

2025
[24]

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al . 2023. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.003523, 4 (2023), 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Zhengding Hu, Vibha Murthy, Zaifeng Pan, Wanlu Li, Xiaoyi Fang, Yufei Ding, and Yuke Wang. 2025. HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 623–638

2025
[26]

Zhengding Hu, Zaifeng Pan, Prabhleen Kaur, Vibha Murthy, Zhongkai Yu, Yue Guan, Zhen Wang, Steven Swanson, and Yufei Ding. 2026. Pancake: Hierarchical Memory System for Multi-Agent LLM Serving.arXiv preprint arXiv:2602.21477 (2026). 13

work page arXiv 2026
[27]

Yichao Ji. 2025. Context Engineering for AI Agents: Lessons from Building Manus. https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons- from-Building-Manus

2025
[28]

Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso
[29]

Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models.Proceedings of the VLDB Endowment18, 1 (2024), 42–52

2024
[30]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems44, 1 (2025), 1–27

2025
[31]

Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Unlocking full prefill-decode overlap for faster llm inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897–912

2025
[32]

Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora. 2026. ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System.arXiv preprint arXiv:2602.13692(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

2023
[34]

LangChain. 2025. Context Engineering. https://blog.langchain.com/context- engineering-for-agents

2025
[35]

LangChain. 2026. LangChain: The agent engineering platform. https://www.la ngchain.com

2026
[36]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

2024
[37]

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing context to enhance inference efficiency of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 6342– 6353

2023
[38]

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. 2023. Swift- sage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems36 (2023), 23813–23825

2023
[39]

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of {LLM-based} Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945

2024
[40]

Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, Luis Ceze, and Baris Kasikci. 2025. TeleRAG: Effi- cient retrieval-augmented generation inference with lookahead retrieval.arXiv preprint arXiv:2502.20969(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Jerry Liu. 2022. LlamaIndex. https://github.com/jerryjliu/llama_index

2022
[42]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024
[43]

Gonzalez, and Aditya G

Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, and Aditya G. Parameswaran. 2026. Supporting our ai overlords: Redesigning data systems to be agent-first.Proceedings of CIDR 2026(2026)

2026
[44]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

work page arXiv 2025
[45]

Kuan Lu, Zhihui Yang, Sai Wu, Ruichen Xia, Dongxiang Zhang, and Gang Chen
[46]

Adda: Towards efficient in-database feature generation via llm-based agents.Proceedings of the ACM on Management of Data3, 3 (2025), 1–27

2025
[47]

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, and Ion Stoica. 2025. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965(2025)

work page arXiv 2025
[48]

Yuyu Luo, Guoliang Li, Ju Fan, and Nan Tang. 2026. Data Agents: Levels, State of the Art, and Open Problems. InCompanion of the International Conference on Management of Data. 571–579

2026
[49]

Microsoft. 2026. Autogen: Open-Source Framework for Agentic AI. https: //www.microsoft.com/en-us/research/project/autogen

2026
[50]

MiniMax. 2026. Mini Agent. https://github.com/MiniMax-AI/Mini-Agent

2026
[51]

NVIDIA. 2026. NVIDIA Inference Xfer Library (NIXL). https://github.com/ai- dynamo/nixl

2026
[52]

James Pan and Guoliang Li. 2025. Database Perspective on LLM Inference Systems.Proceedings of the VLDB Endowment18, 12 (2025), 5504–5507

2025
[53]

Zaifeng Pan, Yitong Ding, Yue Guan, Zheng Wang, Zhongkai Yu, Xulong Tang, Yida Wang, and Yufei Ding. 2025. FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference. InProceedings of Machine Learning and Systems

2025
[54]

Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows.arXiv preprint arXiv:2507.07400(2025)

work page arXiv 2025
[55]

Zaifeng Pan, Yipeng Shen, Zhengding Hu, Zhuang Wang, Aninda Manocha, Zheng Wang, Zhongkai Yu, Yue Guan, and Yufei Ding. 2026. ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management.arXiv preprint arXiv:2601.21473(2026)

work page arXiv 2026
[56]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

2024
[57]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 155–170

2025
[58]

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, Yiqing Yang, Eric Liu, Ryan Wu, Kevin Benavente, Rajiv Mandya Nagaraju, Muhammad Faayez, Xiyan Zhang, Dhruv Vivek Sharma, Xianrui Zhong, Ziqiao Ma, Tianmin Shu, Zhiting Hu, and Lianhui Qin. 2025. Simworld: An open-ended realis...

work page arXiv 2025
[60]

Rya Sanovar, Srikant Bharadwaj, Renee St Amant, Victor Rühle, and Saravan Rajmohan. 2025. LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers.Proceedings of Machine Learning and Systems7 (2025)

2025
[61]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learn- ing.Advances in Neural Information Processing Systems36 (2023), 8634–8652

2023
[62]

Peter Steinberger and OpenClaw Community. 2026. OpenClaw: Personal AI Assistant. https://openclaw.ai

2026
[63]

StepFun, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin...

work page arXiv 2025
[64]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards end-to-end optimization of llm-based applications with ayo. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1302–1316

2025
[65]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Noppanat Wadlom, Junyi Shen, and Yao Lu. 2026. Efficient LLM serving for agentic workflows: A data systems perspective.Proceedings of the ACM on Management of Data4, 3 (SIGMOD) (2026), 1–29

2026
[67]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

2025
[68]

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. 2025. Mirage: A {Multi-Level} superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 21–38

2025
[69]

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. 2023. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Un- structured Sparsity.Proceedings of the VLDB Endowment17, 2 (2023), 211–224

2023
[70]

Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, and Christos Kozyrakis. 2025. Ai metropolis: Scaling large language model-based multi-agent simulation with out-of-order execution.Proceedings of Machine Learning and Systems7 (2025)

2025
[71]

Qian Xu, Juan Yang, Feng Zhang, Junda Pan, Kang Chen, Youren Shen, Amelie Chi Zhou, and Xiaoyong Du. 2025. Tribase: A vector data query engine for reliable and lossless pruning compression using triangle inequalities.Proceedings of the ACM on Management of Data3, 1 (2025), 1–28

2025
[72]

Qian Xu, Feng Zhang, Chengxi Li, Lei Cao, Zheng Chen, Jidong Zhai, and Xi- aoyong Du. 2025. Harmony: A scalable distributed vector database for high- throughput approximate nearest neighbor search.Proceedings of the ACM on Management of Data3, 4 (2025), 1–28

2025
[73]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024
[75]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109

2025
[76]

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35 (2022), 20744–20757

2022
[77]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR)

2023
[78]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze
[79]

InEighth Conference on Machine Learning and Systems

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InEighth Conference on Machine Learning and Systems
[80]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22). 521–538

2022

Showing first 80 references.

[1] [1]

Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang

[2] [2]

Infercept: Efficient intercept support for augmented large language model inference.arXiv preprint arXiv:2402.01869(2024)

work page arXiv 2024

[3] [3]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24). 117–134

2024

[4] [4]

Anthropic. 2024. Building effective agents. https://www.anthropic.com/engine ering/building-effective-agents

2024

[5] [5]

Anthropic. 2025. Claude Code. https://www.anthropic.com/claude-code

2025

[6] [6]

Anthropic. 2025. Effective context engineering for AI agents. https://www.anth ropic.com/engineering/effective-context-engineering-for-ai-agents

2025

[7] [7]

Anthropic. 2025. How we built our multi-agent research system. https://www. anthropic.com/engineering/multi-agent-research-system

2025

[8] [8]

Qiaoling Chen, Zhisheng Ye, Tian Tang, Peng Sun, Boyu Tian, Guoteng Wang, Shenggui Li, Yonggang Wen, Zhenhua Han, and Tianwei Zhang. 2026. CON- CUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control.arXiv preprint arXiv:2601.22705(2026)

work page arXiv 2026

[9] [9]

Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xu- peng Miao, Tianqi Chen, and Zhihao Jia. 2025. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs...

work page arXiv 2025

[10] [10]

Chester Curme and Mason Daugherty. 2026. Context Management for Deep Agents. https://blog.langchain.com/context-management-for-deepagents

2026

[11] [11]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

2022

[12] [12]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. Flex attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.054962, 3 (2024), 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Hugging Face. 2025. Open-source DeepResearch - Freeing our search agents. https://huggingface.co/blog/open-deep-research

2025

[15] [15]

Taosong Fang, Zhen Zheng, Zhengzhao Ma, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. 2026. FlashAgents: Accelerating Multi-Agent LLM Systems via Streaming Prefill Overlap.Proceedings of Machine Learning and Systems(2026)

2026

[16] [16]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 111–126

2024

[17] [17]

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025

[18] [18]

Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-tune: Harnessing large language models for automated database system tuning.Proceedings of the ACM on Management of Data3, 1 (2025), 1–26

2025

[19] [19]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems6 (2024), 325–338

2024

[20] [20]

GLM-5-Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Kanishk Goel, Jayashree Mohan, Nipun Kwatra, Ravi Shreyas Anupindi, and Ramachandran Ramjee. 2026. QoServe: Breaking the Silos of LLM Inference Serving. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1492–1507

2026

[22] [22]

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. Deepspeed-fastgen: High- throughput text generation for llms via mii and deepspeed-inference.arXiv preprint arXiv:2401.08671(2024)

work page arXiv 2024

[23] [23]

2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance

Kelly Hong, Anton Troynikov, and Jeff Huber. 2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance. Technical Report. Chroma. https: //research.trychroma.com/context-rot

2025

[24] [24]

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al . 2023. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.003523, 4 (2023), 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Zhengding Hu, Vibha Murthy, Zaifeng Pan, Wanlu Li, Xiaoyi Fang, Yufei Ding, and Yuke Wang. 2025. HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 623–638

2025

[26] [26]

Zhengding Hu, Zaifeng Pan, Prabhleen Kaur, Vibha Murthy, Zhongkai Yu, Yue Guan, Zhen Wang, Steven Swanson, and Yufei Ding. 2026. Pancake: Hierarchical Memory System for Multi-Agent LLM Serving.arXiv preprint arXiv:2602.21477 (2026). 13

work page arXiv 2026

[27] [27]

Yichao Ji. 2025. Context Engineering for AI Agents: Lessons from Building Manus. https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons- from-Building-Manus

2025

[28] [28]

Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso

[29] [29]

Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models.Proceedings of the VLDB Endowment18, 1 (2024), 42–52

2024

[30] [30]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems44, 1 (2025), 1–27

2025

[31] [31]

Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Unlocking full prefill-decode overlap for faster llm inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897–912

2025

[32] [32]

Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora. 2026. ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System.arXiv preprint arXiv:2602.13692(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

2023

[34] [34]

LangChain. 2025. Context Engineering. https://blog.langchain.com/context- engineering-for-agents

2025

[35] [35]

LangChain. 2026. LangChain: The agent engineering platform. https://www.la ngchain.com

2026

[36] [36]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

2024

[37] [37]

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing context to enhance inference efficiency of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 6342– 6353

2023

[38] [38]

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. 2023. Swift- sage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems36 (2023), 23813–23825

2023

[39] [39]

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of {LLM-based} Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945

2024

[40] [40]

Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, Luis Ceze, and Baris Kasikci. 2025. TeleRAG: Effi- cient retrieval-augmented generation inference with lookahead retrieval.arXiv preprint arXiv:2502.20969(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Jerry Liu. 2022. LlamaIndex. https://github.com/jerryjliu/llama_index

2022

[42] [42]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024

[43] [43]

Gonzalez, and Aditya G

Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, and Aditya G. Parameswaran. 2026. Supporting our ai overlords: Redesigning data systems to be agent-first.Proceedings of CIDR 2026(2026)

2026

[44] [44]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

work page arXiv 2025

[45] [45]

Kuan Lu, Zhihui Yang, Sai Wu, Ruichen Xia, Dongxiang Zhang, and Gang Chen

[46] [46]

Adda: Towards efficient in-database feature generation via llm-based agents.Proceedings of the ACM on Management of Data3, 3 (2025), 1–27

2025

[47] [47]

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, and Ion Stoica. 2025. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965(2025)

work page arXiv 2025

[48] [48]

Yuyu Luo, Guoliang Li, Ju Fan, and Nan Tang. 2026. Data Agents: Levels, State of the Art, and Open Problems. InCompanion of the International Conference on Management of Data. 571–579

2026

[49] [49]

Microsoft. 2026. Autogen: Open-Source Framework for Agentic AI. https: //www.microsoft.com/en-us/research/project/autogen

2026

[50] [50]

MiniMax. 2026. Mini Agent. https://github.com/MiniMax-AI/Mini-Agent

2026

[51] [51]

NVIDIA. 2026. NVIDIA Inference Xfer Library (NIXL). https://github.com/ai- dynamo/nixl

2026

[52] [52]

James Pan and Guoliang Li. 2025. Database Perspective on LLM Inference Systems.Proceedings of the VLDB Endowment18, 12 (2025), 5504–5507

2025

[53] [53]

Zaifeng Pan, Yitong Ding, Yue Guan, Zheng Wang, Zhongkai Yu, Xulong Tang, Yida Wang, and Yufei Ding. 2025. FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference. InProceedings of Machine Learning and Systems

2025

[54] [54]

Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows.arXiv preprint arXiv:2507.07400(2025)

work page arXiv 2025

[55] [55]

Zaifeng Pan, Yipeng Shen, Zhengding Hu, Zhuang Wang, Aninda Manocha, Zheng Wang, Zhongkai Yu, Yue Guan, and Yufei Ding. 2026. ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management.arXiv preprint arXiv:2601.21473(2026)

work page arXiv 2026

[56] [56]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

2024

[57] [57]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 155–170

2025

[58] [58]

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, Yiqing Yang, Eric Liu, Ryan Wu, Kevin Benavente, Rajiv Mandya Nagaraju, Muhammad Faayez, Xiyan Zhang, Dhruv Vivek Sharma, Xianrui Zhong, Ziqiao Ma, Tianmin Shu, Zhiting Hu, and Lianhui Qin. 2025. Simworld: An open-ended realis...

work page arXiv 2025

[60] [60]

Rya Sanovar, Srikant Bharadwaj, Renee St Amant, Victor Rühle, and Saravan Rajmohan. 2025. LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers.Proceedings of Machine Learning and Systems7 (2025)

2025

[61] [61]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learn- ing.Advances in Neural Information Processing Systems36 (2023), 8634–8652

2023

[62] [62]

Peter Steinberger and OpenClaw Community. 2026. OpenClaw: Personal AI Assistant. https://openclaw.ai

2026

[63] [63]

StepFun, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin...

work page arXiv 2025

[64] [64]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2025. Towards end-to-end optimization of llm-based applications with ayo. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1302–1316

2025

[65] [65]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

Noppanat Wadlom, Junyi Shen, and Yao Lu. 2026. Efficient LLM serving for agentic workflows: A data systems perspective.Proceedings of the ACM on Management of Data4, 3 (SIGMOD) (2026), 1–29

2026

[67] [67]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

2025

[68] [68]

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. 2025. Mirage: A {Multi-Level} superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 21–38

2025

[69] [69]

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. 2023. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Un- structured Sparsity.Proceedings of the VLDB Endowment17, 2 (2023), 211–224

2023

[70] [70]

Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, and Christos Kozyrakis. 2025. Ai metropolis: Scaling large language model-based multi-agent simulation with out-of-order execution.Proceedings of Machine Learning and Systems7 (2025)

2025

[71] [71]

Qian Xu, Juan Yang, Feng Zhang, Junda Pan, Kang Chen, Youren Shen, Amelie Chi Zhou, and Xiaoyong Du. 2025. Tribase: A vector data query engine for reliable and lossless pruning compression using triangle inequalities.Proceedings of the ACM on Management of Data3, 1 (2025), 1–28

2025

[72] [72]

Qian Xu, Feng Zhang, Chengxi Li, Lei Cao, Zheng Chen, Jidong Zhai, and Xi- aoyong Du. 2025. Harmony: A scalable distributed vector database for high- throughput approximate nearest neighbor search.Proceedings of the ACM on Management of Data3, 4 (2025), 1–28

2025

[73] [73]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024

[75] [75]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109

2025

[76] [76]

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35 (2022), 20744–20757

2022

[77] [77]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR)

2023

[78] [78]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze

[79] [79]

InEighth Conference on Machine Learning and Systems

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InEighth Conference on Machine Learning and Systems

[80] [80]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22). 521–538

2022