Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving

Ennan Zhai; Harry Xu; Jiarong Xing; Jinyuan Zhang; Junyi Shu; Kun Qian; Lingjun Zhu; Qingda Lu; Shan Yu; Shuo Yang

arxiv: 2604.25899 · v2 · submitted 2026-04-28 · 💻 cs.MA · cs.DC· cs.SY· eess.SY

Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving

Shan Yu , Junyi Shu , Yuanjiang Ni , Kun Qian , Xue Li , Yang Wang , Jinyuan Zhang , Ziyi Xu

show 9 more authors

Shuo Yang Lingjun Zhu Ennan Zhai Qingda Lu Jiarong Xing Youyou Lu Xin Jin Xuanzhe Liu Harry Xu

This is my paper

Pith reviewed 2026-05-15 07:30 UTC · model grok-4.3

classification 💻 cs.MA cs.DCcs.SYeess.SY

keywords multi-agent LLM servingworkflow predictabilityagent-native systemsprefix cache optimizationserving-layer semanticsLLM resource management

0 comments

The pith

Pythia captures workflow structure in multi-agent LLM systems at the serving layer to raise throughput and shorten completion times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent LLM applications break tasks into specialized collaborating components, which creates repeatable patterns in request flow and timing. Existing serving systems treat these workloads as ordinary traffic and miss chances to improve caching, scheduling, and scaling. Pythia adds a simple interface that lets the serving layer read the workflow topology directly. With this information the system can make better decisions that reduce cache misses, ease contention on long contexts, and cut queuing delays. If the approach works, agent-based applications run faster and use resources more effectively without rewriting the agents themselves.

Core claim

Pythia is a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.

What carries the argument

A simple serving-layer interface that records the structured topology of multi-agent workflows so the scheduler, cache, and scaler can exploit predictable request patterns.

If this is right

Prefix cache hit rates rise because future agent requests become predictable from the workflow graph.
Long-context requests cause less contention when the scheduler can anticipate their arrival and duration.
Queuing delays drop through scaling decisions that match observed workflow burst patterns.
Overall job completion time improves because the system avoids treating every agent step as independent traffic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Workflow interfaces like this could apply to other structured AI pipelines that have repeatable call sequences, such as tool-use chains or planning loops.
Adoption would encourage developers to expose more workflow metadata when they design agents, amplifying the gains.
The same interface might allow cross-workflow sharing of cached prefixes when multiple users run similar agent topologies.

Load-bearing premise

The structured topology of multi-agent workflows exposes enough semantic predictability that a simple interface at the serving layer can capture and use it without large overhead or loss of flexibility.

What would settle it

Measure whether Pythia still outperforms baselines on multi-agent workloads whose agent call graphs are deliberately made highly variable and unpredictable.

Figures

Figures reproduced from arXiv: 2604.25899 by Ennan Zhai, Harry Xu, Jiarong Xing, Jinyuan Zhang, Junyi Shu, Kun Qian, Lingjun Zhu, Qingda Lu, Shan Yu, Shuo Yang, Xin Jin, Xuanzhe Liu, Xue Li, Yang Wang, Youyou Lu, Yuanjiang Ni, Ziyi Xu.

**Figure 1.** Figure 1: Examples of multi-agent workflows. Production trace analysis. To demonstrate how existing black-box approaches stifle efficiency, we analyzed largescale production traces from our agent-serving service. We conducted in-depth profiling of an internal multi-agent coding assistant. Our analysis (§2) exposes three fundamental challenges in serving agentic workloads that contradict common assumptions. First,… view at source ↗

**Figure 3.** Figure 3: Timeline of the coding agent workflow: each bar represents view at source ↗

**Figure 5.** Figure 5: Outstanding requests of the multi-agent coding assistant view at source ↗

**Figure 6.** Figure 6: Pythia overview. Predictive information consumption. Pythia uses the aforementioned predictive information at three distinct locations of the serving pipeline: per-node (agent) prefix cache management, global request scheduling, as well as per-node model scaling, which, respectively, correspond to the three major challenges faced by existing techniques (§2). Operating at the node level, the cache manager… view at source ↗

**Figure 7.** Figure 7: End-to-end experiments. SGLang +Sem. keep/drop +Sem. drop +L2 pref. +Sem. drop +L1 pref. 10 2 10 3 10 4 TTFT (ms) 4.1s 1.4s 73 55 283 261 262 258 2.9× 19.2× 25.5× Decomposer(D) Summarizer(S) (a) Pythia’s semantic-aware keep/drop and speculative prefetch improve TTFT. SGLang +Sem. keep/drop +Sem. drop +L2 pref. +Sem. drop +L1 pref. 0% 20% 40% 60% 80% 100% Input Cache Hit Ratio D S D S D S D S L1 (GPU) L2 (C… view at source ↗

**Figure 8.** Figure 8: Workflow-aware speculative prefix cache management view at source ↗

**Figure 10.** Figure 10: Queuing delay under different scaling strategies. view at source ↗

**Figure 12.** Figure 12: Predicted output lengths vs. actual lengths. view at source ↗

read the original abstract

As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertainty$\unicode{x2015}$yet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent-serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long-context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose Pythia, a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pythia shows that multi-agent LLM workflows have enough structure for a simple serving interface to cut cache misses and contention with measurable gains.

read the letter

Pythia shows that multi-agent LLM workflows have enough structure for a simple serving interface to cut cache misses and contention with measurable gains. The paper's core contribution is the Pythia system, which adds a lightweight way to expose workflow semantics at the serving layer instead of treating everything as generic traffic. Production traces from an agent platform and internal coding assistant back this up by highlighting real bottlenecks like low prefix cache hits, long-context contention, and queuing delays, then demonstrating how the structured topology reduces uncertainty enough to improve throughput and job completion time over baselines. The trace-driven evaluation looks internally consistent and the design avoids overclaiming zero overhead or perfect predictability. A minor soft spot is that the benefits depend on workflows staying close to the observed patterns; if agents become more dynamic than the traces, the interface might need tuning to avoid flexibility trade-offs or added latency. The evaluation is trace-based rather than live deployment, so some edge cases could surface later, but nothing in the analysis suggests the central assumption fails in the tested regimes. This is for systems researchers and engineers tuning LLM serving for agentic applications. Anyone working on production multi-agent setups would get practical value from the bottleneck breakdown and the interface idea. It deserves peer review because the problem is timely, the evidence is grounded in real traces, and the approach is straightforward without circular reasoning or unsupported leaps.

Referee Report

0 major / 2 minor

Summary. The paper proposes Pythia, a multi-agent LLM serving system that exploits the structured topology and semantic predictability of agent workflows via a simple serving-layer interface. Analysis of production traces identifies bottlenecks such as low prefix cache hit rates, resource contention from long-context requests, and queuing delays; Pythia addresses these to achieve higher throughput and lower job completion times than state-of-the-art baselines.

Significance. If the empirical gains hold, the work is significant for LLM serving research because it demonstrates that workflow predictability in multi-agent systems can be captured with low overhead at the serving layer, yielding measurable improvements in cache efficiency, contention reduction, and scaling. The trace-driven evaluation and system design provide a concrete foundation for future agent-native optimizations.

minor comments (2)

Abstract claims 'substantially improving' throughput and JCT but provides no quantitative deltas or baseline names; adding one sentence with key metrics would improve immediate impact.
The interface description in §3 could benefit from a small pseudocode listing or explicit API signature to clarify the 'simple interface' claim for implementers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and the recommendation to accept the manuscript. The review correctly identifies the core contribution of Pythia in leveraging workflow predictability for multi-agent LLM serving, and we are pleased that the trace-driven evaluation and system design are viewed as providing a foundation for future optimizations.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a systems proposal for Pythia, a multi-agent LLM serving system, with no mathematical derivations, equations, or fitted parameters present in the manuscript. Central claims rest on empirical trace analysis from production workloads and system implementation details that are independent of the proposed optimizations; workflow predictability is observed externally from traces rather than defined into existence by the system itself. No self-citation chains, self-definitional steps, or reductions of predictions to inputs occur.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multi-agent workflows provide exploitable semantic predictability that a simple interface can capture.

axioms (1)

domain assumption Multi-agent LLM workflows exhibit sufficient semantic predictability due to their structured topology.
Stated as the basis for new optimization opportunities in the abstract.

pith-pipeline@v0.9.0 · 5513 in / 1067 out tokens · 70274 ms · 2026-05-15T07:30:40.850165+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

Efficient Memory Management for Large Language Model Serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” inACM SOSP, 2023

work page 2023
[2]

SGLang: Efficient Execution of Structured Language Model Programs,

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “SGLang: Efficient Execution of Structured Language Model Programs,” inNeurIPS, 2024

work page 2024
[3]

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,” inUSENIX OSDI, 2024

work page 2024
[4]

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving,

S. Yu, J. Xing, Y . Qiao, M. Ma, Y . Li, Y . Wang, S. Yang, Z. Xie, S. Cao, K. Bao, I. Stoica, H. Xu, and Y . Sheng, “Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving,” 2025

work page 2025
[5]

OpenClaw

P. Steinberger, “OpenClaw.”https://openclaw.ai/, 2026. Retrieved Mar 9, 2026

work page 2026
[6]

Synthesizing regular expressions from examples for introductory automata assignments,

M. Lee, S. So, and H. Oh, “Synthesizing regular expressions from examples for introductory automata assignments,” GPCE 2016, 2016

work page 2016
[7]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations,

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations,” inConference on Language Modeling, 2024

work page 2024
[8]

LangGraph

“LangGraph.”https://www .langchain.com/langgraph, 2026. Re- trieved Mar 9, 2026

work page 2026
[9]

LangChain

“LangChain.”https://www.langchain.com, 2026. Retrieved Mar 9, 2026

work page 2026
[10]

OpenAI Python API library

“OpenAI Python API library.”https://github .com/openai/openai- python, 2026. Retrieved Mar 9, 2026

work page 2026
[11]

Orca: A Distributed Serving System for Transformer-Based Generative Models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A Distributed Serving System for Transformer-Based Generative Models,” inUSENIX OSDI, 2022

work page 2022
[12]

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,” inUSENIX OSDI, 2024

work page 2024
[13]

Claude Code bypassPermission Mode

“Claude Code bypassPermission Mode.”https://code .claude.com/ docs/en/permission-modes, 2026. Retrieved Mar 9, 2026

work page 2026
[14]

Codex Command Line Options

“Codex Command Line Options.”https://developers.openai.com/ codex/cli/reference, 2026. Retrieved Mar 9, 2026

work page 2026
[15]

Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot,

R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot,” inUSENIX FAST, 2025

work page 2025
[16]

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference,

Y . Liu, Y . Cheng, J. Yao, Y . An, X. Chen, S. Feng, Y . Huang, S. Shen, R. Zhang, K. Du, and J. Jiang, “LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference,” 2025

work page 2025
[17]

Coding Plan Overview

“Coding Plan Overview.”https://www.alibabacloud.com/help/en/ model-studio/coding-plan, 2026. Retrieved Mar 9, 2026

work page 2026
[18]

ModelArk Coding Plan

“ModelArk Coding Plan.”https://www .byteplus.com/en/activity/ codingplan, 2026. Retrieved Mar 9, 2026

work page 2026
[19]

MiniMax Token Plan

“MiniMax Token Plan.”https://platform .minimax.io/subscribe/ token-plan, 2026. Retrieved Mar 9, 2026

work page 2026
[20]

GLM Coding Plan

“GLM Coding Plan.”https://z.ai/subscribe, 2026. Retrieved Mar 9, 2026

work page 2026
[21]

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference,

Y . Wu, S. Chen, Y . Zhong, R. Huang, Y . Tan, W. Zhang, L. Zhang, S. Zhou, Y . Liu, S. Zhou, M. Zhang, X. Jin, and P. Huang, “DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference,” 2026

work page 2026
[22]

ThunderAgent: A Simple, Fast and Program- Aware Agentic Inference System,

H. Kang, Z. Li, X. Yang, W. Xu, Y . Chen, J. Wang, B. Chen, T. Krishna, C. Xu, and S. Arora, “ThunderAgent: A Simple, Fast and Program- Aware Agentic Inference System,” 2026

work page 2026
[23]

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live,

H. Li, Q. Mang, R. He, Q. Zhang, H. Mao, X. Chen, H. Zhou, A. Che- ung, J. Gonzalez, and I. Stoica, “Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live,” 2026

work page 2026
[24]

Mining specifications,

G. Ammons, R. Bodík, and J. R. Larus, “Mining specifications,” in ACM POPL, 2002

work page 2002
[25]

vLLM Production Stack

“vLLM Production Stack.”https : / / github .com / vllm - project / production-stack, 2026. Retrieved Mar 9, 2026

work page 2026
[26]

Autellix: An Efficient Serving Engine for LLM Agents as General Programs,

M. Luo, X. Shi, C. Cai, T. Zhang, J. Wong, Y . Wang, C. Wang, Y . Huang, Z. Chen, J. E. Gonzalez, and I. Stoica, “Autellix: An Efficient Serving Engine for LLM Agents as General Programs,” 2025

work page 2025
[27]

ReAct: Synergizing Reasoning and Acting in Language Models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “ReAct: Synergizing Reasoning and Acting in Language Models,” in ICLR, 2023

work page 2023
[28]

Enterprise Deep Research: Steerable Multi- Agent Deep Research for Enterprise Analytics,

A. Prabhakar, R. Ram, Z. Chen, S. Savarese, F. Wang, C. Xiong, H. Wang, and W. Yao, “Enterprise Deep Research: Steerable Multi- Agent Deep Research for Enterprise Analytics,” 2025

work page 2025
[29]

SWE-Bench Pro: Can AI Agents Solve Long- Horizon Software Engineering Tasks?,

X. Deng, J. Da, E. Pan, Y . Y . He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V . Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler, “SWE-Bench Pro: Can AI Agents Solve Long- Horizon Software Engineering Tasks?,” 2025

work page 2025
[30]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents,

M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao, “DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents,” 2025

work page 2025
[31]

Fast Distributed Inference Serving for Large Language Models,

B. Wu, Y . Zhong, Z. Zhang, S. Liu, F. Liu, Y . Sun, G. Huang, X. Liu, and X. Jin, “Fast Distributed Inference Serving for Large Language Models,” 2024

work page 2024
[32]

Splitwise: Efficient Generative LLM Inference Using Phase Splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, I. n. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” inACM/IEEE ISCA, 2025

work page 2025
[33]

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving,

F. Strati, S. Mcallister, A. Phanishayee, J. Tarnawski, and A. Klimovic, “DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving,” inICML, 2024

work page 2024
[34]

MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism,

R. Zhu, Z. Jiang, C. Jin, P. Wu, C. A. Stuardo, D. Wang, X. Zhang, H. Zhou, H. Wei, Y . Cheng, J. Xiao, X. Zhang, L. Liu, H. Lin, L.-W. 14 Chang, J. Ye, X. Yu, X. Liu, X. Jin, and X. Liu, “MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism,” inACM SIGCOMM, 2025

work page 2025
[35]

NanoFlow: towards optimal large language model serving throughput,

K. Zhu, Y . Gao, Y . Zhao, L. Zhao, G. Zuo, Y . Gu, D. Xie, T. Tang, Q. Xu, Z. Ye, K. Kamahori, C.-Y . Lin, Z. Wang, S. Wang, A. Krishnamurthy, and B. Kasikci, “NanoFlow: towards optimal large language model serving throughput,” inUSENIX OSDI, 2025

work page 2025
[36]

Symphony: Improving memory management for llm inference workloads,

S. Agarwal, A. Mao, A. Akella, and S. Venkataraman, “Symphony: Improving memory management for llm inference workloads,” 2024

work page 2024
[37]

Strata: Hierarchical context caching for long context language model serving,

Z. Xie, Z. Xu, M. Zhao, Y . An, V . S. Mailthody, S. Mahlke, M. Garland, and C. Kozyrakis, “Strata: Hierarchical context caching for long context language model serving,” 2025

work page 2025
[38]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving,

Z. Ye, L. Chen, R. Lai, W. Lin, Y . Zhang, S. Wang, T. Chen, B. Kasikci, V . Grover, A. Krishnamurthy, and L. Ceze, “FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving,” inMLSys, 2025

work page 2025
[39]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” in NeurIPS, 2022

work page 2022
[40]

Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market,

Y . Xiang, X. Li, K. Qian, Y . Yang, D. Zhu, W. Yu, E. Zhai, X. Liu, X. Jin, and J. Zhou, “Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market,” inACM SOSP, 2025

work page 2025
[41]

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving,

J. Duan, R. Lu, H. Duanmu, X. Li, X. Zhang, D. Lin, I. Stoica, and H. Zhang, “MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving,” inICML, 2024

work page 2024
[42]

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models,

Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “ServerlessLLM: Low-Latency Serverless Inference for Large Language Models,” inUSENIX OSDI, 2024

work page 2024
[43]

Hy- draServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds,

C. Lou, S. Qi, C. Jin, D. Nie, H. Yang, Y . Ding, X. Liu, and X. Jin, “Hy- draServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds,” 2025

work page 2025
[44]

Pancake: Hierarchical Memory System for Multi-Agent LLM Serving,

Z. Hu, Z. Pan, P. Kaur, V . Murthy, Z. Yu, Y . Guan, Z. Wang, S. Swanson, and Y . Ding, “Pancake: Hierarchical Memory System for Multi-Agent LLM Serving,” 2026

work page 2026
[45]

Towards End-to-End Optimiza- tion of LLM-based Applications with Ayo,

X. Tan, Y . Jiang, Y . Yang, and H. Xu, “Towards End-to-End Optimiza- tion of LLM-based Applications with Ayo,” inACM ASPLOS, 2025

work page 2025
[46]

KVFlow: Efficient Prefix Caching for Accelerating LLM- Based Multi-Agent Workflows,

Z. Pan, A. Patel, Z. Hu, Y . Shen, Y . Guan, W.-L. Li, L. Qin, Y . Wang, and Y . Ding, “KVFlow: Efficient Prefix Caching for Accelerating LLM- Based Multi-Agent Workflows,” 2025. 15

work page 2025

[1] [1]

Efficient Memory Management for Large Language Model Serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” inACM SOSP, 2023

work page 2023

[2] [2]

SGLang: Efficient Execution of Structured Language Model Programs,

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “SGLang: Efficient Execution of Structured Language Model Programs,” inNeurIPS, 2024

work page 2024

[3] [3]

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,” inUSENIX OSDI, 2024

work page 2024

[4] [4]

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving,

S. Yu, J. Xing, Y . Qiao, M. Ma, Y . Li, Y . Wang, S. Yang, Z. Xie, S. Cao, K. Bao, I. Stoica, H. Xu, and Y . Sheng, “Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving,” 2025

work page 2025

[5] [5]

OpenClaw

P. Steinberger, “OpenClaw.”https://openclaw.ai/, 2026. Retrieved Mar 9, 2026

work page 2026

[6] [6]

Synthesizing regular expressions from examples for introductory automata assignments,

M. Lee, S. So, and H. Oh, “Synthesizing regular expressions from examples for introductory automata assignments,” GPCE 2016, 2016

work page 2016

[7] [7]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations,

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations,” inConference on Language Modeling, 2024

work page 2024

[8] [8]

LangGraph

“LangGraph.”https://www .langchain.com/langgraph, 2026. Re- trieved Mar 9, 2026

work page 2026

[9] [9]

LangChain

“LangChain.”https://www.langchain.com, 2026. Retrieved Mar 9, 2026

work page 2026

[10] [10]

OpenAI Python API library

“OpenAI Python API library.”https://github .com/openai/openai- python, 2026. Retrieved Mar 9, 2026

work page 2026

[11] [11]

Orca: A Distributed Serving System for Transformer-Based Generative Models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A Distributed Serving System for Transformer-Based Generative Models,” inUSENIX OSDI, 2022

work page 2022

[12] [12]

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,” inUSENIX OSDI, 2024

work page 2024

[13] [13]

Claude Code bypassPermission Mode

“Claude Code bypassPermission Mode.”https://code .claude.com/ docs/en/permission-modes, 2026. Retrieved Mar 9, 2026

work page 2026

[14] [14]

Codex Command Line Options

“Codex Command Line Options.”https://developers.openai.com/ codex/cli/reference, 2026. Retrieved Mar 9, 2026

work page 2026

[15] [15]

Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot,

R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot,” inUSENIX FAST, 2025

work page 2025

[16] [16]

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference,

Y . Liu, Y . Cheng, J. Yao, Y . An, X. Chen, S. Feng, Y . Huang, S. Shen, R. Zhang, K. Du, and J. Jiang, “LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference,” 2025

work page 2025

[17] [17]

Coding Plan Overview

“Coding Plan Overview.”https://www.alibabacloud.com/help/en/ model-studio/coding-plan, 2026. Retrieved Mar 9, 2026

work page 2026

[18] [18]

ModelArk Coding Plan

“ModelArk Coding Plan.”https://www .byteplus.com/en/activity/ codingplan, 2026. Retrieved Mar 9, 2026

work page 2026

[19] [19]

MiniMax Token Plan

“MiniMax Token Plan.”https://platform .minimax.io/subscribe/ token-plan, 2026. Retrieved Mar 9, 2026

work page 2026

[20] [20]

GLM Coding Plan

“GLM Coding Plan.”https://z.ai/subscribe, 2026. Retrieved Mar 9, 2026

work page 2026

[21] [21]

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference,

Y . Wu, S. Chen, Y . Zhong, R. Huang, Y . Tan, W. Zhang, L. Zhang, S. Zhou, Y . Liu, S. Zhou, M. Zhang, X. Jin, and P. Huang, “DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference,” 2026

work page 2026

[22] [22]

ThunderAgent: A Simple, Fast and Program- Aware Agentic Inference System,

H. Kang, Z. Li, X. Yang, W. Xu, Y . Chen, J. Wang, B. Chen, T. Krishna, C. Xu, and S. Arora, “ThunderAgent: A Simple, Fast and Program- Aware Agentic Inference System,” 2026

work page 2026

[23] [23]

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live,

H. Li, Q. Mang, R. He, Q. Zhang, H. Mao, X. Chen, H. Zhou, A. Che- ung, J. Gonzalez, and I. Stoica, “Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live,” 2026

work page 2026

[24] [24]

Mining specifications,

G. Ammons, R. Bodík, and J. R. Larus, “Mining specifications,” in ACM POPL, 2002

work page 2002

[25] [25]

vLLM Production Stack

“vLLM Production Stack.”https : / / github .com / vllm - project / production-stack, 2026. Retrieved Mar 9, 2026

work page 2026

[26] [26]

Autellix: An Efficient Serving Engine for LLM Agents as General Programs,

M. Luo, X. Shi, C. Cai, T. Zhang, J. Wong, Y . Wang, C. Wang, Y . Huang, Z. Chen, J. E. Gonzalez, and I. Stoica, “Autellix: An Efficient Serving Engine for LLM Agents as General Programs,” 2025

work page 2025

[27] [27]

ReAct: Synergizing Reasoning and Acting in Language Models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “ReAct: Synergizing Reasoning and Acting in Language Models,” in ICLR, 2023

work page 2023

[28] [28]

Enterprise Deep Research: Steerable Multi- Agent Deep Research for Enterprise Analytics,

A. Prabhakar, R. Ram, Z. Chen, S. Savarese, F. Wang, C. Xiong, H. Wang, and W. Yao, “Enterprise Deep Research: Steerable Multi- Agent Deep Research for Enterprise Analytics,” 2025

work page 2025

[29] [29]

SWE-Bench Pro: Can AI Agents Solve Long- Horizon Software Engineering Tasks?,

X. Deng, J. Da, E. Pan, Y . Y . He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V . Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler, “SWE-Bench Pro: Can AI Agents Solve Long- Horizon Software Engineering Tasks?,” 2025

work page 2025

[30] [30]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents,

M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao, “DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents,” 2025

work page 2025

[31] [31]

Fast Distributed Inference Serving for Large Language Models,

B. Wu, Y . Zhong, Z. Zhang, S. Liu, F. Liu, Y . Sun, G. Huang, X. Liu, and X. Jin, “Fast Distributed Inference Serving for Large Language Models,” 2024

work page 2024

[32] [32]

Splitwise: Efficient Generative LLM Inference Using Phase Splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, I. n. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” inACM/IEEE ISCA, 2025

work page 2025

[33] [33]

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving,

F. Strati, S. Mcallister, A. Phanishayee, J. Tarnawski, and A. Klimovic, “DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving,” inICML, 2024

work page 2024

[34] [34]

MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism,

R. Zhu, Z. Jiang, C. Jin, P. Wu, C. A. Stuardo, D. Wang, X. Zhang, H. Zhou, H. Wei, Y . Cheng, J. Xiao, X. Zhang, L. Liu, H. Lin, L.-W. 14 Chang, J. Ye, X. Yu, X. Liu, X. Jin, and X. Liu, “MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism,” inACM SIGCOMM, 2025

work page 2025

[35] [35]

NanoFlow: towards optimal large language model serving throughput,

K. Zhu, Y . Gao, Y . Zhao, L. Zhao, G. Zuo, Y . Gu, D. Xie, T. Tang, Q. Xu, Z. Ye, K. Kamahori, C.-Y . Lin, Z. Wang, S. Wang, A. Krishnamurthy, and B. Kasikci, “NanoFlow: towards optimal large language model serving throughput,” inUSENIX OSDI, 2025

work page 2025

[36] [36]

Symphony: Improving memory management for llm inference workloads,

S. Agarwal, A. Mao, A. Akella, and S. Venkataraman, “Symphony: Improving memory management for llm inference workloads,” 2024

work page 2024

[37] [37]

Strata: Hierarchical context caching for long context language model serving,

Z. Xie, Z. Xu, M. Zhao, Y . An, V . S. Mailthody, S. Mahlke, M. Garland, and C. Kozyrakis, “Strata: Hierarchical context caching for long context language model serving,” 2025

work page 2025

[38] [38]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving,

Z. Ye, L. Chen, R. Lai, W. Lin, Y . Zhang, S. Wang, T. Chen, B. Kasikci, V . Grover, A. Krishnamurthy, and L. Ceze, “FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving,” inMLSys, 2025

work page 2025

[39] [39]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” in NeurIPS, 2022

work page 2022

[40] [40]

Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market,

Y . Xiang, X. Li, K. Qian, Y . Yang, D. Zhu, W. Yu, E. Zhai, X. Liu, X. Jin, and J. Zhou, “Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market,” inACM SOSP, 2025

work page 2025

[41] [41]

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving,

J. Duan, R. Lu, H. Duanmu, X. Li, X. Zhang, D. Lin, I. Stoica, and H. Zhang, “MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving,” inICML, 2024

work page 2024

[42] [42]

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models,

Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “ServerlessLLM: Low-Latency Serverless Inference for Large Language Models,” inUSENIX OSDI, 2024

work page 2024

[43] [43]

Hy- draServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds,

C. Lou, S. Qi, C. Jin, D. Nie, H. Yang, Y . Ding, X. Liu, and X. Jin, “Hy- draServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds,” 2025

work page 2025

[44] [44]

Pancake: Hierarchical Memory System for Multi-Agent LLM Serving,

Z. Hu, Z. Pan, P. Kaur, V . Murthy, Z. Yu, Y . Guan, Z. Wang, S. Swanson, and Y . Ding, “Pancake: Hierarchical Memory System for Multi-Agent LLM Serving,” 2026

work page 2026

[45] [45]

Towards End-to-End Optimiza- tion of LLM-based Applications with Ayo,

X. Tan, Y . Jiang, Y . Yang, and H. Xu, “Towards End-to-End Optimiza- tion of LLM-based Applications with Ayo,” inACM ASPLOS, 2025

work page 2025

[46] [46]

KVFlow: Efficient Prefix Caching for Accelerating LLM- Based Multi-Agent Workflows,

Z. Pan, A. Patel, Z. Hu, Y . Shen, Y . Guan, W.-L. Li, L. Qin, Y . Wang, and Y . Ding, “KVFlow: Efficient Prefix Caching for Accelerating LLM- Based Multi-Agent Workflows,” 2025. 15

work page 2025