pith. sign in

arxiv: 2604.25899 · v2 · submitted 2026-04-28 · 💻 cs.MA · cs.DC· cs.SY· eess.SY

Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving

Pith reviewed 2026-05-15 07:30 UTC · model grok-4.3

classification 💻 cs.MA cs.DCcs.SYeess.SY
keywords multi-agent LLM servingworkflow predictabilityagent-native systemsprefix cache optimizationserving-layer semanticsLLM resource management
0
0 comments X

The pith

Pythia captures workflow structure in multi-agent LLM systems at the serving layer to raise throughput and shorten completion times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent LLM applications break tasks into specialized collaborating components, which creates repeatable patterns in request flow and timing. Existing serving systems treat these workloads as ordinary traffic and miss chances to improve caching, scheduling, and scaling. Pythia adds a simple interface that lets the serving layer read the workflow topology directly. With this information the system can make better decisions that reduce cache misses, ease contention on long contexts, and cut queuing delays. If the approach works, agent-based applications run faster and use resources more effectively without rewriting the agents themselves.

Core claim

Pythia is a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.

What carries the argument

A simple serving-layer interface that records the structured topology of multi-agent workflows so the scheduler, cache, and scaler can exploit predictable request patterns.

If this is right

  • Prefix cache hit rates rise because future agent requests become predictable from the workflow graph.
  • Long-context requests cause less contention when the scheduler can anticipate their arrival and duration.
  • Queuing delays drop through scaling decisions that match observed workflow burst patterns.
  • Overall job completion time improves because the system avoids treating every agent step as independent traffic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Workflow interfaces like this could apply to other structured AI pipelines that have repeatable call sequences, such as tool-use chains or planning loops.
  • Adoption would encourage developers to expose more workflow metadata when they design agents, amplifying the gains.
  • The same interface might allow cross-workflow sharing of cached prefixes when multiple users run similar agent topologies.

Load-bearing premise

The structured topology of multi-agent workflows exposes enough semantic predictability that a simple interface at the serving layer can capture and use it without large overhead or loss of flexibility.

What would settle it

Measure whether Pythia still outperforms baselines on multi-agent workloads whose agent call graphs are deliberately made highly variable and unpredictable.

Figures

Figures reproduced from arXiv: 2604.25899 by Ennan Zhai, Harry Xu, Jiarong Xing, Jinyuan Zhang, Junyi Shu, Kun Qian, Lingjun Zhu, Qingda Lu, Shan Yu, Shuo Yang, Xin Jin, Xuanzhe Liu, Xue Li, Yang Wang, Youyou Lu, Yuanjiang Ni, Ziyi Xu.

Figure 1
Figure 1. Figure 1: Examples of multi-agent workflows. Production trace analysis. To demonstrate how existing black-box approaches stifle efficiency, we analyzed large￾scale production traces from our agent-serving service. We conducted in-depth profiling of an internal multi-agent cod￾ing assistant. Our analysis (§2) exposes three fundamental challenges in serving agentic workloads that contradict com￾mon assumptions. First,… view at source ↗
Figure 3
Figure 3. Figure 3: Timeline of the coding agent workflow: each bar represents view at source ↗
Figure 5
Figure 5. Figure 5: Outstanding requests of the multi-agent coding assistant view at source ↗
Figure 6
Figure 6. Figure 6: Pythia overview. Predictive information consumption. Pythia uses the afore￾mentioned predictive information at three distinct locations of the serving pipeline: per-node (agent) prefix cache manage￾ment, global request scheduling, as well as per-node model scaling, which, respectively, correspond to the three major challenges faced by existing techniques (§2). Operating at the node level, the cache manager… view at source ↗
Figure 7
Figure 7. Figure 7: End-to-end experiments. SGLang +Sem. keep/drop +Sem. drop +L2 pref. +Sem. drop +L1 pref. 10 2 10 3 10 4 TTFT (ms) 4.1s 1.4s 73 55 283 261 262 258 2.9× 19.2× 25.5× Decomposer(D) Summarizer(S) (a) Pythia’s semantic-aware keep/drop and speculative prefetch improve TTFT. SGLang +Sem. keep/drop +Sem. drop +L2 pref. +Sem. drop +L1 pref. 0% 20% 40% 60% 80% 100% Input Cache Hit Ratio D S D S D S D S L1 (GPU) L2 (C… view at source ↗
Figure 8
Figure 8. Figure 8: Workflow-aware speculative prefix cache management view at source ↗
Figure 10
Figure 10. Figure 10: Queuing delay under different scaling strategies. view at source ↗
Figure 12
Figure 12. Figure 12: Predicted output lengths vs. actual lengths. view at source ↗
read the original abstract

As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertainty$\unicode{x2015}$yet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent-serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long-context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose Pythia, a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes Pythia, a multi-agent LLM serving system that exploits the structured topology and semantic predictability of agent workflows via a simple serving-layer interface. Analysis of production traces identifies bottlenecks such as low prefix cache hit rates, resource contention from long-context requests, and queuing delays; Pythia addresses these to achieve higher throughput and lower job completion times than state-of-the-art baselines.

Significance. If the empirical gains hold, the work is significant for LLM serving research because it demonstrates that workflow predictability in multi-agent systems can be captured with low overhead at the serving layer, yielding measurable improvements in cache efficiency, contention reduction, and scaling. The trace-driven evaluation and system design provide a concrete foundation for future agent-native optimizations.

minor comments (2)
  1. Abstract claims 'substantially improving' throughput and JCT but provides no quantitative deltas or baseline names; adding one sentence with key metrics would improve immediate impact.
  2. The interface description in §3 could benefit from a small pseudocode listing or explicit API signature to clarify the 'simple interface' claim for implementers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and the recommendation to accept the manuscript. The review correctly identifies the core contribution of Pythia in leveraging workflow predictability for multi-agent LLM serving, and we are pleased that the trace-driven evaluation and system design are viewed as providing a foundation for future optimizations.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a systems proposal for Pythia, a multi-agent LLM serving system, with no mathematical derivations, equations, or fitted parameters present in the manuscript. Central claims rest on empirical trace analysis from production workloads and system implementation details that are independent of the proposed optimizations; workflow predictability is observed externally from traces rather than defined into existence by the system itself. No self-citation chains, self-definitional steps, or reductions of predictions to inputs occur.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multi-agent workflows provide exploitable semantic predictability that a simple interface can capture.

axioms (1)
  • domain assumption Multi-agent LLM workflows exhibit sufficient semantic predictability due to their structured topology.
    Stated as the basis for new optimization opportunities in the abstract.

pith-pipeline@v0.9.0 · 5513 in / 1067 out tokens · 70274 ms · 2026-05-15T07:30:40.850165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Efficient Memory Management for Large Language Model Serving with PagedAttention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” inACM SOSP, 2023

  2. [2]

    SGLang: Efficient Execution of Structured Language Model Programs,

    L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “SGLang: Efficient Execution of Structured Language Model Programs,” inNeurIPS, 2024

  3. [3]

    Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,

    A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,” inUSENIX OSDI, 2024

  4. [4]

    Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving,

    S. Yu, J. Xing, Y . Qiao, M. Ma, Y . Li, Y . Wang, S. Yang, Z. Xie, S. Cao, K. Bao, I. Stoica, H. Xu, and Y . Sheng, “Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving,” 2025

  5. [5]

    OpenClaw

    P. Steinberger, “OpenClaw.”https://openclaw.ai/, 2026. Retrieved Mar 9, 2026

  6. [6]

    Synthesizing regular expressions from examples for introductory automata assignments,

    M. Lee, S. So, and H. Oh, “Synthesizing regular expressions from examples for introductory automata assignments,” GPCE 2016, 2016

  7. [7]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations,

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations,” inConference on Language Modeling, 2024

  8. [8]

    LangGraph

    “LangGraph.”https://www .langchain.com/langgraph, 2026. Re- trieved Mar 9, 2026

  9. [9]

    LangChain

    “LangChain.”https://www.langchain.com, 2026. Retrieved Mar 9, 2026

  10. [10]

    OpenAI Python API library

    “OpenAI Python API library.”https://github .com/openai/openai- python, 2026. Retrieved Mar 9, 2026

  11. [11]

    Orca: A Distributed Serving System for Transformer-Based Generative Models,

    G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A Distributed Serving System for Transformer-Based Generative Models,” inUSENIX OSDI, 2022

  12. [12]

    DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,

    Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,” inUSENIX OSDI, 2024

  13. [13]

    Claude Code bypassPermission Mode

    “Claude Code bypassPermission Mode.”https://code .claude.com/ docs/en/permission-modes, 2026. Retrieved Mar 9, 2026

  14. [14]

    Codex Command Line Options

    “Codex Command Line Options.”https://developers.openai.com/ codex/cli/reference, 2026. Retrieved Mar 9, 2026

  15. [15]

    Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot,

    R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot,” inUSENIX FAST, 2025

  16. [16]

    LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference,

    Y . Liu, Y . Cheng, J. Yao, Y . An, X. Chen, S. Feng, Y . Huang, S. Shen, R. Zhang, K. Du, and J. Jiang, “LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference,” 2025

  17. [17]

    Coding Plan Overview

    “Coding Plan Overview.”https://www.alibabacloud.com/help/en/ model-studio/coding-plan, 2026. Retrieved Mar 9, 2026

  18. [18]

    ModelArk Coding Plan

    “ModelArk Coding Plan.”https://www .byteplus.com/en/activity/ codingplan, 2026. Retrieved Mar 9, 2026

  19. [19]

    MiniMax Token Plan

    “MiniMax Token Plan.”https://platform .minimax.io/subscribe/ token-plan, 2026. Retrieved Mar 9, 2026

  20. [20]

    GLM Coding Plan

    “GLM Coding Plan.”https://z.ai/subscribe, 2026. Retrieved Mar 9, 2026

  21. [21]

    DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference,

    Y . Wu, S. Chen, Y . Zhong, R. Huang, Y . Tan, W. Zhang, L. Zhang, S. Zhou, Y . Liu, S. Zhou, M. Zhang, X. Jin, and P. Huang, “DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference,” 2026

  22. [22]

    ThunderAgent: A Simple, Fast and Program- Aware Agentic Inference System,

    H. Kang, Z. Li, X. Yang, W. Xu, Y . Chen, J. Wang, B. Chen, T. Krishna, C. Xu, and S. Arora, “ThunderAgent: A Simple, Fast and Program- Aware Agentic Inference System,” 2026

  23. [23]

    Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live,

    H. Li, Q. Mang, R. He, Q. Zhang, H. Mao, X. Chen, H. Zhou, A. Che- ung, J. Gonzalez, and I. Stoica, “Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live,” 2026

  24. [24]

    Mining specifications,

    G. Ammons, R. Bodík, and J. R. Larus, “Mining specifications,” in ACM POPL, 2002

  25. [25]

    vLLM Production Stack

    “vLLM Production Stack.”https : / / github .com / vllm - project / production-stack, 2026. Retrieved Mar 9, 2026

  26. [26]

    Autellix: An Efficient Serving Engine for LLM Agents as General Programs,

    M. Luo, X. Shi, C. Cai, T. Zhang, J. Wong, Y . Wang, C. Wang, Y . Huang, Z. Chen, J. E. Gonzalez, and I. Stoica, “Autellix: An Efficient Serving Engine for LLM Agents as General Programs,” 2025

  27. [27]

    ReAct: Synergizing Reasoning and Acting in Language Models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “ReAct: Synergizing Reasoning and Acting in Language Models,” in ICLR, 2023

  28. [28]

    Enterprise Deep Research: Steerable Multi- Agent Deep Research for Enterprise Analytics,

    A. Prabhakar, R. Ram, Z. Chen, S. Savarese, F. Wang, C. Xiong, H. Wang, and W. Yao, “Enterprise Deep Research: Steerable Multi- Agent Deep Research for Enterprise Analytics,” 2025

  29. [29]

    SWE-Bench Pro: Can AI Agents Solve Long- Horizon Software Engineering Tasks?,

    X. Deng, J. Da, E. Pan, Y . Y . He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V . Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler, “SWE-Bench Pro: Can AI Agents Solve Long- Horizon Software Engineering Tasks?,” 2025

  30. [30]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents,

    M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao, “DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents,” 2025

  31. [31]

    Fast Distributed Inference Serving for Large Language Models,

    B. Wu, Y . Zhong, Z. Zhang, S. Liu, F. Liu, Y . Sun, G. Huang, X. Liu, and X. Jin, “Fast Distributed Inference Serving for Large Language Models,” 2024

  32. [32]

    Splitwise: Efficient Generative LLM Inference Using Phase Splitting,

    P. Patel, E. Choukse, C. Zhang, A. Shah, I. n. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” inACM/IEEE ISCA, 2025

  33. [33]

    DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving,

    F. Strati, S. Mcallister, A. Phanishayee, J. Tarnawski, and A. Klimovic, “DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving,” inICML, 2024

  34. [34]

    MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism,

    R. Zhu, Z. Jiang, C. Jin, P. Wu, C. A. Stuardo, D. Wang, X. Zhang, H. Zhou, H. Wei, Y . Cheng, J. Xiao, X. Zhang, L. Liu, H. Lin, L.-W. 14 Chang, J. Ye, X. Yu, X. Liu, X. Jin, and X. Liu, “MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism,” inACM SIGCOMM, 2025

  35. [35]

    NanoFlow: towards optimal large language model serving throughput,

    K. Zhu, Y . Gao, Y . Zhao, L. Zhao, G. Zuo, Y . Gu, D. Xie, T. Tang, Q. Xu, Z. Ye, K. Kamahori, C.-Y . Lin, Z. Wang, S. Wang, A. Krishnamurthy, and B. Kasikci, “NanoFlow: towards optimal large language model serving throughput,” inUSENIX OSDI, 2025

  36. [36]

    Symphony: Improving memory management for llm inference workloads,

    S. Agarwal, A. Mao, A. Akella, and S. Venkataraman, “Symphony: Improving memory management for llm inference workloads,” 2024

  37. [37]

    Strata: Hierarchical context caching for long context language model serving,

    Z. Xie, Z. Xu, M. Zhao, Y . An, V . S. Mailthody, S. Mahlke, M. Garland, and C. Kozyrakis, “Strata: Hierarchical context caching for long context language model serving,” 2025

  38. [38]

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving,

    Z. Ye, L. Chen, R. Lai, W. Lin, Y . Zhang, S. Wang, T. Chen, B. Kasikci, V . Grover, A. Krishnamurthy, and L. Ceze, “FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving,” inMLSys, 2025

  39. [39]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” in NeurIPS, 2022

  40. [40]

    Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market,

    Y . Xiang, X. Li, K. Qian, Y . Yang, D. Zhu, W. Yu, E. Zhai, X. Liu, X. Jin, and J. Zhou, “Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market,” inACM SOSP, 2025

  41. [41]

    MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving,

    J. Duan, R. Lu, H. Duanmu, X. Li, X. Zhang, D. Lin, I. Stoica, and H. Zhang, “MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving,” inICML, 2024

  42. [42]

    ServerlessLLM: Low-Latency Serverless Inference for Large Language Models,

    Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “ServerlessLLM: Low-Latency Serverless Inference for Large Language Models,” inUSENIX OSDI, 2024

  43. [43]

    Hy- draServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds,

    C. Lou, S. Qi, C. Jin, D. Nie, H. Yang, Y . Ding, X. Liu, and X. Jin, “Hy- draServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds,” 2025

  44. [44]

    Pancake: Hierarchical Memory System for Multi-Agent LLM Serving,

    Z. Hu, Z. Pan, P. Kaur, V . Murthy, Z. Yu, Y . Guan, Z. Wang, S. Swanson, and Y . Ding, “Pancake: Hierarchical Memory System for Multi-Agent LLM Serving,” 2026

  45. [45]

    Towards End-to-End Optimiza- tion of LLM-based Applications with Ayo,

    X. Tan, Y . Jiang, Y . Yang, and H. Xu, “Towards End-to-End Optimiza- tion of LLM-based Applications with Ayo,” inACM ASPLOS, 2025

  46. [46]

    KVFlow: Efficient Prefix Caching for Accelerating LLM- Based Multi-Agent Workflows,

    Z. Pan, A. Patel, Z. Hu, Y . Shen, Y . Guan, W.-L. Li, L. Qin, Y . Wang, and Y . Ding, “KVFlow: Efficient Prefix Caching for Accelerating LLM- Based Multi-Agent Workflows,” 2025. 15