Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference
Pith reviewed 2026-05-16 13:20 UTC · model grok-4.3
The pith
Sutradhara integrates orchestrator and LLM engine to overlap tool execution and improve cache reuse in agentic inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sutradhara is an intelligent orchestrator-engine co-design for tool-based agentic inference that uses a thin API to enable three optimizations: tool-aware prompt splitting to overlap tool execution with prefill, streaming tool execution during decode, and orchestrator-aware cache management with semantic hints to boost hit rates.
What carries the argument
A thin API enabling cross-layer optimizations between the orchestrator and the LLM serving engine, specifically tool-aware prompt splitting, incremental tool streaming, and semantic cache hints.
If this is right
- Systems can sustain up to 77% higher load at the same median first-token-rendered latency.
- Median first-token-rendered latency can be reduced by up to 15% at the same load.
- End-to-end latency can be reduced by up to 11% on A100 GPUs.
- The throughput-latency trade-off improves for agentic workloads.
- KV cache hit rates improve despite context reuse across iterations.
Where Pith is reading between the lines
- Similar co-design principles might extend to other iterative LLM tasks like multi-step reasoning without tools.
- Production deployments could see cost savings from higher efficiency under high load.
- Future agentic systems might incorporate these overlaps by default in serving frameworks.
- Testing on diverse tool sets could reveal how robust the semantic hints are across domains.
Load-bearing premise
That the three optimizations add negligible overhead and that the analyzed production traces represent typical future agentic workloads.
What would settle it
Running the system on a new agentic workload with different tool call frequencies or patterns where the latency improvements are not observed or overhead becomes significant.
Figures
read the original abstract
Agentic applications are LLMs that iteratively invoke external tools to accomplish complex tasks. Such tool-based agents are rapidly becoming the dominant paradigm for deploying language models in production. Unlike traditional single-turn inference, agentic workloads chain together multiple LLM calls and tool executions before producing a final response, creating a new performance bottleneck that manifests as increased latency in First Token Rendered (FTR) of the final answer. Through analysis of requests at production scale, we reveal three critical challenges: tool calls account for 30-85% of FTR latency, KV cache hit rates collapse despite substantial context reuse across iterations, and sequential orchestration wastes potential intra-request parallelism. These bottlenecks stem from a design gap in which orchestrators and LLM engines operate as decoupled black boxes, preventing cross-layer optimizations. We present Sutradhara, a co-designed agentic inference system that integrates orchestration with LLM serving through a thin API enabling three optimizations: overlap tool execution with subsequent LLM prefill using tool-aware prompt splitting, streaming tool execution to dispatch tools incrementally during decode rather than waiting for complete output, and orchestrator-aware cache management that uses semantic hints to improve hit rates and reduce thrashing. Implemented on vLLM, Sutradhara improves the throughput-latency trade-off in agentic systems, sustains up to 77% higher load at the same median FTR latency, or reduces median FTR latency by up to 15% at the same load while reducing end-to-end latency by upto 11% on A100 GPUs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Sutradhara, a co-designed orchestrator-engine system for tool-based agentic LLM inference. Analysis of production-scale traces identifies three bottlenecks: tool calls contributing 30-85% of first-token-rendered (FTR) latency, collapsing KV cache hit rates despite context reuse, and sequential orchestration missing intra-request parallelism. Sutradhara exposes a thin API enabling three optimizations—tool-aware prompt splitting to overlap tool execution with prefill, incremental tool streaming during decode, and semantic-hint cache management—and implements them on vLLM. The system improves the throughput-latency trade-off, sustaining up to 77% higher load at the same median FTR latency or reducing median FTR latency by up to 15% at the same load while cutting end-to-end latency by up to 11% on A100 GPUs.
Significance. If the empirical results hold, the work is significant because it directly tackles an emerging performance bottleneck in production agentic systems by bridging the previously decoupled orchestrator and engine layers. The concrete mechanisms, use of real production traces for motivation, and vLLM implementation provide a practical, reproducible contribution that could shape future LLM serving designs for multi-turn tool-using workloads. The absence of closed-form derivations is appropriate for an engineering system paper; the value lies in the measured deltas.
major comments (2)
- [§5 (Evaluation)] §5 (Evaluation): The 77% higher sustainable load and 15% FTR latency reduction claims rest on throughput-latency curves; the manuscript must explicitly state whether the baseline vLLM configuration uses identical batching, scheduling, and tool-execution emulation as the Sutradhara variant, otherwise the deltas may partly reflect differences in the underlying engine rather than the co-design.
- [§3.1–3.3 (Optimizations)] §3.1–3.3 (Optimizations): The three optimizations are described at a high level; the paper should supply pseudocode or a precise API specification for the thin interface (e.g., how semantic hints are passed and how prompt splitting is performed) so that the claimed negligible overhead can be independently verified.
minor comments (3)
- [Abstract] Abstract: 'upto 11%' should be written 'up to 11%' for standard English usage.
- [§4 (Implementation)] §4 (Implementation): The description of the thin API would benefit from an explicit listing of the new calls or data structures exposed to the orchestrator.
- [Figures] Figures: Ensure all throughput-latency plots include error bars or variance across runs and clearly label the exact workload traces used.
Simulated Author's Rebuttal
We thank the referee for the positive recommendation of minor revision and for the constructive comments on evaluation clarity and implementation details. We address each major comment below.
read point-by-point responses
-
Referee: §5 (Evaluation): The 77% higher sustainable load and 15% FTR latency reduction claims rest on throughput-latency curves; the manuscript must explicitly state whether the baseline vLLM configuration uses identical batching, scheduling, and tool-execution emulation as the Sutradhara variant, otherwise the deltas may partly reflect differences in the underlying engine rather than the co-design.
Authors: We agree that explicit clarification is required. The baseline is the stock vLLM (v0.4.x) with identical batching, continuous batching scheduler, and KV cache configuration; tool execution is emulated identically in both arms via the same mock tool handlers and latency model derived from production traces. The only differences are the three Sutradhara extensions. We will insert a new paragraph in §5.1 explicitly documenting these shared settings and confirming that all engine-level parameters are held constant. revision: yes
-
Referee: §3.1–3.3 (Optimizations): The three optimizations are described at a high level; the paper should supply pseudocode or a precise API specification for the thin interface (e.g., how semantic hints are passed and how prompt splitting is performed) so that the claimed negligible overhead can be independently verified.
Authors: We accept the request for greater precision. In the revised manuscript we will add a new subsection 3.4 containing (i) the exact thin-API signatures, (ii) pseudocode for tool-aware prompt splitting, incremental streaming dispatch, and semantic-hint cache tagging, and (iii) measured overhead numbers (already collected) showing <2% additional CPU time. This will allow independent verification while preserving the engineering focus of the paper. revision: yes
Circularity Check
No circularity: empirical system evaluation with direct measurements
full rationale
The paper describes an implemented co-design (tool-aware prompt splitting, incremental streaming, semantic-hint cache management) on vLLM and reports throughput-latency gains from A100 GPU runs on production traces. No equations, first-principles derivations, or predictions appear in the provided text; claims rest on concrete mechanisms and measured deltas rather than any reduction to fitted inputs or self-citation chains. This is a standard engineering paper with independent empirical content.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Production agentic traces are representative of future workloads
- standard math KV cache behavior follows standard attention semantics
Forward citations
Cited by 2 Pith papers
-
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
-
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
Reference graph
Works this paper leans on
-
[1]
GitHub - langchain-ai/langgraph: Build resilient lan- guage agents as graphs. — github.com. https:// github.com/langchain-ai/langgraph. [Accessed 11-12-2025]
work page 2025
-
[2]
LangChain — langchain.com. https://www. langchain.com/. [Accessed 11-12-2025]
work page 2025
-
[3]
Gulavani, Alexey Tu- manov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tu- manov, and Ramachandran Ramjee. Taming throughput- latency tradeoff in LLM inference with sarathi-serve. In Ada Gavrilovska and Douglas B. Terry, editors,18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, ...
work page 2024
-
[4]
DeepSpeed-Inference: Enabling efficient inference of transformer models at unprecedented scale
Reza Yazdani Aminabadi, Samyam Rajbhandari, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. DeepSpeed-Inference: Enabling efficient inference of transformer models at unprecedented scale. InProceedings of the International Conference for High Performance Computing, Networking, Storage and ...
work page 2022
-
[5]
Murakkab: Resource-efficient agentic workflow orchestration in cloud platforms, 2025
Gohar Irfan Chaudhry, Esha Choukse, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Adam Belay, and Ricardo Bian- chini. Murakkab: Resource-efficient agentic workflow orchestration in cloud platforms, 2025
work page 2025
-
[6]
Lmcache: An efficient kv cache layer for enterprise-scale llm inference,
Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xi- aokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang. Lmcache: An effi- cient KV cache layer for enterprise-scale LLM inference. CoRR, abs/2510.09665, 2025
- [7]
-
[8]
An llm compiler for parallel function calling
Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gho- lami. An llm compiler for parallel function calling. In International Conference on Machine Learning (ICML), 2024
work page 2024
-
[9]
Practical considerations for agentic llm systems.arXiv preprint arXiv:2412.04093, 2024
Saurav Kumar, Sahana Suri, Daniel Obolski, Tom Yeh, and Michael Hamilton. Practical considerations for agentic llm systems.arXiv preprint arXiv:2412.04093, 2024
-
[10]
Efficient memory manage- ment for large language model serving with pagedatten- tion
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Pro- ceedings of the 29th Symposium on Operating Sys...
work page 2023
-
[11]
Continuum: Efficient and robust multi-turn llm agent scheduling with kv cache time-to-live, 2025
Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Alvin Cheung, Joseph Gonzalez, and Ion Stoica. Continuum: Efficient and robust multi-turn llm agent scheduling with kv cache time-to-live, 2025. 11
work page 2025
-
[12]
Parrot: Efficient serving of llm-based applications with seman- tic variable
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of llm-based applications with seman- tic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, July 2024. USENIX Association
work page 2024
-
[13]
Circinus: Efficient query planner for com- pound ml serving, 2025
Banruo Liu, Wei-Yu Lin, Minghao Fang, Yihan Jiang, and Fan Lai. Circinus: Efficient query planner for com- pound ml serving, 2025
work page 2025
-
[14]
Palimpzest: Optimizing ai-powered analyt- ics with declarative query processing
Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, and Gerardo Vitagliano. Palimpzest: Optimizing ai-powered analyt- ics with declarative query processing. InProceedings of the Conference on Innovative Database Research (CIDR)
-
[15]
Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving, 2025
Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, and Esha Choukse. Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving, 2025
work page 2025
-
[16]
Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. Autellix: An efficient serving engine for llm agents as general programs, 2025
work page 2025
-
[17]
Llm inference serving: Survey of recent advances and opportunities
Baolin Miao, Yuntao Zhuang, Haichao Cui, Xu- peng Zhang, Yang Yang, Zekai Wang, Pengcheng Li, Guodong Ding, Binhang He, Tianchi Chen, et al. Llm inference serving: Survey of recent advances and oppor- tunities.arXiv preprint arXiv:2407.12391, 2024
-
[18]
GitHub - microsoft/autogen: A programming framework for agentic AI — github.com
Microsoft. GitHub - microsoft/autogen: A programming framework for agentic AI — github.com. https:// github.com/microsoft/autogen. [Accessed 11-12- 2025]
work page 2025
-
[19]
Azure VM NDm-A100-v4 sizes series
Microsoft Azure. Azure VM NDm-A100-v4 sizes series. https://learn.microsoft.com/en-us/azure/ virtual-machines/sizes/gpu-accelerated/ ndma100v4-series, 2024
work page 2024
- [20]
-
[21]
In Search of An Understandable Consensus Algorithm
Diego Ongaro and John Ousterhout. In Search of An Understandable Consensus Algorithm. InProceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (ATC), 2014
work page 2014
-
[22]
Kvflow: Efficient prefix caching for accelerating llm-based multi-agent workflows, 2025
Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. Kvflow: Efficient prefix caching for accelerating llm-based multi-agent workflows, 2025
work page 2025
-
[23]
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis.arXiv preprint arXiv:2305.15334, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Tool learning with large language models: a survey.Frontiers of Computer Science, 18(6), 2024
Changle Qin, Aojun Zhang, Zihan Zhang, Jiaqi Chen, Michihiro Yasunaga, and Diyi Yang. Tool learning with large language models: a survey.Frontiers of Computer Science, 18(6), 2024
work page 2024
-
[25]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation - A kvcache-centric architecture for serving LLM chatbot. In Haryadi S. Gunawi and Vasily Tarasov, editors,23rd USENIX Conference on File and Storage Technologies, FAST 2025, Santa Clara, CA, ...
work page 2025
-
[26]
Stanford. GitHub - stanfordnlp/dspy: DSPy: The frame- work for programming—not prompting—language models — github.com. https://github.com/ stanfordnlp/dspy. [Accessed 11-12-2025]
work page 2025
- [27]
-
[28]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Sci- ence, 18(6):186345, 2024
work page 2024
-
[29]
Conveyor: Efficient tool-aware LLM serving with tool partial execution, 2024
Yechen Xu, Xinhao Kong, Tingjun Chen, and Danyang Zhuo. Conveyor: Efficient tool-aware LLM serving with tool partial execution, 2024
work page 2024
-
[30]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Chris- tos Kozyrakis, Ion Stoica, Joseph E Gonzalez, Clark Barrett, and Ying Ying. Sglang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.