pith. sign in

arxiv: 2601.12967 · v3 · submitted 2026-01-19 · 💻 cs.DC

Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference

Pith reviewed 2026-05-16 13:20 UTC · model grok-4.3

classification 💻 cs.DC
keywords agentic systemsLLM inferencetool callingorchestrationKV cachelatency optimizationco-designvLLM
0
0 comments X

The pith

Sutradhara integrates orchestrator and LLM engine to overlap tool execution and improve cache reuse in agentic inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic applications chain multiple LLM calls with tool executions, creating latency bottlenecks from tool calls, poor KV cache reuse, and sequential processing. The paper shows these issues arise because orchestrators and engines operate as separate black boxes. Sutradhara introduces a co-design with a thin API that allows overlapping tool execution with LLM prefill, streaming tool outputs during decoding, and using semantic hints for better cache management. This results in sustaining higher loads or lower latencies in production-like workloads on GPUs. The approach addresses the gap in current decoupled systems for tool-based agents.

Core claim

Sutradhara is an intelligent orchestrator-engine co-design for tool-based agentic inference that uses a thin API to enable three optimizations: tool-aware prompt splitting to overlap tool execution with prefill, streaming tool execution during decode, and orchestrator-aware cache management with semantic hints to boost hit rates.

What carries the argument

A thin API enabling cross-layer optimizations between the orchestrator and the LLM serving engine, specifically tool-aware prompt splitting, incremental tool streaming, and semantic cache hints.

If this is right

  • Systems can sustain up to 77% higher load at the same median first-token-rendered latency.
  • Median first-token-rendered latency can be reduced by up to 15% at the same load.
  • End-to-end latency can be reduced by up to 11% on A100 GPUs.
  • The throughput-latency trade-off improves for agentic workloads.
  • KV cache hit rates improve despite context reuse across iterations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar co-design principles might extend to other iterative LLM tasks like multi-step reasoning without tools.
  • Production deployments could see cost savings from higher efficiency under high load.
  • Future agentic systems might incorporate these overlaps by default in serving frameworks.
  • Testing on diverse tool sets could reveal how robust the semantic hints are across domains.

Load-bearing premise

That the three optimizations add negligible overhead and that the analyzed production traces represent typical future agentic workloads.

What would settle it

Running the system on a new agentic workload with different tool call frequencies or patterns where the latency improvements are not observed or overhead becomes significant.

Figures

Figures reproduced from arXiv: 2601.12967 by Alind Khare, Anish Biswas, Anjaly Parayil, Chetan Bansal, Jayashree Mohan, Kanishk Goel, Ramachandran Ramjee, Srivarshinee S.

Figure 1
Figure 1. Figure 1: SUTRADHARA reduces FTR and e2e latency (by 15% and 10% respectively across the trace) by systematically parallelizing the execution of LLM and tools, along with workload-aware KV eviction. For two random requests in the trace, these techniques reduce FTR by 18 – 35%. iteratively invoke LLMs and external tools to accomplish complex tasks. These systems represent a fundamental shift in how we deploy LLMs. Ra… view at source ↗
Figure 2
Figure 2. Figure 2: Workflow: (1) User request arrives; (2) Orchestrator [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of the agentic trace in production [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tool call execution dynamics 3.4 Trace statistics We first present the overall trace characteristics. For each request, we categorize the number of LLM iterations into two types; intermediate iterations that perform tool invocations and a final iteration that generates the user visible response and doesn’t result in any tool calls [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CDF of prompt independent of tool output [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Thrashing due to workload-agnostic KV eviction [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Intra-request parallel execution in SUTRADHARA 4.1 Overview SUTRADHARA extends the standard LLM serving architec￾ture with a thin coordination layer that enables the orchestrator to communicate semantic hints about agentic request structure to the engine. As before, the orchestrator maintains knowl￾edge of iteration boundaries, prompt composition, and tool dependencies, while the engine controls scheduling… view at source ↗
Figure 8
Figure 8. Figure 8: Workload-aware KV eviction policy scheduling them in first-come-first-served (FCFS) order with￾out awareness of which agentic request they belong to. This is suboptimal for minimizing request-level latency. Consider an agentic request R1 that arrives first and executes iteration 1, followed by request R2 arriving shortly after. When R1’s itera￾tion 2 becomes ready (after its tools complete), the engine see… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of FTR and e2e latency on the Tool-heavy trace. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Breakdown of FTR for 5 randomly selected re [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of FTR and e2e latency for Gemma [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
read the original abstract

Agentic applications are LLMs that iteratively invoke external tools to accomplish complex tasks. Such tool-based agents are rapidly becoming the dominant paradigm for deploying language models in production. Unlike traditional single-turn inference, agentic workloads chain together multiple LLM calls and tool executions before producing a final response, creating a new performance bottleneck that manifests as increased latency in First Token Rendered (FTR) of the final answer. Through analysis of requests at production scale, we reveal three critical challenges: tool calls account for 30-85% of FTR latency, KV cache hit rates collapse despite substantial context reuse across iterations, and sequential orchestration wastes potential intra-request parallelism. These bottlenecks stem from a design gap in which orchestrators and LLM engines operate as decoupled black boxes, preventing cross-layer optimizations. We present Sutradhara, a co-designed agentic inference system that integrates orchestration with LLM serving through a thin API enabling three optimizations: overlap tool execution with subsequent LLM prefill using tool-aware prompt splitting, streaming tool execution to dispatch tools incrementally during decode rather than waiting for complete output, and orchestrator-aware cache management that uses semantic hints to improve hit rates and reduce thrashing. Implemented on vLLM, Sutradhara improves the throughput-latency trade-off in agentic systems, sustains up to 77% higher load at the same median FTR latency, or reduces median FTR latency by up to 15% at the same load while reducing end-to-end latency by upto 11% on A100 GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents Sutradhara, a co-designed orchestrator-engine system for tool-based agentic LLM inference. Analysis of production-scale traces identifies three bottlenecks: tool calls contributing 30-85% of first-token-rendered (FTR) latency, collapsing KV cache hit rates despite context reuse, and sequential orchestration missing intra-request parallelism. Sutradhara exposes a thin API enabling three optimizations—tool-aware prompt splitting to overlap tool execution with prefill, incremental tool streaming during decode, and semantic-hint cache management—and implements them on vLLM. The system improves the throughput-latency trade-off, sustaining up to 77% higher load at the same median FTR latency or reducing median FTR latency by up to 15% at the same load while cutting end-to-end latency by up to 11% on A100 GPUs.

Significance. If the empirical results hold, the work is significant because it directly tackles an emerging performance bottleneck in production agentic systems by bridging the previously decoupled orchestrator and engine layers. The concrete mechanisms, use of real production traces for motivation, and vLLM implementation provide a practical, reproducible contribution that could shape future LLM serving designs for multi-turn tool-using workloads. The absence of closed-form derivations is appropriate for an engineering system paper; the value lies in the measured deltas.

major comments (2)
  1. [§5 (Evaluation)] §5 (Evaluation): The 77% higher sustainable load and 15% FTR latency reduction claims rest on throughput-latency curves; the manuscript must explicitly state whether the baseline vLLM configuration uses identical batching, scheduling, and tool-execution emulation as the Sutradhara variant, otherwise the deltas may partly reflect differences in the underlying engine rather than the co-design.
  2. [§3.1–3.3 (Optimizations)] §3.1–3.3 (Optimizations): The three optimizations are described at a high level; the paper should supply pseudocode or a precise API specification for the thin interface (e.g., how semantic hints are passed and how prompt splitting is performed) so that the claimed negligible overhead can be independently verified.
minor comments (3)
  1. [Abstract] Abstract: 'upto 11%' should be written 'up to 11%' for standard English usage.
  2. [§4 (Implementation)] §4 (Implementation): The description of the thin API would benefit from an explicit listing of the new calls or data structures exposed to the orchestrator.
  3. [Figures] Figures: Ensure all throughput-latency plots include error bars or variance across runs and clearly label the exact workload traces used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and for the constructive comments on evaluation clarity and implementation details. We address each major comment below.

read point-by-point responses
  1. Referee: §5 (Evaluation): The 77% higher sustainable load and 15% FTR latency reduction claims rest on throughput-latency curves; the manuscript must explicitly state whether the baseline vLLM configuration uses identical batching, scheduling, and tool-execution emulation as the Sutradhara variant, otherwise the deltas may partly reflect differences in the underlying engine rather than the co-design.

    Authors: We agree that explicit clarification is required. The baseline is the stock vLLM (v0.4.x) with identical batching, continuous batching scheduler, and KV cache configuration; tool execution is emulated identically in both arms via the same mock tool handlers and latency model derived from production traces. The only differences are the three Sutradhara extensions. We will insert a new paragraph in §5.1 explicitly documenting these shared settings and confirming that all engine-level parameters are held constant. revision: yes

  2. Referee: §3.1–3.3 (Optimizations): The three optimizations are described at a high level; the paper should supply pseudocode or a precise API specification for the thin interface (e.g., how semantic hints are passed and how prompt splitting is performed) so that the claimed negligible overhead can be independently verified.

    Authors: We accept the request for greater precision. In the revised manuscript we will add a new subsection 3.4 containing (i) the exact thin-API signatures, (ii) pseudocode for tool-aware prompt splitting, incremental streaming dispatch, and semantic-hint cache tagging, and (iii) measured overhead numbers (already collected) showing <2% additional CPU time. This will allow independent verification while preserving the engineering focus of the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with direct measurements

full rationale

The paper describes an implemented co-design (tool-aware prompt splitting, incremental streaming, semantic-hint cache management) on vLLM and reports throughput-latency gains from A100 GPU runs on production traces. No equations, first-principles derivations, or predictions appear in the provided text; claims rest on concrete mechanisms and measured deltas rather than any reduction to fitted inputs or self-citation chains. This is a standard engineering paper with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on empirical production measurements and standard assumptions about LLM inference engines rather than new mathematical axioms or fitted parameters.

axioms (2)
  • domain assumption Production agentic traces are representative of future workloads
    The three bottlenecks are derived from analysis of requests at production scale.
  • standard math KV cache behavior follows standard attention semantics
    Cache hit-rate collapse is treated as a known property of transformer inference.

pith-pipeline@v0.9.0 · 5612 in / 1349 out tokens · 24522 ms · 2026-05-16T13:20:16.195529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

    cs.AI 2026-05 conditional novelty 7.0

    IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.

  2. The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

    cs.LG 2026-03 unverdicted novelty 5.0

    The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    — github.com

    GitHub - langchain-ai/langgraph: Build resilient lan- guage agents as graphs. — github.com. https:// github.com/langchain-ai/langgraph. [Accessed 11-12-2025]

  2. [2]

    https://www

    LangChain — langchain.com. https://www. langchain.com/. [Accessed 11-12-2025]

  3. [3]

    Gulavani, Alexey Tu- manov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tu- manov, and Ramachandran Ramjee. Taming throughput- latency tradeoff in LLM inference with sarathi-serve. In Ada Gavrilovska and Douglas B. Terry, editors,18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, ...

  4. [4]

    DeepSpeed-Inference: Enabling efficient inference of transformer models at unprecedented scale

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. DeepSpeed-Inference: Enabling efficient inference of transformer models at unprecedented scale. InProceedings of the International Conference for High Performance Computing, Networking, Storage and ...

  5. [5]

    Murakkab: Resource-efficient agentic workflow orchestration in cloud platforms, 2025

    Gohar Irfan Chaudhry, Esha Choukse, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Adam Belay, and Ricardo Bian- chini. Murakkab: Resource-efficient agentic workflow orchestration in cloud platforms, 2025

  6. [6]

    Lmcache: An efficient kv cache layer for enterprise-scale llm inference,

    Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xi- aokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang. Lmcache: An effi- cient KV cache layer for enterprise-scale LLM inference. CoRR, abs/2510.09665, 2025

  7. [7]

    Qwen3 technical report, 2025

    An Yang et al. Qwen3 technical report, 2025

  8. [8]

    An llm compiler for parallel function calling

    Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gho- lami. An llm compiler for parallel function calling. In International Conference on Machine Learning (ICML), 2024

  9. [9]

    Practical considerations for agentic llm systems.arXiv preprint arXiv:2412.04093, 2024

    Saurav Kumar, Sahana Suri, Daniel Obolski, Tom Yeh, and Michael Hamilton. Practical considerations for agentic llm systems.arXiv preprint arXiv:2412.04093, 2024

  10. [10]

    Efficient memory manage- ment for large language model serving with pagedatten- tion

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Pro- ceedings of the 29th Symposium on Operating Sys...

  11. [11]

    Continuum: Efficient and robust multi-turn llm agent scheduling with kv cache time-to-live, 2025

    Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Alvin Cheung, Joseph Gonzalez, and Ion Stoica. Continuum: Efficient and robust multi-turn llm agent scheduling with kv cache time-to-live, 2025. 11

  12. [12]

    Parrot: Efficient serving of llm-based applications with seman- tic variable

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of llm-based applications with seman- tic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, July 2024. USENIX Association

  13. [13]

    Circinus: Efficient query planner for com- pound ml serving, 2025

    Banruo Liu, Wei-Yu Lin, Minghao Fang, Yihan Jiang, and Fan Lai. Circinus: Efficient query planner for com- pound ml serving, 2025

  14. [14]

    Palimpzest: Optimizing ai-powered analyt- ics with declarative query processing

    Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, and Gerardo Vitagliano. Palimpzest: Optimizing ai-powered analyt- ics with declarative query processing. InProceedings of the Conference on Innovative Database Research (CIDR)

  15. [15]

    Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving, 2025

    Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, and Esha Choukse. Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving, 2025

  16. [16]

    Gonzalez, and Ion Stoica

    Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. Autellix: An efficient serving engine for llm agents as general programs, 2025

  17. [17]

    Llm inference serving: Survey of recent advances and opportunities

    Baolin Miao, Yuntao Zhuang, Haichao Cui, Xu- peng Zhang, Yang Yang, Zekai Wang, Pengcheng Li, Guodong Ding, Binhang He, Tianchi Chen, et al. Llm inference serving: Survey of recent advances and oppor- tunities.arXiv preprint arXiv:2407.12391, 2024

  18. [18]

    GitHub - microsoft/autogen: A programming framework for agentic AI — github.com

    Microsoft. GitHub - microsoft/autogen: A programming framework for agentic AI — github.com. https:// github.com/microsoft/autogen. [Accessed 11-12- 2025]

  19. [19]

    Azure VM NDm-A100-v4 sizes series

    Microsoft Azure. Azure VM NDm-A100-v4 sizes series. https://learn.microsoft.com/en-us/azure/ virtual-machines/sizes/gpu-accelerated/ ndma100v4-series, 2024

  20. [20]

    NVIDIA TensoRT

    NVIDIA. NVIDIA TensoRT. https://github.com/ NVIDIA/TensorRT, 2024

  21. [21]

    In Search of An Understandable Consensus Algorithm

    Diego Ongaro and John Ousterhout. In Search of An Understandable Consensus Algorithm. InProceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (ATC), 2014

  22. [22]

    Kvflow: Efficient prefix caching for accelerating llm-based multi-agent workflows, 2025

    Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. Kvflow: Efficient prefix caching for accelerating llm-based multi-agent workflows, 2025

  23. [23]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis.arXiv preprint arXiv:2305.15334, 2023

  24. [24]

    Tool learning with large language models: a survey.Frontiers of Computer Science, 18(6), 2024

    Changle Qin, Aojun Zhang, Zihan Zhang, Jiaqi Chen, Michihiro Yasunaga, and Diyi Yang. Tool learning with large language models: a survey.Frontiers of Computer Science, 18(6), 2024

  25. [25]

    Mooncake: Trading more storage for less computation - A kvcache-centric architecture for serving LLM chatbot

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation - A kvcache-centric architecture for serving LLM chatbot. In Haryadi S. Gunawi and Vasily Tarasov, editors,23rd USENIX Conference on File and Storage Technologies, FAST 2025, Santa Clara, CA, ...

  26. [26]

    GitHub - stanfordnlp/dspy: DSPy: The frame- work for programming—not prompting—language models — github.com

    Stanford. GitHub - stanfordnlp/dspy: DSPy: The frame- work for programming—not prompting—language models — github.com. https://github.com/ stanfordnlp/dspy. [Accessed 11-12-2025]

  27. [27]

    Gemma 3 technical report, 2025

    Gemma Team. Gemma 3 technical report, 2025

  28. [28]

    A survey on large language model based autonomous agents.Frontiers of Computer Sci- ence, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Sci- ence, 18(6):186345, 2024

  29. [29]

    Conveyor: Efficient tool-aware LLM serving with tool partial execution, 2024

    Yechen Xu, Xinhao Kong, Tingjun Chen, and Danyang Zhuo. Conveyor: Efficient tool-aware LLM serving with tool partial execution, 2024

  30. [30]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

  31. [31]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Chris- tos Kozyrakis, Ion Stoica, Joseph E Gonzalez, Clark Barrett, and Ying Ying. Sglang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104, 2023. 12