pith. machine review for the scientific record. sign in

arxiv: 2603.21354 · v2 · submitted 2026-03-22 · 💻 cs.LG · cs.DC

Recognition: 2 theorem links

· Lean Theorem

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:36 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords LLM inference optimizationsemantic routingworkload characterizationrouter architectureinference poolfleet provisioningvision paper3x3 interaction matrix
0
0 comments X

The pith

The Workload-Router-Pool architecture organizes LLM inference optimization into a 3x3 interaction matrix that maps prior results and flags twenty-one open research directions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper consolidates earlier vLLM Semantic Router studies into one three-dimensional framework called Workload-Router-Pool. Workload captures request patterns such as chat versus agentic or prefill-heavy versus decode-heavy. Router covers dispatching rules from static semantic matching to online adaptation and reinforcement learning. Pool covers hardware choices from homogeneous GPUs to disaggregated prefill-decode setups. Plotting past contributions onto the resulting 3x3 matrix shows which intersections are already addressed and which remain empty, then lists twenty-one concrete next steps each backed by measurements from those earlier experiments. A reader should care because scaling LLM fleets requires coordinated decisions across all three dimensions rather than separate tweaks.

Core claim

The Workload-Router-Pool architecture is a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves, Router determines how each request is dispatched, and Pool defines where inference runs. Mapping prior work onto the 3x3 interaction matrix identifies covered cells and open cells, and the paper proposes twenty-one concrete research directions at the intersections, each grounded in prior measurements and tiered by maturity.

What carries the argument

The Workload-Router-Pool (WRP) architecture, a three-dimensional framework that places workload types, routing policies, and execution pools on the axes of a 3x3 matrix to expose covered areas and open research cells.

If this is right

  • Fleet provisioning decisions must change when routing policies shift or when workload mixes move toward agentic and multimodal traffic.
  • Safety mechanisms such as policy conflict detection and hallucination checks become more effective when combined with context-length-aware pool routing.
  • Energy-efficiency gains require joint selection of router policies and heterogeneous pool configurations rather than independent tuning.
  • Agentic workloads with multi-turn memory and tool selection create new cells in the matrix that need dedicated routing and pool designs.
  • Standards for inference routing protocols and multi-provider APIs must account for interactions across all three WRP dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the matrix proves stable across new model families, it could serve as a shared taxonomy for comparing commercial serving systems.
  • Extending the framework with a fourth axis for network topology might be needed if cross-node KV-cache movement dominates latency.
  • Prioritizing the engineering-ready directions first would let teams measure concrete throughput or cost improvements before tackling open research cells.
  • The same matrix structure could be tested on non-LLM workloads such as diffusion model serving to check whether the three dimensions generalize.

Load-bearing premise

The three dimensions of workload, router, and pool are sufficient to capture the main interactions that matter for LLM inference optimization.

What would settle it

Demonstration of a dominant factor, such as regulatory data-locality rules or network-topology effects, that cannot be assigned to any cell in the 3x3 WRP matrix would show the framework is incomplete.

read the original abstract

Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Workload-Router-Pool (WRP) architecture as a three-dimensional framework for LLM inference optimization. It distills prior vLLM Semantic Router results into characterizations of Workload (chat vs. agentic, single- vs. multi-turn, prefill- vs. decode-heavy), Router (semantic rules, bandit adaptation, RL selection), and Pool (homogeneous/heterogeneous GPUs, disaggregated prefill/decode, KV-cache topology), maps these onto a 3x3 interaction matrix to identify covered and open cells, and proposes 21 concrete research directions grounded in the project's prior measurements and tiered by maturity.

Significance. If the framework holds, the WRP matrix could serve as a useful organizing taxonomy for LLM inference research, systematically highlighting gaps at the intersections of workload characteristics, routing policies, and execution pools. The explicit grounding of the 21 directions in existing measurements from the vLLM project is a strength that could help prioritize engineering-ready versus open-research items.

major comments (2)
  1. [WRP architecture definition and matrix construction] Section defining the WRP dimensions and 3x3 matrix: The central claim that these three axes fully capture key interactions and allow complete mapping of prior work rests on the unargued assumption that factors such as network topology (e.g., KV-cache transfer latency across nodes) or regulatory constraints (data residency) can be reduced to Workload/Router/Pool without loss of fidelity. No explicit justification or reduction argument is provided, which directly affects the identification of 'open cells' and the completeness of the proposed directions.
  2. [Proposal of twenty-one research directions] Section on the 21 research directions: Several directions (particularly those involving fleet governance and multi-provider standards) are listed without explicit mapping back to specific open cells in the 3x3 matrix or to concrete prior measurements that would ground their feasibility. This weakens the claim that the directions are systematically derived from the matrix analysis.
minor comments (2)
  1. [Abstract and introduction] The abstract and introduction could more explicitly separate the synthesis of previously published vLLM results from any novel conceptual contribution of the WRP framing itself.
  2. [Matrix and directions section] Notation for the matrix cells and direction tiers would benefit from a small summary table to improve readability when the 21 directions are enumerated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and grounding of the WRP framework. We address both major comments below and will revise the manuscript to strengthen the justification of the dimensions and the explicit mapping of research directions.

read point-by-point responses
  1. Referee: [WRP architecture definition and matrix construction] Section defining the WRP dimensions and 3x3 matrix: The central claim that these three axes fully capture key interactions and allow complete mapping of prior work rests on the unargued assumption that factors such as network topology (e.g., KV-cache transfer latency across nodes) or regulatory constraints (data residency) can be reduced to Workload/Router/Pool without loss of fidelity. No explicit justification or reduction argument is provided, which directly affects the identification of 'open cells' and the completeness of the proposed directions.

    Authors: We acknowledge the absence of an explicit reduction argument. In revision we will insert a dedicated paragraph in Section 2 explaining the scope: network topology effects are already subsumed under the Pool dimension through the KV-cache topology sub-axis (our prior disaggregated prefill/decode measurements quantify cross-node transfer latencies), while regulatory constraints such as data residency are treated as workload attributes (privacy-sensitive vs. general chat) or pool restrictions (geo-fenced GPU sets). We will also state explicitly that WRP is offered as a practical organizing taxonomy derived from vLLM measurements rather than a claim of theoretical completeness, and we will note possible extensions for factors outside the current axes. This addition will make the identification of open cells more transparent. revision: yes

  2. Referee: [Proposal of twenty-one research directions] Section on the 21 research directions: Several directions (particularly those involving fleet governance and multi-provider standards) are listed without explicit mapping back to specific open cells in the 3x3 matrix or to concrete prior measurements that would ground their feasibility. This weakens the claim that the directions are systematically derived from the matrix analysis.

    Authors: We agree that the linkage should be stated more explicitly. The revised manuscript will include a summary table (new Table 3) that, for each of the 21 directions, lists the target (Workload, Router, Pool) cell and cites the specific prior vLLM measurement or paper that grounds its feasibility. For the fleet-governance and multi-provider directions, we will map them to the open cells at the intersection of heterogeneous pools with adaptive routers and reference the energy-efficiency analysis and multi-provider API extension results already obtained in the project. This will demonstrate the systematic derivation from the matrix. revision: yes

Circularity Check

2 steps flagged

WRP 3x3 matrix and 21 directions reduce to re-labeling of authors' own prior vLLM results

specific steps
  1. renaming known result [Abstract]
    "This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. ... We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements"

    The 3x3 matrix is populated exclusively by re-mapping the authors' own prior vLLM papers (listed in the abstract as the source of the distillation); the identification of covered cells and the 21 directions are therefore direct outputs of that re-mapping rather than new derivations.

  2. self citation load bearing [Abstract]
    "Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selec"

    The claim that WRP is a sufficient three-dimensional framework rests on the completeness of the enumerated self-cited project outputs; no external criterion or reduction is provided to show why network topology, regulatory constraints, or other factors can be omitted without loss.

full rationale

The paper's central derivation consists of defining the WRP dimensions from the authors' listed prior publications, then mapping those same publications onto the new 3x3 matrix to identify covered/open cells and generate 21 directions. This process is self-contained within the authors' body of work with no external derivation, benchmark, or independent validation step shown; the 'framework' and proposals are therefore equivalent to a reorganization of the input citations by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the three chosen dimensions fully span the space of LLM inference interactions and on the introduction of the WRP framework itself as an organizing entity without external validation.

axioms (1)
  • domain assumption The three dimensions of Workload, Router, and Pool are sufficient to capture all key interactions in LLM inference optimization.
    Invoked when defining the 3x3 interaction matrix and when claiming the framework organizes the space of problems.
invented entities (1)
  • Workload-Router-Pool (WRP) architecture no independent evidence
    purpose: To serve as a unified three-dimensional organizing framework for LLM inference optimization
    Newly proposed conceptual entity that structures the mapping of prior work and the 21 research directions.

pith-pipeline@v0.9.0 · 5656 in / 1396 out tokens · 64460 ms · 2026-05-15T06:36:51.065328+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 12 internal anchors

  1. [1]

    vLLM semantic router: Signal driven decision routing for mixture-of-modality models.arXiv preprint arXiv:2603.04444, 2026

    vLLM Semantic Router Team. vLLM semantic router: Signal driven decision routing for mixture-of-modality models.arXiv preprint arXiv:2603.04444, 2026

  2. [2]

    Conflict-free policy languages for probabilistic ML predicates: A framework and case study with the semantic router DSL.arXiv preprint arXiv:2603.18174, 2026

    Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, and Xue Liu. Conflict-free policy languages for probabilistic ML predicates: A framework and case study with the semantic router DSL.arXiv preprint arXiv:2603.18174, 2026

  3. [3]

    98× faster LLM routing without a dedicated GPU: Flash attention, prompt compression, and near-streaming for the vLLM semantic router.arXiv preprint arXiv:2603.12646, 2026

    Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. 98× faster LLM routing without a dedicated GPU: Flash attention, prompt compression, and near-streaming for the vLLM semantic router.arXiv preprint arXiv:2603.12646, 2026

  4. [4]

    mmBERT-embed-32k-2d-matryoshka: Multilingual embedding model with 2d matryoshka training

    vLLM Semantic Router Team. mmBERT-embed-32k-2d-matryoshka: Multilingual embedding model with 2d matryoshka training. Hugging Face model:llm-semantic-router/mmbert-embed- 32k-2d-matryoshka, 2025

  5. [5]

    When to reason: Semantic router for vLLM

    Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, and Huamin Chen. When to reason: Semantic router for vLLM. InNeurIPS Workshop on ML for Systems (MLForSys), 2025

  6. [6]

    Category- aware semantic caching for heterogeneous LLM workloads.arXiv preprint arXiv:2510.26835, 2025

    Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, and Huamin Chen. Category- aware semantic caching for heterogeneous LLM workloads.arXiv preprint arXiv:2510.26835, 2025

  7. [7]

    mmBERT-32k feedback detector: User satisfaction classification for online routing adaptation

    vLLM Semantic Router Team. mmBERT-32k feedback detector: User satisfaction classification for online routing adaptation. Hugging Face model:llm-semantic-router/mmbert32k-feedback- detector-lora, 2026

  8. [8]

    Token-level truth: Real-time hallucination detection for production LLMs

    vLLM Semantic Router Team. Token-level truth: Real-time hallucination detection for production LLMs. vLLM Blog, 2025.https://blog.vllm.ai/2025/12/14/halugate.html

  9. [9]

    mmBERT-32k factcheck classifier: Binary prompt classification for conditional hallucination detection

    vLLM Semantic Router Team. mmBERT-32k factcheck classifier: Binary prompt classification for conditional hallucination detection. Hugging Face model:llm-semantic-router/mmbert32k- factcheck-classifier-merged, 2026

  10. [10]

    MLCommons AI safety classifier – level 1 (binary): Safe vs

    vLLM Semantic Router Team. MLCommons AI safety classifier – level 1 (binary): Safe vs. unsafe content classification. Hugging Face model:llm-semantic-router/mlcommons-safety- classifier-level1-binary, 2026

  11. [11]

    MLCommons AI safety classifier – level 2 (9-class hazard): Hier- archical content safety classification

    vLLM Semantic Router Team. MLCommons AI safety classifier – level 2 (9-class hazard): Hier- archical content safety classification. Hugging Face model:llm-semantic-router/mlcommons- safety-classifier-level2-hazard, 2026

  12. [12]

    Token-budget- aware pool routing for cost-efficient LLM inference.arXiv preprint, 2026

    Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. Token-budget- aware pool routing for cost-efficient LLM inference.arXiv preprint, 2026. 38

  13. [13]

    FleetOpt: Analytical fleet provisioning for LLM inference with compress-and-route as implementation mechanism.arXiv preprint arXiv:2603.16514, 2026

    Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. FleetOpt: Analytical fleet provisioning for LLM inference with compress-and-route as implementation mechanism.arXiv preprint arXiv:2603.16514, 2026

  14. [14]

    inference- fleet-sim: A queueing-theory-grounded fleet capacity planner for LLM inference.arXiv preprint arXiv:2603.16054, 2026

    Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. inference- fleet-sim: A queueing-theory-grounded fleet capacity planner for LLM inference.arXiv preprint arXiv:2603.16054, 2026

  15. [15]

    The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

    Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. The 1/W law: An analytical study of context-length routing topology and GPU generation gains for LLM inference energy efficiency.arXiv preprint arXiv:2603.17280, 2026

  16. [16]

    Adaptive vision-language model routing for computer use agents.arXiv preprint arXiv:2603.12823, 2026

    Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. Adaptive vision-language model routing for computer use agents.arXiv preprint arXiv:2603.12823, 2026

  17. [17]

    Outcome-aware tool selection for semantic routers: Latency-constrained learning without LLM inference.arXiv preprint arXiv:2603.13426, 2026

    Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, and Xue Liu. Outcome-aware tool selection for semantic routers: Latency-constrained learning without LLM inference.arXiv preprint arXiv:2603.13426, 2026

  18. [18]

    Visual confused deputy: Exploiting and defending perception failures in computer-using agents.arXiv preprint arXiv:2603.14707, 2026

    Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. Visual confused deputy: Exploiting and defending perception failures in computer-using agents.arXiv preprint arXiv:2603.14707, 2026

  19. [19]

    OpenClaw: Personal AI assistant with a local-first gateway

    OpenClaw contributors. OpenClaw: Personal AI assistant with a local-first gateway. Open- source software (MIT License), 2026. Repository: https://github.com/openclaw/openclaw. Gateway WebSocket control plane for multi-channel agent sessions, tools, and model routing; documentation athttps://docs.openclaw.ai

  20. [20]

    Semantic inference routing protocol (SIRP)

    Huamin Chen and Luay Jalil. Semantic inference routing protocol (SIRP). Internet Engineering Task Force (IETF), 2025

  21. [21]

    Huamin Chen, Luay Jalil, and N. Cocker. Multi-provider extensions for agentic AI inference APIs. Internet Engineering Task Force (IETF), Network Management Research Group, 2025

  22. [22]

    LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, et al. LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset. InProceedings of ICLR, 2024

  23. [23]

    Splitwise: Efficient generative LLM inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, et al. Splitwise: Efficient generative LLM inference using phase splitting. InProceedings of ISCA, 2024

  24. [24]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In Proceedings of ICLR, 2024

  25. [25]

    Patil, Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Ion Stoica, and Joseph E

    Shishir G. Patil, Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Ion Stoica, and Joseph E. Gonzalez. The Berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of ICML, 2025

  26. [26]

    ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

    Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, and Xin Jin. ServeGen: Workload characterization and generation of large language model serving in production.arXiv preprint arXiv:2505.09999, 2025

  27. [27]

    Drift-bench: Diagnosing cooperative breakdowns in LLM agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

    Han Bao, Zheyuan Zhang, Pengcheng Jing, et al. Drift-bench: Diagnosing cooperative breakdowns in LLM agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

  28. [28]

    AgentHallu: Benchmarking automated hallucination attribution of LLM-based agents.arXiv preprint arXiv:2601.06818, 2026

    Xuannan Liu, Xiao Yang, Zekun Li, et al. AgentHallu: Benchmarking automated hallucination attribution of LLM-based agents.arXiv preprint arXiv:2601.06818, 2026. 39

  29. [29]

    Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers.arXiv preprint arXiv:2603.07670, 2026

    Pengfei Du. Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers.arXiv preprint arXiv:2603.07670, 2026

  30. [30]

    Beyond the context window: A cost- performance analysis of fact-based memory vs

    Natchanon Pollertlam and Witchayut Kornsuwannawit. Beyond the context window: A cost- performance analysis of fact-based memory vs. long-context LLMs for persistent agents.arXiv preprint arXiv:2603.04814, 2026

  31. [31]

    ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025

    Minki Kang, Wei-Ning Chen, Dongge Han, et al. ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025

  32. [32]

    Active context compression: Autonomous memory management in LLM agents

    Nikhil Verma. Active context compression: Autonomous memory management in LLM agents. arXiv preprint arXiv:2601.07190, 2026

  33. [33]

    SAMULE: Self-learning agents enhanced by multi-level reflection

    Yubin Ge, Salvatore Romeo, Jason Cai, et al. SAMULE: Self-learning agents enhanced by multi-level reflection. InProceedings of EMNLP, 2025. arXiv:2509.20562

  34. [34]

    CORRECT: COndensed eRror RECognition via knowledge transfer in multi-agent systems.arXiv preprint arXiv:2509.24088, 2025

    Yifan Yu, Moyan Li, Shaoyuan Xu, et al. CORRECT: COndensed eRror RECognition via knowledge transfer in multi-agent systems.arXiv preprint arXiv:2509.24088, 2025

  35. [35]

    Mistake notebook learning: Batch-clustered failures for training-free agent adaptation.arXiv preprint arXiv:2512.11485, 2025

    Xuanbo Su, Yingfang Zhang, Hao Luo, et al. Mistake notebook learning: Batch-clustered failures for training-free agent adaptation.arXiv preprint arXiv:2512.11485, 2025

  36. [36]

    Dynamic system instructions and tool exposure for efficient agentic LLMs.arXiv preprint arXiv:2602.17046, 2026

    Uria Franko. Dynamic system instructions and tool exposure for efficient agentic LLMs.arXiv preprint arXiv:2602.17046, 2026

  37. [37]

    ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering

    Marianne Menglin Liu, Daniel Garcia, Fjona Parllaku, Vikas Upadhyay, Syed Fahad Allam Shah, and Dan Roth. ToolScope: Enhancing LLM agent tool use through tool merging and context-aware filtering.arXiv preprint arXiv:2510.20036, 2025

  38. [38]

    SMART: Self-aware agent for tool overuse mitigation.arXiv preprint arXiv:2502.11435, 2025

    Cheng Qian, Emre Can Acikgoz, Hongru Wang, et al. SMART: Self-aware agent for tool overuse mitigation.arXiv preprint arXiv:2502.11435, 2025

  39. [39]

    Budget-aware tool-use enables effective agent scaling

    Tengxiao Liu, Zifeng Wang, Jin Miao, et al. Budget-aware tool-use enables effective agent scaling. arXiv preprint arXiv:2511.17006, 2025

  40. [40]

    Transcending cost-quality tradeoff in agent serving via session-awareness

    Yanyu Ren, Li Chen, Dan Li, et al. Transcending cost-quality tradeoff in agent serving via session-awareness. InNeurIPS, 2025

  41. [41]

    Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

    Hanchen Li et al. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live.arXiv preprint arXiv:2511.02230, 2025

  42. [42]

    KV-Cache wins you can see: From prefix caching in vLLM to distributed scheduling with llm-d.https://llm-d.ai/blog/kvcache-wins-you-can-see, 2026

    llm-d Team. KV-Cache wins you can see: From prefix caching in vLLM to distributed scheduling with llm-d.https://llm-d.ai/blog/kvcache-wins-you-can-see, 2026

  43. [43]

    RFC: Context-aware KV-cache retention API (prioritized evictions).https: //github.com/vllm-project/vllm/issues/37003, 2026

    vLLM Community. RFC: Context-aware KV-cache retention API (prioritized evictions).https: //github.com/vllm-project/vllm/issues/37003, 2026

  44. [44]

    DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, et al. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of OSDI, 2024

  45. [45]

    LiteLLM tool permission guardrail.https://docs.litellm.ai/docs/proxy/guardrails/ tool_permission, 2026

    BerriAI. LiteLLM tool permission guardrail.https://docs.litellm.ai/docs/proxy/guardrails/ tool_permission, 2026

  46. [46]

    Agent authorization profile (AAP): OAuth 2.0 extension for agent authorization.https://www.aap-protocol.org/, 2026

    AAP Protocol Working Group. Agent authorization profile (AAP): OAuth 2.0 extension for agent authorization.https://www.aap-protocol.org/, 2026

  47. [47]

    Zero trust architecture

    Scott Rose, Oliver Borchert, Stu Mitchell, and Sean Connelly. Zero trust architecture. Technical Report SP 800-207, National Institute of Standards and Technology, 2020

  48. [48]

    Harnessing chain-of-thought metadata for task routing and adversarial prompt detection.arXiv preprint arXiv:2503.21464, 2026

    Ryan Marinelli, Josef Pichlmeier, and Tamas Bisztray. Harnessing chain-of-thought metadata for task routing and adversarial prompt detection.arXiv preprint arXiv:2503.21464, 2026. 40

  49. [49]

    Aragog: Just-in-time model routing for scalable serving of agentic workflows.arXiv preprint arXiv:2511.20975, 2025

    Yinwei Dai, Zhuofu Chen, Anand Iyer, et al. Aragog: Just-in-time model routing for scalable serving of agentic workflows.arXiv preprint arXiv:2511.20975, 2025

  50. [50]

    RouteLLM: Learning to route LLMs with preference data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, et al. RouteLLM: Learning to route LLMs with preference data. InProceedings of ICLR, 2025

  51. [51]

    MixLLM: Dynamic routing in mixed large language models

    Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, and Haifeng Chen. MixLLM: Dynamic routing in mixed large language models. InProceedings of NAACL, 2025. arXiv:2502.18482

  52. [52]

    SageServe: Optimizing LLM serving on cloud data centers with forecast aware auto-scaling.arXiv preprint arXiv:2502.14617, 2025

    Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, et al. SageServe: Optimizing LLM serving on cloud data centers with forecast aware auto-scaling.arXiv preprint arXiv:2502.14617, 2025

  53. [53]

    Router-R1: Teaching LLMs multi-round routing and aggregation via reinforcement learning

    Haozhen Zhang, Tao Feng, and Jiaxuan You. Router-R1: Teaching LLMs multi-round routing and aggregation via reinforcement learning. InProceedings of NeurIPS, 2025. arXiv:2506.09033

  54. [54]

    R2-Router: A new paradigm for LLM routing with reasoning.arXiv preprint arXiv:2602.02823, 2026

    Jiaqi Xue, Qian Lou, Jiarong Xing, and Heng Huang. R2-Router: A new paradigm for LLM routing with reasoning.arXiv preprint arXiv:2602.02823, 2026

  55. [55]

    Adapter-augmented bandits for online multi-constrained multi-modal inference scheduling.arXiv preprint arXiv:2603.06403, 2026

    Xianzhi Zhang, Yue Xu, Yinlin Zhu, Di Wu, Yipeng Zhou, Miao Hu, and Guocong Quan. Adapter-augmented bandits for online multi-constrained multi-modal inference scheduling.arXiv preprint arXiv:2603.06403, 2026

  56. [56]

    Efficient LLM serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

    Noppanat Wadlom, Junyi Shen, and Yao Lu. Efficient LLM serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

  57. [57]

    Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference

    Anish Biswas, Kanishk Goel, Jayashree Mohan, et al. Sutradhara: An intelligent orchestrator- engine co-design for tool-based agentic inference.arXiv preprint arXiv:2601.12967, 2026

  58. [58]

    CONCUR: High-throughput agentic batch inference of LLM via congestion-based concurrency control.arXiv preprint arXiv:2601.22705,

    Qiaoling Chen, Zhisheng Ye, Tian Tang, et al. CONCUR: High-throughput agentic batch inference of LLM via congestion-based concurrency control.arXiv preprint arXiv:2601.22705,

  59. [59]

    EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

    Guibin Zhang, Haiyang Yu, Kaiming Yang, et al. EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

  60. [60]

    Budget-aware agentic routing via boundary-guided training.arXiv preprint arXiv:2602.21227, 2026

    Caiqi Zhang, Menglin Xia, Xuchao Zhang, Daniel Madrigal, Ankur Mallick, Samuel Kessler, Victor Ruehle, and Saravan Rajmohan. Budget-aware agentic routing via boundary-guided training.arXiv preprint arXiv:2602.21227, 2026

  61. [61]

    Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks.arXiv preprint arXiv:2601.06007, 2026

    Elias Lumer, Faheem Nizar, Akshaya Jangiti, et al. Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks.arXiv preprint arXiv:2601.06007, 2026

  62. [62]

    xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

    Cheng Qian, Zuxin Liu, Shirley Kokane, et al. xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

  63. [63]

    Mélange: Cost efficient large language model serving by exploiting GPU heterogeneity

    Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. Mélange: Cost efficient large language model serving by exploiting GPU heterogeneity. arXiv preprint arXiv:2404.14527, 2024

  64. [64]

    Mooncake: A KVCache-centric disaggregated architecture for LLM serving

    Ruoyu Qin et al. Mooncake: A KVCache-centric disaggregated architecture for LLM serving. arXiv preprint arXiv:2407.00079, 2025

  65. [65]

    NVIDIA dynamo: Smart multi-node schedul- ing for LLM inference

    NVIDIA. NVIDIA dynamo: Smart multi-node schedul- ing for LLM inference. https://developer.nvidia.com/blog/ smart-multi-node-scheduling-for-fast-and-efficient-llm-inference-with-nvidia-runai-and-nvidia-dynamo/ , 2026

  66. [66]

    Efficient memory management for large language model serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, et al. Efficient memory management for large language model serving with PagedAttention. InProceedings of SOSP, 2023. 41

  67. [67]

    Search-R2: Enhancing search-integrated reasoning via actor–refiner collaboration.arXiv preprint arXiv:2602.03647, 2026

    Bowei He, Minda Hu, Zenan Xu, Hongru Wang, Licheng Zong, Yankai Chen, Chen Ma, Xue Liu, Pluto Zhou, and Irwin King. Search-R2: Enhancing search-integrated reasoning via actor–refiner collaboration.arXiv preprint arXiv:2602.03647, 2026

  68. [68]

    Generalizing beyond suboptimality: Offline reinforcement learning learns effective scheduling through random data.arXiv preprint arXiv:2509.10303, 2025

    Jesse van Remmerden, Zaharah Bukhsh, and Yingqian Zhang. Generalizing beyond suboptimality: Offline reinforcement learning learns effective scheduling through random data.arXiv preprint arXiv:2509.10303, 2025

  69. [69]

    Agents of Chaos

    Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, et al. Agents of chaos: Evaluating LLM agent vulnerabilities through real-world interactions.arXiv preprint arXiv:2602.20021, 2026

  70. [70]

    Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024. USENIX Security 2025

  71. [71]

    Alex Corll

    J. Alex Corll. Peak+accumulation: A proxy-level scoring formula for multi-turn LLM attack detection.arXiv preprint arXiv:2602.11247, 2026

  72. [72]

    DeepContext: Stateful real-time detection of multi-turn adversarial intent drift in LLMs.arXiv preprint arXiv:2602.16935, 2026

    Justin Albrethsen, Yash Datta, Kunal Kumar, et al. DeepContext: Stateful real-time detection of multi-turn adversarial intent drift in LLMs.arXiv preprint arXiv:2602.16935, 2026

  73. [73]

    ErrorMap and ErrorAtlas: Charting the failure landscape of large language models.arXiv preprint arXiv:2601.15812, 2026

    Shir Ashury-Tahan, Yifan Mai, Elron Bandel, et al. ErrorMap and ErrorAtlas: Charting the failure landscape of large language models.arXiv preprint arXiv:2601.15812, 2026

  74. [74]

    LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206, 2026

    Hao Li, Yiqun Zhang, Zhaoyan Guo, et al. LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206, 2026

  75. [75]

    Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes

    Mehil B. Shah, Mohammad Mehdi Morovati, Mohammad Masudur Rahman, et al. Character- izing faults in agentic AI: A taxonomy of types, symptoms, and root causes.arXiv preprint arXiv:2603.06847, 2026

  76. [76]

    PALADIN: Self-correcting language model agents to cure tool-failure cases.arXiv preprint arXiv:2509.25238, 2025

    Sri Vatsa Vuddanti, Aarav Shah, Satwik Kumar Chittiprolu, Tony Song, Sunishchal Dev, Kevin Zhu, Sean O’Brien, and Maheep Chaudhary. PALADIN: Self-correcting language model agents to cure tool-failure cases.arXiv preprint arXiv:2509.25238, 2025

  77. [77]

    Raad Khraishi, Iman Zafar, Katie Myles, and Greig A. Cowan. Evaluating performance drift from model switching in multi-turn LLM systems.ICLR 2026 CAO Workshop, 2026. arXiv:2603.03111

  78. [78]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  79. [79]

    Learning when to attend: Conditional memory access for long-context LLMs.arXiv preprint arXiv:2603.17484, 2026

    Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, and Stefano Soatto. Learning when to attend: Conditional memory access for long-context LLMs.arXiv preprint arXiv:2603.17484, 2026

  80. [80]

    AgentCom- press: Task-aware compression for affordable large language model agents.arXiv preprint arXiv:2601.05191, 2026

    Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, and Shahnawaz Alam. AgentCom- press: Task-aware compression for affordable large language model agents.arXiv preprint arXiv:2601.05191, 2026

Showing first 80 references.