arxiv: 2603.21354 · v2 · submitted 2026-03-22 · 💻 cs.LG · cs.DC

Recognition: 2 theorem links

· Lean Theorem

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Huamin Chen , Xunzhuo Liu , Bowei He , Fuyuan Lyu , Yankai Chen , Xue Liu , Yuhan Liu , Junchen Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:36 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords LLM inference optimizationsemantic routingworkload characterizationrouter architectureinference poolfleet provisioningvision paper3x3 interaction matrix

0 comments

The pith

The Workload-Router-Pool architecture organizes LLM inference optimization into a 3x3 interaction matrix that maps prior results and flags twenty-one open research directions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper consolidates earlier vLLM Semantic Router studies into one three-dimensional framework called Workload-Router-Pool. Workload captures request patterns such as chat versus agentic or prefill-heavy versus decode-heavy. Router covers dispatching rules from static semantic matching to online adaptation and reinforcement learning. Pool covers hardware choices from homogeneous GPUs to disaggregated prefill-decode setups. Plotting past contributions onto the resulting 3x3 matrix shows which intersections are already addressed and which remain empty, then lists twenty-one concrete next steps each backed by measurements from those earlier experiments. A reader should care because scaling LLM fleets requires coordinated decisions across all three dimensions rather than separate tweaks.

Core claim

The Workload-Router-Pool architecture is a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves, Router determines how each request is dispatched, and Pool defines where inference runs. Mapping prior work onto the 3x3 interaction matrix identifies covered cells and open cells, and the paper proposes twenty-one concrete research directions at the intersections, each grounded in prior measurements and tiered by maturity.

What carries the argument

The Workload-Router-Pool (WRP) architecture, a three-dimensional framework that places workload types, routing policies, and execution pools on the axes of a 3x3 matrix to expose covered areas and open research cells.

If this is right

Fleet provisioning decisions must change when routing policies shift or when workload mixes move toward agentic and multimodal traffic.
Safety mechanisms such as policy conflict detection and hallucination checks become more effective when combined with context-length-aware pool routing.
Energy-efficiency gains require joint selection of router policies and heterogeneous pool configurations rather than independent tuning.
Agentic workloads with multi-turn memory and tool selection create new cells in the matrix that need dedicated routing and pool designs.
Standards for inference routing protocols and multi-provider APIs must account for interactions across all three WRP dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the matrix proves stable across new model families, it could serve as a shared taxonomy for comparing commercial serving systems.
Extending the framework with a fourth axis for network topology might be needed if cross-node KV-cache movement dominates latency.
Prioritizing the engineering-ready directions first would let teams measure concrete throughput or cost improvements before tackling open research cells.
The same matrix structure could be tested on non-LLM workloads such as diffusion model serving to check whether the three dimensions generalize.

Load-bearing premise

The three dimensions of workload, router, and pool are sufficient to capture the main interactions that matter for LLM inference optimization.

What would settle it

Demonstration of a dominant factor, such as regulatory data-locality rules or network-topology effects, that cannot be assigned to any cell in the 3x3 WRP matrix would show the framework is incomplete.

read the original abstract

Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Workload-Router-Pool (WRP) architecture as a three-dimensional framework for LLM inference optimization. It distills prior vLLM Semantic Router results into characterizations of Workload (chat vs. agentic, single- vs. multi-turn, prefill- vs. decode-heavy), Router (semantic rules, bandit adaptation, RL selection), and Pool (homogeneous/heterogeneous GPUs, disaggregated prefill/decode, KV-cache topology), maps these onto a 3x3 interaction matrix to identify covered and open cells, and proposes 21 concrete research directions grounded in the project's prior measurements and tiered by maturity.

Significance. If the framework holds, the WRP matrix could serve as a useful organizing taxonomy for LLM inference research, systematically highlighting gaps at the intersections of workload characteristics, routing policies, and execution pools. The explicit grounding of the 21 directions in existing measurements from the vLLM project is a strength that could help prioritize engineering-ready versus open-research items.

major comments (2)

[WRP architecture definition and matrix construction] Section defining the WRP dimensions and 3x3 matrix: The central claim that these three axes fully capture key interactions and allow complete mapping of prior work rests on the unargued assumption that factors such as network topology (e.g., KV-cache transfer latency across nodes) or regulatory constraints (data residency) can be reduced to Workload/Router/Pool without loss of fidelity. No explicit justification or reduction argument is provided, which directly affects the identification of 'open cells' and the completeness of the proposed directions.
[Proposal of twenty-one research directions] Section on the 21 research directions: Several directions (particularly those involving fleet governance and multi-provider standards) are listed without explicit mapping back to specific open cells in the 3x3 matrix or to concrete prior measurements that would ground their feasibility. This weakens the claim that the directions are systematically derived from the matrix analysis.

minor comments (2)

[Abstract and introduction] The abstract and introduction could more explicitly separate the synthesis of previously published vLLM results from any novel conceptual contribution of the WRP framing itself.
[Matrix and directions section] Notation for the matrix cells and direction tiers would benefit from a small summary table to improve readability when the 21 directions are enumerated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and grounding of the WRP framework. We address both major comments below and will revise the manuscript to strengthen the justification of the dimensions and the explicit mapping of research directions.

read point-by-point responses

Referee: [WRP architecture definition and matrix construction] Section defining the WRP dimensions and 3x3 matrix: The central claim that these three axes fully capture key interactions and allow complete mapping of prior work rests on the unargued assumption that factors such as network topology (e.g., KV-cache transfer latency across nodes) or regulatory constraints (data residency) can be reduced to Workload/Router/Pool without loss of fidelity. No explicit justification or reduction argument is provided, which directly affects the identification of 'open cells' and the completeness of the proposed directions.

Authors: We acknowledge the absence of an explicit reduction argument. In revision we will insert a dedicated paragraph in Section 2 explaining the scope: network topology effects are already subsumed under the Pool dimension through the KV-cache topology sub-axis (our prior disaggregated prefill/decode measurements quantify cross-node transfer latencies), while regulatory constraints such as data residency are treated as workload attributes (privacy-sensitive vs. general chat) or pool restrictions (geo-fenced GPU sets). We will also state explicitly that WRP is offered as a practical organizing taxonomy derived from vLLM measurements rather than a claim of theoretical completeness, and we will note possible extensions for factors outside the current axes. This addition will make the identification of open cells more transparent. revision: yes
Referee: [Proposal of twenty-one research directions] Section on the 21 research directions: Several directions (particularly those involving fleet governance and multi-provider standards) are listed without explicit mapping back to specific open cells in the 3x3 matrix or to concrete prior measurements that would ground their feasibility. This weakens the claim that the directions are systematically derived from the matrix analysis.

Authors: We agree that the linkage should be stated more explicitly. The revised manuscript will include a summary table (new Table 3) that, for each of the 21 directions, lists the target (Workload, Router, Pool) cell and cites the specific prior vLLM measurement or paper that grounds its feasibility. For the fleet-governance and multi-provider directions, we will map them to the open cells at the intersection of heterogeneous pools with adaptive routers and reference the energy-efficiency analysis and multi-provider API extension results already obtained in the project. This will demonstrate the systematic derivation from the matrix. revision: yes

Circularity Check

2 steps flagged

WRP 3x3 matrix and 21 directions reduce to re-labeling of authors' own prior vLLM results

specific steps

renaming known result [Abstract]
"This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. ... We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements"

The 3x3 matrix is populated exclusively by re-mapping the authors' own prior vLLM papers (listed in the abstract as the source of the distillation); the identification of covered cells and the 21 directions are therefore direct outputs of that re-mapping rather than new derivations.
self citation load bearing [Abstract]
"Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selec"

The claim that WRP is a sufficient three-dimensional framework rests on the completeness of the enumerated self-cited project outputs; no external criterion or reduction is provided to show why network topology, regulatory constraints, or other factors can be omitted without loss.

full rationale

The paper's central derivation consists of defining the WRP dimensions from the authors' listed prior publications, then mapping those same publications onto the new 3x3 matrix to identify covered/open cells and generate 21 directions. This process is self-contained within the authors' body of work with no external derivation, benchmark, or independent validation step shown; the 'framework' and proposals are therefore equivalent to a reorganization of the input citations by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the three chosen dimensions fully span the space of LLM inference interactions and on the introduction of the WRP framework itself as an organizing entity without external validation.

axioms (1)

domain assumption The three dimensions of Workload, Router, and Pool are sufficient to capture all key interactions in LLM inference optimization.
Invoked when defining the 3x3 interaction matrix and when claiming the framework organizes the space of problems.

invented entities (1)

Workload-Router-Pool (WRP) architecture no independent evidence
purpose: To serve as a unified three-dimensional organizing framework for LLM inference optimization
Newly proposed conceptual entity that structures the mapping of prior work and the 21 research directions.

pith-pipeline@v0.9.0 · 5656 in / 1396 out tokens · 64460 ms · 2026-05-15T06:36:51.065328+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Workload–Router–Pool (WRP) architecture, a three-dimensional framework... Workload characterizes what the fleet serves... Router determines how each request is dispatched... Pool defines where inference runs
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimization objective is a weighted combination of cost (GPU-hours per request), accuracy, latency, and energy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 12 internal anchors

[1]

vLLM semantic router: Signal driven decision routing for mixture-of-modality models.arXiv preprint arXiv:2603.04444, 2026

vLLM Semantic Router Team. vLLM semantic router: Signal driven decision routing for mixture-of-modality models.arXiv preprint arXiv:2603.04444, 2026

work page arXiv 2026
[2]

Conflict-free policy languages for probabilistic ML predicates: A framework and case study with the semantic router DSL.arXiv preprint arXiv:2603.18174, 2026

Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, and Xue Liu. Conflict-free policy languages for probabilistic ML predicates: A framework and case study with the semantic router DSL.arXiv preprint arXiv:2603.18174, 2026

work page arXiv 2026
[3]

98× faster LLM routing without a dedicated GPU: Flash attention, prompt compression, and near-streaming for the vLLM semantic router.arXiv preprint arXiv:2603.12646, 2026

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. 98× faster LLM routing without a dedicated GPU: Flash attention, prompt compression, and near-streaming for the vLLM semantic router.arXiv preprint arXiv:2603.12646, 2026

work page arXiv 2026
[4]

mmBERT-embed-32k-2d-matryoshka: Multilingual embedding model with 2d matryoshka training

vLLM Semantic Router Team. mmBERT-embed-32k-2d-matryoshka: Multilingual embedding model with 2d matryoshka training. Hugging Face model:llm-semantic-router/mmbert-embed- 32k-2d-matryoshka, 2025

work page 2025
[5]

When to reason: Semantic router for vLLM

Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, and Huamin Chen. When to reason: Semantic router for vLLM. InNeurIPS Workshop on ML for Systems (MLForSys), 2025

work page 2025
[6]

Category- aware semantic caching for heterogeneous LLM workloads.arXiv preprint arXiv:2510.26835, 2025

Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, and Huamin Chen. Category- aware semantic caching for heterogeneous LLM workloads.arXiv preprint arXiv:2510.26835, 2025

work page arXiv 2025
[7]

mmBERT-32k feedback detector: User satisfaction classification for online routing adaptation

vLLM Semantic Router Team. mmBERT-32k feedback detector: User satisfaction classification for online routing adaptation. Hugging Face model:llm-semantic-router/mmbert32k-feedback- detector-lora, 2026

work page 2026
[8]

Token-level truth: Real-time hallucination detection for production LLMs

vLLM Semantic Router Team. Token-level truth: Real-time hallucination detection for production LLMs. vLLM Blog, 2025.https://blog.vllm.ai/2025/12/14/halugate.html

work page 2025
[9]

mmBERT-32k factcheck classifier: Binary prompt classification for conditional hallucination detection

vLLM Semantic Router Team. mmBERT-32k factcheck classifier: Binary prompt classification for conditional hallucination detection. Hugging Face model:llm-semantic-router/mmbert32k- factcheck-classifier-merged, 2026

work page 2026
[10]

MLCommons AI safety classifier – level 1 (binary): Safe vs

vLLM Semantic Router Team. MLCommons AI safety classifier – level 1 (binary): Safe vs. unsafe content classification. Hugging Face model:llm-semantic-router/mlcommons-safety- classifier-level1-binary, 2026

work page 2026
[11]

MLCommons AI safety classifier – level 2 (9-class hazard): Hier- archical content safety classification

vLLM Semantic Router Team. MLCommons AI safety classifier – level 2 (9-class hazard): Hier- archical content safety classification. Hugging Face model:llm-semantic-router/mlcommons- safety-classifier-level2-hazard, 2026

work page 2026
[12]

Token-budget- aware pool routing for cost-efficient LLM inference.arXiv preprint, 2026

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. Token-budget- aware pool routing for cost-efficient LLM inference.arXiv preprint, 2026. 38

work page 2026
[13]

FleetOpt: Analytical fleet provisioning for LLM inference with compress-and-route as implementation mechanism.arXiv preprint arXiv:2603.16514, 2026

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. FleetOpt: Analytical fleet provisioning for LLM inference with compress-and-route as implementation mechanism.arXiv preprint arXiv:2603.16514, 2026

work page arXiv 2026
[14]

inference- fleet-sim: A queueing-theory-grounded fleet capacity planner for LLM inference.arXiv preprint arXiv:2603.16054, 2026

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. inference- fleet-sim: A queueing-theory-grounded fleet capacity planner for LLM inference.arXiv preprint arXiv:2603.16054, 2026

work page arXiv 2026
[15]

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. The 1/W law: An analytical study of context-length routing topology and GPU generation gains for LLM inference energy efficiency.arXiv preprint arXiv:2603.17280, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Adaptive vision-language model routing for computer use agents.arXiv preprint arXiv:2603.12823, 2026

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. Adaptive vision-language model routing for computer use agents.arXiv preprint arXiv:2603.12823, 2026

work page arXiv 2026
[17]

Outcome-aware tool selection for semantic routers: Latency-constrained learning without LLM inference.arXiv preprint arXiv:2603.13426, 2026

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, and Xue Liu. Outcome-aware tool selection for semantic routers: Latency-constrained learning without LLM inference.arXiv preprint arXiv:2603.13426, 2026

work page arXiv 2026
[18]

Visual confused deputy: Exploiting and defending perception failures in computer-using agents.arXiv preprint arXiv:2603.14707, 2026

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. Visual confused deputy: Exploiting and defending perception failures in computer-using agents.arXiv preprint arXiv:2603.14707, 2026

work page arXiv 2026
[19]

OpenClaw: Personal AI assistant with a local-first gateway

OpenClaw contributors. OpenClaw: Personal AI assistant with a local-first gateway. Open- source software (MIT License), 2026. Repository: https://github.com/openclaw/openclaw. Gateway WebSocket control plane for multi-channel agent sessions, tools, and model routing; documentation athttps://docs.openclaw.ai

work page 2026
[20]

Semantic inference routing protocol (SIRP)

Huamin Chen and Luay Jalil. Semantic inference routing protocol (SIRP). Internet Engineering Task Force (IETF), 2025

work page 2025
[21]

Huamin Chen, Luay Jalil, and N. Cocker. Multi-provider extensions for agentic AI inference APIs. Internet Engineering Task Force (IETF), Network Management Research Group, 2025

work page 2025
[22]

LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, et al. LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset. InProceedings of ICLR, 2024

work page 2024
[23]

Splitwise: Efficient generative LLM inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, et al. Splitwise: Efficient generative LLM inference using phase splitting. InProceedings of ISCA, 2024

work page 2024
[24]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In Proceedings of ICLR, 2024

work page 2024
[25]

Patil, Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Ion Stoica, and Joseph E

Shishir G. Patil, Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Ion Stoica, and Joseph E. Gonzalez. The Berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of ICML, 2025

work page 2025
[26]

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, and Xin Jin. ServeGen: Workload characterization and generation of large language model serving in production.arXiv preprint arXiv:2505.09999, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Drift-bench: Diagnosing cooperative breakdowns in LLM agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

Han Bao, Zheyuan Zhang, Pengcheng Jing, et al. Drift-bench: Diagnosing cooperative breakdowns in LLM agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

work page arXiv 2026
[28]

AgentHallu: Benchmarking automated hallucination attribution of LLM-based agents.arXiv preprint arXiv:2601.06818, 2026

Xuannan Liu, Xiao Yang, Zekun Li, et al. AgentHallu: Benchmarking automated hallucination attribution of LLM-based agents.arXiv preprint arXiv:2601.06818, 2026. 39

work page arXiv 2026
[29]

Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers.arXiv preprint arXiv:2603.07670, 2026

Pengfei Du. Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers.arXiv preprint arXiv:2603.07670, 2026

work page arXiv 2026
[30]

Beyond the context window: A cost- performance analysis of fact-based memory vs

Natchanon Pollertlam and Witchayut Kornsuwannawit. Beyond the context window: A cost- performance analysis of fact-based memory vs. long-context LLMs for persistent agents.arXiv preprint arXiv:2603.04814, 2026

work page arXiv 2026
[31]

ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025

Minki Kang, Wei-Ning Chen, Dongge Han, et al. ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025

work page arXiv 2025
[32]

Active context compression: Autonomous memory management in LLM agents

Nikhil Verma. Active context compression: Autonomous memory management in LLM agents. arXiv preprint arXiv:2601.07190, 2026

work page arXiv 2026
[33]

SAMULE: Self-learning agents enhanced by multi-level reflection

Yubin Ge, Salvatore Romeo, Jason Cai, et al. SAMULE: Self-learning agents enhanced by multi-level reflection. InProceedings of EMNLP, 2025. arXiv:2509.20562

work page arXiv 2025
[34]

CORRECT: COndensed eRror RECognition via knowledge transfer in multi-agent systems.arXiv preprint arXiv:2509.24088, 2025

Yifan Yu, Moyan Li, Shaoyuan Xu, et al. CORRECT: COndensed eRror RECognition via knowledge transfer in multi-agent systems.arXiv preprint arXiv:2509.24088, 2025

work page arXiv 2025
[35]

Mistake notebook learning: Batch-clustered failures for training-free agent adaptation.arXiv preprint arXiv:2512.11485, 2025

Xuanbo Su, Yingfang Zhang, Hao Luo, et al. Mistake notebook learning: Batch-clustered failures for training-free agent adaptation.arXiv preprint arXiv:2512.11485, 2025

work page arXiv 2025
[36]

Dynamic system instructions and tool exposure for efficient agentic LLMs.arXiv preprint arXiv:2602.17046, 2026

Uria Franko. Dynamic system instructions and tool exposure for efficient agentic LLMs.arXiv preprint arXiv:2602.17046, 2026

work page arXiv 2026
[37]

ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering

Marianne Menglin Liu, Daniel Garcia, Fjona Parllaku, Vikas Upadhyay, Syed Fahad Allam Shah, and Dan Roth. ToolScope: Enhancing LLM agent tool use through tool merging and context-aware filtering.arXiv preprint arXiv:2510.20036, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

SMART: Self-aware agent for tool overuse mitigation.arXiv preprint arXiv:2502.11435, 2025

Cheng Qian, Emre Can Acikgoz, Hongru Wang, et al. SMART: Self-aware agent for tool overuse mitigation.arXiv preprint arXiv:2502.11435, 2025

work page arXiv 2025
[39]

Budget-aware tool-use enables effective agent scaling

Tengxiao Liu, Zifeng Wang, Jin Miao, et al. Budget-aware tool-use enables effective agent scaling. arXiv preprint arXiv:2511.17006, 2025

work page arXiv 2025
[40]

Transcending cost-quality tradeoff in agent serving via session-awareness

Yanyu Ren, Li Chen, Dan Li, et al. Transcending cost-quality tradeoff in agent serving via session-awareness. InNeurIPS, 2025

work page 2025
[41]

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

Hanchen Li et al. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live.arXiv preprint arXiv:2511.02230, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

KV-Cache wins you can see: From prefix caching in vLLM to distributed scheduling with llm-d.https://llm-d.ai/blog/kvcache-wins-you-can-see, 2026

llm-d Team. KV-Cache wins you can see: From prefix caching in vLLM to distributed scheduling with llm-d.https://llm-d.ai/blog/kvcache-wins-you-can-see, 2026

work page 2026
[43]

RFC: Context-aware KV-cache retention API (prioritized evictions).https: //github.com/vllm-project/vllm/issues/37003, 2026

vLLM Community. RFC: Context-aware KV-cache retention API (prioritized evictions).https: //github.com/vllm-project/vllm/issues/37003, 2026

work page 2026
[44]

DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, et al. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of OSDI, 2024

work page 2024
[45]

LiteLLM tool permission guardrail.https://docs.litellm.ai/docs/proxy/guardrails/ tool_permission, 2026

BerriAI. LiteLLM tool permission guardrail.https://docs.litellm.ai/docs/proxy/guardrails/ tool_permission, 2026

work page 2026
[46]

Agent authorization profile (AAP): OAuth 2.0 extension for agent authorization.https://www.aap-protocol.org/, 2026

AAP Protocol Working Group. Agent authorization profile (AAP): OAuth 2.0 extension for agent authorization.https://www.aap-protocol.org/, 2026

work page 2026
[47]

Zero trust architecture

Scott Rose, Oliver Borchert, Stu Mitchell, and Sean Connelly. Zero trust architecture. Technical Report SP 800-207, National Institute of Standards and Technology, 2020

work page 2020
[48]

Harnessing chain-of-thought metadata for task routing and adversarial prompt detection.arXiv preprint arXiv:2503.21464, 2026

Ryan Marinelli, Josef Pichlmeier, and Tamas Bisztray. Harnessing chain-of-thought metadata for task routing and adversarial prompt detection.arXiv preprint arXiv:2503.21464, 2026. 40

work page arXiv 2026
[49]

Aragog: Just-in-time model routing for scalable serving of agentic workflows.arXiv preprint arXiv:2511.20975, 2025

Yinwei Dai, Zhuofu Chen, Anand Iyer, et al. Aragog: Just-in-time model routing for scalable serving of agentic workflows.arXiv preprint arXiv:2511.20975, 2025

work page arXiv 2025
[50]

RouteLLM: Learning to route LLMs with preference data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, et al. RouteLLM: Learning to route LLMs with preference data. InProceedings of ICLR, 2025

work page 2025
[51]

MixLLM: Dynamic routing in mixed large language models

Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, and Haifeng Chen. MixLLM: Dynamic routing in mixed large language models. InProceedings of NAACL, 2025. arXiv:2502.18482

work page arXiv 2025
[52]

SageServe: Optimizing LLM serving on cloud data centers with forecast aware auto-scaling.arXiv preprint arXiv:2502.14617, 2025

Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, et al. SageServe: Optimizing LLM serving on cloud data centers with forecast aware auto-scaling.arXiv preprint arXiv:2502.14617, 2025

work page arXiv 2025
[53]

Router-R1: Teaching LLMs multi-round routing and aggregation via reinforcement learning

Haozhen Zhang, Tao Feng, and Jiaxuan You. Router-R1: Teaching LLMs multi-round routing and aggregation via reinforcement learning. InProceedings of NeurIPS, 2025. arXiv:2506.09033

work page arXiv 2025
[54]

R2-Router: A new paradigm for LLM routing with reasoning.arXiv preprint arXiv:2602.02823, 2026

Jiaqi Xue, Qian Lou, Jiarong Xing, and Heng Huang. R2-Router: A new paradigm for LLM routing with reasoning.arXiv preprint arXiv:2602.02823, 2026

work page arXiv 2026
[55]

Adapter-augmented bandits for online multi-constrained multi-modal inference scheduling.arXiv preprint arXiv:2603.06403, 2026

Xianzhi Zhang, Yue Xu, Yinlin Zhu, Di Wu, Yipeng Zhou, Miao Hu, and Guocong Quan. Adapter-augmented bandits for online multi-constrained multi-modal inference scheduling.arXiv preprint arXiv:2603.06403, 2026

work page arXiv 2026
[56]

Efficient LLM serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

Noppanat Wadlom, Junyi Shen, and Yao Lu. Efficient LLM serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

work page arXiv 2026
[57]

Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference

Anish Biswas, Kanishk Goel, Jayashree Mohan, et al. Sutradhara: An intelligent orchestrator- engine co-design for tool-based agentic inference.arXiv preprint arXiv:2601.12967, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

CONCUR: High-throughput agentic batch inference of LLM via congestion-based concurrency control.arXiv preprint arXiv:2601.22705,

Qiaoling Chen, Zhisheng Ye, Tian Tang, et al. CONCUR: High-throughput agentic batch inference of LLM via congestion-based concurrency control.arXiv preprint arXiv:2601.22705,

work page arXiv
[59]

EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

Guibin Zhang, Haiyang Yu, Kaiming Yang, et al. EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

work page arXiv 2026
[60]

Budget-aware agentic routing via boundary-guided training.arXiv preprint arXiv:2602.21227, 2026

Caiqi Zhang, Menglin Xia, Xuchao Zhang, Daniel Madrigal, Ankur Mallick, Samuel Kessler, Victor Ruehle, and Saravan Rajmohan. Budget-aware agentic routing via boundary-guided training.arXiv preprint arXiv:2602.21227, 2026

work page arXiv 2026
[61]

Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks.arXiv preprint arXiv:2601.06007, 2026

Elias Lumer, Faheem Nizar, Akshaya Jangiti, et al. Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks.arXiv preprint arXiv:2601.06007, 2026

work page arXiv 2026
[62]

xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

Cheng Qian, Zuxin Liu, Shirley Kokane, et al. xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

work page arXiv 2025
[63]

Mélange: Cost efficient large language model serving by exploiting GPU heterogeneity

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. Mélange: Cost efficient large language model serving by exploiting GPU heterogeneity. arXiv preprint arXiv:2404.14527, 2024

work page arXiv 2024
[64]

Mooncake: A KVCache-centric disaggregated architecture for LLM serving

Ruoyu Qin et al. Mooncake: A KVCache-centric disaggregated architecture for LLM serving. arXiv preprint arXiv:2407.00079, 2025

work page arXiv 2025
[65]

NVIDIA dynamo: Smart multi-node schedul- ing for LLM inference

NVIDIA. NVIDIA dynamo: Smart multi-node schedul- ing for LLM inference. https://developer.nvidia.com/blog/ smart-multi-node-scheduling-for-fast-and-efficient-llm-inference-with-nvidia-runai-and-nvidia-dynamo/ , 2026

work page 2026
[66]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, et al. Efficient memory management for large language model serving with PagedAttention. InProceedings of SOSP, 2023. 41

work page 2023
[67]

Search-R2: Enhancing search-integrated reasoning via actor–refiner collaboration.arXiv preprint arXiv:2602.03647, 2026

Bowei He, Minda Hu, Zenan Xu, Hongru Wang, Licheng Zong, Yankai Chen, Chen Ma, Xue Liu, Pluto Zhou, and Irwin King. Search-R2: Enhancing search-integrated reasoning via actor–refiner collaboration.arXiv preprint arXiv:2602.03647, 2026

work page arXiv 2026
[68]

Generalizing beyond suboptimality: Offline reinforcement learning learns effective scheduling through random data.arXiv preprint arXiv:2509.10303, 2025

Jesse van Remmerden, Zaharah Bukhsh, and Yingqian Zhang. Generalizing beyond suboptimality: Offline reinforcement learning learns effective scheduling through random data.arXiv preprint arXiv:2509.10303, 2025

work page arXiv 2025
[69]

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, et al. Agents of chaos: Evaluating LLM agent vulnerabilities through real-world interactions.arXiv preprint arXiv:2602.20021, 2026

work page internal anchor Pith review arXiv 2026
[70]

Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024. USENIX Security 2025

work page arXiv 2024
[71]

Alex Corll

J. Alex Corll. Peak+accumulation: A proxy-level scoring formula for multi-turn LLM attack detection.arXiv preprint arXiv:2602.11247, 2026

work page arXiv 2026
[72]

DeepContext: Stateful real-time detection of multi-turn adversarial intent drift in LLMs.arXiv preprint arXiv:2602.16935, 2026

Justin Albrethsen, Yash Datta, Kunal Kumar, et al. DeepContext: Stateful real-time detection of multi-turn adversarial intent drift in LLMs.arXiv preprint arXiv:2602.16935, 2026

work page arXiv 2026
[73]

ErrorMap and ErrorAtlas: Charting the failure landscape of large language models.arXiv preprint arXiv:2601.15812, 2026

Shir Ashury-Tahan, Yifan Mai, Elron Bandel, et al. ErrorMap and ErrorAtlas: Charting the failure landscape of large language models.arXiv preprint arXiv:2601.15812, 2026

work page arXiv 2026
[74]

LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206, 2026

Hao Li, Yiqun Zhang, Zhaoyan Guo, et al. LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206, 2026

work page arXiv 2026
[75]

Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes

Mehil B. Shah, Mohammad Mehdi Morovati, Mohammad Masudur Rahman, et al. Character- izing faults in agentic AI: A taxonomy of types, symptoms, and root causes.arXiv preprint arXiv:2603.06847, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[76]

PALADIN: Self-correcting language model agents to cure tool-failure cases.arXiv preprint arXiv:2509.25238, 2025

Sri Vatsa Vuddanti, Aarav Shah, Satwik Kumar Chittiprolu, Tony Song, Sunishchal Dev, Kevin Zhu, Sean O’Brien, and Maheep Chaudhary. PALADIN: Self-correcting language model agents to cure tool-failure cases.arXiv preprint arXiv:2509.25238, 2025

work page arXiv 2025
[77]

Raad Khraishi, Iman Zafar, Katie Myles, and Greig A. Cowan. Evaluating performance drift from model switching in multi-turn LLM systems.ICLR 2026 CAO Workshop, 2026. arXiv:2603.03111

work page arXiv 2026
[78]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Learning when to attend: Conditional memory access for long-context LLMs.arXiv preprint arXiv:2603.17484, 2026

Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, and Stefano Soatto. Learning when to attend: Conditional memory access for long-context LLMs.arXiv preprint arXiv:2603.17484, 2026

work page arXiv 2026
[80]

AgentCom- press: Task-aware compression for affordable large language model agents.arXiv preprint arXiv:2601.05191, 2026

Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, and Shahnawaz Alam. AgentCom- press: Task-aware compression for affordable large language model agents.arXiv preprint arXiv:2601.05191, 2026

work page arXiv 2026

Showing first 80 references.