arxiv: 2605.14241 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

Kexin Chu , Dawei Xiang , Wei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM agentstool routingcontextual banditslatency-quality trade-offweb searchretrievalonline adaptation

0 comments

The pith

LQM-ContextRoute routes LLM agents to equivalent tool providers by expected answer quality per service cycle rather than additive rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LQM-ContextRoute, a contextual bandit router that selects among functionally equivalent providers such as different web-search APIs or retrievers. It scores each option by expected answer quality divided by service cycle time, then combines this with query context and LLM-as-judge signals to adapt online to load and quality shifts without gold labels. The design prevents the collapse that occurs when low latency offsets poor answers in standard additive rewards. A reader would care because modern LLM agents increasingly face interchangeable providers whose speeds and accuracies differ, and poor routing wastes resources or degrades final answers. Experiments show the router stays on the latency-quality frontier while delivering measurable gains on web-search and StrategyQA benchmarks.

Core claim

LQM-ContextRoute ranks same-function tool providers by expected answer quality per service cycle, using capacity-aware scoring together with query-specific quality estimation and LLM-as-judge feedback; this formulation lets the router adapt online to both changing loads and provider-quality differences, avoiding additive-reward collapse when heterogeneity is high.

What carries the argument

Latency-quality matching, which ranks providers by expected answer quality per service cycle instead of additive latency-quality rewards.

If this is right

On the main web-search load benchmark, LQM-ContextRoute improves F1 by 2.18 percentage points over SW-UCB while remaining on the latency-quality frontier.
In high-heterogeneity StrategyQA settings, it improves accuracy by up to 18 percentage points over SW-UCB.
On heterogeneous retriever pools, it improves NDCG by 2.91 to 3.22 percentage points over SW-UCB.
The capacity-aware formulation prevents additive-reward collapse when provider quality varies widely under runtime pressure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar per-cycle quality scoring could be applied to routing decisions among interchangeable code-execution or database-query providers.
Agent systems running on variable cloud loads might adopt the same capacity-aware ranking to control total inference cost.
Replacing the online LLM judge with a lightweight learned quality predictor trained on past interactions would reduce per-query overhead while preserving adaptation.

Load-bearing premise

LLM-as-judge feedback supplies a sufficiently reliable and unbiased quality signal to drive online adaptation without gold labels at deployment time.

What would settle it

A controlled experiment that replaces the LLM judge with a version known to be biased or low-accuracy and measures whether the reported accuracy and F1 gains disappear on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.14241 by Dawei Xiang, Kexin Chu, Wei Zhang.

**Figure 1.** Figure 1: Latency-quality Pareto view of the main benchmark. Marker size encodes SLA@1.5s. 4.3 Q2: When does latency-quality matching matter? The latency-quality matching score should matter most when the same interface hides large providerquality differences. In the main benchmark, a slice by per-query cross-provider F1 gap tests this mechanism: among 106 high-gap questions, LQMCONTEXTROUTE gains +4.42 pp over SW… view at source ↗

read the original abstract

Tool-augmented LLM agents increasingly access the same tool type through multiple functionally equivalent providers, such as web-search APIs, retrievers, or LLM backends exposed behind a shared interface. This creates a provider-routing problem under runtime load: the router must choose among providers that differ in latency, reliability, and answer quality, often without gold labels at deployment time. We introduce LQM-ContextRoute, a contextual bandit router for same-function tool providers. Its key design is latency-quality matching: instead of letting low latency offset poor answers in an additive reward, the router ranks providers by expected answer quality per service cycle. It combines this capacity-aware score with query-specific quality estimation and LLM-as-judge feedback, allowing it to adapt online to both load changes and provider-quality differences. On the main web-search load benchmark, LQM-ContextRoute improves F1 by +2.18 pp over SW-UCB while staying on the latency-quality frontier. In a high-heterogeneity StrategyQA setting, LQM-ContextRoute avoids additive-reward collapse and improves accuracy by up to +18 pp over SW-UCB; on heterogeneous retriever pools, it improves NDCG by +2.91--+3.22 pp over SW-UCB. These results show that same-function tool routing benefits from treating latency as service capacity, especially when runtime pressure and provider-quality heterogeneity coexist.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LQM-ContextRoute gives a clean tweak to contextual bandits by scoring providers on quality per cycle instead of additive reward, but the gains hinge on unvalidated LLM-as-judge signals.

read the letter

The paper's core move is to rank tool providers by expected answer quality divided by service time rather than letting low latency cancel out weak answers in a summed reward. That framing fits the setting of interchangeable providers under load, and the reported numbers show it avoids the collapse that hits SW-UCB on heterogeneous StrategyQA (up to +18 pp accuracy) while still improving the web-search benchmark by +2.18 pp F1. The online adaptation using query-specific estimates and LLM judge feedback is a practical way to operate without gold labels at runtime. Those pieces are the actual contribution and they read as a direct response to a growing deployment issue. The evaluation stays on the latency-quality frontier, which is the right comparison. The soft spot is the judge itself. The abstract states the method works without gold labels, yet gives no numbers on inter-judge agreement, correlation with human labels, or an ablation that removes the judge. If the judge systematically favors faster providers or tracks load-induced degradation, the gains could shrink once the router starts acting on its own outputs. The methods section would need to show that the bandit updates remain stable under realistic judge noise. This is aimed at people who already run multi-provider agents and need a lightweight router. It is incremental rather than foundational, but the problem is real and the fix is straightforward enough that a serious referee could check the judge validation and the statistical details in one round. I would send it out for review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces LQM-ContextRoute, a contextual bandit router for selecting among functionally equivalent tool providers (e.g., web-search APIs) in LLM agents. It ranks providers by expected answer quality per service cycle rather than additive latency-quality rewards, using query-specific quality estimates and LLM-as-judge feedback to adapt online without gold labels at deployment. Evaluations on a web-search load benchmark, high-heterogeneity StrategyQA, and heterogeneous retriever pools report gains over SW-UCB of +2.18 pp F1, up to +18 pp accuracy, and +2.91--3.22 pp NDCG respectively.

Significance. If the reported gains hold under deployment conditions, the work provides a practical and principled method for handling provider heterogeneity and runtime load in tool-augmented LLM agents. Treating latency as service capacity rather than an offset in an additive reward avoids collapse in high-heterogeneity settings and could improve reliability in production agent systems.

major comments (2)

[Abstract and §3] Abstract and §3 (method description): the central claims rest on LLM-as-judge feedback supplying reliable query-specific quality signals for bandit updates without gold labels. No quantitative validation (inter-judge agreement, correlation with held-out human labels, or ablation removing the judge) is reported, leaving open the possibility that judge bias or variance systematically favors low-latency providers and inflates the reported gains.
[Experimental results] Experimental results (web-search and StrategyQA sections): the abstract states concrete improvements (+2.18 pp F1, +18 pp accuracy) but provides no details on data splits, number of runs, statistical significance tests, or controls for judge bias, making it impossible to verify whether the gains are robust or artifacts of post-hoc choices.

minor comments (1)

[§3] Notation for the latency-quality score and service-cycle normalization should be defined explicitly in the main text rather than only in the appendix to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for validation of the LLM-as-judge component and fuller experimental reporting. We address each major comment below and will make the indicated revisions to improve transparency and robustness.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method description): the central claims rest on LLM-as-judge feedback supplying reliable query-specific quality signals for bandit updates without gold labels. No quantitative validation (inter-judge agreement, correlation with held-out human labels, or ablation removing the judge) is reported, leaving open the possibility that judge bias or variance systematically favors low-latency providers and inflates the reported gains.

Authors: We agree that quantitative validation of the LLM-as-judge is a gap in the current manuscript. All reported metrics (F1, accuracy, NDCG) are computed against ground-truth labels independent of the judge; the judge supplies only relative signals for online bandit updates. To address potential bias, the revision will add an ablation replacing the judge with constant or random quality estimates, report the specific judge model and prompt template, and include correlation analysis against human annotations on a held-out query subset where available. This will quantify the judge's contribution and any systematic effects. revision: yes
Referee: [Experimental results] Experimental results (web-search and StrategyQA sections): the abstract states concrete improvements (+2.18 pp F1, +18 pp accuracy) but provides no details on data splits, number of runs, statistical significance tests, or controls for judge bias, making it impossible to verify whether the gains are robust or artifacts of post-hoc choices.

Authors: We acknowledge that the manuscript omits these experimental details. The revision will expand the relevant sections to specify: query partitioning for bandit training versus evaluation, the number of independent runs (5 runs with distinct random seeds), statistical tests (paired t-tests with p-values), and the judge-bias ablation described above. These additions will enable verification of robustness and rule out post-hoc artifacts. revision: yes

Circularity Check

0 steps flagged

LQM-ContextRoute derivation is self-contained with no circular reductions

full rationale

The paper introduces LQM-ContextRoute as a contextual bandit router whose central mechanism is a latency-quality matching score (expected answer quality per service cycle) combined with query-specific LLM-as-judge estimates for online adaptation. No equations or steps reduce by construction to fitted inputs renamed as predictions, nor does any load-bearing premise rest on self-citations whose validity is presupposed. The reported gains (+2.18 pp F1, +18 pp accuracy) are presented as empirical outcomes on external benchmarks against SW-UCB, with the method remaining falsifiable through held-out data and alternative routers rather than tautological. The derivation therefore stands as an independent design choice whose performance claims can be tested independently of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only: no explicit free parameters, axioms, or invented entities are stated. The approach relies on standard contextual bandit assumptions plus the unstated reliability of LLM-as-judge feedback.

pith-pipeline@v0.9.0 · 5547 in / 1136 out tokens · 30654 ms · 2026-05-15T02:55:32.026976+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the renewal-reward theorem gives the long-run reward rate ui/(1 +τ i/Lref) (Ross, 1996, Thm. 3.6.1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 6 internal anchors

[1]

1966.Lectures on Functional Equations and Their Applications, volume 19 ofMathematics in Science and Engineering

János Aczél. 1966.Lectures on Functional Equations and Their Applications, volume 19 ofMathematics in Science and Engineering. Academic Press. Shipra Agrawal and Nikhil R. Devanur

work page 1966
[2]

BaRP: Bandit-feedback routing with preferences for multi-LLM inference.arXiv preprint arXiv:2510.07429, 2025

Bandits with concave rewards and convex knapsacks. InACM EC. Anonymous. 2025a. Learning to route LLMs from bandit feedback (BaRP).arXiv preprint arXiv:2510.07429. Multi-objective contextual bandit for LLM routing under bandit feedback. Anonymous. 2025b. PILOT: Adaptive LLM rout- ing under budget constraints.arXiv preprint arXiv:2508.21141. EMNLP 2025 Find...

work page arXiv 2025
[3]

https://modelcontextprotocol

Model Context Protocol Specifi- cation. https://modelcontextprotocol. io/. Accessed: 2026-05-02. Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins

work page 2026
[4]

https://docs.litellm

LiteLLM Routing and Load Balanc- ing Documentation. https://docs.litellm. ai/docs/routing. Accessed: 2026-05-02. Chadderwala

work page 2026
[5]

Thompson Sampling contextual bandit over heterogeneous tools (PubMed, drug DBs, calculator, web) with composite reward including latency

Optimizing life sciences agents in real-time using reinforcement learning.arXiv preprint arXiv:2512.03065. Thompson Sampling contextual bandit over heterogeneous tools (PubMed, drug DBs, calculator, web) with composite reward including latency. Richard Combes, Chong Jiang, and R. Srikant

work page arXiv
[6]

https: //www.digitalapplied.com/blog/mcp- server-reliability-100-server- stress-test-study

MCP Server Reliabil- ity: A 100-Server Stress Test Study. https: //www.digitalapplied.com/blog/mcp- server-reliability-100-server- stress-test-study. Accessed: 2026-05-02. Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Hipolito Garcia, Menglin Xia, L. Lakshmanan, Qingyun Wu, and Vic- tor Ruehle

work page 2026
[7]

Aurélien Garivier and Eric Moulines

BEST-Route: Adaptive LLM rout- ing with test-time optimal compute.arXiv preprint arXiv:2506.22716. Aurélien Garivier and Eric Moulines

work page arXiv
[8]

ReliabilityBench: evaluating LLM agent reliability under production-like stress conditions, 2026

ReliabilityBench: Evaluating LLM agent reliability under production-like stress conditions.arXiv preprint arXiv:2601.06112. Qitian Jason Hu and 1 others

work page arXiv
[9]

RouterBench: A benchmark for multi-LLM routing system.arXiv preprint arXiv:2403.12031. Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, and Sanjiv Kumar

work page arXiv
[10]

Levente Kocsis and Csaba Szepesvári

Universal model routing for efficient LLM inference.arXiv preprint arXiv:2502.08773. Levente Kocsis and Csaba Szepesvári

work page arXiv
[11]

LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206, 2026

LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206. Lihong Li, Wei Chu, John Langford, and Robert E. Schapire

work page arXiv
[12]

RouteLLM: Learning to Route LLMs with Preference Data

RouteLLM: Learning to route LLMs with preference data.arXiv preprint arXiv:2406.18665. Shishir G. Patil and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Gorilla: Large Language Model Connected with Massive APIs

Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334. Manhin Poon, Xiangxiang Dai, Xutong Liu, Fang Kong, John C. S. Lui, and Jinhang Zuo

work page internal anchor Pith review Pith/arXiv arXiv
[14]

On- line multi-LLM selection via contextual bandits un- der unstructured context evolution.arXiv preprint arXiv:2506.17670. Portkey

work page arXiv
[15]

https: //portkey.ai/blog/failover-routing- strategies-for-llms-in-production/

Failover Routing Strate- gies for LLMs in Production. https: //portkey.ai/blog/failover-routing- strategies-for-llms-in-production/ . Accessed: 2026-05-02. Portkey

work page 2026
[16]

https://portkey.ai/ blog/the-most-reliable-ai-gateway- for-production-systems/

The Most Reliable AI Gateway for Production Systems. https://portkey.ai/ blog/the-most-reliable-ai-gateway- for-production-systems/. Accessed: 2026-05-02. Yujia Qin and 1 others

work page 2026
[17]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789. 9 Sheldon M. Ross. 1996.Stochastic Processes, 2nd edition. Wiley. Renewal-reward theorem (Theorem 3.6.1). Yoan Russac, Claire Vernade, and Olivier Cappé

work page internal anchor Pith review Pith/arXiv arXiv 1996
[18]

Toolformer: Language Models Can Teach Themselves to Use Tools

Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761. Annette Taberner-Miller

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

ParetoBandit: Budget- paced adaptive routing for non-stationary LLM serv- ing.arXiv preprint arXiv:2604.00136. Alex Tamkin, Doyen Sahoo, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[20]

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing rea- soning and acting in language models.arXiv preprint arXiv:2210.03629. Bowen Zhang, Gang Wang, Qi Chen, and Anton van den Hengel

work page internal anchor Pith review Pith/arXiv arXiv
[21]

OpenReview preprint

How do we select right LLM for each query? MAR: Multi-armed recommender for on- line LLM selection. OpenReview preprint. Contex- tual bandit + LLM-as-judge for online LLM routing on 4,029-query WildArena dataset; OpenReview ID AfA3qNY0Fq. A Positioning vs. prior LLM-routing work Work route unit deploy fb. runtime stateu–τobjective RouteLLM LLM endpoint pr...

work page 2018
[22]

LQM-CONTEXTROUTE instead provides a single online selection rule for a gateway that has already selected the tool type and must choose a provider under current load

Pareto routing methods and budget-paced LLM routers expose a quality-cost frontier or allocate traffic under a global budget (Mei et al., 2025; Taberner-Miller, 2026). LQM-CONTEXTROUTE instead provides a single online selection rule for a gateway that has already selected the tool type and must choose a provider under current load. The renewal-rate score ...

work page 2025
[23]

Sliding-window concentration gives the usual non- stationary additive term O(√TlogT·V T )

satisfies RT ≤ X i:∆V i >0 C(1 +L −1 ref )2σ2 logT ∆V i +o(logT). Sliding-window concentration gives the usual non- stationary additive term O(√TlogT·V T ). The implementedλ >0quality modulation is not cov- ered by this optimism guarantee: because ∆i is estimated online, it can suppress exploration after an early quality-estimation error. We use it as an ...

work page 1996
[24]

3), and the linear renewal cycle fixesα=

depend only on (1 +z 2)/(1 +z 1); standard multiplica- tive functional-equation arguments yield T(u, z) =u(1 +z) −α (Aczél, 1966, Ch. 3), and the linear renewal cycle fixesα=

work page 1966
[25]

Theorem 2(Renewal-reward vs

Separation from additive composites. Theorem 2(Renewal-reward vs. additive separa- tion).Fix α∈(0,1) and let radd i (α) =αu i − (1−α)˜τi with ˜τ= min{τ /Lref,1} . There exists a two-arm instance where the additive score chooses the lower-quality faster arm while Vi =u i/(1+ ˜τi) chooses the higher-quality arm whenever u2∆˜τ 1 + ˜τ2 <∆ u < 1−α α ∆˜τ. The i...

work page 1955