pith. machine review for the scientific record. sign in

arxiv: 2605.14241 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:55 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM agentstool routingcontextual banditslatency-quality trade-offweb searchretrievalonline adaptation
0
0 comments X

The pith

LQM-ContextRoute routes LLM agents to equivalent tool providers by expected answer quality per service cycle rather than additive rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LQM-ContextRoute, a contextual bandit router that selects among functionally equivalent providers such as different web-search APIs or retrievers. It scores each option by expected answer quality divided by service cycle time, then combines this with query context and LLM-as-judge signals to adapt online to load and quality shifts without gold labels. The design prevents the collapse that occurs when low latency offsets poor answers in standard additive rewards. A reader would care because modern LLM agents increasingly face interchangeable providers whose speeds and accuracies differ, and poor routing wastes resources or degrades final answers. Experiments show the router stays on the latency-quality frontier while delivering measurable gains on web-search and StrategyQA benchmarks.

Core claim

LQM-ContextRoute ranks same-function tool providers by expected answer quality per service cycle, using capacity-aware scoring together with query-specific quality estimation and LLM-as-judge feedback; this formulation lets the router adapt online to both changing loads and provider-quality differences, avoiding additive-reward collapse when heterogeneity is high.

What carries the argument

Latency-quality matching, which ranks providers by expected answer quality per service cycle instead of additive latency-quality rewards.

If this is right

  • On the main web-search load benchmark, LQM-ContextRoute improves F1 by 2.18 percentage points over SW-UCB while remaining on the latency-quality frontier.
  • In high-heterogeneity StrategyQA settings, it improves accuracy by up to 18 percentage points over SW-UCB.
  • On heterogeneous retriever pools, it improves NDCG by 2.91 to 3.22 percentage points over SW-UCB.
  • The capacity-aware formulation prevents additive-reward collapse when provider quality varies widely under runtime pressure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar per-cycle quality scoring could be applied to routing decisions among interchangeable code-execution or database-query providers.
  • Agent systems running on variable cloud loads might adopt the same capacity-aware ranking to control total inference cost.
  • Replacing the online LLM judge with a lightweight learned quality predictor trained on past interactions would reduce per-query overhead while preserving adaptation.

Load-bearing premise

LLM-as-judge feedback supplies a sufficiently reliable and unbiased quality signal to drive online adaptation without gold labels at deployment time.

What would settle it

A controlled experiment that replaces the LLM judge with a version known to be biased or low-accuracy and measures whether the reported accuracy and F1 gains disappear on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.14241 by Dawei Xiang, Kexin Chu, Wei Zhang.

Figure 1
Figure 1. Figure 1: Latency-quality Pareto view of the main benchmark. Marker size encodes SLA@1.5s. 4.3 Q2: When does latency-quality matching matter? The latency-quality matching score should matter most when the same interface hides large provider￾quality differences. In the main benchmark, a slice by per-query cross-provider F1 gap tests this mechanism: among 106 high-gap questions, LQM￾CONTEXTROUTE gains +4.42 pp over SW… view at source ↗
read the original abstract

Tool-augmented LLM agents increasingly access the same tool type through multiple functionally equivalent providers, such as web-search APIs, retrievers, or LLM backends exposed behind a shared interface. This creates a provider-routing problem under runtime load: the router must choose among providers that differ in latency, reliability, and answer quality, often without gold labels at deployment time. We introduce LQM-ContextRoute, a contextual bandit router for same-function tool providers. Its key design is latency-quality matching: instead of letting low latency offset poor answers in an additive reward, the router ranks providers by expected answer quality per service cycle. It combines this capacity-aware score with query-specific quality estimation and LLM-as-judge feedback, allowing it to adapt online to both load changes and provider-quality differences. On the main web-search load benchmark, LQM-ContextRoute improves F1 by +2.18 pp over SW-UCB while staying on the latency-quality frontier. In a high-heterogeneity StrategyQA setting, LQM-ContextRoute avoids additive-reward collapse and improves accuracy by up to +18 pp over SW-UCB; on heterogeneous retriever pools, it improves NDCG by +2.91--+3.22 pp over SW-UCB. These results show that same-function tool routing benefits from treating latency as service capacity, especially when runtime pressure and provider-quality heterogeneity coexist.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces LQM-ContextRoute, a contextual bandit router for selecting among functionally equivalent tool providers (e.g., web-search APIs) in LLM agents. It ranks providers by expected answer quality per service cycle rather than additive latency-quality rewards, using query-specific quality estimates and LLM-as-judge feedback to adapt online without gold labels at deployment. Evaluations on a web-search load benchmark, high-heterogeneity StrategyQA, and heterogeneous retriever pools report gains over SW-UCB of +2.18 pp F1, up to +18 pp accuracy, and +2.91--3.22 pp NDCG respectively.

Significance. If the reported gains hold under deployment conditions, the work provides a practical and principled method for handling provider heterogeneity and runtime load in tool-augmented LLM agents. Treating latency as service capacity rather than an offset in an additive reward avoids collapse in high-heterogeneity settings and could improve reliability in production agent systems.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): the central claims rest on LLM-as-judge feedback supplying reliable query-specific quality signals for bandit updates without gold labels. No quantitative validation (inter-judge agreement, correlation with held-out human labels, or ablation removing the judge) is reported, leaving open the possibility that judge bias or variance systematically favors low-latency providers and inflates the reported gains.
  2. [Experimental results] Experimental results (web-search and StrategyQA sections): the abstract states concrete improvements (+2.18 pp F1, +18 pp accuracy) but provides no details on data splits, number of runs, statistical significance tests, or controls for judge bias, making it impossible to verify whether the gains are robust or artifacts of post-hoc choices.
minor comments (1)
  1. [§3] Notation for the latency-quality score and service-cycle normalization should be defined explicitly in the main text rather than only in the appendix to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for validation of the LLM-as-judge component and fuller experimental reporting. We address each major comment below and will make the indicated revisions to improve transparency and robustness.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): the central claims rest on LLM-as-judge feedback supplying reliable query-specific quality signals for bandit updates without gold labels. No quantitative validation (inter-judge agreement, correlation with held-out human labels, or ablation removing the judge) is reported, leaving open the possibility that judge bias or variance systematically favors low-latency providers and inflates the reported gains.

    Authors: We agree that quantitative validation of the LLM-as-judge is a gap in the current manuscript. All reported metrics (F1, accuracy, NDCG) are computed against ground-truth labels independent of the judge; the judge supplies only relative signals for online bandit updates. To address potential bias, the revision will add an ablation replacing the judge with constant or random quality estimates, report the specific judge model and prompt template, and include correlation analysis against human annotations on a held-out query subset where available. This will quantify the judge's contribution and any systematic effects. revision: yes

  2. Referee: [Experimental results] Experimental results (web-search and StrategyQA sections): the abstract states concrete improvements (+2.18 pp F1, +18 pp accuracy) but provides no details on data splits, number of runs, statistical significance tests, or controls for judge bias, making it impossible to verify whether the gains are robust or artifacts of post-hoc choices.

    Authors: We acknowledge that the manuscript omits these experimental details. The revision will expand the relevant sections to specify: query partitioning for bandit training versus evaluation, the number of independent runs (5 runs with distinct random seeds), statistical tests (paired t-tests with p-values), and the judge-bias ablation described above. These additions will enable verification of robustness and rule out post-hoc artifacts. revision: yes

Circularity Check

0 steps flagged

LQM-ContextRoute derivation is self-contained with no circular reductions

full rationale

The paper introduces LQM-ContextRoute as a contextual bandit router whose central mechanism is a latency-quality matching score (expected answer quality per service cycle) combined with query-specific LLM-as-judge estimates for online adaptation. No equations or steps reduce by construction to fitted inputs renamed as predictions, nor does any load-bearing premise rest on self-citations whose validity is presupposed. The reported gains (+2.18 pp F1, +18 pp accuracy) are presented as empirical outcomes on external benchmarks against SW-UCB, with the method remaining falsifiable through held-out data and alternative routers rather than tautological. The derivation therefore stands as an independent design choice whose performance claims can be tested independently of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only: no explicit free parameters, axioms, or invented entities are stated. The approach relies on standard contextual bandit assumptions plus the unstated reliability of LLM-as-judge feedback.

pith-pipeline@v0.9.0 · 5547 in / 1136 out tokens · 30654 ms · 2026-05-15T02:55:32.026976+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 6 internal anchors

  1. [1]

    1966.Lectures on Functional Equations and Their Applications, volume 19 ofMathematics in Science and Engineering

    János Aczél. 1966.Lectures on Functional Equations and Their Applications, volume 19 ofMathematics in Science and Engineering. Academic Press. Shipra Agrawal and Nikhil R. Devanur

  2. [2]

    BaRP: Bandit-feedback routing with preferences for multi-LLM inference.arXiv preprint arXiv:2510.07429, 2025

    Bandits with concave rewards and convex knapsacks. InACM EC. Anonymous. 2025a. Learning to route LLMs from bandit feedback (BaRP).arXiv preprint arXiv:2510.07429. Multi-objective contextual bandit for LLM routing under bandit feedback. Anonymous. 2025b. PILOT: Adaptive LLM rout- ing under budget constraints.arXiv preprint arXiv:2508.21141. EMNLP 2025 Find...

  3. [3]

    https://modelcontextprotocol

    Model Context Protocol Specifi- cation. https://modelcontextprotocol. io/. Accessed: 2026-05-02. Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins

  4. [4]

    https://docs.litellm

    LiteLLM Routing and Load Balanc- ing Documentation. https://docs.litellm. ai/docs/routing. Accessed: 2026-05-02. Chadderwala

  5. [5]

    Thompson Sampling contextual bandit over heterogeneous tools (PubMed, drug DBs, calculator, web) with composite reward including latency

    Optimizing life sciences agents in real-time using reinforcement learning.arXiv preprint arXiv:2512.03065. Thompson Sampling contextual bandit over heterogeneous tools (PubMed, drug DBs, calculator, web) with composite reward including latency. Richard Combes, Chong Jiang, and R. Srikant

  6. [6]

    https: //www.digitalapplied.com/blog/mcp- server-reliability-100-server- stress-test-study

    MCP Server Reliabil- ity: A 100-Server Stress Test Study. https: //www.digitalapplied.com/blog/mcp- server-reliability-100-server- stress-test-study. Accessed: 2026-05-02. Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Hipolito Garcia, Menglin Xia, L. Lakshmanan, Qingyun Wu, and Vic- tor Ruehle

  7. [7]

    Aurélien Garivier and Eric Moulines

    BEST-Route: Adaptive LLM rout- ing with test-time optimal compute.arXiv preprint arXiv:2506.22716. Aurélien Garivier and Eric Moulines

  8. [8]

    ReliabilityBench: evaluating LLM agent reliability under production-like stress conditions, 2026

    ReliabilityBench: Evaluating LLM agent reliability under production-like stress conditions.arXiv preprint arXiv:2601.06112. Qitian Jason Hu and 1 others

  9. [9]

    RouterBench: A benchmark for multi-LLM routing system.arXiv preprint arXiv:2403.12031. Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, and Sanjiv Kumar

  10. [10]

    Levente Kocsis and Csaba Szepesvári

    Universal model routing for efficient LLM inference.arXiv preprint arXiv:2502.08773. Levente Kocsis and Csaba Szepesvári

  11. [11]

    LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206, 2026

    LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206. Lihong Li, Wei Chu, John Langford, and Robert E. Schapire

  12. [12]

    RouteLLM: Learning to Route LLMs with Preference Data

    RouteLLM: Learning to route LLMs with preference data.arXiv preprint arXiv:2406.18665. Shishir G. Patil and 1 others

  13. [13]

    Gorilla: Large Language Model Connected with Massive APIs

    Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334. Manhin Poon, Xiangxiang Dai, Xutong Liu, Fang Kong, John C. S. Lui, and Jinhang Zuo

  14. [14]

    On- line multi-LLM selection via contextual bandits un- der unstructured context evolution.arXiv preprint arXiv:2506.17670. Portkey

  15. [15]

    https: //portkey.ai/blog/failover-routing- strategies-for-llms-in-production/

    Failover Routing Strate- gies for LLMs in Production. https: //portkey.ai/blog/failover-routing- strategies-for-llms-in-production/ . Accessed: 2026-05-02. Portkey

  16. [16]

    https://portkey.ai/ blog/the-most-reliable-ai-gateway- for-production-systems/

    The Most Reliable AI Gateway for Production Systems. https://portkey.ai/ blog/the-most-reliable-ai-gateway- for-production-systems/. Accessed: 2026-05-02. Yujia Qin and 1 others

  17. [17]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789. 9 Sheldon M. Ross. 1996.Stochastic Processes, 2nd edition. Wiley. Renewal-reward theorem (Theorem 3.6.1). Yoan Russac, Claire Vernade, and Olivier Cappé

  18. [18]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761. Annette Taberner-Miller

  19. [19]

    ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

    ParetoBandit: Budget- paced adaptive routing for non-stationary LLM serv- ing.arXiv preprint arXiv:2604.00136. Alex Tamkin, Doyen Sahoo, and 1 others

  20. [20]

    ReAct: Synergizing Reasoning and Acting in Language Models

    ReAct: Synergizing rea- soning and acting in language models.arXiv preprint arXiv:2210.03629. Bowen Zhang, Gang Wang, Qi Chen, and Anton van den Hengel

  21. [21]

    OpenReview preprint

    How do we select right LLM for each query? MAR: Multi-armed recommender for on- line LLM selection. OpenReview preprint. Contex- tual bandit + LLM-as-judge for online LLM routing on 4,029-query WildArena dataset; OpenReview ID AfA3qNY0Fq. A Positioning vs. prior LLM-routing work Work route unit deploy fb. runtime stateu–τobjective RouteLLM LLM endpoint pr...

  22. [22]

    LQM-CONTEXTROUTE instead provides a single online selection rule for a gateway that has already selected the tool type and must choose a provider under current load

    Pareto routing methods and budget-paced LLM routers expose a quality-cost frontier or allocate traffic under a global budget (Mei et al., 2025; Taberner-Miller, 2026). LQM-CONTEXTROUTE instead provides a single online selection rule for a gateway that has already selected the tool type and must choose a provider under current load. The renewal-rate score ...

  23. [23]

    Sliding-window concentration gives the usual non- stationary additive term O(√TlogT·V T )

    satisfies RT ≤ X i:∆V i >0 C(1 +L −1 ref )2σ2 logT ∆V i +o(logT). Sliding-window concentration gives the usual non- stationary additive term O(√TlogT·V T ). The implementedλ >0quality modulation is not cov- ered by this optimism guarantee: because ∆i is estimated online, it can suppress exploration after an early quality-estimation error. We use it as an ...

  24. [24]

    3), and the linear renewal cycle fixesα=

    depend only on (1 +z 2)/(1 +z 1); standard multiplica- tive functional-equation arguments yield T(u, z) =u(1 +z) −α (Aczél, 1966, Ch. 3), and the linear renewal cycle fixesα=

  25. [25]

    Theorem 2(Renewal-reward vs

    Separation from additive composites. Theorem 2(Renewal-reward vs. additive separa- tion).Fix α∈(0,1) and let radd i (α) =αu i − (1−α)˜τi with ˜τ= min{τ /Lref,1} . There exists a two-arm instance where the additive score chooses the lower-quality faster arm while Vi =u i/(1+ ˜τi) chooses the higher-quality arm whenever u2∆˜τ 1 + ˜τ2 <∆ u < 1−α α ∆˜τ. The i...