Thompson Sampling contextual bandit over heterogeneous tools (PubMed, drug DBs, calculator, web) with composite reward including latency

Optimizing life sciences agents in real-time using reinforcement learning · arXiv 2512.03065

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web-search, StrategyQA, and retriever benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents cs.LG · 2026-05-14 · unverdicted · none · ref 5
LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web-search, StrategyQA, and retriever benchmarks.

Thompson Sampling contextual bandit over heterogeneous tools (PubMed, drug DBs, calculator, web) with composite reward including latency

fields

years

verdicts

representative citing papers

citing papers explorer