hub Mixed citations

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath · 2024 · cs.LG · arXiv 2403.12031

Mixed citation behavior. Most common role is background (60%).

42 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 42 citing papers arXiv PDF

abstract

As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through RouterBench, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at https://github.com/withmartian/routerbench.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 2

citation-polarity summary

background 3 baseline 2

representative citing papers

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

cs.AI · 2026-06-25 · unverdicted · novelty 7.0

Any single-output LLM ensemble is accuracy-capped at 1-beta where beta is the all-models-wrong rate, a quantity not captured by pairwise correlations and frequently underestimated by copula models.

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

RouteJudge introduces an open platform for preference-based evaluation of LLM routers via pairwise user comparisons, along with the ORBIT toolbox for standardized routing workflows.

Online Pandora's Box for Contextual LLM Cascading

cs.AI · 2026-06-05 · unverdicted · novelty 7.0

Introduces a parametric reservation-index policy with GMM estimation and UCB exploration for contextual LLM cascading under output-mediated feedback, claiming dimension-dependent square-root regret.

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

IDPR is a response-conditioned inhibitory deliberation method that trains a controller on fast-slow outcome pairs to decide when to override LLM fast answers, improving accuracy from 47.90% to 48.92% with slow reasoning invoked on only 8.20% of a 5,000-example math test set.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

cs.LG · 2026-05-14 · accept · novelty 7.0 · 2 refs

TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

LQM-ContextRoute routes LLM tool calls via latency-quality matching in a contextual bandit, improving F1 by 2.18 pp, accuracy by up to 18 pp, and NDCG by 2.91-3.22 pp over SW-UCB on web-search, StrategyQA, and retriever benchmarks.

CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference

cs.IT · 2026-05-12 · unverdicted · novelty 7.0

CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.

Efficient Ensemble Selection from Binary and Pairwise Feedback

cs.GT · 2026-05-10 · unverdicted · novelty 7.0

The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta-winning committees.

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

cs.LG · 2025-05-19 · conditional · novelty 7.0

A well-tuned kNN router matches or exceeds state-of-the-art learned routers on new standardized benchmarks spanning instruction, QA, reasoning, and the first multi-modal visual routing dataset, due to locality of model performance in embedding space.

SWE-Router: Routing in Multi-turn Agentic Software Engineering Tasks

cs.SE · 2026-06-30 · unverdicted · novelty 6.0

SWE-Router introduces trajectory-conditioned value-based routing for LLM agents on SWE tasks, with a Bayes-optimality theorem and empirical cost savings while retaining most strong-model performance.

RouteBalance: Fused Model Routing and Load Balancing for Heterogeneous LLM Serving

cs.DC · 2026-06-16 · unverdicted · novelty 6.0

RouteBalance fuses routing and load balancing for heterogeneous LLM serving and traces the upper quality-cost-throughput frontier on a 13-instance 28-GPU cluster.

Triaging Threats to Specialized Guardrails

cs.CR · 2026-05-29 · unverdicted · novelty 6.0

Introduces GuardZoo benchmark and RouteGuard router-expert system showing monolithic guardrails suffer task interference while specialized routing improves threat detection and generalization.

The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

LLM routers across 21 methods on 5 benchmarks converge to similar accuracy below oracle due to learning global performance trends rather than fine-grained query signals.

Natural Language Query to Configuration for Retrieval Agents

cs.AI · 2026-05-26 · unverdicted · novelty 6.0

BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.

Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

cs.AI · 2026-05-25 · unverdicted · novelty 6.0

DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

cs.AI · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

ECC calibrates semantic embeddings with model comparisons via Bradley-Terry profiles and mixture weights to cluster queries by latent LLM capabilities, claiming 17-18 point gains in ranking quality over baselines.

Domain Restriction via Multi SAE Layer Transitions

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.

GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.

LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.

ModelLens: Finding the Best for Your Task from Myriads of Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.

CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.

Privacy-Preserving LLMs Routing

cs.CR · 2026-04-17 · unverdicted · novelty 6.0

PPRoute achieves plaintext-level LLM routing quality with MPC-based privacy and a 20x speedup over naive encrypted implementations via MPC-friendly encoders, multi-step training, and O(1) communication Top-k search.

citing papers explorer

Showing 42 of 42 citing papers.

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models cs.AI · 2026-06-25 · unverdicted · none · ref 12 · internal anchor
Any single-output LLM ensemble is accuracy-capped at 1-beta where beta is the all-models-wrong rate, a quantity not captured by pairwise correlations and frequently underestimated by copula models.
RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing cs.LG · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
RouteJudge introduces an open platform for preference-based evaluation of LLM routers via pairwise user comparisons, along with the ORBIT toolbox for standardized routing workflows.
Online Pandora's Box for Contextual LLM Cascading cs.AI · 2026-06-05 · unverdicted · none · ref 69 · internal anchor
Introduces a parametric reservation-index policy with GMM estimation and UCB exploration for contextual LLM cascading under output-mediated feedback, claiming dimension-dependent square-root regret.
When to Think Deeply: Inhibitory Deliberation for LLM Reasoning cs.CL · 2026-06-04 · unverdicted · none · ref 6 · internal anchor
IDPR is a response-conditioned inhibitory deliberation method that trains a controller on fast-slow outcome pairs to decide when to override LLM fast answers, improving accuracy from 47.90% to 48.92% with slow reasoning invoked on only 8.20% of a 5,000-example math test set.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 13 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing cs.LG · 2026-05-14 · accept · none · ref 2 · 2 links · internal anchor
TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents cs.LG · 2026-05-14 · unverdicted · none · ref 7 · 2 links · internal anchor
LQM-ContextRoute routes LLM tool calls via latency-quality matching in a contextual bandit, improving F1 by 2.18 pp, accuracy by up to 18 pp, and NDCG by 2.91-3.22 pp over SW-UCB on web-search, StrategyQA, and retriever benchmarks.
CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference cs.IT · 2026-05-12 · unverdicted · none · ref 48 · internal anchor
CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.
Efficient Ensemble Selection from Binary and Pairwise Feedback cs.GT · 2026-05-10 · unverdicted · none · ref 4 · internal anchor
The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta-winning committees.
Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers cs.LG · 2025-05-19 · conditional · none · ref 26 · internal anchor
A well-tuned kNN router matches or exceeds state-of-the-art learned routers on new standardized benchmarks spanning instruction, QA, reasoning, and the first multi-modal visual routing dataset, due to locality of model performance in embedding space.
SWE-Router: Routing in Multi-turn Agentic Software Engineering Tasks cs.SE · 2026-06-30 · unverdicted · none · ref 78 · internal anchor
SWE-Router introduces trajectory-conditioned value-based routing for LLM agents on SWE tasks, with a Bayes-optimality theorem and empirical cost savings while retaining most strong-model performance.
RouteBalance: Fused Model Routing and Load Balancing for Heterogeneous LLM Serving cs.DC · 2026-06-16 · unverdicted · none · ref 22 · internal anchor
RouteBalance fuses routing and load balancing for heterogeneous LLM serving and traces the upper quality-cost-throughput frontier on a 13-instance 28-GPU cluster.
Triaging Threats to Specialized Guardrails cs.CR · 2026-05-29 · unverdicted · none · ref 16 · internal anchor
Introduces GuardZoo benchmark and RouteGuard router-expert system showing monolithic guardrails suffer task interference while specialized routing improves threat detection and generalization.
The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers cs.LG · 2026-05-27 · unverdicted · none · ref 22 · internal anchor
LLM routers across 21 methods on 5 benchmarks converge to similar accuracy below oracle due to learning global performance trends rather than fine-grained query signals.
Natural Language Query to Configuration for Retrieval Agents cs.AI · 2026-05-26 · unverdicted · none · ref 13 · internal anchor
BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.
Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching cs.AI · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.
Capturing LLM Capabilities via Evidence-Calibrated Query Clustering cs.AI · 2026-05-16 · unverdicted · none · ref 17 · 2 links · internal anchor
ECC calibrates semantic embeddings with model comparisons via Bradley-Terry profiles and mixture weights to cluster queries by latent LLM capabilities, claiming 17-18 point gains in ranking quality over baselines.
Domain Restriction via Multi SAE Layer Transitions cs.AI · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization cs.AI · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer? cs.AI · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge cs.AI · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
ModelLens: Finding the Best for Your Task from Myriads of Models cs.LG · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation cs.AI · 2026-04-20 · unverdicted · none · ref 12 · internal anchor
CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
Privacy-Preserving LLMs Routing cs.CR · 2026-04-17 · unverdicted · none · ref 6 · internal anchor
PPRoute achieves plaintext-level LLM routing quality with MPC-based privacy and a 20x speedup over naive encrypted implementations via MPC-friendly encoders, multi-step training, and O(1) communication Top-k search.
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization cs.LG · 2026-04-16 · unverdicted · none · ref 33 · internal anchor
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible cs.CR · 2026-02-08 · conditional · none · ref 18 · internal anchor
An anonymization framework replaces sensitive UI content with deterministic placeholders to protect privacy in mobile GUI agents while preserving task performance.
A Greedy PDE Router for Blending Neural Operators and Classical Methods stat.ME · 2025-09-29 · unverdicted · none · ref 5 · internal anchor
An approximate greedy router for hybrid PDE solvers that mimics optimal selection without true error access and shows faster, more stable error reduction on test equations.
Large Language Models for Combinatorial Optimization of Design Structure Matrix cs.CE · 2025-06-11 · unverdicted · none · ref 56 · internal anchor
LLM framework combines network topology and domain knowledge for iterative DSM sequencing optimization and outperforms stochastic and deterministic baselines on convergence speed and solution quality.
RouteLLM: Learning to Route LLMs with Preference Data cs.LG · 2024-06-26 · unverdicted · none · ref 17 · internal anchor
Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
EntroRouter: Learning Efficient Model Routing via Entropy Regulation cs.CL · 2026-06-28 · unverdicted · none · ref 76 · internal anchor
EntroRouter applies entropy regulation in a single-round routing framework to decouple reasoning from routing, retaining 98.3% of top expert accuracy at 48.25% lower compute cost.
ReCal: Reward Calibration for RL-based LLM Routing cs.LG · 2026-06-10 · unverdicted · none · ref 9 · internal anchor
ReCal introduces hierarchical reward decomposition and distribution-aware optimization to address ambiguous credit assignment and optimization bias in RL-based LLM routing.
From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing cs.LG · 2026-06-05 · unverdicted · none · ref 140 · internal anchor
DARS replaces single-shot response labels with distribution-aware supervision derived from input and output uncertainty to produce more reliable LLM routing policies.
R2V Agent: Teaching SLMs When to Ask for Help cs.LG · 2026-05-15 · unverdicted · none · ref 5 · internal anchor
R2V-Agent combines an SLM policy trained via BC and DPO with a step-level risk-calibrated router using Brier scores and CVaR to escalate to LLM only on high residual failure risk, improving success-cost tradeoffs on HumanEval+, TextWorld, and TerminalBench.
UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing cs.LG · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
UCCI calibrates LLM uncertainty to error probabilities with isotonic regression for cost-optimal cascade routing, delivering 31% cost savings at maintained accuracy on a 75k-query NER task.
Agentic AI Systems Should Be Designed as Marginal Token Allocators cs.AI · 2026-05-02 · unverdicted · none · ref 17 · internal anchor
Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.
RouteProfile: Graph-Based Profiling for Cold-Start LLM Routing cs.NI · 2026-04-30 · unverdicted · none · ref 6 · internal anchor
RouteProfile builds graph-based LLM profiles from public technical report signals to enable training-free cold-start routing and new-LLM integration.
Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents cs.AI · 2026-04-15 · unverdicted · none · ref 6 · internal anchor
Tri-Spirit decomposes autonomous AI into planning, reasoning, and execution layers on heterogeneous hardware, yielding 75.6% lower latency, 71.1% less energy, and 77.6% offline task completion in 2000-task simulations.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent cs.LG · 2026-04-07 · unverdicted · none · ref 10 · internal anchor
AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-optimal accuracy on benchmarks.
OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning cs.LG · 2026-05-29 · unverdicted · none · ref 7 · internal anchor
OrcaRouter applies LinUCB with hybrid offline-online learning to LLM routing and reports second place on RouterArena at 75.54% accuracy for $1 per 1,000 queries.
AI-Model Network: Concept, Current State and Future cs.AI · 2026-05-25 · unverdicted · none · ref 49 · internal anchor
The paper introduces the concept, vision, and hierarchical architecture of a worldwide AI-model network (AI-ModelNet) for model interconnection, sharing, and collaboration, validated via a prototype.
Harnessing Multiple Large Language Models: A Survey on LLM Ensemble cs.CL · 2025-02-25 · unverdicted · none · ref 16 · internal anchor
A systematic survey of LLM ensemble methods organized into a taxonomy of ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference stages, with review of benchmarks, applications, and future directions.
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving cs.NI · 2026-04-13 · unreviewed · ref 21 · internal anchor

RouterBench: A Benchmark for Multi-LLM Routing System

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer