OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

Chris Zhang; Fengya Tian; Xile Ma; Yi Shi; Zhenghua Bao; Zhenjun Chen

arxiv: 2605.30736 · v1 · pith:6RD7QTVJnew · submitted 2026-05-29 · 💻 cs.LG · cs.AI· cs.CL

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

Zhenghua Bao , Fengya Tian , Chris Zhang , Zhenjun Chen , Xile Ma , Yi Shi This is my paper

Pith reviewed 2026-06-28 23:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM routercontextual banditLinUCBhybrid offline-online learningproduction LLM deploymentRouterArena leaderboard

0 comments

The pith

OrcaRouter uses hybrid offline-online LinUCB learning to route LLM requests, achieving second place on RouterArena at 75.54% accuracy for $1 per 1,000 queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OrcaRouter to solve the problem of choosing the right large language model for each query to balance quality and cost. It builds an offline reward matrix by running all candidate models on a set of prompts, fits ridge regression models for each, then deploys a LinUCB contextual bandit that starts from those fits and can learn online from observed rewards. This hybrid method avoids the need for extensive initial exploration in production. Readers would care because as more LLMs become available with varying strengths and prices, smart routing becomes essential for efficient deployment. The reported result is second rank on the leaderboard with the given accuracy and cost.

Core claim

OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts to yield a reward matrix, fits one ridge regressor per arm, initializes from these parameters at deployment, and optionally continues learning from bandit feedback by updating only the selected model's arm after observing its reward. This produced an arena score of 72.08 with 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.

What carries the argument

LinUCB contextual bandit over lexical and sentence-embedding features, initialized from offline ridge regressors on a full-information reward matrix, with selective online arm updates.

If this is right

The system can begin routing effectively without random exploration at the start of deployment.
Computational cost stays low because only the chosen model updates after each interaction.
Production systems gain the ability to adapt to shifting query distributions over time.
Overall inference spend decreases while maintaining competitive accuracy compared to always using the strongest model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the offline prompt curation process could further improve initial performance if new query types are anticipated.
The approach might generalize to routing decisions in other AI services where multiple models compete on cost and quality.
Long-term, the router could reduce the need for users to manually select models by learning preferences implicitly.

Load-bearing premise

The curated set of routing prompts used to generate the offline reward matrix is representative of the distribution of queries that will arrive at deployment time.

What would settle it

Measuring the accuracy and cost on a large set of real user queries collected after deployment that were not part of the original curated prompt set.

read the original abstract

The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question: given an incoming request, which model should handle it? We present OrcaRouter, a production-oriented LLM router that combines a LinUCB-based contextual bandit over lexical and sentence-embedding features with a hybrid offline-online learning protocol. Offline, OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts, yielding a reward matrix used to fit one ridge regressor per arm. At deployment time, it initializes from these parameters and can optionally continue learning from bandit feedback, updating only the selected model's arm after observing its reward. At the time of our RouterArena submission (May 20, 2026), OrcaRouter-Adaptive ranked second on the public RouterArena leaderboard with an arena score of 72.08, achieving 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OrcaRouter gives a concrete LinUCB hybrid for LLM routing that reaches second on RouterArena, but the offline-to-online transfer rests on an unverified prompt distribution match.

read the letter

The paper's main contribution is a practical system that fits ridge regressors offline on a full-information reward matrix from a curated prompt set, then runs LinUCB online with optional updates only on the chosen arm. It reports a real public leaderboard result: second place at 72.08 arena score, 75.54% accuracy, and $1 per 1k queries.

That hybrid protocol is a reasonable engineering choice for production routing where you can afford initial full evaluations but want to adapt later. Reporting numbers against an actual arena benchmark is better than pure offline simulation.

The load-bearing assumption is that the curated prompts used for the offline matrix have feature distributions close enough to live RouterArena queries. The abstract gives no curation details, no embedding comparison, and no ablation on mismatched sets, so the transfer from offline parameters to the observed score stays untested. Reward definition and baseline comparisons are also missing, which makes the accuracy claim hard to evaluate.

This is aimed at engineers running multi-model LLM services who need a bandit router they can initialize cheaply. A practitioner might borrow the offline-plus-online pattern, but the result is too tied to one unexamined distribution to generalize without more evidence.

It deserves peer review because it ships a working system with public-benchmark numbers rather than just a method sketch.

Referee Report

2 major / 0 minor

Summary. The manuscript presents OrcaRouter, a production LLM router that uses a LinUCB contextual bandit over lexical and sentence-embedding features. It employs a hybrid offline-online protocol: offline, a full-information reward matrix is generated by evaluating every candidate model on a fixed curated prompt set and used to fit one ridge regressor per arm; at deployment the policy initializes from these parameters and may continue updating from bandit feedback. The central empirical claim is that the resulting OrcaRouter-Adaptive variant ranked second on the public RouterArena leaderboard (May 20, 2026) with an arena score of 72.08, 75.54% accuracy, and a cost of USD 1.00 per 1,000 queries.

Significance. If the reported leaderboard result is reproducible and the distributional assumptions hold, the hybrid offline-initialization plus online-adaptation approach could offer a practical, low-cost routing method for heterogeneous LLM deployments. The work does not supply machine-checked proofs, reproducible code, or parameter-free derivations.

major comments (2)

[Abstract] Abstract: the headline performance claim (72.08 arena score, 75.54% accuracy) is obtained by initializing LinUCB from ridge regressors fit on a reward matrix generated exclusively from a curated prompt set; no description of the curation process, no statistical comparison of feature distributions (lexical + embeddings) between the curated set and RouterArena queries, and no ablation replacing the curated set with a mismatched distribution are supplied. This distributional match is load-bearing for the transfer from offline fitting to the observed leaderboard result.
[Abstract] Abstract / method description: the reward definition, exact feature construction, baseline comparisons, and any statistical significance tests for the 72.08 arena score are not reported, preventing verification of the central performance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and verifiability of the central claims.

read point-by-point responses

Referee: [Abstract] Abstract: the headline performance claim (72.08 arena score, 75.54% accuracy) is obtained by initializing LinUCB from ridge regressors fit on a reward matrix generated exclusively from a curated prompt set; no description of the curation process, no statistical comparison of feature distributions (lexical + embeddings) between the curated set and RouterArena queries, and no ablation replacing the curated set with a mismatched distribution are supplied. This distributional match is load-bearing for the transfer from offline fitting to the observed leaderboard result.

Authors: We agree that the curation process and evidence of distributional similarity are insufficiently documented and that this is a substantive concern for interpreting the offline-to-online transfer. In the revised manuscript we will add: (i) an explicit description of the curation criteria and prompt sources in Section 3, (ii) quantitative comparisons of lexical and embedding feature distributions (including summary statistics and distance metrics) between the curated set and RouterArena queries, and (iii) an ablation that substitutes a deliberately mismatched prompt distribution and reports the resulting degradation in leaderboard performance. These additions will be placed in the Methods and Experiments sections. revision: yes
Referee: [Abstract] Abstract / method description: the reward definition, exact feature construction, baseline comparisons, and any statistical significance tests for the 72.08 arena score are not reported, preventing verification of the central performance claim.

Authors: The abstract is intentionally concise; the full manuscript contains the reward definition (a linear combination of normalized accuracy and cost), the precise lexical and embedding feature construction, and comparisons against random, cost-greedy, and other bandit baselines. However, we acknowledge that statistical significance tests for the reported arena score are absent. In the revision we will add bootstrap confidence intervals and permutation-test p-values for the 72.08 score in the Results section, make the exact feature dimensionality and hyper-parameters explicit in the Methods, and ensure the reward formulation is stated in a single, self-contained paragraph. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical leaderboard result is external and independent of offline fitting procedure.

full rationale

The paper presents a system description and empirical result on RouterArena without any equations, derivations, or claimed first-principles predictions. The offline ridge regressors are fit to a curated prompt matrix and used only for initialization; the reported 72.08 arena score is measured on an external public leaderboard after deployment, not computed from the offline data by construction. No self-citations, uniqueness theorems, or ansatzes are invoked. The distributional representativeness of the curated set is a standard (unverified) modeling assumption, not a definitional loop or fitted-input prediction. This is the normal case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The ridge regressors per arm are implicitly fitted but their regularization strength and feature construction details are unspecified.

pith-pipeline@v0.9.1-grok · 5716 in / 1002 out tokens · 18962 ms · 2026-06-28T23:22:23.982905+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

1901
[2]

arXiv preprint arXiv:2309.15789 , year=

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789,

work page arXiv
[3]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2510.00202 , year=

Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, and Jiarong Xing. Routerarena: An open platform for comprehensive comparison of llm routers.arXiv preprint arXiv:2510.00202,

work page arXiv
[6]

Sentence-bert: Sentence embeddings using siamese bert-networks

5 OrcaRouterTECHNICALREPORT Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992,

2019
[7]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

1901

[2] [2]

arXiv preprint arXiv:2309.15789 , year=

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789,

work page arXiv

[3] [3]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2510.00202 , year=

Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, and Jiarong Xing. Routerarena: An open platform for comprehensive comparison of llm routers.arXiv preprint arXiv:2510.00202,

work page arXiv

[6] [6]

Sentence-bert: Sentence embeddings using siamese bert-networks

5 OrcaRouterTECHNICALREPORT Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992,

2019

[7] [7]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,

work page internal anchor Pith review Pith/arXiv arXiv