Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

Yang Li

arxiv: 2505.12601 · v2 · pith:7MEC6ORRnew · submitted 2025-05-19 · 💻 cs.LG

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

Yang Li This is my paper

Pith reviewed 2026-05-22 15:08 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM routingk-nearest neighborsmodel selectionembedding localitynon-parametric methodsrouting benchmarksmulti-modal routing

0 comments

The pith

A well-tuned k-nearest neighbors method often matches or beats complex learned routers when selecting the best LLM for a given input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that routing an input to the most suitable large language model can be handled effectively by a simple k-nearest neighbors lookup in embedding space rather than by training parametric routers. This holds across instruction-following, question-answering, reasoning tasks and a new multi-modal dataset with visual inputs. The underlying reason is that different models exhibit locally consistent performance patterns, so nearby points in embedding space tend to favor the same model. Because the approach is non-parametric, it reaches strong decisions with fewer labeled examples than learned alternatives. The authors also release standardized benchmarks to make future comparisons reproducible and to highlight the value of checking basic methods first.

Core claim

A well-tuned k-Nearest Neighbors (kNN) approach not only matches but often outperforms state-of-the-art learned routers across diverse tasks. The locality properties of model performance in embedding space enable simple non-parametric methods to achieve strong routing decisions with lower sample complexity than parametric approaches.

What carries the argument

k-nearest neighbors lookup in an input embedding space that retrieves the model which performed best on the most similar previous examples.

If this is right

kNN routers can achieve competitive or superior accuracy on instruction-following, question-answering, reasoning, and multi-modal tasks.
Non-parametric routing decisions require lower sample complexity than training parametric learned routers.
Standardized benchmarks spanning text and visual inputs allow systematic comparison of routing strategies.
Thorough evaluation of simple baselines should precede adoption of more complex routing architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If locality holds, routing systems could be maintained by periodically adding new performance evaluations to a lookup table instead of retraining neural routers.
The same embedding-based locality idea might transfer to routing decisions in other multi-model AI systems such as vision or code models.
Choosing or fine-tuning the embedding model itself could become a key lever for improving kNN routing quality without adding parametric complexity.

Load-bearing premise

The embedding space used for nearest-neighbor lookup must reflect the input features that actually determine which model will perform best on new queries.

What would settle it

A result on the released benchmarks showing that, with identical embeddings and comparable training data, a learned router consistently selects higher-performing models than the best-tuned kNN across multiple tasks.

Figures

Figures reproduced from arXiv: 2505.12601 by Yang Li.

**Figure 1.** Figure 1: As the embedding distance between prompt pairs increases, the agreement between their model performance scores decreases, demonstrating the locality property in the promptperformance space. In this section, we develop a theoretical framework to explain why simple kNN-based routers often match or outperform more complex learned routers. Our analysis addresses an important question: under what conditions … view at source ↗

read the original abstract

As large language models (LLMs) grow in scale and specialization, routing--selecting the best model for a given input--has become essential for efficient and effective deployment. While recent methods rely on complex learned routing strategies, their dependence on disparate training data and evaluation setups makes comparison and generalization difficult. In this work, we revisit LLM routing through the lens of simplicity. We show that a well-tuned k-Nearest Neighbors (kNN) approach not only matches but often outperforms state-of-the-art learned routers across diverse tasks. To support systematic evaluation, we introduce a suite of standardized routing benchmarks spanning instruction-following, question-answering, and reasoning tasks, as well as the first multi-modal routing dataset involving visual inputs. Our findings reveal that the locality properties of model performance in embedding space enable simple non-parametric methods to achieve strong routing decisions with lower sample complexity than parametric approaches. This challenges the prevailing trend toward sophisticated architectures and highlights the importance of thoroughly evaluating simple baselines before investing in complex solutions. To support reproducibility and further exploration, we will release all benchmarks and code upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

kNN beats learned routers on the new benchmarks mainly because the embedding space already separates the tasks well, but the paper needs to show this isn't just an artifact of their encoder choice.

read the letter

The main point is that a simple kNN router matches or beats current learned routers on instruction, QA, reasoning, and a new multi-modal dataset, while using less training data. They also release standardized benchmarks, which is the most immediately useful part of the work. That combination of new evaluation resources and the empirical result stands out from prior routing papers that used mismatched setups. The locality argument in embedding space is plausible and explains why non-parametric methods can win here without heavy parameterization. Credit to the authors for testing across task types and including visual inputs, which few routing papers have done. The soft spot is the embedding itself. The gains could disappear if a different encoder is used or on inputs where the geometry does not track actual model performance differences. The abstract mentions locality properties but does not appear to include embedding ablations or robustness checks on out-of-distribution cases, so the central claim rests on one feature extractor working out. Details on exact k selection, distance metric, and statistical significance of the wins would also help. This paper is for groups building multi-model serving systems who want lower training overhead and reproducible baselines. A reader working on routing or efficient inference gets concrete numbers and new test sets to try. It is coherent on its own terms and engages the literature directly rather than fitting noise. I would bring it to a reading group to discuss the embedding question and would cite the benchmark release. It deserves peer review because the new resources and the practical result are worth checking even if revisions are needed on the robustness side.

Referee Report

3 major / 2 minor

Summary. The paper claims that a well-tuned k-Nearest Neighbors (kNN) router not only matches but often outperforms state-of-the-art learned routers for selecting the best LLM on a given input. It introduces a suite of standardized benchmarks covering instruction-following, question-answering, and reasoning tasks plus the first multi-modal routing dataset with visual inputs, and attributes the success of the simple non-parametric method to locality properties of model performance in embedding space, which also yields lower sample complexity than parametric routers.

Significance. If the results hold under rigorous controls, the work is significant because it supplies standardized benchmarks and a multi-modal dataset that the community can use for future comparisons, while providing concrete evidence that thorough evaluation of simple baselines can outperform the prevailing trend toward complex learned routers. The planned release of code and benchmarks is a clear strength for reproducibility.

major comments (3)

[§4.2 and Table 3] §4.2 and Table 3: the reported kNN gains over learned routers are presented without an ablation across alternative embedding models (e.g., different sentence transformers or contrastive encoders). Because the central claim rests on the locality properties of model performance in the chosen embedding space, the absence of this check leaves open the possibility that superiority is an artifact of the particular encoder rather than a general property.
[§5.1] §5.1 (Multi-modal dataset): the description of how visual inputs are embedded for nearest-neighbor lookup is too brief to verify that the embedding geometry aligns with actual performance differences across models. If the multi-modal encoder does not separate inputs according to which LLM performs best, the kNN advantage claimed for this new dataset would not follow from the locality argument.
[§4.3] §4.3 (Statistical reporting): the performance tables lack error bars, standard deviations across seeds, or statistical significance tests for the accuracy differences versus learned routers. Without these, it is impossible to determine whether the observed outperformance is reliable or could be explained by benchmark variance.

minor comments (2)

[Abstract] The abstract states that kNN achieves 'strong routing decisions with lower sample complexity' but does not quantify sample complexity (e.g., number of labeled examples needed for convergence) in the main text or appendix.
[§3] Notation for the distance metric and value of k is introduced inconsistently between the method section and the experimental setup; a single consolidated definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we plan to make in response.

read point-by-point responses

Referee: [§4.2 and Table 3] §4.2 and Table 3: the reported kNN gains over learned routers are presented without an ablation across alternative embedding models (e.g., different sentence transformers or contrastive encoders). Because the central claim rests on the locality properties of model performance in the chosen embedding space, the absence of this check leaves open the possibility that superiority is an artifact of the particular encoder rather than a general property.

Authors: We agree with the referee that demonstrating the robustness of our findings across different embedding models would better support the generality of the locality argument. Accordingly, we will add an ablation study in the revised manuscript that evaluates kNN performance using several alternative embedding models, including different sentence transformers and contrastive encoders. This will help confirm that the observed advantages are not specific to the encoder used in the original submission. revision: yes
Referee: [§5.1] §5.1 (Multi-modal dataset): the description of how visual inputs are embedded for nearest-neighbor lookup is too brief to verify that the embedding geometry aligns with actual performance differences across models. If the multi-modal encoder does not separate inputs according to which LLM performs best, the kNN advantage claimed for this new dataset would not follow from the locality argument.

Authors: We acknowledge that the current description in §5.1 is concise and may not provide sufficient detail for verification. In the revision, we will expand this section to include a more thorough explanation of the visual input embedding process, the specific multi-modal encoder employed, and any supporting analysis or visualizations that illustrate the alignment between the embedding geometry and the performance differences across LLMs. revision: yes
Referee: [§4.3] §4.3 (Statistical reporting): the performance tables lack error bars, standard deviations across seeds, or statistical significance tests for the accuracy differences versus learned routers. Without these, it is impossible to determine whether the observed outperformance is reliable or could be explained by benchmark variance.

Authors: We concur that the inclusion of statistical reporting would enhance the credibility of our results. We will update the performance tables in the revised manuscript to include error bars (standard deviations across multiple random seeds) and conduct statistical significance tests, such as paired t-tests or Wilcoxon tests, to assess whether the differences in accuracy are statistically significant. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on external benchmarks

full rationale

The paper advances an empirical claim that a tuned kNN router matches or exceeds learned routers on instruction-following, QA, reasoning, and a new multi-modal dataset. No equations, fitted parameters, or self-citations are used to derive the result; performance differences are measured directly against held-out test sets and external baselines. The locality observation is reported as an outcome of the experiments rather than an input assumption that forces the conclusion. The derivation chain is therefore self-contained against external benchmarks and contains no self-definitional, fitted-input, or self-citation reductions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation of performance locality in embedding space and on the construction of new benchmarks; k is treated as a tunable hyperparameter rather than a learned parameter.

free parameters (1)

k (number of neighbors)
Hyperparameter selected to optimize routing accuracy on the evaluation benchmarks.

axioms (1)

domain assumption Model performance exhibits locality in the chosen embedding space
Invoked to explain why nearest-neighbor lookup produces reliable routing decisions.

pith-pipeline@v0.9.0 · 5714 in / 1328 out tokens · 73259 ms · 2026-05-22T15:08:35.467191+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 1 (δ-Locality) ... d(x1,x2)<δ ⟹ |u(x1,m)−u(x2,m)|<ϵ(δ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 11 internal anchors

[1]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems , journal =

Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, and François Jacquenet. Doing more with less–implementing routing strategies in large language model-based systems: An extended survey. arXiv preprint arXiv:2502.00409, 2025

work page arXiv 2025
[6]

Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, and Philip S Yu. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Routellm: Learning to route llms from preference data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms from preference data. In The Thirteenth International Conference on Learning Representations, 2024

work page 2024
[8]

Routerdc: Query-based router by dual contrastive learning for assembling large language models

Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query-based router by dual contrastive learning for assembling large language models. Advances in Neural Information Processing Systems, 37:66305–66328, 2024

work page 2024
[9]

Routing to the expert: Efficient reward-guided ensemble of large language models

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. arXiv preprint arXiv:2311.08692, 2023

work page arXiv 2023
[10]

Tensoropera router: A multi-model router for efficient llm inference.arXiv preprint arXiv:2408.12320,

Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference. arXiv preprint arXiv:2408.12320, 2024

work page arXiv 2024
[11]

Hybrid llm: Cost-efficient and quality-aware query routing

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality- aware query routing. arXiv preprint arXiv:2404.14618, 2024

work page arXiv 2024
[12]

doi:10.48550/arXiv.2410.03834 , url =

Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834, 2024

work page arXiv 2024
[13]

Metallm: A high-performant and cost-efficient dynamic frame- work for wrapping llms.arXiv preprint arXiv:2407.10834,

Quang H Nguyen, Duy C Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V Chawla, and Khoa D Doan. Metallm: A high-performant and cost-efficient dynamic framework for wrapping llms. arXiv preprint arXiv:2407.10834, 2024

work page arXiv 2024
[14]

Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing.arXiv preprint arXiv:2502.02743,

Yang Li. Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing. arXiv preprint arXiv:2502.02743, 2025

work page arXiv 2025
[15]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023

work page arXiv 2023
[16]

Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

Xiaoding Lu, Zongyi Liu, Adian Liusie, Vyas Raina, Vineet Mudupalli, Yuwen Zhang, and William Beauchamp. Blending is all you need: Cheaper, better alternative to trillion-parameters llm. arXiv preprint arXiv:2401.02994, 2024. 10

work page arXiv 2024
[17]

Fusing models with complementary expertise

Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik Kundu, Eric Xing, and Mikhail Yurochkin. Fusing models with complementary expertise. arXiv preprint arXiv:2310.01542, 2023

work page arXiv 2023
[18]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Automix: Automatically mixing language models

Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models. Advances in Neural Information Processing Systems, 37:131000–131034, 2024

work page 2024
[20]

Optimising calls to large language models with uncertainty-based two-tier selection

Guillem Ramírez, Alexandra Birch, and Ivan Titov. Optimising calls to large language models with uncertainty-based two-tier selection. arXiv preprint arXiv:2405.02134, 2024

work page arXiv 2024
[21]

Large language model routing with benchmark datasets

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023

work page arXiv 2023
[22]

Tryage: Real-time, intelligent routing of user prompts to large language models

Surya Narayanan Hari and Matt Thomson. Tryage: Real-time, intelligent routing of user prompts to large language models. arXiv preprint arXiv:2308.11601, 2023

work page arXiv 2023
[23]

Fly-swat or cannon? cost-effective language model choice via meta-modeling

Marija Šakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 606–615, 2024

work page 2024
[24]

Eagle: Efficient training-free router for multi-llm inference

Zesen Zhao, Shuowei Jin, and Z Morley Mao. Eagle: Efficient training-free router for multi-llm inference. arXiv preprint arXiv:2409.15518, 2024

work page arXiv 2024
[25]

Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,

Alireza Mohammadshahi, Arshad Rafiq Shaikh, and Majid Yazdani. Routoo: Learning to route to large language models effectively. arXiv preprint arXiv:2401.13979, 2024

work page arXiv 2024
[26]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system. arXiv preprint arXiv:2403.12031, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Open llm leaderboard v2

Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_ llm_leaderboard, 2024

work page 2024
[29]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Vhelm: A holistic evaluation of vision language models

Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models. Advances in Neural Information Processing Systems, 37:140632– 140666, 2024

work page 2024
[31]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[32]

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 11

work page 2023
[34]

arXiv preprint arXiv:2405.18137 (2024)

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Martin Vechev. Exploiting llm quantization. arXiv preprint arXiv:2405.18137, 2024

work page arXiv 2024
[35]

A survey of collaborative filtering techniques

Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Ad- vances in artificial intelligence, 2009(1):421425, 2009

work page 2009
[36]

Item-based collaborative filtering recommendation algorithms

Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285–295, 2001

work page 2001
[37]

Reducing Offline Evaluation Bias in Recommendation Systems

Arnaud De Myttenaere, Bénédicte Le Grand, Boris Golden, and Fabrice Rossi. Reducing offline evaluation bias in recommendation systems. arXiv preprint arXiv:1407.0822, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[38]

Sfr- embedding-mistral:enhance text retrieval with transfer learning

Shafiq Rayhan Joty Caiming Xiong Yingbo Zhou Semih Yavuz Rui Meng, Ye Liu. Sfr- embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog, 2024

work page 2024
[39]

ε-entropy and ε- capacity of sets in function spaces

Andrei Nikolaevich Kolmogorov and Vladimir Mikhailovich Tikhomirov. ε-entropy and ε- capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3–86, 1959

work page 1959
[40]

Distance-based classification with lipschitz functions

Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669–695, 2004

work page 2004
[41]

Universal approximation bounds for superpositions of a sigmoidal function

Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993

work page 1993
[42]

Error bounds for approximations with deep relu networks

Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural networks, 94:103–114, 2017

work page 2017
[43]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017

work page 2017
[44]

Size-independent sample complexity of neural networks

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018. 12 A Additional Related Works Beyond the core routing approaches discussed in Section 2, several other research directions are relevant to our investigation of LLM routing mechanisms. LLM In...

work page 2018
[45]

For a given query x, we obtain predicted utility scores ˆu(x, m) = ˆs(x, m) − λ × ˆc(x, m) for each model m ∈ M across various values of λ

work page
[46]

For each λ value, we select the model with the highest predicted utility: mλ = arg maxm∈M ˆu(x, m)

work page
[47]

We plot the actual performance-cost pairs (c(x, mλ), s(x, mλ)) in the cost-performance space

work page
[48]

We compute the non-decreasing convex hull of these points to obtain the Pareto-optimal frontier

work page
[49]

This approach ensures that routers are evaluated on their ability to make optimal trade-offs across the entire spectrum of cost-performance preferences

The AUC is calculated as the area under this frontier, normalized so that the maximum score is 100 and the maximum cost is 1. This approach ensures that routers are evaluated on their ability to make optimal trade-offs across the entire spectrum of cost-performance preferences. 14 B.4 Data Splits and Reproducibility To ensure reproducible evaluation, we u...

work page 2024

[1] [1]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems , journal =

Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, and François Jacquenet. Doing more with less–implementing routing strategies in large language model-based systems: An extended survey. arXiv preprint arXiv:2502.00409, 2025

work page arXiv 2025

[6] [6]

Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, and Philip S Yu. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Routellm: Learning to route llms from preference data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms from preference data. In The Thirteenth International Conference on Learning Representations, 2024

work page 2024

[8] [8]

Routerdc: Query-based router by dual contrastive learning for assembling large language models

Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query-based router by dual contrastive learning for assembling large language models. Advances in Neural Information Processing Systems, 37:66305–66328, 2024

work page 2024

[9] [9]

Routing to the expert: Efficient reward-guided ensemble of large language models

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. arXiv preprint arXiv:2311.08692, 2023

work page arXiv 2023

[10] [10]

Tensoropera router: A multi-model router for efficient llm inference.arXiv preprint arXiv:2408.12320,

Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference. arXiv preprint arXiv:2408.12320, 2024

work page arXiv 2024

[11] [11]

Hybrid llm: Cost-efficient and quality-aware query routing

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality- aware query routing. arXiv preprint arXiv:2404.14618, 2024

work page arXiv 2024

[12] [12]

doi:10.48550/arXiv.2410.03834 , url =

Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834, 2024

work page arXiv 2024

[13] [13]

Metallm: A high-performant and cost-efficient dynamic frame- work for wrapping llms.arXiv preprint arXiv:2407.10834,

Quang H Nguyen, Duy C Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V Chawla, and Khoa D Doan. Metallm: A high-performant and cost-efficient dynamic framework for wrapping llms. arXiv preprint arXiv:2407.10834, 2024

work page arXiv 2024

[14] [14]

Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing.arXiv preprint arXiv:2502.02743,

Yang Li. Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing. arXiv preprint arXiv:2502.02743, 2025

work page arXiv 2025

[15] [15]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023

work page arXiv 2023

[16] [16]

Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

Xiaoding Lu, Zongyi Liu, Adian Liusie, Vyas Raina, Vineet Mudupalli, Yuwen Zhang, and William Beauchamp. Blending is all you need: Cheaper, better alternative to trillion-parameters llm. arXiv preprint arXiv:2401.02994, 2024. 10

work page arXiv 2024

[17] [17]

Fusing models with complementary expertise

Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik Kundu, Eric Xing, and Mikhail Yurochkin. Fusing models with complementary expertise. arXiv preprint arXiv:2310.01542, 2023

work page arXiv 2023

[18] [18]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Automix: Automatically mixing language models

Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models. Advances in Neural Information Processing Systems, 37:131000–131034, 2024

work page 2024

[20] [20]

Optimising calls to large language models with uncertainty-based two-tier selection

Guillem Ramírez, Alexandra Birch, and Ivan Titov. Optimising calls to large language models with uncertainty-based two-tier selection. arXiv preprint arXiv:2405.02134, 2024

work page arXiv 2024

[21] [21]

Large language model routing with benchmark datasets

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023

work page arXiv 2023

[22] [22]

Tryage: Real-time, intelligent routing of user prompts to large language models

Surya Narayanan Hari and Matt Thomson. Tryage: Real-time, intelligent routing of user prompts to large language models. arXiv preprint arXiv:2308.11601, 2023

work page arXiv 2023

[23] [23]

Fly-swat or cannon? cost-effective language model choice via meta-modeling

Marija Šakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 606–615, 2024

work page 2024

[24] [24]

Eagle: Efficient training-free router for multi-llm inference

Zesen Zhao, Shuowei Jin, and Z Morley Mao. Eagle: Efficient training-free router for multi-llm inference. arXiv preprint arXiv:2409.15518, 2024

work page arXiv 2024

[25] [25]

Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,

Alireza Mohammadshahi, Arshad Rafiq Shaikh, and Majid Yazdani. Routoo: Learning to route to large language models effectively. arXiv preprint arXiv:2401.13979, 2024

work page arXiv 2024

[26] [26]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system. arXiv preprint arXiv:2403.12031, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Open llm leaderboard v2

Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_ llm_leaderboard, 2024

work page 2024

[29] [29]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Vhelm: A holistic evaluation of vision language models

Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models. Advances in Neural Information Processing Systems, 37:140632– 140666, 2024

work page 2024

[31] [31]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019

[32] [32]

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 11

work page 2023

[34] [34]

arXiv preprint arXiv:2405.18137 (2024)

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Martin Vechev. Exploiting llm quantization. arXiv preprint arXiv:2405.18137, 2024

work page arXiv 2024

[35] [35]

A survey of collaborative filtering techniques

Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Ad- vances in artificial intelligence, 2009(1):421425, 2009

work page 2009

[36] [36]

Item-based collaborative filtering recommendation algorithms

Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285–295, 2001

work page 2001

[37] [37]

Reducing Offline Evaluation Bias in Recommendation Systems

Arnaud De Myttenaere, Bénédicte Le Grand, Boris Golden, and Fabrice Rossi. Reducing offline evaluation bias in recommendation systems. arXiv preprint arXiv:1407.0822, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[38] [38]

Sfr- embedding-mistral:enhance text retrieval with transfer learning

Shafiq Rayhan Joty Caiming Xiong Yingbo Zhou Semih Yavuz Rui Meng, Ye Liu. Sfr- embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog, 2024

work page 2024

[39] [39]

ε-entropy and ε- capacity of sets in function spaces

Andrei Nikolaevich Kolmogorov and Vladimir Mikhailovich Tikhomirov. ε-entropy and ε- capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3–86, 1959

work page 1959

[40] [40]

Distance-based classification with lipschitz functions

Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669–695, 2004

work page 2004

[41] [41]

Universal approximation bounds for superpositions of a sigmoidal function

Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993

work page 1993

[42] [42]

Error bounds for approximations with deep relu networks

Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural networks, 94:103–114, 2017

work page 2017

[43] [43]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017

work page 2017

[44] [44]

Size-independent sample complexity of neural networks

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018. 12 A Additional Related Works Beyond the core routing approaches discussed in Section 2, several other research directions are relevant to our investigation of LLM routing mechanisms. LLM In...

work page 2018

[45] [45]

For a given query x, we obtain predicted utility scores ˆu(x, m) = ˆs(x, m) − λ × ˆc(x, m) for each model m ∈ M across various values of λ

work page

[46] [46]

For each λ value, we select the model with the highest predicted utility: mλ = arg maxm∈M ˆu(x, m)

work page

[47] [47]

We plot the actual performance-cost pairs (c(x, mλ), s(x, mλ)) in the cost-performance space

work page

[48] [48]

We compute the non-decreasing convex hull of these points to obtain the Pareto-optimal frontier

work page

[49] [49]

This approach ensures that routers are evaluated on their ability to make optimal trade-offs across the entire spectrum of cost-performance preferences

The AUC is calculated as the area under this frontier, normalized so that the maximum score is 100 and the maximum cost is 1. This approach ensures that routers are evaluated on their ability to make optimal trade-offs across the entire spectrum of cost-performance preferences. 14 B.4 Data Splits and Reproducibility To ensure reproducible evaluation, we u...

work page 2024