Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers
Pith reviewed 2026-05-22 15:08 UTC · model grok-4.3
The pith
A well-tuned k-nearest neighbors method often matches or beats complex learned routers when selecting the best LLM for a given input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A well-tuned k-Nearest Neighbors (kNN) approach not only matches but often outperforms state-of-the-art learned routers across diverse tasks. The locality properties of model performance in embedding space enable simple non-parametric methods to achieve strong routing decisions with lower sample complexity than parametric approaches.
What carries the argument
k-nearest neighbors lookup in an input embedding space that retrieves the model which performed best on the most similar previous examples.
If this is right
- kNN routers can achieve competitive or superior accuracy on instruction-following, question-answering, reasoning, and multi-modal tasks.
- Non-parametric routing decisions require lower sample complexity than training parametric learned routers.
- Standardized benchmarks spanning text and visual inputs allow systematic comparison of routing strategies.
- Thorough evaluation of simple baselines should precede adoption of more complex routing architectures.
Where Pith is reading between the lines
- If locality holds, routing systems could be maintained by periodically adding new performance evaluations to a lookup table instead of retraining neural routers.
- The same embedding-based locality idea might transfer to routing decisions in other multi-model AI systems such as vision or code models.
- Choosing or fine-tuning the embedding model itself could become a key lever for improving kNN routing quality without adding parametric complexity.
Load-bearing premise
The embedding space used for nearest-neighbor lookup must reflect the input features that actually determine which model will perform best on new queries.
What would settle it
A result on the released benchmarks showing that, with identical embeddings and comparable training data, a learned router consistently selects higher-performing models than the best-tuned kNN across multiple tasks.
Figures
read the original abstract
As large language models (LLMs) grow in scale and specialization, routing--selecting the best model for a given input--has become essential for efficient and effective deployment. While recent methods rely on complex learned routing strategies, their dependence on disparate training data and evaluation setups makes comparison and generalization difficult. In this work, we revisit LLM routing through the lens of simplicity. We show that a well-tuned k-Nearest Neighbors (kNN) approach not only matches but often outperforms state-of-the-art learned routers across diverse tasks. To support systematic evaluation, we introduce a suite of standardized routing benchmarks spanning instruction-following, question-answering, and reasoning tasks, as well as the first multi-modal routing dataset involving visual inputs. Our findings reveal that the locality properties of model performance in embedding space enable simple non-parametric methods to achieve strong routing decisions with lower sample complexity than parametric approaches. This challenges the prevailing trend toward sophisticated architectures and highlights the importance of thoroughly evaluating simple baselines before investing in complex solutions. To support reproducibility and further exploration, we will release all benchmarks and code upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a well-tuned k-Nearest Neighbors (kNN) router not only matches but often outperforms state-of-the-art learned routers for selecting the best LLM on a given input. It introduces a suite of standardized benchmarks covering instruction-following, question-answering, and reasoning tasks plus the first multi-modal routing dataset with visual inputs, and attributes the success of the simple non-parametric method to locality properties of model performance in embedding space, which also yields lower sample complexity than parametric routers.
Significance. If the results hold under rigorous controls, the work is significant because it supplies standardized benchmarks and a multi-modal dataset that the community can use for future comparisons, while providing concrete evidence that thorough evaluation of simple baselines can outperform the prevailing trend toward complex learned routers. The planned release of code and benchmarks is a clear strength for reproducibility.
major comments (3)
- [§4.2 and Table 3] §4.2 and Table 3: the reported kNN gains over learned routers are presented without an ablation across alternative embedding models (e.g., different sentence transformers or contrastive encoders). Because the central claim rests on the locality properties of model performance in the chosen embedding space, the absence of this check leaves open the possibility that superiority is an artifact of the particular encoder rather than a general property.
- [§5.1] §5.1 (Multi-modal dataset): the description of how visual inputs are embedded for nearest-neighbor lookup is too brief to verify that the embedding geometry aligns with actual performance differences across models. If the multi-modal encoder does not separate inputs according to which LLM performs best, the kNN advantage claimed for this new dataset would not follow from the locality argument.
- [§4.3] §4.3 (Statistical reporting): the performance tables lack error bars, standard deviations across seeds, or statistical significance tests for the accuracy differences versus learned routers. Without these, it is impossible to determine whether the observed outperformance is reliable or could be explained by benchmark variance.
minor comments (2)
- [Abstract] The abstract states that kNN achieves 'strong routing decisions with lower sample complexity' but does not quantify sample complexity (e.g., number of labeled examples needed for convergence) in the main text or appendix.
- [§3] Notation for the distance metric and value of k is introduced inconsistently between the method section and the experimental setup; a single consolidated definition would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we plan to make in response.
read point-by-point responses
-
Referee: [§4.2 and Table 3] §4.2 and Table 3: the reported kNN gains over learned routers are presented without an ablation across alternative embedding models (e.g., different sentence transformers or contrastive encoders). Because the central claim rests on the locality properties of model performance in the chosen embedding space, the absence of this check leaves open the possibility that superiority is an artifact of the particular encoder rather than a general property.
Authors: We agree with the referee that demonstrating the robustness of our findings across different embedding models would better support the generality of the locality argument. Accordingly, we will add an ablation study in the revised manuscript that evaluates kNN performance using several alternative embedding models, including different sentence transformers and contrastive encoders. This will help confirm that the observed advantages are not specific to the encoder used in the original submission. revision: yes
-
Referee: [§5.1] §5.1 (Multi-modal dataset): the description of how visual inputs are embedded for nearest-neighbor lookup is too brief to verify that the embedding geometry aligns with actual performance differences across models. If the multi-modal encoder does not separate inputs according to which LLM performs best, the kNN advantage claimed for this new dataset would not follow from the locality argument.
Authors: We acknowledge that the current description in §5.1 is concise and may not provide sufficient detail for verification. In the revision, we will expand this section to include a more thorough explanation of the visual input embedding process, the specific multi-modal encoder employed, and any supporting analysis or visualizations that illustrate the alignment between the embedding geometry and the performance differences across LLMs. revision: yes
-
Referee: [§4.3] §4.3 (Statistical reporting): the performance tables lack error bars, standard deviations across seeds, or statistical significance tests for the accuracy differences versus learned routers. Without these, it is impossible to determine whether the observed outperformance is reliable or could be explained by benchmark variance.
Authors: We concur that the inclusion of statistical reporting would enhance the credibility of our results. We will update the performance tables in the revised manuscript to include error bars (standard deviations across multiple random seeds) and conduct statistical significance tests, such as paired t-tests or Wilcoxon tests, to assess whether the differences in accuracy are statistically significant. revision: yes
Circularity Check
No circularity: purely empirical comparison on external benchmarks
full rationale
The paper advances an empirical claim that a tuned kNN router matches or exceeds learned routers on instruction-following, QA, reasoning, and a new multi-modal dataset. No equations, fitted parameters, or self-citations are used to derive the result; performance differences are measured directly against held-out test sets and external baselines. The locality observation is reported as an outcome of the experiments rather than an input assumption that forces the conclusion. The derivation chain is therefore self-contained against external benchmarks and contains no self-definitional, fitted-input, or self-citation reductions.
Axiom & Free-Parameter Ledger
free parameters (1)
- k (number of neighbors)
axioms (1)
- domain assumption Model performance exhibits locality in the chosen embedding space
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 1 (δ-Locality) ... d(x1,x2)<δ ⟹ |u(x1,m)−u(x2,m)|<ϵ(δ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, and François Jacquenet. Doing more with less–implementing routing strategies in large language model-based systems: An extended survey. arXiv preprint arXiv:2502.00409, 2025
-
[6]
Harnessing Multiple Large Language Models: A Survey on LLM Ensemble
Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, and Philip S Yu. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Routellm: Learning to route llms from preference data
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms from preference data. In The Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[8]
Routerdc: Query-based router by dual contrastive learning for assembling large language models
Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query-based router by dual contrastive learning for assembling large language models. Advances in Neural Information Processing Systems, 37:66305–66328, 2024
work page 2024
-
[9]
Routing to the expert: Efficient reward-guided ensemble of large language models
Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. arXiv preprint arXiv:2311.08692, 2023
-
[10]
Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference. arXiv preprint arXiv:2408.12320, 2024
-
[11]
Hybrid llm: Cost-efficient and quality-aware query routing
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality- aware query routing. arXiv preprint arXiv:2404.14618, 2024
-
[12]
doi:10.48550/arXiv.2410.03834 , url =
Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834, 2024
-
[13]
Quang H Nguyen, Duy C Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V Chawla, and Khoa D Doan. Metallm: A high-performant and cost-efficient dynamic framework for wrapping llms. arXiv preprint arXiv:2407.10834, 2024
-
[14]
Yang Li. Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing. arXiv preprint arXiv:2502.02743, 2025
-
[15]
Llm-blender: Ensembling large language models with pairwise ranking and generative fusion
Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023
-
[16]
Xiaoding Lu, Zongyi Liu, Adian Liusie, Vyas Raina, Vineet Mudupalli, Yuwen Zhang, and William Beauchamp. Blending is all you need: Cheaper, better alternative to trillion-parameters llm. arXiv preprint arXiv:2401.02994, 2024. 10
-
[17]
Fusing models with complementary expertise
Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik Kundu, Eric Xing, and Mikhail Yurochkin. Fusing models with complementary expertise. arXiv preprint arXiv:2310.01542, 2023
-
[18]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Automix: Automatically mixing language models
Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models. Advances in Neural Information Processing Systems, 37:131000–131034, 2024
work page 2024
-
[20]
Optimising calls to large language models with uncertainty-based two-tier selection
Guillem Ramírez, Alexandra Birch, and Ivan Titov. Optimising calls to large language models with uncertainty-based two-tier selection. arXiv preprint arXiv:2405.02134, 2024
-
[21]
Large language model routing with benchmark datasets
Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023
-
[22]
Tryage: Real-time, intelligent routing of user prompts to large language models
Surya Narayanan Hari and Matt Thomson. Tryage: Real-time, intelligent routing of user prompts to large language models. arXiv preprint arXiv:2308.11601, 2023
-
[23]
Fly-swat or cannon? cost-effective language model choice via meta-modeling
Marija Šakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 606–615, 2024
work page 2024
-
[24]
Eagle: Efficient training-free router for multi-llm inference
Zesen Zhao, Shuowei Jin, and Z Morley Mao. Eagle: Efficient training-free router for multi-llm inference. arXiv preprint arXiv:2409.15518, 2024
-
[25]
Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,
Alireza Mohammadshahi, Arshad Rafiq Shaikh, and Majid Yazdani. Routoo: Learning to route to large language models effectively. arXiv preprint arXiv:2401.13979, 2024
-
[26]
RouterBench: A Benchmark for Multi-LLM Routing System
Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system. arXiv preprint arXiv:2403.12031, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_ llm_leaderboard, 2024
work page 2024
-
[29]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Vhelm: A holistic evaluation of vision language models
Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models. Advances in Neural Information Processing Systems, 37:140632– 140666, 2024
work page 2024
-
[31]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
work page 2019
-
[32]
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 11
work page 2023
-
[34]
arXiv preprint arXiv:2405.18137 (2024)
Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Martin Vechev. Exploiting llm quantization. arXiv preprint arXiv:2405.18137, 2024
-
[35]
A survey of collaborative filtering techniques
Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Ad- vances in artificial intelligence, 2009(1):421425, 2009
work page 2009
-
[36]
Item-based collaborative filtering recommendation algorithms
Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285–295, 2001
work page 2001
-
[37]
Reducing Offline Evaluation Bias in Recommendation Systems
Arnaud De Myttenaere, Bénédicte Le Grand, Boris Golden, and Fabrice Rossi. Reducing offline evaluation bias in recommendation systems. arXiv preprint arXiv:1407.0822, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[38]
Sfr- embedding-mistral:enhance text retrieval with transfer learning
Shafiq Rayhan Joty Caiming Xiong Yingbo Zhou Semih Yavuz Rui Meng, Ye Liu. Sfr- embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog, 2024
work page 2024
-
[39]
ε-entropy and ε- capacity of sets in function spaces
Andrei Nikolaevich Kolmogorov and Vladimir Mikhailovich Tikhomirov. ε-entropy and ε- capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3–86, 1959
work page 1959
-
[40]
Distance-based classification with lipschitz functions
Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669–695, 2004
work page 2004
-
[41]
Universal approximation bounds for superpositions of a sigmoidal function
Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993
work page 1993
-
[42]
Error bounds for approximations with deep relu networks
Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural networks, 94:103–114, 2017
work page 2017
-
[43]
Spectrally-normalized margin bounds for neural networks
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017
work page 2017
-
[44]
Size-independent sample complexity of neural networks
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018. 12 A Additional Related Works Beyond the core routing approaches discussed in Section 2, several other research directions are relevant to our investigation of LLM routing mechanisms. LLM In...
work page 2018
-
[45]
For a given query x, we obtain predicted utility scores ˆu(x, m) = ˆs(x, m) − λ × ˆc(x, m) for each model m ∈ M across various values of λ
-
[46]
For each λ value, we select the model with the highest predicted utility: mλ = arg maxm∈M ˆu(x, m)
-
[47]
We plot the actual performance-cost pairs (c(x, mλ), s(x, mλ)) in the cost-performance space
-
[48]
We compute the non-decreasing convex hull of these points to obtain the Pareto-optimal frontier
-
[49]
The AUC is calculated as the area under this frontier, normalized so that the maximum score is 100 and the maximum cost is 1. This approach ensures that routers are evaluated on their ability to make optimal trade-offs across the entire spectrum of cost-performance preferences. 14 B.4 Data Splits and Reproducibility To ensure reproducible evaluation, we u...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.