pith. sign in

arxiv: 2505.12601 · v2 · pith:7MEC6ORRnew · submitted 2025-05-19 · 💻 cs.LG

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

Pith reviewed 2026-05-22 15:08 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM routingk-nearest neighborsmodel selectionembedding localitynon-parametric methodsrouting benchmarksmulti-modal routing
0
0 comments X

The pith

A well-tuned k-nearest neighbors method often matches or beats complex learned routers when selecting the best LLM for a given input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that routing an input to the most suitable large language model can be handled effectively by a simple k-nearest neighbors lookup in embedding space rather than by training parametric routers. This holds across instruction-following, question-answering, reasoning tasks and a new multi-modal dataset with visual inputs. The underlying reason is that different models exhibit locally consistent performance patterns, so nearby points in embedding space tend to favor the same model. Because the approach is non-parametric, it reaches strong decisions with fewer labeled examples than learned alternatives. The authors also release standardized benchmarks to make future comparisons reproducible and to highlight the value of checking basic methods first.

Core claim

A well-tuned k-Nearest Neighbors (kNN) approach not only matches but often outperforms state-of-the-art learned routers across diverse tasks. The locality properties of model performance in embedding space enable simple non-parametric methods to achieve strong routing decisions with lower sample complexity than parametric approaches.

What carries the argument

k-nearest neighbors lookup in an input embedding space that retrieves the model which performed best on the most similar previous examples.

If this is right

  • kNN routers can achieve competitive or superior accuracy on instruction-following, question-answering, reasoning, and multi-modal tasks.
  • Non-parametric routing decisions require lower sample complexity than training parametric learned routers.
  • Standardized benchmarks spanning text and visual inputs allow systematic comparison of routing strategies.
  • Thorough evaluation of simple baselines should precede adoption of more complex routing architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If locality holds, routing systems could be maintained by periodically adding new performance evaluations to a lookup table instead of retraining neural routers.
  • The same embedding-based locality idea might transfer to routing decisions in other multi-model AI systems such as vision or code models.
  • Choosing or fine-tuning the embedding model itself could become a key lever for improving kNN routing quality without adding parametric complexity.

Load-bearing premise

The embedding space used for nearest-neighbor lookup must reflect the input features that actually determine which model will perform best on new queries.

What would settle it

A result on the released benchmarks showing that, with identical embeddings and comparable training data, a learned router consistently selects higher-performing models than the best-tuned kNN across multiple tasks.

Figures

Figures reproduced from arXiv: 2505.12601 by Yang Li.

Figure 1
Figure 1. Figure 1: As the embedding distance between prompt pairs increases, the agreement between their model performance scores decreases, demon￾strating the locality property in the prompt￾performance space. In this section, we develop a theoretical frame￾work to explain why simple kNN-based routers often match or outperform more complex learned routers. Our analysis addresses an important question: under what conditions … view at source ↗
read the original abstract

As large language models (LLMs) grow in scale and specialization, routing--selecting the best model for a given input--has become essential for efficient and effective deployment. While recent methods rely on complex learned routing strategies, their dependence on disparate training data and evaluation setups makes comparison and generalization difficult. In this work, we revisit LLM routing through the lens of simplicity. We show that a well-tuned k-Nearest Neighbors (kNN) approach not only matches but often outperforms state-of-the-art learned routers across diverse tasks. To support systematic evaluation, we introduce a suite of standardized routing benchmarks spanning instruction-following, question-answering, and reasoning tasks, as well as the first multi-modal routing dataset involving visual inputs. Our findings reveal that the locality properties of model performance in embedding space enable simple non-parametric methods to achieve strong routing decisions with lower sample complexity than parametric approaches. This challenges the prevailing trend toward sophisticated architectures and highlights the importance of thoroughly evaluating simple baselines before investing in complex solutions. To support reproducibility and further exploration, we will release all benchmarks and code upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that a well-tuned k-Nearest Neighbors (kNN) router not only matches but often outperforms state-of-the-art learned routers for selecting the best LLM on a given input. It introduces a suite of standardized benchmarks covering instruction-following, question-answering, and reasoning tasks plus the first multi-modal routing dataset with visual inputs, and attributes the success of the simple non-parametric method to locality properties of model performance in embedding space, which also yields lower sample complexity than parametric routers.

Significance. If the results hold under rigorous controls, the work is significant because it supplies standardized benchmarks and a multi-modal dataset that the community can use for future comparisons, while providing concrete evidence that thorough evaluation of simple baselines can outperform the prevailing trend toward complex learned routers. The planned release of code and benchmarks is a clear strength for reproducibility.

major comments (3)
  1. [§4.2 and Table 3] §4.2 and Table 3: the reported kNN gains over learned routers are presented without an ablation across alternative embedding models (e.g., different sentence transformers or contrastive encoders). Because the central claim rests on the locality properties of model performance in the chosen embedding space, the absence of this check leaves open the possibility that superiority is an artifact of the particular encoder rather than a general property.
  2. [§5.1] §5.1 (Multi-modal dataset): the description of how visual inputs are embedded for nearest-neighbor lookup is too brief to verify that the embedding geometry aligns with actual performance differences across models. If the multi-modal encoder does not separate inputs according to which LLM performs best, the kNN advantage claimed for this new dataset would not follow from the locality argument.
  3. [§4.3] §4.3 (Statistical reporting): the performance tables lack error bars, standard deviations across seeds, or statistical significance tests for the accuracy differences versus learned routers. Without these, it is impossible to determine whether the observed outperformance is reliable or could be explained by benchmark variance.
minor comments (2)
  1. [Abstract] The abstract states that kNN achieves 'strong routing decisions with lower sample complexity' but does not quantify sample complexity (e.g., number of labeled examples needed for convergence) in the main text or appendix.
  2. [§3] Notation for the distance metric and value of k is introduced inconsistently between the method section and the experimental setup; a single consolidated definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we plan to make in response.

read point-by-point responses
  1. Referee: [§4.2 and Table 3] §4.2 and Table 3: the reported kNN gains over learned routers are presented without an ablation across alternative embedding models (e.g., different sentence transformers or contrastive encoders). Because the central claim rests on the locality properties of model performance in the chosen embedding space, the absence of this check leaves open the possibility that superiority is an artifact of the particular encoder rather than a general property.

    Authors: We agree with the referee that demonstrating the robustness of our findings across different embedding models would better support the generality of the locality argument. Accordingly, we will add an ablation study in the revised manuscript that evaluates kNN performance using several alternative embedding models, including different sentence transformers and contrastive encoders. This will help confirm that the observed advantages are not specific to the encoder used in the original submission. revision: yes

  2. Referee: [§5.1] §5.1 (Multi-modal dataset): the description of how visual inputs are embedded for nearest-neighbor lookup is too brief to verify that the embedding geometry aligns with actual performance differences across models. If the multi-modal encoder does not separate inputs according to which LLM performs best, the kNN advantage claimed for this new dataset would not follow from the locality argument.

    Authors: We acknowledge that the current description in §5.1 is concise and may not provide sufficient detail for verification. In the revision, we will expand this section to include a more thorough explanation of the visual input embedding process, the specific multi-modal encoder employed, and any supporting analysis or visualizations that illustrate the alignment between the embedding geometry and the performance differences across LLMs. revision: yes

  3. Referee: [§4.3] §4.3 (Statistical reporting): the performance tables lack error bars, standard deviations across seeds, or statistical significance tests for the accuracy differences versus learned routers. Without these, it is impossible to determine whether the observed outperformance is reliable or could be explained by benchmark variance.

    Authors: We concur that the inclusion of statistical reporting would enhance the credibility of our results. We will update the performance tables in the revised manuscript to include error bars (standard deviations across multiple random seeds) and conduct statistical significance tests, such as paired t-tests or Wilcoxon tests, to assess whether the differences in accuracy are statistically significant. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on external benchmarks

full rationale

The paper advances an empirical claim that a tuned kNN router matches or exceeds learned routers on instruction-following, QA, reasoning, and a new multi-modal dataset. No equations, fitted parameters, or self-citations are used to derive the result; performance differences are measured directly against held-out test sets and external baselines. The locality observation is reported as an outcome of the experiments rather than an input assumption that forces the conclusion. The derivation chain is therefore self-contained against external benchmarks and contains no self-definitional, fitted-input, or self-citation reductions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation of performance locality in embedding space and on the construction of new benchmarks; k is treated as a tunable hyperparameter rather than a learned parameter.

free parameters (1)
  • k (number of neighbors)
    Hyperparameter selected to optimize routing accuracy on the evaluation benchmarks.
axioms (1)
  • domain assumption Model performance exhibits locality in the chosen embedding space
    Invoked to explain why nearest-neighbor lookup produces reliable routing decisions.

pith-pipeline@v0.9.0 · 5714 in / 1328 out tokens · 73259 ms · 2026-05-22T15:08:35.467191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 11 internal anchors

  1. [1]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  2. [2]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  4. [4]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  5. [5]

    Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems , journal =

    Clovis Varangot-Reille, Christophe Bouvard, Antoine Gourru, Mathieu Ciancone, Marion Schaeffer, and François Jacquenet. Doing more with less–implementing routing strategies in large language model-based systems: An extended survey. arXiv preprint arXiv:2502.00409, 2025

  6. [6]

    Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

    Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, and Philip S Yu. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025

  7. [7]

    Routellm: Learning to route llms from preference data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms from preference data. In The Thirteenth International Conference on Learning Representations, 2024

  8. [8]

    Routerdc: Query-based router by dual contrastive learning for assembling large language models

    Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query-based router by dual contrastive learning for assembling large language models. Advances in Neural Information Processing Systems, 37:66305–66328, 2024

  9. [9]

    Routing to the expert: Efficient reward-guided ensemble of large language models

    Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. arXiv preprint arXiv:2311.08692, 2023

  10. [10]

    Tensoropera router: A multi-model router for efficient llm inference.arXiv preprint arXiv:2408.12320,

    Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference. arXiv preprint arXiv:2408.12320, 2024

  11. [11]

    Hybrid llm: Cost-efficient and quality-aware query routing

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality- aware query routing. arXiv preprint arXiv:2404.14618, 2024

  12. [12]

    doi:10.48550/arXiv.2410.03834 , url =

    Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834, 2024

  13. [13]

    Metallm: A high-performant and cost-efficient dynamic frame- work for wrapping llms.arXiv preprint arXiv:2407.10834,

    Quang H Nguyen, Duy C Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V Chawla, and Khoa D Doan. Metallm: A high-performant and cost-efficient dynamic framework for wrapping llms. arXiv preprint arXiv:2407.10834, 2024

  14. [14]

    Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing.arXiv preprint arXiv:2502.02743,

    Yang Li. Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing. arXiv preprint arXiv:2502.02743, 2025

  15. [15]

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023

  16. [16]

    Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

    Xiaoding Lu, Zongyi Liu, Adian Liusie, Vyas Raina, Vineet Mudupalli, Yuwen Zhang, and William Beauchamp. Blending is all you need: Cheaper, better alternative to trillion-parameters llm. arXiv preprint arXiv:2401.02994, 2024. 10

  17. [17]

    Fusing models with complementary expertise

    Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik Kundu, Eric Xing, and Mikhail Yurochkin. Fusing models with complementary expertise. arXiv preprint arXiv:2310.01542, 2023

  18. [18]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

  19. [19]

    Automix: Automatically mixing language models

    Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models. Advances in Neural Information Processing Systems, 37:131000–131034, 2024

  20. [20]

    Optimising calls to large language models with uncertainty-based two-tier selection

    Guillem Ramírez, Alexandra Birch, and Ivan Titov. Optimising calls to large language models with uncertainty-based two-tier selection. arXiv preprint arXiv:2405.02134, 2024

  21. [21]

    Large language model routing with benchmark datasets

    Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023

  22. [22]

    Tryage: Real-time, intelligent routing of user prompts to large language models

    Surya Narayanan Hari and Matt Thomson. Tryage: Real-time, intelligent routing of user prompts to large language models. arXiv preprint arXiv:2308.11601, 2023

  23. [23]

    Fly-swat or cannon? cost-effective language model choice via meta-modeling

    Marija Šakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 606–615, 2024

  24. [24]

    Eagle: Efficient training-free router for multi-llm inference

    Zesen Zhao, Shuowei Jin, and Z Morley Mao. Eagle: Efficient training-free router for multi-llm inference. arXiv preprint arXiv:2409.15518, 2024

  25. [25]

    Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,

    Alireza Mohammadshahi, Arshad Rafiq Shaikh, and Majid Yazdani. Routoo: Learning to route to large language models effectively. arXiv preprint arXiv:2401.13979, 2024

  26. [26]

    RouterBench: A Benchmark for Multi-LLM Routing System

    Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system. arXiv preprint arXiv:2403.12031, 2024

  27. [27]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

  28. [28]

    Open llm leaderboard v2

    Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_ llm_leaderboard, 2024

  29. [29]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

  30. [30]

    Vhelm: A holistic evaluation of vision language models

    Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models. Advances in Neural Information Processing Systems, 37:140632– 140666, 2024

  31. [31]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  32. [32]

    VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024

  33. [33]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 11

  34. [34]

    arXiv preprint arXiv:2405.18137 (2024)

    Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Martin Vechev. Exploiting llm quantization. arXiv preprint arXiv:2405.18137, 2024

  35. [35]

    A survey of collaborative filtering techniques

    Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Ad- vances in artificial intelligence, 2009(1):421425, 2009

  36. [36]

    Item-based collaborative filtering recommendation algorithms

    Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285–295, 2001

  37. [37]

    Reducing Offline Evaluation Bias in Recommendation Systems

    Arnaud De Myttenaere, Bénédicte Le Grand, Boris Golden, and Fabrice Rossi. Reducing offline evaluation bias in recommendation systems. arXiv preprint arXiv:1407.0822, 2014

  38. [38]

    Sfr- embedding-mistral:enhance text retrieval with transfer learning

    Shafiq Rayhan Joty Caiming Xiong Yingbo Zhou Semih Yavuz Rui Meng, Ye Liu. Sfr- embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog, 2024

  39. [39]

    ε-entropy and ε- capacity of sets in function spaces

    Andrei Nikolaevich Kolmogorov and Vladimir Mikhailovich Tikhomirov. ε-entropy and ε- capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3–86, 1959

  40. [40]

    Distance-based classification with lipschitz functions

    Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669–695, 2004

  41. [41]

    Universal approximation bounds for superpositions of a sigmoidal function

    Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993

  42. [42]

    Error bounds for approximations with deep relu networks

    Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural networks, 94:103–114, 2017

  43. [43]

    Spectrally-normalized margin bounds for neural networks

    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017

  44. [44]

    Size-independent sample complexity of neural networks

    Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018. 12 A Additional Related Works Beyond the core routing approaches discussed in Section 2, several other research directions are relevant to our investigation of LLM routing mechanisms. LLM In...

  45. [45]

    For a given query x, we obtain predicted utility scores ˆu(x, m) = ˆs(x, m) − λ × ˆc(x, m) for each model m ∈ M across various values of λ

  46. [46]

    For each λ value, we select the model with the highest predicted utility: mλ = arg maxm∈M ˆu(x, m)

  47. [47]

    We plot the actual performance-cost pairs (c(x, mλ), s(x, mλ)) in the cost-performance space

  48. [48]

    We compute the non-decreasing convex hull of these points to obtain the Pareto-optimal frontier

  49. [49]

    This approach ensures that routers are evaluated on their ability to make optimal trade-offs across the entire spectrum of cost-performance preferences

    The AUC is calculated as the area under this frontier, normalized so that the maximum score is 100 and the maximum cost is 1. This approach ensures that routers are evaluated on their ability to make optimal trade-offs across the entire spectrum of cost-performance preferences. 14 B.4 Data Splits and Reproducibility To ensure reproducible evaluation, we u...