pith. machine review for the scientific record. sign in

arxiv: 2605.07112 · v1 · submitted 2026-05-08 · 💻 cs.AI · cs.MA

Recognition: no theorem link

Switchcraft: AI Model Router for Agentic Tool Calling

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:59 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords model routingagentic AItool callingcost optimizationDistilBERTfunction callinginference cost
0
0 comments X

The pith

Switchcraft routes each tool call to the cheapest model that can handle it correctly, matching the best single model's accuracy at 84% lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic AI systems that invoke external tools are powerful but costly because developers default to large models. Switchcraft introduces the first router built specifically for tool calling instead of chat. It uses a DistilBERT classifier to pick the lowest-cost model for each query while meeting a correctness threshold and latency limit. On five function-calling benchmarks the router reaches 82.9 percent accuracy, equal to or better than any one model, and cuts inference cost by 84 percent. This approach lets developers run capable agents without overspending on every query.

Core claim

Switchcraft is a model router optimized for agentic tool calling that selects the lowest-cost model subject to correctness. A DistilBERT classifier is trained on five function-calling benchmarks and deployed inline under a latency budget. It achieves 82.9 percent accuracy matching or exceeding the strongest individual model while reducing inference cost by 84 percent and saving over $3,600 per million queries. The evaluation also reveals that larger models do not always outperform smaller ones on tool-use tasks and that nominally cheaper models can produce higher total cost through token-intensive reasoning.

What carries the argument

DistilBERT classifier trained on five function-calling benchmarks to predict and select the lowest-cost model that meets correctness for each inline tool-calling query.

If this is right

  • Larger models do not consistently outperform smaller ones on tool-use tasks.
  • Nominally cheaper models can incur higher total cost due to token-intensive reasoning.
  • Cost-aware agentic AI deployment becomes feasible without sacrificing correctness.
  • Savings exceed $3,600 per million queries while accuracy stays at or above 82.9 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing logic could be tested on agent tasks that mix tool calls with open-ended reasoning.
  • Performance may change if production queries shift in style or complexity from the benchmark set.
  • Pairing the router with prompt-length controls could produce additional cost reductions beyond the reported 84 percent.

Load-bearing premise

The classifier trained on the five benchmarks will keep selecting correct low-cost models on new real-world agent queries without large accuracy drops or hidden cost increases.

What would settle it

Measure accuracy and total token cost when Switchcraft routes a fresh collection of diverse, real-world tool-calling queries outside the original five benchmarks.

Figures

Figures reproduced from arXiv: 2605.07112 by Alec Wolman, Ankur Gupta, Pooria Namyar, Qizheng Zhang, Rahul Ambavat, Sharad Agarwal.

Figure 1
Figure 1. Figure 1: Motivating example from BFCL v3 [24]. Agentic queries demand correct functions and precise parameter values at every step; small errors (wrong type, wrong value, wrong order) produce consequential failures. router (66M parameters) that takes an agent’s query and context as input and predicts the most suitable model for execution. Key Results. Switchcraft achieves 82.9% accuracy—matching or exceeding the be… view at source ↗
Figure 2
Figure 2. Figure 2: System architecture. Left: fine-tuning pipeline (ingest benchmarks, run inference across LLMs, score via AST, fine-tune router). Right: inference-time routing (DistilBERT classifier and cost model select the cheapest predicted-correct LLM). weighted losses, BLEU and LLM-as-judge scoring); rejected alternatives are detailed in Appendix B [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy–cost Pareto plot on the held-out test set (12,282 examples). Switchcraft (red [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Acceptable variation in an agentic parallel tool-calling scenario (BFCL v3 [ [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: DistilBERT fine-tuning curves across 20 seeds. Left: training and validation loss. Right: [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DistilBERT router accuracy across 20 random seeds. Each dot is one seed; the red bar [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy–cost Pareto frontier for different thresholding strategies. Each faint curve traces [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Expected vs. actual average cost per query on the validation set. [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy–cost Pareto plot for the earlier model basket on the held-out test set (12,282 [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of correctness probabilities across 20 invocations, grouped by model. Each [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of correctness probabilities across 20 invocations, grouped by dataset. Multi [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy–cost Pareto plot with probabilistic correctness labels (1,224 test examples). The [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗
read the original abstract

Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy -- matching or exceeding the best individual model -- while reducing inference cost by 84%, saving over $3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Switchcraft, a DistilBERT-based model router for agentic tool calling. It trains a classifier on five function-calling benchmarks to select the lowest-cost model subject to correctness under a latency budget, reporting 82.9% accuracy that matches or exceeds the best individual model while reducing inference cost by 84% (over $3,600 savings per million queries). The work additionally observes that larger models do not consistently outperform smaller ones on tool-use tasks and that nominally cheaper models can incur higher total token cost due to longer reasoning traces.

Significance. If the benchmark results prove robust and reproducible, the approach offers a practical, inline mechanism for cost-efficient agentic AI deployment without correctness loss. The observation that model scale does not reliably predict tool-calling performance is a useful empirical contribution for the field.

major comments (3)
  1. [Abstract] Abstract: the headline claims of 82.9% accuracy and 84% cost reduction are stated without any details on training/validation data splits, error bars, statistical significance, or the precise protocol used to measure total token cost (including cases where cheaper models produce longer traces). These omissions directly undermine assessment of the central accuracy and savings assertions.
  2. [Evaluation framework] Evaluation framework: all accuracy and cost figures are measured exclusively on the same five benchmarks used to train the DistilBERT classifier. No results are reported on out-of-distribution agent queries, production logs, or tasks with different tool schemas or reasoning depths, leaving the generalization assumption untested despite the abstract's own note on variable token costs.
  3. [Methods] Methods: the classifier training procedure, label construction (how 'correct' model selections are defined), cross-validation strategy, and latency-budget deployment mechanics are insufficiently specified to support reproduction or verification of the reported performance numbers.
minor comments (1)
  1. [Abstract] Abstract: the claim of being 'the first (to the best of our knowledge)' router optimized for agentic tool calling would benefit from a short related-work paragraph contrasting it with existing chat-completion routers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve reproducibility, add necessary details, and acknowledge limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claims of 82.9% accuracy and 84% cost reduction are stated without any details on training/validation data splits, error bars, statistical significance, or the precise protocol used to measure total token cost (including cases where cheaper models produce longer traces). These omissions directly undermine assessment of the central accuracy and savings assertions.

    Authors: We agree that the abstract omits key details. In the revision we will expand the methods and results sections (and update the abstract where space permits) to specify the 80/20 train/test split on the combined benchmarks, report standard deviations and 5-fold cross-validation error bars, include statistical significance tests for the accuracy and cost comparisons, and describe the total token cost protocol that sums prompt plus completion tokens while explicitly accounting for longer reasoning traces produced by smaller models. revision: yes

  2. Referee: [Evaluation framework] Evaluation framework: all accuracy and cost figures are measured exclusively on the same five benchmarks used to train the DistilBERT classifier. No results are reported on out-of-distribution agent queries, production logs, or tasks with different tool schemas or reasoning depths, leaving the generalization assumption untested despite the abstract's own note on variable token costs.

    Authors: This is a valid limitation. All current numbers are in-distribution. We will add a dedicated Limitations section that states this explicitly and discusses risks of distribution shift. We will also report a new small-scale experiment on held-out queries with altered tool schemas to provide preliminary generalization evidence. Production logs are unavailable to us, so we cannot add those results. revision: partial

  3. Referee: [Methods] Methods: the classifier training procedure, label construction (how 'correct' model selections are defined), cross-validation strategy, and latency-budget deployment mechanics are insufficiently specified to support reproduction or verification of the reported performance numbers.

    Authors: We acknowledge the methods section is underspecified. The revision will detail: label construction (a model is labeled correct only if it emits the exact ground-truth tool call and arguments), DistilBERT fine-tuning hyperparameters and procedure, 5-fold cross-validation protocol with per-fold results, and the latency-budget mechanism (models whose profiled latency exceeds the budget are filtered before cheapest-valid selection). Pseudocode for the full routing logic will be added to an appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of trained router is independent of inputs

full rationale

The paper trains a DistilBERT classifier on data from five function-calling benchmarks and reports measured accuracy (82.9%) plus cost reduction (84%) on the same benchmarks' evaluation splits. These quantities are computed directly from the classifier's output selections versus ground-truth correctness and per-model token costs; they do not reduce by construction to the fitted parameters themselves. No equations, self-citations, uniqueness theorems, or ansatzes appear in the provided text that would make the headline result equivalent to its training inputs. The pipeline is a standard supervised-learning evaluation and therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract. The system relies on standard supervised classification and benchmark evaluation.

pith-pipeline@v0.9.0 · 5476 in / 1193 out tokens · 37264 ms · 2026-05-11T00:59:29.859303+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    URLhttps://link.springer.com/article/10.1007/s104 62-025-11422-4

    Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions.Artificial Intelligence Review, 59(11), 2025. doi: 10.1007/s10462-025-11422-4. URL https://link.springer.com/ article/10.1007/s10462-025-11422-4

  2. [2]

    Automix: Automatically mixing language models

    Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam . Automix: Automatically mixing language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  3. [3]

    CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

    Tamer Alkhouli, Katerina Margatina, James Gung, Raphael Shu, Claudia Zaghi, Monica Sunkara, and Yi Zhang. CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7993–8006, Vienna, Austria, jul

  4. [4]

    doi: 10.18653/v1/2025.acl-long.394

    Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long.394. URL https://aclanthology.org/2025.acl-long.394/. Creative Commons Attribution 4.0 International License

  5. [5]

    Validation of modern json schema: Formalization and complexity

    Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. Validation of modern json schema: Formalization and complexity. Proc. ACM Program. Lang., 8(POPL), January 2024. doi: 10.1145/3632891. URL https: //doi.org/10.1145/3632891

  6. [6]

    Routenator: A router-based multi-modal architecture for generating synthetic training data for function calling llms, 2025

    Vibha Belavadi, Tushar Vatsa, Dewang Sultania, Suhas Suresha, Ishita Verma, Cheng Chen, Tracy Holloway King, and Michael Friedrich. Routenator: A router-based multi-modal architecture for generating synthetic training data for function calling llms, 2025. URL https://arxiv.org/abs/2505.10495

  7. [7]

    Routerdc: Query- based router by dual contrastive learning for assembling large language models

    Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query- based router by dual contrastive learning for assembling large language models. InAd- vances in Neural Information Processing Systems, volume 37, 2024. doi: 10.52202/ 079017-2120. URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/7a641b8ec86162fc875fb9f6456a5...

  8. [8]

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality- aware query routing. InThe Twelfth International Conference on Learning Representations, 2024

  9. [9]

    Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Car- men Hipolito Garcia, Menglin Xia, Laks V . S. Lakshmanan, Qingyun Wu, and Victor Rühle. BEST-route: Adaptive LLM routing with test-time optimal compute. InForty-second Interna- tional Conference on Machine Learning, 2025

  10. [10]

    IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs.arXiv preprint arXiv:2509.06274, 2025

    Aosong Feng, Balasubramaniam Srinivasan, Yun Zhou, Zhichao Xu, Kang Zhou, Sheng Guan, Yueyan Chen, Xian Wu, Ninad Kulkarni, Yi Zhang, Zhengyuan Shen, Dmitriy Bespalov, Soumya Smruti Mishra, Yifei Teng, Darren Yow-Bang Wang, Haibo Ding, and Lin Lee Cheong. IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs.arXiv preprint arXiv:250...

  11. [11]

    Graphrouter: A graph-based router for llm selections,

    Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections,

  12. [12]

    URLhttps://arxiv.org/abs/2410.03834

  13. [13]

    Glaive Function Calling v2

    Glaive AI. Glaive Function Calling v2. Dataset available from HuggingFace, 2023. URLhttps: //huggingface.co/datasets/glaiveai/glaive-function-calling-v2 . Apache 2.0 License

  14. [14]

    Routerbench: A benchmark for multi-LLM routing system

    Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-LLM routing system. InAgentic Markets Workshop at ICML 2024, 2024

  15. [15]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06770, 2024

  16. [16]

    Levente Kocsis and Csaba Szepesvári

    Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, and Sanjiv Kumar. Universal Model Routing for Efficient LLM Inference.arXiv preprint arXiv:2502.08773, 2025

  17. [17]

    LLexus: An AI Agent System for Incident Management.ACM SIGOPS Operating Systems Review, 58 (1):23–36, 2024

    Pedro Las-Casas, Alok Gautum Kumbhare, Rodrigo Fonseca, and Sharad Agarwal. LLexus: An AI Agent System for Incident Management.ACM SIGOPS Operating Systems Review, 58 (1):23–36, 2024. doi: 10.1145/3689051.3689056

  18. [18]

    InFind- ings of the Association for Computational Linguistics: EMNLP 2025, pages 12581–12597, Suzhou, China

    Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, and Shuyue Hu. LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing.arXiv preprint arXiv:2601.07206, 2026

  19. [19]

    RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers.arXiv preprint arXiv:2510.00202, 2025

    Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, and Jiarong Xing. RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers.arXiv preprint arXiv:2510.00202, 2025

  20. [20]

    OmniRouter: Budget and Performance Controllable Multi-LLM Routing.arXiv preprint arXiv:2502.20576, 2025

    Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. OmniRouter: Budget and Performance Controllable Multi-LLM Routing.arXiv preprint arXiv:2502.20576, 2025

  21. [21]

    InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286

    Quang H. Nguyen, Thinh Dao, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V . Chawla, and Khoa D. Doan. Metallm: A high-performant and cost-efficient dynamic framework for wrapping llms, 2025. URLhttps://arxiv.org/abs/2407.10834

  22. [22]

    Hermes 3 Technical Report

    Nous Research. Hermes 3 Technical Report. Technical report, Nous Re- search, 2024. URL https://nousresearch.com/wp-content/uploads/2024/08/ Hermes-3-Technical-Report.pdf. Apache 2.0 License

  23. [23]

    Explainable Model Routing for Agentic Workflows

    Mika Okamoto, Ansel Kaplan Erol, and Mark Riedl. Explainable model routing for agentic workflows, 2026. URLhttps://arxiv.org/abs/2604.03527

  24. [24]

    Gonzalez, M Waleed Kadous, and Ion Stoica

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs from preference data. InThe Thirteenth International Conference on Learning Representations, 2025

  25. [25]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, volume 37. Curran Associates, Inc., 2024. doi: 10.52202/079017-4020

  26. [26]

    Gonzalez

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 48371–48392, 2025. Apache 2.0 License

  27. [27]

    Fly-swat or cannon? cost-effective language model choice via meta-modeling

    Marija Sakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, page 606–615, 2024. URL https://doi.org/10. 1145/3616855.3635825. 11

  28. [28]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.arXiv preprint arXiv:1910.01108, 2019. URLhttps://arxiv.org/abs/1910.01108

  29. [29]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/d842425e4bf79ba...

  30. [30]

    Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023

    Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets, 2023. URLhttps://arxiv.org/abs/2309.15789

  31. [31]

    Carrot: A cost aware rate optimal router, 2025

    Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, and Subha Maity. Carrot: A cost aware rate optimal router, 2025. URLhttps://arxiv.org/abs/2502.03261

  32. [32]

    IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory

    Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  33. [33]

    Tensoropera router: A multi-model router for efficient llm inference, 2024

    Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference, 2024. URLhttps://arxiv.org/abs/2408.12320

  34. [34]

    vllm semantic router

    vLLM Semantic Router Team. vllm semantic router. https://github.com/vllm-project/ semantic-router, 2025

  35. [35]

    Icl-router: In-context learned model representations for llm routing

    Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, and Shuyue Hu. Icl-router: In-context learned model representations for llm routing. InAAAI Conference on Artificial Intelligence, 2025. URL https://api. semanticscholar.org/CorpusID:282057625

  36. [36]

    R2-Router: A new paradigm for LLM routing with reasoning.arXiv preprint arXiv:2602.02823, 2026

    Jiaqi Xue, Qian Lou, Jiarong Xing, and Heng Huang. R2-router: A new paradigm for llm routing with reasoning.arXiv preprint arXiv:2602.02823, 2026

  37. [37]

    Shunyu Yao, Jian Pei, Yue Ma, and Howard Chen.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.arXiv preprint arXiv:2406.12045, 2024

  38. [38]

    arXiv preprint arXiv:2409.03215

    Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, and Caiming Xiong. xLAM: A Family of Large Action Models to Empower AI ...

  39. [39]

    Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

    Qizheng Zhang, Michael Wornow, and Kunle Olukotun. Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching. InInternational Conference on Machine Learning Workshops, 2025

  40. [40]

    Office Products

    Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchandran. Embedllm: Learning compact representations of large language models, 2024. URL https: //arxiv.org/abs/2410.02223. 12 Available tool:calculate_sales_tax(purchase_amount, city, state) User:“Calculate the amount of sales tax to be added on a purchase amount of $30.45 in...

  41. [41]

    Chicago", state=

    calculate_sales_tax(purchase_amount=30.45, city="Chicago", state="Illinois")

  42. [42]

    Sacramento

    calculate_sales_tax(purchase_amount=52.33, city="Sacramento", state="California")

  43. [43]

    Portland

    calculate_sales_tax(purchase_amount=11.23, city="Portland", state="Oregon") Response B✓(also correct—reordered calls with abbreviated parameter values):

  44. [44]

    Portland

    calculate_sales_tax(purchase_amount=11.23, city="Portland", state="OR" )

  45. [45]

    CHI" , state=

    calculate_sales_tax(purchase_amount=30.45, city="CHI" , state= "IL" )

  46. [46]

    Sacramento

    calculate_sales_tax(purchase_amount=52.33, city="Sacramento", state="CA" ) Both responses are correct.Because these three calls areindependent(no data flows between them), any ordering is valid. Addi- tionally, semantically equivalent parameter values—“Illinois” vs. “IL”, “Chicago” vs. “CHI”—are equally acceptable. Figure 4: Acceptable variation in an age...

  47. [47]

    reliable

    Apply a threshold (θ= 0.5 ) to each sigmoid output to determine which models are predicted to be “reliable” for this query

  48. [48]

    Among the models above threshold, select thecheapestone (cost-aware tie-breaking using profiled per-query costs)

  49. [49]

    this model answers correctly only 60% of the time

    If no model exceeds the threshold, fall back to the model with the highest predicted proba- bility (argmax). Probability distributions.Figure 10 shows the distribution of correctness probabilities per model. The distributions are strongly bimodal: the vast majority of model–query pairs have probability near 0 (always incorrect) or 1 (always correct), with...

  50. [50]

    Frozen encoder.Our multi-label router fine-tunes all 66M DistilBERT parameters end-to- end, allowing the encoder to learn task-specific representations for agentic function-calling queries. The MIRT router uses frozen embeddings from a general-purpose pre-trained model, which may not capture the fine-grained distinctions (e.g., JSON structure validity, to...

  51. [51]

    The ablation in Appendix K shows that token packing alone contributes 1.66 pp of accuracy; this accounts for most of the observed gap

    Vanilla tokenization.The MIRT router uses simple text concatenation rather than our compressed token-packing strategy (Section 3.1), which prioritises the most recent user turn and tool signatures within the 512-token budget. The ablation in Appendix K shows that token packing alone contributes 1.66 pp of accuracy; this accounts for most of the observed g...