Recognition: 2 theorem links
· Lean TheoremRouterBench: A Benchmark for Multi-LLM Routing System
Pith reviewed 2026-05-16 10:43 UTC · model grok-4.3
The pith
RouterBench supplies a benchmark and over 405k inference results to evaluate systems that route queries across multiple LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RouterBench is a novel evaluation framework together with a dataset of over 405k inference outcomes from representative LLMs that allows systematic assessment of LLM routing systems. The authors further supply a theoretical framework for routing and deliver a comparative analysis of various routing approaches, highlighting their potentials and limitations.
What carries the argument
RouterBench evaluation framework and its accompanying dataset of inference outcomes that standardize measurement of routing decisions across tasks.
If this is right
- Routing algorithms can now be compared under identical conditions and metrics.
- Researchers can train and validate new routers directly on the released inference outcomes.
- Production systems can adopt routers that demonstrably improve performance per dollar.
- The theoretical framework supplies a common language for designing and analyzing future routers.
- The benchmark establishes a baseline that later papers can use to quantify incremental gains.
Where Pith is reading between the lines
- Industry teams may begin treating routing as a first-class component rather than an afterthought in LLM serving stacks.
- The dataset could be extended with newer models or multi-modal tasks to keep the benchmark relevant over time.
- Routing research might shift from hand-crafted heuristics toward learned policies trained on the provided outcomes.
- Adoption of such benchmarks could reduce redundant experimentation across different research groups.
Load-bearing premise
The selected tasks, models, and recorded outcomes sufficiently represent real-world usage patterns and future models so that results on the benchmark generalize.
What would settle it
A routing method that achieves strong results on RouterBench yet produces worse accuracy-cost trade-offs when deployed on a fresh collection of production tasks or newer LLMs would falsify the benchmark's claimed utility.
read the original abstract
As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through RouterBench, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at https://github.com/withmartian/routerbench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RouterBench, a benchmark and evaluation framework for multi-LLM routing systems, releases a dataset of over 405k inference outcomes from representative LLMs, proposes a theoretical framework for routing, and provides a comparative analysis of routing approaches to support development of cost-effective LLM serving strategies.
Significance. If the dataset collection methodology and representativeness claims hold after detailed validation, RouterBench could establish a much-needed standard for evaluating LLM routers, accelerating research on hybrid model serving that balances accuracy and cost. The public release of code and data at the provided GitHub link is a clear strength for reproducibility.
major comments (2)
- [Abstract and Dataset Description] Abstract and dataset section: The central claim that the 405k-outcome dataset supports development of general routing strategies rests on unstated assumptions about task/model selection and outcome distributions; no methodology, validation steps, or statistical controls for representativeness are described, preventing assessment of whether the benchmark generalizes beyond the snapshot of current LLMs.
- [Evaluation Framework and Comparative Analysis] Evaluation and analysis sections: No experiments test robustness to post-cutoff models or shifted task distributions, which directly undermines the claim that RouterBench will remain useful for future routing strategies; the comparative results may overfit to the fixed accuracy/cost profiles in this static collection.
minor comments (2)
- [Theoretical Framework] The theoretical framework section uses several routing-specific terms without explicit definitions or references to prior work on multi-model selection, which could reduce accessibility.
- [Results] Figure captions and axis labels in the results plots should explicitly state the number of models and tasks included to allow readers to assess scale.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript introducing RouterBench. We appreciate the referee's recognition of the benchmark's potential value and the public release of code and data. We address each major comment below, outlining planned revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract and Dataset Description] Abstract and dataset section: The central claim that the 405k-outcome dataset supports development of general routing strategies rests on unstated assumptions about task/model selection and outcome distributions; no methodology, validation steps, or statistical controls for representativeness are described, preventing assessment of whether the benchmark generalizes beyond the snapshot of current LLMs.
Authors: We agree that the dataset section would benefit from greater explicitness on these points. In the revised manuscript, we will expand the dataset description with a new subsection detailing the task and model selection criteria, the inference sampling procedure, outcome distributions, and any statistical validation or controls applied to support representativeness claims. This will enable readers to more rigorously assess generalizability beyond the current snapshot. revision: yes
-
Referee: [Evaluation Framework and Comparative Analysis] Evaluation and analysis sections: No experiments test robustness to post-cutoff models or shifted task distributions, which directly undermines the claim that RouterBench will remain useful for future routing strategies; the comparative results may overfit to the fixed accuracy/cost profiles in this static collection.
Authors: We will add experiments simulating shifted task distributions (e.g., via subset cross-validation and controlled perturbations of the existing data) to the evaluation section to demonstrate robustness of the comparative results. For post-cutoff models, direct experiments are not feasible as such models do not yet exist; we will add an explicit limitations discussion noting this and recommending periodic benchmark updates as new models become available, thereby clarifying the scope of current claims about long-term utility. revision: partial
- Direct experimental testing of robustness to post-cutoff models is not possible, as these models do not currently exist.
Circularity Check
No circularity: benchmark rests on new empirical data collection
full rationale
The paper's core contribution is the release of RouterBench plus a static 405k inference dataset collected from existing LLMs on chosen tasks. No derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present. The proposed theoretical framework is described at a high level without equations that reduce to the dataset by construction. Representativeness for future models is a generalizability concern, not a circularity issue. The work is self-contained as an empirical benchmark.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 18 Pith papers
-
CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference
CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.
-
Efficient Ensemble Selection from Binary and Pairwise Feedback
The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta...
-
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
RouteProfile organizes LLM profile design into organizational form, representation type, aggregation depth, and learning configuration, with evaluations showing structured profiles outperform flat ones and aid general...
-
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
-
ModelLens: Finding the Best for Your Task from Myriads of Models
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
-
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
-
Privacy-Preserving LLMs Routing
PPRoute achieves plaintext-level LLM routing quality with MPC-based privacy and a 20x speedup over naive encrypted implementations via MPC-friendly encoders, multi-step training, and O(1) communication Top-k search.
-
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
-
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
-
Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible
An anonymization framework replaces sensitive UI content with deterministic placeholders to protect privacy in mobile GUI agents while preserving task performance.
-
RouteLLM: Learning to Route LLMs with Preference Data
Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
-
Agentic AI Systems Should Be Designed as Marginal Token Allocators
Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.
-
Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents
Tri-Spirit decomposes autonomous AI into planning, reasoning, and execution layers on heterogeneous hardware, yielding 75.6% lower latency, 71.1% less energy, and 77.6% offline task completion in 2000-task simulations.
-
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...
Reference graph
Works this paper leans on
-
[1]
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards , author =. 2024 , journal =
work page 2024
-
[2]
The Twelfth International Conference on Learning Representations , year=
Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=
-
[4]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[6]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[7]
M. J. Kearns , title =
-
[8]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[9]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[10]
Suppressed for Anonymity , author=
-
[11]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[12]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[13]
Large Language Model Routing with Benchmark Datasets , author=. 2023 , eprint=
work page 2023
-
[15]
OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking , author=. 2023 , eprint=
work page 2023
-
[17]
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models , author=. 2023 , eprint=
work page 2023
-
[18]
Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models , author=. 2023 , eprint=
work page 2023
-
[19]
AutoMix: Automatically Mixing Language Models , author=. 2023 , eprint=
work page 2023
-
[20]
Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning , author=. 2023 , eprint=
work page 2023
- [22]
-
[23]
LLML ingua: Compressing Prompts for Accelerated Inference of Large Language Models
Jiang, Huiqiang and Wu, Qianhui and Lin, Chin-Yew and Yang, Yuqing and Qiu, Lili. LLML ingua: Compressing Prompts for Accelerated Inference of Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.825
-
[24]
On Optimal Caching and Model Multiplexing for Large Model Inference , author=. 2023 , eprint=
work page 2023
-
[25]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance , author =. 2023 , journal =
work page 2023
-
[26]
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM , author =. 2024 , journal =
work page 2024
-
[27]
Jacobs, Robert A. and Jordan, Michael I. and Nowlan, Steven J. and Hinton, Geoffrey E. , journal=. Adaptive Mixtures of Local Experts , year=
-
[28]
Learning Factored Representations in a Deep Mixture of Experts
Learning Factored Representations in a Deep Mixture of Experts , year =. arXiv , author =:1312.4314 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , booktitle =. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , url =
-
[30]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , url =
Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle =. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , url =
-
[31]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , year =. arXiv , author =:2101.03961 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , author=. 2022 , eprint=
work page 2022
-
[33]
ST-MoE: Designing Stable and Transferable Sparse Expert Models , author=. 2022 , eprint=
work page 2022
-
[34]
He, Jiaao and Zhai, Jidong and Antunes, Tiago and Wang, Haojie and Luo, Fuwen and Shi, Shangfeng and Li, Qin , title =. Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pages =. 2022 , isbn =. doi:10.1145/3503221.3508418 , abstract =
-
[35]
Proceedings of Machine Learning and Systems , volume =
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts , author =. Proceedings of Machine Learning and Systems , volume =
-
[36]
Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models , author=. 2023 , eprint=
work page 2023
-
[37]
Neural Information Processing Systems , year =
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Neural Information Processing Systems , year =
-
[39]
Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[40]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[43]
Measuring Massive Multitask Language Understanding , url =
Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , booktitle =. Measuring Massive Multitask Language Understanding , url =
-
[45]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[46]
Communications of the ACM , volume=
Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=
work page 2021
-
[47]
The winograd schema challenge , author=. Thirteenth international conference on the principles of knowledge representation and reasoning , year=
-
[48]
Program Synthesis with Large Language Models , author=. 2021 , eprint=
work page 2021
-
[49]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
-
[52]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[53]
Stanford alpaca: An instruction-following llama model , author=
- [55]
-
[56]
Model Card and Evaluations for Claude Models , author=
-
[57]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2023 , eprint=
work page 2023
-
[58]
SqueezeLLM: Dense-and-Sparse Quantization , author =. 2023 , journal =
work page 2023
-
[59]
Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. 2023 , booktitle =
work page 2023
-
[60]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author =. 2023 , journal =
work page 2023
-
[61]
Joshi and Hanna Moazam and Heather Miller and Matei Zaharia and Christopher Potts , booktitle=
Omar Khattab and Arnav Singhvi and Paridhi Maheshwari and Zhiyuan Zhang and Keshav Santhanam and Sri Vardhamanan A and Saiful Haq and Ashutosh Sharma and Thomas T. Joshi and Hanna Moazam and Heather Miller and Matei Zaharia and Christopher Potts , booktitle=. 2024 , url=
work page 2024
-
[63]
Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Qingwei Lin and Daxin Jiang , booktitle=. Wizard. 2024 , url=
work page 2024
- [64]
-
[65]
Code Llama: Open Foundation Models for Code , author =. 2023 , journal =
work page 2023
-
[66]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019
work page 2019
- [67]
-
[68]
arXiv preprint arXiv:2403.02419 , year=
Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems , author=. arXiv preprint arXiv:2403.02419 , year=
-
[69]
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines , author =. 2023 , journal =
work page 2023
-
[72]
The Shift from Models to Compound AI Systems , author=
-
[73]
01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
When benchmarks are targets: Revealing the sensitivity of large language model leaderboards
Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, and Haidar Khan. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv: 2402.01781, 2024
-
[75]
Model card and evaluations for claude models, 2023
Anthropic. Model card and evaluations for claude models, 2023. URL https://www-cdn.anthropic.com/files/4zrzovbb/website/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226.pdf
work page 2023
-
[76]
Program synthesis with large language models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. 2021
work page 2021
-
[77]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv: Arxiv-2305.05176, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[80]
Training verifiers to solve math word problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021
work page 2021
-
[81]
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...
work page 2022
-
[82]
Learning factored representations in a deep mixture of experts
David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. 2014
work page 2014
-
[83]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. 2022
work page 2022
-
[84]
Tryage: Real-time, intelligent routing of user prompts to large language models
Surya Narayanan Hari and Matt Thomson. Tryage: Real-time, intelligent routing of user prompts to large language models. 2023
work page 2023
-
[85]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[86]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv: ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[87]
LLM -blender: Ensembling large language models with pairwise ranking and generative fusion
Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. LLM -blender: Ensembling large language models with pairwise ranking and generative fusion. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14165--14178, Toronto, Canada, July 20...
-
[88]
Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSP y: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Represe...
work page 2024
-
[89]
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv: 2306.07629, 2023
-
[90]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...
-
[91]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, pp.\ 611–626, 2023
work page 2023
-
[92]
Orchestrallm: Efficient orchestration of language models for dialogue state tracking
Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. Orchestrallm: Efficient orchestration of language models for dialogue state tracking. 2023
work page 2023
-
[93]
Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012
work page 2012
-
[94]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv: 2306.00978, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.