arxiv: 2403.12031 · v2 · submitted 2024-03-18 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu , Jacob Bieker , Xiuyu Li , Nan Jiang , Benjamin Keigwin , Gaurav Ranganath , Kurt Keutzer , Shriyash Kaustubh Upadhyay

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM routingbenchmarkmulti-LLM systemsevaluation frameworkrouting datasetinference outcomescost-performance trade-offmodel selection

0 comments

The pith

RouterBench supplies a benchmark and over 405k inference results to evaluate systems that route queries across multiple LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RouterBench as an evaluation framework to measure how well routing systems select the right LLM for a given input while balancing quality and cost. It releases a dataset of more than 405,000 inference outcomes from representative models to enable consistent testing and development of routing strategies. The work also outlines a theoretical framework for routing and compares several existing approaches inside the new benchmark. Without such a standard, progress on hybrid LLM serving has been hard to track. The authors argue this setup will support more economical and capable deployments by letting routers exploit the complementary strengths of different models.

Core claim

RouterBench is a novel evaluation framework together with a dataset of over 405k inference outcomes from representative LLMs that allows systematic assessment of LLM routing systems. The authors further supply a theoretical framework for routing and deliver a comparative analysis of various routing approaches, highlighting their potentials and limitations.

What carries the argument

RouterBench evaluation framework and its accompanying dataset of inference outcomes that standardize measurement of routing decisions across tasks.

If this is right

Routing algorithms can now be compared under identical conditions and metrics.
Researchers can train and validate new routers directly on the released inference outcomes.
Production systems can adopt routers that demonstrably improve performance per dollar.
The theoretical framework supplies a common language for designing and analyzing future routers.
The benchmark establishes a baseline that later papers can use to quantify incremental gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Industry teams may begin treating routing as a first-class component rather than an afterthought in LLM serving stacks.
The dataset could be extended with newer models or multi-modal tasks to keep the benchmark relevant over time.
Routing research might shift from hand-crafted heuristics toward learned policies trained on the provided outcomes.
Adoption of such benchmarks could reduce redundant experimentation across different research groups.

Load-bearing premise

The selected tasks, models, and recorded outcomes sufficiently represent real-world usage patterns and future models so that results on the benchmark generalize.

What would settle it

A routing method that achieves strong results on RouterBench yet produces worse accuracy-cost trade-offs when deployed on a fresh collection of production tasks or newer LLMs would falsify the benchmark's claimed utility.

read the original abstract

As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through RouterBench, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at https://github.com/withmartian/routerbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RouterBench ships the first sizable public dataset for LLM routing and a basic comparison framework, but the fixed 405k snapshot leaves open whether routing policies will transfer to new models or usage patterns.

read the letter

The paper's real contribution is the release of RouterBench plus the 405k-outcome dataset. That gives the community a concrete starting point for measuring cost-accuracy tradeoffs across multiple LLMs on the same tasks, which prior routing papers mostly handled with private or small-scale evaluations. They also lay out a simple theoretical framing and run head-to-head tests of a few routing strategies, so readers can see where the baselines sit today. The GitHub link means anyone can inspect the data and code directly, which is the right move for a benchmark paper.

Referee Report

2 major / 2 minor

Summary. The paper introduces RouterBench, a benchmark and evaluation framework for multi-LLM routing systems, releases a dataset of over 405k inference outcomes from representative LLMs, proposes a theoretical framework for routing, and provides a comparative analysis of routing approaches to support development of cost-effective LLM serving strategies.

Significance. If the dataset collection methodology and representativeness claims hold after detailed validation, RouterBench could establish a much-needed standard for evaluating LLM routers, accelerating research on hybrid model serving that balances accuracy and cost. The public release of code and data at the provided GitHub link is a clear strength for reproducibility.

major comments (2)

[Abstract and Dataset Description] Abstract and dataset section: The central claim that the 405k-outcome dataset supports development of general routing strategies rests on unstated assumptions about task/model selection and outcome distributions; no methodology, validation steps, or statistical controls for representativeness are described, preventing assessment of whether the benchmark generalizes beyond the snapshot of current LLMs.
[Evaluation Framework and Comparative Analysis] Evaluation and analysis sections: No experiments test robustness to post-cutoff models or shifted task distributions, which directly undermines the claim that RouterBench will remain useful for future routing strategies; the comparative results may overfit to the fixed accuracy/cost profiles in this static collection.

minor comments (2)

[Theoretical Framework] The theoretical framework section uses several routing-specific terms without explicit definitions or references to prior work on multi-model selection, which could reduce accessibility.
[Results] Figure captions and axis labels in the results plots should explicitly state the number of models and tasks included to allow readers to assess scale.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript introducing RouterBench. We appreciate the referee's recognition of the benchmark's potential value and the public release of code and data. We address each major comment below, outlining planned revisions where appropriate to strengthen the paper.

read point-by-point responses

Referee: [Abstract and Dataset Description] Abstract and dataset section: The central claim that the 405k-outcome dataset supports development of general routing strategies rests on unstated assumptions about task/model selection and outcome distributions; no methodology, validation steps, or statistical controls for representativeness are described, preventing assessment of whether the benchmark generalizes beyond the snapshot of current LLMs.

Authors: We agree that the dataset section would benefit from greater explicitness on these points. In the revised manuscript, we will expand the dataset description with a new subsection detailing the task and model selection criteria, the inference sampling procedure, outcome distributions, and any statistical validation or controls applied to support representativeness claims. This will enable readers to more rigorously assess generalizability beyond the current snapshot. revision: yes
Referee: [Evaluation Framework and Comparative Analysis] Evaluation and analysis sections: No experiments test robustness to post-cutoff models or shifted task distributions, which directly undermines the claim that RouterBench will remain useful for future routing strategies; the comparative results may overfit to the fixed accuracy/cost profiles in this static collection.

Authors: We will add experiments simulating shifted task distributions (e.g., via subset cross-validation and controlled perturbations of the existing data) to the evaluation section to demonstrate robustness of the comparative results. For post-cutoff models, direct experiments are not feasible as such models do not yet exist; we will add an explicit limitations discussion noting this and recommending periodic benchmark updates as new models become available, thereby clarifying the scope of current claims about long-term utility. revision: partial

standing simulated objections not resolved

Direct experimental testing of robustness to post-cutoff models is not possible, as these models do not currently exist.

Circularity Check

0 steps flagged

No circularity: benchmark rests on new empirical data collection

full rationale

The paper's core contribution is the release of RouterBench plus a static 405k inference dataset collected from existing LLMs on chosen tasks. No derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present. The proposed theoretical framework is described at a high level without equations that reduce to the dataset by construction. Representativeness for future models is a generalizability concern, not a circularity issue. The work is self-contained as an empirical benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical benchmark contribution that relies on collected inference data rather than new theoretical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5537 in / 1064 out tokens · 46254 ms · 2026-05-16T10:43:52.460809+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference
cs.IT 2026-05 unverdicted novelty 7.0

CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.
Efficient Ensemble Selection from Binary and Pairwise Feedback
cs.GT 2026-05 unverdicted novelty 7.0

The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta...
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
cs.NI 2026-04 unverdicted novelty 7.0

RouteProfile organizes LLM profile design into organizational form, representation type, aggregation depth, and learning configuration, with evaluations showing structured profiles outperform flat ones and aid general...
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
cs.LG 2026-05 unverdicted novelty 6.0

LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
cs.AI 2026-05 unverdicted novelty 6.0

GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
cs.AI 2026-05 unverdicted novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
ModelLens: Finding the Best for Your Task from Myriads of Models
cs.LG 2026-05 unverdicted novelty 6.0

ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
cs.AI 2026-04 unverdicted novelty 6.0

CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
Privacy-Preserving LLMs Routing
cs.CR 2026-04 unverdicted novelty 6.0

PPRoute achieves plaintext-level LLM routing quality with MPC-based privacy and a 20x speedup over naive encrypted implementations via MPC-friendly encoders, multi-step training, and O(1) communication Top-k search.
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
cs.NI 2026-04 unverdicted novelty 6.0

Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible
cs.CR 2026-02 conditional novelty 6.0

An anonymization framework replaces sensitive UI content with deterministic placeholders to protect privacy in mobile GUI agents while preserving task performance.
RouteLLM: Learning to Route LLMs with Preference Data
cs.LG 2024-06 unverdicted novelty 6.0

Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
Agentic AI Systems Should Be Designed as Marginal Token Allocators
cs.AI 2026-05 unverdicted novelty 5.0

Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.
Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 5.0

Tri-Spirit decomposes autonomous AI into planning, reasoning, and execution layers on heterogeneous hardware, yielding 75.6% lower latency, 71.1% less energy, and 77.6% offline task completion in 2000-task simulations.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
cs.LG 2026-04 unverdicted novelty 5.0

AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · cited by 18 Pith papers · 16 internal anchors

[1]

2024 , journal =

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards , author =. 2024 , journal =

work page 2024
[2]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=

work page
[4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[6]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[7]

M. J. Kearns , title =

work page
[8]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[9]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[10]

Suppressed for Anonymity , author=

work page
[11]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[12]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[13]

2023 , eprint=

Large Language Model Routing with Benchmark Datasets , author=. 2023 , eprint=

work page 2023
[15]

2023 , eprint=

OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking , author=. 2023 , eprint=

work page 2023
[17]

2023 , eprint=

Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models , author=. 2023 , eprint=

work page 2023
[18]

2023 , eprint=

Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models , author=. 2023 , eprint=

work page 2023
[19]

2023 , eprint=

AutoMix: Automatically Mixing Language Models , author=. 2023 , eprint=

work page 2023
[20]

2023 , eprint=

Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning , author=. 2023 , eprint=

work page 2023
[22]

2023 , eprint=

BatchPrompt: Accomplish more with less , author=. 2023 , eprint=

work page 2023
[23]

LLML ingua: Compressing Prompts for Accelerated Inference of Large Language Models

Jiang, Huiqiang and Wu, Qianhui and Lin, Chin-Yew and Yang, Yuqing and Qiu, Lili. LLML ingua: Compressing Prompts for Accelerated Inference of Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.825

work page doi:10.18653/v1/2023.emnlp-main.825 2023
[24]

2023 , eprint=

On Optimal Caching and Model Multiplexing for Large Model Inference , author=. 2023 , eprint=

work page 2023
[25]

2023 , journal =

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance , author =. 2023 , journal =

work page 2023
[26]

2024 , journal =

Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM , author =. 2024 , journal =

work page 2024
[27]

and Jordan, Michael I

Jacobs, Robert A. and Jordan, Michael I. and Nowlan, Steven J. and Hinton, Geoffrey E. , journal=. Adaptive Mixtures of Local Experts , year=

work page
[28]

Learning Factored Representations in a Deep Mixture of Experts

Learning Factored Representations in a Deep Mixture of Experts , year =. arXiv , author =:1312.4314 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Le and Geoffrey E

Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , booktitle =. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , url =

work page
[30]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , url =

Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle =. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , url =

work page
[31]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , year =. arXiv , author =:2101.03961 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv
[32]

2022 , eprint=

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , author=. 2022 , eprint=

work page 2022
[33]

2022 , eprint=

ST-MoE: Designing Stable and Transferable Sparse Expert Models , author=. 2022 , eprint=

work page 2022
[34]

Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pages =

He, Jiaao and Zhai, Jidong and Antunes, Tiago and Wang, Haojie and Luo, Fuwen and Shi, Shangfeng and Li, Qin , title =. Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pages =. 2022 , isbn =. doi:10.1145/3503221.3508418 , abstract =

work page doi:10.1145/3503221.3508418 2022
[35]

Proceedings of Machine Learning and Systems , volume =

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts , author =. Proceedings of Machine Learning and Systems , volume =

work page
[36]

2023 , eprint=

Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models , author=. 2023 , eprint=

work page 2023
[37]

Neural Information Processing Systems , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Neural Information Processing Systems , year =

work page
[39]

2023 , eprint=

Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models , author=. 2023 , eprint=

work page 2023
[40]

2023 , eprint=

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[43]

Measuring Massive Multitask Language Understanding , url =

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , booktitle =. Measuring Massive Multitask Language Understanding , url =

work page
[45]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[46]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

work page 2021
[47]

Thirteenth international conference on the principles of knowledge representation and reasoning , year=

The winograd schema challenge , author=. Thirteenth international conference on the principles of knowledge representation and reasoning , year=

work page
[48]

2021 , eprint=

Program Synthesis with Large Language Models , author=. 2021 , eprint=

work page 2021
[49]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[52]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[53]

Stanford alpaca: An instruction-following llama model , author=

work page
[55]

2024 , eprint=

Mixtral of Experts , author=. 2024 , eprint=

work page 2024
[56]

Model Card and Evaluations for Claude Models , author=

work page
[57]

2023 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2023 , eprint=

work page 2023
[58]

2023 , journal =

SqueezeLLM: Dense-and-Sparse Quantization , author =. 2023 , journal =

work page 2023
[59]

2023 , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. 2023 , booktitle =

work page 2023
[60]

2023 , journal =

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author =. 2023 , journal =

work page 2023
[61]

Joshi and Hanna Moazam and Heather Miller and Matei Zaharia and Christopher Potts , booktitle=

Omar Khattab and Arnav Singhvi and Paridhi Maheshwari and Zhiyuan Zhang and Keshav Santhanam and Sri Vardhamanan A and Saiful Haq and Ashutosh Sharma and Thomas T. Joshi and Hanna Moazam and Heather Miller and Matei Zaharia and Christopher Potts , booktitle=. 2024 , url=

work page 2024
[63]

Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Qingwei Lin and Daxin Jiang , booktitle=. Wizard. 2024 , url=

work page 2024
[64]

2023 , journal =

Mistral 7B , author =. 2023 , journal =

work page 2023
[65]

2023 , journal =

Code Llama: Open Foundation Models for Code , author =. 2023 , journal =

work page 2023
[66]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

work page 2019
[67]

2024 , journal =

Yi: Open Foundation Models by 01.AI , author =. 2024 , journal =

work page 2024
[68]

arXiv preprint arXiv:2403.02419 , year=

Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems , author=. arXiv preprint arXiv:2403.02419 , year=

work page arXiv
[69]

2023 , journal =

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines , author =. 2023 , journal =

work page 2023
[72]

The Shift from Models to Compound AI Systems , author=

work page
[73]

01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

When benchmarks are targets: Revealing the sensitivity of large language model leaderboards

Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, and Haidar Khan. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv: 2402.01781, 2024

work page arXiv 2024
[75]

Model card and evaluations for claude models, 2023

Anthropic. Model card and evaluations for claude models, 2023. URL https://www-cdn.anthropic.com/files/4zrzovbb/website/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226.pdf

work page 2023
[76]

Program synthesis with large language models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. 2021

work page 2021
[77]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv: Arxiv-2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[80]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021

work page 2021
[81]

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

work page 2022
[82]

Learning factored representations in a deep mixture of experts

David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. 2014

work page 2014
[83]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. 2022

work page 2022
[84]

Tryage: Real-time, intelligent routing of user prompts to large language models

Surya Narayanan Hari and Matt Thomson. Tryage: Real-time, intelligent routing of user prompts to large language models. 2023

work page 2023
[85]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[86]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv: ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[87]

LLM -blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. LLM -blender: Ensembling large language models with pairwise ranking and generative fusion. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14165--14178, Toronto, Canada, July 20...

work page doi:10.18653/v1/2023.acl-long.792 2023
[88]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSP y: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Represe...

work page 2024
[89]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv: 2306.07629, 2023

work page arXiv 2023
[90]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page doi:10.1162/tacl_a_00276 2019
[91]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, pp.\ 611–626, 2023

work page 2023
[92]

Orchestrallm: Efficient orchestration of language models for dialogue state tracking

Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. Orchestrallm: Efficient orchestration of language models for dialogue state tracking. 2023

work page 2023
[93]

The winograd schema challenge

Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012

work page 2012
[94]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv: 2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.