FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
hub Canonical reference
RouteLLM: Learning to Route LLMs with Preference Data
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data an
co-cited works
representative citing papers
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
A regime theory selects the optimal controller class for LLM action decisions from a nested lattice of four classes using three data-estimable bottlenecks, with a Bernstein-tight threshold and empirical matches on multiple benchmarks.
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
The paper introduces route receipts as a portable runtime record of routing decisions to make adaptive AI systems more transparent and trustworthy.
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
Pyramid MoA is a hierarchical Mixture-of-Agents system with a decision-theoretic router that achieves up to 42.9% compute savings while nearly matching oracle accuracy on MBPP, GSM8K, MMLU, HumanEval, and MATH.
ScrapeGraphAI-100k releases 93,695 real telemetry examples pairing web page content with prompts, schemas, and LLM responses to support training and benchmarking of schema-constrained generation.
HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up to 72.5 percent cost savings on coding benchmarks while remaining decoupled from具体
GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant improving further at lower latency.
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.
RouteLMT learns to route MT requests to large or small LLMs by predicting marginal quality gain from small-model token representations, yielding a better quality-budget Pareto frontier than baselines.
citing papers explorer
-
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
-
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
-
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
-
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
-
A Regime Theory of Controller Class Selection for LLM Action Decisions
A regime theory selects the optimal controller class for LLM action decisions from a nested lattice of four classes using three data-estimable bottlenecks, with a Bernstein-tight threshold and empirical matches on multiple benchmarks.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Model Routing as a Trust Problem: Route Receipts for Adaptive AI Systems
The paper introduces route receipts as a portable runtime record of routing decisions to make adaptive AI systems more transparent and trustworthy.
-
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
-
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Pyramid MoA is a hierarchical Mixture-of-Agents system with a decision-theoretic router that achieves up to 42.9% compute savings while nearly matching oracle accuracy on MBPP, GSM8K, MMLU, HumanEval, and MATH.
-
ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation
ScrapeGraphAI-100k releases 93,695 real telemetry examples pairing web page content with prompts, schemas, and LLM responses to support training and benchmarking of schema-constrained generation.
-
HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools
HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up to 72.5 percent cost savings on coding benchmarks while remaining decoupled from具体
-
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
-
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
-
ModelLens: Finding the Best for Your Task from Myriads of Models
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
-
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
-
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant improving further at lower latency.
-
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
-
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.
-
RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment
RouteLMT learns to route MT requests to large or small LLMs by predicting marginal quality gain from small-model token representations, yielding a better quality-budget Pareto frontier than baselines.
-
Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination
PSMAS reduces token use in LLM multi-agent systems by 27.3% on average via phase-based temporal scheduling and context compression, with task performance staying within 2.1 points of full activation.
-
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
-
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
-
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
-
Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals
Triage routes coding tasks to cost-effective LLM tiers based on code quality metrics to maintain verification quality at lower cost.
-
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
-
Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories
A governed LLM routing system for lab tutoring raises challenge-alignment from 0.90 to 0.98, boosts productive-struggle time, and cuts token costs by two-thirds while preserving answer accuracy.
-
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
GlimpRouter uses the entropy of the first token in each reasoning step to decide whether to invoke a large model, yielding 10.7% higher accuracy and 25.9% lower latency than a standalone large model on AIME25.
-
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing
UCCI calibrates LLM uncertainty to error probabilities with isotonic regression for cost-optimal cascade routing, delivering 31% cost savings at maintained accuracy on a 75k-query NER task.
-
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
-
Agentic AI Systems Should Be Designed as Marginal Token Allocators
Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.
-
TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping
TRACES tags reasoning steps to enable adaptive early stopping, cutting token use by 20-50% on MATH500, GSM8K, AIME, MMLU and GPQA with comparable accuracy.
-
A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs
A-IO adaptively orchestrates LLM inference on NPUs to address memory bottlenecks, model scaling paradoxes, and synchronization costs in speculative decoding.
-
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-optimal accuracy on benchmarks.
-
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
-
Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design
The paper defines Computational Token Economics and introduces the Token Economics Trilemma as a framework for trade-offs in granularity, real-time performance, and optimality, while outlining a research agenda for three challenge areas.
-
INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference
INAR-VL routes 36% of visual question answering requests to the edge using lightweight complexity signals, cutting latency 24% and energy 26% while retaining 97% of cloud accuracy.
-
Qualixar OS: A Universal Operating System for AI Agent Orchestration
Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy on a custom 20-task suite at $0.000039 mean cost per task.
-
Harnessing Multiple Large Language Models: A Survey on LLM Ensemble
A systematic survey of LLM ensemble methods organized into a taxonomy of ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference stages, with review of benchmarks, applications, and future directions.
- Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
- A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology
- Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees