pith. sign in

arxiv: 2605.17106 · v1 · pith:SDVHQ2G6new · submitted 2026-05-16 · 💻 cs.CL · cs.LG

HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

Pith reviewed 2026-05-20 15:07 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords LLM routingdynamic model selectioncost optimizationcapability predictionheterogeneous model poolsSWE-Benchshortfall matching
0
0 comments X

The pith

HyDRA routes each query to the cheapest model whose static profile meets the query's predicted multi-dimensional needs, achieving large cost savings with matched or better quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to predict the reasoning, code, debugging, and tool-use demands of a query with four separate heads on a ModernBERT encoder. It then uses shortfall matching against fixed model profiles to pick the lowest-cost model that clears the required thresholds. Because the profiles live in a configuration file rather than inside learned weights, adding or removing models needs no retraining. On SWE-Bench Verified this produces three operating points: a peak-quality regime that beats the strongest single model at 13 percent lower cost, an iso-quality regime that matches the strong model at 54 percent lower cost, and an aggressive regime that saves 72 percent for a small quality drop. The same pattern holds on other coding and agent benchmarks and runs in production for GitHub Copilot users across many languages.

Core claim

HyDRA predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency and is fully decoupled from the model catalog.

What carries the argument

shortfall-matching algorithm that selects the cheapest model whose static capability profile meets or exceeds the query's predicted scores across four dimensions

If this is right

  • Peak-quality operation exceeds the always-strong baseline quality while cutting cost 12.9 percent on SWE-Bench Verified.
  • Iso-quality operation matches the strong baseline at 54.1 percent cost savings, six times the savings of a prior binary router.
  • Aggressive operation reaches 72.5 percent savings for a 3.2-point quality trade-off.
  • Results generalize to LiveCodeBench, BigCodeBench, and tau-bench.
  • Routing decisions remain effective across CJK, European, and other script families without language-specific changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could add new specialized models to the pool and begin using them immediately by editing only the configuration file.
  • The same shortfall logic could be applied to other production workloads such as retrieval-augmented generation or multi-step agent tasks.
  • Increasing the number of scored dimensions might tighten the quality-cost frontier on broader task distributions.

Load-bearing premise

That scoring a query on only four capability dimensions is enough to decide which model will actually succeed without needing direct performance labels for every model-query pair.

What would settle it

Measure actual task resolution rates on a held-out set of queries routed at different shortfall thresholds and check whether the observed quality-cost curve reproduces the three reported regimes versus always using the strongest model.

Figures

Figures reproduced from arXiv: 2605.17106 by Aashna Garg, Federico Brancasi, Jinu Jang, Shengyu Fu, Siddharth Singha Roy.

Figure 1
Figure 1. Figure 1: HyDRA architecture overview. (1) Input Construction: a 7-flag signal prefix is concatenated with the current user message and tokenized at a 512-token cap. The deployed predictor is single-turn: prior turn text is never fed to the model; conversation position is exposed only via the coarse turn-count bin in the signal prefix. (2) Capability Predictor: ModernBERT-base produces a [CLS] embedding. (3) Sigmoid… view at source ↗
Figure 2
Figure 2. Figure 2: Language-invariant routing quality. Qual￾ity retention by language group on the multilingual eval set (English N=3,191; European N=1,434; CJK N=22; Other N=175). HyDRA stays within ±4.3 points of its English baseline across all four groups— European actually exceeds English—confirming rout￾ing decisions depend on task complexity rather than in￾put language. Always-Strong is 100% by definition. bution. SWE-… view at source ↗
Figure 3
Figure 3. Figure 3: SWE-Bench Verified routing decomposition [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cost-quality Pareto frontier on SWE-Bench [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-model share of the ∼805K daily auto￾mode routing decisions following the 100% rollout. The two cheapest 1P models combined absorb 43% of traffic; the strongest model (GPT-5.3 Codex) accounts for 21%. Placeholder: stacked time-series of (a) RPM, (b) p50/p95/p99 router latency, (c) per-model traffic share over a 7-day post-rollout window, captured from the CAPI Auto Intent dashboard [PITH_FULL_IMAGE:fig… view at source ↗
Figure 6
Figure 6. Figure 6: Production serving-tier metrics for the de [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prototype routing explainability inspector for [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Focused INT8/multilingual ASR. ML￾En/ML-Local are multilingual prompts with En￾glish/localized suffixes. S1 S2 S3 S4 S5 0 5 10 15 20 25 30 35 Frontier ASR (%) INT8 FP32 FP16 ML-En ML-Local [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: All-condition ASR. FP32/FP16 are frozen￾corpus English sensitivity checks. Heatmap-style summary. Tables 20–21 pack ASR and cost ratio into each condition cell. Darker cells indicate larger frontier ASR, so the qualitative pattern is visible even when the exact numbers are small [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog -- adding or removing models requires only a configuration change, with zero retraining. On SWE-Bench Verified (5-model pool: GPT-5.4-mini, Claude Haiku 4.5, GPT-5.3 Codex, Claude Sonnet 4.6, GPT-5.4), HyDRA's tunable shortfall threshold spans three regimes: peak-quality exceeds the always-strong Claude Sonnet 4.6 baseline (75.4% vs. 74.2% resolution) at 12.9% cost savings; iso-quality matches Sonnet at 54.1% cost savings, a 6x improvement over our prior in-house binary router at 9.1%; aggressive pushes savings to 72.5% for a 3.2-point quality trade. Results generalize across LiveCodeBench, BigCodeBench, and tau-bench. HyDRA is deployed to all users in GitHub Copilot's VS Code Chat auto-mode and -- to our knowledge for the first time in the LLM routing literature -- demonstrates language-invariant routing across CJK, European, and other script families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HyDRA, a routing framework for heterogeneous LLM pools that decouples a query-level capability predictor from the model catalog. A ModernBERT encoder with K=4 independent sigmoid heads predicts requirements along reasoning, code generation, debugging, and tool use; shortfall matching then selects the cheapest model whose static profile meets or exceeds the predicted scores. The router requires only a configuration change when models are added or removed. On SWE-Bench Verified (5-model pool), three operating regimes are reported: peak quality of 75.4% (exceeding Claude Sonnet 4.6 at 74.2%) with 12.9% cost savings; iso-quality matching at 54.1% savings (6x prior binary router); and aggressive mode at 72.5% savings for a 3.2-point quality drop. Generalization is claimed across LiveCodeBench, BigCodeBench, and tau-bench, with production deployment in GitHub Copilot and language-invariant behavior across scripts.

Significance. If the central results hold, the work is significant for production LLM serving. The explicit decoupling of the learned predictor from model identities removes the retraining requirement that limits prior routers, and the tunable shortfall threshold provides concrete, controllable quality-cost operating points with large reported savings. The 86 ms median CPU latency, real deployment, and cross-script invariance are practical strengths. The approach also supplies a falsifiable prediction mechanism (capability scores must correlate with per-model success) that future work can test directly.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Results): The headline regimes (75.4% peak quality at 12.9% savings, 54.1% iso-quality savings) rest on shortfall matching between the four predicted capability scores and static model profiles. No table or figure shows the empirical correlation between these four ModernBERT outputs and actual per-model resolution success on SWE-Bench Verified or the other benchmarks. Without this validation, it is unclear whether the observed trade-offs are driven by genuine capability prediction or by incidental alignment with the particular 5-model pool and benchmark distribution.
  2. [§3.1] §3.1 (Capability Encoder): The selection of exactly the four dimensions (reasoning, code generation, debugging, tool use) and K=4 heads is presented as given, with no ablation on alternative dimension sets or on the effect of removing any head. If one or more dimensions are weakly predictive of success for the models in the pool, the shortfall-matching rule could systematically under- or over-estimate requirements, undermining the claim that the router generalizes beyond the reported benchmarks.
minor comments (2)
  1. [Table 1] Table 1: The cost-savings percentages are reported to one decimal place while quality is given to one decimal; clarify whether these are means over multiple runs and whether error bars or standard deviations are available.
  2. [§5] §5 (Deployment): The 86 ms median CPU latency is stated without the hardware configuration, batch size, or sequence-length distribution used for the measurement; add these details for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Results): The headline regimes (75.4% peak quality at 12.9% savings, 54.1% iso-quality savings) rest on shortfall matching between the four predicted capability scores and static model profiles. No table or figure shows the empirical correlation between these four ModernBERT outputs and actual per-model resolution success on SWE-Bench Verified or the other benchmarks. Without this validation, it is unclear whether the observed trade-offs are driven by genuine capability prediction or by incidental alignment with the particular 5-model pool and benchmark distribution.

    Authors: We agree that a direct empirical validation of the correlation between the four predicted capability scores and per-model resolution success would strengthen the manuscript. While the reported generalization across LiveCodeBench, BigCodeBench, and tau-bench, together with the production deployment results, provide indirect support, we will add a new figure and accompanying analysis in the revised version that reports Pearson correlations and scatter plots between each predicted dimension and observed per-model success rates on SWE-Bench Verified. revision: yes

  2. Referee: [§3.1] §3.1 (Capability Encoder): The selection of exactly the four dimensions (reasoning, code generation, debugging, tool use) and K=4 heads is presented as given, with no ablation on alternative dimension sets or on the effect of removing any head. If one or more dimensions are weakly predictive of success for the models in the pool, the shortfall-matching rule could systematically under- or over-estimate requirements, undermining the claim that the router generalizes beyond the reported benchmarks.

    Authors: The four dimensions were selected to align with the core capabilities needed for the coding, debugging, and tool-use tasks in our benchmarks and GitHub Copilot deployment. We acknowledge that no ablation on the number or choice of dimensions was included. We will add an ablation study to the appendix of the revised manuscript that measures the effect of using subsets of the heads on routing accuracy, cost savings, and generalization. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained with no circular reductions

full rationale

The paper describes a hybrid routing system where a ModernBERT model with four sigmoid heads predicts capability scores for queries, which are then matched to static model profiles using shortfall matching. The performance claims, including cost savings and quality metrics on SWE-Bench Verified, LiveCodeBench, BigCodeBench, and tau-bench, are presented as empirical results from deploying this system. These outcomes are measured against external baselines such as always selecting Claude Sonnet 4.6 and compared to a prior router for context. No part of the central derivation or results reduces by construction to the fitted parameters or relies on self-citations in a way that makes the claims tautological. The architecture is explicitly decoupled from specific model identities, requiring only configuration changes for catalog updates. The evaluation uses independent benchmark resolution rates, making the reported trade-offs falsifiable and not internally defined.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the learned predictor generalizing across queries and on the manually defined model profiles accurately reflecting real capabilities.

free parameters (2)
  • shortfall threshold
    Tunable parameter that selects among the three quality-cost regimes reported on SWE-Bench.
  • K=4 capability heads
    Number of independent sigmoid outputs chosen for the ModernBERT encoder.
axioms (1)
  • domain assumption Query requirements can be usefully decomposed into the four independent dimensions of reasoning, code generation, debugging, and tool use.
    This decomposition is used both to train the encoder and to define model profiles for matching.

pith-pipeline@v0.9.0 · 5885 in / 1585 out tokens · 84472 ms · 2026-05-20T15:07:22.681735+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 10 internal anchors

  1. [1]

    InThe Fourteenth International Conference on Learning Representa- tions (ICLR)

    Multilingual routing in mixture-of-experts. InThe Fourteenth International Conference on Learning Representa- tions (ICLR). ArXiv:2510.04694. Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan

  2. [2]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    FrugalGPT: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176. Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V .S. Lakshmanan, and Ahmed Hassan Awadallah

  3. [3]

    Hybrid LLM: Cost-efficient and quality-aware query routing.arXiv preprint arXiv:2404.14618. GitHub

  4. [4]

    RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

    RouteNLP: Closed-loop LLM routing with confor- mal cascading and distillation co-optimization. In Proceedings of the 64th Annual Meeting of the As- sociation for Computational Linguistics (ACL), In- dustry Track. ArXiv:2604.23577. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fan- jia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica

  5. [5]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Live- CodeBench: Holistic and contamination-free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974. Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, and 1 others

  6. [6]

    Mixtral of Experts

    Mixtral of experts. InarXiv preprint arXiv:2401.04088. Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin

  7. [7]

    Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios

    Task-aware LLM routing with multi-level task-profile-guided data synthesis for cold-start scenarios. InProceedings of the 64th Annual Meeting of the Association for Com- putational Linguistics (ACL). ArXiv:2604.09377. Keming Lu, Bowen Yu, Chang Zhou, and Jingren Zhou

  8. [8]

    Aman Madaan, Pranjal Aggarwal, Ankit Anand, Sriv- idya Potdar, Sandro Savarese, and Shafiq Jain

    Routing to the expert: Efficient reward- guided ensemble of large language models.arXiv preprint arXiv:2311.08692. Aman Madaan, Pranjal Aggarwal, Ankit Anand, Sriv- idya Potdar, Sandro Savarese, and Shafiq Jain

  9. [9]

    arXiv preprint arXiv:2310.12963

    AutoMix: Automatically mixing language models. arXiv preprint arXiv:2310.12963. Lech Madeyski

  10. [10]

    Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals

    Triage: Routing software en- gineering tasks to cost-effective LLM tiers via code quality signals.arXiv preprint arXiv:2604.07494. Microsoft Azure

  11. [11]

    com/azure/foundry/openai/concepts/ model-router

    Model router for Microsoft Foundry.https://learn.microsoft. com/azure/foundry/openai/concepts/ model-router. Accessed 2026-05-06. Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica

  12. [12]

    RouteLLM: Learning to Route LLMs with Preference Data

    RouteLLM: Learning to route LLMs with preference data. InProceedings of the International Conference on Machine Learn- ing (ICML). ArXiv:2406.18665. OpenRouter

  13. [13]

    Accessed 2026-05-

    OpenRouter auto rout- ing.https://openrouter.ai/docs/ features/auto-router. Accessed 2026-05-

  14. [14]

    Haochun Tang, Yuliang Yan, Jiahua Lu, Huaxiao Liu, and Enyan Dai

    Routes- plain: Towards faithful and intervenable routing for software-related tasks.Preprint, arXiv:2511.09373. Haochun Tang, Yuliang Yan, Jiahua Lu, Huaxiao Liu, and Enyan Dai

  15. [15]

    Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

    Route to rome attack: Direct- ing LLM routers to expensive models via adversarial suffix optimization. InProceedings of the 64th An- nual Meeting of the Association for Computational Linguistics (ACL). ArXiv:2604.15022. Tanay Varshney, Annie Surla, Michelle Xu, Go- mathy Venkata Krishnan, Maximilian Jeblick, David Austin, Neal Vaidya, and Davide Onofrio

  16. [16]

    Benjamin Warner, Benjamin Clavi´e, Orion Weller, Os- kar Hallstr ¨om, Said Taghadouini, Alexis Galkin, Raja Biber, Stephen Labusch, Mehmet Emin Dur- mus, and Nomic AI

    LLM router: Rethinking routing with prefill activa- tions.arXiv preprint arXiv:2603.20895. Benjamin Warner, Benjamin Clavi´e, Orion Weller, Os- kar Hallstr ¨om, Said Taghadouini, Alexis Galkin, Raja Biber, Stephen Labusch, Mehmet Emin Dur- mus, and Nomic AI

  17. [17]

    Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

    ModernBERT: A mod- ern approach to encoder-only transformers.arXiv preprint arXiv:2412.13663. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024.τ-bench: A benchmark for tool- agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045.https://github. com/sierra-research/tau-bench. Jiarui Zhang, Xiangyu Liu, Yong Hu, Chao...

  18. [18]

    Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, and Shuyue Hu

    EcoAssistant: Using LLM assistant more affordably and accurately.arXiv preprint arXiv:2310.03046. Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, and Shuyue Hu

  19. [19]

    InProceedings of the International Confer- ence on Distributed Artificial Intelligence (DAI)

    Be- yond GPT-5: Making LLMs cheaper and bet- ter via performance-efficiency optimized rout- ing. InProceedings of the International Confer- ence on Distributed Artificial Intelligence (DAI). ArXiv:2508.12631. Yiqun Zhang, Hao Li, Zihan Wang, Shi Feng, Xi- aocui Yang, Daling Wang, Bo Zhang, Lei Bai, and Shuyue Hu. 2026b. MTRouter: Cost-aware multi- turn LL...

  20. [20]

    BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877