pith. machine review for the scientific record. sign in

arxiv: 2605.07395 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords multi-LLM routingunsolvability ceilingevaluation artifactsLLM-as-a-judgeexact-match metricsrouter trainingbenchmarks
0
0 comments X

The pith

A substantial portion of reported unsolvability in multi-LLM routing arises from evaluation artifacts rather than true model limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why many queries appear unsolvable when routing across pools of large language models. It attributes much of the reported ceiling to three evaluation problems: judges that favor longer answers over correct ones, generations cut off by token budgets, and answers that fail to match expected output formats. Applying dual-judge checks and exact-match scoring lowers the unsolvability numbers on six standard benchmarks with two model families. This matters because the inflated numbers distort the data used to train routers, causing them to default to the cheapest model and miss better cost-quality tradeoffs. The authors supply a breakdown that isolates each artifact and test controls that confirm the distortion in routing performance.

Core claim

Prior work attributes routing headroom to an unsolvability ceiling of queries no model can solve. Evaluating 206,000 query-model pairs with both LLM-as-a-judge and exact-match metrics shows that a substantial portion of reported unsolvability stems from evaluation artifacts: systematic judge biases favoring verbosity over correctness, truncation under fixed generation budgets, and output format mismatches. Dual-judge validation and exact-match grounding reduce measured unsolvability across tasks, while standard routers collapse to majority-class prediction at roughly 79 percent smallest-tier optimal and incur a 13-17 percentage point opportunity cost.

What carries the argument

Decomposition framework that isolates failures into judge bias, truncation, and format mismatch categories, then corrects them via dual-judge validation and exact-match grounding.

If this is right

  • Routers trained on artifact-contaminated labels default to the smallest model in roughly 79 percent of cases.
  • Routing decisions incur a 13-17 percentage point opportunity cost when evaluation artifacts are ignored.
  • Unsolvability rates drop consistently once verbosity bias, truncation, and format errors are removed.
  • Cost-sensitive router objectives become feasible only after evaluation artifacts are controlled.
  • Actionable fixes include dual-judge validation, exact-match anchoring, and revised training objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • More reliable solvability labels could let deployed routing systems achieve tighter cost-quality tradeoffs than current estimates suggest.
  • The same three artifacts likely affect LLM evaluation in domains beyond routing, such as agent benchmarks or instruction following.
  • Future multi-model studies should report both original and artifact-corrected solvability numbers to allow direct comparison.

Load-bearing premise

Dual-judge validation and exact-match grounding supply a more accurate measure of true solvability than the single-judge or format-sensitive evaluations used in earlier routing studies.

What would settle it

Re-running the full set of benchmarks with dual judges and exact-match scoring while still observing unsolvability rates that match the original single-judge figures would falsify the central claim.

read the original abstract

Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing failures to these artifacts, revealing consistent patterns across domains and model families. These artifacts also distort router training signals: standard routers collapse to majority-class prediction (~79% smallest-tier optimal), confirmed via random-feature and shuffled-label controls, incurring a 13-17 percentage point opportunity cost. We provide actionable recommendations including dual-judge validation, exact-match anchoring, and cost-sensitive objectives. Our findings suggest existing routing headroom estimates are substantially inflated, underscoring the need for reliable evaluation protocols in multi-LLM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a substantial portion of the 'unsolvability ceiling' in multi-LLM routing stems from evaluation artifacts including LLM judge biases favoring verbosity, truncation under fixed generation budgets, and output format mismatches. Through a study of 206k query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) with Gemma and Llama families, dual-judge validation and exact-match metrics reduce measured unsolvability. The artifacts distort router training signals, causing collapse to majority-class prediction (~79% smallest-tier optimal) with 13-17pp opportunity cost, as shown by random-feature and shuffled-label controls. The work provides a failure decomposition framework and recommendations for dual-judge validation, exact-match anchoring, and cost-sensitive objectives.

Significance. If the central claims hold after addressing calibration concerns, the work would be significant for LLM routing and evaluation research. The 206k-pair scale, dual validation, and controls (random features, shuffled labels) are strengths that credibly identify how artifacts inflate unsolvability estimates and degrade router performance. The decomposition framework offers a reusable tool for attributing failures across domains. This could lead to more reliable multi-LLM systems and revised headroom estimates, though the lack of human calibration for the new metrics tempers the immediate impact.

major comments (2)
  1. [Abstract] Abstract: The claim that dual-judge validation and exact-match grounding provide a strictly more accurate measure of true solvability than prior single-judge or format-sensitive methods is load-bearing for the conclusion that routing headroom estimates are substantially inflated. However, without human calibration on a subset of recovered instances, it remains possible that the reductions reflect a shift in evaluation leniency rather than artifact removal, especially given potential correlated biases in LLM judges and exact-match limitations on open-ended or partial-credit tasks.
  2. [Results] Results section: The reported 13-17 percentage point opportunity cost from router collapse to majority-class prediction is central to the distortion claim, yet lacks variance estimates, confidence intervals, or significance tests across the six benchmarks and model families, limiting the ability to assess the robustness of the quantitative impact.
minor comments (2)
  1. The manuscript omits details on exact data splits, full exclusion criteria for the 206k pairs, and any statistical significance testing for unsolvability reductions, which would aid reproducibility.
  2. [Methodology] The decomposition framework would benefit from a formal definition, equation, or pseudocode to clarify how failures are attributed to specific artifacts.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The claim that dual-judge validation and exact-match grounding provide a strictly more accurate measure of true solvability than prior single-judge or format-sensitive methods is load-bearing for the conclusion that routing headroom estimates are substantially inflated. However, without human calibration on a subset of recovered instances, it remains possible that the reductions reflect a shift in evaluation leniency rather than artifact removal, especially given potential correlated biases in LLM judges and exact-match limitations on open-ended or partial-credit tasks.

    Authors: We acknowledge that human calibration would offer the most direct validation. Our dual-judge and exact-match protocol is designed to mitigate specific, documented artifacts (verbosity bias in judges, truncation from fixed budgets, and format mismatches) rather than to assert absolute ground truth. Reductions are consistent across six benchmarks spanning multiple domains and model families, and the router-training controls (random features and shuffled labels) independently show that artifact removal alters training signals. We will revise the abstract and add an explicit limitations paragraph noting the absence of human calibration, rephrasing the claim to emphasize robustness against known artifacts instead of 'strictly more accurate.' New human evaluations are outside the scope of this empirical study. revision: partial

  2. Referee: [Results] The reported 13-17 percentage point opportunity cost from router collapse to majority-class prediction is central to the distortion claim, yet lacks variance estimates, confidence intervals, or significance tests across the six benchmarks and model families, limiting the ability to assess the robustness of the quantitative impact.

    Authors: We agree this statistical detail is necessary. The 13-17pp figure is an aggregate; we will expand the Results section with per-benchmark and per-family breakdowns, reporting standard deviations, bootstrap-derived 95% confidence intervals, and appropriate significance tests (e.g., paired Wilcoxon tests) to quantify variability and robustness. revision: yes

standing simulated objections not resolved
  • Human calibration of dual-judge and exact-match metrics on a subset of recovered instances to rule out leniency shifts.

Circularity Check

0 steps flagged

No circularity: empirical measurements with external controls

full rationale

The paper conducts a large-scale empirical study across 206k query-model pairs on six benchmarks, comparing LLM-as-a-judge, dual-judge, and exact-match metrics to quantify evaluation artifacts. Claims about reduced unsolvability and distorted router signals rest on direct experimental contrasts (including random-feature and shuffled-label controls) rather than any self-definitional loop, fitted parameter renamed as prediction, or self-citation chain. The decomposition framework simply attributes observed failures to the measured artifacts; no equation or result reduces to its own inputs by construction. This is a standard self-contained empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of standard benchmarks as proxies for real queries and the assumption that combined dual-judge plus exact-match metrics are superior to prior evaluation methods, without introducing new free parameters or entities.

axioms (2)
  • domain assumption The six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) represent a sufficient sample of query types for measuring unsolvability.
    Invoked when generalizing findings from these specific tasks to overall routing headroom.
  • domain assumption LLM-as-a-judge metrics, when debiased via dual validation, better approximate ground-truth correctness than single-judge setups.
    Central to the claim that artifacts are reduced and unsolvability measurements improved.

pith-pipeline@v0.9.0 · 5559 in / 1476 out tokens · 51714 ms · 2026-05-11T01:44:25.696846+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    Chen, Lingjiao and Zaharia, Matei and Zou, James , journal =

  2. [2]

    Ong, Isaac and Almahairi, Amjad and Wu, Vincent and Chiang, Wei-Lin and Wu, Tianhao and Gonzalez, Joseph E and Kadous, M Waleed and Stoica, Ion , booktitle =

  3. [3]

    Ding, Dujian and Mallick, Ankur and Wang, Chi and Sim, Robert and Mukherjee, Subhabrata and R. Hybrid. International Conference on Learning Representations , year =

  4. [4]

    Routing to the expert: Efficient reward-guided ensemble of large language models, 2023

    Routing to the Expert: Efficient Reward-Guided Ensemble of Large Language Models , author =. arXiv preprint arXiv:2311.08692 , year =

  5. [5]

    Intelligent Prompt Routing for

  6. [6]

    Lu, Yujie and Liu, Ruiyi and Yuan, Jia and Cui, Xin and Zhang, Shuai and Liu, Hao and Xing, Jian , journal =

  7. [7]

    Hu, Qitian Jason and Bieker, Jacob and Li, Xiuyu and Jiang, Nan and Keigwin, Benjamin and Ranganath, Gaurav and Keutzer, Kurt and Upadhyay, Siddharth Karamcheti , journal =

  8. [8]

    Xue, Jianguo and Lou, Qiuying and Xing, Jian and Huang, Hua , journal =

  9. [9]

    Dimension-Direct Routing: Achieving 25\

    Rui, Tianyu , note =. Dimension-Direct Routing: Achieving 25\

  10. [10]

    Simonds, Toby and Kurniawan, Kris and Lau, Jey Han , journal =

  11. [11]

    Dubey, Abhimanyu and others , journal =. The

  12. [12]

    International Conference on Learning Representations , year =

    Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

  13. [13]

    Mixtral of Experts

    Mixtral of Experts , author =. arXiv preprint arXiv:2401.04088 , year =

  14. [14]

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and others , booktitle =. Judging