arxiv: 2604.07494 · v1 · submitted 2026-04-08 · 💻 cs.SE · cs.AI· cs.LG

Recognition: unknown

Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals

Lech Madeyski

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:11 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords LLM routingcode quality metricssoftware maintainabilitycost optimizationmodel selectionAI coding agentsverification gatesSWE-bench

0 comments

The pith

Code health metrics can route software engineering tasks to the cheapest LLM tier that still passes verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI coding agents currently send every task to the most expensive frontier model even when the work is routine. Triage instead measures indicators of code maintainability in advance and assigns each task to the lowest-cost model tier whose output meets the same verification standard. The authors derive two concrete conditions under which this produces savings: the light tier must succeed on healthy code often enough to beat the cost ratio between tiers, and code health must separate the required tiers with at least a small effect size. They test the approach on three hundred tasks from SWE-bench Lite using heuristic rules, a trained classifier, and an oracle policy. The result is an explicit evaluation protocol that turns a diagnostic code-quality score into a practical model-selection signal.

Core claim

Triage defines three LLM capability tiers and routes tasks based on pre-computed code health sub-factors and task metadata so that each task reaches the cheapest tier whose output passes the identical verification gate as the frontier model. The paper analytically derives two falsifiable conditions for cost-effectiveness: the light-tier pass rate on healthy code must exceed the inter-tier cost ratio, and code health must discriminate the needed tier with at least a small effect size. Evaluation on SWE-bench Lite compares a heuristic-threshold policy, a trained ML classifier, and a perfect-hindsight oracle to quantify the cost-quality trade-off and identify which health sub-factors drive the

What carries the argument

Triage, the routing framework that converts code health sub-factors into tier assignments while enforcing a common verification gate across tiers.

If this is right

Heuristic thresholds on code health sub-factors can already produce measurable cost reductions on routine tasks.
A trained classifier can learn to route more accurately than fixed thresholds.
Certain code health sub-factors will turn out to be stronger predictors of required tier than others.
The same verification gate across tiers guarantees that quality does not degrade when cheaper models are used.
The evaluation protocol can be reused to test new health metrics or new tier definitions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting the approach would let coding agents run at lower average inference cost while keeping the same final correctness check.
Teams that keep their codebases clean would gain a direct economic benefit through cheaper AI assistance.
The same routing logic could be tested on non-coding agent tasks where analogous quality or complexity signals exist.

Load-bearing premise

Code health metrics must discriminate tasks that a light-tier model can solve from those that require a heavier model with enough accuracy to offset the cost difference.

What would settle it

Measure the light-tier model's pass rate on a set of tasks whose code health scores are above a chosen threshold and check whether that rate falls below the cost ratio between the light tier and the next tier.

read the original abstract

Context: AI coding agents route every task to a single frontier large language model (LLM), paying premium inference cost even when many tasks are routine. Objectives: We propose Triage, a framework that uses code health metrics -- indicators of software maintainability -- as a routing signal to assign each task to the cheapest model tier whose output passes the same verification gate as the expensive model. Methods: Triage defines three capability tiers (light, standard, heavy -- mirroring, e.g., Haiku, Sonnet, Opus) and routes tasks based on pre-computed code health sub-factors and task metadata. We design an evaluation comparing three routing policies on SWE-bench Lite (300 tasks across three model tiers): heuristic thresholds, a trained ML classifier, and a perfect-hindsight oracle. Results: We analytically derived two falsifiable conditions under which the tier-dependent asymmetry (medium LLMs benefit from clean code while frontier models do not) yields cost-effective routing: the light-tier pass rate on healthy code must exceed the inter-tier cost ratio, and code health must discriminate the required model tier with at least a small effect size ($\hat{p} \geq 0.56$). Conclusion: Triage transforms a diagnostic code quality metric into an actionable model-selection signal. We present a rigorous evaluation protocol to test the cost--quality trade-off and identify which code health sub-factors drive routing decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Triage sketches a routing system that turns code health metrics into a signal for picking cheaper LLM tiers, with two clear falsifiable conditions, but reports no actual benchmark numbers.

read the letter

The paper's main contribution is a framework called Triage that routes software tasks to light, standard, or heavy LLM tiers based on pre-computed code health sub-factors instead of defaulting to the most expensive model. It derives two conditions for when this should cut costs without hurting verification pass rates: the light tier must succeed on healthy code more often than the cost ratio between tiers, and the health metric must separate required tiers with at least a modest effect size of 0.56. The evaluation plan compares a simple heuristic, a trained classifier, and an oracle policy on SWE-bench Lite's 300 tasks. That setup is straightforward and gives a concrete way to measure whether the signal adds value over baselines. The conditions are analytically derived and stated as testable, which keeps the argument from drifting into unfalsifiable territory. The protocol itself is the practical output here, since it isolates the cost-quality trade-off and flags which sub-factors matter most for routing decisions. The work is still mostly a proposal. No pass rates, cost savings, or discrimination results appear from running the policies, so we cannot yet judge whether the light-tier condition holds or whether 0.56 is enough separation in real code. The paper reads as a detailed plan rather than a completed experiment with data. Readers working on cost-efficient LLM agents for software engineering will find the routing logic and evaluation design useful as a template they can implement or critique. It is worth sending for peer review because the conditions are explicit, the benchmark choice is standard, and the topic addresses a real deployment constraint even if the current version needs the empirical section added.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Triage, a routing framework that uses pre-computed code health metrics (indicators of software maintainability) to assign software engineering tasks to one of three LLM tiers (light, standard, heavy) such that the cheapest tier whose output still passes the same verification gate as a frontier model is selected. It analytically derives two falsifiable conditions for cost-effective routing and outlines an evaluation protocol that compares heuristic thresholds, a trained ML classifier, and a perfect-hindsight oracle on SWE-bench Lite.

Significance. If the derived conditions prove valid and the protocol yields measurable cost savings without loss of verification pass rate, the work could meaningfully reduce inference expenditure for AI coding agents on routine tasks. The emphasis on analytically derived, falsifiable conditions rather than fitted heuristics, together with an explicit three-policy comparison protocol, supplies a clear path for reproducible follow-up experiments.

major comments (2)

[Results] Results section: the two conditions (light-tier pass rate on healthy code exceeding the inter-tier cost ratio; code-health discrimination with p-hat >= 0.56) are presented as analytically derived, yet no equations, derivation steps, or explicit assumptions appear in the manuscript, preventing verification that the conditions are parameter-free or independent of the evaluation data.
[Methods] Methods / Evaluation protocol: the protocol is described at a high level (heuristic thresholds, ML classifier, oracle on SWE-bench Lite) but supplies no concrete definitions of the code-health sub-factors, the feature set for the ML policy, the exact cost and quality metrics, or the statistical test for the p-hat threshold, all of which are load-bearing for the claim that the protocol can identify which sub-factors drive routing decisions.

minor comments (2)

[Abstract] Abstract: the tier examples (Haiku, Sonnet, Opus) are given but the manuscript never states the precise model identifiers or context-length/cost values used for the three tiers in the SWE-bench Lite experiments.
[Abstract] The abstract claims the conditions are 'falsifiable' but does not indicate the exact statistical procedure or data split that would constitute a falsification test.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's potential impact and for the constructive major comments. We address each point below and commit to expanding the manuscript with the requested analytical and methodological details.

read point-by-point responses

Referee: [Results] Results section: the two conditions (light-tier pass rate on healthy code exceeding the inter-tier cost ratio; code-health discrimination with p-hat >= 0.56) are presented as analytically derived, yet no equations, derivation steps, or explicit assumptions appear in the manuscript, preventing verification that the conditions are parameter-free or independent of the evaluation data.

Authors: We agree that the derivation steps and assumptions were omitted, which prevents independent verification. The two conditions follow directly from the cost-effectiveness inequality under the assumption of a fixed verification gate: cost savings occur when the light-tier pass rate on healthy code exceeds the inter-tier cost ratio, and the discrimination power must satisfy a minimum effect-size threshold (derived via power analysis for the chosen sample size). We will add a new subsection to the Results section containing the full derivation, the complete list of assumptions (identical verification across tiers, fixed per-token pricing, and no dependence on task-specific distributions), and a proof that the conditions are parameter-free and independent of the SWE-bench Lite data. This will make the claims fully verifiable. revision: yes
Referee: [Methods] Methods / Evaluation protocol: the protocol is described at a high level (heuristic thresholds, ML classifier, oracle on SWE-bench Lite) but supplies no concrete definitions of the code-health sub-factors, the feature set for the ML policy, the exact cost and quality metrics, or the statistical test for the p-hat threshold, all of which are load-bearing for the claim that the protocol can identify which sub-factors drive routing decisions.

Authors: We acknowledge that the evaluation protocol was presented at a high level. In the revised manuscript we will expand the Methods section to include: explicit definitions of each code-health sub-factor (cyclomatic complexity, maintainability index, duplication ratio, and test coverage as computed by the static-analysis pipeline); the complete feature vector for the ML policy (the sub-factors plus task metadata such as file size and dependency count); precise cost metrics (current API token prices for the three tiers); quality metrics (binary pass/fail on the verification suite); and the exact statistical procedure for the p-hat threshold (one-sided binomial test with effect-size justification). These additions will render the protocol fully reproducible and allow direct analysis of which sub-factors drive routing decisions. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper analytically derives two falsifiable conditions (light-tier pass rate exceeding inter-tier cost ratio; code health discrimination with p-hat >= 0.56) as mathematical requirements for cost-effective routing under the stated tier asymmetry, without reducing them to fitted parameters, self-definitions, or self-citations. The evaluation protocol compares heuristic, ML, and oracle policies on the external SWE-bench Lite benchmark, treating code health metrics as independent inputs transformed into routing signals. No load-bearing step equates a prediction to its own inputs by construction, and the framework remains self-contained against external benchmarks with no imported uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or invented entities; the approach rests on the unverified assumption that code health metrics correlate with tier-specific performance differences.

axioms (1)

domain assumption Code health metrics can reliably indicate when a lighter LLM tier will produce output that passes the same verification as a heavier tier.
This assumption underpins the entire routing decision but receives no empirical backing in the abstract.

pith-pipeline@v0.9.0 · 5556 in / 1248 out tokens · 45626 ms · 2026-05-10T17:11:33.663741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 6 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2601.02200

Markus Borg, Nadim Hagatulah, Adam Tornhill, and Emma S \"o derberg. Code for machines, not just humans: Quantifying AI -friendliness with code health metrics, 2026. URL https://arxiv.org/abs/2601.02200

work page arXiv 2026
[2]

A survey on collaborative mech- anisms between large and small language models,

Yi Chen, JiaHao Zhao, and HaoHao Han. A survey on collaborative mechanisms between large and small language models, 2025. URL https://arxiv.org/abs/2505.07460

work page arXiv 2025
[3]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM : Cost-efficient and quality-aware query routing, 2024. URL https://arxiv.org/abs/2404.14618

work page arXiv 2024
[4]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench : Can language models resolve real-world GitHub issues? In Proceedings of the International Conference on Learning Representations (ICLR), 2024

2024
[5]

Recommendations for analysing and meta-analysing small sample size software engineering experiments

Barbara Kitchenham and Lech Madeyski. Recommendations for analysing and meta-analysing small sample size software engineering experiments. Empirical Software Engineering, 29 0 (6): 0 137, 2024

2024
[6]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), pages 4765--4774, 2017

2017
[7]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM : Learning to route LLMs with preference data, 2024. URL https://arxiv.org/abs/2406.18665

work page internal anchor Pith review arXiv 2024
[8]

Code red: The business impact of code quality -- a quantitative study of 39 proprietary production codebases

Adam Tornhill and Markus Borg. Code red: The business impact of code quality -- a quantitative study of 39 proprietary production codebases. In Proceedings of the International Conference on Technical Debt, TechDebt '22, pages 11--20. ACM, 2022. doi:10.1145/3524843.3528091. URL https://doi.org/10.1145/3524843.3528091

work page doi:10.1145/3524843.3528091 2022
[9]

Token-level LLM collaboration via FusionRoute , 2026

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, and Zhuokai Zhao. Token-level LLM collaboration via FusionRoute , 2026. URL https://arxiv.org/abs/2601.05106

work page arXiv 2026