Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment

Charles Lee; Warren Johnson

arxiv: 2604.02367 · v1 · submitted 2026-03-26 · 💻 cs.NI · cs.CL

Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment

Warren Johnson , Charles Lee This is my paper

Pith reviewed 2026-05-15 00:35 UTC · model grok-4.3

classification 💻 cs.NI cs.CL

keywords small language modelsmodel routingtask classificationinference optimizationlatency benchmarksaccuracy evaluationsynthetic traffic

0 comments

The pith

Small language models can classify inputs for model routing at sub-second latency and zero cost, but none reach the accuracy and latency thresholds for standalone use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether small language models are capable enough to handle the routing decision that chooses which larger model to use for a given query. This would matter because current routing often depends on costly large models, making the decision itself expensive, whereas capable SLMs could make routing essentially free. The authors run a harmonized benchmark on identical hardware showing Qwen-2.5-3B with the best accuracy of 0.783 and latency tradeoff among self-hosted models, followed by a randomized experiment under synthetic traffic where it achieves 0.793 accuracy at 988 ms median with no marginal cost. DeepSeek-V3 scores higher at 0.830 but exceeds the latency limit. No model satisfies the pre-set viability bar of 0.85 accuracy and 2,000 ms P95 latency, and whether accurate routing improves final outputs is untested.

Core claim

Through offline benchmarking and a pre-registered synthetic-traffic experiment, the authors find that Qwen-2.5-3B delivers the strongest performance among self-hosted small language models for six-way task classification, with 0.793 accuracy, 988 ms median latency, and zero marginal cost, making it Pareto-dominant in that category. DeepSeek-V3 reaches 0.830 accuracy but violates the P95 latency gate at 2,295 ms. The cost and latency requirements for SLM-based front-door routing are satisfied, yet the accuracy shortfall of 6-8 points and the unproven connection to downstream quality gains leave standalone viability out of reach.

What carries the argument

Front-door routing via an SLM classifier on a six-label task taxonomy, evaluated for accuracy, latency, and cost under fixed hardware and randomized synthetic traffic.

If this is right

SLM-based routing decisions impose negligible additional cost and latency on the overall inference process.
Qwen-2.5-3B offers the best accuracy-latency-cost balance among the self-hosted models tested.
The prerequisites of cost and latency for using SLMs in routing are now met, focusing remaining challenges on accuracy.
Pre-registered experiments under synthetic traffic validate these performance figures without real-world variability.
Routing using these models remains below the standalone viability threshold set at 0.85 accuracy and 2,000 ms P95 latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving SLM accuracy by a few points could make dynamic model selection practical in multi-model serving systems.
Direct measurement of end-to-end output quality with and without SLM routing would test whether classification correctness improves results.
SLM routers might apply to other inference-time decisions such as prompt optimization or tool selection.
Scaling the taxonomy or testing on real traffic could reveal gaps not captured in the synthetic setup.

Load-bearing premise

That routing a query to the model selected by correct task classification will yield better output quality than using a single model for all queries.

What would settle it

Measuring the quality of final model outputs in a setup that routes using the SLM classifier compared to a no-routing baseline or random routing to see if quality improves as expected.

Figures

Figures reproduced from arXiv: 2604.02367 by Charles Lee, Warren Johnson.

read the original abstract

Selecting the appropriate model at inference time -- the routing problem -- requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM-based classifiers or preference-trained routers that are themselves costly and high-latency, reducing a multi-objective optimization to single-dimensional quality prediction. We argue that small language models (SLMs, 1-4B parameters) have now achieved sufficient reasoning capability for sub-second, zero-marginal-cost, self-hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six-label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60-case corpus. Qwen-2.5-3B achieves the best exact-match accuracy (0.783), the strongest latency-accuracy tradeoff, and the only nonzero accuracy on all six task families. Study 2 is a pre-registered four-arm randomized experiment under synthetic traffic with an effective sample size of 60 unique cases per arm, comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control. DeepSeek-V3 attains the highest accuracy (0.830) but fails the pre-registered P95 latency gate (2,295 ms); Qwen-2.5-3B is Pareto-dominant among self-hosted models (0.793 accuracy, 988 ms median, $0 marginal cost). No model meets the standalone viability criterion (>=0.85 accuracy, <=2,000 ms P95). The cost and latency prerequisites for SLM-based routing are met; the accuracy gap of 6-8 percentage points and the untested question of whether correct classification translates to downstream output quality bound the remaining distance to production viability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SLMs can handle task routing with low latency and zero cost, but the paper leaves the downstream quality gains untested.

read the letter

The main thing to know is that this work shows SLMs can do front-door routing with low latency and no cost, but it leaves open whether that routing actually improves the quality of the final outputs. The paper brings a harmonized offline benchmark of three SLMs on identical hardware and a pre-registered four-arm randomized experiment under synthetic traffic. Qwen-2.5-3B comes out Pareto-dominant among self-hosted models with 0.793 accuracy and 988 ms median latency. DeepSeek-V3 hits higher accuracy at 0.830 but misses the P95 latency target. No model reaches the standalone viability threshold of 0.85 accuracy and 2,000 ms P95. What they do well is run a controlled comparison to a no-routing baseline and report exact-match accuracy across task families. The pre-registration and the fixed corpus add credibility to the numbers. The soft spot is the missing link between classification accuracy and downstream output quality. The studies stop at measuring how often the router picks the right task label; they do not check if correct routing produces better answers from the large model or compare output quality metrics between routed and unrouted cases. The abstract itself notes this remains untested, so the practical payoff is still an assumption. This is for teams working on LLM inference stacks who need data on whether small models can handle routing decisions. A reader interested in production deployment tradeoffs would find the latency and cost figures directly useful. I would send it to peer review. The experimental design is solid enough to warrant referee feedback, even though the end-to-end benefit needs more work.

Referee Report

1 major / 1 minor

Summary. The manuscript evaluates small language models (1-4B parameters) for front-door routing in multi-model inference. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, quantization, and a fixed 60-case corpus across a six-label taxonomy, with Qwen-2.5-3B achieving the highest exact-match accuracy (0.783) and best latency-accuracy tradeoff. Study 2 is a pre-registered four-arm randomized experiment under synthetic traffic (effective N=60 per arm) comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control; Qwen-2.5-3B is Pareto-dominant among self-hosted models (0.793 accuracy, 988 ms median latency, $0 marginal cost) while DeepSeek-V3 reaches 0.830 accuracy but exceeds the P95 latency gate. No model meets the viability criterion (>=0.85 accuracy and <=2,000 ms P95). The authors conclude that cost and latency prerequisites are met but accuracy gaps of 6-8 points and the untested link to downstream output quality remain.

Significance. If the measurements hold, the work supplies concrete, hardware-matched evidence that current SLMs can perform task classification for routing with sub-second latency and zero marginal cost, which could simplify multi-objective inference optimization. The pre-registered design, identical serving stack, and explicit bounding of claims around the untested downstream-quality translation are methodological strengths that increase the reliability of the negative result on standalone viability.

major comments (1)

[Abstract and §5] Abstract and §5 (Discussion): The central claim that SLM-based routing meets cost/latency prerequisites while accuracy is the remaining gap rests on the assumption that correct classification produces measurable gains in downstream output quality (correctness, coherence). The manuscript reports routed accuracy in Study 2 but provides no comparison of output-quality metrics between correctly routed cases, incorrectly routed cases, and the no-routing control, and explicitly flags this link as untested. This leaves the practical significance of the accuracy numbers unverified.

minor comments (1)

[§4.2] §4.2: The exact-match accuracy definition and handling of multi-label cases should be stated explicitly with an example to ensure the 0.783 and 0.793 figures are reproducible.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the methodological strengths of the pre-registered design and explicit bounding of claims. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Discussion): The central claim that SLM-based routing meets cost/latency prerequisites while accuracy is the remaining gap rests on the assumption that correct classification produces measurable gains in downstream output quality (correctness, coherence). The manuscript reports routed accuracy in Study 2 but provides no comparison of output-quality metrics between correctly routed cases, incorrectly routed cases, and the no-routing control, and explicitly flags this link as untested. This leaves the practical significance of the accuracy numbers unverified.

Authors: We agree that the manuscript does not provide empirical comparisons of downstream output quality (correctness, coherence) between correctly routed, incorrectly routed, and control cases, and that this link is explicitly flagged as untested. The central claim is therefore carefully bounded: the SLM router satisfies the cost and latency prerequisites (zero marginal cost, sub-second median latency), while the 6–8 point accuracy gap and the unverified downstream translation together constitute the remaining distance to viability. We do not assert that the reported accuracies guarantee end-to-end quality gains; they are presented only as a necessary condition. To strengthen clarity, we will revise §5 to add a short paragraph outlining the logical rationale for expecting quality benefits (task-specialized models outperforming a general model on their domains) and to recommend future end-to-end experiments that measure correctness and coherence on routed versus unrouted outputs. revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct measurements

full rationale

The paper reports offline benchmarks and a randomized synthetic-traffic experiment measuring exact-match accuracy and latency on fixed corpora. No equations, fitted parameters, or predictions appear; all results are direct observations. The abstract explicitly flags the untested link from classification accuracy to downstream output quality, but this is an acknowledged empirical gap rather than a self-referential derivation. No self-citations are load-bearing for any central claim, and no ansatz or uniqueness theorem is invoked. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities; the work rests on the domain assumption that the chosen six-label taxonomy is representative of real routing needs and that the 60-case corpus is sufficient for evaluation.

axioms (1)

domain assumption The six-label taxonomy adequately covers the space of routing-relevant tasks.
Defines the classification target used in both studies.

pith-pipeline@v0.9.0 · 5672 in / 1194 out tokens · 38925 ms · 2026-05-15T00:35:02.302725+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin, M., Jacobs, S. A., Amin, A. A., et al. (2024a). Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219. Abdin, M., Jacobs, S. A., Amin, A. A., et al. (2024b). Phi-4 technical report.arXiv preprint arXiv:2412.08905. Austin, J., Odena, A., Nye, M., et al. (2021). Program synthesis with large lang...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

F., Altman, D

Schulz, K. F., Altman, D. G., and Moher, D. (2010). CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials.BMJ, 340:c332. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V ., Young, M., Crespo, J.-F., and Dennison, D. (2015). Hidden technical debt in machine learning systems. InAdvanc...

work page 2010
[3]

Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of ACL, pages 1715–1725. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V ., Hinton, G., and Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.0653...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Wald, A. (1945). Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186. Wang, F., Mao, Z., et al. (2024a). A survey on small language models.arXiv preprint arXiv:2410.20011. Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., and Zou, J. (2024b). Mixture-of-agents enhances large language model capabilities.arXiv prep...

work page arXiv 1945
[5]

Yue, M., Pei, J., Zhao, Y ., Li, Z., Lu, H., et al. (2024). Large language model cascades with mixture of thoughts representations for cost-efficient reasoning.arXiv preprint arXiv:2310.03094. Zaharia, M., Chen, A., Davidson, A., et al. (2018). Accelerating the machine learning lifecycle with MLflow. InIEEE Data Engineering Bulletin, volume 41, pages 39–4...

work page arXiv 2024
[6]

Zheng, L. et al. (2024b). Outlines: Generative model programming. GitHub Repository. https://github.com/ outlines-dev/outlines. Zhong, Y ., Liu, S., Chen, J., Hu, J., Zhu, Y ., Liu, X., Jin, X., and Zhang, H. (2024). DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of OSDI, pages 193–210. Zho...

work page 2024

[1] [1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin, M., Jacobs, S. A., Amin, A. A., et al. (2024a). Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219. Abdin, M., Jacobs, S. A., Amin, A. A., et al. (2024b). Phi-4 technical report.arXiv preprint arXiv:2412.08905. Austin, J., Odena, A., Nye, M., et al. (2021). Program synthesis with large lang...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

F., Altman, D

Schulz, K. F., Altman, D. G., and Moher, D. (2010). CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials.BMJ, 340:c332. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V ., Young, M., Crespo, J.-F., and Dennison, D. (2015). Hidden technical debt in machine learning systems. InAdvanc...

work page 2010

[3] [3]

Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of ACL, pages 1715–1725. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V ., Hinton, G., and Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.0653...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

Wald, A. (1945). Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186. Wang, F., Mao, Z., et al. (2024a). A survey on small language models.arXiv preprint arXiv:2410.20011. Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., and Zou, J. (2024b). Mixture-of-agents enhances large language model capabilities.arXiv prep...

work page arXiv 1945

[5] [5]

Yue, M., Pei, J., Zhao, Y ., Li, Z., Lu, H., et al. (2024). Large language model cascades with mixture of thoughts representations for cost-efficient reasoning.arXiv preprint arXiv:2310.03094. Zaharia, M., Chen, A., Davidson, A., et al. (2018). Accelerating the machine learning lifecycle with MLflow. InIEEE Data Engineering Bulletin, volume 41, pages 39–4...

work page arXiv 2024

[6] [6]

Zheng, L. et al. (2024b). Outlines: Generative model programming. GitHub Repository. https://github.com/ outlines-dev/outlines. Zhong, Y ., Liu, S., Chen, J., Hu, J., Zhu, Y ., Liu, X., Jin, X., and Zhang, H. (2024). DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of OSDI, pages 193–210. Zho...

work page 2024