Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment
Pith reviewed 2026-05-15 00:35 UTC · model grok-4.3
The pith
Small language models can classify inputs for model routing at sub-second latency and zero cost, but none reach the accuracy and latency thresholds for standalone use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through offline benchmarking and a pre-registered synthetic-traffic experiment, the authors find that Qwen-2.5-3B delivers the strongest performance among self-hosted small language models for six-way task classification, with 0.793 accuracy, 988 ms median latency, and zero marginal cost, making it Pareto-dominant in that category. DeepSeek-V3 reaches 0.830 accuracy but violates the P95 latency gate at 2,295 ms. The cost and latency requirements for SLM-based front-door routing are satisfied, yet the accuracy shortfall of 6-8 points and the unproven connection to downstream quality gains leave standalone viability out of reach.
What carries the argument
Front-door routing via an SLM classifier on a six-label task taxonomy, evaluated for accuracy, latency, and cost under fixed hardware and randomized synthetic traffic.
If this is right
- SLM-based routing decisions impose negligible additional cost and latency on the overall inference process.
- Qwen-2.5-3B offers the best accuracy-latency-cost balance among the self-hosted models tested.
- The prerequisites of cost and latency for using SLMs in routing are now met, focusing remaining challenges on accuracy.
- Pre-registered experiments under synthetic traffic validate these performance figures without real-world variability.
- Routing using these models remains below the standalone viability threshold set at 0.85 accuracy and 2,000 ms P95 latency.
Where Pith is reading between the lines
- Improving SLM accuracy by a few points could make dynamic model selection practical in multi-model serving systems.
- Direct measurement of end-to-end output quality with and without SLM routing would test whether classification correctness improves results.
- SLM routers might apply to other inference-time decisions such as prompt optimization or tool selection.
- Scaling the taxonomy or testing on real traffic could reveal gaps not captured in the synthetic setup.
Load-bearing premise
That routing a query to the model selected by correct task classification will yield better output quality than using a single model for all queries.
What would settle it
Measuring the quality of final model outputs in a setup that routes using the SLM classifier compared to a no-routing baseline or random routing to see if quality improves as expected.
Figures
read the original abstract
Selecting the appropriate model at inference time -- the routing problem -- requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM-based classifiers or preference-trained routers that are themselves costly and high-latency, reducing a multi-objective optimization to single-dimensional quality prediction. We argue that small language models (SLMs, 1-4B parameters) have now achieved sufficient reasoning capability for sub-second, zero-marginal-cost, self-hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six-label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60-case corpus. Qwen-2.5-3B achieves the best exact-match accuracy (0.783), the strongest latency-accuracy tradeoff, and the only nonzero accuracy on all six task families. Study 2 is a pre-registered four-arm randomized experiment under synthetic traffic with an effective sample size of 60 unique cases per arm, comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control. DeepSeek-V3 attains the highest accuracy (0.830) but fails the pre-registered P95 latency gate (2,295 ms); Qwen-2.5-3B is Pareto-dominant among self-hosted models (0.793 accuracy, 988 ms median, $0 marginal cost). No model meets the standalone viability criterion (>=0.85 accuracy, <=2,000 ms P95). The cost and latency prerequisites for SLM-based routing are met; the accuracy gap of 6-8 percentage points and the untested question of whether correct classification translates to downstream output quality bound the remaining distance to production viability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates small language models (1-4B parameters) for front-door routing in multi-model inference. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, quantization, and a fixed 60-case corpus across a six-label taxonomy, with Qwen-2.5-3B achieving the highest exact-match accuracy (0.783) and best latency-accuracy tradeoff. Study 2 is a pre-registered four-arm randomized experiment under synthetic traffic (effective N=60 per arm) comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control; Qwen-2.5-3B is Pareto-dominant among self-hosted models (0.793 accuracy, 988 ms median latency, $0 marginal cost) while DeepSeek-V3 reaches 0.830 accuracy but exceeds the P95 latency gate. No model meets the viability criterion (>=0.85 accuracy and <=2,000 ms P95). The authors conclude that cost and latency prerequisites are met but accuracy gaps of 6-8 points and the untested link to downstream output quality remain.
Significance. If the measurements hold, the work supplies concrete, hardware-matched evidence that current SLMs can perform task classification for routing with sub-second latency and zero marginal cost, which could simplify multi-objective inference optimization. The pre-registered design, identical serving stack, and explicit bounding of claims around the untested downstream-quality translation are methodological strengths that increase the reliability of the negative result on standalone viability.
major comments (1)
- [Abstract and §5] Abstract and §5 (Discussion): The central claim that SLM-based routing meets cost/latency prerequisites while accuracy is the remaining gap rests on the assumption that correct classification produces measurable gains in downstream output quality (correctness, coherence). The manuscript reports routed accuracy in Study 2 but provides no comparison of output-quality metrics between correctly routed cases, incorrectly routed cases, and the no-routing control, and explicitly flags this link as untested. This leaves the practical significance of the accuracy numbers unverified.
minor comments (1)
- [§4.2] §4.2: The exact-match accuracy definition and handling of multi-label cases should be stated explicitly with an example to ensure the 0.783 and 0.793 figures are reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the methodological strengths of the pre-registered design and explicit bounding of claims. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Discussion): The central claim that SLM-based routing meets cost/latency prerequisites while accuracy is the remaining gap rests on the assumption that correct classification produces measurable gains in downstream output quality (correctness, coherence). The manuscript reports routed accuracy in Study 2 but provides no comparison of output-quality metrics between correctly routed cases, incorrectly routed cases, and the no-routing control, and explicitly flags this link as untested. This leaves the practical significance of the accuracy numbers unverified.
Authors: We agree that the manuscript does not provide empirical comparisons of downstream output quality (correctness, coherence) between correctly routed, incorrectly routed, and control cases, and that this link is explicitly flagged as untested. The central claim is therefore carefully bounded: the SLM router satisfies the cost and latency prerequisites (zero marginal cost, sub-second median latency), while the 6–8 point accuracy gap and the unverified downstream translation together constitute the remaining distance to viability. We do not assert that the reported accuracies guarantee end-to-end quality gains; they are presented only as a necessary condition. To strengthen clarity, we will revise §5 to add a short paragraph outlining the logical rationale for expecting quality benefits (task-specialized models outperforming a general model on their domains) and to recommend future end-to-end experiments that measure correctness and coherence on routed versus unrouted outputs. revision: partial
Circularity Check
No circularity: pure empirical benchmark with direct measurements
full rationale
The paper reports offline benchmarks and a randomized synthetic-traffic experiment measuring exact-match accuracy and latency on fixed corpora. No equations, fitted parameters, or predictions appear; all results are direct observations. The abstract explicitly flags the untested link from classification accuracy to downstream output quality, but this is an acknowledged empirical gap rather than a self-referential derivation. No self-citations are load-bearing for any central claim, and no ansatz or uniqueness theorem is invoked. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The six-label taxonomy adequately covers the space of routing-relevant tasks.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin, M., Jacobs, S. A., Amin, A. A., et al. (2024a). Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219. Abdin, M., Jacobs, S. A., Amin, A. A., et al. (2024b). Phi-4 technical report.arXiv preprint arXiv:2412.08905. Austin, J., Odena, A., Nye, M., et al. (2021). Program synthesis with large lang...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Schulz, K. F., Altman, D. G., and Moher, D. (2010). CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials.BMJ, 340:c332. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V ., Young, M., Crespo, J.-F., and Dennison, D. (2015). Hidden technical debt in machine learning systems. InAdvanc...
work page 2010
-
[3]
Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of ACL, pages 1715–1725. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V ., Hinton, G., and Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.0653...
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
Wald, A. (1945). Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186. Wang, F., Mao, Z., et al. (2024a). A survey on small language models.arXiv preprint arXiv:2410.20011. Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., and Zou, J. (2024b). Mixture-of-agents enhances large language model capabilities.arXiv prep...
-
[5]
Yue, M., Pei, J., Zhao, Y ., Li, Z., Lu, H., et al. (2024). Large language model cascades with mixture of thoughts representations for cost-efficient reasoning.arXiv preprint arXiv:2310.03094. Zaharia, M., Chen, A., Davidson, A., et al. (2018). Accelerating the machine learning lifecycle with MLflow. InIEEE Data Engineering Bulletin, volume 41, pages 39–4...
-
[6]
Zheng, L. et al. (2024b). Outlines: Generative model programming. GitHub Repository. https://github.com/ outlines-dev/outlines. Zhong, Y ., Liu, S., Chen, J., Hu, J., Zhu, Y ., Liu, X., Jin, X., and Zhang, H. (2024). DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of OSDI, pages 193–210. Zho...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.