pith. sign in

arxiv: 2606.08970 · v1 · pith:OZ37PLPZnew · submitted 2026-06-08 · 💻 cs.AI

An Effective Router for Vision-Language Model Selection

Pith reviewed 2026-06-27 17:04 UTC · model grok-4.3

classification 💻 cs.AI
keywords Vision-language modelsModel routingModel selectionMultimodal datasetAdaptation strategiesRouter architecturePerformance paradox
0
0 comments X

The pith

ARMS is a compact router that selects the best vision-language model for a query and adapts via two training strategies to outperform models hundreds of times larger.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a dataset of 32,626 image-text queries answered by seven VLMs to address the lack of specialized data for model selection. It introduces ARMS, which augments query inputs with VLM capability profiles and uses a straightforward architecture to improve how queries and model strengths are represented. Two extension strategies—incremental training and independent training—let ARMS adapt when new VLMs are added without retraining from scratch. Experiments show the 800M-parameter ARMS succeeds on both in-distribution and out-of-distribution queries and, after adaptation, exceeds commercial systems such as GPT-4o.

Core claim

ARMS enhances input signals with VLM profiles and employs a simple architecture for better query and capability representations; combined with incremental or independent training it adapts to new VLMs, allowing an 800M model to defeat much larger commercial ones on VLM selection tasks.

What carries the argument

ARMS router that augments inputs with VLM profiles and applies incremental or independent training to expand the model space.

If this is right

  • ARMS achieves strong performance on both in-distribution and out-of-distribution test sets for VLM selection.
  • Incremental and independent training let the router extend to new VLMs at lower cost than full retraining.
  • An 800M router can match or exceed selection quality of commercial models hundreds of times larger after adaptation.
  • The constructed multimodal dataset supplies the specialized data needed to train future VLM routers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same profile-augmented routing approach could be tested on selecting among other multimodal or unimodal model families.
  • Dynamic selection with ARMS could lower average inference cost for users who currently default to the largest available model.
  • Collecting outputs from additional VLMs on the same queries would allow direct measurement of how dataset scale affects router accuracy.

Load-bearing premise

The 32,626-query dataset from seven VLMs is representative enough that a router trained on it will keep selecting the best model for new queries and newly added VLMs.

What would settle it

Evaluating the trained ARMS on a fresh collection of queries or on VLMs released after the dataset was built and finding that its selections are no longer better than those of GPT-4o or other large models.

Figures

Figures reproduced from arXiv: 2606.08970 by Bolin Zhang, Can Wang, Dianhui Chu, Shengwei Wang, Zhiying Tu.

Figure 1
Figure 1. Figure 1: The proportion(%) of samples answered incorrectly [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A data sample in M2 . <Question> and <Image> are combined as the input to the VLM claude3.5-sonnet, where the blue text denotes the added prompt for output standard￾ization. The correct answer is in green, while the incorrect answer from the VLM is in red [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of ARMS, consists of five modules: a [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of data amount and training epochs on the performance of ARMS in in-distribution (ID) and out-of-distribution [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Main Results of our router ARMS with different training strategies and selection spaces ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Embeddings visualization of the fused image and text vectors from ARMS with t-SNE [ [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between two training strategies. The best results on Accuracy, Recall, Precision and F1-score are annotated [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performance paradox phenomenon in language models and focuses on routing methods to solve it. However, developing a router for VLM selection is still a critical yet challenging problem, which primarily faces: 1) lack of specialized data, 2) ineffective feature representation, and 3) rigid model space and costly adaptation. In this paper, we construct a multimodal dataset for VLM selection, containing the outputs of seven mainstream VLMs on 32,626 unique image-text queries. We then propose ARMS, a router for VLM selection. ARMS enhances input signals with VLM profiles, employs a simple but effective architecture to improve representations of queries and VLM capabilities. To improve ARMS' adaptation to new VLMs, we propose two extension training strategies: incremental training and independent training. Experimental results on both in-distribution and out-of-distribution test sets demonstrate the effectiveness of ARMS. In particular, using our training strategy, ARMs (only 800M in size) can adapt to a broader VLM space and defeat commercial models like GPT-4o that are hundreds of times larger in scale. Our code, models, and datasets are available in the anonymous repository.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript constructs a multimodal dataset consisting of 32,626 image-text queries evaluated by seven mainstream VLMs. It proposes ARMS, an 800M-parameter router that augments query features with VLM profiles, employs a simple architecture for improved representations, and introduces incremental and independent training strategies to adapt the router to new VLMs. Experiments are reported to show effectiveness on both in-distribution and out-of-distribution test sets, with the headline claim that the adapted ARMS outperforms much larger commercial models such as GPT-4o.

Significance. If the empirical claims are substantiated with full quantitative detail, the work would address a practical deployment challenge in heterogeneous VLM ecosystems by providing a lightweight, adaptable router. The public release of the dataset, code, and models is a clear strength that enables reproducibility and follow-on research.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (experimental results): the claim that the 800M ARMS defeats GPT-4o is presented without any numerical performance values, baseline comparisons, error bars, or ablation tables. This absence is load-bearing for the central empirical contribution.
  2. [§3.2 and §4.2] §3.2 and §4.2 (OOD evaluation): the out-of-distribution test sets are constructed exclusively from the same seven source VLMs used to build the 32,626-query training collection. This does not probe extrapolation to new VLMs whose capability profiles may lie outside the convex hull of the training set, which is required to support the adaptation-to-broader-space claim.
minor comments (1)
  1. [§3.1] Clarify the precise encoding and dimensionality of the VLM profiles and how they are concatenated with query embeddings in the router architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experimental results): the claim that the 800M ARMS defeats GPT-4o is presented without any numerical performance values, baseline comparisons, error bars, or ablation tables. This absence is load-bearing for the central empirical contribution.

    Authors: We agree that the abstract and Section 4 require explicit numerical performance values, baseline comparisons, error bars, and ablation tables to substantiate the claim. In the revised manuscript, we will add these quantitative details from our experiments, including the specific metrics demonstrating ARMS outperforming GPT-4o after adaptation. revision: yes

  2. Referee: [§3.2 and §4.2] §3.2 and §4.2 (OOD evaluation): the out-of-distribution test sets are constructed exclusively from the same seven source VLMs used to build the 32,626-query training collection. This does not probe extrapolation to new VLMs whose capability profiles may lie outside the convex hull of the training set, which is required to support the adaptation-to-broader-space claim.

    Authors: The OOD test sets evaluate generalization to unseen queries drawn from the same seven VLMs, while the incremental and independent training strategies are intended to enable adaptation to new VLMs. The broader-space claim is supported by results on incorporating additional models via these strategies. We will revise the manuscript to explicitly distinguish query-level OOD from model adaptation and discuss the limitation regarding profiles outside the original set. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training and evaluation on external dataset

full rationale

The paper collects a 32,626-query dataset from seven VLMs, trains the ARMS router (with incremental/independent strategies), and reports empirical accuracy on in-distribution and out-of-distribution splits plus comparisons to GPT-4o. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that reduce the reported performance to an input by construction. The result remains an open empirical claim whose validity depends on dataset representativeness, not on definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the representativeness of the collected 32k-query set and on the assumption that VLM profile embeddings plus query features are sufficient to predict per-query performance; no free parameters, axioms, or invented entities are explicitly introduced beyond standard neural-network training.

pith-pipeline@v0.9.1-grok · 5778 in / 1143 out tokens · 12764 ms · 2026-06-27T17:04:29.294297+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 3 canonical work pages

  1. [1]

    Anthropic. 2024. Claude 3.5 sonnet model card addendum.anthropic blog. https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c 52/Model_Card_Claude_3_Addendum.pdf

  2. [2]

    Shuai Bai et al. 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923

  3. [3]

    Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, and Zhepeng Wang. 2025. Tv-rag: a temporal-aware and semantic entropy-weighted frame- work for long video retrieval and understanding. InProceedings of the 33rd ACM International Conference on Multimedia(MM ’25). Association for Computing Machinery, Dublin, Ireland, 9071–9079.isbn: 9798400720352. d...

  4. [4]

    Lingjiao Chen, Matei Zaharia, and James Zou. 2024. Frugalgpt: how to use large language models while reducing cost and improving performance.Transactions on Machine Learning Research

  5. [5]

    Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. 2024. Routerdc: query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems, 37, 66305– 66328

  6. [6]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 4171–4186

  7. [7]

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. 2023. Hybrid llm: cost-efficient and quality-aware query routing. InThe Twelfth International Conference on Learning Representations

  8. [8]

    Tao Feng, Yanzhen Shen, and Jiaxuan You. 2024. Graphrouter: a graph-based router for llm selections. InThe Thirteenth International Conference on Learning Representations

  9. [9]

    Aaron Grattafiori et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783

  10. [10]

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: decoding-enhanced bert with disentangled attention. InInternational Confer- ence on Learning Representations

  11. [11]

    Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. 2024. Mars: a benchmark for multi-llm algorithmic routing system. InICLR 2024 Workshop: How Far Are We From AGI

  12. [12]

    Aaron Hurst et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276

  13. [13]

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. 2023. Llm-blender: ensembling large language models with pairwise ranking and generative fusion. InProceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL: Long Papers), 14165–14178

  14. [14]

    Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. 2024. What matters when building vision-language models?Advances in Neural Information Processing Systems, 37, 87874–87907

  15. [15]

    Tony Lee et al. 2024. Vhelm: a holistic evaluation of vision language models. Advances in Neural Information Processing Systems, 37, 140632–140666

  16. [16]

    Percy Liang et al. 2023. Holistic evaluation of language models.Transactions on Machine Learning Research. Featured Certification, Expert Certification. https://openreview.net/forum?id=iO4LZibEqW

  17. [17]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. InInternational Conference on Learning Representations

  18. [18]

    Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. 2024. Routing to the expert: efficient reward-guided ensemble of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 1964–1974

  19. [19]

    Andrés Marafioti et al. 2025. Smolvlm: redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299

  20. [20]

    Alec Radford et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  21. [21]

    Marija Šakota, Maxime Peyrard, and Robert West. 2024. Fly-swat or cannon? cost-effective language model choice via meta-modeling. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, 606–615

  22. [22]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis- tilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108

  23. [23]

    Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. 2024. Large language model routing with benchmark datasets. InFirst Conference on Language Modeling

  24. [24]

    Dimitris Stripelis et al. 2024. Tensoropera router: a multi-model router for effi- cient llm inference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 452–462

  25. [25]

    Gemini Team et al. 2024. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530

  26. [26]

    Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne.Journal of machine learning research, 9, 11

  27. [27]

    Ashmal Vayani et al. 2024. All languages matter: evaluating lmms on culturally diverse 100 languages.arXiv preprint arXiv:2411.16508

  28. [28]

    Can Wang, Dianbo Sui, Bolin Zhang, Xiaoyu Liu, Jiabao Kang, Zhidong Qiao, and Zhiying Tu. 2025. A framework for effective invocation methods of various LLM services. InProceedings of the 31st International Conference on Computa- tional Linguistics. Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, (E...

  29. [29]

    Qinchen Wu, Difei Gao, Qinghong Lin, Zhuoyu Wu, and Mike Zheng Shou. 2025. Gui-narrator: detecting and captioning computer gui actions. InProceedings of the 33rd ACM International Conference on Multimedia(MM ’25). Association for Computing Machinery, Dublin, Ireland, 3683–3692.isbn: 9798400720352. doi:10.1145/3746027.3755150

  30. [30]

    Kaining Ying et al. 2024. Mmt-bench: a comprehensive multimodal bench- mark for evaluating large vision-language models towards multitask agi. In Proceedings of the 41st International Conference on Machine Learning, 57116– 57198

  31. [31]

    Runtian Yuan, Mohan Chen, Jilan Xu, Ling Zhou, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, and Shang Gao. 2025. Text-promptable propagation for referring medical image sequence segmentation. InProceedings of the 33rd ACM International Conference on Multimedia(MM ’25). Association for Computing Machinery, Dublin, Ireland, 362–371.isbn: 9798400720352. do...

  32. [32]

    Yi-Fan Zhang et al. 2024. Mme-realworld: could your multimodal llm chal- lenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257