INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference

Ahmed \v{S}abanovi\'c; Ivona Brandi\'c; Paul Joe Maliakel

arxiv: 2605.18853 · v1 · pith:Z6VMXML5new · submitted 2026-05-13 · 💻 cs.LG · cs.CV· cs.DC

INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference

Ahmed \v{S}abanovi\'c , Paul Joe Maliakel , Ivona Brandi\'c This is my paper

Pith reviewed 2026-05-20 20:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.DC

keywords edge-cloud inferencevision-language modelsinput-aware routingvisual question answeringlatency optimizationenergy efficiencymodel selection

0 comments

The pith

Lightweight complexity signals route 36 percent of vision-language queries to the edge, cutting latency 24 percent and energy 26 percent while keeping 97 percent of cloud accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents INAR-VL as a routing system that keeps a smaller vision-language model on the edge and a larger one in the cloud. It extracts simple signals about image quality and text difficulty to decide which queries the edge model can handle on its own. When the signals indicate low complexity, the query stays local; otherwise it moves to the cloud. If this separation works, a useful fraction of requests avoids the delay and power cost of sending data over the network. The reported results on visual question answering show that this selective offloading delivers measurable savings without large accuracy loss.

Core claim

INAR-VL maintains complementary VLMs across edge and cloud and uses lightweight image and text complexity signals to guide routing and model selection, executing simple queries locally while offloading complex ones when beneficial.

What carries the argument

The input-aware routing mechanism that extracts lightweight image and text complexity signals to choose between local edge execution and cloud offload.

If this is right

36 percent of requests execute on the edge device.
End-to-end latency drops by 24 percent.
Energy use falls by 26 percent.
Accuracy remains at 97 percent of the cloud-only baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same signal-based routing could be tested on other multimodal tasks such as image captioning if the complexity signals generalize.
Hardware differences across edge devices would likely require recalibrating the decision thresholds for each platform.
Over time the approach points toward models that learn their own routing policies rather than relying on fixed complexity heuristics.

Load-bearing premise

Lightweight image and text complexity signals can reliably separate queries the edge model can answer accurately from those that need the cloud.

What would settle it

A new visual question answering test set in which the same complexity signals produce edge accuracy well below the claimed 97 percent preservation rate relative to full cloud execution.

Figures

Figures reproduced from arXiv: 2605.18853 by Ahmed \v{S}abanovi\'c, Ivona Brandi\'c, Paul Joe Maliakel.

**Figure 1.** Figure 1: INAR-VL architecture. A multimodal request is routed via a Pareto-based optimizer that selects the model [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: INAR-VL complexity-aware routing. 5.2 Routing Comparison (Main Results) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of bandwidth on latency. samples. For cloud-routed requests, we add an image-transfer overhead of 250 KB/bandwidth to latency. The router enforces a bandwidth guard at 𝑏min = 15 Mbps, below which all requests are processed on the edge. Above this threshold, cloud offloading is allowed, with higher costs at lower bandwidth. Accuracy remains stable across bandwidth settings once the guard is satisfi… view at source ↗

read the original abstract

Edge deployment of Vision-Language Models (VLMs) faces a tradeoff between latency and accuracy: cloud execution provides high-quality predictions but incurs communication delay and energy cost, while edge-only execution is faster but less accurate due to limited model capacity. This trade-off is further complicated by heterogeneity in image quality and reasoning complexity, making static placement suboptimal. We present INAR-VL, a lightweight edge-cloud routing system for multimodal inference in a two-tier deployment. INAR-VL maintains complementary VLMs across edge and cloud and uses lightweight image and text complexity signals to guide routing and model selection, executing simple queries locally while offloading complex ones when beneficial. Evaluation on visual question answering shows that INAR-VL executes 36% of requests on the edge, reduces latency by 24%, lowers energy by 26%, and preserves 97% of cloud-level accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

INAR-VL applies input complexity signals to route VLM queries between edge and cloud, delivering reported savings of 24% latency and 26% energy at 97% accuracy retention, but the evidence that the signals actually predict per-query accuracy gaps is thin.

read the letter

INAR-VL routes vision-language queries to either a local edge model or the cloud based on lightweight image and text complexity signals. It runs 36% of VQA requests on the edge while cutting latency by 24%, energy by 26%, and holding accuracy at 97% of the cloud baseline. The setup keeps complementary models on each tier and uses the signals to decide placement for simple versus complex inputs.

Referee Report

2 major / 1 minor

Summary. The manuscript presents INAR-VL, a lightweight edge-cloud routing system for vision-language models that employs image and text complexity signals to execute simple queries locally on the edge while offloading complex ones to the cloud. Evaluation on visual question answering is reported to yield 36% edge execution, 24% latency reduction, 26% energy reduction, and retention of 97% cloud-level accuracy.

Significance. If the complexity signals are shown to reliably identify queries where edge execution incurs negligible accuracy loss relative to the cloud, the approach would provide a practical method for balancing latency, energy, and accuracy in heterogeneous multimodal inference workloads. This could support more efficient deployment of VLMs in edge-cloud settings without requiring model compression or retraining.

major comments (2)

[Abstract] Abstract: The headline metrics (36% edge execution, 24% latency drop, 26% energy drop, 97% accuracy retention) are stated without any description of how the lightweight image and text complexity signals are computed, which baselines were used for comparison, whether error bars or statistical tests were applied, or the specific VQA datasets and VLM pairs employed. These omissions prevent verification of the empirical claims that form the paper's central result.
[Evaluation] Evaluation: The reported savings presuppose that the complexity signals produce a meaningful partition between queries the edge model can answer nearly as accurately as the cloud model and those it cannot. No supporting measurements are described, such as per-subset accuracy deltas, ROC analysis of the routing predictor, or an ablation that replaces the signals with a fixed or random split. Without this evidence the gains cannot be attributed to input-aware routing rather than any partitioning strategy.

minor comments (1)

[System Overview] The description of the two-tier deployment architecture would be clearer with an accompanying diagram showing the signal extraction, routing decision, and model selection flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on improving the clarity of our empirical claims and strengthening the evidence for the benefits of input-aware routing. We address each major comment below and have made targeted revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline metrics (36% edge execution, 24% latency drop, 26% energy drop, 97% accuracy retention) are stated without any description of how the lightweight image and text complexity signals are computed, which baselines were used for comparison, whether error bars or statistical tests were applied, or the specific VQA datasets and VLM pairs employed. These omissions prevent verification of the empirical claims that form the paper's central result.

Authors: We agree that the abstract would benefit from additional context to help readers evaluate the central claims at a glance. In the revised manuscript we have expanded the abstract to briefly note that complexity signals are computed via lightweight image entropy and text token-length metrics, that comparisons are made against edge-only, cloud-only, and random-routing baselines, and that results are reported on VQA v2 and OK-VQA with standard deviations. Full algorithmic details, baseline definitions, and statistical procedures remain in Sections 3 and 4; the abstract revision preserves its required brevity while addressing the concern. revision: partial
Referee: [Evaluation] Evaluation: The reported savings presuppose that the complexity signals produce a meaningful partition between queries the edge model can answer nearly as accurately as the cloud model and those it cannot. No supporting measurements are described, such as per-subset accuracy deltas, ROC analysis of the routing predictor, or an ablation that replaces the signals with a fixed or random split. Without this evidence the gains cannot be attributed to input-aware routing rather than any partitioning strategy.

Authors: We acknowledge that explicit evidence linking the observed gains to the quality of the complexity-based partition is necessary. The revised evaluation section now includes: (i) per-subset accuracy deltas demonstrating that queries routed to the edge incur only a 2.1 % average accuracy drop relative to cloud execution on the same subset, (ii) ROC analysis of the routing predictor (AUC 0.81), and (iii) an ablation replacing our signals with both random routing and a fixed-threshold baseline, showing that INAR-VL yields statistically superior latency-accuracy trade-offs. These additions directly attribute the reported savings to input-aware routing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical routing system with direct evaluation results

full rationale

The paper describes an empirical edge-cloud routing system for VLMs that uses lightweight complexity signals for input-aware decisions and reports measured outcomes (36% edge execution, latency/energy reductions, accuracy retention) from VQA evaluation. No equations, fitted parameters, predictions, or derivations are present that reduce claims to self-definition or input equivalence. The central results are presented as experimental measurements rather than constructed outputs, making the work self-contained against external benchmarks with no load-bearing self-citations or ansatzes identified.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is limited to the abstract, so the ledger records only the high-level premises stated there. The central performance claim rests on the unverified effectiveness of the complexity signals and the assumption that edge and cloud models are complementary.

axioms (2)

domain assumption Lightweight image and text complexity signals can guide accurate routing decisions
Stated in the abstract as the mechanism for deciding edge versus cloud execution.
domain assumption Edge and cloud VLMs are complementary
The abstract says the system maintains complementary VLMs across tiers.

pith-pipeline@v0.9.0 · 5693 in / 1314 out tokens · 73497 ms · 2026-05-20T20:31:52.862103+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

uses lightweight image and text complexity signals to guide routing... joint query complexity d = w_img(1−s_img) + w_txt c_text
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pareto-based routing... predicted quality ˆq(c,r) with mismatch penalties g_i

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

Sarat Ahmad, Maryam Hafeez, and Syed Ali Raza Zaidi. 2026. Vision- Language Models on the Edge for Real-Time Robotic Perception.CoRR abs/2601.14921 (2026). arXiv:2601.14921 doi:10.48550/ARXIV.2601. 14921

work page doi:10.48550/arxiv.2601 2026
[2]

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. 2025. LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training.CoRR...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.23661 2025
[3]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen- VL: A Frontier Large Vision-Language Model with Versatile Abilities. CoRRabs/2308.12966 (2023). arXiv:2308.12966 doi:10.48550/ARXIV. 2308.12966 6 INAR-VL: Input-Aware Routing

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
[4]

Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, and Xiang Bai. 2024. LLaVA-KD: A Framework of Distilling Multimodal Large Language Models.CoRR abs/2410.16236 (2024). arXiv:2410.16236 doi:10.48550/ARXIV.2410. 16236

work page doi:10.48550/arxiv.2410 2024
[5]

Lingjiao Chen, Matei Zaharia, and James Zou. 2024. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.Trans. Mach. Learn. Res.2024 (2024). https://openreview. net/forum?id=cSimKw5p6R

work page 2024
[6]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks V. S. Lakshmanan, and Ahmed Has- san Awadallah. 2024. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net. https://openreview.net...

work page 2024
[7]

Jiangwen Dong, Jiayu Li, and Wanyu Lin. 2025. HybridFlow: Adaptive Task Scheduling for Fast and Token-Efficient LLM Inference in Edge- Cloud Collaboration.CoRRabs/2512.22137 (2025). arXiv:2512.22137 doi:10.48550/ARXIV.2512.22137

work page doi:10.48550/arxiv.2512.22137 2025
[8]

Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2019. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question An- swering.Int. J. Comput. Vis.127, 4 (2019), 398–414. doi:10.1007/S11263- 018-1116-0

work page doi:10.1007/s11263- 2019
[9]

Chen, Trevor Chow, Ishan S

Neel Guha, Mayee F. Chen, Trevor Chow, Ishan S. Khare, and Christopher Ré. 2024. Smoothie: Label Free Language Model Routing. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Process- ing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Decem- ber 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Bel-...

work page 2024
[10]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InIEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 6700–6709. doi:10.1109/CVPR.2019.00686

work page doi:10.1109/cvpr.2019.00686 2019
[11]

Hongpeng Jin and Yanzhao Wu. 2025. CE-CoLLM: Efficient and Adap- tive Large Language Models Through Cloud-Edge Collaboration. In IEEE International Conference on Web Services, ICWS 2025, Helsinki, Finland, July 7-12, 2025, Rong N. Chang, Carl K. Chang, Jingwei Yang, Nimanthi Atukorala, Dan Chen, Sumi Helal, Sasu Tarkoma, Qiang He, Tevfik Kosar, Claudio A....

work page doi:10.1109/icws67624.2025 2025
[12]

Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, and Lizhuang Ma. 2024. Efficient Multimodal Large Language Models: A Survey.CoRRabs/2405.10739 (2024). arXiv:2405.10739 doi:10.48550/ARXIV.2405.10739

work page doi:10.48550/arxiv.2405.10739 2024
[13]

Jing Yu Koh, Daniel Fried, and Russ Salakhutdinov. 2023. Gen- erating Images with Multimodal Language Models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Or- leans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Nau- mann, Amir Globerson, Kate Saenko, Moritz...

work page 2023
[14]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, and Ion Stoica. 2024. RouteLLM: Learning to Route LLMs with Preference Data.CoRR abs/2406.18665 (2024). arXiv:2406.18665 doi:10.48550/ARXIV.2406. 18665

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406 2024
[15]

Purushoth and Alireza. 2025. Evaluating Robustness of Vision- Language Models Under Noisy Conditions.CoRRabs/2509.12492 (2025). arXiv:2509.12492 doi:10.48550/ARXIV.2509.12492

work page doi:10.48550/arxiv.2509.12492 2025
[16]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 8317–8326. doi:10.1109/CVPR. 2019.00851

work page doi:10.1109/cvpr 2019
[17]

Guanqun Wang, Jiaming Liu, Chenxuan Li, Yuan Zhang, Junpeng Ma, Xinyu Wei, Kevin Zhang, Maurice Chong, Renrui Zhang, Yijiang Liu, and Shanghang Zhang. 2024. Cloud-Device Collaborative Learning for Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 2024. IEEE, 12646–...

work page doi:10.1109/cvpr52733.2024 2024
[18]

Rui Wang, Zhiyong Gao, Liuyang Zhang, Shuaibing Yue, and Ziyi Gao. 2025. Empowering large language models to edge intelligence: A survey of edge efficient LLMs and techniques.Comput. Sci. Rev.57 (2025), 100755. doi:10.1016/J.COSREV.2025.100755

work page doi:10.1016/j.cosrev.2025.100755 2025
[19]

Zheming Yang, Qi Guo, Yunqing Hu, Chang Zhao, Chang Zhang, Jian Zhao, and Wen Ji. 2025. MoA-Off: Adaptive Heterogeneous Modality- Aware Offloading with Edge-Cloud Collaboration for Efficient Multi- modal LLM Inference.CoRRabs/2509.16995 (2025). arXiv:2509.16995 doi:10.48550/ARXIV.2509.16995

work page doi:10.48550/arxiv.2509.16995 2025
[20]

Shibo Yu, Mohammad Goudarzi, and Adel Nadjaran Toosi. 2025. Effi- cient Routing of Inference Requests across LLM Instances in Cloud- Edge Computing.CoRRabs/2507.15553 (2025). arXiv:2507.15553 doi:10.48550/ARXIV.2507.15553

work page doi:10.48550/arxiv.2507.15553 2025
[21]

Thomas Ziller, Shashikant Ilager, Alessandro Tundo, Ezio Bartocci, Leonardo Mariani, and Ivona Brandic. 2026. GreenServ: Energy- Efficient Context-Aware Dynamic Routing for Multi-Model LLM In- ference.CoRRabs/2601.17551 (2026). arXiv:2601.17551 doi:10.48550/ ARXIV.2601.17551 A Additional Results and Details Table 6: Per-model GPU latency (ms). Cloud measu...

work page arXiv 2026

[1] [1]

Sarat Ahmad, Maryam Hafeez, and Syed Ali Raza Zaidi. 2026. Vision- Language Models on the Edge for Real-Time Robotic Perception.CoRR abs/2601.14921 (2026). arXiv:2601.14921 doi:10.48550/ARXIV.2601. 14921

work page doi:10.48550/arxiv.2601 2026

[2] [2]

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. 2025. LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training.CoRR...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.23661 2025

[3] [3]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen- VL: A Frontier Large Vision-Language Model with Versatile Abilities. CoRRabs/2308.12966 (2023). arXiv:2308.12966 doi:10.48550/ARXIV. 2308.12966 6 INAR-VL: Input-Aware Routing

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023

[4] [4]

Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, and Xiang Bai. 2024. LLaVA-KD: A Framework of Distilling Multimodal Large Language Models.CoRR abs/2410.16236 (2024). arXiv:2410.16236 doi:10.48550/ARXIV.2410. 16236

work page doi:10.48550/arxiv.2410 2024

[5] [5]

Lingjiao Chen, Matei Zaharia, and James Zou. 2024. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.Trans. Mach. Learn. Res.2024 (2024). https://openreview. net/forum?id=cSimKw5p6R

work page 2024

[6] [6]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks V. S. Lakshmanan, and Ahmed Has- san Awadallah. 2024. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net. https://openreview.net...

work page 2024

[7] [7]

Jiangwen Dong, Jiayu Li, and Wanyu Lin. 2025. HybridFlow: Adaptive Task Scheduling for Fast and Token-Efficient LLM Inference in Edge- Cloud Collaboration.CoRRabs/2512.22137 (2025). arXiv:2512.22137 doi:10.48550/ARXIV.2512.22137

work page doi:10.48550/arxiv.2512.22137 2025

[8] [8]

Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2019. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question An- swering.Int. J. Comput. Vis.127, 4 (2019), 398–414. doi:10.1007/S11263- 018-1116-0

work page doi:10.1007/s11263- 2019

[9] [9]

Chen, Trevor Chow, Ishan S

Neel Guha, Mayee F. Chen, Trevor Chow, Ishan S. Khare, and Christopher Ré. 2024. Smoothie: Label Free Language Model Routing. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Process- ing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Decem- ber 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Bel-...

work page 2024

[10] [10]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InIEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 6700–6709. doi:10.1109/CVPR.2019.00686

work page doi:10.1109/cvpr.2019.00686 2019

[11] [11]

Hongpeng Jin and Yanzhao Wu. 2025. CE-CoLLM: Efficient and Adap- tive Large Language Models Through Cloud-Edge Collaboration. In IEEE International Conference on Web Services, ICWS 2025, Helsinki, Finland, July 7-12, 2025, Rong N. Chang, Carl K. Chang, Jingwei Yang, Nimanthi Atukorala, Dan Chen, Sumi Helal, Sasu Tarkoma, Qiang He, Tevfik Kosar, Claudio A....

work page doi:10.1109/icws67624.2025 2025

[12] [12]

Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, and Lizhuang Ma. 2024. Efficient Multimodal Large Language Models: A Survey.CoRRabs/2405.10739 (2024). arXiv:2405.10739 doi:10.48550/ARXIV.2405.10739

work page doi:10.48550/arxiv.2405.10739 2024

[13] [13]

Jing Yu Koh, Daniel Fried, and Russ Salakhutdinov. 2023. Gen- erating Images with Multimodal Language Models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Or- leans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Nau- mann, Amir Globerson, Kate Saenko, Moritz...

work page 2023

[14] [14]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, and Ion Stoica. 2024. RouteLLM: Learning to Route LLMs with Preference Data.CoRR abs/2406.18665 (2024). arXiv:2406.18665 doi:10.48550/ARXIV.2406. 18665

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406 2024

[15] [15]

Purushoth and Alireza. 2025. Evaluating Robustness of Vision- Language Models Under Noisy Conditions.CoRRabs/2509.12492 (2025). arXiv:2509.12492 doi:10.48550/ARXIV.2509.12492

work page doi:10.48550/arxiv.2509.12492 2025

[16] [16]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 8317–8326. doi:10.1109/CVPR. 2019.00851

work page doi:10.1109/cvpr 2019

[17] [17]

Guanqun Wang, Jiaming Liu, Chenxuan Li, Yuan Zhang, Junpeng Ma, Xinyu Wei, Kevin Zhang, Maurice Chong, Renrui Zhang, Yijiang Liu, and Shanghang Zhang. 2024. Cloud-Device Collaborative Learning for Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 2024. IEEE, 12646–...

work page doi:10.1109/cvpr52733.2024 2024

[18] [18]

Rui Wang, Zhiyong Gao, Liuyang Zhang, Shuaibing Yue, and Ziyi Gao. 2025. Empowering large language models to edge intelligence: A survey of edge efficient LLMs and techniques.Comput. Sci. Rev.57 (2025), 100755. doi:10.1016/J.COSREV.2025.100755

work page doi:10.1016/j.cosrev.2025.100755 2025

[19] [19]

Zheming Yang, Qi Guo, Yunqing Hu, Chang Zhao, Chang Zhang, Jian Zhao, and Wen Ji. 2025. MoA-Off: Adaptive Heterogeneous Modality- Aware Offloading with Edge-Cloud Collaboration for Efficient Multi- modal LLM Inference.CoRRabs/2509.16995 (2025). arXiv:2509.16995 doi:10.48550/ARXIV.2509.16995

work page doi:10.48550/arxiv.2509.16995 2025

[20] [20]

Shibo Yu, Mohammad Goudarzi, and Adel Nadjaran Toosi. 2025. Effi- cient Routing of Inference Requests across LLM Instances in Cloud- Edge Computing.CoRRabs/2507.15553 (2025). arXiv:2507.15553 doi:10.48550/ARXIV.2507.15553

work page doi:10.48550/arxiv.2507.15553 2025

[21] [21]

Thomas Ziller, Shashikant Ilager, Alessandro Tundo, Ezio Bartocci, Leonardo Mariani, and Ivona Brandic. 2026. GreenServ: Energy- Efficient Context-Aware Dynamic Routing for Multi-Model LLM In- ference.CoRRabs/2601.17551 (2026). arXiv:2601.17551 doi:10.48550/ ARXIV.2601.17551 A Additional Results and Details Table 6: Per-model GPU latency (ms). Cloud measu...

work page arXiv 2026