arxiv: 2603.27112 · v2 · submitted 2026-03-28 · 💻 cs.CV

Recognition: no theorem link

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

Sen Zhang , Runmei Li , Shizhuang Deng , Zhichao Zheng , Yuhe Zhang , Jiani Li , Kailun Zhang , Tao Zhang

show 2 more authors

Wenjun Wu Qunbo Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual question answeringautomatic train operationmultimodal modelsbenchmark datasetcollaborative architectureinterpretabilityrailway perception

0 comments

The pith

A benchmark of 21,000 cab-view QA pairs and a three-module collaborative model framework together enable efficient interpretable visual cognition for automatic train operation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates RailVQA-bench, the first visual question answering dataset built specifically for train cab views, with 20,000 single-frame and 1,168 video-based question-answer pairs to test perception and reasoning in both static and moving scenes. It also presents RailVQA-CoM, a framework that pairs small efficient models with large multimodal models inside a transparent three-module structure plus adaptive sampling of video frames. This design targets the problems of high compute cost and hallucination that block large models from safety-critical train use. If the approach works, automatic train systems gain reliable high-level visual planning at lower cost and with clearer reasoning steps.

Core claim

The central claim is that the RailVQA-CoM collaborative large-small model framework, using a transparent three-module architecture for perception, reasoning and planning together with adaptive temporal sampling, substantially improves performance, interpretability, efficiency and cross-domain generalization on visual cognition tasks for automatic train operation when evaluated on the new RailVQA-bench dataset.

What carries the argument

The three-module collaborative large-small model architecture that separates perception, reasoning and decision planning while using adaptive temporal sampling to process video inputs efficiently.

If this is right

Automatic train systems can perform reliable high-level visual planning in complex environments at reduced computational cost.
Reasoning steps become more transparent because each module contributes visibly to the final decision.
Performance gains appear on both static images and dynamic video sequences from cab views.
The same framework shows improved results when transferred to other autonomous driving domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The small-large model pairing could be tested on road-vehicle perception tasks where real-time constraints are equally strict.
The benchmark dataset offers a ready-made way to measure hallucination rates in multimodal models for transportation.
Live deployment would require separate checks on actual train routes to confirm that efficiency gains preserve safety margins.

Load-bearing premise

The collected 21,168 QA pairs adequately represent rare safety-critical railway corner cases and the three-module design reduces hallucination risk without lowering decision accuracy.

What would settle it

A test set of previously unseen real-world railway corner cases where the collaborative model either hallucinates answers or falls below large-model accuracy while using similar compute would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.27112 by Jiani Li, Kailun Zhang, Qunbo Wang, Runmei Li, Sen Zhang, Shizhuang Deng, Tao Zhang, Wenjun Wu, Yuhe Zhang, Zhichao Zheng.

**Figure 1.** Figure 1: Overview of the proposed RailVQA-CoM framework, which consists of three hierarchical modules: (1) a Perception module that efficiently extracts [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: This figure presents the standardized input–output schemas for the benchmark’s two core subtasks: Static Single-frame VQA and Dynamic Multi-frame [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comprehensive statistical overview of the RailVQA-bench dataset. (a) shows the distribution of generated CoT character lengths, reflecting the benchmark’s emphasis on logic-intensive reasoning. (b) reports the occurrence frequencies of key railway entities, demonstrating broad domainspecific coverage. (c) summarizes the distribution of question intents, indicating a primary focus on action planning and sa… view at source ↗

**Figure 4.** Figure 4: Performance-efficiency comparison in dynamic scenarios. RailVQA [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study of the core middleware components of RailVQA-CoM. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of the RailVQA-CoM framework. (a) Case 1: The perception module tracks pedestrians and handles transient occlusions via track [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

As Automatic Train Operation (ATO) advances toward GoA4 and beyond, it increasingly depends on efficient, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling more efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, improves efficiency, and strengthens cross-domain generalization in autonomous driving systems. Code and datasets will be available at https://cybereye-bjtu.github.io/RailVQA.html.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RailVQA brings the first dedicated VQA benchmark for train cab views plus a collaborative efficiency framework, but the performance numbers and cross-domain claims are not backed by evidence in the abstract.

read the letter

RailVQA introduces the first VQA benchmark built for cab-view visual cognition in automatic train operation, with 20,000 single-frame and 1,168 video QA pairs, and pairs it with RailVQA-CoM, a three-module collaborative large-small model that uses adaptive temporal sampling to trade off speed and reasoning depth. That combination directly targets the gap in high-level, decision-oriented perception for GoA4 rail systems where standard LMMs are too slow or prone to hallucination. The transparent architecture and focus on interpretability are practical choices for safety-critical use, and releasing the dataset and code is a clear positive for anyone who needs rail-specific evaluation data. The benchmark construction itself looks independent of the results and should let others test generalization in static and dynamic rail scenes. The main shortcoming is the lack of concrete support for the claims. The abstract says the approach substantially improves performance, interpretability, efficiency, and cross-domain generalization to autonomous driving systems, yet supplies no metrics, baselines, ablations, or error analysis. The cross-domain part in particular does not hold up on the given description, since the data and modules stay entirely within railway imagery and no transfer experiments or road-domain comparisons are mentioned. This is a paper for the rail automation and domain-specific VQA crowd. A reader who needs benchmarks for high-stakes transport perception will find the dataset and efficiency angle worth examining once the full experiments are in place. It is coherent on its own terms and deserves a serious referee to verify the implementation and tighten the unsupported claims. I would send it to peer review rather than desk reject.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces RailVQA-bench, the first VQA benchmark for cab-view visual cognition in Automatic Train Operation (ATO), comprising 20,000 single-frame and 1,168 video-based QA pairs to evaluate cognitive generalization and interpretability. It also proposes RailVQA-CoM, a collaborative large-small model framework with a transparent three-module architecture and adaptive temporal sampling that combines small-model efficiency with large-model cognition for perceptual generalization, reasoning, and planning. Experiments are stated to demonstrate substantial improvements in performance, interpretability, efficiency, and cross-domain generalization to autonomous driving systems.

Significance. If the experimental results hold, the benchmark would address the absence of domain-specific resources for high-level reasoning in safety-critical ATO, while the framework could offer a deployable route to using LMMs under computational and reliability constraints, potentially supporting GoA4+ operations.

major comments (2)

[Abstract] Abstract: the statement that experiments demonstrate substantial improvements in performance, interpretability, efficiency, and cross-domain generalization supplies no quantitative metrics, baselines, ablation studies, or error analysis, making it impossible to determine whether the data support the central claims.
[Abstract] Abstract: the claim that RailVQA-CoM strengthens cross-domain generalization in autonomous driving systems lacks any supporting evidence; no transfer experiments, domain-adaptation results, or shared-feature analysis between railway and road domains are described, despite the benchmark and framework being constructed exclusively from cab-view railway imagery.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recognition of the potential significance of RailVQA-bench and RailVQA-CoM. We address the major comments point-by-point below and will revise the manuscript to strengthen the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that experiments demonstrate substantial improvements in performance, interpretability, efficiency, and cross-domain generalization supplies no quantitative metrics, baselines, ablation studies, or error analysis, making it impossible to determine whether the data support the central claims.

Authors: We agree that the abstract should provide key quantitative highlights. The full manuscript contains detailed experimental results with specific metrics (accuracy, efficiency, interpretability scores), baseline comparisons, module ablations, and error analysis demonstrating the claimed improvements. We will revise the abstract to include representative quantitative results from these experiments. revision: yes
Referee: [Abstract] Abstract: the claim that RailVQA-CoM strengthens cross-domain generalization in autonomous driving systems lacks any supporting evidence; no transfer experiments, domain-adaptation results, or shared-feature analysis between railway and road domains are described, despite the benchmark and framework being constructed exclusively from cab-view railway imagery.

Authors: We acknowledge that the manuscript contains no transfer experiments, domain-adaptation results, or analysis between railway and road domains. The abstract claim was not supported by evidence and will be removed or qualified in revision to focus only on results within the railway ATO domain and the framework's modular design for potential future generalization. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction or framework claims

full rationale

The paper introduces RailVQA-bench as a new dataset of 21,168 QA pairs and RailVQA-CoM as a three-module collaborative architecture, then reports empirical results on performance, interpretability, efficiency, and generalization. No equations, fitted parameters, or self-referential definitions appear that would reduce any claimed prediction to the inputs by construction. The cross-domain generalization statement to autonomous driving is an unsupported extension rather than a circular reduction, and no load-bearing self-citations or ansatz smuggling are present. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work rests on standard assumptions of multimodal learning and VQA evaluation.

pith-pipeline@v0.9.0 · 5590 in / 1086 out tokens · 54142 ms · 2026-05-14T22:19:39.929299+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

[1]

Railway accident causa- tion analysis: Current approaches, challenges and potential solutions,

W.-T. Hong, G. Clifton, and J. D. Nelson, “Railway accident causa- tion analysis: Current approaches, challenges and potential solutions,” Accident Analysis & Prevention, vol. 186, p. 107049, 2023

work page 2023
[2]

Advanced learning technologies for intelligent transportation systems: Prospects and challenges,

R. A. Khalil, Z. Safelnasr, N. Yemane, M. Kedir, A. Shafiqurrahman, and N. Saeed, “Advanced learning technologies for intelligent transportation systems: Prospects and challenges,”IEEE Open Journal of Vehicular Technology, vol. 5, pp. 397–427, 2024. 12

work page 2024
[3]

Railsem19: A dataset for semantic rail scene understand- ing,

O. Zendel, M. Murschitz, M. Zeilinger, D. Steininger, S. Abbasi, and C. Beleznai, “Railsem19: A dataset for semantic rail scene understand- ing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0

work page 2019
[4]

Osdar23: Open sensor data for rail 2023,

R. Tagiew, P. Klasek, R. Tilly, M. K ¨oppel, P. Denzler, P. Neumaier, T. Klockau, M. Boekhoff, and K. Schwalbe, “Osdar23: Open sensor data for rail 2023,” in2023 8th International Conference on Robotics and Automation Engineering (ICRAE). IEEE, 2023, pp. 270–276

work page 2023
[5]

Railgo- erl24: G¨orlitz rail test center cv dataset 2024,

R. Tagiew, I. Wunderlich, M. Sastuba, K. G ¨oller, and S. Seitz, “Railgo- erl24: G¨orlitz rail test center cv dataset 2024,” in2025 IEEE Engineering Reliable Autonomous Systems (ERAS). IEEE, 2025, pp. 1–4

work page 2024
[6]

A survey of multimodel large language models,

Z. Liang, Y . Xu, Y . Hong, P. Shang, Q. Wang, Q. Fu, and K. Liu, “A survey of multimodel large language models,” inProceedings of the 3rd international conference on computer, artificial intelligence and control engineering, 2024, pp. 405–409

work page 2024
[7]

Retrieval-based interleaved visual chain-of-thought in real-world driv- ing scenarios,

C. Corbiere, S. Roburin, S. Montariol, A. Bosselut, and A. Alahi, “Retrieval-based interleaved visual chain-of-thought in real-world driv- ing scenarios,”arXiv preprint arXiv:2501.04671, 2025

work page arXiv 2025
[8]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022

work page 2022
[9]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837

work page 2022
[10]

A Survey on Hallucination in Large Vision-Language Models

H. Liu, W. Xue, Y . Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng, “A survey on hallucination in large vision-language models,” arXiv preprint arXiv:2402.00253, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 292–305

work page 2023
[12]

Mrsi: A multi- modal proximity remote sensing data set for environment perception in rail transit,

Y . Chen, N. Zhu, Q. Wu, C. Wu, W. Niu, and Y . Wang, “Mrsi: A multi- modal proximity remote sensing data set for environment perception in rail transit,”International Journal of Intelligent Systems, vol. 37, no. 9, pp. 5530–5556, 2022

work page 2022
[13]

A camera and lidar data fusion method for railway object detection,

W. Zhangyu, Y . Guizhen, W. Xinkai, L. Haoran, and L. Da, “A camera and lidar data fusion method for railway object detection,”IEEE Sensors Journal, vol. 21, no. 12, pp. 13 442–13 454, 2021

work page 2021
[14]

Synrailobs: A synthetic dataset for obstacle detection in railway scenarios,

Q. Guo and J. Rambach, “Synrailobs: A synthetic dataset for obstacle detection in railway scenarios,”arXiv preprint arXiv:2505.10784, 2025

work page arXiv 2025
[15]

A video-analysis-based railway–road safety system for detecting hazard situations at level crossings,

H. Salmane, L. Khoudour, and Y . Ruichek, “A video-analysis-based railway–road safety system for detecting hazard situations at level crossings,”IEEE transactions on intelligent transportation systems, vol. 16, no. 2, pp. 596–609, 2015

work page 2015
[16]

The obstacle detection on the railway crossing based on optical flow and clustering,

Z. ˇSilar and M. Dobrovoln `y, “The obstacle detection on the railway crossing based on optical flow and clustering,” in2013 36th Interna- tional Conference on Telecommunications and Signal Processing (TSP). IEEE, 2013, pp. 755–759

work page 2013
[17]

An effective railway intrusion detection method using dynamic intrusion region and lightweight neural network,

Z. Cao, Y . Qin, Z. Xie, Q. Liu, E. Zhang, Z. Wu, and Z. Yu, “An effective railway intrusion detection method using dynamic intrusion region and lightweight neural network,”Measurement, vol. 191, p. 110564, 2022

work page 2022
[18]

Yolo-rail: An improved yolo model for obstacle detection on railway tracks,

Z. Wang and X. Du, “Yolo-rail: An improved yolo model for obstacle detection on railway tracks,”IEEE Sensors Journal, 2026

work page 2026
[19]

Railway obstacle intrusion warning mechanism integrating yolo-based detection and risk assessment,

Z. Zhang, P. Chen, Y . Huang, L. Dai, F. Xu, and H. Hu, “Railway obstacle intrusion warning mechanism integrating yolo-based detection and risk assessment,”Journal of Industrial Information Integration, vol. 38, p. 100571, 2024

work page 2024
[20]

Real-time railway obstacle detection based on multitask perception learning,

C. Chen, H. Qin, Y . Qin, and Y . Bai, “Real-time railway obstacle detection based on multitask perception learning,”IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025
[21]

Railfusion: A lidar- camera data interaction network for 3-d railway object detection,

W. Liu, Y . Wang, G. Yu, Z. Wang, and P. Chen, “Railfusion: A lidar- camera data interaction network for 3-d railway object detection,”IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025
[22]

Vqa: Visual question answering,

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433

work page 2015
[23]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023

work page 2023
[24]

Qwen-vl: A versatile vision-language model for understanding, localization,

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for understanding, localization,”Text Reading, and Beyond, vol. 2, no. 1, p. 1, 2023

work page 2023
[25]

Radarscenes: A real-world radar point cloud data set for automotive applications,

O. Schumann, M. Hahn, N. Scheiner, F. Weishaupt, J. F. Tilly, J. Dick- mann, and C. W ¨ohler, “Radarscenes: A real-world radar point cloud data set for automotive applications,” in2021 IEEE 24th International Conference on Information Fusion (FUSION). IEEE, 2021, pp. 1–8

work page 2021
[26]

Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events,

L. Xu, H. Huang, and J. Liu, “Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9878–9888

work page 2021
[27]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,

T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4542–4550

work page 2024
[28]

Lingoqa: Visual question answering for autonomous driving,

A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton et al., “Lingoqa: Visual question answering for autonomous driving,” in European Conference on Computer Vision. Springer, 2024, pp. 252– 269

work page 2024
[29]

Large (vision) language models for autonomous vehicles: Current trends and future directions,

H. Tian, K. Reddy, Y . Feng, M. Quddus, Y . Demiris, and P. An- geloudis, “Large (vision) language models for autonomous vehicles: Current trends and future directions,”IEEE Transactions on Intelligent Transportation Systems, vol. 27, no. 1, pp. 187–210, 2025

work page 2025
[30]

Drivelm: Driving with graph visual ques- tion answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual ques- tion answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274

work page 2024
[31]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8186–8193, 2024

work page 2024
[33]

Robotron-drive: All-in-one large multimodal model for autonomous driving,

Z. Huang, C. Feng, F. Yan, B. Xiao, Z. Jie, Y . Zhong, X. Liang, and L. Ma, “Robotron-drive: All-in-one large multimodal model for autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8011–8021

work page 2025
[34]

Enhancing vision-language models for autonomous driving through task-specific prompting and spatial reasoning,

A. Wu and X. Luo, “Enhancing vision-language models for autonomous driving through task-specific prompting and spatial reasoning,”arXiv preprint arXiv:2510.24152, 2025

work page arXiv 2025
[35]

Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,

A. Ishaq, J. Lahoud, K. More, O. Thawakar, R. Thawkar, D. Dis- sanayake, N. Ahsan, Y . Li, F. S. Khan, H. Cholakkalet al., “Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 20 501–20 508

work page 2025
[36]

Foundation models in autonomous driving: A survey on scenario generation and scenario analysis,

Y . Gao, M. Piccinini, Y . Zhang, D. Wang, K. Moller, R. Brusnicki, B. Zarrouki, A. Gambi, J. F. Totz, K. Stormset al., “Foundation models in autonomous driving: A survey on scenario generation and scenario analysis,”IEEE Open Journal of Intelligent Transportation Systems, 2026

work page 2026
[37]

A survey on collaborative mech- anisms between large and small language models,

Y . Chen, J. Zhao, and H. Han, “A survey on collaborative mech- anisms between large and small language models,”arXiv preprint arXiv:2505.07460, 2025

work page arXiv 2025
[38]

Collaborative inference and learning between edge slms and cloud llms: A survey of algorithms, execution, and open challenges,

S. Li, H. Wang, W. Xu, R. Zhang, S. Guo, J. Yuan, X. Zhong, T. Zhang, and R. Li, “Collaborative inference and learning between edge slms and cloud llms: A survey of algorithms, execution, and open challenges,” arXiv preprint arXiv:2507.16731, 2025

work page arXiv 2025
[39]

big. little vi- sion transformer for efficient visual recognition,

H. Guo, Y . Wang, Z. Ye, J. Dai, and Y . Xiong, “big. little vi- sion transformer for efficient visual recognition,”arXiv preprint arXiv:2410.10267, 2024

work page arXiv 2024
[40]

Visiongpt: Vision-language understand- ing agent using generalized multimodal framework,

C. Kelly, L. Hu, B. Yang, Y . Tian, D. Yang, C. Yang, Z. Huang, Z. Li, J. Hu, and Y . Zou, “Visiongpt: Vision-language understand- ing agent using generalized multimodal framework,”arXiv preprint arXiv:2403.09027, 2024

work page arXiv 2024
[41]

Small drafts, big verdict: Information-intensive visual reasoning via speculation,

Y . Liu, L. Qin, and S. Wang, “Small drafts, big verdict: Information-intensive visual reasoning via speculation,”arXiv preprint arXiv:2510.20812, 2025

work page arXiv 2025
[42]

Kcm: Kan-based collaboration models enhance pretrained large models,

G. Dai, S. Tang, and Y . Zhuang, “Kcm: Kan-based collaboration models enhance pretrained large models,”arXiv preprint arXiv:2510.20278, 2025

work page arXiv 2025
[43]

Multi-modal medical diagnosis via large-small model collaboration,

W. Chen, Z. Zhao, J. Yao, Y . Zhang, J. Bu, and H. Wang, “Multi-modal medical diagnosis via large-small model collaboration,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 30 763–30 773

work page 2025
[44]

Large-small model synergy with multimodal fine-grained heuristics for knowledge-based visual question answering,

Z. Sun, K. Guo, Y . Hu, D. Tian, Q. Gao, J. Wang, J. Gao, Y . Sun, and B. Yin, “Large-small model synergy with multimodal fine-grained heuristics for knowledge-based visual question answering,” inProceed- ings of the 33rd ACM International Conference on Multimedia, 2025, pp. 935–944. 13

work page 2025
[45]

Space-llava: a vision-language model adapted to extraterrestrial applications,

M. Foutter, D. Gammelli, J. Kruger, E. Foss, P. Bhoj, T. Guffanti, S. D’Amico, and M. Pavone, “Space-llava: a vision-language model adapted to extraterrestrial applications,”arXiv preprint arXiv:2408.05924, 2024

work page arXiv 2024
[46]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

work page 2023
[48]

G-eval: Nlg evaluation using gpt-4 with better human alignment,

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 2511–2522

work page 2023
[49]

You only look once: Unified, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779– 788

work page 2016
[50]

Bytetrack: Multi-object tracking by associating every detection box,

Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” inEuropean conference on computer vision. Springer, 2022, pp. 1–21

work page 2022
[51]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198

work page 2024
[52]

Llama guard 3 vision: Safe- guarding human-ai image understanding conversations,

J. Chi, U. Karn, H. Zhan, E. Smith, J. Rando, Y . Zhang, K. Plawiak, Z. D. Coudert, K. Upasani, and M. Pasupuleti, “Llama guard 3 vision: Safe- guarding human-ai image understanding conversations,”arXiv preprint arXiv:2411.10414, 2024

work page arXiv 2024
[53]

A Survey on Efficient Inference for Large Language Models

Z. Zhou, X. Ning, K. Hong, T. Fu, J. Xu, S. Li, Y . Lou, L. Wang, Z. Yuan, X. Liet al., “A survey on efficient inference for large language models,”arXiv preprint arXiv:2404.14294, 2024

work page internal anchor Pith review arXiv 2024