pith. machine review for the scientific record. sign in

arxiv: 2603.27112 · v2 · submitted 2026-03-28 · 💻 cs.CV

Recognition: no theorem link

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual question answeringautomatic train operationmultimodal modelsbenchmark datasetcollaborative architectureinterpretabilityrailway perception
0
0 comments X

The pith

A benchmark of 21,000 cab-view QA pairs and a three-module collaborative model framework together enable efficient interpretable visual cognition for automatic train operation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates RailVQA-bench, the first visual question answering dataset built specifically for train cab views, with 20,000 single-frame and 1,168 video-based question-answer pairs to test perception and reasoning in both static and moving scenes. It also presents RailVQA-CoM, a framework that pairs small efficient models with large multimodal models inside a transparent three-module structure plus adaptive sampling of video frames. This design targets the problems of high compute cost and hallucination that block large models from safety-critical train use. If the approach works, automatic train systems gain reliable high-level visual planning at lower cost and with clearer reasoning steps.

Core claim

The central claim is that the RailVQA-CoM collaborative large-small model framework, using a transparent three-module architecture for perception, reasoning and planning together with adaptive temporal sampling, substantially improves performance, interpretability, efficiency and cross-domain generalization on visual cognition tasks for automatic train operation when evaluated on the new RailVQA-bench dataset.

What carries the argument

The three-module collaborative large-small model architecture that separates perception, reasoning and decision planning while using adaptive temporal sampling to process video inputs efficiently.

If this is right

  • Automatic train systems can perform reliable high-level visual planning in complex environments at reduced computational cost.
  • Reasoning steps become more transparent because each module contributes visibly to the final decision.
  • Performance gains appear on both static images and dynamic video sequences from cab views.
  • The same framework shows improved results when transferred to other autonomous driving domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The small-large model pairing could be tested on road-vehicle perception tasks where real-time constraints are equally strict.
  • The benchmark dataset offers a ready-made way to measure hallucination rates in multimodal models for transportation.
  • Live deployment would require separate checks on actual train routes to confirm that efficiency gains preserve safety margins.

Load-bearing premise

The collected 21,168 QA pairs adequately represent rare safety-critical railway corner cases and the three-module design reduces hallucination risk without lowering decision accuracy.

What would settle it

A test set of previously unseen real-world railway corner cases where the collaborative model either hallucinates answers or falls below large-model accuracy while using similar compute would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.27112 by Jiani Li, Kailun Zhang, Qunbo Wang, Runmei Li, Sen Zhang, Shizhuang Deng, Tao Zhang, Wenjun Wu, Yuhe Zhang, Zhichao Zheng.

Figure 1
Figure 1. Figure 1: Overview of the proposed RailVQA-CoM framework, which consists of three hierarchical modules: (1) a Perception module that efficiently extracts [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: This figure presents the standardized input–output schemas for the benchmark’s two core subtasks: Static Single-frame VQA and Dynamic Multi-frame [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comprehensive statistical overview of the RailVQA-bench dataset. (a) shows the distribution of generated CoT character lengths, reflecting the benchmark’s emphasis on logic-intensive reasoning. (b) reports the occurrence frequencies of key railway entities, demonstrating broad domain￾specific coverage. (c) summarizes the distribution of question intents, indicating a primary focus on action planning and sa… view at source ↗
Figure 4
Figure 4. Figure 4: Performance-efficiency comparison in dynamic scenarios. RailVQA [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study of the core middleware components of RailVQA-CoM. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of the RailVQA-CoM framework. (a) Case 1: The perception module tracks pedestrians and handles transient occlusions via track [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

As Automatic Train Operation (ATO) advances toward GoA4 and beyond, it increasingly depends on efficient, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling more efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, improves efficiency, and strengthens cross-domain generalization in autonomous driving systems. Code and datasets will be available at https://cybereye-bjtu.github.io/RailVQA.html.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces RailVQA-bench, the first VQA benchmark for cab-view visual cognition in Automatic Train Operation (ATO), comprising 20,000 single-frame and 1,168 video-based QA pairs to evaluate cognitive generalization and interpretability. It also proposes RailVQA-CoM, a collaborative large-small model framework with a transparent three-module architecture and adaptive temporal sampling that combines small-model efficiency with large-model cognition for perceptual generalization, reasoning, and planning. Experiments are stated to demonstrate substantial improvements in performance, interpretability, efficiency, and cross-domain generalization to autonomous driving systems.

Significance. If the experimental results hold, the benchmark would address the absence of domain-specific resources for high-level reasoning in safety-critical ATO, while the framework could offer a deployable route to using LMMs under computational and reliability constraints, potentially supporting GoA4+ operations.

major comments (2)
  1. [Abstract] Abstract: the statement that experiments demonstrate substantial improvements in performance, interpretability, efficiency, and cross-domain generalization supplies no quantitative metrics, baselines, ablation studies, or error analysis, making it impossible to determine whether the data support the central claims.
  2. [Abstract] Abstract: the claim that RailVQA-CoM strengthens cross-domain generalization in autonomous driving systems lacks any supporting evidence; no transfer experiments, domain-adaptation results, or shared-feature analysis between railway and road domains are described, despite the benchmark and framework being constructed exclusively from cab-view railway imagery.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recognition of the potential significance of RailVQA-bench and RailVQA-CoM. We address the major comments point-by-point below and will revise the manuscript to strengthen the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the statement that experiments demonstrate substantial improvements in performance, interpretability, efficiency, and cross-domain generalization supplies no quantitative metrics, baselines, ablation studies, or error analysis, making it impossible to determine whether the data support the central claims.

    Authors: We agree that the abstract should provide key quantitative highlights. The full manuscript contains detailed experimental results with specific metrics (accuracy, efficiency, interpretability scores), baseline comparisons, module ablations, and error analysis demonstrating the claimed improvements. We will revise the abstract to include representative quantitative results from these experiments. revision: yes

  2. Referee: [Abstract] Abstract: the claim that RailVQA-CoM strengthens cross-domain generalization in autonomous driving systems lacks any supporting evidence; no transfer experiments, domain-adaptation results, or shared-feature analysis between railway and road domains are described, despite the benchmark and framework being constructed exclusively from cab-view railway imagery.

    Authors: We acknowledge that the manuscript contains no transfer experiments, domain-adaptation results, or analysis between railway and road domains. The abstract claim was not supported by evidence and will be removed or qualified in revision to focus only on results within the railway ATO domain and the framework's modular design for potential future generalization. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction or framework claims

full rationale

The paper introduces RailVQA-bench as a new dataset of 21,168 QA pairs and RailVQA-CoM as a three-module collaborative architecture, then reports empirical results on performance, interpretability, efficiency, and generalization. No equations, fitted parameters, or self-referential definitions appear that would reduce any claimed prediction to the inputs by construction. The cross-domain generalization statement to autonomous driving is an unsupported extension rather than a circular reduction, and no load-bearing self-citations or ansatz smuggling are present. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work rests on standard assumptions of multimodal learning and VQA evaluation.

pith-pipeline@v0.9.0 · 5590 in / 1086 out tokens · 54142 ms · 2026-05-14T22:19:39.929299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

  1. [1]

    Railway accident causa- tion analysis: Current approaches, challenges and potential solutions,

    W.-T. Hong, G. Clifton, and J. D. Nelson, “Railway accident causa- tion analysis: Current approaches, challenges and potential solutions,” Accident Analysis & Prevention, vol. 186, p. 107049, 2023

  2. [2]

    Advanced learning technologies for intelligent transportation systems: Prospects and challenges,

    R. A. Khalil, Z. Safelnasr, N. Yemane, M. Kedir, A. Shafiqurrahman, and N. Saeed, “Advanced learning technologies for intelligent transportation systems: Prospects and challenges,”IEEE Open Journal of Vehicular Technology, vol. 5, pp. 397–427, 2024. 12

  3. [3]

    Railsem19: A dataset for semantic rail scene understand- ing,

    O. Zendel, M. Murschitz, M. Zeilinger, D. Steininger, S. Abbasi, and C. Beleznai, “Railsem19: A dataset for semantic rail scene understand- ing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0

  4. [4]

    Osdar23: Open sensor data for rail 2023,

    R. Tagiew, P. Klasek, R. Tilly, M. K ¨oppel, P. Denzler, P. Neumaier, T. Klockau, M. Boekhoff, and K. Schwalbe, “Osdar23: Open sensor data for rail 2023,” in2023 8th International Conference on Robotics and Automation Engineering (ICRAE). IEEE, 2023, pp. 270–276

  5. [5]

    Railgo- erl24: G¨orlitz rail test center cv dataset 2024,

    R. Tagiew, I. Wunderlich, M. Sastuba, K. G ¨oller, and S. Seitz, “Railgo- erl24: G¨orlitz rail test center cv dataset 2024,” in2025 IEEE Engineering Reliable Autonomous Systems (ERAS). IEEE, 2025, pp. 1–4

  6. [6]

    A survey of multimodel large language models,

    Z. Liang, Y . Xu, Y . Hong, P. Shang, Q. Wang, Q. Fu, and K. Liu, “A survey of multimodel large language models,” inProceedings of the 3rd international conference on computer, artificial intelligence and control engineering, 2024, pp. 405–409

  7. [7]

    Retrieval-based interleaved visual chain-of-thought in real-world driv- ing scenarios,

    C. Corbiere, S. Roburin, S. Montariol, A. Bosselut, and A. Alahi, “Retrieval-based interleaved visual chain-of-thought in real-world driv- ing scenarios,”arXiv preprint arXiv:2501.04671, 2025

  8. [8]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022

  9. [9]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837

  10. [10]

    A Survey on Hallucination in Large Vision-Language Models

    H. Liu, W. Xue, Y . Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng, “A survey on hallucination in large vision-language models,” arXiv preprint arXiv:2402.00253, 2024

  11. [11]

    Evaluating object hallucination in large vision-language models,

    Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 292–305

  12. [12]

    Mrsi: A multi- modal proximity remote sensing data set for environment perception in rail transit,

    Y . Chen, N. Zhu, Q. Wu, C. Wu, W. Niu, and Y . Wang, “Mrsi: A multi- modal proximity remote sensing data set for environment perception in rail transit,”International Journal of Intelligent Systems, vol. 37, no. 9, pp. 5530–5556, 2022

  13. [13]

    A camera and lidar data fusion method for railway object detection,

    W. Zhangyu, Y . Guizhen, W. Xinkai, L. Haoran, and L. Da, “A camera and lidar data fusion method for railway object detection,”IEEE Sensors Journal, vol. 21, no. 12, pp. 13 442–13 454, 2021

  14. [14]

    Synrailobs: A synthetic dataset for obstacle detection in railway scenarios,

    Q. Guo and J. Rambach, “Synrailobs: A synthetic dataset for obstacle detection in railway scenarios,”arXiv preprint arXiv:2505.10784, 2025

  15. [15]

    A video-analysis-based railway–road safety system for detecting hazard situations at level crossings,

    H. Salmane, L. Khoudour, and Y . Ruichek, “A video-analysis-based railway–road safety system for detecting hazard situations at level crossings,”IEEE transactions on intelligent transportation systems, vol. 16, no. 2, pp. 596–609, 2015

  16. [16]

    The obstacle detection on the railway crossing based on optical flow and clustering,

    Z. ˇSilar and M. Dobrovoln `y, “The obstacle detection on the railway crossing based on optical flow and clustering,” in2013 36th Interna- tional Conference on Telecommunications and Signal Processing (TSP). IEEE, 2013, pp. 755–759

  17. [17]

    An effective railway intrusion detection method using dynamic intrusion region and lightweight neural network,

    Z. Cao, Y . Qin, Z. Xie, Q. Liu, E. Zhang, Z. Wu, and Z. Yu, “An effective railway intrusion detection method using dynamic intrusion region and lightweight neural network,”Measurement, vol. 191, p. 110564, 2022

  18. [18]

    Yolo-rail: An improved yolo model for obstacle detection on railway tracks,

    Z. Wang and X. Du, “Yolo-rail: An improved yolo model for obstacle detection on railway tracks,”IEEE Sensors Journal, 2026

  19. [19]

    Railway obstacle intrusion warning mechanism integrating yolo-based detection and risk assessment,

    Z. Zhang, P. Chen, Y . Huang, L. Dai, F. Xu, and H. Hu, “Railway obstacle intrusion warning mechanism integrating yolo-based detection and risk assessment,”Journal of Industrial Information Integration, vol. 38, p. 100571, 2024

  20. [20]

    Real-time railway obstacle detection based on multitask perception learning,

    C. Chen, H. Qin, Y . Qin, and Y . Bai, “Real-time railway obstacle detection based on multitask perception learning,”IEEE Transactions on Intelligent Transportation Systems, 2025

  21. [21]

    Railfusion: A lidar- camera data interaction network for 3-d railway object detection,

    W. Liu, Y . Wang, G. Yu, Z. Wang, and P. Chen, “Railfusion: A lidar- camera data interaction network for 3-d railway object detection,”IEEE Transactions on Intelligent Transportation Systems, 2025

  22. [22]

    Vqa: Visual question answering,

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433

  23. [23]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023

  24. [24]

    Qwen-vl: A versatile vision-language model for understanding, localization,

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for understanding, localization,”Text Reading, and Beyond, vol. 2, no. 1, p. 1, 2023

  25. [25]

    Radarscenes: A real-world radar point cloud data set for automotive applications,

    O. Schumann, M. Hahn, N. Scheiner, F. Weishaupt, J. F. Tilly, J. Dick- mann, and C. W ¨ohler, “Radarscenes: A real-world radar point cloud data set for automotive applications,” in2021 IEEE 24th International Conference on Information Fusion (FUSION). IEEE, 2021, pp. 1–8

  26. [26]

    Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events,

    L. Xu, H. Huang, and J. Liu, “Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9878–9888

  27. [27]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,

    T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4542–4550

  28. [28]

    Lingoqa: Visual question answering for autonomous driving,

    A.-M. Marcu, L. Chen, J. H ¨unermann, A. Karnsund, B. Hanotte, P. Chidananda, S. Nair, V . Badrinarayanan, A. Kendall, J. Shotton et al., “Lingoqa: Visual question answering for autonomous driving,” in European Conference on Computer Vision. Springer, 2024, pp. 252– 269

  29. [29]

    Large (vision) language models for autonomous vehicles: Current trends and future directions,

    H. Tian, K. Reddy, Y . Feng, M. Quddus, Y . Demiris, and P. An- geloudis, “Large (vision) language models for autonomous vehicles: Current trends and future directions,”IEEE Transactions on Intelligent Transportation Systems, vol. 27, no. 1, pp. 187–210, 2025

  30. [30]

    Drivelm: Driving with graph visual ques- tion answering,

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual ques- tion answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274

  31. [31]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024

  32. [32]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

    Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8186–8193, 2024

  33. [33]

    Robotron-drive: All-in-one large multimodal model for autonomous driving,

    Z. Huang, C. Feng, F. Yan, B. Xiao, Z. Jie, Y . Zhong, X. Liang, and L. Ma, “Robotron-drive: All-in-one large multimodal model for autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8011–8021

  34. [34]

    Enhancing vision-language models for autonomous driving through task-specific prompting and spatial reasoning,

    A. Wu and X. Luo, “Enhancing vision-language models for autonomous driving through task-specific prompting and spatial reasoning,”arXiv preprint arXiv:2510.24152, 2025

  35. [35]

    Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,

    A. Ishaq, J. Lahoud, K. More, O. Thawakar, R. Thawkar, D. Dis- sanayake, N. Ahsan, Y . Li, F. S. Khan, H. Cholakkalet al., “Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 20 501–20 508

  36. [36]

    Foundation models in autonomous driving: A survey on scenario generation and scenario analysis,

    Y . Gao, M. Piccinini, Y . Zhang, D. Wang, K. Moller, R. Brusnicki, B. Zarrouki, A. Gambi, J. F. Totz, K. Stormset al., “Foundation models in autonomous driving: A survey on scenario generation and scenario analysis,”IEEE Open Journal of Intelligent Transportation Systems, 2026

  37. [37]

    A survey on collaborative mech- anisms between large and small language models,

    Y . Chen, J. Zhao, and H. Han, “A survey on collaborative mech- anisms between large and small language models,”arXiv preprint arXiv:2505.07460, 2025

  38. [38]

    Collaborative inference and learning between edge slms and cloud llms: A survey of algorithms, execution, and open challenges,

    S. Li, H. Wang, W. Xu, R. Zhang, S. Guo, J. Yuan, X. Zhong, T. Zhang, and R. Li, “Collaborative inference and learning between edge slms and cloud llms: A survey of algorithms, execution, and open challenges,” arXiv preprint arXiv:2507.16731, 2025

  39. [39]

    big. little vi- sion transformer for efficient visual recognition,

    H. Guo, Y . Wang, Z. Ye, J. Dai, and Y . Xiong, “big. little vi- sion transformer for efficient visual recognition,”arXiv preprint arXiv:2410.10267, 2024

  40. [40]

    Visiongpt: Vision-language understand- ing agent using generalized multimodal framework,

    C. Kelly, L. Hu, B. Yang, Y . Tian, D. Yang, C. Yang, Z. Huang, Z. Li, J. Hu, and Y . Zou, “Visiongpt: Vision-language understand- ing agent using generalized multimodal framework,”arXiv preprint arXiv:2403.09027, 2024

  41. [41]

    Small drafts, big verdict: Information-intensive visual reasoning via speculation,

    Y . Liu, L. Qin, and S. Wang, “Small drafts, big verdict: Information-intensive visual reasoning via speculation,”arXiv preprint arXiv:2510.20812, 2025

  42. [42]

    Kcm: Kan-based collaboration models enhance pretrained large models,

    G. Dai, S. Tang, and Y . Zhuang, “Kcm: Kan-based collaboration models enhance pretrained large models,”arXiv preprint arXiv:2510.20278, 2025

  43. [43]

    Multi-modal medical diagnosis via large-small model collaboration,

    W. Chen, Z. Zhao, J. Yao, Y . Zhang, J. Bu, and H. Wang, “Multi-modal medical diagnosis via large-small model collaboration,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 30 763–30 773

  44. [44]

    Large-small model synergy with multimodal fine-grained heuristics for knowledge-based visual question answering,

    Z. Sun, K. Guo, Y . Hu, D. Tian, Q. Gao, J. Wang, J. Gao, Y . Sun, and B. Yin, “Large-small model synergy with multimodal fine-grained heuristics for knowledge-based visual question answering,” inProceed- ings of the 33rd ACM International Conference on Multimedia, 2025, pp. 935–944. 13

  45. [45]

    Space-llava: a vision-language model adapted to extraterrestrial applications,

    M. Foutter, D. Gammelli, J. Kruger, E. Foss, P. Bhoj, T. Guffanti, S. D’Amico, and M. Pavone, “Space-llava: a vision-language model adapted to extraterrestrial applications,”arXiv preprint arXiv:2408.05924, 2024

  46. [46]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  47. [47]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

  48. [48]

    G-eval: Nlg evaluation using gpt-4 with better human alignment,

    Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 2511–2522

  49. [49]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779– 788

  50. [50]

    Bytetrack: Multi-object tracking by associating every detection box,

    Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” inEuropean conference on computer vision. Springer, 2022, pp. 1–21

  51. [51]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198

  52. [52]

    Llama guard 3 vision: Safe- guarding human-ai image understanding conversations,

    J. Chi, U. Karn, H. Zhan, E. Smith, J. Rando, Y . Zhang, K. Plawiak, Z. D. Coudert, K. Upasani, and M. Pasupuleti, “Llama guard 3 vision: Safe- guarding human-ai image understanding conversations,”arXiv preprint arXiv:2411.10414, 2024

  53. [53]

    A Survey on Efficient Inference for Large Language Models

    Z. Zhou, X. Ning, K. Hong, T. Fu, J. Xu, S. Li, Y . Lou, L. Wang, Z. Yuan, X. Liet al., “A survey on efficient inference for large language models,”arXiv preprint arXiv:2404.14294, 2024