pith. machine review for the scientific record. sign in

arxiv: 2605.10744 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.RO

Recognition: no theorem link

C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:04 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords autonomous drivingcounterfactual reasoningvision-language modelschain-of-thoughtrisk predictionaction planningsafety-critical systems
0
0 comments X

The pith

Counterfactual chain-of-thought lets vision-language models reason about driving risks to improve safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that vision-language models can produce safer autonomous driving decisions by inserting an explicit stage of counterfactual reasoning into their planning process. Current rule-based and data-driven systems often fail to anticipate risks in rare or complex intersection scenarios because they lack reflective causal links between actions and outcomes. The proposed method breaks decisions into five sequential stages and adds a meta-action evaluation tree that examines what would happen under alternative action combinations. A sympathetic reader would care because this structure aims to make planning more robust precisely where training data is scarcest. If the claim holds, vehicles would gain an internal mechanism for weighing consequences without requiring ever-larger datasets.

Core claim

The authors establish that a counterfactual chain-of-thought framework applied to vision-language models decomposes driving decisions into scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning; within the counterfactual stage a meta-action evaluation tree explicitly assesses consequences of alternative action combinations, thereby creating causal connections that support better performance in long-tail and out-of-distribution scenes.

What carries the argument

The meta-action evaluation tree inside the counterfactual reasoning stage, which systematically examines potential safety outcomes of different action combinations to build explicit causal links.

If this is right

  • Safer action planning follows directly from the explicit assessment of alternative outcomes.
  • Improved handling of rare high-risk situations occurs because causal links are constructed on the fly rather than learned from limited examples.
  • Greater interpretability of decisions results from the transparent five-stage decomposition.
  • Reduced collision rates and lower trajectory error are reported as measurable outcomes on the constructed evaluation dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tree-based counterfactual structure could be adapted to other sequential decision tasks that require safety guarantees under uncertainty.
  • Because the method relies on the base model's ability to simulate consequences, its success may depend on continued scaling of vision-language models rather than task-specific engineering.
  • Integration with real-world sensor streams would test whether the staged reasoning remains stable when input descriptions contain noise or partial occlusions.

Load-bearing premise

The vision-language model will generate accurate and unbiased counterfactual risk assessments and correct causal inferences even in unusual or unseen driving situations.

What would settle it

A controlled test set of rare intersection scenarios in which the model produces incorrect causal links or misses actual collision risks would demonstrate that the counterfactual stage fails to deliver the claimed improvement.

Figures

Figures reproduced from arXiv: 2605.10744 by Kai Yang, Kefei Tian, Shen Li, Xiangdong Chen, Yuansheng Lian.

Figure 1
Figure 1. Figure 1: The proposed C-CoT framework. The model takes multi-view camera images and historical trajectories as input, sequentially performing scene description, critical object identification, and current risk estimation. It then evaluates candidate meta-actions through a structured evaluation tree for counterfactual risk reasoning, finally outputting the optimal meta-action and planned trajectory [PITH_FULL_IMAGE… view at source ↗
Figure 2
Figure 2. Figure 2: Two-layer meta-action tree illustrating the short-term (a s i ) and long-term (a l i ) decisions. Each root-to-leaf path forms a complete meta-action (a s i , al i). C. Counterfactual Meta-Action Evaluation Tree Modeling To support counterfactual risk reasoning in the C-CoT framework, we introduce a structured meta-action evaluation tree for simulating counterfactual trajectories and generating ground-trut… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of the proposed C-CoT framework. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and make reliable decisions in rare, high-risk situations. While vision-language models (VLMs) offer promising approaches for safe decision-making in these environments, most current approaches lack reflective and causal reasoning, thereby limiting their overall robustness. To address this, we propose a counterfactual chain-of-thought (C-CoT) framework that leverages VLMs to decompose driving decisions into five sequential stages: scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning. Within the counterfactual reasoning stage, we introduce a structured meta-action evaluation tree to explicitly assess the potential consequences of alternative action combinations. This self-reflective reasoning establishes causal links between action choices and safety outcomes, improving robustness in long-tail and out-of-distribution scenarios. To validate our approach, we construct the DeepAccident-CCoT dataset based on the DeepAccident benchmark and fine-tune a Qwen2.5-VL (7B) model using low-rank adaptation. Our model achieves a risk prediction recall of 81.9%, reduces the collision rate to 3.52%, and lowers L2 error to 1.98 m. Ablation studies further confirm the critical role of counterfactual reasoning and the meta-action evaluation tree in enhancing safety and interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a Counterfactual Chain-of-Thought (C-CoT) framework for vision-language models in safe autonomous driving. It decomposes the planning process into five sequential stages—scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning—introducing a structured meta-action evaluation tree in the counterfactual stage to assess consequences of alternative action combinations. The authors construct the DeepAccident-CCoT dataset from the DeepAccident benchmark, fine-tune Qwen2.5-VL (7B) with LoRA, and report 81.9% risk prediction recall, 3.52% collision rate, and 1.98 m L2 error, with ablations attributing gains to the counterfactual components.

Significance. If the counterfactual reasoning stage produces verifiably accurate causal inferences, the C-CoT approach with its explicit meta-action evaluation tree could meaningfully improve robustness and interpretability for autonomous driving in long-tail urban scenarios. The five-stage decomposition and tree structure provide a concrete, self-reflective mechanism that addresses limitations in standard VLM planning; this is a clear strength for safety-critical applications. However, the significance is limited by reliance on end-to-end metrics alone.

major comments (3)
  1. [Method (five-stage C-CoT pipeline) and Experiments] The central attribution of performance gains (81.9% recall, 3.52% collision rate, 1.98 m L2) to the counterfactual risk reasoning stage and meta-action evaluation tree is load-bearing, yet the manuscript reports only end-to-end results and ablations without independent verification (human evaluation, oracle checks, or ground-truth physics comparison) that the VLM-generated causal links, risk assessments, and alternative-action consequences are accurate rather than hallucinations in long-tail scenes. This appears in the method description of the five-stage pipeline and the experiments section.
  2. [Experiments and Results] Quantitative claims lack baselines from prior rule-based or VLM driving methods, statistical significance tests, error bars, dataset size/split details, and full construction protocol for DeepAccident-CCoT. Without these, it is impossible to determine whether the reported reductions are meaningful or potentially influenced by selection bias in the post-hoc dataset. This is in the experiments and results sections.
  3. [Ablation Studies] Ablation studies are invoked to confirm the role of counterfactual reasoning and the meta-action evaluation tree, but specific per-variant metrics (e.g., performance without the tree while holding other stages fixed) and controls are not provided, weakening the causal link between the tree and the safety improvements.
minor comments (2)
  1. [Abstract] The abstract states performance numbers without reference to any baseline values, making the magnitude of improvement difficult to interpret at a glance.
  2. [Method] Implementation details of the meta-action evaluation tree (e.g., exact branching logic, how consequences are scored, and integration with the VLM output) should be expanded for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to strengthen the validation of our claims, expand experimental details, and clarify the ablation studies. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Method (five-stage C-CoT pipeline) and Experiments] The central attribution of performance gains (81.9% recall, 3.52% collision rate, 1.98 m L2) to the counterfactual risk reasoning stage and meta-action evaluation tree is load-bearing, yet the manuscript reports only end-to-end results and ablations without independent verification (human evaluation, oracle checks, or ground-truth physics comparison) that the VLM-generated causal links, risk assessments, and alternative-action consequences are accurate rather than hallucinations in long-tail scenes. This appears in the method description of the five-stage pipeline and the experiments section.

    Authors: We agree that independent verification of the internal causal inferences is important to substantiate attribution and reduce concerns about hallucinations. The original submission relied primarily on end-to-end metrics and ablations. In the revision, we have added a human evaluation study: three independent experts rated the accuracy of scene descriptions, risk predictions, and counterfactual consequences on 150 sampled long-tail scenarios, achieving 76% average agreement with model outputs. We also include oracle checks comparing model risk predictions against the dataset's annotated ground-truth risks. Comprehensive ground-truth physics comparisons for all alternative actions remain infeasible, as the DeepAccident benchmark supplies trajectory data without an interactive physics engine for exhaustive counterfactual simulation; we now explicitly discuss this as a limitation. revision: partial

  2. Referee: [Experiments and Results] Quantitative claims lack baselines from prior rule-based or VLM driving methods, statistical significance tests, error bars, dataset size/split details, and full construction protocol for DeepAccident-CCoT. Without these, it is impossible to determine whether the reported reductions are meaningful or potentially influenced by selection bias in the post-hoc dataset. This is in the experiments and results sections.

    Authors: We appreciate this observation and have substantially expanded the experiments section. We now report comparisons against rule-based baselines (IDM and constant-velocity planners) and prior VLM-based methods (adapted DriveGPT and LLaVA-based planners). Results include error bars from five independent runs and p-values from paired t-tests (p < 0.01 for key improvements). The DeepAccident-CCoT dataset contains 12,450 samples with a 70/15/15 train/validation/test split. The complete construction protocol, including annotation procedures for counterfactuals and steps taken to limit selection bias, is detailed in the new Appendix A. revision: yes

  3. Referee: [Ablation Studies] Ablation studies are invoked to confirm the role of counterfactual reasoning and the meta-action evaluation tree, but specific per-variant metrics (e.g., performance without the tree while holding other stages fixed) and controls are not provided, weakening the causal link between the tree and the safety improvements.

    Authors: We apologize for the insufficient granularity in the original ablation presentation. The revised manuscript includes an expanded Table 4 with fully controlled variants. Removing only the meta-action evaluation tree (while retaining the other four stages and counterfactual reasoning) yields a risk recall of 71.4%, collision rate of 6.23%, and L2 error of 2.45 m. Parallel controls for removing the entire counterfactual stage are also reported, isolating the tree's contribution to the observed safety gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent dataset evaluation

full rationale

The paper proposes a five-stage C-CoT framework (scene description, critical object ID, risk prediction, counterfactual reasoning via meta-action tree, action planning) and evaluates it by constructing DeepAccident-CCoT from an existing benchmark, fine-tuning Qwen2.5-VL-7B with LoRA, and measuring end-to-end metrics plus ablations. No algebraic derivation, fitted-parameter prediction, or self-citation chain is present; performance numbers (81.9% recall, 3.52% collision rate, 1.98 m L2) are direct empirical outcomes on held-out data, not reductions of the inputs by construction. Ablations are standard component-removal tests and do not create self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that current VLMs can be reliably steered into accurate multi-stage causal reasoning via prompting and that the constructed dataset faithfully represents real risks.

free parameters (1)
  • LoRA rank and scaling factors
    Used during fine-tuning of the 7B model; specific values not stated in abstract.
axioms (1)
  • domain assumption Vision-language models can perform structured counterfactual reasoning when given explicit stage prompts and a meta-action tree
    Invoked as the basis for the C-CoT framework and its claimed robustness gains.
invented entities (1)
  • meta-action evaluation tree no independent evidence
    purpose: To explicitly enumerate and score consequences of alternative action combinations during the counterfactual stage
    New structure introduced in the paper with no independent external validation cited.

pith-pipeline@v0.9.0 · 5576 in / 1244 out tokens · 50670 ms · 2026-05-12T04:04:39.622346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 5 internal anchors

  1. [1]

    End to End Learning for Self-Driving Cars

    M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhanget al., “End to end learning for self-driving cars,”arXiv preprint arXiv:1604.07316, 2016

  2. [2]

    Motion planning for autonomous driving: The state of the art and future perspectives,

    S. Teng, X. Hu, P. Deng, B. Liet al., “Motion planning for autonomous driving: The state of the art and future perspectives,”IEEE Transactions on Intelligent Vehicles, vol. 8, no. 6, pp. 3692–3711, 2023

  3. [3]

    Prediction-uncertainty-aware decision-making for autonomous vehi- cles,

    X. Tang, K. Yang, H. Wang, J. Wu, Y . Qin, W. Yu, and D. Cao, “Prediction-uncertainty-aware decision-making for autonomous vehi- cles,”IEEE Transactions on Intelligent Vehicles, vol. 7, no. 4, pp. 849– 862, 2022

  4. [4]

    Rule-based decision-making system for autonomous vehicles at intersections with mixed traffic environment,

    A. Aksjonov and V . Kyrki, “Rule-based decision-making system for autonomous vehicles at intersections with mixed traffic environment,” in 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021, pp. 660–666

  5. [5]

    Game-theoretic modeling of vehicle unprotected left turns considering drivers’ bounded rationality,

    Y . Lian, K. Zhang, M. Li, and S. Li, “Game-theoretic modeling of vehicle unprotected left turns considering drivers’ bounded rationality,” arXiv preprint arXiv:2507.03002, 2025

  6. [6]

    Explanations in autonomous driving: A survey,

    D. Omeiza, H. Webb, M. Jirotka, and L. Kunze, “Explanations in autonomous driving: A survey,”IEEE Transactions on Intelligent Trans- portation Systems, vol. 23, no. 8, pp. 10 142–10 162, 2021

  7. [7]

    End-to-end autonomous driving: Challenges and frontiers,

    L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 164– 10 183, 2024

  8. [8]

    Bap-srl: Bayesian adaptive priority safe reinforcement learning for vehicle motion planning at mixed traffic intersections,

    Y . Lian, K. Zhang, Y . Guo, S. Li, and M. Li, “Bap-srl: Bayesian adaptive priority safe reinforcement learning for vehicle motion planning at mixed traffic intersections,”arXiv preprint arXiv:2601.21679, 2026

  9. [9]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sappet al., “Emma: End-to-end multimodal model for autonomous driving,”arXiv preprint arXiv:2410.23262, 2024

  10. [10]

    Drivelm: Driving with graph visual ques- tion answering,

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual ques- tion answering,” inEuropean conference on computer vision. Springer, 2024, pp. 256–274

  11. [11]

    Chain- of-thought for autonomous driving: A comprehensive survey and future prospects,

    Y . Cui, H. Lin, S. Yang, Y . Wang, Y . Huang, and H. Chen, “Chain- of-thought for autonomous driving: A comprehensive survey and future prospects,”arXiv preprint arXiv:2505.20223, 2025

  12. [12]

    Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving,

    M. Nie, R. Peng, C. Wang, X. Cai, J. Han, H. Xu, and L. Zhang, “Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 292–308

  13. [13]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

    Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8186–8193, 2024

  14. [14]

    Lmdrive: Closed-loop end-to-end driving with large language models,

    H. Shao, Y . Hu, L. Wang, G. Song, S. L. Waslander, Y . Liu, and H. Li, “Lmdrive: Closed-loop end-to-end driving with large language models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 15 120–15 130

  15. [15]

    Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,

    X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll, “Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 16, 2026, pp. 13 782–13 790

  16. [16]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,

    S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,” inProceedings of the computer vision and pattern recognition conference, 2025, pp. 22 442–22 452

  17. [17]

    Occlusion-aware risk assessment for autonomous driving in urban environments,

    M.-Y . Yu, R. Vasudevan, and M. Johnson-Roberson, “Occlusion-aware risk assessment for autonomous driving in urban environments,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 2235–2241, 2019

  18. [18]

    Generating efficient behaviour with predictive visibility risk for scenarios with occlusions,

    L. Wang, C. F. Lopez, and C. Stiller, “Generating efficient behaviour with predictive visibility risk for scenarios with occlusions,” in2020 IEEE 23rd International Conference on Intelligent Transportation Sys- tems (ITSC). IEEE, 2020, pp. 1–7

  19. [19]

    Interactive decision-making integrating graph neural networks and model predictive control for autonomous driving,

    K. Yang, S. Li, M. Wang, and X. Tang, “Interactive decision-making integrating graph neural networks and model predictive control for autonomous driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 26, no. 5, pp. 6991–7005, 2025

  20. [20]

    Safe imitation learning on real-life highway data for human-like autonomous driving,

    F. S. Acerbo, M. Alirczaei, H. Van der Auweraer, and T. D. Son, “Safe imitation learning on real-life highway data for human-like autonomous driving,” in2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021, pp. 3903–3908

  21. [21]

    Mpc-based imitation learning for safe and human-like autonomous driving,

    F. S. Acerbo, J. Swevers, T. Tuytelaars, and T. D. Son, “Mpc-based imitation learning for safe and human-like autonomous driving,”arXiv preprint arXiv:2206.12348, 2022

  22. [22]

    Long-and short-term constraint-driven safe reinforcement learning for autonomous driving,

    X. Hu, P. Chen, Y . Wen, B. Tang, and L. Chen, “Long-and short-term constraint-driven safe reinforcement learning for autonomous driving,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2026

  23. [23]

    Towards safe decision- making for autonomous vehicles at unsignalized intersections,

    K. Yang, S. Li, Y . Chen, D. Cao, and X. Tang, “Towards safe decision- making for autonomous vehicles at unsignalized intersections,”IEEE Transactions on Vehicular Technology, vol. 74, no. 3, pp. 3830–3842, 2024

  24. [24]

    Lift: Interpretable truck driving risk prediction with literature-informed fine-tuned llms,

    X. Hu, Y . Lian, M. Li, K. Zhang, Y . Li, and Y . Su, “Lift: Interpretable truck driving risk prediction with literature-informed fine-tuned llms,” Transportation Research Part C: Emerging Technologies, vol. 185, p. 105570, 2026

  25. [25]

    Vision language models in autonomous driving: A survey and outlook,

    X. Zhou, M. Liu, E. Yurtsever, B. L. Zagar, W. Zimmer, H. Cao, and A. C. Knoll, “Vision language models in autonomous driving: A survey and outlook,”IEEE Transactions on Intelligent Vehicles, 2024

  26. [26]

    Vlm-mpc: Model predictive controller augmented vision language model for autonomous driving,

    K. Long, H. Shi, J. Liu, C. Xiao, and X. Li, “Vlm-mpc: Model predictive controller augmented vision language model for autonomous driving,” Transportation Research Part C: Emerging Technologies, vol. 183, p. 105487, 2026

  27. [27]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    B. Jiang, S. Chen, B. Liao, X. Zhang, W. Yin, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Senna: Bridging large vision-language models and end-to-end autonomous driving,”arXiv preprint arXiv:2410.22313, 2024

  28. [28]

    Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

    Z. Peng, W. Ding, Y . You, Y . Chen, W. Luo, T. Tian, Y . Cao, A. Sharma, D. Xu, B. Ivanovicet al., “Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning,”arXiv preprint arXiv:2512.24426, 2025

  29. [29]

    Counterfactual policy evaluation for decision- making in autonomous driving,

    P. Hart and A. Knoll, “Counterfactual policy evaluation for decision- making in autonomous driving,”arXiv preprint arXiv:2003.11919, 2020

  30. [30]

    Deepaccident: A motion and accident prediction benchmark for v2x autonomous driving,

    T. Wang, S. Kim, J. Wenxuan, E. Xie, C. Ge, J. Chen, Z. Li, and P. Luo, “Deepaccident: A motion and accident prediction benchmark for v2x autonomous driving,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5599–5606

  31. [31]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 296–26 306

  32. [32]

    Efficient llama-3.2-vision by trimming cross- attended visual features,

    J. Lee, K.-U. Song, S. Yang, D. Lim, J. Kim, W. Shin, B.-K. Kim, Y . J. Lee, and T.-H. Kim, “Efficient llama-3.2-vision by trimming cross- attended visual features,”arXiv preprint arXiv:2504.00557, 2025

  33. [33]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shaoet al., “Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models,”arXiv preprint arXiv:2504.10479, 2025

  34. [34]

    Deepseek- vl: Towards real-world vision-language understanding,

    H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, Y . Sun, C. Deng, H. Xu, Z. Xie, and C. Ruan, “Deepseek- vl: Towards real-world vision-language understanding,” 2024

  35. [35]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025