pith. sign in

arxiv: 2509.10026 · v4 · submitted 2025-09-12 · 💻 cs.CV

LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

Pith reviewed 2026-05-18 17:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords multilingual VQAvisual chain-of-thoughtreward optimizationvision-language modelslanguage-aware reasoningmultimodal pipelineGRPO training
0
0 comments X p. Extension

The pith

LaV-CoT uses a language-aware visual chain-of-thought with multi-aspect rewards to achieve up to 9.5% higher accuracy in multilingual VQA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to show that current vision-language models can be significantly improved for multilingual visual question answering by adding language awareness to visual chain-of-thought reasoning. It proposes a structured pipeline that generates summaries with bounding boxes, identifies the language, captions objects at the spatial level, and then reasons step by step. Training involves creating synthetic data through iteration and then using supervised fine-tuning followed by group relative policy optimization driven by rewards that check language consistency, structural accuracy, and semantic alignment. The resulting model beats same-size open models by nearly 10 percent and even some larger ones, with real-world confirmation through A/B testing, which matters for making AI tools work across languages in practical settings.

Core claim

The central discovery is that the LaV-CoT framework, consisting of a multi-stage reasoning pipeline including Text Summary with Bounding Box, Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning, combined with automated multilingual CoT data curation and two-stage training using SFT and Language-aware GRPO with multi-aspect rewards, delivers substantial accuracy gains on MMMB, Multilingual MMBench, and MTVQA, outperforming both similar-sized baselines by up to 9.5% and larger models by 2.6%.

What carries the argument

The interpretable multi-stage Language-aware Visual CoT reasoning pipeline together with Language-aware Group Relative Policy Optimization (GRPO) using rewards for language consistency, structural accuracy, and semantic alignment.

If this is right

  • The automated data curation method allows for scalable creation of high-quality multilingual CoT annotations.
  • The two-stage training paradigm improves reasoning capabilities and generalization across languages.
  • Performance gains enable the model to surpass both open-source models of similar size and some proprietary larger models.
  • Validation on real-world data through A/B testing supports its use in industrial applications.
  • The approach enhances interpretability of the reasoning process in multilingual multimodal tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying this visual CoT approach to other vision-language tasks could improve performance in areas like visual grounding or document understanding in multiple languages.
  • Further research might explore how the reward components interact to avoid introducing language-specific biases in low-resource languages.
  • The framework's efficiency at smaller scales suggests it could be adapted for resource-constrained environments without needing massive model sizes.

Load-bearing premise

The multi-aspect rewards accurately capture and promote correct multilingual visual reasoning without introducing biases or new errors.

What would settle it

Demonstrating that optimizing for the specified rewards results in lower accuracy or inconsistent reasoning on a diverse set of multilingual visual questions would challenge the central claim.

Figures

Figures reproduced from arXiv: 2509.10026 by Changtao Miao, Fanwei Zeng, Huazhe Tan, Jianshu Li, Jing Huang, Joey Tianyi Zhou, Shutao Gong, Weibin Yao, Zhiya Tan.

Figure 1
Figure 1. Figure 1: Overview of Lav-CoT: (a) Direct model answers may [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework includes an automated data generation pipeline, which leverages a multi-step reasoning process [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between Qwen2.5-VL-7B and LaV-CoT. As illustrated, Qwen2.5-VL-7B demonstrates a step-by-step [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the smoothed reward curves during GRPO train￾ing, highlighting the evolution of four key reward components: Language Reward, Count Reward, Edit Distance Reward, and For￾mat Reward. The Format Reward exhibits rapid initial improvement, ascending from approximately 0.125 to 0.25 within the first 850 steps before stabilizing, indicating the base model has desent instruction following capability an… view at source ↗
Figure 5
Figure 5. Figure 5: Abliation Study on LaV-CoT GRPO Training. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Instructions for evaluating cot step. B.3 Inference Prompt [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Instructions for LaV-CoT inference on open-ended [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Instructions for LaV-CoT inference on MCQ. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Arabic Visa demo case [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Indonesian ID demo case [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
read the original abstract

As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce LaV-CoT, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to ~9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by ~2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: https://github.com/HJNVR/LaV-CoT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces LaV-CoT, the first Language-aware Visual CoT framework for multilingual VQA. It proposes a multi-stage interpretable reasoning pipeline (Text Summary with BBox, Language Identification, Spatial Object-level Captioning, Step-by-step Logical Reasoning), an automated iterative data curation process for multilingual CoT annotations, and a two-stage training paradigm of SFT followed by Language-aware GRPO optimized via verifiable multi-aspect rewards (language consistency, structural accuracy, semantic alignment). Evaluations on MMMB, Multilingual MMBench, and MTVQA report up to ~9.5% accuracy gains over similar-scale open-source baselines and ~2.6% over 2× larger models, plus outperformance of some proprietary models; an online A/B test on real-world data is included, with code released.

Significance. If the accuracy gains hold under scrutiny, the work would meaningfully advance interpretable multilingual multimodal reasoning by combining visual CoT with language-specific rewards and scalable curation. The two-stage GRPO approach with verifiable rewards, the industrial A/B validation, and public code release are concrete strengths that support practical impact and reproducibility beyond typical VLM fine-tuning papers.

major comments (1)
  1. [§4 and Tables 1–3] §4 (Experiments) and Tables 1–3: the central claim of up to 9.5% and 2.6% accuracy improvements is reported as point estimates without error bars, standard deviations, or results from multiple random seeds. This makes it impossible to determine whether the reported margins over baselines are statistically reliable or could be explained by run-to-run variance.
minor comments (3)
  1. [§3.2] The automated curation pipeline description (likely §3.2) would benefit from explicit statistics on the final dataset size, language distribution, and rejection rate after the iterative correction step.
  2. [Figure 2 and §3.3] Figure 2 (reasoning pipeline diagram) and the reward definitions in §3.3 use overlapping terminology (e.g., “structural accuracy” vs. “semantic alignment”); a short table mapping each reward component to its verification method would improve clarity.
  3. [Online A/B test subsection] The online A/B test section reports aggregate win rates but does not specify the sample size, duration, or exact metric used for the real-world deployment comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment of LaV-CoT. We address the concern about statistical reliability of the reported accuracy gains below.

read point-by-point responses
  1. Referee: [§4 and Tables 1–3] §4 (Experiments) and Tables 1–3: the central claim of up to 9.5% and 2.6% accuracy improvements is reported as point estimates without error bars, standard deviations, or results from multiple random seeds. This makes it impossible to determine whether the reported margins over baselines are statistically reliable or could be explained by run-to-run variance.

    Authors: We agree that including measures of variance would improve the robustness of the claims. Our initial experiments followed the common practice in the VLM literature of reporting single-run point estimates on these benchmarks, given the substantial compute required for full training and evaluation. In the revised manuscript we will add results from three independent random seeds for the primary models and baselines, reporting mean accuracy and standard deviation in Tables 1–3 and the corresponding text in §4. This revision will allow readers to directly assess whether the observed margins exceed typical run-to-run variation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claim of accuracy gains rests on an empirical two-stage training pipeline (SFT followed by Language-aware GRPO) using externally defined multi-aspect rewards (language consistency, structural accuracy, semantic alignment) and evaluation on independent public benchmarks (MMMB, Multilingual MMBench, MTVQA) plus an online A/B test. No step reduces the reported improvements to a fitted parameter or self-citation by construction; the reward functions are specified separately from the final accuracy metric, the data curation process is described with implementation detail, and the performance numbers are measured on held-out external test sets rather than being tautological with the training objectives.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that the four-stage reasoning pipeline produces high-quality training signals and that the three reward aspects can be automatically verified at scale; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption The multi-stage pipeline (Text Summary with BBox, Language Identification, Spatial Object-level Captioning, Step-by-step Logical Reasoning) produces faithful and useful reasoning chains for multilingual VQA.
    Invoked in the description of the interpretable multi-stage reasoning pipeline.

pith-pipeline@v0.9.0 · 5900 in / 1297 out tokens · 28164 ms · 2026-05-18T17:49:03.406358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized metho...

  2. Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 2 Pith papers · 17 internal anchors

  1. [1]

    Ahtamjan Ahmat, Lei Wang, Yating Yang, Bo Ma, Rui Dong, Kaiwen Lu, Rong Ma, and Xinyue Wang. 2025. M2-VLP: Enhancing Multilingual Vision-Language Pre-Training via Multi-Grained Alignment. InProceedings of the ACM on Web Conference 2025. ACM, Taipei, Taiwan. https://api.semanticscholar.org/CorpusID: 277998784

  2. [2]

    Jean-Baptiste Alayrac, Adriana Recasens, Jack Kennedy, et al. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022). Curran Associates, Inc., New York, NY, USA, 200–212. https://proceedings.neurips.cc/paper/2022/file/xxxx.pdf

  3. [3]

    Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural Module Networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, 39–48

  4. [4]

    Akari Asai, Kuniaki Saito, Atsushi Hashimoto, Xinyun Chen, Ruiyu Zhu, Noah Snavely, Yutaka Matsuo, and Yoshua Bengio. 2022. VISPROG: Symbolic Program Generation for Interpretable Visual Reasoning. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022). Curran Associates, Inc., New York, NY, USA, 2870–2883

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923. https://arxiv.org/abs/2502.13923

  6. [6]

    Reza Bigverdi et al . 2025. Perception Tokens Enhance Visual Reasoning in Multimodal Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Nashville, TN, USA, to appear

  7. [7]

    Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. 2023. MaXM: Towards Multilingual Visual Question Answering. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 2667–2682

  8. [8]

    Wenhu Chen, Yelong Shen, Hongxia Jin, William Wang, and William Yang Wang

  9. [9]

    InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022)

    Prism: Learning to Decompose Vision-and-Language Tasks. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022). Curran Associates, Inc., New York, NY, USA, to appear

  10. [10]

    Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2022. PaLI: A Jointly-Scaled Multilingual Language-Image Model. arXiv:2209.06794 [cs.CV] https://arxiv.org/abs/2209.06794

  11. [11]

    Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, and Ruqi Huang. 2025. SIFThinker: Spatially-Aware Image Focus for Visual Reasoning. arXiv:2508.06259 [cs.CV] https://arxiv.org/abs/2508.06259 Accepted at ICCCN 2025

  12. [12]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 4299–

  13. [13]

    https://papers.nips.cc/paper/7017-deep-reinforcement-learning-from- human-preferences

  14. [14]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Oleg Klimov, John Schul- man, Maxim Petrov, and Julian Schrittwieser. 2021. Training Verifiers to Solve Math Word Problems. InProceedings of the International Conference on Learn- ing Representations (ICLR). OpenReview.net, Virtual Conference, to appear. https://openreview.net/forum?id=ZxtIGccPfR

  15. [15]

    Google DeepMind. 2023. Gemini: A Family of Highly Capable Multimodal Models. https://arxiv.org/abs/2312.11805

  16. [16]

    H. Duan, J. Yang, Y. Qiao, X. Fang, L. Chen, Y. Liu, X. Dong, Y. Zang, P. Zhang, J. Wang, et al. 2024. VLMEvalKit: An Open-source Toolkit for Evaluating Large Multi-modality Models. arXiv preprint arXiv:2407.11691. https://arxiv.org/abs/ 2407.11691

  17. [17]

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister

  18. [18]

    InFindings of the Associa- tion for Computational Linguistics: ACL 2023, pp

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. InFindings of the Associa- tion for Computational Linguistics: ACL 2023, pp. 8003–8017, Toronto, Canada. doi:10.18653/v1/2023.findings-acl.507

  19. [19]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Weizhu Chen, and Long Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Weizhu Chen, and Long Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. Presented at theInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?id=TfY8HnXg6K

  20. [20]

    Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. 2024. Visual Program Distilla- tion: Distilling Tools and Programmatic Reasoning into Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). IEEE, Seattle, WA, USA, 300–30...

  21. [21]

    Feiyang Huang. 2024. ViTOC: Vision Transformer and Object-aware Captioner. arXiv preprint arXiv:2411.07265. https://arxiv.org/abs/2411.07265

  22. [22]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 6700–6709

  23. [23]

    Yasmine Karoui, Rémi Lebret, Negar Foroutan, and Karl Aberer. 2023. Stop Pre- Training: Adapt Visual-Language Models to Unseen Languages. arXiv preprint arXiv:2306.16774. https://arxiv.org/abs/2306.16774

  24. [24]

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al

  25. [25]

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arXiv preprint arXiv:1602.07332. https://arxiv.org/ abs/1602.07332

  26. [26]

    Ayush Kumar, Yao Fu, Yilun Zou, Dong-Hyun Lee, and Percy Liang. 2022. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. arXiv preprint arXiv:2206.05836. https://arxiv.org/abs/2206.05836

  27. [27]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024. LLaVA- OneVision: Easy Visual Task Transfer. arXiv preprint arXiv:2408.03326. doi:10. 48550/arXiv.2408.03326 v3, last revised 26 Oct 2024

  28. [28]

    Junnan Li, Dongxu Li, Steven C. H. Hoi, Shuo Liang, Fengwei Xia, Xiaodan Jin, Bolei Zhou, Rui Yan, and Feng Zhuang. 2022. BLIP: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). IEEE, New Orleans, LA, USA, 12888–12897

  29. [29]

    Zhiyuan Li, Dongnan Liu, Chaoyi Zhang, Heng Wang, Tengfei Xue, and Weidong Cai. 2024. Enhancing Advanced Visual Reasoning Ability of Large Language Models. arXiv preprint arXiv:2409.13980. https://arxiv.org/abs/2409.13980 EMNLP 2024 Main

  30. [30]

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. 2024. Monkey: Image Resolution and Text Label Are Important Things for Large Multi-Modal Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 26763–26773

  31. [31]

    Le, Kenneth Forbus, and Ni Lao

    Chen Liang, Jonathan Berant, Quoc V. Le, Kenneth Forbus, and Ni Lao. 2018. Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Melbourne, Australia, 23–32

  32. [32]

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s Verify Step by Step. https://arxiv.org/abs/2305.20050 arXiv preprint arXiv:2305.20050

  33. [33]

    Tsung-Yi Lin, Shikun Liu, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Yinfei Yang, and Yu Cheng. 2023. LLaVA: Large Language and Vision Assistant with Chain-of-Thought Reasoning. https://arxiv.org/abs/2304.08485 arXiv preprint arXiv:2304.08485

  34. [34]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick

  35. [35]

    https://arxiv.org/abs/1405

    Microsoft COCO: Common Objects in Context. https://arxiv.org/abs/1405. 0312 Accessed: 2025-08-29

  36. [36]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  37. [37]

    Zikang Liu, Tongtian Yue, Yepeng Tang, Longteng Guo, Junxian Cai, Qingbin Liu, Xi Chen, and Jing Liu. 2025. Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward. https://arxiv.org/abs/2506.05433 Technical report, 10 pages. Accessed: 2025-08-29

  38. [38]

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. 2024. DeepSeek-VL: Towards Real-World Vision-Language Understanding. https://arxiv.org/abs/2403.05525 Accessed: Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou...

  39. [39]

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, and Hao Sun. 2023. DeepSeek-R1: A Reinforcement Learning Enhanced Reasoning Model. Techni- cal Report. DeepSeek AI. https://deepseek.ai/reports/DeepSeek-R1-Technical- Report.pdf

  40. [40]

    Jiaming Luo, Yichong Xu, Ruochen Xu, and Dong Yu. 2023. MSG: Forced Chain-of- Thought Reasoning in Large Language Models. https://arxiv.org/abs/2305.19156 Accessed: 2025-08-29

  41. [41]

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty

  42. [42]

    In Proceedings of the 16th International Conference on Document Analysis and Recog- nition (ICDAR)

    OCR-VQA: Visual Question Answering by Reading Text in Images. In Proceedings of the 16th International Conference on Document Analysis and Recog- nition (ICDAR). IEEE, Sydney, Australia, 1234–1243

  43. [43]

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. 2015. Human-level control through deep reinforcement learning.Nature518, 7540 (2015), 529–533

  44. [44]

    OpenAI. 2023. GPT-4 Technical Report. https://arxiv.org/abs/2303.08774

  45. [45]

    Long Ouyang, Jeffrey Wu, Xu Jiang, et al . 2022. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems (NeurIPS)35 (2022), 27730–27744

  46. [46]

    Roman Rafailov, Long Ouyang, Paul Christiano, and Jan Leike. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. https://arxiv.org/abs/2305.18290 Accessed: 2025-08-29

  47. [47]

    Leonardo Ranaldi, Federico Ranaldi, and Giulia Pucci. 2025. R2-MultiOmnia: Leading Multilingual Multimodal Reasoning via Self-Training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar (Eds.). Association for Comp...

  48. [48]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. 2024. Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning. arXiv:2403.16999 [cs.CV] https://arxiv.org/abs/2403.16999

  49. [49]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. 2025. Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Nashville, TN, USA, 300–309...

  50. [50]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, and Junxiao Song. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. https://arxiv.org/abs/2402.03300 Accessed: 2025-08-29

  51. [51]

    David Silver, Aja Huang, Chris J Maddison, et al. 2016. Mastering the game of Go with deep neural networks and tree search.Nature529, 7587 (2016), 484–489

  52. [52]

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 8317–8326. doi:10.1109/CVPR. 2019.00851

  53. [53]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, et al . 2020. Learning to summa- rize with human feedback.Advances in Neural Information Processing Systems (NeurIPS)33 (2020), 3008–3021

  54. [54]

    Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, and Han-Jia Ye. 2024. Parrot: Multilingual Visual Instruction Tuning. https://arxiv.org/abs/2406.02539 Accessed: 2025-08-29

  55. [55]

    Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. ViperGPT: Visual Inference via Python Execution for Reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Paris, France, 11888–11898. doi:10. 1109/ICCV52188.2023.01161

  56. [56]

    2018.Reinforcement Learning: An Intro- duction

    Richard S Sutton and Andrew G Barto. 2018.Reinforcement Learning: An Intro- duction. MIT Press, Cambridge, Massachusetts, USA

  57. [57]

    Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, and Can Huang. 2024. MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. arXiv:2405.11985 [cs.CV]

  58. [58]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. https://arxiv.org/abs/2409.12191 ...

  59. [59]

    Wen Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiangbo Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. 2023. Image as a Foreign Language: BEIT Pretraining for Vision and Vision- Language Tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19175–19186 pages. ...

  60. [60]

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

  61. [61]

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. 2024. Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Systems 37 (2024), 121475–121499

  62. [62]

    Christopher JCH Watkins and Peter Dayan. 1992. Q-learning.Machine Learning 8, 3-4 (1992), 279–292

  63. [63]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. https://arxiv.org/abs/2201.11903 Accessed: 2025-08- 29

  64. [64]

    Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. 2024. V-DPO: Mitigat- ing Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Flo...

  65. [65]

    Baojiao Xiong, Boheng Chen, Chengzhi Wang, Daxiong Luo, Dongsheng Xu, Dongyang Liu, Fan Yang, Fangyuan Li, Fei Teng, Feng Wang, Fukang Qin, Fuquan Peng, Guanxin Tan, Guozhi Wang, Haibo Yu, Haohao Gao, Heng Liu, Hongbo Yang, Hongjian Zou, Houzheng Shen, Hu Meng, Huan Li, Hui Tan, Jiali Chen, Jianzhao Chen, Jinliang Zhu, Kai Wang, Lei Wu, Liangbing Liu, Liu...

  66. [66]

    arXiv:2507.05934 [cs.AI] https://arxiv

    BlueLM-2.5-3B Technical Report. arXiv:2507.05934 [cs.AI] https://arxiv. org/abs/2507.05934

  67. [67]

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. 2025. LLaVA-CoT: Let Vision Language Models Reason Step-by-Step. arXiv:2411.10440 [cs.CV] https://arxiv.org/abs/2411.10440

  68. [68]

    Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling. arXiv:2111.12085 [cs.CV] https:// arxiv.org/abs/2111.12085

  69. [69]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. 2022. STaR: Bootstrapping Reasoning With Reasoning. arXiv:2203.14465 [cs.LG] https: //arxiv.org/abs/2203.14465

  70. [70]

    Zhen Zhang, Jialu Wang, and Xin Wang. 2023. Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 7258–7268

  71. [71]

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2024. Multimodal Chain-of-Thought Reasoning in Language Models. arXiv:2302.00923 [cs.CL] https://arxiv.org/abs/2302.00923

  72. [72]

    Kesen Zhao, Beier Zhu, Qianru Sun, and Hanwang Zhang. 2025. Unsu- pervised Visual Chain-of-Thought Reasoning via Preference Optimization. arXiv:2504.18397 [cs.CV] https://arxiv.org/abs/2504.18397

  73. [73]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenw...