LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
Pith reviewed 2026-05-18 17:49 UTC · model grok-4.3
The pith
LaV-CoT uses a language-aware visual chain-of-thought with multi-aspect rewards to achieve up to 9.5% higher accuracy in multilingual VQA.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that the LaV-CoT framework, consisting of a multi-stage reasoning pipeline including Text Summary with Bounding Box, Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning, combined with automated multilingual CoT data curation and two-stage training using SFT and Language-aware GRPO with multi-aspect rewards, delivers substantial accuracy gains on MMMB, Multilingual MMBench, and MTVQA, outperforming both similar-sized baselines by up to 9.5% and larger models by 2.6%.
What carries the argument
The interpretable multi-stage Language-aware Visual CoT reasoning pipeline together with Language-aware Group Relative Policy Optimization (GRPO) using rewards for language consistency, structural accuracy, and semantic alignment.
If this is right
- The automated data curation method allows for scalable creation of high-quality multilingual CoT annotations.
- The two-stage training paradigm improves reasoning capabilities and generalization across languages.
- Performance gains enable the model to surpass both open-source models of similar size and some proprietary larger models.
- Validation on real-world data through A/B testing supports its use in industrial applications.
- The approach enhances interpretability of the reasoning process in multilingual multimodal tasks.
Where Pith is reading between the lines
- Applying this visual CoT approach to other vision-language tasks could improve performance in areas like visual grounding or document understanding in multiple languages.
- Further research might explore how the reward components interact to avoid introducing language-specific biases in low-resource languages.
- The framework's efficiency at smaller scales suggests it could be adapted for resource-constrained environments without needing massive model sizes.
Load-bearing premise
The multi-aspect rewards accurately capture and promote correct multilingual visual reasoning without introducing biases or new errors.
What would settle it
Demonstrating that optimizing for the specified rewards results in lower accuracy or inconsistent reasoning on a diverse set of multilingual visual questions would challenge the central claim.
Figures
read the original abstract
As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce LaV-CoT, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to ~9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by ~2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: https://github.com/HJNVR/LaV-CoT
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LaV-CoT, the first Language-aware Visual CoT framework for multilingual VQA. It proposes a multi-stage interpretable reasoning pipeline (Text Summary with BBox, Language Identification, Spatial Object-level Captioning, Step-by-step Logical Reasoning), an automated iterative data curation process for multilingual CoT annotations, and a two-stage training paradigm of SFT followed by Language-aware GRPO optimized via verifiable multi-aspect rewards (language consistency, structural accuracy, semantic alignment). Evaluations on MMMB, Multilingual MMBench, and MTVQA report up to ~9.5% accuracy gains over similar-scale open-source baselines and ~2.6% over 2× larger models, plus outperformance of some proprietary models; an online A/B test on real-world data is included, with code released.
Significance. If the accuracy gains hold under scrutiny, the work would meaningfully advance interpretable multilingual multimodal reasoning by combining visual CoT with language-specific rewards and scalable curation. The two-stage GRPO approach with verifiable rewards, the industrial A/B validation, and public code release are concrete strengths that support practical impact and reproducibility beyond typical VLM fine-tuning papers.
major comments (1)
- [§4 and Tables 1–3] §4 (Experiments) and Tables 1–3: the central claim of up to 9.5% and 2.6% accuracy improvements is reported as point estimates without error bars, standard deviations, or results from multiple random seeds. This makes it impossible to determine whether the reported margins over baselines are statistically reliable or could be explained by run-to-run variance.
minor comments (3)
- [§3.2] The automated curation pipeline description (likely §3.2) would benefit from explicit statistics on the final dataset size, language distribution, and rejection rate after the iterative correction step.
- [Figure 2 and §3.3] Figure 2 (reasoning pipeline diagram) and the reward definitions in §3.3 use overlapping terminology (e.g., “structural accuracy” vs. “semantic alignment”); a short table mapping each reward component to its verification method would improve clarity.
- [Online A/B test subsection] The online A/B test section reports aggregate win rates but does not specify the sample size, duration, or exact metric used for the real-world deployment comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment of LaV-CoT. We address the concern about statistical reliability of the reported accuracy gains below.
read point-by-point responses
-
Referee: [§4 and Tables 1–3] §4 (Experiments) and Tables 1–3: the central claim of up to 9.5% and 2.6% accuracy improvements is reported as point estimates without error bars, standard deviations, or results from multiple random seeds. This makes it impossible to determine whether the reported margins over baselines are statistically reliable or could be explained by run-to-run variance.
Authors: We agree that including measures of variance would improve the robustness of the claims. Our initial experiments followed the common practice in the VLM literature of reporting single-run point estimates on these benchmarks, given the substantial compute required for full training and evaluation. In the revised manuscript we will add results from three independent random seeds for the primary models and baselines, reporting mean accuracy and standard deviation in Tables 1–3 and the corresponding text in §4. This revision will allow readers to directly assess whether the observed margins exceed typical run-to-run variation. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's central claim of accuracy gains rests on an empirical two-stage training pipeline (SFT followed by Language-aware GRPO) using externally defined multi-aspect rewards (language consistency, structural accuracy, semantic alignment) and evaluation on independent public benchmarks (MMMB, Multilingual MMBench, MTVQA) plus an online A/B test. No step reduces the reported improvements to a fitted parameter or self-citation by construction; the reward functions are specified separately from the final accuracy metric, the data curation process is described with implementation detail, and the performance numbers are measured on held-out external test sets rather than being tautological with the training objectives.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The multi-stage pipeline (Text Summary with BBox, Language Identification, Spatial Object-level Captioning, Step-by-step Logical Reasoning) produces faithful and useful reasoning chains for multilingual VQA.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment.
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The multi-stage reasoning design ... (a) Text Summary with Bounding Boxes, (b) Language Identification, (c) Spatial Object-Level Image Captioning, (d) Step-by-Step Logical Reasoning.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning
DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized metho...
-
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...
Reference graph
Works this paper leans on
-
[1]
Ahtamjan Ahmat, Lei Wang, Yating Yang, Bo Ma, Rui Dong, Kaiwen Lu, Rong Ma, and Xinyue Wang. 2025. M2-VLP: Enhancing Multilingual Vision-Language Pre-Training via Multi-Grained Alignment. InProceedings of the ACM on Web Conference 2025. ACM, Taipei, Taiwan. https://api.semanticscholar.org/CorpusID: 277998784
work page 2025
-
[2]
Jean-Baptiste Alayrac, Adriana Recasens, Jack Kennedy, et al. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022). Curran Associates, Inc., New York, NY, USA, 200–212. https://proceedings.neurips.cc/paper/2022/file/xxxx.pdf
work page 2022
-
[3]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural Module Networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, 39–48
work page 2016
-
[4]
Akari Asai, Kuniaki Saito, Atsushi Hashimoto, Xinyun Chen, Ruiyu Zhu, Noah Snavely, Yutaka Matsuo, and Yoshua Bengio. 2022. VISPROG: Symbolic Program Generation for Interpretable Visual Reasoning. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022). Curran Associates, Inc., New York, NY, USA, 2870–2883
work page 2022
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923. https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Reza Bigverdi et al . 2025. Perception Tokens Enhance Visual Reasoning in Multimodal Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Nashville, TN, USA, to appear
work page 2025
-
[7]
Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. 2023. MaXM: Towards Multilingual Visual Question Answering. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 2667–2682
work page 2023
-
[8]
Wenhu Chen, Yelong Shen, Hongxia Jin, William Wang, and William Yang Wang
-
[9]
InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022)
Prism: Learning to Decompose Vision-and-Language Tasks. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022). Curran Associates, Inc., New York, NY, USA, to appear
work page 2022
-
[10]
Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2022. PaLI: A Jointly-Scaled Multilingual Language-Image Model. arXiv:2209.06794 [cs.CV] https://arxiv.org/abs/2209.06794
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [11]
-
[12]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 4299–
work page 2017
-
[13]
https://papers.nips.cc/paper/7017-deep-reinforcement-learning-from- human-preferences
-
[14]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Oleg Klimov, John Schul- man, Maxim Petrov, and Julian Schrittwieser. 2021. Training Verifiers to Solve Math Word Problems. InProceedings of the International Conference on Learn- ing Representations (ICLR). OpenReview.net, Virtual Conference, to appear. https://openreview.net/forum?id=ZxtIGccPfR
work page 2021
-
[15]
Google DeepMind. 2023. Gemini: A Family of Highly Capable Multimodal Models. https://arxiv.org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [16]
-
[17]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister
-
[18]
InFindings of the Associa- tion for Computational Linguistics: ACL 2023, pp
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. InFindings of the Associa- tion for Computational Linguistics: ACL 2023, pp. 8003–8017, Toronto, Canada. doi:10.18653/v1/2023.findings-acl.507
-
[19]
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Weizhu Chen, and Long Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. Presented at theInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?id=TfY8HnXg6K
work page 2021
-
[20]
Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. 2024. Visual Program Distilla- tion: Distilling Tools and Programmatic Reasoning into Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). IEEE, Seattle, WA, USA, 300–30...
- [21]
-
[22]
Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 6700–6709
work page 2019
- [23]
-
[24]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al
-
[25]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arXiv preprint arXiv:1602.07332. https://arxiv.org/ abs/1602.07332
work page internal anchor Pith review Pith/arXiv arXiv
- [26]
-
[27]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024. LLaVA- OneVision: Easy Visual Task Transfer. arXiv preprint arXiv:2408.03326. doi:10. 48550/arXiv.2408.03326 v3, last revised 26 Oct 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Junnan Li, Dongxu Li, Steven C. H. Hoi, Shuo Liang, Fengwei Xia, Xiaodan Jin, Bolei Zhou, Rui Yan, and Feng Zhuang. 2022. BLIP: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). IEEE, New Orleans, LA, USA, 12888–12897
work page 2022
- [29]
-
[30]
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. 2024. Monkey: Image Resolution and Text Label Are Important Things for Large Multi-Modal Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 26763–26773
work page 2024
-
[31]
Le, Kenneth Forbus, and Ni Lao
Chen Liang, Jonathan Berant, Quoc V. Le, Kenneth Forbus, and Ni Lao. 2018. Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Melbourne, Australia, 23–32
work page 2018
-
[32]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s Verify Step by Step. https://arxiv.org/abs/2305.20050 arXiv preprint arXiv:2305.20050
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Tsung-Yi Lin, Shikun Liu, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Yinfei Yang, and Yu Cheng. 2023. LLaVA: Large Language and Vision Assistant with Chain-of-Thought Reasoning. https://arxiv.org/abs/2304.08485 arXiv preprint arXiv:2304.08485
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick
-
[35]
Microsoft COCO: Common Objects in Context. https://arxiv.org/abs/1405. 0312 Accessed: 2025-08-29
work page 2025
-
[36]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916
work page 2023
- [37]
-
[38]
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. 2024. DeepSeek-VL: Towards Real-World Vision-Language Understanding. https://arxiv.org/abs/2403.05525 Accessed: Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, and Hao Sun. 2023. DeepSeek-R1: A Reinforcement Learning Enhanced Reasoning Model. Techni- cal Report. DeepSeek AI. https://deepseek.ai/reports/DeepSeek-R1-Technical- Report.pdf
work page 2023
- [40]
-
[41]
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty
-
[42]
In Proceedings of the 16th International Conference on Document Analysis and Recog- nition (ICDAR)
OCR-VQA: Visual Question Answering by Reading Text in Images. In Proceedings of the 16th International Conference on Document Analysis and Recog- nition (ICDAR). IEEE, Sydney, Australia, 1234–1243
-
[43]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. 2015. Human-level control through deep reinforcement learning.Nature518, 7540 (2015), 529–533
work page 2015
-
[44]
OpenAI. 2023. GPT-4 Technical Report. https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Long Ouyang, Jeffrey Wu, Xu Jiang, et al . 2022. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems (NeurIPS)35 (2022), 27730–27744
work page 2022
-
[46]
Roman Rafailov, Long Ouyang, Paul Christiano, and Jan Leike. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. https://arxiv.org/abs/2305.18290 Accessed: 2025-08-29
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Leonardo Ranaldi, Federico Ranaldi, and Giulia Pucci. 2025. R2-MultiOmnia: Leading Multilingual Multimodal Reasoning via Self-Training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar (Eds.). Association for Comp...
-
[48]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. 2024. Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning. arXiv:2403.16999 [cs.CV] https://arxiv.org/abs/2403.16999
-
[49]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. 2025. Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Nashville, TN, USA, 300–309...
-
[50]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, and Junxiao Song. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. https://arxiv.org/abs/2402.03300 Accessed: 2025-08-29
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
David Silver, Aja Huang, Chris J Maddison, et al. 2016. Mastering the game of Go with deep neural networks and tree search.Nature529, 7587 (2016), 484–489
work page 2016
-
[52]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 8317–8326. doi:10.1109/CVPR. 2019.00851
-
[53]
Nisan Stiennon, Long Ouyang, Jeffrey Wu, et al . 2020. Learning to summa- rize with human feedback.Advances in Neural Information Processing Systems (NeurIPS)33 (2020), 3008–3021
work page 2020
- [54]
- [55]
-
[56]
2018.Reinforcement Learning: An Intro- duction
Richard S Sutton and Andrew G Barto. 2018.Reinforcement Learning: An Intro- duction. MIT Press, Cambridge, Massachusetts, USA
work page 2018
-
[57]
Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, and Can Huang. 2024. MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. arXiv:2405.11985 [cs.CV]
-
[58]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. https://arxiv.org/abs/2409.12191 ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Wen Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiangbo Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. 2023. Image as a Foreign Language: BEIT Pretraining for Vision and Vision- Language Tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19175–19186 pages. ...
work page 2023
-
[60]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. 2024. Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Systems 37 (2024), 121475–121499
work page 2024
-
[62]
Christopher JCH Watkins and Peter Dayan. 1992. Q-learning.Machine Learning 8, 3-4 (1992), 279–292
work page 1992
-
[63]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. https://arxiv.org/abs/2201.11903 Accessed: 2025-08- 29
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. 2024. V-DPO: Mitigat- ing Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Flo...
-
[65]
Baojiao Xiong, Boheng Chen, Chengzhi Wang, Daxiong Luo, Dongsheng Xu, Dongyang Liu, Fan Yang, Fangyuan Li, Fei Teng, Feng Wang, Fukang Qin, Fuquan Peng, Guanxin Tan, Guozhi Wang, Haibo Yu, Haohao Gao, Heng Liu, Hongbo Yang, Hongjian Zou, Houzheng Shen, Hu Meng, Huan Li, Hui Tan, Jiali Chen, Jianzhao Chen, Jinliang Zhu, Kai Wang, Lei Wu, Liangbing Liu, Liu...
-
[66]
arXiv:2507.05934 [cs.AI] https://arxiv
BlueLM-2.5-3B Technical Report. arXiv:2507.05934 [cs.AI] https://arxiv. org/abs/2507.05934
-
[67]
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. 2025. LLaVA-CoT: Let Vision Language Models Reason Step-by-Step. arXiv:2411.10440 [cs.CV] https://arxiv.org/abs/2411.10440
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [68]
- [69]
-
[70]
Zhen Zhang, Jialu Wang, and Xin Wang. 2023. Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 7258–7268
work page 2023
-
[71]
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2024. Multimodal Chain-of-Thought Reasoning in Language Models. arXiv:2302.00923 [cs.CL] https://arxiv.org/abs/2302.00923
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [72]
-
[73]
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenw...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.