{"total":18,"items":[{"citing_arxiv_id":"2605.20177","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:58:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19852","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T13:44:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08560","ref_index":169,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ZAYA1-VL-8B Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-08T23:41:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021. [168] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianping Han, Hang Xu, Zhenguo Li, and Pheng-Ann Heng. G- llava: Solving geometric problem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023. [169] Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incen- tivizing self-reflection of vision-language models with re- inforcement learning.arXiv preprint arXiv:2504.08837, 2025. [170] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano,"},{"citing_arxiv_id":"2604.19544","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling","primary_cat":"cs.AI","submitted_at":"2026-04-21T15:02:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[9] Ziming Cheng, Binrui Xu, Lisheng Gong, Zuhe Song, Tian- shuo Zhou, Shiqi Zhong, Siyu Ren, Mingxiang Chen, Xi- angchao Meng, Yuxin Zhang, et al. Evaluating mllms with multimodal multi-image reasoning benchmark.arXiv preprint arXiv:2506.04280, 2025. 1, 4 [10] Google DeepMind. Gemini-1.5-Pro.https : / / deepmind . google / technologies / gemini / pro/, 2024. 5 [11] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wan- jun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-LLaV A: Solving geometric prob- lem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023. 14 [12] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi"},{"citing_arxiv_id":"2604.04838","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Less Detail, Better Answers: Degradation-Driven Prompting for VQA","primary_cat":"cs.CV","submitted_at":"2026-04-06T16:41:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02893","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-03T09:10:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.06856","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-06-07T16:37:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10479","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 14, 15 [39] Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024. 9, 11 [40] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023. 6 [41] Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun"},{"citing_arxiv_id":"2504.09925","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding","primary_cat":"cs.CV","submitted_at":"2025-04-14T06:33:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.16549","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems","primary_cat":"cs.CV","submitted_at":"2025-03-19T11:46:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.14164","ref_index":123,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MetaMorph: Multimodal Understanding and Generation via Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2024-12-18T18:58:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.04468","ref_index":121,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NVILA: Efficient Frontier Visual Language Models","primary_cat":"cs.CV","submitted_at":"2024-12-05T18:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (SIGIR), 2022. [120] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. InIEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2017. 18 NVILA: Efficient Frontier Visual Language Models [121] Drew A Hudson and Christopher D Manning. GQA: A New Dataset for Real-World Visual Reason- ing and Compositional Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [122] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong."},{"citing_arxiv_id":"2411.10442","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","primary_cat":"cs.CL","submitted_at":"2024-11-15T18:59:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ber of visual tokens required by LLMs while introducing extra training costs. Recently, there have been explorations into vision encoder-free architectures [7, 51, 63, 90, 104], which consist of a single transformer model that jointly pro- cesses visual and textual information without a separate en- coder. In addition to exploring model architectures, recent works [28, 49, 56, 101, 108, 115] also try to construct high- quality training data to improve multimodal reasoning abil- ities. Despite these advancements, MLLMs typically rely on a training paradigm comprising pre-training and super- vised fine-tuning, which suffers from the curve of distribu- tion shift and exhibits limited multimodal reasoning abili- ties."},{"citing_arxiv_id":"2408.16500","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CogVLM2: Visual Language Models for Image and Video Understanding","primary_cat":"cs.CV","submitted_at":"2024-08-29T12:59:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.01284","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","primary_cat":"cs.AI","submitted_at":"2024-07-01T13:39:08+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"gories on WE-M ATH: (a) Closed-source LMMs: GPT-4o [38], GPT-4V [26], Gemini 1.5 Pro [40], Qwen-VL-Max [13], (b) Open-source LMMs: LLaV A-NeXT-110B, LLaV A-NeXT-70B [39], LLaV A- 1.6-13B, LLaV A-1.6-7B [41], DeepSeek-VL-1.3B, DeepSeek-VL-7B [42], Phi3-Vision-4.2B [43], MiniCPM-Llama3-V 2.5 [44], InternLM-XComposer2-VL-7B [45], InternVL-Chat-V1.5 [46], GLM- 4V-9B [47], LongV A [48], G-LLaV A-13B [29]. 3.1 Main Result Table 1 shows the overall performance of different LMMs on One-Step / Two-Step / Three-Step problems and different problem domains. We have the following observations: The Nums of Knowledge Concepts are negatively correlated with LMMs' Performance.Regard- ing problems of varying complexities (one-step vs. two-step vs. three-step), GPT-4o consistently"},{"citing_arxiv_id":"2406.16860","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2024-06-24T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"on MLLM performance (Section 2) and explore the benefits of model ensembles (Section 3.5). Multimodal Connector Representations from a visual encoder cannot be natively processed by an LLM-they must be mapped into the LLM token space by a connector. There are three primary approaches to connector design: Resamplers [6], Q-Formers [11, 37], and MLP Projectors [44, 81, 83, 158]. We begin our exploration using an MLP projector, which is highly effective but presents challenges: the visual token count grows quadratically with image resolution, inhibiting scaling context length input resolution. For example, LLaVA-Next [82] requires 2880 visual tokens to process one 672px image. To address this, we explore new vision connector designs that process"},{"citing_arxiv_id":"2403.14624","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?","primary_cat":"cs.CV","submitted_at":"2024-03-21T17:59:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.05525","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-VL: Towards Real-World Vision-Language Understanding","primary_cat":"cs.AI","submitted_at":"2024-03-08T18:46:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}