ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

Chuanrui Zhang; Hangrui Xu; Haonan Lu; Haoqian Wang; Jun Yang; Kai Shi; Yunyao Yu; Zhengxian Wu; Zhenyu Yang; Zhuohong Chen

arxiv: 2606.27974 · v1 · pith:KTVR5GSUnew · submitted 2026-06-26 · 💻 cs.CV · cs.AI

ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

ZhengXian Wu , Hangrui Xu , Kai Shi , Zhuohong Chen , Yunyao Yu , Chuanrui Zhang , Zirui Liao , Jun Yang

show 3 more authors

Zhenyu Yang Haonan Lu Haoqian Wang

This is my paper

Pith reviewed 2026-06-29 04:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords knowledge-based visual question answeringmultimodal search agentprogressive retrievaltool use agentreinforcement learningE-VQAInfoSeek

0 comments

The pith

A progressive agent that iteratively chooses image search, text search, or stop improves retrieval and accuracy on knowledge-based visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that an adaptive agent for knowledge-based visual question answering can outperform fixed retrieve-then-generate pipelines by deciding at each step whether to search images, search text, or stop. This matters because non-adaptive methods use a static top-k setting that cannot adjust to the specific needs of the image-question pair during reasoning. The agent operates under explicit tool-call budgets with deduplication to limit redundancy. It is trained first by rejection-sampling supervised fine-tuning to produce valid tool calls, then refined with a sequence-level reinforcement learning objective that normalizes updates by generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek report consistent gains in retrieval quality and end-to-end answer accuracy over strong RAG and agent baselines.

Core claim

ProMSA is a progressive multimodal search agent that, given an image-question pair, iteratively selects among image search, text search, or stopping, subject to explicit tool-call budgets and deduplication. It is trained first with rejection-sampling SFT for valid formats, then optimized with TN-GSPO, a sequence-level RL objective normalizing updates by generation length and tool-interaction depth. This yields consistent gains in retrieval quality and answer accuracy on the E-VQA and InfoSeek benchmarks compared to RAG and agent baselines.

What carries the argument

The progressive multimodal search agent that makes iterative decisions among image search, text search, or stop, with budgets and deduplication.

Load-bearing premise

The assumption that an iterative choice among image search, text search, or stop under tool budgets and deduplication will produce better retrieval and accuracy than fixed retrieve-then-generate pipelines.

What would settle it

An experiment on E-VQA or InfoSeek in which the ProMSA agent achieves no higher retrieval quality or end-to-end accuracy than the strongest fixed RAG baseline.

Figures

Figures reproduced from arXiv: 2606.27974 by Chuanrui Zhang, Hangrui Xu, Haonan Lu, Haoqian Wang, Jun Yang, Kai Shi, Yunyao Yu, Zhengxian Wu, Zhenyu Yang, Zhuohong Chen, Zirui Liao.

**Figure 1.** Figure 1: Comparison between direct answering, RAG-based retrieval, and our progressive multimodal search agent for KB-VQA. rare, making it difficult for the model to reliably decide whether the entity has been correctly identified and whether the retrieved evidence is trustworthy. Early approaches mostly follow a fixed retrieval augmented generation pipeline [27]. They first run a single image retrieval step to co… view at source ↗

**Figure 2.** Figure 2: Overview of our progressive multimodal search agent for KB-VQA. Given an image-question pair, the agent iteratively performs image and text search over Wikipedia, and is trained with tool-horizon normalized sequence-level RL (TN-GSPO). – We formulate KB-VQA as a budgeted progressive search-and-reasoning problem that learns when to retrieve, which modality to use, and when to stop. – We introduce TN-GSPO, a… view at source ↗

**Figure 3.** Figure 3: Tool usage and training dynamics of the proposed search agent. Left: proportions of text vs. image search. Right: tool calls, response length, and reward during training across RL strategies. Effect of training stages. To quantify the contribution of each training stage, we report results on EVQA and InfoSeek across three settings ( [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Retrieval examples from the image search module. Impact of tool-call budget and retrieval Top-k. We vary the tool-call budget and retrieval Top-k and evaluate on E-VQA and InfoSeek (Tables 6 and 8). Increasing the tool-call budget or retrieving more content improves the chance of recalling correct evidence, reducing errors caused by failed early retrieval. However, further increases lead to diminishing re… view at source ↗

read the original abstract

Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressive multimodal search agent for KB-VQA. Given an image-question pair, the agent iteratively chooses image search, text search, or stop, under explicit tool-call budgets and with deduplication to avoid redundant retrieval. For training, we first use rejection-sampling SFT to learn valid tool-use formats, then optimize the agent with TN-GSPO, a sequence-level RL objective that normalizes updates by both generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy. The code is available at https://github.com/DingWu1021/Promsa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProMSA adds an iterative agent that picks image or text search on the fly with budgets and dedup, trained by SFT then TN-GSPO, and reports gains on E-VQA and InfoSeek over fixed RAG baselines.

read the letter

The main takeaway is that this agent architecture moves past static top-k retrieval by letting the model decide when to call image search, text search, or stop, while enforcing tool budgets and removing duplicates. That setup plus the TN-GSPO objective is the concrete addition.

The work does a few things cleanly. It spells out the two-stage training (rejection sampling for valid tool calls, then the length-and-depth normalized RL update), ships the code, and shows the agent beats the listed RAG and agent baselines on both datasets for retrieval quality and final accuracy. The motivation about non-adaptive pipelines is addressed directly by the iterative choice mechanism.

The soft spots are mostly about scale and isolation. The gains are described as consistent, but without the full tables it is hard to judge how large they are once you control for the extra tool calls or whether the normalization in TN-GSPO is doing the heavy lifting versus the agent loop itself. The paper stays inside the KB-VQA agent niche, so broader claims about multimodal systems would need more datasets or tasks.

This is useful for people already working on tool-using VQA agents who want a working example with open code. It is not a foundational shift, but the method is specified enough and the experiments are on standard benchmarks.

I would send it to review. The combination of explicit training procedure, budget controls, and public code gives referees something concrete to check.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes ProMSA, a progressive multimodal search agent for KB-VQA. The agent iteratively selects among image search, text search, or stopping, subject to explicit tool-call budgets and deduplication to prevent redundancy. Training proceeds in two stages: rejection-sampling SFT to learn valid tool-use formats, followed by optimization using the TN-GSPO sequence-level RL objective, which normalizes updates by both generation length and tool-interaction depth. Experiments on the E-VQA and InfoSeek benchmarks report consistent improvements over strong RAG and agent baselines in retrieval and end-to-end accuracy. The code is publicly available.

Significance. If the reported gains hold under scrutiny, the work provides empirical support for adaptive iterative search over fixed retrieve-then-generate pipelines in multimodal KB-VQA. The public code release supports reproducibility. The TN-GSPO objective is a contribution to sequence-level RL for agents with variable tool-interaction depths. This advances agentic approaches in vision-language reasoning.

minor comments (2)

[Experiments] The abstract states that experiments show 'consistent gains' and 'improved retrieval and end-to-end accuracy' but provides no numerical values, error bars, or statistical tests; the results section should include these to allow assessment of the central claim.
[4] The motivation highlights the limitation of static top-k settings, yet the paper should include an explicit ablation comparing the learned iterative policy against a fixed-budget non-adaptive variant to isolate the benefit of adaptivity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation of minor revision. No specific major comments are listed in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical proposal of an iterative multimodal search agent (ProMSA) for KB-VQA. It describes a training pipeline using rejection-sampling SFT followed by the TN-GSPO sequence-level RL objective, then reports experimental gains on E-VQA and InfoSeek over RAG and agent baselines. No equations, fitted parameters, or first-principles derivations appear in the abstract or described method that reduce any claimed result to a definition or self-referential input. The central claim rests on external benchmark comparisons rather than any internal reduction or self-citation chain. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5733 in / 1067 out tokens · 21302 ms · 2026-06-29T04:58:52.246420+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 13 linked inside Pith

[1]

arXiv preprint arXiv:2303.08774 (2023)

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

Pith/arXiv arXiv 2023
[2]

arXiv preprint arXiv:2511.21631 (2025)

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

Pith/arXiv arXiv 2025
[3]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

Pith/arXiv arXiv 2025
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Caffagni,D.,Cocchi,F.,Moratelli,N.,Sarto,S.,Cornia,M.,Baraldi,L.,Cucchiara, R.: Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1818–1826 (2024)

2024
[5]

In: Findings of the association for computational linguistics: ACL 2024

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z.: M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self- knowledge distillation. In: Findings of the association for computational linguistics: ACL 2024. pp. 2318–2335 (2024)

2024
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen, X., Shukla, S.N., Azab, M., Singh, A., Wang, Q., Yang, D., Peng, S., Yu, H., Yan, S., Zhang, X., et al.: Compcap: Improving multimodal large language models with composite captions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23582–23592 (2025)

2025
[7]

Chen, Y., Hu, H., Luan, Y., Sun, H., Changpinyo, S., Ritter, A., Chang, M.W.: Can pre-trained vision and language models answer visual information-seeking ques- tions? In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 14948–14968 (2023)

2023
[8]

arXiv preprint arXiv:2512.24330 (2025)

Chng, Y.X., Hu, T., Tong, W., Li, X., Chen, J., Yu, H., Lu, J., Guo, H., Deng, H., Xie, C., et al.: Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning. arXiv preprint arXiv:2512.24330 (2025)

arXiv 2025
[9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cocchi, F., Moratelli, N., Cornia, M., Baraldi, L., Cucchiara, R.: Augmenting mul- timodal llms with self-reflective tokens for knowledge-based visual question answer- ing. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9199–9209 (2025) 16 Z. Wu et al

2025
[10]

arXiv preprint arXiv:2511.22715 (2025)

Compagnoni, A., Morini, M., Sarto, S., Cocchi, F., Caffagni, D., Cornia, M., Baraldi, L., Cucchiara, R.: Reag: Reasoning-augmented generation for knowledge- based visual question answering. arXiv preprint arXiv:2511.22715 (2025)

arXiv 2025
[11]

arXiv preprint arXiv:2511.05271 (2025)

Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271 (2025)

Pith/arXiv arXiv 2025
[12]

arXiv preprint arXiv:2602.23952 (2026)

Hong, Y., Gu, J., Lou, Y., Fan, L., Yang, Q., Wang, Y., Ding, K., Wu, Y., Xiang, S., Ye, J.: Cc-vqa: Conflict-and correlation-aware method for mitigating knowledge conflict in knowledge-based visual question answering. arXiv preprint arXiv:2602.23952 (2026)

arXiv 2026
[13]

arXiv preprint arXiv:2510.14605 (2025)

Hong, Y., Gu, J., Yang, Q., Fan, L., Wu, Y., Wang, Y., Ding, K., Xiang, S., Ye, J.: Knowledge-based visual question answer with multimodal processing, retrieval and filtering. arXiv preprint arXiv:2510.14605 (2025)

arXiv 2025
[14]

arXiv preprint arXiv:2503.09516 (2025)

Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., Han, J.: Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516 (2025)

Pith/arXiv arXiv 2025
[15]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

2023
[16]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Li, X., Dong, G., Jin, J., Zhang, Y., Zhou, Y., Zhu, Y., Zhang, P., Dou, Z.: Search- o1: Agentic search-enhanced large reasoning models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 5420–5438 (2025)

2025
[17]

arXiv preprint arXiv:2504.10074 (2025)

Ling, Z., Guo, Z., Huang, Y., An, Y., Xiao, S., Lan, J., Zhu, X., Zheng, B.: Mmkb- rag: A multi-modal knowledge-based retrieval-augmented generation framework. arXiv preprint arXiv:2504.10074 (2025)

arXiv 2025
[18]

arXiv preprint arXiv:2602.15915 (2026)

Mao, X., Ye, K., Zhou, S., Zhang, N., Huang, H., Li, B., Bu, J.: Mas-vqa: A mask-and-select framework for knowledge-based visual question answering. arXiv preprint arXiv:2602.15915 (2026)

arXiv 2026
[19]

In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition

Marino,K.,Rastegari,M.,Farhadi,A.,Mottaghi,R.:Ok-vqa:Avisualquestionan- swering benchmark requiring external knowledge. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. pp. 3195–3204 (2019)

2019
[20]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Mensink, T., Uijlings, J., Castrejon, L., Goel, A., Cadar, F., Zhou, H., Sha, F., Araujo, A., Ferrari, V.: Encyclopedic vqa: Visual questions about detailed prop- erties of fine-grained categories. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3113–3124 (2023)

2023
[21]

arXiv preprint arXiv:2402.03300 (2024)

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

Pith/arXiv arXiv 2024
[22]

arXiv preprint arXiv: 2409.19256 (2024)

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024)

Pith/arXiv arXiv 2024
[23]

arXiv preprint arXiv:2303.15389 (2023)

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training tech- niques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)

Pith/arXiv arXiv 2023
[24]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, P., Li, Z.Z., Yin, F., Ran, D., Liu, C.L.: Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19541–19551 (2025)

2025
[25]

arXiv preprint arXiv:2503.10042 (2025)

Wang, Z., Dong, Y., Luo, F., Ruan, M., Cheng, Z., Chen, C., Li, P., Liu, Y.: Escapecraft: A 3d room escape environment for benchmarking complex multimodal reasoning ability. arXiv preprint arXiv:2503.10042 (2025)

arXiv 2025
[26]

arXiv preprint arXiv:2506.20670 (2025) Title Suppressed Due to Excessive Length 17

Wu, J., Deng, Z., Li, W., Liu, Y., You, B., Li, B., Ma, Z., Liu, Z.: Mmsearch-r1: Incentivizing lmms to search. arXiv preprint arXiv:2506.20670 (2025) Title Suppressed Due to Excessive Length 17

Pith/arXiv arXiv 2025
[27]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Yan, Y., Xie, W.: Echosight: Advancing visual-language models with wiki knowl- edge. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 1538–1551 (2024)

2024
[28]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers)

Yang, W., Fu, J., Wang, R., Wang, J., Song, L., Bian, J.: Omgm: Orchestrate mul- tiple granularities and modalities for efficient multimodal retrieval. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). pp. 24545–24563 (2025)

2025
[29]

arXiv preprint arXiv:2602.14065 (2026)

Ye, K., Mao, X., Zhou, S., Shao, Z., Mo, Y., Liu, L., Huang, H., Li, B., Bu, J.: Real: Resolving knowledge conflicts in knowledge-intensive visual question answering via reasoning-pivot alignment. arXiv preprint arXiv:2602.14065 (2026)

Pith/arXiv arXiv 2026
[30]

Ye,W.,Su,Y.,Chen,Y.,Gao,L.,Li,J.,Li,R.,Zhang,R.:Qkvqa:Question-focused filtering for knowledge-based vqa (2026),https://arxiv.org/abs/2601.13856

Pith/arXiv arXiv 2026
[31]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

2024
[32]

arXiv preprint arXiv:2507.18071 (2025)

Zheng, C., Liu, S., Li, M., Chen, X.H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., et al.: Group sequence policy optimization. arXiv preprint arXiv:2507.18071 (2025)

Pith/arXiv arXiv 2025
[33]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafac- tory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thailand (2024),http://arxiv.org/abs/2403.13372

Pith/arXiv arXiv 2024

[1] [1]

arXiv preprint arXiv:2303.08774 (2023)

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

Pith/arXiv arXiv 2023

[2] [2]

arXiv preprint arXiv:2511.21631 (2025)

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

Pith/arXiv arXiv 2025

[3] [3]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

Pith/arXiv arXiv 2025

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Caffagni,D.,Cocchi,F.,Moratelli,N.,Sarto,S.,Cornia,M.,Baraldi,L.,Cucchiara, R.: Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1818–1826 (2024)

2024

[5] [5]

In: Findings of the association for computational linguistics: ACL 2024

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z.: M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self- knowledge distillation. In: Findings of the association for computational linguistics: ACL 2024. pp. 2318–2335 (2024)

2024

[6] [6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen, X., Shukla, S.N., Azab, M., Singh, A., Wang, Q., Yang, D., Peng, S., Yu, H., Yan, S., Zhang, X., et al.: Compcap: Improving multimodal large language models with composite captions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23582–23592 (2025)

2025

[7] [7]

Chen, Y., Hu, H., Luan, Y., Sun, H., Changpinyo, S., Ritter, A., Chang, M.W.: Can pre-trained vision and language models answer visual information-seeking ques- tions? In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 14948–14968 (2023)

2023

[8] [8]

arXiv preprint arXiv:2512.24330 (2025)

Chng, Y.X., Hu, T., Tong, W., Li, X., Chen, J., Yu, H., Lu, J., Guo, H., Deng, H., Xie, C., et al.: Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning. arXiv preprint arXiv:2512.24330 (2025)

arXiv 2025

[9] [9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cocchi, F., Moratelli, N., Cornia, M., Baraldi, L., Cucchiara, R.: Augmenting mul- timodal llms with self-reflective tokens for knowledge-based visual question answer- ing. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9199–9209 (2025) 16 Z. Wu et al

2025

[10] [10]

arXiv preprint arXiv:2511.22715 (2025)

Compagnoni, A., Morini, M., Sarto, S., Cocchi, F., Caffagni, D., Cornia, M., Baraldi, L., Cucchiara, R.: Reag: Reasoning-augmented generation for knowledge- based visual question answering. arXiv preprint arXiv:2511.22715 (2025)

arXiv 2025

[11] [11]

arXiv preprint arXiv:2511.05271 (2025)

Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271 (2025)

Pith/arXiv arXiv 2025

[12] [12]

arXiv preprint arXiv:2602.23952 (2026)

Hong, Y., Gu, J., Lou, Y., Fan, L., Yang, Q., Wang, Y., Ding, K., Wu, Y., Xiang, S., Ye, J.: Cc-vqa: Conflict-and correlation-aware method for mitigating knowledge conflict in knowledge-based visual question answering. arXiv preprint arXiv:2602.23952 (2026)

arXiv 2026

[13] [13]

arXiv preprint arXiv:2510.14605 (2025)

Hong, Y., Gu, J., Yang, Q., Fan, L., Wu, Y., Wang, Y., Ding, K., Xiang, S., Ye, J.: Knowledge-based visual question answer with multimodal processing, retrieval and filtering. arXiv preprint arXiv:2510.14605 (2025)

arXiv 2025

[14] [14]

arXiv preprint arXiv:2503.09516 (2025)

Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., Han, J.: Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516 (2025)

Pith/arXiv arXiv 2025

[15] [15]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

2023

[16] [16]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Li, X., Dong, G., Jin, J., Zhang, Y., Zhou, Y., Zhu, Y., Zhang, P., Dou, Z.: Search- o1: Agentic search-enhanced large reasoning models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 5420–5438 (2025)

2025

[17] [17]

arXiv preprint arXiv:2504.10074 (2025)

Ling, Z., Guo, Z., Huang, Y., An, Y., Xiao, S., Lan, J., Zhu, X., Zheng, B.: Mmkb- rag: A multi-modal knowledge-based retrieval-augmented generation framework. arXiv preprint arXiv:2504.10074 (2025)

arXiv 2025

[18] [18]

arXiv preprint arXiv:2602.15915 (2026)

Mao, X., Ye, K., Zhou, S., Zhang, N., Huang, H., Li, B., Bu, J.: Mas-vqa: A mask-and-select framework for knowledge-based visual question answering. arXiv preprint arXiv:2602.15915 (2026)

arXiv 2026

[19] [19]

In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition

Marino,K.,Rastegari,M.,Farhadi,A.,Mottaghi,R.:Ok-vqa:Avisualquestionan- swering benchmark requiring external knowledge. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. pp. 3195–3204 (2019)

2019

[20] [20]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Mensink, T., Uijlings, J., Castrejon, L., Goel, A., Cadar, F., Zhou, H., Sha, F., Araujo, A., Ferrari, V.: Encyclopedic vqa: Visual questions about detailed prop- erties of fine-grained categories. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3113–3124 (2023)

2023

[21] [21]

arXiv preprint arXiv:2402.03300 (2024)

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

Pith/arXiv arXiv 2024

[22] [22]

arXiv preprint arXiv: 2409.19256 (2024)

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024)

Pith/arXiv arXiv 2024

[23] [23]

arXiv preprint arXiv:2303.15389 (2023)

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training tech- niques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)

Pith/arXiv arXiv 2023

[24] [24]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, P., Li, Z.Z., Yin, F., Ran, D., Liu, C.L.: Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19541–19551 (2025)

2025

[25] [25]

arXiv preprint arXiv:2503.10042 (2025)

Wang, Z., Dong, Y., Luo, F., Ruan, M., Cheng, Z., Chen, C., Li, P., Liu, Y.: Escapecraft: A 3d room escape environment for benchmarking complex multimodal reasoning ability. arXiv preprint arXiv:2503.10042 (2025)

arXiv 2025

[26] [26]

arXiv preprint arXiv:2506.20670 (2025) Title Suppressed Due to Excessive Length 17

Wu, J., Deng, Z., Li, W., Liu, Y., You, B., Li, B., Ma, Z., Liu, Z.: Mmsearch-r1: Incentivizing lmms to search. arXiv preprint arXiv:2506.20670 (2025) Title Suppressed Due to Excessive Length 17

Pith/arXiv arXiv 2025

[27] [27]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Yan, Y., Xie, W.: Echosight: Advancing visual-language models with wiki knowl- edge. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 1538–1551 (2024)

2024

[28] [28]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers)

Yang, W., Fu, J., Wang, R., Wang, J., Song, L., Bian, J.: Omgm: Orchestrate mul- tiple granularities and modalities for efficient multimodal retrieval. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). pp. 24545–24563 (2025)

2025

[29] [29]

arXiv preprint arXiv:2602.14065 (2026)

Ye, K., Mao, X., Zhou, S., Shao, Z., Mo, Y., Liu, L., Huang, H., Li, B., Bu, J.: Real: Resolving knowledge conflicts in knowledge-intensive visual question answering via reasoning-pivot alignment. arXiv preprint arXiv:2602.14065 (2026)

Pith/arXiv arXiv 2026

[30] [30]

Ye,W.,Su,Y.,Chen,Y.,Gao,L.,Li,J.,Li,R.,Zhang,R.:Qkvqa:Question-focused filtering for knowledge-based vqa (2026),https://arxiv.org/abs/2601.13856

Pith/arXiv arXiv 2026

[31] [31]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

2024

[32] [32]

arXiv preprint arXiv:2507.18071 (2025)

Zheng, C., Liu, S., Li, M., Chen, X.H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., et al.: Group sequence policy optimization. arXiv preprint arXiv:2507.18071 (2025)

Pith/arXiv arXiv 2025

[33] [33]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafac- tory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thailand (2024),http://arxiv.org/abs/2403.13372

Pith/arXiv arXiv 2024