pith. sign in

arxiv: 2606.27974 · v1 · pith:KTVR5GSUnew · submitted 2026-06-26 · 💻 cs.CV · cs.AI

ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

Pith reviewed 2026-06-29 04:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords knowledge-based visual question answeringmultimodal search agentprogressive retrievaltool use agentreinforcement learningE-VQAInfoSeek
0
0 comments X

The pith

A progressive agent that iteratively chooses image search, text search, or stop improves retrieval and accuracy on knowledge-based visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that an adaptive agent for knowledge-based visual question answering can outperform fixed retrieve-then-generate pipelines by deciding at each step whether to search images, search text, or stop. This matters because non-adaptive methods use a static top-k setting that cannot adjust to the specific needs of the image-question pair during reasoning. The agent operates under explicit tool-call budgets with deduplication to limit redundancy. It is trained first by rejection-sampling supervised fine-tuning to produce valid tool calls, then refined with a sequence-level reinforcement learning objective that normalizes updates by generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek report consistent gains in retrieval quality and end-to-end answer accuracy over strong RAG and agent baselines.

Core claim

ProMSA is a progressive multimodal search agent that, given an image-question pair, iteratively selects among image search, text search, or stopping, subject to explicit tool-call budgets and deduplication. It is trained first with rejection-sampling SFT for valid formats, then optimized with TN-GSPO, a sequence-level RL objective normalizing updates by generation length and tool-interaction depth. This yields consistent gains in retrieval quality and answer accuracy on the E-VQA and InfoSeek benchmarks compared to RAG and agent baselines.

What carries the argument

The progressive multimodal search agent that makes iterative decisions among image search, text search, or stop, with budgets and deduplication.

Load-bearing premise

The assumption that an iterative choice among image search, text search, or stop under tool budgets and deduplication will produce better retrieval and accuracy than fixed retrieve-then-generate pipelines.

What would settle it

An experiment on E-VQA or InfoSeek in which the ProMSA agent achieves no higher retrieval quality or end-to-end accuracy than the strongest fixed RAG baseline.

Figures

Figures reproduced from arXiv: 2606.27974 by Chuanrui Zhang, Hangrui Xu, Haonan Lu, Haoqian Wang, Jun Yang, Kai Shi, Yunyao Yu, Zhengxian Wu, Zhenyu Yang, Zhuohong Chen, Zirui Liao.

Figure 1
Figure 1. Figure 1: Comparison between direct answering, RAG-based retrieval, and our progres￾sive multimodal search agent for KB-VQA. rare, making it difficult for the model to reliably decide whether the entity has been correctly identified and whether the retrieved evidence is trustworthy. Early approaches mostly follow a fixed retrieval augmented generation pipeline [27]. They first run a single image retrieval step to co… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our progressive multimodal search agent for KB-VQA. Given an image-question pair, the agent iteratively performs image and text search over Wikipedia, and is trained with tool-horizon normalized sequence-level RL (TN-GSPO). – We formulate KB-VQA as a budgeted progressive search-and-reasoning problem that learns when to retrieve, which modality to use, and when to stop. – We introduce TN-GSPO, a… view at source ↗
Figure 3
Figure 3. Figure 3: Tool usage and training dynamics of the pro￾posed search agent. Left: proportions of text vs. im￾age search. Right: tool calls, response length, and reward during training across RL strategies. Effect of training stages. To quantify the contribu￾tion of each training stage, we report results on E￾VQA and InfoSeek across three settings ( [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Retrieval examples from the image search module. Impact of tool-call budget and retrieval Top-k. We vary the tool-call budget and retrieval Top-k and evaluate on E-VQA and InfoSeek (Tables 6 and 8). Increasing the tool-call budget or retrieving more content improves the chance of recalling correct evidence, reducing errors caused by failed early re￾trieval. However, further increases lead to diminishing re… view at source ↗
read the original abstract

Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressive multimodal search agent for KB-VQA. Given an image-question pair, the agent iteratively chooses image search, text search, or stop, under explicit tool-call budgets and with deduplication to avoid redundant retrieval. For training, we first use rejection-sampling SFT to learn valid tool-use formats, then optimize the agent with TN-GSPO, a sequence-level RL objective that normalizes updates by both generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy. The code is available at https://github.com/DingWu1021/Promsa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes ProMSA, a progressive multimodal search agent for KB-VQA. The agent iteratively selects among image search, text search, or stopping, subject to explicit tool-call budgets and deduplication to prevent redundancy. Training proceeds in two stages: rejection-sampling SFT to learn valid tool-use formats, followed by optimization using the TN-GSPO sequence-level RL objective, which normalizes updates by both generation length and tool-interaction depth. Experiments on the E-VQA and InfoSeek benchmarks report consistent improvements over strong RAG and agent baselines in retrieval and end-to-end accuracy. The code is publicly available.

Significance. If the reported gains hold under scrutiny, the work provides empirical support for adaptive iterative search over fixed retrieve-then-generate pipelines in multimodal KB-VQA. The public code release supports reproducibility. The TN-GSPO objective is a contribution to sequence-level RL for agents with variable tool-interaction depths. This advances agentic approaches in vision-language reasoning.

minor comments (2)
  1. [Experiments] The abstract states that experiments show 'consistent gains' and 'improved retrieval and end-to-end accuracy' but provides no numerical values, error bars, or statistical tests; the results section should include these to allow assessment of the central claim.
  2. [4] The motivation highlights the limitation of static top-k settings, yet the paper should include an explicit ablation comparing the learned iterative policy against a fixed-budget non-adaptive variant to isolate the benefit of adaptivity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation of minor revision. No specific major comments are listed in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical proposal of an iterative multimodal search agent (ProMSA) for KB-VQA. It describes a training pipeline using rejection-sampling SFT followed by the TN-GSPO sequence-level RL objective, then reports experimental gains on E-VQA and InfoSeek over RAG and agent baselines. No equations, fitted parameters, or first-principles derivations appear in the abstract or described method that reduce any claimed result to a definition or self-referential input. The central claim rests on external benchmark comparisons rather than any internal reduction or self-citation chain. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5733 in / 1067 out tokens · 21302 ms · 2026-06-29T04:58:52.246420+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 13 linked inside Pith

  1. [1]

    arXiv preprint arXiv:2303.08774 (2023)

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    arXiv preprint arXiv:2511.21631 (2025)

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  3. [3]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Caffagni,D.,Cocchi,F.,Moratelli,N.,Sarto,S.,Cornia,M.,Baraldi,L.,Cucchiara, R.: Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1818–1826 (2024)

  5. [5]

    In: Findings of the association for computational linguistics: ACL 2024

    Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z.: M3-embedding: Multi- linguality, multi-functionality, multi-granularity text embeddings through self- knowledge distillation. In: Findings of the association for computational linguistics: ACL 2024. pp. 2318–2335 (2024)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Chen, X., Shukla, S.N., Azab, M., Singh, A., Wang, Q., Yang, D., Peng, S., Yu, H., Yan, S., Zhang, X., et al.: Compcap: Improving multimodal large language models with composite captions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23582–23592 (2025)

  7. [7]

    Chen, Y., Hu, H., Luan, Y., Sun, H., Changpinyo, S., Ritter, A., Chang, M.W.: Can pre-trained vision and language models answer visual information-seeking ques- tions? In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 14948–14968 (2023)

  8. [8]

    arXiv preprint arXiv:2512.24330 (2025)

    Chng, Y.X., Hu, T., Tong, W., Li, X., Chen, J., Yu, H., Lu, J., Guo, H., Deng, H., Xie, C., et al.: Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning. arXiv preprint arXiv:2512.24330 (2025)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Cocchi, F., Moratelli, N., Cornia, M., Baraldi, L., Cucchiara, R.: Augmenting mul- timodal llms with self-reflective tokens for knowledge-based visual question answer- ing. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9199–9209 (2025) 16 Z. Wu et al

  10. [10]

    arXiv preprint arXiv:2511.22715 (2025)

    Compagnoni, A., Morini, M., Sarto, S., Cocchi, F., Caffagni, D., Cornia, M., Baraldi, L., Cucchiara, R.: Reag: Reasoning-augmented generation for knowledge- based visual question answering. arXiv preprint arXiv:2511.22715 (2025)

  11. [11]

    arXiv preprint arXiv:2511.05271 (2025)

    Hong, J., Zhao, C., Zhu, C., Lu, W., Xu, G., Yu, X.: Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271 (2025)

  12. [12]

    arXiv preprint arXiv:2602.23952 (2026)

    Hong, Y., Gu, J., Lou, Y., Fan, L., Yang, Q., Wang, Y., Ding, K., Wu, Y., Xiang, S., Ye, J.: Cc-vqa: Conflict-and correlation-aware method for mitigating knowledge conflict in knowledge-based visual question answering. arXiv preprint arXiv:2602.23952 (2026)

  13. [13]

    arXiv preprint arXiv:2510.14605 (2025)

    Hong, Y., Gu, J., Yang, Q., Fan, L., Wu, Y., Wang, Y., Ding, K., Xiang, S., Ye, J.: Knowledge-based visual question answer with multimodal processing, retrieval and filtering. arXiv preprint arXiv:2510.14605 (2025)

  14. [14]

    arXiv preprint arXiv:2503.09516 (2025)

    Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., Han, J.: Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516 (2025)

  15. [15]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  16. [16]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Li, X., Dong, G., Jin, J., Zhang, Y., Zhou, Y., Zhu, Y., Zhang, P., Dou, Z.: Search- o1: Agentic search-enhanced large reasoning models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 5420–5438 (2025)

  17. [17]

    arXiv preprint arXiv:2504.10074 (2025)

    Ling, Z., Guo, Z., Huang, Y., An, Y., Xiao, S., Lan, J., Zhu, X., Zheng, B.: Mmkb- rag: A multi-modal knowledge-based retrieval-augmented generation framework. arXiv preprint arXiv:2504.10074 (2025)

  18. [18]

    arXiv preprint arXiv:2602.15915 (2026)

    Mao, X., Ye, K., Zhou, S., Zhang, N., Huang, H., Li, B., Bu, J.: Mas-vqa: A mask-and-select framework for knowledge-based visual question answering. arXiv preprint arXiv:2602.15915 (2026)

  19. [19]

    In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition

    Marino,K.,Rastegari,M.,Farhadi,A.,Mottaghi,R.:Ok-vqa:Avisualquestionan- swering benchmark requiring external knowledge. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. pp. 3195–3204 (2019)

  20. [20]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Mensink, T., Uijlings, J., Castrejon, L., Goel, A., Cadar, F., Zhou, H., Sha, F., Araujo, A., Ferrari, V.: Encyclopedic vqa: Visual questions about detailed prop- erties of fine-grained categories. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3113–3124 (2023)

  21. [21]

    arXiv preprint arXiv:2402.03300 (2024)

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  22. [22]

    arXiv preprint arXiv: 2409.19256 (2024)

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024)

  23. [23]

    arXiv preprint arXiv:2303.15389 (2023)

    Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training tech- niques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)

  24. [24]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, P., Li, Z.Z., Yin, F., Ran, D., Liu, C.L.: Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19541–19551 (2025)

  25. [25]

    arXiv preprint arXiv:2503.10042 (2025)

    Wang, Z., Dong, Y., Luo, F., Ruan, M., Cheng, Z., Chen, C., Li, P., Liu, Y.: Escapecraft: A 3d room escape environment for benchmarking complex multimodal reasoning ability. arXiv preprint arXiv:2503.10042 (2025)

  26. [26]

    arXiv preprint arXiv:2506.20670 (2025) Title Suppressed Due to Excessive Length 17

    Wu, J., Deng, Z., Li, W., Liu, Y., You, B., Li, B., Ma, Z., Liu, Z.: Mmsearch-r1: Incentivizing lmms to search. arXiv preprint arXiv:2506.20670 (2025) Title Suppressed Due to Excessive Length 17

  27. [27]

    In: Findings of the Association for Computational Linguistics: EMNLP 2024

    Yan, Y., Xie, W.: Echosight: Advancing visual-language models with wiki knowl- edge. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 1538–1551 (2024)

  28. [28]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers)

    Yang, W., Fu, J., Wang, R., Wang, J., Song, L., Bian, J.: Omgm: Orchestrate mul- tiple granularities and modalities for efficient multimodal retrieval. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). pp. 24545–24563 (2025)

  29. [29]

    arXiv preprint arXiv:2602.14065 (2026)

    Ye, K., Mao, X., Zhou, S., Shao, Z., Mo, Y., Liu, L., Huang, H., Li, B., Bu, J.: Real: Resolving knowledge conflicts in knowledge-intensive visual question answering via reasoning-pivot alignment. arXiv preprint arXiv:2602.14065 (2026)

  30. [30]

    Ye,W.,Su,Y.,Chen,Y.,Gao,L.,Li,J.,Li,R.,Zhang,R.:Qkvqa:Question-focused filtering for knowledge-based vqa (2026),https://arxiv.org/abs/2601.13856

  31. [31]

    National Science Review11(12), nwae403 (2024)

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

  32. [32]

    arXiv preprint arXiv:2507.18071 (2025)

    Zheng, C., Liu, S., Li, M., Chen, X.H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., et al.: Group sequence policy optimization. arXiv preprint arXiv:2507.18071 (2025)

  33. [33]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

    Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafac- tory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thailand (2024),http://arxiv.org/abs/2403.13372