pith. machine review for the scientific record. sign in

arxiv: 2605.03790 · v1 · submitted 2026-05-05 · 💻 cs.CV

Recognition: unknown

Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual question answeringmultimodal large language modelsretrieval-augmented generationchain-of-thoughtquestion decompositionknowledge retrievalcross-domain vqaopen-domain vqa
0
0 comments X

The pith

A prompting strategy that fuses chain-of-thought with question decomposition guides retrieval to improve multimodal LLMs on visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a prompting strategy that combines step-by-step reasoning with breaking visual questions into parts to direct the retrieval of external knowledge. Multimodal large language models often require outside information to answer questions about images accurately, especially when the knowledge is not obvious from the visual input alone. The proposed framework uses this guidance to fetch more relevant and coherent knowledge, which the model then uses for its final inference. A sympathetic reader would care because it addresses the common issue of inaccurate or incomplete answers in open-domain VQA by making the knowledge acquisition process more structured and reliable.

Core claim

The authors claim that introducing CoVQD, a fusion of Chain-of-Thought reasoning with Visual Question Decomposition, to guide retrieval-augmented generation enables MLLMs to access more comprehensive and coherent external knowledge while benefiting from structured visual-text reasoning guidance, thereby improving generalization and reliability in complex cross-domain VQA scenarios.

What carries the argument

The CoVQD prompting strategy, which fuses Chain-of-Thought reasoning with Visual Question Decomposition to guide retrieval in a retrieval-augmented generation framework for multimodal large language models.

Load-bearing premise

Fusing Chain-of-Thought with Visual Question Decomposition will reliably steer retrieval toward accurate and relevant external knowledge without introducing decomposition errors or irrelevant retrieved content that harms the final model output.

What would settle it

A direct comparison on the OKVQA or InfoSeek benchmarks in which the full framework performs no better than or worse than a baseline retrieval-augmented generation setup without the CoVQD guidance would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.03790 by Chia-Wen Lin, Ling Zhou, Quanxing Xu, Rubing Huang, Xian Zhong, Xiaohua Huang.

Figure 1
Figure 1. Figure 1: Comparison between prior retrieval-based VQA with MLLMs and the proposed CgRAG framework. With an MLLM fine-tuned by liDPO, VQD and CoT are fused to guide fine-grained RAG, yielding an enhanced MLLM-based VQA framework. Answering (VQA) [4]. As one of the most fundamental vision￾language tasks, VQA requires generating accurate natural language answers given an image and a question, and has therefore attract… view at source ↗
Figure 2
Figure 2. Figure 2: The illustration of the VQD. The generation of Chain￾of-Question via VQD on the input question can benefit the MLLM’s exploration of the knowledge behind the given image. can substantially enhance comprehension and inference in large models. These observations motivate the development of a framework that supports hierarchical question decomposition and logic-guided retrieval for MLLM-based VQA. In this wor… view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of the proposed CgRAG framework. The pipeline consists of three components: Dissecting Chain Generation (DCG), which constructs CoT guided by VQD from the input image and question; Elaborate Knowledge Retrieval (EKR), which retrieves external knowledge under the guidance of CoVQD; and Comprehensive Prompt Construction (CPC), which aggregates implicit and explicit knowledge for inferenc… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of logical implication Direct Preference Optimization (liDPO). The two-stage procedure includes rejected data construction and preference optimization, where the desired set (O) contains logically correct sub-question sequences and the undesired set (Or) contains incorrect ones. IO CoVQD 1. Original Image Retrieval “A child in a "Rangers" baseball uniform kneels on a grassy lawn, ready to play wit… view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the Elaborate Knowledge Retrieval (EKR) module. Three retrieval processes are involved: original image retrieval, multimodal retrieval, and CoVQD-guided retrieval. I, C, E, and K denote Image, Caption, Explanation, and Knowledge, respectively. And QO and CoVQD serve as a supervisor for filtering out visual information irrelevant to reasoning. IO \ n qa1 \ n . . . \ n qan (8) where \n denote… view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison of MLLM-based VQA on INFOSEEK-All. The best result of each method is reported, and different shapes indicate distinct retrieval features. B. Quantitative Results To address RQ_1 and RQ_2, we compare CgRAG with two categories of MLLM-based VQA methods: zero-shot MLLMs and retrieval-augmented models. We instantiate CgRAG with multiple backbone MLLMs to examine its generalization within… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of different methods on Explanatory VQA. For visualization clarity, CIDEr scores are scaled to one-tenth of their original values. (both +0.3%) and on All of INFOSEEK (+0.5%), while being marginally lower on Unseen-E of INFOSEEK (-0.1%). These results indicate that CoVQD-guided refined retrieval effectively strengthens MLLMs for open-domain KBVQA by improving grounding and maintaining semantic c… view at source ↗
Figure 9
Figure 9. Figure 9: Ablation comparison of different retrieval modes across MLLMs. Results are reported on E-VQA-All and INFOSEEK￾All, where “V” and “T” denote visual and textual features, respectively. removing all modules yields the lowest performance (23.1 on E-VQA-Single-Hop and 24.0 on INFOSEEK-All), highlighting the necessity of these enhancements. Adding DCG alone substantially improves performance, supporting its role… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison across different knowledge domains. Performance differences between Qwen2.5-VL [51], LLaVA-NeXT [52], and their CgRAG-enhanced variants are shown on cases from (a-b) commonsense, (c-d) animal, (e) geography, (f) architecture, (g) history, and (h) art domains. TABLE VIII: Effect of prompt structure on final perfor￾mance. C, I, and K denote caption, image, and knowledge, respectively.… view at source ↗
read the original abstract

With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question Answering (VQA) has increasingly employed MLLMs to improve performance, particularly in open-domain settings where external knowledge is essential. In this work, we aim to further enhance retrieval-based VQA by more effectively integrating MLLMs with structured reasoning and knowledge acquisition. We introduce a logical prompting strategy that fuses Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), termed CoVQD, to guide retrieval toward more accurate and relevant knowledge for MLLM inference. Building on this idea, we propose a new framework, CoVQD-guided RAG (CgRAG), which enables MLLMs to access more comprehensive and coherent external knowledge while benefiting from structured visual-text reasoning guidance, thereby improving generalization and reliability in complex cross-domain VQA scenarios. Extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes CoVQD, a logical prompting strategy fusing Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), and builds the CgRAG framework on top of it to guide retrieval-augmented generation for multimodal LLMs in visual question answering. The central claim is that this structured visual-text reasoning guidance enables MLLMs to retrieve more comprehensive and coherent external knowledge, thereby improving generalization and reliability on complex cross-domain VQA tasks, with effectiveness shown via experiments on the E-VQA, InfoSeek, and OKVQA benchmarks.

Significance. If the empirical claims are substantiated with quantitative results and controls, the work could offer a practical way to combine explicit reasoning chains with external retrieval in MLLMs, addressing a real limitation in open-domain VQA where knowledge is required. The approach is timely given the growing use of MLLMs for knowledge-intensive multimodal tasks.

major comments (2)
  1. [Abstract] Abstract: The manuscript asserts that 'extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness' of CgRAG yet supplies no quantitative results, baselines, ablation studies, or error analysis. This is load-bearing for the central claim because the asserted gains over standard RAG or plain CoT cannot be evaluated without those data; the skeptic concern that VQD decomposition errors may retrieve off-topic or contradictory passages is left unaddressed.
  2. [Abstract] Abstract: No mechanism is described for filtering or mitigating irrelevant or contradictory passages that could be retrieved when CoVQD sub-questions contain errors (a known risk in visual question decomposition). Without such a safeguard or validation of sub-question quality, the claim that the fusion 'guides retrieval toward more accurate and relevant knowledge' rests on an untested assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have made revisions to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript asserts that 'extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness' of CgRAG yet supplies no quantitative results, baselines, ablation studies, or error analysis. This is load-bearing for the central claim because the asserted gains over standard RAG or plain CoT cannot be evaluated without those data; the skeptic concern that VQD decomposition errors may retrieve off-topic or contradictory passages is left unaddressed.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The detailed experimental results, baseline comparisons, ablation studies, and error analysis are presented in Sections 4 and 5 of the manuscript. We will revise the abstract to report specific performance gains on E-VQA, InfoSeek, and OKVQA (e.g., accuracy improvements relative to standard RAG and CoT baselines) to make the central claims more self-contained. revision: yes

  2. Referee: [Abstract] Abstract: No mechanism is described for filtering or mitigating irrelevant or contradictory passages that could be retrieved when CoVQD sub-questions contain errors (a known risk in visual question decomposition). Without such a safeguard or validation of sub-question quality, the claim that the fusion 'guides retrieval toward more accurate and relevant knowledge' rests on an untested assumption.

    Authors: This is a fair observation. The current version of the manuscript does not explicitly describe a filtering or mitigation mechanism for erroneous sub-questions in the abstract or methods overview. We will revise the paper to add a relevance-thresholding step and passage reranking in the CgRAG retrieval module, include a brief description of this safeguard in the abstract, and report sub-question quality validation results in the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural framework validated on external benchmarks

full rationale

The paper proposes an architectural framework (CgRAG) that fuses CoT with VQD to guide RAG for MLLM-based VQA. All performance claims are measured against independent external benchmarks (E-VQA, InfoSeek, OKVQA) with no internal equations, fitted parameters, or self-referential definitions that reduce the reported gains to quantities defined by the method itself. No load-bearing step collapses by construction to prior self-citations or ansatzes; the derivation chain consists of design choices whose value is asserted via empirical comparison rather than algebraic identity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the untested premise that the new prompting strategy improves retrieval quality and that standard VQA benchmarks measure the intended generalization; no free parameters are explicitly fitted in the abstract, but the method implicitly assumes MLLMs can exploit the guided retrieval without side effects.

axioms (2)
  • domain assumption The chosen benchmarks (E-VQA, InfoSeek, OKVQA) are valid proxies for complex cross-domain VQA performance.
    The paper evaluates exclusively on these datasets without discussing their limitations or selection criteria.
  • domain assumption Structured CoT+VQD prompting produces retrieval queries that are more accurate and relevant than unguided retrieval.
    This is the core mechanistic assumption underlying the CgRAG design.
invented entities (2)
  • CoVQD no independent evidence
    purpose: Prompting strategy that fuses Chain-of-Thought reasoning with Visual Question Decomposition to guide retrieval.
    Newly defined in the paper as the key logical component.
  • CgRAG no independent evidence
    purpose: Overall retrieval-augmented generation framework that integrates CoVQD with MLLMs.
    The main proposed system whose performance is asserted to improve on benchmarks.

pith-pipeline@v0.9.0 · 5515 in / 1555 out tokens · 78407 ms · 2026-05-07T17:43:24.607799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 13 canonical work pages · 8 internal anchors

  1. [1]

    StoryLLaV A: enhancing visual storytelling with multi-modal large language models,

    L. Yang, Z. Xiao, W. Huang, and X. Zhong, “StoryLLaV A: enhancing visual storytelling with multi-modal large language models,” inProc. Int. Conf. Comput. Linguistics, pp. 3936–3951, 2025

  2. [2]

    Refined semantic enhancement towards frequency diffusion for video captioning,

    X. Zhong, Z. Li, S. Chen, K. Jiang, C. Chen, and M. Ye, “Refined semantic enhancement towards frequency diffusion for video captioning,” inProc. AAAI Conf. Artif. Intell., pp. 3724–3732, 2023

  3. [3]

    Action-aware linguistic skeleton optimization network for non-autoregressive video captioning,

    S. Chen, X. Zhong, Y . Zhang, L. Zhu, P. Li, X. Yang, and B. Sheng, “Action-aware linguistic skeleton optimization network for non-autoregressive video captioning,”ACM Trans. Multimedia Comput. Commun. Appl., vol. 20, no. 10, pp. 326:1–326:24, 2024

  4. [4]

    VQA: visual question answering,

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: visual question answering,” inProc. IEEE/CVF Int. Conf. Comput. Vis., pp. 2425–2433, 2015

  5. [5]

    Making the V in VQA matter: Elevating the role of image understanding in visual question answering,

    Y . Goyal, T. Khot, A. Agrawal, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 6904–6913, 2017

  6. [6]

    KVQA: knowledge- aware visual question answering,

    S. Shah, A. Mishra, N. Yadati, and P. P. Talukdar, “KVQA: knowledge- aware visual question answering,” inProc. AAAI Conf. Artif. Intell., pp. 8876–8884, 2019

  7. [7]

    Augmenting multimodal LLMs with self-reflective tokens for knowledge- based visual question answering,

    F. Cocchi, N. Moratelli, M. Cornia, L. Baraldi, and R. Cucchiara, “Augmenting multimodal LLMs with self-reflective tokens for knowledge- based visual question answering,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 9199–9209, 2025

  8. [8]

    Fine-grained retrieval- augmented generation for visual question answering,

    Z. Zhang, Y . Wu, Y . Luo, and N. Tang, “Fine-grained retrieval- augmented generation for visual question answering,”arXiv preprint arXiv:2502.20964, 2025

  9. [9]

    MMKB-RAG: a multi-modal knowledge-based retrieval- augmented generation framework,

    Z. Ling, Z. Guo, Y . Huang, Y . An, S. Xiao, J. Lan, X. Zhu, and B. Zheng, “MMKB-RAG: a multi-modal knowledge-based retrieval- augmented generation framework,”arXiv preprint arXiv:2504.10074, 2025

  10. [10]

    Knowledge-based visual question answer with multimodal processing, retrieval and filtering,

    Y . Hong, J. Gu, Q. Yang, L. Fan, Y . Wu, Y . Wang, K. Ding, S. Xiang, and J. Ye, “Knowledge-based visual question answer with multimodal processing, retrieval and filtering,”arXiv preprint arXiv:2510.14605, 2025

  11. [11]

    Idealgpt: Iteratively decomposing vision and language reasoning via large language models,

    H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi, K.-W. Chang, and S.-F. Chang, “Idealgpt: Iteratively decomposing vision and language reasoning via large language models,” inFindings EMNLP, pp. 11289–11303, 2023

  12. [12]

    The art of SOCRATIC questioning: Recursive thinking with large language models,

    J. Qi, Z. Xu, Y . Shen, M. Liu, D. Jin, Q. Wang, and L. Huang, “The art of SOCRATIC questioning: Recursive thinking with large language models,” inProc. Conf. Empir . Methods Nat. Lang. Process., pp. 4177–4199, 2023

  13. [13]

    Chatterbox: Multimodal referring and grounding with chain-of-questions,

    Y . Tian, T. Ma, L. Xie, and Q. Ye, “Chatterbox: Multimodal referring and grounding with chain-of-questions,” inProc. AAAI Conf. Artif. Intell., pp. 7401–7409, 2025

  14. [14]

    Perception tokens enhance visual reasoning in multimodal language models,

    M. Bigverdi, Z. Luo, C.-Y . Hsieh, E. Shen, D. Chen, L. G. Shapiro, and R. Krishna, “Perception tokens enhance visual reasoning in multimodal language models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 3836–3845, 2025

  15. [15]

    Wiki-llava: Hierarchical retrieval-augmented generation for multimodal LLMs,

    D. Caffagni, F. Cocchi, N. Moratelli, S. Sarto, M. Cornia, L. Baraldi, and R. Cucchiara, “Wiki-llava: Hierarchical retrieval-augmented generation for multimodal LLMs,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, pp. 1818–1826, 2024

  16. [16]

    arXiv preprint arXiv:2410.08876 (2024),https://arxiv.org/abs/2410.08876

    J. Qi, Z. Xu, R. Shao, Y . Chen, D. Jin, Y . Cheng, Q. Wang, and L. Huang, “Rora-vlm: Robust retrieval-augmented vision language models,”arXiv preprint arXiv:2410.08876, 2024

  17. [17]

    Echosight: Advancing visual-language models with wiki knowledge,

    Y . Yan and W. Xie, “Echosight: Advancing visual-language models with wiki knowledge,” inFindings EMNLP, pp. 1538–1551, 2024

  18. [18]

    Playing Atari with Deep Reinforcement Learning

    V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- stra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,”arXiv preprint arXiv:1312.5602, 2013

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  20. [20]

    R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

    Y . Yang, X. He, H. Pan, X. Jiang, Y . Deng, X. Yang, H. Lu, D. Yin, F. Rao, M. Zhu, B. Zhang, and W. Chen, “R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization,” arXiv preprint arXiv:2503.10615, 2025

  21. [21]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inProc. Adv. Neural Inf. Process. Syst., 2023

  22. [23]

    Towards more faithful natural language explanation using multi-level contrastive learning in VQA,

    C. Lai, S. Song, S. Meng, J. Li, S. Yan, and G. Hu, “Towards more faithful natural language explanation using multi-level contrastive learning in VQA,” inProc. AAAI Conf. Artif. Intell., pp. 2849–2857, 2024

  23. [24]

    Few-shot multimodal explanation for visual question answering,

    D. Xue, S. Qian, and C. Xu, “Few-shot multimodal explanation for visual question answering,” inProc. ACM Int. Conf. Multimedia, pp. 1875–1884, 2024

  24. [25]

    An empirical study of GPT-3 for few-shot knowledge-based VQA,

    Z. Yang, Z. Gan, J. Wang, X. Hu, Y . Lu, Z. Liu, and L. Wang, “An empirical study of GPT-3 for few-shot knowledge-based VQA,” inProc. AAAI Conf. Artif. Intell., pp. 3081–3089, 2022

  25. [26]

    Exploring question decomposition for zero-shot VQA,

    Z. Khan, V . K. B. G, S. Schulter, M. Chandraker, and Y . Fu, “Exploring question decomposition for zero-shot VQA,” inProc. Int. Conf. Neural Inf. Process. Syst., 2023

  26. [27]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

  27. [28]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 26286–26296, 2024

  28. [29]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond,”arXiv preprint arXiv:2308.12966, 2023

  29. [30]

    Visual question decomposition on multimodal large language models,

    H. Zhang, J. Liu, Z. Han, S. Chen, B. He, V . Tresp, Z. Xu, and J. Gu, “Visual question decomposition on multimodal large language models,” inFindings EMNLP, pp. 1926–1949, 2024

  30. [31]

    Logical impli- cations for visual question answering consistency,

    S. Tascon-Morales, P. Márquez-Neila, and R. Sznitman, “Logical impli- cations for visual question answering consistency,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 6725–6735, 2023

  31. [32]

    BERT: pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inProc. Conf. North Amer . Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., pp. 4171–4186, 2019

  32. [33]

    Scaling laws for reward model overoptimization,

    L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” inProc. Int. Conf. Mach. Learn., pp. 10835–10866, 2023

  33. [34]

    Rank analysis of incomplete block designs: I. the method of paired comparisons,

    R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

  34. [35]

    mDPO: conditional preference optimization for multimodal large language models,

    F. Wang, W. Zhou, J. Y . Huang, N. Xu, S. Zhang, H. Poon, and M. Chen, “mDPO: conditional preference optimization for multimodal large language models,” inProc. Conf. Empir . Methods Nat. Lang. Process., pp. 8078–8088, 2024

  35. [36]

    Evaluating and mitigating object hallucination in large vision- language models: Can they still see removed objects?,

    Y . He, H. Sun, P. Ren, J. Wang, H. Wang, Q. Qi, Z. Zhuang, and J. Wang, “Evaluating and mitigating object hallucination in large vision- language models: Can they still see removed objects?,” inProc. Conf. North Amer . Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., pp. 6841–6858, 2025

  36. [37]

    Vinvl: Revisiting visual representations in vision-language models,

    P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y . Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 5579–5588, 2021

  37. [38]

    BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inProc. Int. Conf. Mach. Learn., pp. 19730–19742, 2023

  38. [39]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn., pp. 8748–8763, 2021

  39. [40]

    BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProc. Int. Conf. Mach. Learn., vol. 162, pp. 12888– 12900, 2022

  40. [41]

    OK-VQA: a visual question answering benchmark requiring external knowledge,

    K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “OK-VQA: a visual question answering benchmark requiring external knowledge,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 3195–3204, 2019

  41. [42]

    Encyclopedic VQA: visual questions about detailed properties of fine-grained categories,

    T. Mensink, J. R. R. Uijlings, L. Castrejón, A. Goel, F. Cadar, H. Zhou, F. Sha, A. Araújo, and V . Ferrari, “Encyclopedic VQA: visual questions about detailed properties of fine-grained categories,” inProc. IEEE/CVF Int. Conf. Comput. Vis., pp. 3090–3101, 2023. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2026 12

  42. [43]

    Can pre-trained vision and language models answer visual information-seeking questions?,

    Y . Chen, H. Hu, Y . Luan, H. Sun, S. Changpinyo, A. Ritter, and M.- W. Chang, “Can pre-trained vision and language models answer visual information-seeking questions?,” inProc. Conf. Empir . Methods Nat. Lang. Process., pp. 14948–14968, 2023

  43. [44]

    Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities,

    H. Hu, Y . Luan, Y . Chen, U. Khandelwal, M. Joshi, K. Lee, K. Toutanova, and M.-W. Chang, “Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities,” inProc. IEEE/CVF Int. Conf. Comput. Vis., pp. 12031–12041, 2023

  44. [45]

    REX: reasoning-aware and grounded explanation,

    S. Chen and Q. Zhao, “REX: reasoning-aware and grounded explanation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 15565– 15574, 2022

  45. [46]

    GQA: a new dataset for real- world visual reasoning and compositional question answering,

    D. A. Hudson and C. D. Manning, “GQA: a new dataset for real- world visual reasoning and compositional question answering,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 6700–6709, 2019

  46. [47]

    Roses are red, violets are blue... but should VQA expect them to?,

    C. Kervadec, G. Antipov, M. Baccouche, and C. Wolf, “Roses are red, violets are blue... but should VQA expect them to?,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 2776–2785, 2021

  47. [48]

    Instructblip: Towards general-purpose vision-language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. C. H. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” inProc. Adv. Neural Inf. Process. Syst., 2023

  48. [49]

    GPT-4 Technical Report

    OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2024

  49. [50]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  50. [51]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  51. [52]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. Ma, and C. Li, “LLaV A-NeXT-Interleave: tackling multi-image, video, and 3D in large multimodal models,”arXiv preprint arXiv:2407.07895, 2024

  52. [53]

    Cross-modal retrieval for knowledge-based visual question answering,

    P. Lerner, O. Ferret, and C. Guinaudeau, “Cross-modal retrieval for knowledge-based visual question answering,” inProc. Eur . Conf. Inf. Retr ., pp. 421–438, 2024

  53. [54]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao,et al., “Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models,”arXiv preprint arXiv:2504.10479, 2025

  54. [55]

    A large annotated corpus for learning natural language inference,

    S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” inProc. Conf. Empir . Methods Nat. Lang. Process., pp. 632–642, 2015

  55. [56]

    Squinting at VQA models: Introspecting VQA models with sub-questions,

    R. R. Selvaraju, P. Tendulkar, D. Parikh, E. Horvitz, M. T. Ribeiro, B. Nushi, and E. Kamar, “Squinting at VQA models: Introspecting VQA models with sub-questions,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 10000–10008, 2020

  56. [57]

    VQA-E: explaining, elaborating, and enhancing your answers for visual questions,

    Q. Li, Q. Tao, S. R. Joty, J. Cai, and J. Luo, “VQA-E: explaining, elaborating, and enhancing your answers for visual questions,” inProc. Eur . Conf. Comput. Vis., vol. 11211, pp. 570–586, 2018

  57. [58]

    Faithful multimodal explanation for visual question answering,

    J. Wu and R. J. Mooney, “Faithful multimodal explanation for visual question answering,” inProc. ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 103–112, 2019

  58. [59]

    Variational causal inference network for explanatory visual question answering,

    D. Xue, S. Qian, and C. Xu, “Variational causal inference network for explanatory visual question answering,” inProc. IEEE/CVF Int. Conf. Comput. Vis., pp. 2515–2525, 2023

  59. [60]

    Multimodal rationales for explainable visual question answering,

    K. Li, G. V osselman, and M. Y . Yang, “Multimodal rationales for explainable visual question answering,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, pp. 191–201, 2025