Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering

Changin Choi , Wonseok Lee , Jungmin Ko , Wonjong Rhee

Authors on Pith no claims yet

classification 💻 cs.CV cs.AI

keywords knowledgereasoningvisualmultimodalpmsransweringbasesdiverse

read the original abstract

Knowledge-intensive visual question answering (VQA) requires external knowledge beyond image content, demanding precise visual grounding and coherent integration of visual and textual information. Although multimodal retrieval-augmented generation has achieved notable advances by incorporating external knowledge bases, existing approaches largely adopt single-pass frameworks that often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning. We propose PMSR (Progressive Multimodal Search and Reasoning), a framework that progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. PMSR uses dual-scope queries conditioned on both the latest record and the trajectory to retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is then synthesized into compact records via compositional reasoning. This design facilitates controlled iterative refinement, which supports more stable reasoning trajectories with reduced error propagation. Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
cs.CV 2026-04 unverdicted novelty 6.0

WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.