VimRAG models multimodal reasoning as a dynamic DAG and modulates visual token allocation by node topology, reaching state-of-the-art results on multimodal RAG benchmarks.
During the retrieval phase, it directly uses the original question to search for relevant text, images and videos, which are then inserted into the context to answer the question
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph
VimRAG models multimodal reasoning as a dynamic DAG and modulates visual token allocation by node topology, reaching state-of-the-art results on multimodal RAG benchmarks.