RoRA-VLM: Robust retrieval-augmented vision language models

· 2024 · arXiv 2410.08876

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

cs.IR · 2026-04-30 · unverdicted · novelty 7.0

FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.

mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

cs.CV · 2025-08-07 · unverdicted · novelty 7.0

mKG-RAG constructs multimodal KGs via MLLM-driven extraction and vision-text matching then applies dual-stage query-aware retrieval to achieve new state-of-the-art results on knowledge-based VQA.

Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

A decision-based agent for KB-VQA learns to dynamically select retrieval or answer actions over multiple steps and achieves state-of-the-art results on InfoSeek and E-VQA after fine-tuning on automatically collected trajectories.

WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.

MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

cs.IR · 2026-04-04 · unverdicted · novelty 6.0

MG²-RAG proposes a multi-granularity graph RAG framework that constructs hierarchical multimodal nodes via entity-driven visual grounding and performs structured retrieval, delivering SOTA results on four multimodal tasks with 43.3× faster graph construction.

MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

cs.CV · 2026-05-19 · unverdicted · novelty 5.0

MetaRA applies metamorphic testing to VQA tasks and shows that MLLM models exhibit sensitivity to linguistic perturbations and superficial visual cues not detected by conventional accuracy benchmarks.

Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

cs.CV · 2026-05-05 · unverdicted · novelty 4.0

A new CoVQD-guided retrieval-augmented generation framework improves multimodal LLMs on visual question answering by using structured reasoning to retrieve better external knowledge.

citing papers explorer

Showing 7 of 7 citing papers.

Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG cs.IR · 2026-04-30 · unverdicted · none · ref 40
FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA cs.CV · 2025-08-07 · unverdicted · none · ref 44
mKG-RAG constructs multimodal KGs via MLLM-driven extraction and vision-text matching then applies dual-stage query-aware retrieval to achieve new state-of-the-art results on knowledge-based VQA.
Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering cs.CV · 2026-04-08 · unverdicted · none · ref 3
A decision-based agent for KB-VQA learns to dynamically select retrieval or answer actions over multiple steps and achieves state-of-the-art results on InfoSeek and E-VQA after fine-tuning on automatically collected trajectories.
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering cs.CV · 2026-04-07 · unverdicted · none · ref 23
WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.
MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation cs.IR · 2026-04-04 · unverdicted · none · ref 59
MG²-RAG proposes a multi-granularity graph RAG framework that constructs hierarchical multimodal nodes via entity-driven visual grounding and performs structured retrieval, delivering SOTA results on four multimodal tasks with 43.3× faster graph construction.
MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems cs.CV · 2026-05-19 · unverdicted · none · ref 21
MetaRA applies metamorphic testing to VQA tasks and shows that MLLM models exhibit sensitivity to linguistic perturbations and superficial visual cues not detected by conventional accuracy benchmarks.
Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation cs.CV · 2026-05-05 · unverdicted · none · ref 16
A new CoVQD-guided retrieval-augmented generation framework improves multimodal LLMs on visual question answering by using structured reasoning to retrieve better external knowledge.

RoRA-VLM: Robust retrieval-augmented vision language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer