QKVQA: Question-Focused Filtering for Knowledge-based VQA

arxiv: 2601.13856 · v3 · submitted 2026-01-20 · 💻 cs.IR

QKVQA: Question-Focused Filtering for Knowledge-based VQA

Wei Ye , Yixin Su , Yueguo Chen , Longxiang Gao , Jianjun Li , Ruixuan Li , Rui Zhang This is my paper

Pith reviewed 2026-05-16 12:51 UTC · model grok-4.3

classification 💻 cs.IR

keywords knowledge-based VQAquestion-focused filteringvisual question answeringknowledge filteringcross-article selectioninformation retrievalmultimodal reasoning

0 comments p. Extension

The pith

A trainable question-focused filter and cross-article selector raise accuracy in knowledge-based visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Knowledge-based visual question answering requires pulling external facts that go beyond what an image shows. Standard knowledge filters either ignore the question when encoding candidate sections or restrict selection to a single article, which caps the usable information. The paper introduces a Question-Focused Filter that trains the encoder to emphasize parts of the knowledge relevant to the given question, together with a Chunk-based Dynamic Cross-Article module that gathers useful segments from multiple sources. This combination keeps inference time close to that of shorter-context baselines while lifting final answer accuracy. The gains are measured at 3.2 and 2.2 percentage points above prior best results on the Encyclopedic-VQA and InfoSeek benchmarks.

Core claim

The QKVQA method trains a Question-Focused Filter to re-encode candidate knowledge sections with explicit attention to the question and pairs it with a Chunk-based Dynamic Cross-Article Selection module that extracts and ranks useful chunks across multiple articles, producing higher-quality filtered knowledge than prior single-article or question-agnostic approaches.

What carries the argument

Question-Focused Filter (QFF), a trainable encoder that weights knowledge-section tokens according to their relevance to the input question, combined with the Chunk-based Dynamic Cross-Article Selection (CDA) module that dynamically assembles and ranks knowledge chunks from several articles.

If this is right

Inference time remains comparable to the best short-context baselines while using higher-quality knowledge.
High-quality filtered knowledge is obtained without lengthening the input context.
Answer accuracy rises 3.2 percentage points above prior best results on Encyclopedic-VQA.
Answer accuracy rises 2.2 percentage points above prior best results on InfoSeek.
The filtering works for questions that need information distributed across multiple articles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same question-guided chunk selection could be applied to retrieval-augmented generation in text-only or multimodal settings where context must stay short.
Training the filter end-to-end on new knowledge corpora would allow quick adaptation to specialized domains such as medical or legal images.
If the CDA module scales to web-scale indexes, the approach could reduce reliance on curated single-article sources.

Load-bearing premise

The trainable QFF and CDA modules will reliably pick out relevant knowledge for varied questions and knowledge sources without injecting noise or dropping critical details when trained on the given datasets.

What would settle it

A new test set containing questions that require synthesizing facts from many conflicting or noisy articles would show the method falling below current state-of-the-art accuracy if the filters miss or distort key information.

Figures

Figures reproduced from arXiv: 2601.13856 by Jianjun Li, Longxiang Gao, Ruixuan Li, Rui Zhang, Wei Ye, Yixin Su, Yueguo Chen.

**Figure 2.** Figure 2: Concrete examples of the two types of errors. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The figure illustrates our complete pipeline, in which the question-focused filtering framework consists of components (2) and (3). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between OMGM and our proposed QKVQA method on E-VQA (top row) and InfoSeek (bottom row) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Additional qualitative results on image-question pairs from Encyclopedic-VQA, where we compare the answers provided by [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Additional qualitative results on image-question pairs from InfoSeek, where we compare the answers provided by QKVQA with [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Visual Question Answering (VQA) is the task of answering questions based on image content. Building upon this, Knowledge-Based VQA (KB-VQA) requires models to answer questions that depend on external knowledge beyond the visual content of an image. In such settings, effective knowledge filtering is essential for achieving high question answering accuracy. Typical filtering methods suffer from two issues: they fail to focus on parts relevant to the question during candidate section encoding, and they use similarity metrics to locate a section from a single article, resulting in information limitation. To address these issues, this paper proposes a question-focused, cross-article filtering method. Specifically, we design a trainable Question-Focused Filter (QFF) and a Chunk-based Dynamic Cross-Article Selection module (CDA). This approach maintains inference time comparable to the optimal method with the shorter context length, efficiently obtaining high-quality filtered knowledge. The accuracy outperforms current state-of-the-art methods by 3.2 and 2.2 percentage points on Encyclopedic-VQA and InfoSeek, respectively. The code is publicly available at: https://github.com/leaffeall/QKVQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QKVQA adds a trainable question-focused filter and cross-article chunk selector that fixes two concrete gaps in prior KB-VQA retrieval and reports small accuracy lifts on two benchmarks.

read the letter

Hi, the main thing to know is that this paper fixes two practical problems in knowledge filtering for KB-VQA: existing methods ignore question focus when encoding candidates and they pull from only one article at a time. They introduce a trainable Question-Focused Filter (QFF) that learns to weight relevant parts and a Chunk-based Dynamic Cross-Article Selection (CDA) module that pulls useful chunks across articles while keeping context length short. Inference time stays comparable to the best prior short-context baselines, and they report gains of 3.2 points on Encyclopedic-VQA and 2.2 on InfoSeek over current SOTA. Code is released, which is useful for anyone reproducing or extending the work. The approach is a clear engineering step beyond the single-article similarity methods they cite, and the central claim holds up once you look at the modules and selection metrics. The gains are modest rather than dramatic, so the real test is whether the ablations show each piece is necessary and whether retrieval precision/recall actually improves without adding noise. No obvious circularity or fitting artifacts in the setup. This is mainly for people already working on knowledge-augmented visual QA or retrieval-augmented multimodal models. If that is your area, the filtering ideas are worth reading. I would send it to peer review; the contribution is focused and the results are concrete enough to get useful referee comments.

Referee Report

0 major / 3 minor

Summary. The paper proposes QKVQA for knowledge-based visual question answering, introducing a trainable Question-Focused Filter (QFF) module to better encode question-relevant content during candidate section processing and a Chunk-based Dynamic Cross-Article Selection (CDA) module to retrieve knowledge across multiple articles rather than relying on single-article similarity matching. The approach is presented as maintaining inference efficiency comparable to shorter-context baselines while delivering accuracy gains of 3.2 percentage points on Encyclopedic-VQA and 2.2 percentage points on InfoSeek over prior state-of-the-art methods; public code release is noted.

Significance. If the reported gains are supported by ablations confirming the independent contributions of QFF and CDA, along with improved knowledge-selection precision/recall metrics, the work offers a practical, trainable refinement to knowledge filtering pipelines in KB-VQA. The emphasis on cross-article coverage and question focus addresses documented limitations of prior similarity-based methods without substantially increasing inference cost, providing a useful engineering increment for systems that must handle encyclopedic or open-domain knowledge sources.

minor comments (3)

Abstract: the headline accuracy improvements are stated without reference to the specific baselines, number of runs, or error bars; adding one sentence summarizing the experimental protocol would strengthen the claim for readers who encounter only the abstract.
Section 4 (Experiments): confirm that ablation tables isolate the effect of removing QFF versus CDA individually and report knowledge-retrieval precision/recall alongside end-task accuracy so that the source of the 3.2 pp and 2.2 pp gains is transparent.
Figure 2 / architecture diagram: ensure the diagram explicitly labels the trainable parameters of QFF and the chunk-selection logic of CDA to avoid ambiguity about which components are learned versus fixed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation for minor revision. We appreciate the recognition that our QFF and CDA modules address documented limitations in prior similarity-based knowledge filtering for KB-VQA while preserving inference efficiency.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces trainable QFF and CDA modules as an engineering solution to improve knowledge filtering in KB-VQA, with performance gains presented as empirical results on Encyclopedic-VQA and InfoSeek benchmarks. No equations, derivations, or load-bearing steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the method description and reported accuracy improvements stand as independent contributions verifiable via the public code repository. The derivation chain is self-contained against external benchmarks without internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract alone, the approach rests on standard neural network training assumptions and dataset representativeness; no explicit free parameters, axioms, or invented entities beyond the named modules are detailed.

pith-pipeline@v0.9.0 · 5514 in / 1034 out tokens · 31563 ms · 2026-05-16T12:51:09.204430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a trainable Question-Focused Filter (QFF) and a Chunk-based Dynamic Cross-Article Selection module (CDA)... contrastive learning... simQFF i,j = max cosine similarity
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

QFF is based on the Q-Former architecture... F_Queries via cross-attention

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

[Achiamet al., 2023 ] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a Visual Language Model for Few-Shot Learning

[Alayracet al., 2022 ] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning. InNeurIPS,

work page 2022
[3]

Vqa: Visual question answering

[Antolet al., 2015 ] Stanislaw Antol, Aishwarya Agrawal, Ji- asen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on com- puter vision, pages 2425–2433,

work page 2015
[4]

Qwen2.5-VL Technical Report

[Baiet al., 2025 ] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shi- jie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Wiki-LLaV A: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

[Caffagniet al., 2024 ] Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-LLaV A: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs. InCVPR Workshops,

work page 2024
[6]

Can Pre-trained Vision and Language Mod- els Answer Visual Information-Seeking Questions? In EMNLP,

[Chenet al., 2023 ] Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming- Wei Chang. Can Pre-trained Vision and Language Mod- els Answer Visual Information-Seeking Questions? In EMNLP,

work page 2023
[7]

M3- embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self-knowledge dis- tillation

[Chenet al., 2024 ] Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3- embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self-knowledge dis- tillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Com- putational Linguistics: ACL 2024, pag...

work page 2024
[8]

Reag: Reasoning-augmented generation for knowledge-based visual question answering.arXiv preprint arXiv:2511.22715,

[Compagnoniet al., 2025 ] Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. Reag: Reasoning-augmented generation for knowledge-based visual question answering.arXiv preprint arXiv:2511.22715,

work page arXiv 2025
[9]

Ge-chat: A graph en- hanced rag framework for evidential response generation of llms

[Daet al., 2025 ] Longchao Da, Parth Mitesh Shah, Kuan-Ru Liou, Jiaxing Zhang, and Hua Wei. Ge-chat: A graph en- hanced rag framework for evidential response generation of llms. InIJCAI,

work page 2025
[10]

The Llama 3 Herd of Models

[Dubeyet al., 2024 ] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

[Honget al., 2025 ] Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, and Jieping Ye. Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering. In NeurIPS,

work page 2025
[12]

Words over pixels? rethinking vision in multimodal large language models

[Jainet al., 2025 ] Anubhooti Jain, Mayank Vatsa, and Richa Singh. Words over pixels? rethinking vision in multimodal large language models. InProceedings of the Thirty- Fourth International Joint Conference on Artificial In- telligence (IJCAI-25) Survey Track, pages 10481–10489,

work page 2025
[13]

Colbert: Efficient and effective passage search via contextualized late interaction over bert

[Khattab and Zaharia, 2020] Omar Khattab and Matei Za- haria. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48,

work page 2020
[14]

X-flora: Cross-modal federated learning with modality-expert lora for medical vqa

[Kimet al., 2025 ] Min Hyuk Kim, Changheon Kim, and Seok Bong Yoo. X-flora: Cross-modal federated learning with modality-expert lora for medical vqa. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8390–8408,

work page 2025
[15]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

[Lewiset al., 2020 ] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InNeurIPS,

work page 2020
[16]

PreFLMR: Scaling Up Fine-Grained Late- Interaction Multi-modal Retrievers

[Linet al., 2024 ] Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. PreFLMR: Scaling Up Fine-Grained Late- Interaction Multi-modal Retrievers. InACL,

work page 2024
[17]

Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework

[Linget al., 2025 ] Zihan Ling, Zhiyao Guo, Yixuan Huang, Yi An, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, and Bo Zheng. Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework.arXiv preprint arXiv:2504.10074,

work page arXiv 2025
[18]

Improved Baselines with Visual Instruction Tuning

[Liuet al., 2024 ] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning. InCVPR,

work page 2024
[19]

Ma-rag: Automating role engineering for restful apis with multi-head attention and retrieval-augmented gener- ation

[Luoet al., 2025 ] Yang Luo, Qingni Shen, and Zhonghai Wu. Ma-rag: Automating role engineering for restful apis with multi-head attention and retrieval-augmented gener- ation. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 7607– 7615,

work page 2025
[20]

Knowra: Knowledge retrieval augmented method for document- level relation extraction with comprehensive reasoning abilities

[Maiet al., 2025 ] Chengcheng Mai, Yuxiang Wang, Ziyu Gong, Hanxiang Wang, and Yihua Huang. Knowra: Knowledge retrieval augmented method for document- level relation extraction with comprehensive reasoning abilities. InIJCAI,

work page 2025
[21]

Ok-VQA: A Visual Question Answering Benchmark Requiring Exter- nal Knowledge

[Marinoet al., 2019 ] Kenneth Marino, Mohammad Raste- gari, Ali Farhadi, and Roozbeh Mottaghi. Ok-VQA: A Visual Question Answering Benchmark Requiring Exter- nal Knowledge. InCVPR,

work page 2019
[22]

Krisp: Integrating implicit and symbolic knowledge for open- domain knowledge-based vqa

[Marinoet al., 2021 ] Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, and Marcus Rohrbach. Krisp: Integrating implicit and symbolic knowledge for open- domain knowledge-based vqa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14111–14121,

work page 2021
[23]

Encyclope- dic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories

[Mensinket al., 2023 ] Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr ´e Araujo, and Vittorio Ferrari. Encyclope- dic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories. InICCV,

work page 2023
[24]

Plotqa: Reason- ing over scientific plots

[Methaniet al., 2020 ] Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reason- ing over scientific plots. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536,

work page 2020
[25]

A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge

[Schwenket al., 2022 ] Dustin Schwenk, Apoorv Khandel- wal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. InECCV,

work page 2022
[26]

KVQA: Knowledge- aware Visual Question Answering

[Shahet al., 2019 ] Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. KVQA: Knowledge- aware Visual Question Answering. InAAAI,

work page 2019
[27]

Combo of thinking and observing for outside-knowledge vqa.arXiv preprint arXiv:2305.06407,

[Siet al., 2023 ] Qingyi Si, Yuchen Mo, Zheng Lin, Huis- han Ji, and Weiping Wang. Combo of thinking and observing for outside-knowledge vqa.arXiv preprint arXiv:2305.06407,

work page arXiv 2023
[28]

Rg- vqa: Leveraging retriever-generator pipelines for knowl- edge intensive visual question answering

[Sravanthiet al., 2025 ] Settaluri Lakshmi Sravanthi, Pulkit Agarwal, Debjyoti Mondal, Rituraj Singh, Subhadarshi Panda, Ankit Mishra, Kiran Pradeep, Srihari KB, Go- dawari Sudhakar Rao, and Pushpak Bhattacharyya. Rg- vqa: Leveraging retriever-generator pipelines for knowl- edge intensive visual question answering. InFindings of the Association for Comput...

work page 2025
[29]

Eva-clip-18b: Scaling clip to 18 billion parameters

[Sunet al., 2024 ] Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, and Xinlong Wang. EV A-CLIP-18B: Scaling CLIP to 18 Billion Pa- rameters.arXiv preprint arXiv:2402.04252,

work page arXiv 2024
[30]

Core-mmrag: Cross- source knowledge reconciliation for multimodal rag.arXiv preprint arXiv:2506.02544,

[Tianet al., 2025 ] Yang Tian, Fan Liu, Jingyuan Zhang, Yu- peng Hu, Liqiang Nie, et al. Core-mmrag: Cross- source knowledge reconciliation for multimodal rag.arXiv preprint arXiv:2506.02544,

work page arXiv 2025
[31]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

[Tschannenet al., 2025 ] Michael Tschannen, Alexey Grit- senko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Flair: Vlm with fine-grained language-informed image representations

[Xiaoet al., 2025 ] Rui Xiao, Sanghwan Kim, Mariana- Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. Flair: Vlm with fine-grained language-informed image representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24884–24894,

work page 2025
[33]

EchoSight: Advancing Visual-Language Models with Wiki Knowl- edge

[Yan and Xie, 2024] Yibin Yan and Weidi Xie. EchoSight: Advancing Visual-Language Models with Wiki Knowl- edge. InEMNLP Findings,

work page 2024
[34]

Magic-vqa: Multimodal and grounded in- ference with commonsense knowledge for visual question answering

[Yanget al., 2025a ] Shuo Yang, Caren Han, Siwen Luo, and Eduard Hovy. Magic-vqa: Multimodal and grounded in- ference with commonsense knowledge for visual question answering. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 16967–16986,

work page 2025
[35]

mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

[Yuanet al., 2025 ] Xu Yuan, Liangbo Ning, Wenqi Fan, and Qing Li. mKG-RAG: Multimodal Knowledge Graph- Enhanced RAG for Visual Question Answering.arXiv preprint arXiv:2508.05318,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

BERTScore: Evaluating Text Generation with BERT

[Zhanget al., 2019 ] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041

[Zhanget al., 2024 ] Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, et al. mR 2AG: Multimodal Retrieval-Reflection-Augmented Gener- ation for Knowledge-Based VQA.arXiv preprint arXiv:2411.15041,

work page arXiv 2024
[38]

A comprehensive survey on multimodal rag: All com- binations of modalities as input and output.Authorea Preprints,

[Zhanget al., 2025 ] Rui Zhang, Chen Liu, Yixin Su, Ruix- uan Li, Xuanjing Huang, Xuelong Li, and Philip S Yu. A comprehensive survey on multimodal rag: All com- binations of modalities as input and output.Authorea Preprints,

work page 2025
[39]

Cause-effect driven optimization for robust medical visual question answering with language biases

[Zhuet al., 2025 ] Huanjia Zhu, Yishu Liu, Xiaozhao Fang, Guangming Lu, and Bingzhi Chen. Cause-effect driven optimization for robust medical visual question answering with language biases. InIJCAI,

work page 2025
[40]

Question:In which country or region does this animal live? OMGM: Europe, parts of North and South America, South Africa, Australia, and New Zealand. $ QKVQA: Mediterranean area

QKVQA: Question-Focused Filtering for Knowledge-based VQA A Prompts Details in QKVQA A.1 LLM for E-VQA System Prompt Answer the encyclopedic question about the given image. Don’t mention the visual content of image in your output. Directly output the answer of the ques- tion according to the context. You are a helpful assistant for answering encyclope- di...

work page 2021

[1] [1]

GPT-4 Technical Report

[Achiamet al., 2023 ] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Flamingo: a Visual Language Model for Few-Shot Learning

[Alayracet al., 2022 ] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning. InNeurIPS,

work page 2022

[3] [3]

Vqa: Visual question answering

[Antolet al., 2015 ] Stanislaw Antol, Aishwarya Agrawal, Ji- asen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on com- puter vision, pages 2425–2433,

work page 2015

[4] [4]

Qwen2.5-VL Technical Report

[Baiet al., 2025 ] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shi- jie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Wiki-LLaV A: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

[Caffagniet al., 2024 ] Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-LLaV A: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs. InCVPR Workshops,

work page 2024

[6] [6]

Can Pre-trained Vision and Language Mod- els Answer Visual Information-Seeking Questions? In EMNLP,

[Chenet al., 2023 ] Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming- Wei Chang. Can Pre-trained Vision and Language Mod- els Answer Visual Information-Seeking Questions? In EMNLP,

work page 2023

[7] [7]

M3- embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self-knowledge dis- tillation

[Chenet al., 2024 ] Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3- embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self-knowledge dis- tillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Com- putational Linguistics: ACL 2024, pag...

work page 2024

[8] [8]

Reag: Reasoning-augmented generation for knowledge-based visual question answering.arXiv preprint arXiv:2511.22715,

[Compagnoniet al., 2025 ] Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. Reag: Reasoning-augmented generation for knowledge-based visual question answering.arXiv preprint arXiv:2511.22715,

work page arXiv 2025

[9] [9]

Ge-chat: A graph en- hanced rag framework for evidential response generation of llms

[Daet al., 2025 ] Longchao Da, Parth Mitesh Shah, Kuan-Ru Liou, Jiaxing Zhang, and Hua Wei. Ge-chat: A graph en- hanced rag framework for evidential response generation of llms. InIJCAI,

work page 2025

[10] [10]

The Llama 3 Herd of Models

[Dubeyet al., 2024 ] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

[Honget al., 2025 ] Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, and Jieping Ye. Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering. In NeurIPS,

work page 2025

[12] [12]

Words over pixels? rethinking vision in multimodal large language models

[Jainet al., 2025 ] Anubhooti Jain, Mayank Vatsa, and Richa Singh. Words over pixels? rethinking vision in multimodal large language models. InProceedings of the Thirty- Fourth International Joint Conference on Artificial In- telligence (IJCAI-25) Survey Track, pages 10481–10489,

work page 2025

[13] [13]

Colbert: Efficient and effective passage search via contextualized late interaction over bert

[Khattab and Zaharia, 2020] Omar Khattab and Matei Za- haria. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48,

work page 2020

[14] [14]

X-flora: Cross-modal federated learning with modality-expert lora for medical vqa

[Kimet al., 2025 ] Min Hyuk Kim, Changheon Kim, and Seok Bong Yoo. X-flora: Cross-modal federated learning with modality-expert lora for medical vqa. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8390–8408,

work page 2025

[15] [15]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

[Lewiset al., 2020 ] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InNeurIPS,

work page 2020

[16] [16]

PreFLMR: Scaling Up Fine-Grained Late- Interaction Multi-modal Retrievers

[Linet al., 2024 ] Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. PreFLMR: Scaling Up Fine-Grained Late- Interaction Multi-modal Retrievers. InACL,

work page 2024

[17] [17]

Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework

[Linget al., 2025 ] Zihan Ling, Zhiyao Guo, Yixuan Huang, Yi An, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, and Bo Zheng. Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework.arXiv preprint arXiv:2504.10074,

work page arXiv 2025

[18] [18]

Improved Baselines with Visual Instruction Tuning

[Liuet al., 2024 ] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning. InCVPR,

work page 2024

[19] [19]

Ma-rag: Automating role engineering for restful apis with multi-head attention and retrieval-augmented gener- ation

[Luoet al., 2025 ] Yang Luo, Qingni Shen, and Zhonghai Wu. Ma-rag: Automating role engineering for restful apis with multi-head attention and retrieval-augmented gener- ation. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 7607– 7615,

work page 2025

[20] [20]

Knowra: Knowledge retrieval augmented method for document- level relation extraction with comprehensive reasoning abilities

[Maiet al., 2025 ] Chengcheng Mai, Yuxiang Wang, Ziyu Gong, Hanxiang Wang, and Yihua Huang. Knowra: Knowledge retrieval augmented method for document- level relation extraction with comprehensive reasoning abilities. InIJCAI,

work page 2025

[21] [21]

Ok-VQA: A Visual Question Answering Benchmark Requiring Exter- nal Knowledge

[Marinoet al., 2019 ] Kenneth Marino, Mohammad Raste- gari, Ali Farhadi, and Roozbeh Mottaghi. Ok-VQA: A Visual Question Answering Benchmark Requiring Exter- nal Knowledge. InCVPR,

work page 2019

[22] [22]

Krisp: Integrating implicit and symbolic knowledge for open- domain knowledge-based vqa

[Marinoet al., 2021 ] Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, and Marcus Rohrbach. Krisp: Integrating implicit and symbolic knowledge for open- domain knowledge-based vqa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14111–14121,

work page 2021

[23] [23]

Encyclope- dic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories

[Mensinket al., 2023 ] Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr ´e Araujo, and Vittorio Ferrari. Encyclope- dic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories. InICCV,

work page 2023

[24] [24]

Plotqa: Reason- ing over scientific plots

[Methaniet al., 2020 ] Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reason- ing over scientific plots. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536,

work page 2020

[25] [25]

A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge

[Schwenket al., 2022 ] Dustin Schwenk, Apoorv Khandel- wal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. InECCV,

work page 2022

[26] [26]

KVQA: Knowledge- aware Visual Question Answering

[Shahet al., 2019 ] Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. KVQA: Knowledge- aware Visual Question Answering. InAAAI,

work page 2019

[27] [27]

Combo of thinking and observing for outside-knowledge vqa.arXiv preprint arXiv:2305.06407,

[Siet al., 2023 ] Qingyi Si, Yuchen Mo, Zheng Lin, Huis- han Ji, and Weiping Wang. Combo of thinking and observing for outside-knowledge vqa.arXiv preprint arXiv:2305.06407,

work page arXiv 2023

[28] [28]

Rg- vqa: Leveraging retriever-generator pipelines for knowl- edge intensive visual question answering

[Sravanthiet al., 2025 ] Settaluri Lakshmi Sravanthi, Pulkit Agarwal, Debjyoti Mondal, Rituraj Singh, Subhadarshi Panda, Ankit Mishra, Kiran Pradeep, Srihari KB, Go- dawari Sudhakar Rao, and Pushpak Bhattacharyya. Rg- vqa: Leveraging retriever-generator pipelines for knowl- edge intensive visual question answering. InFindings of the Association for Comput...

work page 2025

[29] [29]

Eva-clip-18b: Scaling clip to 18 billion parameters

[Sunet al., 2024 ] Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, and Xinlong Wang. EV A-CLIP-18B: Scaling CLIP to 18 Billion Pa- rameters.arXiv preprint arXiv:2402.04252,

work page arXiv 2024

[30] [30]

Core-mmrag: Cross- source knowledge reconciliation for multimodal rag.arXiv preprint arXiv:2506.02544,

[Tianet al., 2025 ] Yang Tian, Fan Liu, Jingyuan Zhang, Yu- peng Hu, Liqiang Nie, et al. Core-mmrag: Cross- source knowledge reconciliation for multimodal rag.arXiv preprint arXiv:2506.02544,

work page arXiv 2025

[31] [31]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

[Tschannenet al., 2025 ] Michael Tschannen, Alexey Grit- senko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Flair: Vlm with fine-grained language-informed image representations

[Xiaoet al., 2025 ] Rui Xiao, Sanghwan Kim, Mariana- Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. Flair: Vlm with fine-grained language-informed image representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24884–24894,

work page 2025

[33] [33]

EchoSight: Advancing Visual-Language Models with Wiki Knowl- edge

[Yan and Xie, 2024] Yibin Yan and Weidi Xie. EchoSight: Advancing Visual-Language Models with Wiki Knowl- edge. InEMNLP Findings,

work page 2024

[34] [34]

Magic-vqa: Multimodal and grounded in- ference with commonsense knowledge for visual question answering

[Yanget al., 2025a ] Shuo Yang, Caren Han, Siwen Luo, and Eduard Hovy. Magic-vqa: Multimodal and grounded in- ference with commonsense knowledge for visual question answering. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 16967–16986,

work page 2025

[35] [35]

mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

[Yuanet al., 2025 ] Xu Yuan, Liangbo Ning, Wenqi Fan, and Qing Li. mKG-RAG: Multimodal Knowledge Graph- Enhanced RAG for Visual Question Answering.arXiv preprint arXiv:2508.05318,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

BERTScore: Evaluating Text Generation with BERT

[Zhanget al., 2019 ] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 2019

[37] [37]

mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041

[Zhanget al., 2024 ] Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, et al. mR 2AG: Multimodal Retrieval-Reflection-Augmented Gener- ation for Knowledge-Based VQA.arXiv preprint arXiv:2411.15041,

work page arXiv 2024

[38] [38]

A comprehensive survey on multimodal rag: All com- binations of modalities as input and output.Authorea Preprints,

[Zhanget al., 2025 ] Rui Zhang, Chen Liu, Yixin Su, Ruix- uan Li, Xuanjing Huang, Xuelong Li, and Philip S Yu. A comprehensive survey on multimodal rag: All com- binations of modalities as input and output.Authorea Preprints,

work page 2025

[39] [39]

Cause-effect driven optimization for robust medical visual question answering with language biases

[Zhuet al., 2025 ] Huanjia Zhu, Yishu Liu, Xiaozhao Fang, Guangming Lu, and Bingzhi Chen. Cause-effect driven optimization for robust medical visual question answering with language biases. InIJCAI,

work page 2025

[40] [40]

Question:In which country or region does this animal live? OMGM: Europe, parts of North and South America, South Africa, Australia, and New Zealand. $ QKVQA: Mediterranean area

QKVQA: Question-Focused Filtering for Knowledge-based VQA A Prompts Details in QKVQA A.1 LLM for E-VQA System Prompt Answer the encyclopedic question about the given image. Don’t mention the visual content of image in your output. Directly output the answer of the ques- tion according to the context. You are a helpful assistant for answering encyclope- di...

work page 2021