pith. sign in

arxiv: 2601.13856 · v3 · submitted 2026-01-20 · 💻 cs.IR

QKVQA: Question-Focused Filtering for Knowledge-based VQA

Pith reviewed 2026-05-16 12:51 UTC · model grok-4.3

classification 💻 cs.IR
keywords knowledge-based VQAquestion-focused filteringvisual question answeringknowledge filteringcross-article selectioninformation retrievalmultimodal reasoning
0
0 comments X p. Extension

The pith

A trainable question-focused filter and cross-article selector raise accuracy in knowledge-based visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Knowledge-based visual question answering requires pulling external facts that go beyond what an image shows. Standard knowledge filters either ignore the question when encoding candidate sections or restrict selection to a single article, which caps the usable information. The paper introduces a Question-Focused Filter that trains the encoder to emphasize parts of the knowledge relevant to the given question, together with a Chunk-based Dynamic Cross-Article module that gathers useful segments from multiple sources. This combination keeps inference time close to that of shorter-context baselines while lifting final answer accuracy. The gains are measured at 3.2 and 2.2 percentage points above prior best results on the Encyclopedic-VQA and InfoSeek benchmarks.

Core claim

The QKVQA method trains a Question-Focused Filter to re-encode candidate knowledge sections with explicit attention to the question and pairs it with a Chunk-based Dynamic Cross-Article Selection module that extracts and ranks useful chunks across multiple articles, producing higher-quality filtered knowledge than prior single-article or question-agnostic approaches.

What carries the argument

Question-Focused Filter (QFF), a trainable encoder that weights knowledge-section tokens according to their relevance to the input question, combined with the Chunk-based Dynamic Cross-Article Selection (CDA) module that dynamically assembles and ranks knowledge chunks from several articles.

If this is right

  • Inference time remains comparable to the best short-context baselines while using higher-quality knowledge.
  • High-quality filtered knowledge is obtained without lengthening the input context.
  • Answer accuracy rises 3.2 percentage points above prior best results on Encyclopedic-VQA.
  • Answer accuracy rises 2.2 percentage points above prior best results on InfoSeek.
  • The filtering works for questions that need information distributed across multiple articles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same question-guided chunk selection could be applied to retrieval-augmented generation in text-only or multimodal settings where context must stay short.
  • Training the filter end-to-end on new knowledge corpora would allow quick adaptation to specialized domains such as medical or legal images.
  • If the CDA module scales to web-scale indexes, the approach could reduce reliance on curated single-article sources.

Load-bearing premise

The trainable QFF and CDA modules will reliably pick out relevant knowledge for varied questions and knowledge sources without injecting noise or dropping critical details when trained on the given datasets.

What would settle it

A new test set containing questions that require synthesizing facts from many conflicting or noisy articles would show the method falling below current state-of-the-art accuracy if the filters miss or distort key information.

Figures

Figures reproduced from arXiv: 2601.13856 by Jianjun Li, Longxiang Gao, Ruixuan Li, Rui Zhang, Wei Ye, Yixin Su, Yueguo Chen.

Figure 1
Figure 1. Figure 1: Comparison between typical filtering methods and our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Concrete examples of the two types of errors. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The figure illustrates our complete pipeline, in which the question-focused filtering framework consists of components (2) and (3). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between OMGM and our proposed QKVQA method on E-VQA (top row) and InfoSeek (bottom row) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative results on image-question pairs from Encyclopedic-VQA, where we compare the answers provided by [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative results on image-question pairs from InfoSeek, where we compare the answers provided by QKVQA with [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Visual Question Answering (VQA) is the task of answering questions based on image content. Building upon this, Knowledge-Based VQA (KB-VQA) requires models to answer questions that depend on external knowledge beyond the visual content of an image. In such settings, effective knowledge filtering is essential for achieving high question answering accuracy. Typical filtering methods suffer from two issues: they fail to focus on parts relevant to the question during candidate section encoding, and they use similarity metrics to locate a section from a single article, resulting in information limitation. To address these issues, this paper proposes a question-focused, cross-article filtering method. Specifically, we design a trainable Question-Focused Filter (QFF) and a Chunk-based Dynamic Cross-Article Selection module (CDA). This approach maintains inference time comparable to the optimal method with the shorter context length, efficiently obtaining high-quality filtered knowledge. The accuracy outperforms current state-of-the-art methods by 3.2 and 2.2 percentage points on Encyclopedic-VQA and InfoSeek, respectively. The code is publicly available at: https://github.com/leaffeall/QKVQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes QKVQA for knowledge-based visual question answering, introducing a trainable Question-Focused Filter (QFF) module to better encode question-relevant content during candidate section processing and a Chunk-based Dynamic Cross-Article Selection (CDA) module to retrieve knowledge across multiple articles rather than relying on single-article similarity matching. The approach is presented as maintaining inference efficiency comparable to shorter-context baselines while delivering accuracy gains of 3.2 percentage points on Encyclopedic-VQA and 2.2 percentage points on InfoSeek over prior state-of-the-art methods; public code release is noted.

Significance. If the reported gains are supported by ablations confirming the independent contributions of QFF and CDA, along with improved knowledge-selection precision/recall metrics, the work offers a practical, trainable refinement to knowledge filtering pipelines in KB-VQA. The emphasis on cross-article coverage and question focus addresses documented limitations of prior similarity-based methods without substantially increasing inference cost, providing a useful engineering increment for systems that must handle encyclopedic or open-domain knowledge sources.

minor comments (3)
  1. Abstract: the headline accuracy improvements are stated without reference to the specific baselines, number of runs, or error bars; adding one sentence summarizing the experimental protocol would strengthen the claim for readers who encounter only the abstract.
  2. Section 4 (Experiments): confirm that ablation tables isolate the effect of removing QFF versus CDA individually and report knowledge-retrieval precision/recall alongside end-task accuracy so that the source of the 3.2 pp and 2.2 pp gains is transparent.
  3. Figure 2 / architecture diagram: ensure the diagram explicitly labels the trainable parameters of QFF and the chunk-selection logic of CDA to avoid ambiguity about which components are learned versus fixed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation for minor revision. We appreciate the recognition that our QFF and CDA modules address documented limitations in prior similarity-based knowledge filtering for KB-VQA while preserving inference efficiency.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces trainable QFF and CDA modules as an engineering solution to improve knowledge filtering in KB-VQA, with performance gains presented as empirical results on Encyclopedic-VQA and InfoSeek benchmarks. No equations, derivations, or load-bearing steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the method description and reported accuracy improvements stand as independent contributions verifiable via the public code repository. The derivation chain is self-contained against external benchmarks without internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract alone, the approach rests on standard neural network training assumptions and dataset representativeness; no explicit free parameters, axioms, or invented entities beyond the named modules are detailed.

pith-pipeline@v0.9.0 · 5514 in / 1034 out tokens · 31563 ms · 2026-05-16T12:51:09.204430+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    [Achiamet al., 2023 ] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Flamingo: a Visual Language Model for Few-Shot Learning

    [Alayracet al., 2022 ] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning. InNeurIPS,

  3. [3]

    Vqa: Visual question answering

    [Antolet al., 2015 ] Stanislaw Antol, Aishwarya Agrawal, Ji- asen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on com- puter vision, pages 2425–2433,

  4. [4]

    Qwen2.5-VL Technical Report

    [Baiet al., 2025 ] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shi- jie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923,

  5. [5]

    Wiki-LLaV A: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

    [Caffagniet al., 2024 ] Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-LLaV A: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs. InCVPR Workshops,

  6. [6]

    Can Pre-trained Vision and Language Mod- els Answer Visual Information-Seeking Questions? In EMNLP,

    [Chenet al., 2023 ] Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming- Wei Chang. Can Pre-trained Vision and Language Mod- els Answer Visual Information-Seeking Questions? In EMNLP,

  7. [7]

    M3- embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self-knowledge dis- tillation

    [Chenet al., 2024 ] Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3- embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self-knowledge dis- tillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Com- putational Linguistics: ACL 2024, pag...

  8. [8]

    Reag: Reasoning-augmented generation for knowledge-based visual question answering.arXiv preprint arXiv:2511.22715,

    [Compagnoniet al., 2025 ] Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. Reag: Reasoning-augmented generation for knowledge-based visual question answering.arXiv preprint arXiv:2511.22715,

  9. [9]

    Ge-chat: A graph en- hanced rag framework for evidential response generation of llms

    [Daet al., 2025 ] Longchao Da, Parth Mitesh Shah, Kuan-Ru Liou, Jiaxing Zhang, and Hua Wei. Ge-chat: A graph en- hanced rag framework for evidential response generation of llms. InIJCAI,

  10. [10]

    The Llama 3 Herd of Models

    [Dubeyet al., 2024 ] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783,

  11. [11]

    Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

    [Honget al., 2025 ] Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, and Jieping Ye. Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering. In NeurIPS,

  12. [12]

    Words over pixels? rethinking vision in multimodal large language models

    [Jainet al., 2025 ] Anubhooti Jain, Mayank Vatsa, and Richa Singh. Words over pixels? rethinking vision in multimodal large language models. InProceedings of the Thirty- Fourth International Joint Conference on Artificial In- telligence (IJCAI-25) Survey Track, pages 10481–10489,

  13. [13]

    Colbert: Efficient and effective passage search via contextualized late interaction over bert

    [Khattab and Zaharia, 2020] Omar Khattab and Matei Za- haria. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48,

  14. [14]

    X-flora: Cross-modal federated learning with modality-expert lora for medical vqa

    [Kimet al., 2025 ] Min Hyuk Kim, Changheon Kim, and Seok Bong Yoo. X-flora: Cross-modal federated learning with modality-expert lora for medical vqa. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8390–8408,

  15. [15]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    [Lewiset al., 2020 ] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InNeurIPS,

  16. [16]

    PreFLMR: Scaling Up Fine-Grained Late- Interaction Multi-modal Retrievers

    [Linet al., 2024 ] Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. PreFLMR: Scaling Up Fine-Grained Late- Interaction Multi-modal Retrievers. InACL,

  17. [17]

    Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework

    [Linget al., 2025 ] Zihan Ling, Zhiyao Guo, Yixuan Huang, Yi An, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, and Bo Zheng. Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework.arXiv preprint arXiv:2504.10074,

  18. [18]

    Improved Baselines with Visual Instruction Tuning

    [Liuet al., 2024 ] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning. InCVPR,

  19. [19]

    Ma-rag: Automating role engineering for restful apis with multi-head attention and retrieval-augmented gener- ation

    [Luoet al., 2025 ] Yang Luo, Qingni Shen, and Zhonghai Wu. Ma-rag: Automating role engineering for restful apis with multi-head attention and retrieval-augmented gener- ation. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 7607– 7615,

  20. [20]

    Knowra: Knowledge retrieval augmented method for document- level relation extraction with comprehensive reasoning abilities

    [Maiet al., 2025 ] Chengcheng Mai, Yuxiang Wang, Ziyu Gong, Hanxiang Wang, and Yihua Huang. Knowra: Knowledge retrieval augmented method for document- level relation extraction with comprehensive reasoning abilities. InIJCAI,

  21. [21]

    Ok-VQA: A Visual Question Answering Benchmark Requiring Exter- nal Knowledge

    [Marinoet al., 2019 ] Kenneth Marino, Mohammad Raste- gari, Ali Farhadi, and Roozbeh Mottaghi. Ok-VQA: A Visual Question Answering Benchmark Requiring Exter- nal Knowledge. InCVPR,

  22. [22]

    Krisp: Integrating implicit and symbolic knowledge for open- domain knowledge-based vqa

    [Marinoet al., 2021 ] Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, and Marcus Rohrbach. Krisp: Integrating implicit and symbolic knowledge for open- domain knowledge-based vqa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14111–14121,

  23. [23]

    Encyclope- dic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories

    [Mensinket al., 2023 ] Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr ´e Araujo, and Vittorio Ferrari. Encyclope- dic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories. InICCV,

  24. [24]

    Plotqa: Reason- ing over scientific plots

    [Methaniet al., 2020 ] Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reason- ing over scientific plots. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536,

  25. [25]

    A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge

    [Schwenket al., 2022 ] Dustin Schwenk, Apoorv Khandel- wal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. InECCV,

  26. [26]

    KVQA: Knowledge- aware Visual Question Answering

    [Shahet al., 2019 ] Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. KVQA: Knowledge- aware Visual Question Answering. InAAAI,

  27. [27]

    Combo of thinking and observing for outside-knowledge vqa.arXiv preprint arXiv:2305.06407,

    [Siet al., 2023 ] Qingyi Si, Yuchen Mo, Zheng Lin, Huis- han Ji, and Weiping Wang. Combo of thinking and observing for outside-knowledge vqa.arXiv preprint arXiv:2305.06407,

  28. [28]

    Rg- vqa: Leveraging retriever-generator pipelines for knowl- edge intensive visual question answering

    [Sravanthiet al., 2025 ] Settaluri Lakshmi Sravanthi, Pulkit Agarwal, Debjyoti Mondal, Rituraj Singh, Subhadarshi Panda, Ankit Mishra, Kiran Pradeep, Srihari KB, Go- dawari Sudhakar Rao, and Pushpak Bhattacharyya. Rg- vqa: Leveraging retriever-generator pipelines for knowl- edge intensive visual question answering. InFindings of the Association for Comput...

  29. [29]

    Eva-clip-18b: Scaling clip to 18 billion parameters

    [Sunet al., 2024 ] Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, and Xinlong Wang. EV A-CLIP-18B: Scaling CLIP to 18 Billion Pa- rameters.arXiv preprint arXiv:2402.04252,

  30. [30]

    Core-mmrag: Cross- source knowledge reconciliation for multimodal rag.arXiv preprint arXiv:2506.02544,

    [Tianet al., 2025 ] Yang Tian, Fan Liu, Jingyuan Zhang, Yu- peng Hu, Liqiang Nie, et al. Core-mmrag: Cross- source knowledge reconciliation for multimodal rag.arXiv preprint arXiv:2506.02544,

  31. [31]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    [Tschannenet al., 2025 ] Michael Tschannen, Alexey Grit- senko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

  32. [32]

    Flair: Vlm with fine-grained language-informed image representations

    [Xiaoet al., 2025 ] Rui Xiao, Sanghwan Kim, Mariana- Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. Flair: Vlm with fine-grained language-informed image representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24884–24894,

  33. [33]

    EchoSight: Advancing Visual-Language Models with Wiki Knowl- edge

    [Yan and Xie, 2024] Yibin Yan and Weidi Xie. EchoSight: Advancing Visual-Language Models with Wiki Knowl- edge. InEMNLP Findings,

  34. [34]

    Magic-vqa: Multimodal and grounded in- ference with commonsense knowledge for visual question answering

    [Yanget al., 2025a ] Shuo Yang, Caren Han, Siwen Luo, and Eduard Hovy. Magic-vqa: Multimodal and grounded in- ference with commonsense knowledge for visual question answering. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 16967–16986,

  35. [35]

    mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

    [Yuanet al., 2025 ] Xu Yuan, Liangbo Ning, Wenqi Fan, and Qing Li. mKG-RAG: Multimodal Knowledge Graph- Enhanced RAG for Visual Question Answering.arXiv preprint arXiv:2508.05318,

  36. [36]

    BERTScore: Evaluating Text Generation with BERT

    [Zhanget al., 2019 ] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,

  37. [37]

    mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041

    [Zhanget al., 2024 ] Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, et al. mR 2AG: Multimodal Retrieval-Reflection-Augmented Gener- ation for Knowledge-Based VQA.arXiv preprint arXiv:2411.15041,

  38. [38]

    A comprehensive survey on multimodal rag: All com- binations of modalities as input and output.Authorea Preprints,

    [Zhanget al., 2025 ] Rui Zhang, Chen Liu, Yixin Su, Ruix- uan Li, Xuanjing Huang, Xuelong Li, and Philip S Yu. A comprehensive survey on multimodal rag: All com- binations of modalities as input and output.Authorea Preprints,

  39. [39]

    Cause-effect driven optimization for robust medical visual question answering with language biases

    [Zhuet al., 2025 ] Huanjia Zhu, Yishu Liu, Xiaozhao Fang, Guangming Lu, and Bingzhi Chen. Cause-effect driven optimization for robust medical visual question answering with language biases. InIJCAI,

  40. [40]

    Question:In which country or region does this animal live? OMGM: Europe, parts of North and South America, South Africa, Australia, and New Zealand. $ QKVQA: Mediterranean area

    QKVQA: Question-Focused Filtering for Knowledge-based VQA A Prompts Details in QKVQA A.1 LLM for E-VQA System Prompt Answer the encyclopedic question about the given image. Don’t mention the visual content of image in your output. Directly output the answer of the ques- tion according to the context. You are a helpful assistant for answering encyclope- di...