pith. sign in

arxiv: 2508.05318 · v2 · submitted 2025-08-07 · 💻 cs.CV · cs.AI

mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

Pith reviewed 2026-05-19 00:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal knowledge graphsretrieval-augmented generationknowledge-based VQAgraph extractionmultimodal retrievalmultimodal large language models
0
0 comments X

The pith

Multimodal knowledge graphs in RAG frameworks yield more accurate answers for knowledge-intensive visual question answering by structuring external information and reducing irrelevant content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes mKG-RAG, a retrieval-augmented generation framework that builds multimodal knowledge graphs from documents to support visual question answering tasks requiring external knowledge. Standard RAG approaches often retrieve unstructured data that introduces misleading elements and lowers answer quality. The method applies MLLM-driven extraction together with vision-text matching to create graphs containing semantically consistent entities and relations across modalities. A dual-stage retrieval process with a query-aware retriever then selects and refines relevant knowledge for the generation step. Experiments indicate this structured approach outperforms prior methods and reaches new state-of-the-art results on knowledge-based VQA benchmarks.

Core claim

mKG-RAG distills high-quality multimodal knowledge graphs from documents using MLLM-driven graph extraction and vision-text matching to produce semantically consistent and modality-complementary entities and relations. It then applies a dual-stage retrieval strategy with a query-aware multimodal retriever to improve efficiency and progressively increase precision. When these graphs replace unstructured documents inside the RAG pipeline for knowledge-intensive VQA, answer accuracy and reliability increase substantially, establishing new state-of-the-art performance.

What carries the argument

Multimodal knowledge graph constructed via MLLM-driven graph extraction and vision-text matching, which supplies structured, modality-complementary knowledge representations that support dual-stage retrieval and generation.

If this is right

  • Structured relations among knowledge elements reduce the retrieval of irrelevant or misleading content.
  • Modality-complementary entities and relations improve the reliability of answers generated by the underlying MLLM.
  • Dual-stage retrieval progressively refines precision while preserving computational efficiency.
  • The overall framework produces new state-of-the-art results on standard knowledge-based VQA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-construction pipeline could be tested on other multimodal reasoning tasks that currently rely on unstructured retrieval.
  • Error patterns in the extracted graphs might reveal which types of queries remain hardest for structured RAG approaches.
  • Replacing unstructured sources with these KGs could lower hallucination rates by anchoring generations to explicit cross-modal relations.

Load-bearing premise

MLLM-driven graph extraction combined with vision-text matching can reliably produce semantically consistent and modality-complementary entities and relations from multimodal documents without substantial errors or noise that would degrade downstream retrieval and generation.

What would settle it

An experiment in which the constructed multimodal KGs contain frequent inconsistencies or noise that cause lower VQA accuracy than equivalent unstructured-document RAG baselines would disprove the central benefit.

Figures

Figures reproduced from arXiv: 2508.05318 by Liangbo Ning, Qing Li, Qingqing Ye, Wenqi Fan, Xu Yuan.

Figure 1
Figure 1. Figure 1: Illustration of issues in knowledge-based VQA. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our mKG-RAG consists of a multimodal knowledge graph construction [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture Design of Question-aware Multimodal Retriever. Graph-based Retrieval. Previous methods re￾trieve text chunks directly from candidate doc￾uments [53], which often introduces contextual noise and impairs reasoning performance. In contrast, our approach performs graph-based retrieval to identify query-relevant entities and relationships. These entities and relationships serve as distilled knowled… view at source ↗
Figure 2
Figure 2. Figure 2: Combining Kg matched entities or relationships, we get a relevant subgraph G 0 r . However, similarity-based retrieval alone may yield incomplete information, potentially omitting critical evi￾dence to answer the question entirely. To this end, we leverage the inherent structural properties of the graph and expand Gr by incorporating information from its l-hop neighbors, i.e., G l r = Graph Traversal(Gm, G… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of Qwen2-VL-7B, GPT-4o and mKG-RAG on E-VQA dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The prompt used to match visual and textual entities/relationships [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A high-quality vision-text matching example for In-context Learning [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for expanding the knowledge capacity of Multimodal Large Language Models (MLLMs) by incorporating external knowledge sources into the generation process, and has been widely adopted for knowledge-based Visual Question Answering (VQA). Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relations among knowledge elements frequently introduce irrelevant or misleading content, degrading answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks, thereby enhancing generation through structured multimodal knowledge. To this end, this paper proposes mKG-RAG, a novel retrieval-augmented generation framework built upon multimodal KGs for knowledge-intensive VQA tasks. Specifically, mKG-RAG leverages MLLM-driven graph extraction and vision-text matching to distill semantically consistent, modality-complementary entities and relations from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. Furthermore, a dual-stage retrieval strategy equipped with a query-aware multimodal retriever is introduced to improve retrieval efficiency while progressively refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing approaches and sets new state-of-the-art results for knowledge-based VQA. The code is available at https://github.com/xandery-geek/mKG-RAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes mKG-RAG, a novel RAG framework for knowledge-intensive VQA that constructs multimodal knowledge graphs via MLLM-driven graph extraction and vision-text matching to obtain semantically consistent, modality-complementary entities and relations from documents; it further introduces a dual-stage retrieval strategy with a query-aware multimodal retriever and reports that comprehensive experiments show significant outperformance over prior methods with new state-of-the-art results. Code is released.

Significance. If the results are robust, the work would demonstrate a concrete benefit of structured multimodal knowledge over unstructured documents in RAG pipelines for VQA, addressing a recognized limitation in current approaches. The public code release aids reproducibility.

major comments (1)
  1. [§3] §3 (Graph Construction / Pipeline): The central claim that mKG-RAG sets new SOTA rests on the untested assumption that MLLM-driven extraction plus vision-text matching yields high-quality, low-noise multimodal KGs. No quantitative extraction metrics (entity/relation F1 vs. human gold, error rates, or consistency scores) or error-propagation analysis are reported, leaving open the possibility that downstream gains are artifacts of prompting rather than the KG structure itself.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'comprehensive experiments' would benefit from a parenthetical listing of the primary datasets and main baselines to give readers immediate context.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on the graph construction evaluation below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Graph Construction / Pipeline): The central claim that mKG-RAG sets new SOTA rests on the untested assumption that MLLM-driven extraction plus vision-text matching yields high-quality, low-noise multimodal KGs. No quantitative extraction metrics (entity/relation F1 vs. human gold, error rates, or consistency scores) or error-propagation analysis are reported, leaving open the possibility that downstream gains are artifacts of prompting rather than the KG structure itself.

    Authors: We agree that direct quantitative evaluation of the multimodal KG extraction quality would provide stronger support for our claims and help rule out prompting artifacts. The current manuscript emphasizes end-to-end VQA results and ablations that isolate the contribution of the structured KG (e.g., comparisons against unstructured document retrieval), but does not report extraction-level metrics such as entity/relation F1 against human gold standards or explicit error-propagation studies. In the revised version we will add a dedicated analysis in Section 3 (or a new subsection in the experiments) that reports precision, recall, and F1 scores for entities and relations on a randomly sampled subset of documents annotated by human experts. We will also include an error-propagation study that measures VQA performance degradation when controlled extraction noise is introduced versus when the KG is manually corrected. These additions will clarify that performance gains derive from the modality-complementary structure obtained via MLLM-driven extraction and vision-text matching rather than from prompting alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes mKG-RAG as a framework that constructs multimodal KGs via MLLM-driven extraction plus vision-text matching, then applies dual-stage retrieval for knowledge-based VQA. No equations, fitted parameters, or self-referential definitions appear in the provided abstract or pipeline description that would reduce the reported SOTA performance gains to quantities defined by the inputs themselves. The components are presented as independently motivated constructions evaluated on external benchmarks, satisfying the criteria for a self-contained derivation with no load-bearing reductions to self-citation chains or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that high-quality multimodal KGs can be automatically extracted and that dual-stage retrieval will progressively improve precision; no free parameters or invented entities with external falsifiable handles are described in the abstract.

axioms (1)
  • domain assumption Multimodal documents contain extractable entities and relations that MLLMs plus vision-text matching can distill into semantically consistent, modality-complementary multimodal KGs.
    Invoked when describing construction of high-quality multimodal KGs as structured knowledge representations.

pith-pipeline@v0.9.0 · 5788 in / 1293 out tokens · 54945 ms · 2026-05-19T00:00:44.859637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

    cs.CV 2026-02 conditional novelty 7.0

    SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.

  2. MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 6.0

    MG²-RAG proposes a multi-granularity graph RAG framework that constructs hierarchical multimodal nodes via entity-driven visual grounding and performs structured retrieval, delivering SOTA results on four multimodal t...

  3. QKVQA: Question-Focused Filtering for Knowledge-based VQA

    cs.IR 2026-01 unverdicted novelty 6.0

    QKVQA proposes a question-focused filtering method with QFF and CDA modules that boosts accuracy by 3.2 points on Encyclopedic-VQA and 2.2 points on InfoSeek over prior state-of-the-art.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 3 Pith papers · 8 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning // Advances in neural information processing systems

    Alayrac Jean-Baptiste, Donahue Jeff, Luc Pauline, Miech Antoine, Barr Iain, Hasson Yana, Lenc Karel, Mensch Arthur, Millican Katherine, Reynolds Malcolm, others. Flamingo: a visual language model for few-shot learning // Advances in neural information processing systems

  2. [2]

    Vqa: Visual question answering // Proceedings of the IEEE international conference on computer vision

    Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C Lawrence, Parikh Devi. Vqa: Visual question answering // Proceedings of the IEEE international conference on computer vision. 2015. 2425–2433

  3. [3]

    Qwen2.5-VL Technical Report

    Bai Shuai, Chen Keqin, Liu Xuejing, Wang Jialin, Ge Wenbin, Song Sibo, Dang Kai, Wang Peng, Wang Shijie, Tang Jun, Zhong Humen, Zhu Yuanzhi, Yang Mingkun, Li Zhaohai, Wan Jianqiang, Wang Pengfei, Ding Wei, Fu Zheren, Xu Yiheng, Ye Jiabo, Zhang Xi, Xie Tianbao, Cheng Zesen, Zhang Hang, Yang Zhibo, Xu Haiyang, Lin Junyang. Qwen2.5-VL Technical Report. 2025

  4. [4]

    Wiki-llava: Hierarchical retrieval-augmented generation for mul- timodal llms // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Caffagni Davide, Cocchi Federico, Moratelli Nicholas, Sarto Sara, Cornia Marcella, Baraldi Lorenzo, Cucchiara Rita. Wiki-llava: Hierarchical retrieval-augmented generation for mul- timodal llms // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 1818–1826

  5. [5]

    A simple framework for contrastive learning of visual representations // International conference on machine learning

    Chen Ting, Kornblith Simon, Norouzi Mohammad, Hinton Geoffrey. A simple framework for contrastive learning of visual representations // International conference on machine learning

  6. [6]

    Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Chen Yang, Hu Hexiang, Luan Yi, Sun Haitian, Changpinyo Soravit, Ritter Alan, Chang Ming-Wei. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. 14948–14968

  7. [7]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen Zhe, Wu Jiannan, Wang Wenhai, Su Weijie, Chen Guo, Xing Sen, Zhong Muyan, Zhang Qinglong, Zhu Xizhou, Lu Lewei, others. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. 24185–24198

  8. [8]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality // See https://vicuna

    Chiang Wei-Lin, Li Zhuohan, Lin Ziqing, Sheng Ying, Wu Zhanghao, Zhang Hao, Zheng Lianmin, Zhuang Siyuan, Zhuang Yonghao, Gonzalez Joseph E, others. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality // See https://vicuna. lmsys. org (accessed 14 April 2023). 2023. 2, 3. 6

  9. [9]

    Scaling instruction-finetuned language models // Journal of Machine Learning Research

    Chung Hyung Won, Hou Le, Longpre Shayne, Zoph Barret, Tay Yi, Fedus William, Li Yunxuan, Wang Xuezhi, Dehghani Mostafa, Brahma Siddhartha, others. Scaling instruction-finetuned language models // Journal of Machine Learning Research. 2024. 25, 70. 1–53

  10. [10]

    LLaV A-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning // arXiv preprint arXiv:2503.15621

    Cocchi Federico, Moratelli Nicholas, Caffagni Davide, Sarto Sara, Baraldi Lorenzo, Cor- nia Marcella, Cucchiara Rita. LLaV A-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning // arXiv preprint arXiv:2503.15621. 2025

  11. [11]

    Cocchi Federico, Moratelli Nicholas, Cornia Marcella, Baraldi Lorenzo, Cucchiara Rita. Aug- menting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025

  12. [12]

    A survey on multimodal large language mod- els for autonomous driving // Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Cui Can, Ma Yunsheng, Cao Xu, Ye Wenqian, Zhou Yang, Liang Kaizhao, Chen Jintai, Lu Juanwu, Yang Zichong, Liao Kuei-Da, others. A survey on multimodal large language mod- els for autonomous driving // Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2024. 958–979

  13. [13]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning // Thirty-seventh Conference on Neural Information Processing Systems

    Dai Wenliang, Li Junnan, Li Dongxu, Tiong Anthony, Zhao Junqi, Wang Weisheng, Li Boyang, Fung Pascale, Hoi Steven. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning // Thirty-seventh Conference on Neural Information Processing Systems. 2023. 11

  14. [14]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale // International Conference on Learning Representations

    Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob, Houlsby Neil. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale // International Conference on Learning Representations. 2021

  15. [15]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Edge Darren, Trinh Ha, Cheng Newman, Bradley Joshua, Chao Alex, Mody Apurva, Truitt Steven, Metropolitansky Dasha, Ness Robert Osazuwa, Larson Jonathan. From local to global: A graph rag approach to query-focused summarization // arXiv preprint arXiv:2404.16130. 2024

  16. [16]

    A survey on rag meeting llms: Towards retrieval-augmented large language models // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

    Fan Wenqi, Ding Yujuan, Ning Liangbo, Wang Shijie, Li Hengyun, Yin Dawei, Chua Tat-Seng, Li Qing. A survey on rag meeting llms: Towards retrieval-augmented large language models // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

  17. [17]

    [17]Fan Wenqi, Wang Shijie, Huang Jiani, Chen Zhikai, Song Yu, Tang Wenzhuo, Mao Haitao, Liu Hui, Liu Xiaorui, Yin Dawei, others

    6491–6501. [17]Fan Wenqi, Wang Shijie, Huang Jiani, Chen Zhikai, Song Yu, Tang Wenzhuo, Mao Haitao, Liu Hui, Liu Xiaorui, Yin Dawei, others. Graph machine learning in the era of large language models (llms) // arXiv preprint arXiv:2404.14928. 2024

  18. [18]

    arXiv preprint arXiv:2501.10282 (2025)

    Fan Wenqi, Zhou Yi, Wang Shijie, Yan Yuyao, Liu Hui, Zhao Qian, Song Le, Li Qing. Com- putational Protein Science in the Era of Large Language Models (LLMs) // arXiv preprint arXiv:2501.10282. 2025

  19. [19]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering // Proceedings of the IEEE conference on computer vision and pattern recognition

    Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, Parikh Devi. Making the v in vqa matter: Elevating the role of image understanding in visual question answering // Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. 6904–6913

  20. [20]

    LightRAG: Simple and Fast Retrieval-Augmented Generation

    Guo Zirui, Xia Lianghao, Yu Yanhua, Ao Tu, Huang Chao. LightRAG: Simple and Fast Retrieval-Augmented Generation // arXiv preprint arXiv:2410.05779. 2024

  21. [21]

    Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross. Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. 9729–9738

  22. [22]

    G-retriever: Retrieval-augmented generation for textual graph understanding and question answering // Advances in Neural Information Processing Systems

    He Xiaoxin, Tian Yijun, Sun Yifei, Chawla Nitesh, Laurent Thomas, LeCun Yann, Bresson Xavier, Hooi Bryan. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering // Advances in Neural Information Processing Systems. 37. 2024. 132876–132907

  23. [23]

    Knowledge graphs // ACM Computing Surveys (Csur)

    Hogan Aidan, Blomqvist Eva, Cochez Michael, d’Amato Claudia, Melo Gerard De, Gutierrez Claudio, Kirrane Sabrina, Gayo José Emilio Labra, Navigli Roberto, Neumaier Sebastian, others. Knowledge graphs // ACM Computing Surveys (Csur). 2021. 54, 4. 1–37

  24. [24]

    Lora: Low-rank adaptation of large language models

    Hu Edward J, Shen Yelong, Wallis Phillip, Allen-Zhu Zeyuan, Li Yuanzhi, Wang Shean, Wang Lu, Chen Weizhu, others. Lora: Low-rank adaptation of large language models. // ICLR. 2022. 1, 2. 3

  25. [25]

    Egtr: Extracting graph from transformer for scene graph generation // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Im Jinbae, Nam JeongYeon, Park Nokyung, Lee Hyungmin, Park Seunghyun. Egtr: Extracting graph from transformer for scene graph generation // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 24229–24238

  26. [26]

    Billion-scale similarity search with GPUs // IEEE Transactions on Big Data

    Johnson Jeff, Douze Matthijs, Jégou Hervé. Billion-scale similarity search with GPUs // IEEE Transactions on Big Data. 2019. 7, 3. 535–547

  27. [27]

    Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning // The Twelfth International Conference on Learning Representations

    LUO LINHAO, Li Yuan-Fang, Haffari Gholamreza, Pan Shirui. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning // The Twelfth International Conference on Learning Representations. 2024

  28. [28]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models // International conference on machine learning

    Li Junnan, Li Dongxu, Savarese Silvio, Hoi Steven. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models // International conference on machine learning. 2023. 19730–19742. 12

  29. [29]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning // Advances in Neural Information Processing Systems

    Liang Victor Weixin, Zhang Yuhui, Kwon Yongchan, Yeung Serena, Zou James Y. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning // Advances in Neural Information Processing Systems. 2022. 35. 17612–17625

  30. [30]

    Vila: On pre- training for visual language models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lin Ji, Yin Hongxu, Ping Wei, Molchanov Pavlo, Shoeybi Mohammad, Han Song. Vila: On pre- training for visual language models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. 26689–26699

  31. [31]

    Retrieval Augmented Visual Question Answering with Outside Knowl- edge // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

    Lin Weizhe, Byrne Bill. Retrieval Augmented Visual Question Answering with Outside Knowl- edge // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. 11238–11254

  32. [32]

    Fine-grained late- interaction multi-modal retrieval for retrieval augmented visual question answering // Advances in Neural Information Processing Systems

    Lin Weizhe, Chen Jinghong, Mei Jingbiao, Coca Alexandru, Byrne Bill. Fine-grained late- interaction multi-modal retrieval for retrieval augmented visual question answering // Advances in Neural Information Processing Systems. 36. 2023. 22820–22840

  33. [33]

    Medical visual question answering: A survey // Artificial Intelligence in Medicine

    Lin Zhihong, Zhang Donghao, Tao Qingyi, Shi Danli, Haffari Gholamreza, Wu Qi, He Ming- guang, Ge Zongyuan. Medical visual question answering: A survey // Artificial Intelligence in Medicine. 2023. 143. 102611

  34. [34]

    Improved baselines with visual instruction tuning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu Haotian, Li Chunyuan, Li Yuheng, Lee Yong Jae. Improved baselines with visual instruction tuning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  35. [35]

    Visual instruction tuning // Advances in neural information processing systems

    Liu Haotian, Li Chunyuan, Wu Qingyang, Lee Yong Jae. Visual instruction tuning // Advances in neural information processing systems. 36. 2023. 34892–34916

  36. [36]

    MMKG: multi-modal knowledge graphs // The semantic web: 16th international conference, ESWC 2019, portorož, Slovenia, June 2–6, 2019, proceedings 16

    Liu Ye, Li Hui, Garcia-Duran Alberto, Niepert Mathias, Onoro-Rubio Daniel, Rosenblum David S. MMKG: multi-modal knowledge graphs // The semantic web: 16th international conference, ESWC 2019, portorož, Slovenia, June 2–6, 2019, proceedings 16. 2019. 459–474

  37. [37]

    HyperGraphRAG: Retrieval-Augmented Generation with Hypergraph-Structured Knowledge Representation // arXiv preprint arXiv:2503.21322

    Luo Haoran, Chen Guanting, Zheng Yandan, Wu Xiaobao, Guo Yikai, Lin Qika, Feng Yu, Kuang Zemin, Song Meina, Zhu Yifan, others. HyperGraphRAG: Retrieval-Augmented Generation with Hypergraph-Structured Knowledge Representation // arXiv preprint arXiv:2503.21322. 2025

  38. [38]

    Ma Shengjie, Xu Chengjin, Jiang Xuhui, Li Muzhi, Qu Huaren, Yang Cehao, Mao Jiaxin, Guo Jian. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation // The Thirteenth International Conference on Learning Representations. 2025

  39. [39]

    Ok-vqa: A visual question answering benchmark requiring external knowledge // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Marino Kenneth, Rastegari Mohammad, Farhadi Ali, Mottaghi Roozbeh. Ok-vqa: A visual question answering benchmark requiring external knowledge // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. 3195–3204

  40. [40]

    Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories // Proceedings of the IEEE/CVF International Conference on Computer Vision

    Mensink Thomas, Uijlings Jasper, Castrejon Lluis, Goel Arushi, Cadar Felipe, Zhou Howard, Sha Fei, Araujo André, Ferrari Vittorio. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 3113–3124

  41. [41]

    Towards trustworthy re- trieval augmented generation for large language models: A survey

    Ni Bo, Liu Zheyuan, Wang Leyao, Lei Yongjia, Zhao Yuying, Cheng Xueqi, Zeng Qingkai, Dong Luna, Xia Yinglong, Kenthapadi Krishnaram, others. Towards Trustworthy Retrieval Aug- mented Generation for Large Language Models: A Survey // arXiv preprint arXiv:2502.06872. 2025

  42. [42]

    Cheatagent: Attacking llm-empowered recommender systems via llm agent // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

    Ning Liang-bo, Wang Shijie, Fan Wenqi, Li Qing, Xu Xin, Chen Hao, Huang Feiran. Cheatagent: Attacking llm-empowered recommender systems via llm agent // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. 2284–2295

  43. [43]

    Nomic Embed: Training a Reproducible Long Context Text Embedder // Transactions on Machine Learning Research

    Nussbaum Zach, Morris John Xavier, Mulyar Andriy, Duderstadt Brandon. Nomic Embed: Training a Reproducible Long Context Text Embedder // Transactions on Machine Learning Research. 2025. 13

  44. [44]

    RoRA-VLM: Robust retrieval-augmented vision language models,

    Qi Jingyuan, Xu Zhiyang, Shao Rulin, Chen Yang, Di Jin, Cheng Yu, Wang Qifan, Huang Lifu. RoRA-VLM: Robust Retrieval-Augmented Vision Language Models // arXiv preprint arXiv:2410.08876. 2024

  45. [45]

    A Survey of Mamba

    Qu Haohao, Ning Liangbo, An Rui, Fan Wenqi, Derr Tyler, Liu Hui, Xu Xin, Li Qing. A survey of mamba // arXiv preprint arXiv:2408.01129. 2024

  46. [46]

    Learning transferable visual models from natural language supervision // International conference on machine learning

    Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, others. Learning transferable visual models from natural language supervision // International conference on machine learning

  47. [47]

    Faster r-cnn: Towards real-time object detection with region proposal networks // Advances in neural information processing systems

    Ren Shaoqing, He Kaiming, Girshick Ross, Sun Jian. Faster r-cnn: Towards real-time object detection with region proposal networks // Advances in neural information processing systems

  48. [48]

    [48]Schwenk Dustin, Khandelwal Apoorv, Clark Christopher, Marino Kenneth, Mottaghi Roozbeh

    2015. [48]Schwenk Dustin, Khandelwal Apoorv, Clark Christopher, Marino Kenneth, Mottaghi Roozbeh. A-okvqa: A benchmark for visual question answering using world knowledge // European conference on computer vision. 2022. 146–162

  49. [49]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron Hugo, Lavril Thibaut, Izacard Gautier, Martinet Xavier, Lachaux Marie-Anne, Lacroix Timothée, Rozière Baptiste, Goyal Naman, Hambro Eric, Azhar Faisal, others. Llama: Open and efficient foundation language models // arXiv preprint arXiv:2302.13971. 2023

  50. [50]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang P , Bai S, Tan S, Wang S, Fan Z, Bai J, Chen K, Liu X, Wang J, Ge W, others. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024 // URL https://arxiv. org/abs/2409.12191. 2024

  51. [51]

    Knowledge graph retrieval-augmented generation for llm-based recommendation, 2025 b

    Wang Shijie, Fan Wenqi, Feng Yue, Ma Xinyu, Wang Shuaiqiang, Yin Dawei. Knowledge Graph Retrieval-Augmented Generation for LLM-based Recommendation // arXiv preprint arXiv:2501.02226. 2025

  52. [52]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Wu Zhiyu, Chen Xiaokang, Pan Zizheng, Liu Xingchao, Liu Wen, Dai Damai, Gao Huazuo, Ma Yiyang, Wu Chengyue, Wang Bingxuan, others. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding // arXiv preprint arXiv:2412.10302. 2024

  53. [53]

    EchoSight: Advancing Visual-Language Models with Wiki Knowledge // Findings of the Association for Computational Linguistics: EMNLP 2024

    Yan Yibin, Xie Weidi. EchoSight: Advancing Visual-Language Models with Wiki Knowledge // Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. 1538–1551

  54. [54]

    Instruction-guided multi-granularity segmentation and captioning with large multimodal model // Proceedings of the AAAI Confer- ence on Artificial Intelligence

    Yuan Xu, Zhou Li, Sun Zenghui, Zhou Zikun, Lan Jingsong. Instruction-guided multi-granularity segmentation and captioning with large multimodal model // Proceedings of the AAAI Confer- ence on Artificial Intelligence. 2025

  55. [55]

    MM- LLMs: Recent Advances in MultiModal Large Language Models // Findings of the Association for Computational Linguistics: ACL 2024

    Zhang Duzhen, Yu Yahan, Dong Jiahua, Li Chenxing, Su Dan, Chu Chenhui, Yu Dong. MM- LLMs: Recent Advances in MultiModal Large Language Models // Findings of the Association for Computational Linguistics: ACL 2024. 2024. 12401–12430

  56. [56]

    mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041

    Zhang Tao, Zhang Ziqi, Ma Zongyang, Chen Yuxin, Qi Zhongang, Yuan Chunfeng, Li Bing, Pu Junfu, Zhao Yuxuan, Xie Zehua, others. mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041. 2024

  57. [57]

    Recommender systems in the era of large language models (llms) // IEEE Transactions on Knowledge and Data Engineering

    Zhao Zihuai, Fan Wenqi, Li Jiatong, Liu Yunqing, Mei Xiaowei, Wang Yiqi, Wen Zhen, Wang Fei, Zhao Xiangyu, Tang Jiliang, others. Recommender systems in the era of large language models (llms) // IEEE Transactions on Knowledge and Data Engineering. 2024

  58. [58]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu Jinguo, Wang Weiyun, Chen Zhe, Liu Zhaoyang, Ye Shenglong, Gu Lixin, Tian Hao, Duan Yuchen, Su Weijie, Shao Jie, others. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models // arXiv preprint arXiv:2504.10479. 2025

  59. [59]

    Knowledge graph-guided retrieval augmented generation, 2025

    Zhu Xiangrong, Xie Yuexiang, Liu Yi, Li Yaliang, Hu Wei. Knowledge Graph-Guided Retrieval Augmented Generation // arXiv preprint arXiv:2502.06864. 2025. 14 A. Prompt Design In our multimodal knowledge graph construction pipeline, we utilize LLMs’ text understanding and generation capabilities to extract textual knowledge graphs automatically by providing ...

  60. [60]

    relation

    entity-description: Comprehensive description of the entity’s attributes and activities. Each textual relationship are formatted as (“relation”|<source-entity>|<target-entity>| <relation-description>|<relation-strength>), which contains the following infor- mation: 1)source-entity: name of the source entity, as defined in the textual entities; 2)target-en...

  61. [61]

    relation-description: explanation as to why the source entity and the target entity are related to each other

  62. [62]

    matching

    relation-strength: a numeric score indicating the strength of the relationship between the source and target entities, ranging from 0 to 10. The scene graph provides the object and relationship information in the image, which is formatted as: -<object-0>:<object-category>,<object-bbox> -<object-1>:<object-category>,<object-bbox> ... -<relation-0>:<object-...