mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

Liangbo Ning; Qing Li; Qingqing Ye; Wenqi Fan; Xu Yuan

arxiv: 2508.05318 · v2 · submitted 2025-08-07 · 💻 cs.CV · cs.AI

mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

Xu Yuan , Liangbo Ning , Qingqing Ye , Wenqi Fan , Qing Li This is my paper

Pith reviewed 2026-05-19 00:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal knowledge graphsretrieval-augmented generationknowledge-based VQAgraph extractionmultimodal retrievalmultimodal large language models

0 comments

The pith

Multimodal knowledge graphs in RAG frameworks yield more accurate answers for knowledge-intensive visual question answering by structuring external information and reducing irrelevant content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes mKG-RAG, a retrieval-augmented generation framework that builds multimodal knowledge graphs from documents to support visual question answering tasks requiring external knowledge. Standard RAG approaches often retrieve unstructured data that introduces misleading elements and lowers answer quality. The method applies MLLM-driven extraction together with vision-text matching to create graphs containing semantically consistent entities and relations across modalities. A dual-stage retrieval process with a query-aware retriever then selects and refines relevant knowledge for the generation step. Experiments indicate this structured approach outperforms prior methods and reaches new state-of-the-art results on knowledge-based VQA benchmarks.

Core claim

mKG-RAG distills high-quality multimodal knowledge graphs from documents using MLLM-driven graph extraction and vision-text matching to produce semantically consistent and modality-complementary entities and relations. It then applies a dual-stage retrieval strategy with a query-aware multimodal retriever to improve efficiency and progressively increase precision. When these graphs replace unstructured documents inside the RAG pipeline for knowledge-intensive VQA, answer accuracy and reliability increase substantially, establishing new state-of-the-art performance.

What carries the argument

Multimodal knowledge graph constructed via MLLM-driven graph extraction and vision-text matching, which supplies structured, modality-complementary knowledge representations that support dual-stage retrieval and generation.

If this is right

Structured relations among knowledge elements reduce the retrieval of irrelevant or misleading content.
Modality-complementary entities and relations improve the reliability of answers generated by the underlying MLLM.
Dual-stage retrieval progressively refines precision while preserving computational efficiency.
The overall framework produces new state-of-the-art results on standard knowledge-based VQA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-construction pipeline could be tested on other multimodal reasoning tasks that currently rely on unstructured retrieval.
Error patterns in the extracted graphs might reveal which types of queries remain hardest for structured RAG approaches.
Replacing unstructured sources with these KGs could lower hallucination rates by anchoring generations to explicit cross-modal relations.

Load-bearing premise

MLLM-driven graph extraction combined with vision-text matching can reliably produce semantically consistent and modality-complementary entities and relations from multimodal documents without substantial errors or noise that would degrade downstream retrieval and generation.

What would settle it

An experiment in which the constructed multimodal KGs contain frequent inconsistencies or noise that cause lower VQA accuracy than equivalent unstructured-document RAG baselines would disprove the central benefit.

Figures

Figures reproduced from arXiv: 2508.05318 by Liangbo Ning, Qing Li, Qingqing Ye, Wenqi Fan, Xu Yuan.

**Figure 2.** Figure 2: An overview of our mKG-RAG consists of a multimodal knowledge graph construction [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture Design of Question-aware Multimodal Retriever. Graph-based Retrieval. Previous methods retrieve text chunks directly from candidate documents [53], which often introduces contextual noise and impairs reasoning performance. In contrast, our approach performs graph-based retrieval to identify query-relevant entities and relationships. These entities and relationships serve as distilled knowled… view at source ↗

**Figure 2.** Figure 2: Combining Kg matched entities or relationships, we get a relevant subgraph G 0 r . However, similarity-based retrieval alone may yield incomplete information, potentially omitting critical evidence to answer the question entirely. To this end, we leverage the inherent structural properties of the graph and expand Gr by incorporating information from its l-hop neighbors, i.e., G l r = Graph Traversal(Gm, G… view at source ↗

**Figure 4.** Figure 4: Qualitative results of Qwen2-VL-7B, GPT-4o and mKG-RAG on E-VQA dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: The prompt used to match visual and textual entities/relationships [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: A high-quality vision-text matching example for In-context Learning [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for expanding the knowledge capacity of Multimodal Large Language Models (MLLMs) by incorporating external knowledge sources into the generation process, and has been widely adopted for knowledge-based Visual Question Answering (VQA). Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relations among knowledge elements frequently introduce irrelevant or misleading content, degrading answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks, thereby enhancing generation through structured multimodal knowledge. To this end, this paper proposes mKG-RAG, a novel retrieval-augmented generation framework built upon multimodal KGs for knowledge-intensive VQA tasks. Specifically, mKG-RAG leverages MLLM-driven graph extraction and vision-text matching to distill semantically consistent, modality-complementary entities and relations from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. Furthermore, a dual-stage retrieval strategy equipped with a query-aware multimodal retriever is introduced to improve retrieval efficiency while progressively refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing approaches and sets new state-of-the-art results for knowledge-based VQA. The code is available at https://github.com/xandery-geek/mKG-RAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes mKG-RAG, a novel RAG framework for knowledge-intensive VQA that constructs multimodal knowledge graphs via MLLM-driven graph extraction and vision-text matching to obtain semantically consistent, modality-complementary entities and relations from documents; it further introduces a dual-stage retrieval strategy with a query-aware multimodal retriever and reports that comprehensive experiments show significant outperformance over prior methods with new state-of-the-art results. Code is released.

Significance. If the results are robust, the work would demonstrate a concrete benefit of structured multimodal knowledge over unstructured documents in RAG pipelines for VQA, addressing a recognized limitation in current approaches. The public code release aids reproducibility.

major comments (1)

[§3] §3 (Graph Construction / Pipeline): The central claim that mKG-RAG sets new SOTA rests on the untested assumption that MLLM-driven extraction plus vision-text matching yields high-quality, low-noise multimodal KGs. No quantitative extraction metrics (entity/relation F1 vs. human gold, error rates, or consistency scores) or error-propagation analysis are reported, leaving open the possibility that downstream gains are artifacts of prompting rather than the KG structure itself.

minor comments (1)

[Abstract] Abstract: the phrase 'comprehensive experiments' would benefit from a parenthetical listing of the primary datasets and main baselines to give readers immediate context.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on the graph construction evaluation below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Graph Construction / Pipeline): The central claim that mKG-RAG sets new SOTA rests on the untested assumption that MLLM-driven extraction plus vision-text matching yields high-quality, low-noise multimodal KGs. No quantitative extraction metrics (entity/relation F1 vs. human gold, error rates, or consistency scores) or error-propagation analysis are reported, leaving open the possibility that downstream gains are artifacts of prompting rather than the KG structure itself.

Authors: We agree that direct quantitative evaluation of the multimodal KG extraction quality would provide stronger support for our claims and help rule out prompting artifacts. The current manuscript emphasizes end-to-end VQA results and ablations that isolate the contribution of the structured KG (e.g., comparisons against unstructured document retrieval), but does not report extraction-level metrics such as entity/relation F1 against human gold standards or explicit error-propagation studies. In the revised version we will add a dedicated analysis in Section 3 (or a new subsection in the experiments) that reports precision, recall, and F1 scores for entities and relations on a randomly sampled subset of documents annotated by human experts. We will also include an error-propagation study that measures VQA performance degradation when controlled extraction noise is introduced versus when the KG is manually corrected. These additions will clarify that performance gains derive from the modality-complementary structure obtained via MLLM-driven extraction and vision-text matching rather than from prompting alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes mKG-RAG as a framework that constructs multimodal KGs via MLLM-driven extraction plus vision-text matching, then applies dual-stage retrieval for knowledge-based VQA. No equations, fitted parameters, or self-referential definitions appear in the provided abstract or pipeline description that would reduce the reported SOTA performance gains to quantities defined by the inputs themselves. The components are presented as independently motivated constructions evaluated on external benchmarks, satisfying the criteria for a self-contained derivation with no load-bearing reductions to self-citation chains or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that high-quality multimodal KGs can be automatically extracted and that dual-stage retrieval will progressively improve precision; no free parameters or invented entities with external falsifiable handles are described in the abstract.

axioms (1)

domain assumption Multimodal documents contain extractable entities and relations that MLLMs plus vision-text matching can distill into semantically consistent, modality-complementary multimodal KGs.
Invoked when describing construction of high-quality multimodal KGs as structured knowledge representations.

pith-pipeline@v0.9.0 · 5788 in / 1293 out tokens · 54945 ms · 2026-05-19T00:00:44.859637+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leverages MLLM-driven graph extraction and vision-text matching to distill semantically consistent, modality-complementary entities and relations from multimodal documents, constructing high-quality multimodal KGs
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-stage retrieval strategy equipped with a query-aware multimodal retriever

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
cs.CV 2026-02 conditional novelty 7.0

SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 6.0

MG²-RAG proposes a multi-granularity graph RAG framework that constructs hierarchical multimodal nodes via entity-driven visual grounding and performs structured retrieval, delivering SOTA results on four multimodal t...
QKVQA: Question-Focused Filtering for Knowledge-based VQA
cs.IR 2026-01 unverdicted novelty 6.0

QKVQA proposes a question-focused filtering method with QFF and CDA modules that boosts accuracy by 3.2 points on Encyclopedic-VQA and 2.2 points on InfoSeek over prior state-of-the-art.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 3 Pith papers · 8 internal anchors

[1]

Flamingo: a visual language model for few-shot learning // Advances in neural information processing systems

Alayrac Jean-Baptiste, Donahue Jeff, Luc Pauline, Miech Antoine, Barr Iain, Hasson Yana, Lenc Karel, Mensch Arthur, Millican Katherine, Reynolds Malcolm, others. Flamingo: a visual language model for few-shot learning // Advances in neural information processing systems

work page
[2]

Vqa: Visual question answering // Proceedings of the IEEE international conference on computer vision

Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C Lawrence, Parikh Devi. Vqa: Visual question answering // Proceedings of the IEEE international conference on computer vision. 2015. 2425–2433

work page 2015
[3]

Qwen2.5-VL Technical Report

Bai Shuai, Chen Keqin, Liu Xuejing, Wang Jialin, Ge Wenbin, Song Sibo, Dang Kai, Wang Peng, Wang Shijie, Tang Jun, Zhong Humen, Zhu Yuanzhi, Yang Mingkun, Li Zhaohai, Wan Jianqiang, Wang Pengfei, Ding Wei, Fu Zheren, Xu Yiheng, Ye Jiabo, Zhang Xi, Xie Tianbao, Cheng Zesen, Zhang Hang, Yang Zhibo, Xu Haiyang, Lin Junyang. Qwen2.5-VL Technical Report. 2025

work page 2025
[4]

Wiki-llava: Hierarchical retrieval-augmented generation for mul- timodal llms // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Caffagni Davide, Cocchi Federico, Moratelli Nicholas, Sarto Sara, Cornia Marcella, Baraldi Lorenzo, Cucchiara Rita. Wiki-llava: Hierarchical retrieval-augmented generation for mul- timodal llms // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 1818–1826

work page 2024
[5]

A simple framework for contrastive learning of visual representations // International conference on machine learning

Chen Ting, Kornblith Simon, Norouzi Mohammad, Hinton Geoffrey. A simple framework for contrastive learning of visual representations // International conference on machine learning

work page
[6]

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Chen Yang, Hu Hexiang, Luan Yi, Sun Haitian, Changpinyo Soravit, Ritter Alan, Chang Ming-Wei. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. 14948–14968

work page 2023
[7]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen Zhe, Wu Jiannan, Wang Wenhai, Su Weijie, Chen Guo, Xing Sen, Zhong Muyan, Zhang Qinglong, Zhu Xizhou, Lu Lewei, others. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. 24185–24198

work page 2024
[8]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality // See https://vicuna

Chiang Wei-Lin, Li Zhuohan, Lin Ziqing, Sheng Ying, Wu Zhanghao, Zhang Hao, Zheng Lianmin, Zhuang Siyuan, Zhuang Yonghao, Gonzalez Joseph E, others. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality // See https://vicuna. lmsys. org (accessed 14 April 2023). 2023. 2, 3. 6

work page 2023
[9]

Scaling instruction-finetuned language models // Journal of Machine Learning Research

Chung Hyung Won, Hou Le, Longpre Shayne, Zoph Barret, Tay Yi, Fedus William, Li Yunxuan, Wang Xuezhi, Dehghani Mostafa, Brahma Siddhartha, others. Scaling instruction-finetuned language models // Journal of Machine Learning Research. 2024. 25, 70. 1–53

work page 2024
[10]

LLaV A-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning // arXiv preprint arXiv:2503.15621

Cocchi Federico, Moratelli Nicholas, Caffagni Davide, Sarto Sara, Baraldi Lorenzo, Cor- nia Marcella, Cucchiara Rita. LLaV A-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning // arXiv preprint arXiv:2503.15621. 2025

work page arXiv 2025
[11]

Cocchi Federico, Moratelli Nicholas, Cornia Marcella, Baraldi Lorenzo, Cucchiara Rita. Aug- menting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025

work page 2025
[12]

A survey on multimodal large language mod- els for autonomous driving // Proceedings of the IEEE/CVF winter conference on applications of computer vision

Cui Can, Ma Yunsheng, Cao Xu, Ye Wenqian, Zhou Yang, Liang Kaizhao, Chen Jintai, Lu Juanwu, Yang Zichong, Liao Kuei-Da, others. A survey on multimodal large language mod- els for autonomous driving // Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2024. 958–979

work page 2024
[13]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning // Thirty-seventh Conference on Neural Information Processing Systems

Dai Wenliang, Li Junnan, Li Dongxu, Tiong Anthony, Zhao Junqi, Wang Weisheng, Li Boyang, Fung Pascale, Hoi Steven. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning // Thirty-seventh Conference on Neural Information Processing Systems. 2023. 11

work page 2023
[14]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale // International Conference on Learning Representations

Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob, Houlsby Neil. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale // International Conference on Learning Representations. 2021

work page 2021
[15]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge Darren, Trinh Ha, Cheng Newman, Bradley Joshua, Chao Alex, Mody Apurva, Truitt Steven, Metropolitansky Dasha, Ness Robert Osazuwa, Larson Jonathan. From local to global: A graph rag approach to query-focused summarization // arXiv preprint arXiv:2404.16130. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

A survey on rag meeting llms: Towards retrieval-augmented large language models // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Fan Wenqi, Ding Yujuan, Ning Liangbo, Wang Shijie, Li Hengyun, Yin Dawei, Chua Tat-Seng, Li Qing. A survey on rag meeting llms: Towards retrieval-augmented large language models // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

work page
[17]

[17]Fan Wenqi, Wang Shijie, Huang Jiani, Chen Zhikai, Song Yu, Tang Wenzhuo, Mao Haitao, Liu Hui, Liu Xiaorui, Yin Dawei, others

6491–6501. [17]Fan Wenqi, Wang Shijie, Huang Jiani, Chen Zhikai, Song Yu, Tang Wenzhuo, Mao Haitao, Liu Hui, Liu Xiaorui, Yin Dawei, others. Graph machine learning in the era of large language models (llms) // arXiv preprint arXiv:2404.14928. 2024

work page internal anchor Pith review arXiv 2024
[18]

arXiv preprint arXiv:2501.10282 (2025)

Fan Wenqi, Zhou Yi, Wang Shijie, Yan Yuyao, Liu Hui, Zhao Qian, Song Le, Li Qing. Com- putational Protein Science in the Era of Large Language Models (LLMs) // arXiv preprint arXiv:2501.10282. 2025

work page arXiv 2025
[19]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering // Proceedings of the IEEE conference on computer vision and pattern recognition

Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, Parikh Devi. Making the v in vqa matter: Elevating the role of image understanding in visual question answering // Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. 6904–6913

work page 2017
[20]

LightRAG: Simple and Fast Retrieval-Augmented Generation

Guo Zirui, Xia Lianghao, Yu Yanhua, Ao Tu, Huang Chao. LightRAG: Simple and Fast Retrieval-Augmented Generation // arXiv preprint arXiv:2410.05779. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross. Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. 9729–9738

work page 2020
[22]

G-retriever: Retrieval-augmented generation for textual graph understanding and question answering // Advances in Neural Information Processing Systems

He Xiaoxin, Tian Yijun, Sun Yifei, Chawla Nitesh, Laurent Thomas, LeCun Yann, Bresson Xavier, Hooi Bryan. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering // Advances in Neural Information Processing Systems. 37. 2024. 132876–132907

work page 2024
[23]

Knowledge graphs // ACM Computing Surveys (Csur)

Hogan Aidan, Blomqvist Eva, Cochez Michael, d’Amato Claudia, Melo Gerard De, Gutierrez Claudio, Kirrane Sabrina, Gayo José Emilio Labra, Navigli Roberto, Neumaier Sebastian, others. Knowledge graphs // ACM Computing Surveys (Csur). 2021. 54, 4. 1–37

work page 2021
[24]

Lora: Low-rank adaptation of large language models

Hu Edward J, Shen Yelong, Wallis Phillip, Allen-Zhu Zeyuan, Li Yuanzhi, Wang Shean, Wang Lu, Chen Weizhu, others. Lora: Low-rank adaptation of large language models. // ICLR. 2022. 1, 2. 3

work page 2022
[25]

Egtr: Extracting graph from transformer for scene graph generation // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Im Jinbae, Nam JeongYeon, Park Nokyung, Lee Hyungmin, Park Seunghyun. Egtr: Extracting graph from transformer for scene graph generation // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 24229–24238

work page 2024
[26]

Billion-scale similarity search with GPUs // IEEE Transactions on Big Data

Johnson Jeff, Douze Matthijs, Jégou Hervé. Billion-scale similarity search with GPUs // IEEE Transactions on Big Data. 2019. 7, 3. 535–547

work page 2019
[27]

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning // The Twelfth International Conference on Learning Representations

LUO LINHAO, Li Yuan-Fang, Haffari Gholamreza, Pan Shirui. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning // The Twelfth International Conference on Learning Representations. 2024

work page 2024
[28]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models // International conference on machine learning

Li Junnan, Li Dongxu, Savarese Silvio, Hoi Steven. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models // International conference on machine learning. 2023. 19730–19742. 12

work page 2023
[29]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning // Advances in Neural Information Processing Systems

Liang Victor Weixin, Zhang Yuhui, Kwon Yongchan, Yeung Serena, Zou James Y. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning // Advances in Neural Information Processing Systems. 2022. 35. 17612–17625

work page 2022
[30]

Vila: On pre- training for visual language models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lin Ji, Yin Hongxu, Ping Wei, Molchanov Pavlo, Shoeybi Mohammad, Han Song. Vila: On pre- training for visual language models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. 26689–26699

work page 2024
[31]

Retrieval Augmented Visual Question Answering with Outside Knowl- edge // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Lin Weizhe, Byrne Bill. Retrieval Augmented Visual Question Answering with Outside Knowl- edge // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. 11238–11254

work page 2022
[32]

Fine-grained late- interaction multi-modal retrieval for retrieval augmented visual question answering // Advances in Neural Information Processing Systems

Lin Weizhe, Chen Jinghong, Mei Jingbiao, Coca Alexandru, Byrne Bill. Fine-grained late- interaction multi-modal retrieval for retrieval augmented visual question answering // Advances in Neural Information Processing Systems. 36. 2023. 22820–22840

work page 2023
[33]

Medical visual question answering: A survey // Artificial Intelligence in Medicine

Lin Zhihong, Zhang Donghao, Tao Qingyi, Shi Danli, Haffari Gholamreza, Wu Qi, He Ming- guang, Ge Zongyuan. Medical visual question answering: A survey // Artificial Intelligence in Medicine. 2023. 143. 102611

work page 2023
[34]

Improved baselines with visual instruction tuning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu Haotian, Li Chunyuan, Li Yuheng, Lee Yong Jae. Improved baselines with visual instruction tuning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page
[35]

Visual instruction tuning // Advances in neural information processing systems

Liu Haotian, Li Chunyuan, Wu Qingyang, Lee Yong Jae. Visual instruction tuning // Advances in neural information processing systems. 36. 2023. 34892–34916

work page 2023
[36]

MMKG: multi-modal knowledge graphs // The semantic web: 16th international conference, ESWC 2019, portorož, Slovenia, June 2–6, 2019, proceedings 16

Liu Ye, Li Hui, Garcia-Duran Alberto, Niepert Mathias, Onoro-Rubio Daniel, Rosenblum David S. MMKG: multi-modal knowledge graphs // The semantic web: 16th international conference, ESWC 2019, portorož, Slovenia, June 2–6, 2019, proceedings 16. 2019. 459–474

work page 2019
[37]

HyperGraphRAG: Retrieval-Augmented Generation with Hypergraph-Structured Knowledge Representation // arXiv preprint arXiv:2503.21322

Luo Haoran, Chen Guanting, Zheng Yandan, Wu Xiaobao, Guo Yikai, Lin Qika, Feng Yu, Kuang Zemin, Song Meina, Zhu Yifan, others. HyperGraphRAG: Retrieval-Augmented Generation with Hypergraph-Structured Knowledge Representation // arXiv preprint arXiv:2503.21322. 2025

work page arXiv 2025
[38]

Ma Shengjie, Xu Chengjin, Jiang Xuhui, Li Muzhi, Qu Huaren, Yang Cehao, Mao Jiaxin, Guo Jian. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation // The Thirteenth International Conference on Learning Representations. 2025

work page 2025
[39]

Ok-vqa: A visual question answering benchmark requiring external knowledge // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Marino Kenneth, Rastegari Mohammad, Farhadi Ali, Mottaghi Roozbeh. Ok-vqa: A visual question answering benchmark requiring external knowledge // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. 3195–3204

work page 2019
[40]

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories // Proceedings of the IEEE/CVF International Conference on Computer Vision

Mensink Thomas, Uijlings Jasper, Castrejon Lluis, Goel Arushi, Cadar Felipe, Zhou Howard, Sha Fei, Araujo André, Ferrari Vittorio. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 3113–3124

work page 2023
[41]

Towards trustworthy re- trieval augmented generation for large language models: A survey

Ni Bo, Liu Zheyuan, Wang Leyao, Lei Yongjia, Zhao Yuying, Cheng Xueqi, Zeng Qingkai, Dong Luna, Xia Yinglong, Kenthapadi Krishnaram, others. Towards Trustworthy Retrieval Aug- mented Generation for Large Language Models: A Survey // arXiv preprint arXiv:2502.06872. 2025

work page arXiv 2025
[42]

Cheatagent: Attacking llm-empowered recommender systems via llm agent // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Ning Liang-bo, Wang Shijie, Fan Wenqi, Li Qing, Xu Xin, Chen Hao, Huang Feiran. Cheatagent: Attacking llm-empowered recommender systems via llm agent // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. 2284–2295

work page 2024
[43]

Nomic Embed: Training a Reproducible Long Context Text Embedder // Transactions on Machine Learning Research

Nussbaum Zach, Morris John Xavier, Mulyar Andriy, Duderstadt Brandon. Nomic Embed: Training a Reproducible Long Context Text Embedder // Transactions on Machine Learning Research. 2025. 13

work page 2025
[44]

RoRA-VLM: Robust retrieval-augmented vision language models,

Qi Jingyuan, Xu Zhiyang, Shao Rulin, Chen Yang, Di Jin, Cheng Yu, Wang Qifan, Huang Lifu. RoRA-VLM: Robust Retrieval-Augmented Vision Language Models // arXiv preprint arXiv:2410.08876. 2024

work page arXiv 2024
[45]

A Survey of Mamba

Qu Haohao, Ning Liangbo, An Rui, Fan Wenqi, Derr Tyler, Liu Hui, Xu Xin, Li Qing. A survey of mamba // arXiv preprint arXiv:2408.01129. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Learning transferable visual models from natural language supervision // International conference on machine learning

Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, others. Learning transferable visual models from natural language supervision // International conference on machine learning

work page
[47]

Faster r-cnn: Towards real-time object detection with region proposal networks // Advances in neural information processing systems

Ren Shaoqing, He Kaiming, Girshick Ross, Sun Jian. Faster r-cnn: Towards real-time object detection with region proposal networks // Advances in neural information processing systems

work page
[48]

[48]Schwenk Dustin, Khandelwal Apoorv, Clark Christopher, Marino Kenneth, Mottaghi Roozbeh

2015. [48]Schwenk Dustin, Khandelwal Apoorv, Clark Christopher, Marino Kenneth, Mottaghi Roozbeh. A-okvqa: A benchmark for visual question answering using world knowledge // European conference on computer vision. 2022. 146–162

work page 2015
[49]

LLaMA: Open and Efficient Foundation Language Models

Touvron Hugo, Lavril Thibaut, Izacard Gautier, Martinet Xavier, Lachaux Marie-Anne, Lacroix Timothée, Rozière Baptiste, Goyal Naman, Hambro Eric, Azhar Faisal, others. Llama: Open and efficient foundation language models // arXiv preprint arXiv:2302.13971. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang P , Bai S, Tan S, Wang S, Fan Z, Bai J, Chen K, Liu X, Wang J, Ge W, others. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024 // URL https://arxiv. org/abs/2409.12191. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Knowledge graph retrieval-augmented generation for llm-based recommendation, 2025 b

Wang Shijie, Fan Wenqi, Feng Yue, Ma Xinyu, Wang Shuaiqiang, Yin Dawei. Knowledge Graph Retrieval-Augmented Generation for LLM-based Recommendation // arXiv preprint arXiv:2501.02226. 2025

work page arXiv 2025
[52]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Wu Zhiyu, Chen Xiaokang, Pan Zizheng, Liu Xingchao, Liu Wen, Dai Damai, Gao Huazuo, Ma Yiyang, Wu Chengyue, Wang Bingxuan, others. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding // arXiv preprint arXiv:2412.10302. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

EchoSight: Advancing Visual-Language Models with Wiki Knowledge // Findings of the Association for Computational Linguistics: EMNLP 2024

Yan Yibin, Xie Weidi. EchoSight: Advancing Visual-Language Models with Wiki Knowledge // Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. 1538–1551

work page 2024
[54]

Instruction-guided multi-granularity segmentation and captioning with large multimodal model // Proceedings of the AAAI Confer- ence on Artificial Intelligence

Yuan Xu, Zhou Li, Sun Zenghui, Zhou Zikun, Lan Jingsong. Instruction-guided multi-granularity segmentation and captioning with large multimodal model // Proceedings of the AAAI Confer- ence on Artificial Intelligence. 2025

work page 2025
[55]

MM- LLMs: Recent Advances in MultiModal Large Language Models // Findings of the Association for Computational Linguistics: ACL 2024

Zhang Duzhen, Yu Yahan, Dong Jiahua, Li Chenxing, Su Dan, Chu Chenhui, Yu Dong. MM- LLMs: Recent Advances in MultiModal Large Language Models // Findings of the Association for Computational Linguistics: ACL 2024. 2024. 12401–12430

work page 2024
[56]

mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041

Zhang Tao, Zhang Ziqi, Ma Zongyang, Chen Yuxin, Qi Zhongang, Yuan Chunfeng, Li Bing, Pu Junfu, Zhao Yuxuan, Xie Zehua, others. mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041. 2024

work page arXiv 2024
[57]

Recommender systems in the era of large language models (llms) // IEEE Transactions on Knowledge and Data Engineering

Zhao Zihuai, Fan Wenqi, Li Jiatong, Liu Yunqing, Mei Xiaowei, Wang Yiqi, Wen Zhen, Wang Fei, Zhao Xiangyu, Tang Jiliang, others. Recommender systems in the era of large language models (llms) // IEEE Transactions on Knowledge and Data Engineering. 2024

work page 2024
[58]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu Jinguo, Wang Weiyun, Chen Zhe, Liu Zhaoyang, Ye Shenglong, Gu Lixin, Tian Hao, Duan Yuchen, Su Weijie, Shao Jie, others. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models // arXiv preprint arXiv:2504.10479. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Knowledge graph-guided retrieval augmented generation, 2025

Zhu Xiangrong, Xie Yuexiang, Liu Yi, Li Yaliang, Hu Wei. Knowledge Graph-Guided Retrieval Augmented Generation // arXiv preprint arXiv:2502.06864. 2025. 14 A. Prompt Design In our multimodal knowledge graph construction pipeline, we utilize LLMs’ text understanding and generation capabilities to extract textual knowledge graphs automatically by providing ...

work page arXiv 2025
[60]

relation

entity-description: Comprehensive description of the entity’s attributes and activities. Each textual relationship are formatted as (“relation”|<source-entity>|<target-entity>| <relation-description>|<relation-strength>), which contains the following infor- mation: 1)source-entity: name of the source entity, as defined in the textual entities; 2)target-en...

work page
[61]

relation-description: explanation as to why the source entity and the target entity are related to each other

work page
[62]

matching

relation-strength: a numeric score indicating the strength of the relationship between the source and target entities, ranging from 0 to 10. The scene graph provides the object and relationship information in the image, which is formatted as: -<object-0>:<object-category>,<object-bbox> -<object-1>:<object-category>,<object-bbox> ... -<relation-0>:<object-...

work page

[1] [1]

Flamingo: a visual language model for few-shot learning // Advances in neural information processing systems

Alayrac Jean-Baptiste, Donahue Jeff, Luc Pauline, Miech Antoine, Barr Iain, Hasson Yana, Lenc Karel, Mensch Arthur, Millican Katherine, Reynolds Malcolm, others. Flamingo: a visual language model for few-shot learning // Advances in neural information processing systems

work page

[2] [2]

Vqa: Visual question answering // Proceedings of the IEEE international conference on computer vision

Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C Lawrence, Parikh Devi. Vqa: Visual question answering // Proceedings of the IEEE international conference on computer vision. 2015. 2425–2433

work page 2015

[3] [3]

Qwen2.5-VL Technical Report

Bai Shuai, Chen Keqin, Liu Xuejing, Wang Jialin, Ge Wenbin, Song Sibo, Dang Kai, Wang Peng, Wang Shijie, Tang Jun, Zhong Humen, Zhu Yuanzhi, Yang Mingkun, Li Zhaohai, Wan Jianqiang, Wang Pengfei, Ding Wei, Fu Zheren, Xu Yiheng, Ye Jiabo, Zhang Xi, Xie Tianbao, Cheng Zesen, Zhang Hang, Yang Zhibo, Xu Haiyang, Lin Junyang. Qwen2.5-VL Technical Report. 2025

work page 2025

[4] [4]

Wiki-llava: Hierarchical retrieval-augmented generation for mul- timodal llms // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Caffagni Davide, Cocchi Federico, Moratelli Nicholas, Sarto Sara, Cornia Marcella, Baraldi Lorenzo, Cucchiara Rita. Wiki-llava: Hierarchical retrieval-augmented generation for mul- timodal llms // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 1818–1826

work page 2024

[5] [5]

A simple framework for contrastive learning of visual representations // International conference on machine learning

Chen Ting, Kornblith Simon, Norouzi Mohammad, Hinton Geoffrey. A simple framework for contrastive learning of visual representations // International conference on machine learning

work page

[6] [6]

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Chen Yang, Hu Hexiang, Luan Yi, Sun Haitian, Changpinyo Soravit, Ritter Alan, Chang Ming-Wei. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. 14948–14968

work page 2023

[7] [7]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen Zhe, Wu Jiannan, Wang Wenhai, Su Weijie, Chen Guo, Xing Sen, Zhong Muyan, Zhang Qinglong, Zhu Xizhou, Lu Lewei, others. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. 24185–24198

work page 2024

[8] [8]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality // See https://vicuna

Chiang Wei-Lin, Li Zhuohan, Lin Ziqing, Sheng Ying, Wu Zhanghao, Zhang Hao, Zheng Lianmin, Zhuang Siyuan, Zhuang Yonghao, Gonzalez Joseph E, others. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality // See https://vicuna. lmsys. org (accessed 14 April 2023). 2023. 2, 3. 6

work page 2023

[9] [9]

Scaling instruction-finetuned language models // Journal of Machine Learning Research

Chung Hyung Won, Hou Le, Longpre Shayne, Zoph Barret, Tay Yi, Fedus William, Li Yunxuan, Wang Xuezhi, Dehghani Mostafa, Brahma Siddhartha, others. Scaling instruction-finetuned language models // Journal of Machine Learning Research. 2024. 25, 70. 1–53

work page 2024

[10] [10]

LLaV A-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning // arXiv preprint arXiv:2503.15621

Cocchi Federico, Moratelli Nicholas, Caffagni Davide, Sarto Sara, Baraldi Lorenzo, Cor- nia Marcella, Cucchiara Rita. LLaV A-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning // arXiv preprint arXiv:2503.15621. 2025

work page arXiv 2025

[11] [11]

Cocchi Federico, Moratelli Nicholas, Cornia Marcella, Baraldi Lorenzo, Cucchiara Rita. Aug- menting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025

work page 2025

[12] [12]

A survey on multimodal large language mod- els for autonomous driving // Proceedings of the IEEE/CVF winter conference on applications of computer vision

Cui Can, Ma Yunsheng, Cao Xu, Ye Wenqian, Zhou Yang, Liang Kaizhao, Chen Jintai, Lu Juanwu, Yang Zichong, Liao Kuei-Da, others. A survey on multimodal large language mod- els for autonomous driving // Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2024. 958–979

work page 2024

[13] [13]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning // Thirty-seventh Conference on Neural Information Processing Systems

Dai Wenliang, Li Junnan, Li Dongxu, Tiong Anthony, Zhao Junqi, Wang Weisheng, Li Boyang, Fung Pascale, Hoi Steven. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning // Thirty-seventh Conference on Neural Information Processing Systems. 2023. 11

work page 2023

[14] [14]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale // International Conference on Learning Representations

Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob, Houlsby Neil. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale // International Conference on Learning Representations. 2021

work page 2021

[15] [15]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge Darren, Trinh Ha, Cheng Newman, Bradley Joshua, Chao Alex, Mody Apurva, Truitt Steven, Metropolitansky Dasha, Ness Robert Osazuwa, Larson Jonathan. From local to global: A graph rag approach to query-focused summarization // arXiv preprint arXiv:2404.16130. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

A survey on rag meeting llms: Towards retrieval-augmented large language models // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Fan Wenqi, Ding Yujuan, Ning Liangbo, Wang Shijie, Li Hengyun, Yin Dawei, Chua Tat-Seng, Li Qing. A survey on rag meeting llms: Towards retrieval-augmented large language models // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

work page

[17] [17]

[17]Fan Wenqi, Wang Shijie, Huang Jiani, Chen Zhikai, Song Yu, Tang Wenzhuo, Mao Haitao, Liu Hui, Liu Xiaorui, Yin Dawei, others

6491–6501. [17]Fan Wenqi, Wang Shijie, Huang Jiani, Chen Zhikai, Song Yu, Tang Wenzhuo, Mao Haitao, Liu Hui, Liu Xiaorui, Yin Dawei, others. Graph machine learning in the era of large language models (llms) // arXiv preprint arXiv:2404.14928. 2024

work page internal anchor Pith review arXiv 2024

[18] [18]

arXiv preprint arXiv:2501.10282 (2025)

Fan Wenqi, Zhou Yi, Wang Shijie, Yan Yuyao, Liu Hui, Zhao Qian, Song Le, Li Qing. Com- putational Protein Science in the Era of Large Language Models (LLMs) // arXiv preprint arXiv:2501.10282. 2025

work page arXiv 2025

[19] [19]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering // Proceedings of the IEEE conference on computer vision and pattern recognition

Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, Parikh Devi. Making the v in vqa matter: Elevating the role of image understanding in visual question answering // Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. 6904–6913

work page 2017

[20] [20]

LightRAG: Simple and Fast Retrieval-Augmented Generation

Guo Zirui, Xia Lianghao, Yu Yanhua, Ao Tu, Huang Chao. LightRAG: Simple and Fast Retrieval-Augmented Generation // arXiv preprint arXiv:2410.05779. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross. Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. 9729–9738

work page 2020

[22] [22]

G-retriever: Retrieval-augmented generation for textual graph understanding and question answering // Advances in Neural Information Processing Systems

He Xiaoxin, Tian Yijun, Sun Yifei, Chawla Nitesh, Laurent Thomas, LeCun Yann, Bresson Xavier, Hooi Bryan. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering // Advances in Neural Information Processing Systems. 37. 2024. 132876–132907

work page 2024

[23] [23]

Knowledge graphs // ACM Computing Surveys (Csur)

Hogan Aidan, Blomqvist Eva, Cochez Michael, d’Amato Claudia, Melo Gerard De, Gutierrez Claudio, Kirrane Sabrina, Gayo José Emilio Labra, Navigli Roberto, Neumaier Sebastian, others. Knowledge graphs // ACM Computing Surveys (Csur). 2021. 54, 4. 1–37

work page 2021

[24] [24]

Lora: Low-rank adaptation of large language models

Hu Edward J, Shen Yelong, Wallis Phillip, Allen-Zhu Zeyuan, Li Yuanzhi, Wang Shean, Wang Lu, Chen Weizhu, others. Lora: Low-rank adaptation of large language models. // ICLR. 2022. 1, 2. 3

work page 2022

[25] [25]

Egtr: Extracting graph from transformer for scene graph generation // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Im Jinbae, Nam JeongYeon, Park Nokyung, Lee Hyungmin, Park Seunghyun. Egtr: Extracting graph from transformer for scene graph generation // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 24229–24238

work page 2024

[26] [26]

Billion-scale similarity search with GPUs // IEEE Transactions on Big Data

Johnson Jeff, Douze Matthijs, Jégou Hervé. Billion-scale similarity search with GPUs // IEEE Transactions on Big Data. 2019. 7, 3. 535–547

work page 2019

[27] [27]

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning // The Twelfth International Conference on Learning Representations

LUO LINHAO, Li Yuan-Fang, Haffari Gholamreza, Pan Shirui. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning // The Twelfth International Conference on Learning Representations. 2024

work page 2024

[28] [28]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models // International conference on machine learning

Li Junnan, Li Dongxu, Savarese Silvio, Hoi Steven. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models // International conference on machine learning. 2023. 19730–19742. 12

work page 2023

[29] [29]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning // Advances in Neural Information Processing Systems

Liang Victor Weixin, Zhang Yuhui, Kwon Yongchan, Yeung Serena, Zou James Y. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning // Advances in Neural Information Processing Systems. 2022. 35. 17612–17625

work page 2022

[30] [30]

Vila: On pre- training for visual language models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lin Ji, Yin Hongxu, Ping Wei, Molchanov Pavlo, Shoeybi Mohammad, Han Song. Vila: On pre- training for visual language models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. 26689–26699

work page 2024

[31] [31]

Retrieval Augmented Visual Question Answering with Outside Knowl- edge // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Lin Weizhe, Byrne Bill. Retrieval Augmented Visual Question Answering with Outside Knowl- edge // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. 11238–11254

work page 2022

[32] [32]

Fine-grained late- interaction multi-modal retrieval for retrieval augmented visual question answering // Advances in Neural Information Processing Systems

Lin Weizhe, Chen Jinghong, Mei Jingbiao, Coca Alexandru, Byrne Bill. Fine-grained late- interaction multi-modal retrieval for retrieval augmented visual question answering // Advances in Neural Information Processing Systems. 36. 2023. 22820–22840

work page 2023

[33] [33]

Medical visual question answering: A survey // Artificial Intelligence in Medicine

Lin Zhihong, Zhang Donghao, Tao Qingyi, Shi Danli, Haffari Gholamreza, Wu Qi, He Ming- guang, Ge Zongyuan. Medical visual question answering: A survey // Artificial Intelligence in Medicine. 2023. 143. 102611

work page 2023

[34] [34]

Improved baselines with visual instruction tuning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu Haotian, Li Chunyuan, Li Yuheng, Lee Yong Jae. Improved baselines with visual instruction tuning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page

[35] [35]

Visual instruction tuning // Advances in neural information processing systems

Liu Haotian, Li Chunyuan, Wu Qingyang, Lee Yong Jae. Visual instruction tuning // Advances in neural information processing systems. 36. 2023. 34892–34916

work page 2023

[36] [36]

MMKG: multi-modal knowledge graphs // The semantic web: 16th international conference, ESWC 2019, portorož, Slovenia, June 2–6, 2019, proceedings 16

Liu Ye, Li Hui, Garcia-Duran Alberto, Niepert Mathias, Onoro-Rubio Daniel, Rosenblum David S. MMKG: multi-modal knowledge graphs // The semantic web: 16th international conference, ESWC 2019, portorož, Slovenia, June 2–6, 2019, proceedings 16. 2019. 459–474

work page 2019

[37] [37]

HyperGraphRAG: Retrieval-Augmented Generation with Hypergraph-Structured Knowledge Representation // arXiv preprint arXiv:2503.21322

Luo Haoran, Chen Guanting, Zheng Yandan, Wu Xiaobao, Guo Yikai, Lin Qika, Feng Yu, Kuang Zemin, Song Meina, Zhu Yifan, others. HyperGraphRAG: Retrieval-Augmented Generation with Hypergraph-Structured Knowledge Representation // arXiv preprint arXiv:2503.21322. 2025

work page arXiv 2025

[38] [38]

Ma Shengjie, Xu Chengjin, Jiang Xuhui, Li Muzhi, Qu Huaren, Yang Cehao, Mao Jiaxin, Guo Jian. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation // The Thirteenth International Conference on Learning Representations. 2025

work page 2025

[39] [39]

Ok-vqa: A visual question answering benchmark requiring external knowledge // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Marino Kenneth, Rastegari Mohammad, Farhadi Ali, Mottaghi Roozbeh. Ok-vqa: A visual question answering benchmark requiring external knowledge // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. 3195–3204

work page 2019

[40] [40]

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories // Proceedings of the IEEE/CVF International Conference on Computer Vision

Mensink Thomas, Uijlings Jasper, Castrejon Lluis, Goel Arushi, Cadar Felipe, Zhou Howard, Sha Fei, Araujo André, Ferrari Vittorio. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 3113–3124

work page 2023

[41] [41]

Towards trustworthy re- trieval augmented generation for large language models: A survey

Ni Bo, Liu Zheyuan, Wang Leyao, Lei Yongjia, Zhao Yuying, Cheng Xueqi, Zeng Qingkai, Dong Luna, Xia Yinglong, Kenthapadi Krishnaram, others. Towards Trustworthy Retrieval Aug- mented Generation for Large Language Models: A Survey // arXiv preprint arXiv:2502.06872. 2025

work page arXiv 2025

[42] [42]

Cheatagent: Attacking llm-empowered recommender systems via llm agent // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Ning Liang-bo, Wang Shijie, Fan Wenqi, Li Qing, Xu Xin, Chen Hao, Huang Feiran. Cheatagent: Attacking llm-empowered recommender systems via llm agent // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. 2284–2295

work page 2024

[43] [43]

Nomic Embed: Training a Reproducible Long Context Text Embedder // Transactions on Machine Learning Research

Nussbaum Zach, Morris John Xavier, Mulyar Andriy, Duderstadt Brandon. Nomic Embed: Training a Reproducible Long Context Text Embedder // Transactions on Machine Learning Research. 2025. 13

work page 2025

[44] [44]

RoRA-VLM: Robust retrieval-augmented vision language models,

Qi Jingyuan, Xu Zhiyang, Shao Rulin, Chen Yang, Di Jin, Cheng Yu, Wang Qifan, Huang Lifu. RoRA-VLM: Robust Retrieval-Augmented Vision Language Models // arXiv preprint arXiv:2410.08876. 2024

work page arXiv 2024

[45] [45]

A Survey of Mamba

Qu Haohao, Ning Liangbo, An Rui, Fan Wenqi, Derr Tyler, Liu Hui, Xu Xin, Li Qing. A survey of mamba // arXiv preprint arXiv:2408.01129. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Learning transferable visual models from natural language supervision // International conference on machine learning

Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, others. Learning transferable visual models from natural language supervision // International conference on machine learning

work page

[47] [47]

Faster r-cnn: Towards real-time object detection with region proposal networks // Advances in neural information processing systems

Ren Shaoqing, He Kaiming, Girshick Ross, Sun Jian. Faster r-cnn: Towards real-time object detection with region proposal networks // Advances in neural information processing systems

work page

[48] [48]

[48]Schwenk Dustin, Khandelwal Apoorv, Clark Christopher, Marino Kenneth, Mottaghi Roozbeh

2015. [48]Schwenk Dustin, Khandelwal Apoorv, Clark Christopher, Marino Kenneth, Mottaghi Roozbeh. A-okvqa: A benchmark for visual question answering using world knowledge // European conference on computer vision. 2022. 146–162

work page 2015

[49] [49]

LLaMA: Open and Efficient Foundation Language Models

Touvron Hugo, Lavril Thibaut, Izacard Gautier, Martinet Xavier, Lachaux Marie-Anne, Lacroix Timothée, Rozière Baptiste, Goyal Naman, Hambro Eric, Azhar Faisal, others. Llama: Open and efficient foundation language models // arXiv preprint arXiv:2302.13971. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang P , Bai S, Tan S, Wang S, Fan Z, Bai J, Chen K, Liu X, Wang J, Ge W, others. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024 // URL https://arxiv. org/abs/2409.12191. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Knowledge graph retrieval-augmented generation for llm-based recommendation, 2025 b

Wang Shijie, Fan Wenqi, Feng Yue, Ma Xinyu, Wang Shuaiqiang, Yin Dawei. Knowledge Graph Retrieval-Augmented Generation for LLM-based Recommendation // arXiv preprint arXiv:2501.02226. 2025

work page arXiv 2025

[52] [52]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Wu Zhiyu, Chen Xiaokang, Pan Zizheng, Liu Xingchao, Liu Wen, Dai Damai, Gao Huazuo, Ma Yiyang, Wu Chengyue, Wang Bingxuan, others. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding // arXiv preprint arXiv:2412.10302. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

EchoSight: Advancing Visual-Language Models with Wiki Knowledge // Findings of the Association for Computational Linguistics: EMNLP 2024

Yan Yibin, Xie Weidi. EchoSight: Advancing Visual-Language Models with Wiki Knowledge // Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. 1538–1551

work page 2024

[54] [54]

Instruction-guided multi-granularity segmentation and captioning with large multimodal model // Proceedings of the AAAI Confer- ence on Artificial Intelligence

Yuan Xu, Zhou Li, Sun Zenghui, Zhou Zikun, Lan Jingsong. Instruction-guided multi-granularity segmentation and captioning with large multimodal model // Proceedings of the AAAI Confer- ence on Artificial Intelligence. 2025

work page 2025

[55] [55]

MM- LLMs: Recent Advances in MultiModal Large Language Models // Findings of the Association for Computational Linguistics: ACL 2024

Zhang Duzhen, Yu Yahan, Dong Jiahua, Li Chenxing, Su Dan, Chu Chenhui, Yu Dong. MM- LLMs: Recent Advances in MultiModal Large Language Models // Findings of the Association for Computational Linguistics: ACL 2024. 2024. 12401–12430

work page 2024

[56] [56]

mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041

Zhang Tao, Zhang Ziqi, Ma Zongyang, Chen Yuxin, Qi Zhongang, Yuan Chunfeng, Li Bing, Pu Junfu, Zhao Yuxuan, Xie Zehua, others. mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041. 2024

work page arXiv 2024

[57] [57]

Recommender systems in the era of large language models (llms) // IEEE Transactions on Knowledge and Data Engineering

Zhao Zihuai, Fan Wenqi, Li Jiatong, Liu Yunqing, Mei Xiaowei, Wang Yiqi, Wen Zhen, Wang Fei, Zhao Xiangyu, Tang Jiliang, others. Recommender systems in the era of large language models (llms) // IEEE Transactions on Knowledge and Data Engineering. 2024

work page 2024

[58] [58]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu Jinguo, Wang Weiyun, Chen Zhe, Liu Zhaoyang, Ye Shenglong, Gu Lixin, Tian Hao, Duan Yuchen, Su Weijie, Shao Jie, others. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models // arXiv preprint arXiv:2504.10479. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Knowledge graph-guided retrieval augmented generation, 2025

Zhu Xiangrong, Xie Yuexiang, Liu Yi, Li Yaliang, Hu Wei. Knowledge Graph-Guided Retrieval Augmented Generation // arXiv preprint arXiv:2502.06864. 2025. 14 A. Prompt Design In our multimodal knowledge graph construction pipeline, we utilize LLMs’ text understanding and generation capabilities to extract textual knowledge graphs automatically by providing ...

work page arXiv 2025

[60] [60]

relation

entity-description: Comprehensive description of the entity’s attributes and activities. Each textual relationship are formatted as (“relation”|<source-entity>|<target-entity>| <relation-description>|<relation-strength>), which contains the following infor- mation: 1)source-entity: name of the source entity, as defined in the textual entities; 2)target-en...

work page

[61] [61]

relation-description: explanation as to why the source entity and the target entity are related to each other

work page

[62] [62]

matching

relation-strength: a numeric score indicating the strength of the relationship between the source and target entities, ranging from 0 to 10. The scene graph provides the object and relationship information in the image, which is formatted as: -<object-0>:<object-category>,<object-bbox> -<object-1>:<object-category>,<object-bbox> ... -<relation-0>:<object-...

work page