mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA
Pith reviewed 2026-05-19 00:00 UTC · model grok-4.3
The pith
Multimodal knowledge graphs in RAG frameworks yield more accurate answers for knowledge-intensive visual question answering by structuring external information and reducing irrelevant content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
mKG-RAG distills high-quality multimodal knowledge graphs from documents using MLLM-driven graph extraction and vision-text matching to produce semantically consistent and modality-complementary entities and relations. It then applies a dual-stage retrieval strategy with a query-aware multimodal retriever to improve efficiency and progressively increase precision. When these graphs replace unstructured documents inside the RAG pipeline for knowledge-intensive VQA, answer accuracy and reliability increase substantially, establishing new state-of-the-art performance.
What carries the argument
Multimodal knowledge graph constructed via MLLM-driven graph extraction and vision-text matching, which supplies structured, modality-complementary knowledge representations that support dual-stage retrieval and generation.
If this is right
- Structured relations among knowledge elements reduce the retrieval of irrelevant or misleading content.
- Modality-complementary entities and relations improve the reliability of answers generated by the underlying MLLM.
- Dual-stage retrieval progressively refines precision while preserving computational efficiency.
- The overall framework produces new state-of-the-art results on standard knowledge-based VQA benchmarks.
Where Pith is reading between the lines
- The same graph-construction pipeline could be tested on other multimodal reasoning tasks that currently rely on unstructured retrieval.
- Error patterns in the extracted graphs might reveal which types of queries remain hardest for structured RAG approaches.
- Replacing unstructured sources with these KGs could lower hallucination rates by anchoring generations to explicit cross-modal relations.
Load-bearing premise
MLLM-driven graph extraction combined with vision-text matching can reliably produce semantically consistent and modality-complementary entities and relations from multimodal documents without substantial errors or noise that would degrade downstream retrieval and generation.
What would settle it
An experiment in which the constructed multimodal KGs contain frequent inconsistencies or noise that cause lower VQA accuracy than equivalent unstructured-document RAG baselines would disprove the central benefit.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for expanding the knowledge capacity of Multimodal Large Language Models (MLLMs) by incorporating external knowledge sources into the generation process, and has been widely adopted for knowledge-based Visual Question Answering (VQA). Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relations among knowledge elements frequently introduce irrelevant or misleading content, degrading answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks, thereby enhancing generation through structured multimodal knowledge. To this end, this paper proposes mKG-RAG, a novel retrieval-augmented generation framework built upon multimodal KGs for knowledge-intensive VQA tasks. Specifically, mKG-RAG leverages MLLM-driven graph extraction and vision-text matching to distill semantically consistent, modality-complementary entities and relations from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. Furthermore, a dual-stage retrieval strategy equipped with a query-aware multimodal retriever is introduced to improve retrieval efficiency while progressively refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing approaches and sets new state-of-the-art results for knowledge-based VQA. The code is available at https://github.com/xandery-geek/mKG-RAG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes mKG-RAG, a novel RAG framework for knowledge-intensive VQA that constructs multimodal knowledge graphs via MLLM-driven graph extraction and vision-text matching to obtain semantically consistent, modality-complementary entities and relations from documents; it further introduces a dual-stage retrieval strategy with a query-aware multimodal retriever and reports that comprehensive experiments show significant outperformance over prior methods with new state-of-the-art results. Code is released.
Significance. If the results are robust, the work would demonstrate a concrete benefit of structured multimodal knowledge over unstructured documents in RAG pipelines for VQA, addressing a recognized limitation in current approaches. The public code release aids reproducibility.
major comments (1)
- [§3] §3 (Graph Construction / Pipeline): The central claim that mKG-RAG sets new SOTA rests on the untested assumption that MLLM-driven extraction plus vision-text matching yields high-quality, low-noise multimodal KGs. No quantitative extraction metrics (entity/relation F1 vs. human gold, error rates, or consistency scores) or error-propagation analysis are reported, leaving open the possibility that downstream gains are artifacts of prompting rather than the KG structure itself.
minor comments (1)
- [Abstract] Abstract: the phrase 'comprehensive experiments' would benefit from a parenthetical listing of the primary datasets and main baselines to give readers immediate context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment on the graph construction evaluation below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Graph Construction / Pipeline): The central claim that mKG-RAG sets new SOTA rests on the untested assumption that MLLM-driven extraction plus vision-text matching yields high-quality, low-noise multimodal KGs. No quantitative extraction metrics (entity/relation F1 vs. human gold, error rates, or consistency scores) or error-propagation analysis are reported, leaving open the possibility that downstream gains are artifacts of prompting rather than the KG structure itself.
Authors: We agree that direct quantitative evaluation of the multimodal KG extraction quality would provide stronger support for our claims and help rule out prompting artifacts. The current manuscript emphasizes end-to-end VQA results and ablations that isolate the contribution of the structured KG (e.g., comparisons against unstructured document retrieval), but does not report extraction-level metrics such as entity/relation F1 against human gold standards or explicit error-propagation studies. In the revised version we will add a dedicated analysis in Section 3 (or a new subsection in the experiments) that reports precision, recall, and F1 scores for entities and relations on a randomly sampled subset of documents annotated by human experts. We will also include an error-propagation study that measures VQA performance degradation when controlled extraction noise is introduced versus when the KG is manually corrected. These additions will clarify that performance gains derive from the modality-complementary structure obtained via MLLM-driven extraction and vision-text matching rather than from prompting alone. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper proposes mKG-RAG as a framework that constructs multimodal KGs via MLLM-driven extraction plus vision-text matching, then applies dual-stage retrieval for knowledge-based VQA. No equations, fitted parameters, or self-referential definitions appear in the provided abstract or pipeline description that would reduce the reported SOTA performance gains to quantities defined by the inputs themselves. The components are presented as independently motivated constructions evaluated on external benchmarks, satisfying the criteria for a self-contained derivation with no load-bearing reductions to self-citation chains or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal documents contain extractable entities and relations that MLLMs plus vision-text matching can distill into semantically consistent, modality-complementary multimodal KGs.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
leverages MLLM-driven graph extraction and vision-text matching to distill semantically consistent, modality-complementary entities and relations from multimodal documents, constructing high-quality multimodal KGs
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual-stage retrieval strategy equipped with a query-aware multimodal retriever
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
-
MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation
MG²-RAG proposes a multi-granularity graph RAG framework that constructs hierarchical multimodal nodes via entity-driven visual grounding and performs structured retrieval, delivering SOTA results on four multimodal t...
-
QKVQA: Question-Focused Filtering for Knowledge-based VQA
QKVQA proposes a question-focused filtering method with QFF and CDA modules that boosts accuracy by 3.2 points on Encyclopedic-VQA and 2.2 points on InfoSeek over prior state-of-the-art.
Reference graph
Works this paper leans on
-
[1]
Alayrac Jean-Baptiste, Donahue Jeff, Luc Pauline, Miech Antoine, Barr Iain, Hasson Yana, Lenc Karel, Mensch Arthur, Millican Katherine, Reynolds Malcolm, others. Flamingo: a visual language model for few-shot learning // Advances in neural information processing systems
-
[2]
Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C Lawrence, Parikh Devi. Vqa: Visual question answering // Proceedings of the IEEE international conference on computer vision. 2015. 2425–2433
work page 2015
-
[3]
Bai Shuai, Chen Keqin, Liu Xuejing, Wang Jialin, Ge Wenbin, Song Sibo, Dang Kai, Wang Peng, Wang Shijie, Tang Jun, Zhong Humen, Zhu Yuanzhi, Yang Mingkun, Li Zhaohai, Wan Jianqiang, Wang Pengfei, Ding Wei, Fu Zheren, Xu Yiheng, Ye Jiabo, Zhang Xi, Xie Tianbao, Cheng Zesen, Zhang Hang, Yang Zhibo, Xu Haiyang, Lin Junyang. Qwen2.5-VL Technical Report. 2025
work page 2025
-
[4]
Caffagni Davide, Cocchi Federico, Moratelli Nicholas, Sarto Sara, Cornia Marcella, Baraldi Lorenzo, Cucchiara Rita. Wiki-llava: Hierarchical retrieval-augmented generation for mul- timodal llms // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 1818–1826
work page 2024
-
[5]
Chen Ting, Kornblith Simon, Norouzi Mohammad, Hinton Geoffrey. A simple framework for contrastive learning of visual representations // International conference on machine learning
-
[6]
Chen Yang, Hu Hexiang, Luan Yi, Sun Haitian, Changpinyo Soravit, Ritter Alan, Chang Ming-Wei. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. 14948–14968
work page 2023
-
[7]
Chen Zhe, Wu Jiannan, Wang Wenhai, Su Weijie, Chen Guo, Xing Sen, Zhong Muyan, Zhang Qinglong, Zhu Xizhou, Lu Lewei, others. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. 24185–24198
work page 2024
-
[8]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality // See https://vicuna
Chiang Wei-Lin, Li Zhuohan, Lin Ziqing, Sheng Ying, Wu Zhanghao, Zhang Hao, Zheng Lianmin, Zhuang Siyuan, Zhuang Yonghao, Gonzalez Joseph E, others. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality // See https://vicuna. lmsys. org (accessed 14 April 2023). 2023. 2, 3. 6
work page 2023
-
[9]
Scaling instruction-finetuned language models // Journal of Machine Learning Research
Chung Hyung Won, Hou Le, Longpre Shayne, Zoph Barret, Tay Yi, Fedus William, Li Yunxuan, Wang Xuezhi, Dehghani Mostafa, Brahma Siddhartha, others. Scaling instruction-finetuned language models // Journal of Machine Learning Research. 2024. 25, 70. 1–53
work page 2024
-
[10]
Cocchi Federico, Moratelli Nicholas, Caffagni Davide, Sarto Sara, Baraldi Lorenzo, Cor- nia Marcella, Cucchiara Rita. LLaV A-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning // arXiv preprint arXiv:2503.15621. 2025
-
[11]
Cocchi Federico, Moratelli Nicholas, Cornia Marcella, Baraldi Lorenzo, Cucchiara Rita. Aug- menting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025
work page 2025
-
[12]
Cui Can, Ma Yunsheng, Cao Xu, Ye Wenqian, Zhou Yang, Liang Kaizhao, Chen Jintai, Lu Juanwu, Yang Zichong, Liao Kuei-Da, others. A survey on multimodal large language mod- els for autonomous driving // Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2024. 958–979
work page 2024
-
[13]
Dai Wenliang, Li Junnan, Li Dongxu, Tiong Anthony, Zhao Junqi, Wang Weisheng, Li Boyang, Fung Pascale, Hoi Steven. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning // Thirty-seventh Conference on Neural Information Processing Systems. 2023. 11
work page 2023
-
[14]
Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob, Houlsby Neil. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale // International Conference on Learning Representations. 2021
work page 2021
-
[15]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Edge Darren, Trinh Ha, Cheng Newman, Bradley Joshua, Chao Alex, Mody Apurva, Truitt Steven, Metropolitansky Dasha, Ness Robert Osazuwa, Larson Jonathan. From local to global: A graph rag approach to query-focused summarization // arXiv preprint arXiv:2404.16130. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Fan Wenqi, Ding Yujuan, Ning Liangbo, Wang Shijie, Li Hengyun, Yin Dawei, Chua Tat-Seng, Li Qing. A survey on rag meeting llms: Towards retrieval-augmented large language models // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
-
[17]
6491–6501. [17]Fan Wenqi, Wang Shijie, Huang Jiani, Chen Zhikai, Song Yu, Tang Wenzhuo, Mao Haitao, Liu Hui, Liu Xiaorui, Yin Dawei, others. Graph machine learning in the era of large language models (llms) // arXiv preprint arXiv:2404.14928. 2024
work page internal anchor Pith review arXiv 2024
-
[18]
arXiv preprint arXiv:2501.10282 (2025)
Fan Wenqi, Zhou Yi, Wang Shijie, Yan Yuyao, Liu Hui, Zhao Qian, Song Le, Li Qing. Com- putational Protein Science in the Era of Large Language Models (LLMs) // arXiv preprint arXiv:2501.10282. 2025
-
[19]
Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, Parikh Devi. Making the v in vqa matter: Elevating the role of image understanding in visual question answering // Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. 6904–6913
work page 2017
-
[20]
LightRAG: Simple and Fast Retrieval-Augmented Generation
Guo Zirui, Xia Lianghao, Yu Yanhua, Ao Tu, Huang Chao. LightRAG: Simple and Fast Retrieval-Augmented Generation // arXiv preprint arXiv:2410.05779. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross. Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. 9729–9738
work page 2020
-
[22]
He Xiaoxin, Tian Yijun, Sun Yifei, Chawla Nitesh, Laurent Thomas, LeCun Yann, Bresson Xavier, Hooi Bryan. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering // Advances in Neural Information Processing Systems. 37. 2024. 132876–132907
work page 2024
-
[23]
Knowledge graphs // ACM Computing Surveys (Csur)
Hogan Aidan, Blomqvist Eva, Cochez Michael, d’Amato Claudia, Melo Gerard De, Gutierrez Claudio, Kirrane Sabrina, Gayo José Emilio Labra, Navigli Roberto, Neumaier Sebastian, others. Knowledge graphs // ACM Computing Surveys (Csur). 2021. 54, 4. 1–37
work page 2021
-
[24]
Lora: Low-rank adaptation of large language models
Hu Edward J, Shen Yelong, Wallis Phillip, Allen-Zhu Zeyuan, Li Yuanzhi, Wang Shean, Wang Lu, Chen Weizhu, others. Lora: Low-rank adaptation of large language models. // ICLR. 2022. 1, 2. 3
work page 2022
-
[25]
Im Jinbae, Nam JeongYeon, Park Nokyung, Lee Hyungmin, Park Seunghyun. Egtr: Extracting graph from transformer for scene graph generation // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 24229–24238
work page 2024
-
[26]
Billion-scale similarity search with GPUs // IEEE Transactions on Big Data
Johnson Jeff, Douze Matthijs, Jégou Hervé. Billion-scale similarity search with GPUs // IEEE Transactions on Big Data. 2019. 7, 3. 535–547
work page 2019
-
[27]
LUO LINHAO, Li Yuan-Fang, Haffari Gholamreza, Pan Shirui. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning // The Twelfth International Conference on Learning Representations. 2024
work page 2024
-
[28]
Li Junnan, Li Dongxu, Savarese Silvio, Hoi Steven. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models // International conference on machine learning. 2023. 19730–19742. 12
work page 2023
-
[29]
Liang Victor Weixin, Zhang Yuhui, Kwon Yongchan, Yeung Serena, Zou James Y. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning // Advances in Neural Information Processing Systems. 2022. 35. 17612–17625
work page 2022
-
[30]
Lin Ji, Yin Hongxu, Ping Wei, Molchanov Pavlo, Shoeybi Mohammad, Han Song. Vila: On pre- training for visual language models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. 26689–26699
work page 2024
-
[31]
Lin Weizhe, Byrne Bill. Retrieval Augmented Visual Question Answering with Outside Knowl- edge // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. 11238–11254
work page 2022
-
[32]
Lin Weizhe, Chen Jinghong, Mei Jingbiao, Coca Alexandru, Byrne Bill. Fine-grained late- interaction multi-modal retrieval for retrieval augmented visual question answering // Advances in Neural Information Processing Systems. 36. 2023. 22820–22840
work page 2023
-
[33]
Medical visual question answering: A survey // Artificial Intelligence in Medicine
Lin Zhihong, Zhang Donghao, Tao Qingyi, Shi Danli, Haffari Gholamreza, Wu Qi, He Ming- guang, Ge Zongyuan. Medical visual question answering: A survey // Artificial Intelligence in Medicine. 2023. 143. 102611
work page 2023
-
[34]
Liu Haotian, Li Chunyuan, Li Yuheng, Lee Yong Jae. Improved baselines with visual instruction tuning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
-
[35]
Visual instruction tuning // Advances in neural information processing systems
Liu Haotian, Li Chunyuan, Wu Qingyang, Lee Yong Jae. Visual instruction tuning // Advances in neural information processing systems. 36. 2023. 34892–34916
work page 2023
-
[36]
Liu Ye, Li Hui, Garcia-Duran Alberto, Niepert Mathias, Onoro-Rubio Daniel, Rosenblum David S. MMKG: multi-modal knowledge graphs // The semantic web: 16th international conference, ESWC 2019, portorož, Slovenia, June 2–6, 2019, proceedings 16. 2019. 459–474
work page 2019
-
[37]
Luo Haoran, Chen Guanting, Zheng Yandan, Wu Xiaobao, Guo Yikai, Lin Qika, Feng Yu, Kuang Zemin, Song Meina, Zhu Yifan, others. HyperGraphRAG: Retrieval-Augmented Generation with Hypergraph-Structured Knowledge Representation // arXiv preprint arXiv:2503.21322. 2025
-
[38]
Ma Shengjie, Xu Chengjin, Jiang Xuhui, Li Muzhi, Qu Huaren, Yang Cehao, Mao Jiaxin, Guo Jian. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation // The Thirteenth International Conference on Learning Representations. 2025
work page 2025
-
[39]
Marino Kenneth, Rastegari Mohammad, Farhadi Ali, Mottaghi Roozbeh. Ok-vqa: A visual question answering benchmark requiring external knowledge // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. 3195–3204
work page 2019
-
[40]
Mensink Thomas, Uijlings Jasper, Castrejon Lluis, Goel Arushi, Cadar Felipe, Zhou Howard, Sha Fei, Araujo André, Ferrari Vittorio. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 3113–3124
work page 2023
-
[41]
Towards trustworthy re- trieval augmented generation for large language models: A survey
Ni Bo, Liu Zheyuan, Wang Leyao, Lei Yongjia, Zhao Yuying, Cheng Xueqi, Zeng Qingkai, Dong Luna, Xia Yinglong, Kenthapadi Krishnaram, others. Towards Trustworthy Retrieval Aug- mented Generation for Large Language Models: A Survey // arXiv preprint arXiv:2502.06872. 2025
-
[42]
Ning Liang-bo, Wang Shijie, Fan Wenqi, Li Qing, Xu Xin, Chen Hao, Huang Feiran. Cheatagent: Attacking llm-empowered recommender systems via llm agent // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. 2284–2295
work page 2024
-
[43]
Nussbaum Zach, Morris John Xavier, Mulyar Andriy, Duderstadt Brandon. Nomic Embed: Training a Reproducible Long Context Text Embedder // Transactions on Machine Learning Research. 2025. 13
work page 2025
-
[44]
RoRA-VLM: Robust retrieval-augmented vision language models,
Qi Jingyuan, Xu Zhiyang, Shao Rulin, Chen Yang, Di Jin, Cheng Yu, Wang Qifan, Huang Lifu. RoRA-VLM: Robust Retrieval-Augmented Vision Language Models // arXiv preprint arXiv:2410.08876. 2024
-
[45]
Qu Haohao, Ning Liangbo, An Rui, Fan Wenqi, Derr Tyler, Liu Hui, Xu Xin, Li Qing. A survey of mamba // arXiv preprint arXiv:2408.01129. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, others. Learning transferable visual models from natural language supervision // International conference on machine learning
-
[47]
Ren Shaoqing, He Kaiming, Girshick Ross, Sun Jian. Faster r-cnn: Towards real-time object detection with region proposal networks // Advances in neural information processing systems
-
[48]
[48]Schwenk Dustin, Khandelwal Apoorv, Clark Christopher, Marino Kenneth, Mottaghi Roozbeh
2015. [48]Schwenk Dustin, Khandelwal Apoorv, Clark Christopher, Marino Kenneth, Mottaghi Roozbeh. A-okvqa: A benchmark for visual question answering using world knowledge // European conference on computer vision. 2022. 146–162
work page 2015
-
[49]
LLaMA: Open and Efficient Foundation Language Models
Touvron Hugo, Lavril Thibaut, Izacard Gautier, Martinet Xavier, Lachaux Marie-Anne, Lacroix Timothée, Rozière Baptiste, Goyal Naman, Hambro Eric, Azhar Faisal, others. Llama: Open and efficient foundation language models // arXiv preprint arXiv:2302.13971. 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang P , Bai S, Tan S, Wang S, Fan Z, Bai J, Chen K, Liu X, Wang J, Ge W, others. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024 // URL https://arxiv. org/abs/2409.12191. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Knowledge graph retrieval-augmented generation for llm-based recommendation, 2025 b
Wang Shijie, Fan Wenqi, Feng Yue, Ma Xinyu, Wang Shuaiqiang, Yin Dawei. Knowledge Graph Retrieval-Augmented Generation for LLM-based Recommendation // arXiv preprint arXiv:2501.02226. 2025
-
[52]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Wu Zhiyu, Chen Xiaokang, Pan Zizheng, Liu Xingchao, Liu Wen, Dai Damai, Gao Huazuo, Ma Yiyang, Wu Chengyue, Wang Bingxuan, others. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding // arXiv preprint arXiv:2412.10302. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Yan Yibin, Xie Weidi. EchoSight: Advancing Visual-Language Models with Wiki Knowledge // Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. 1538–1551
work page 2024
-
[54]
Yuan Xu, Zhou Li, Sun Zenghui, Zhou Zikun, Lan Jingsong. Instruction-guided multi-granularity segmentation and captioning with large multimodal model // Proceedings of the AAAI Confer- ence on Artificial Intelligence. 2025
work page 2025
-
[55]
Zhang Duzhen, Yu Yahan, Dong Jiahua, Li Chenxing, Su Dan, Chu Chenhui, Yu Dong. MM- LLMs: Recent Advances in MultiModal Large Language Models // Findings of the Association for Computational Linguistics: ACL 2024. 2024. 12401–12430
work page 2024
-
[56]
Zhang Tao, Zhang Ziqi, Ma Zongyang, Chen Yuxin, Qi Zhongang, Yuan Chunfeng, Li Bing, Pu Junfu, Zhao Yuxuan, Xie Zehua, others. mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041. 2024
-
[57]
Zhao Zihuai, Fan Wenqi, Li Jiatong, Liu Yunqing, Mei Xiaowei, Wang Yiqi, Wen Zhen, Wang Fei, Zhao Xiangyu, Tang Jiliang, others. Recommender systems in the era of large language models (llms) // IEEE Transactions on Knowledge and Data Engineering. 2024
work page 2024
-
[58]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu Jinguo, Wang Weiyun, Chen Zhe, Liu Zhaoyang, Ye Shenglong, Gu Lixin, Tian Hao, Duan Yuchen, Su Weijie, Shao Jie, others. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models // arXiv preprint arXiv:2504.10479. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Knowledge graph-guided retrieval augmented generation, 2025
Zhu Xiangrong, Xie Yuexiang, Liu Yi, Li Yaliang, Hu Wei. Knowledge Graph-Guided Retrieval Augmented Generation // arXiv preprint arXiv:2502.06864. 2025. 14 A. Prompt Design In our multimodal knowledge graph construction pipeline, we utilize LLMs’ text understanding and generation capabilities to extract textual knowledge graphs automatically by providing ...
-
[60]
entity-description: Comprehensive description of the entity’s attributes and activities. Each textual relationship are formatted as (“relation”|<source-entity>|<target-entity>| <relation-description>|<relation-strength>), which contains the following infor- mation: 1)source-entity: name of the source entity, as defined in the textual entities; 2)target-en...
-
[61]
relation-description: explanation as to why the source entity and the target entity are related to each other
-
[62]
relation-strength: a numeric score indicating the strength of the relationship between the source and target entities, ranging from 0 to 10. The scene graph provides the object and relationship information in the image, which is formatted as: -<object-0>:<object-category>,<object-bbox> -<object-1>:<object-category>,<object-bbox> ... -<relation-0>:<object-...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.