pith. machine review for the scientific record. sign in

arxiv: 2604.04969 · v1 · submitted 2026-04-04 · 💻 cs.IR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MG²-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords multimodal RAGknowledge graphgraph retrievalmultimodal reasoningretrieval augmented generationcross-modal fusionvisual grounding
0
0 comments X

The pith

MG²-RAG fuses textual entities and visual regions into unified graph nodes for faster multimodal RAG with state-of-the-art results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MG²-RAG to improve retrieval-augmented generation for multimodal large language models. Flat vector methods miss structural links while existing graph approaches rely on expensive text translations that discard visual details. MG²-RAG builds a hierarchical multimodal knowledge graph through lightweight textual parsing combined with entity-driven visual grounding. This creates unified nodes that keep atomic evidence from both modalities. A multi-granularity retrieval step then aggregates similarities and propagates relevance to support multi-hop cross-modal reasoning. Experiments across retrieval, knowledge-based VQA, reasoning, and classification show top performance plus an average 43.3 times speedup and 23.9 times cost reduction in graph construction.

Core claim

MG²-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, a multi-granularity graph retrieval mechanism aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks demonstrate that MG²-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3× speedup and 23.9× cost reduction compared

What carries the argument

Unified multimodal nodes formed by fusing textual entities and visual regions via lightweight textual parsing and entity-driven visual grounding.

If this is right

  • Enables structured multi-hop reasoning across modalities without discarding fine-grained visual information.
  • Delivers state-of-the-art results on retrieval, knowledge-based VQA, reasoning, and classification tasks.
  • Cuts average graph construction time by 43.3 times and cost by 23.9 times versus advanced graph-based frameworks.
  • Supports complex cross-modal reasoning in retrieval-augmented generation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The node-fusion approach could extend to additional modalities such as audio by adapting the entity-grounding step.
  • Reduced construction cost may make graph-based multimodal RAG feasible for much larger knowledge bases.
  • Preservation of atomic evidence in nodes may reduce hallucinations more reliably than flat vector retrieval.
  • The multi-granularity propagation idea could be tested in non-graph retrieval systems for similar gains.

Load-bearing premise

Combining lightweight textual parsing with entity-driven visual grounding produces unified multimodal nodes that preserve atomic evidence without information loss or alignment errors during fusion.

What would settle it

A controlled comparison on a multi-hop reasoning task where fused nodes are measured against separate text-plus-image retrieval, checking whether accuracy drops due to lost visual details or fusion errors.

Figures

Figures reproduced from arXiv: 2604.04969 by Jun Yu, Qiang Huang, Sijun Dai, Xiaoxing You.

Figure 1
Figure 1. Figure 1: Existing work employs text-centric graph extraction that discards visual details and incurs high costs. MG2 -RAG efficiently fuses textual entities and visual objects into unified multimodal nodes while preserving atomic information. Existing MM-RAG meth￾ods generally fall into two paradigms: vector-based and graph-based approaches. Vector￾based MM-RAG encodes multi￾modal inputs into a shared em￾bedding sp… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MG2 -RAG. This framework consists of two modules: (a) Multimodal Knowledge Graph Construction that transforms a multimodal knowledge base into a hierarchical fine-grained multimodal knowledge graph (MMKG) via textual parsing and entity-driven visual grounding; (b) Multi-Granularity Graph Retrieval that employs a multi-granularity retrieval mechanism, driven by graph propagation, to retrieve the… view at source ↗
Figure 3
Figure 3. Figure 3: Graph construction efficiency. The top and bottom plots show construction time (hours) and cost ($), respectively. Red annotations above the MG2 -RAG bars indicate the efficiency gain (speedup or cost reduction) relative to the strongest baseline. – Multimodal Classification: To assess real-world applicability, we evaluate MG2 -RAG on CrisisMMD [1], a disaster-response dataset containing ap￾proximately 35k… view at source ↗
Figure 4
Figure 4. Figure 4: Case studies on Knowledge-based VQA and Multimodal Retrieval. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Rule-based Relation Extraction Examples. [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case studies on Multimodal Reasoning. MG2 -RAG is compared with a base￾line MLLM (Qwen3.5-27B) and graph-based approaches (VaLiK and MMGraphRAG). The Retrieved Evidence illustrates the multimodal evidence retrieved by MG2 -RAG for answer generation. Hurricane Irma: Residential trash, recycling pickup to be suspended (Infrastructure and utility damage) Retrieved Evidence VaLiK: Rescue volunteering or donati… view at source ↗
Figure 7
Figure 7. Figure 7: Case studies on Multimodal Classification. [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf{MG$^2$-RAG}, a lightweight \textbf{M}ulti-\textbf{G}ranularity \textbf{G}raph \textbf{RAG} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG$^2$-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks (i.e., retrieval, knowledge-based VQA, reasoning, and classification) demonstrate that MG$^2$-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3$\times$ speedup and 23.9$\times$ cost reduction compared with advanced graph-based frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MG²-RAG, a lightweight multi-granularity graph RAG framework for multimodal LLMs. It builds a hierarchical multimodal knowledge graph by fusing textual entities and visual regions into unified nodes via lightweight textual parsing and entity-driven visual grounding. A multi-granularity retrieval mechanism aggregates dense similarities and propagates relevance for structured multi-hop reasoning. Experiments across retrieval, knowledge-based VQA, reasoning, and classification tasks claim SOTA performance with an average 43.3× speedup and 23.9× cost reduction in graph construction versus advanced graph-based baselines.

Significance. If the fusion preserves atomic evidence and the efficiency claims are substantiated, the work could make graph-based multimodal RAG more scalable by avoiding costly translation-to-text pipelines while supporting cross-modal reasoning. The reported speedups would be a notable practical contribution if they hold under rigorous ablations.

major comments (2)
  1. [Abstract, §4] Abstract and §4: The abstract asserts SOTA results and 43.3×/23.9× efficiency gains across four tasks, yet provides no metrics, baselines, ablation details, or error analysis; without these in the experimental section the central performance and speedup claims cannot be evaluated.
  2. [§3.2] §3.2 (Fusion into unified multimodal nodes): The claim that entity-driven visual grounding produces nodes that 'preserve atomic evidence' without information loss or alignment errors is load-bearing for the entire framework, but no quantitative validation (e.g., modality-specific retention rates, alignment error metrics, or ablation on fine-grained visual detail survival) is supplied.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta or speedup factor) to support the SOTA claim.
  2. [§3.3] Notation for the multi-granularity retrieval (dense similarity aggregation and relevance propagation) should be formalized with an equation or algorithm box for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and substantiation of our claims.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4: The abstract asserts SOTA results and 43.3×/23.9× efficiency gains across four tasks, yet provides no metrics, baselines, ablation details, or error analysis; without these in the experimental section the central performance and speedup claims cannot be evaluated.

    Authors: We appreciate the referee highlighting the need for more explicit support of the claims. Section 4 already reports task-specific metrics, baseline comparisons, and efficiency measurements (speedup and cost reduction) across the four tasks. To address the concern directly, we will revise the abstract to include key quantitative highlights with references to the relevant tables and figures. We will also expand §4 with additional ablation details and a brief error analysis to make the performance and efficiency claims fully verifiable. revision: yes

  2. Referee: [§3.2] §3.2 (Fusion into unified multimodal nodes): The claim that entity-driven visual grounding produces nodes that 'preserve atomic evidence' without information loss or alignment errors is load-bearing for the entire framework, but no quantitative validation (e.g., modality-specific retention rates, alignment error metrics, or ablation on fine-grained visual detail survival) is supplied.

    Authors: This is a fair observation. The current manuscript justifies the preservation claim through the entity-driven grounding design and qualitative examples, but does not provide direct quantitative metrics such as retention rates or alignment error rates. We will add a targeted ablation study in the revised experimental section that measures information preservation (e.g., downstream performance impact under controlled grounding variations) and introduces simple alignment-quality metrics to substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework components are defined independently of fitted quantities or self-referential reductions.

full rationale

The paper introduces MG²-RAG as a novel hierarchical graph construction process that fuses textual parsing with entity-driven visual grounding to create unified multimodal nodes. No equations, fitted parameters, or predictions are presented that reduce by construction to prior inputs. Central claims rest on experimental comparisons rather than tautological derivations or load-bearing self-citations. The fusion step is an explicit design choice with stated assumptions about evidence preservation, not a self-definitional loop. This is a standard non-circular proposal of a new architecture.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on standard assumptions about graph utility for knowledge representation and the feasibility of entity-visual alignment; no free parameters or invented entities with independent evidence are specified in the abstract.

axioms (1)
  • domain assumption Hierarchical graphs can represent structural dependencies across text and visual modalities without loss of atomic evidence
    Invoked in the description of node fusion and multi-hop retrieval
invented entities (1)
  • Unified multimodal nodes no independent evidence
    purpose: To fuse textual entities and visual regions while preserving atomic evidence
    New representation introduced to enable joint modality handling

pith-pipeline@v0.9.0 · 5540 in / 1269 out tokens · 31351 ms · 2026-05-13T17:08:29.632617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Very Efficient Listwise Multimodal Reranking for Long Documents

    cs.IR 2026-05 unverdicted novelty 7.0

    ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    In: Proceedings of the International AAAI Conference on Web and Social Media (AAAI)

    Alam, F., Ofli, F., Imran, M.: CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In: Proceedings of the International AAAI Conference on Web and Social Media (AAAI). vol. 12 (2018),https://ojs.aaai.org/index.php/ ICWSM/article/view/14983

  2. [2]

    In: The Twelfth International Conference on Learning Representations (ICLR) (2024), https://openreview

    Asai, A., Wu, Z., Wang, Y., Sil, A., Hajishirzi, H.: Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In: The Twelfth International Conference on Learning Representations (ICLR) (2024), https://openreview. net/forum?id=hSyW5go0v8

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923 (2025), https://arxiv.org/abs/2502.13923

  4. [4]

    Beyond Token-level Answer Equivalence for Question Answering Evaluation

    Bulian,J.,Buck,C.,Gajewski,W.,Börschinger,B.,Schuster,T.:Tomayto,Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP). pp. 291–305 (2022),https://aclanthology.org/2022.emnlp- main.20/

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Caffagni, D., Cocchi, F., Moratelli, N., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1818–1826 (2024),https://openaccess.thecvf. com / content / CVPR2024W / MMFM / html / Caff...

  6. [6]

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, 16 Sijun Dai et al. T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamat...

  7. [7]

    Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to Answer Open- DomainQuestions.In:Proceedingsofthe55thAnnualMeetingoftheAssociationfor Computational Linguistics (ACL). pp. 1870–1879 (2017),https://aclanthology. org/P17-1171/

  8. [8]

    In: The Thirty- eighth Annual Conference on Neural Information Processing Systems (NeurIPS)

    Chen,L., Tong, P., Jin,Z.,Sun, Y.,Ye, J.,Xiong,H.: Plan-on-Graph:Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs. In: The Thirty- eighth Annual Conference on Neural Information Processing Systems (NeurIPS). vol. 37, pp. 37665–37691 (2024),https://openreview.net/forum?id=CwCUEr6wO5

  9. [9]

    In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Chen, W., Hu, H., Chen, X., Verga, P., Cohen, W.: MuRAG: Multimodal Retrieval- Augmented Generator for Open Question Answering over Images and Text. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5558–5570 (2022),https://aclanthology.org/2022. emnlp-main.375/

  10. [10]

    Chen, Y., Hu, H., Luan, Y., Sun, H., Changpinyo, S., Ritter, A., Chang, M.W.: Can Pre-trained Vision and Language Models Answer Visual Information-seeking Questions? In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 14948–14968 (2023),https://aclanthology. org/2023.emnlp-main.925/

  11. [11]

    In: Findings of the Association for Computational Linguistics: EMNLP 2024

    Cheng, Q., Li, X., Li, S., Zhu, Q., Yin, Z., Shao, Y., Li, L., Sun, T., Yan, H., Qiu, X.: Unified Active Retrieval for Retrieval Augmented Generation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 17153–17166 (2024),https://aclanthology.org/2024.findings-emnlp.999/

  12. [12]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cocchi, F., Moratelli, N., Cornia, M., Baraldi, L., Cucchiara, R.: Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9199–9209 (2025),https://openaccess.thecvf. com/content/CVPR2025/html/Cocchi_Augmenting_Mu...

  13. [13]

    arXiv preprint arXiv:2511.22715 (2025),https: //arxiv.org/abs/2511.22715

    Compagnoni, A., Morini, M., Sarto, S., Cocchi, F., Caffagni, D., Cornia, M., Baraldi, L., Cucchiara, R.: ReAG: Reasoning-Augmented Generation for Knowledge- based Visual Question Answering. arXiv preprint arXiv:2511.22715 (2025),https: //arxiv.org/abs/2511.22715

  14. [14]

    In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS)

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS). vol. 36, pp. 49250–49267 (2023), https : / / proceedings . neurips . cc / paper _...

  15. [15]

    arXiv preprint arXiv:2508.19855 (2025),https://arxiv.org/ abs/2508.19855

    Dong, J., An, S., Yu, Y., Zhang, Q.W., Luo, L., Huang, X., Wu, Y., Yin, D., Sun, X.: Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning. arXiv preprint arXiv:2508.19855 (2025),https://arxiv.org/ abs/2508.19855

  16. [16]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From Local to Global: A Graph MG2-RAG: Multi-Granularity Graph for MM-RAG 17 RAG Approach to Query-Focused Summarization. arXiv preprint arXiv:2404.16130 (2024),https://arxiv.org/abs/2404.16130

  17. [17]

    Edge, D., Trinh, H., Larson, J.: LazyGraphRAG: Setting a New Standard for Quality and Cost (2024), https://www.microsoft.com/en-us/research/blog/ lazygraphrag-setting-a-new-standard-for-quality-and-cost/

  18. [18]

    In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview.net/forum?id=ogjBpZ8uSi

    Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., HUDELOT, C., Colombo, P.: ColPali: Efficient Document Retrieval with Vision Language Models. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview.net/forum?id=ogjBpZ8uSi

  19. [19]

    In: The Thirteenth International Conference on Learning Representations (ICLR) (2025),https:// openreview.net/forum?id=OCd3cffulp

    Gao, J., Li, L., Ji, K., Li, W., Lian, Y., Fu, Y., Dai, B.: SmartRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025),https:// openreview.net/forum?id=OCd3cffulp

  20. [20]

    Google DeepMind Team: Gemini 3.1 Pro: Best for Complex Tasks and Bringing Creative Concepts to Life (2026),https://deepmind.google/models/gemini/pro/

  21. [21]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 6904–6913 (2017),https://openaccess.thecvf.com/content_cvpr_ 2017/html/Goyal_Making_the_v_CVPR_2017_p...

  22. [22]

    arXiv preprint arXiv:2502.01142 (2025),https://arxiv.org/abs/2502.01142

    Guan, X., Zeng, J., Meng, F., Xin, C., Lu, Y., Lin, H., Han, X., Sun, L., Zhou, J.: DeepRAG: Thinking to Retrieval Step by Step for Large Language Models. arXiv preprint arXiv:2502.01142 (2025),https://arxiv.org/abs/2502.01142

  23. [23]

    In: Findings of the Association for Computational Lin- guistics: EMNLP 2025

    Guo, Z., Xia, L., Yu, Y., Ao, T., Huang, C.: LightRAG: Simple and Fast Retrieval- Augmented Generation. In: Findings of the Association for Computational Lin- guistics: EMNLP 2025. pp. 10746–10761 (2025),https://aclanthology.org/2025. findings-emnlp.568/

  24. [24]

    In: Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS)

    Gutiérrez, B.J., Shu, Y., Gu, Y., Yasunaga, M., Su, Y.: HippoRAG: neurobiologically inspired long-term memory for large language models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS). vol. 37, pp. 59532–59569 (2024),https://openreview.net/forum?id=hkujvAPVsg

  25. [25]

    In: Forty-second Inter- national Conference on Machine Learning (ICML)

    Gutiérrez, B.J., Shu, Y., Qi, W., Zhou, S., Su, Y.: From RAG to Memory: Non- Parametric Continual Learning for Large Language Models. In: Forty-second Inter- national Conference on Machine Learning (ICML). pp. 21497–21515. PMLR (2025), https://openreview.net/forum?id=LWH8yn4HS2

  26. [26]

    In: Proceedings of the 37th International Conference on Machine Learning (ICML)

    Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval Augmented Language Model Pre-Training. In: Proceedings of the 37th International Conference on Machine Learning (ICML). pp. 3929–3938. PMLR (2020),https://proceedings. mlr.press/v119/guu20a.html

  27. [27]

    In: Proceedings of the 11th international conference on World Wide Web (WWW)

    Haveliwala, T.H.: Topic-sensitive pagerank. In: Proceedings of the 11th international conference on World Wide Web (WWW). pp. 517–526 (2002),https://dl.acm. org/doi/abs/10.1145/511446.511513

  28. [28]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)

    He, X., Tian, Y., Sun, Y., Chawla, N., Laurent, T., LeCun, Y., Bresson, X., Hooi, B.: G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS). vol. 37, pp. 132876–132907 (2024), https://openreview.net/forum?id=MPJ3oXtTZl

  29. [29]

    Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-Strength Natural Language Processing in Python (2020),https://spacy.io/ 18 Sijun Dai et al

  30. [30]

    In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview

    Hsu, S., Khattab, O., Finn, C., Sharma, A.: Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview. net/forum?id=BPAZ6yW3K7

  31. [31]

    arXiv preprint arXiv:2505.24073 (2025),https://arxiv.org/abs/2505.24073

    Hu, C.W., Wang, Y., Xing, S., Chen, C.J., Feng, S., Rossi, R., Tu, Z.: mrag: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation. arXiv preprint arXiv:2505.24073 (2025),https://arxiv.org/abs/2505.24073

  32. [32]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Hu, Z., Iscen, A., Sun, C., Wang, Z., Chang, K.W., Sun, Y., Schmid, C., Ross, D.A., Fathi, A.: REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 23369– 23379 (2023), https://openaccess.thecvf.com/content/C...

  33. [33]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions43(2) (Jan 2025). https://doi.org/10.1145/3703155,https://doi.org/10.1145/3703155

  34. [34]

    In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

    Huang, Y., Zhang, S., Xiao, X.: KET-RAG: A Cost-Efficient Multi-Granular In- dexing Framework for Graph-RAG. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). pp. 1003–1012 (2025), https://dl.acm.org/doi/abs/10.1145/3711896.3737012

  35. [35]

    In: Findings of the Association for Computational Linguistics: ACL 2024

    Jin, B., Xie, C., Zhang, J., Roy, K.K., Zhang, Y., Li, Z., Li, R., Tang, X., Wang, S., Meng, Y., Han, J.: Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 163–184 (2024),https://aclanthology.org/2024. findings-acl.11/

  36. [36]

    International Journal of Computer Vision (IJCV)123(1), 32–73 (2017),https: //link.springer.com/article/10.1007/S11263-016-0981-7

    Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV)123(1), 32–73 (2017),https: //link.springer.com/article/10.1007/S11263-016-0981-7

  37. [37]

    Internet-augmented language models through few-shot prompting for open-domain question answering,

    Lazaridou, A., Gribovskaya, E., Stokowiec, W., Grigorev, N.: Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115 (2022),https://arxiv.org/abs/2203.05115

  38. [38]

    Lee, J., Wang, Y., Li, J., Zhang, M.: Multimodal Reasoning with Multimodal KnowledgeGraph.In:Proceedingsofthe62ndAnnualMeetingoftheAssociationfor Computational Linguistics (ACL). pp. 10767–10782 (2024),https://aclanthology. org/2024.acl-long.579/

  39. [39]

    In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)

    Lee, K., Chang, M.W., Toutanova, K.: Latent Retrieval for Weakly Supervised Open Domain Question Answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 6086–6096 (2019), https://aclanthology.org/P19-1612/

  40. [40]

    In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL)

    Lee, M., An, S., Kim, M.S.: PlanRAG: A Plan-then-Retrieval Augmented Genera- tion for Generative Large Language Models as Decision Makers. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL). pp. 6537–6555 (2024),https://aclanthology.org/2024.naacl-long.364/

  41. [41]

    arXiv preprint arXiv:2503.21729 (2025), https://arxiv.org/abs/2503.21729

    Lee, Z., Cao, S., Liu, J., Zhang, J., Liu, W., Che, X., Hou, L., Li, J.: ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with MG2-RAG: Multi-Granularity Graph for MM-RAG 19 Iterative Retrieval Augmented Generation. arXiv preprint arXiv:2503.21729 (2025), https://arxiv.org/abs/2503.21729

  42. [42]

    In: Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS)

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küt- tler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS). pp. 9459–9474 (2020),https://procee...

  43. [43]

    In: Proceedings of the 40th International Conference on Machine Learning (ICML)

    Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models. In: Proceedings of the 40th International Conference on Machine Learning (ICML). pp. 19730–19742. PMLR (2023),https://proceedings.mlr.press/v202/li23q

  44. [44]

    In: Proceedings of the 39th International Conference on Machine Learning (ICML)

    Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In: Proceedings of the 39th International Conference on Machine Learning (ICML). pp. 12888–12900. PMLR (2022),https://proceedings.mlr.press/v162/li22n.html

  45. [45]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Li, X., Dong, G., Jin, J., Zhang, Y., Zhou, Y., Zhu, Y., Zhang, P., Dou, Z.: Search-o1: Agentic Search-Enhanced Large Reasoning Models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5420–5438 (2025),https://aclanthology.org/2025.emnlp-main.276/

  46. [46]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved Baselines with Visual Instruction Tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26296–26306 (2024),https://openaccess.thecvf.com/ content/CVPR2024/html/Liu_Improved_Baselines_with_Visual_Instruction_ Tuning_CVPR_2024_paper.html

  47. [47]

    In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS). vol. 36, pp. 34892–34916 (2023),https://openreview.net/forum?id=w0H2xGHlkw

  48. [48]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Liu, J., Meng, S., Gao, Y., Mao, S., Cai, P., Yan, G., Chen, Y., Bian, Z., Wang, D., Shi, B.: Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 981–992 (2025),https://openaccess.thecvf.com/content/ICCV20...

  49. [49]

    In: European Semantic Web Conference (ESWC)

    Liu, Y., Li, H., Garcia-Duran, A., Niepert, M., Onoro-Rubio, D., Rosenblum, D.S.: MMKG: Multi-modal Knowledge Graphs. In: European Semantic Web Conference (ESWC). pp. 459–474. Springer (2019),https://link.springer.com/chapter/10. 1007/978-3-030-21348-0_30

  50. [50]

    In: Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS)

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In: Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS). vol. 35, pp. 2507–2521 (2022), https://openreview.net/forum?id=H...

  51. [51]

    In: Proceedings of the 37th International Conference on Neu- ral Information Processing Systems (NeurIPS)

    Lu, P., Peng, B., Cheng, H., Galley, M., Chang, K.W., Wu, Y.N., Zhu, S.C., Gao, J.: Chameleon: Plug-and-Play Compositional Reasoning with Large Lan- guage Models. In: Proceedings of the 37th International Conference on Neu- ral Information Processing Systems (NeurIPS). vol. 36, pp. 43447–43478 (2023), https://openreview.net/forum?id=HtqnVSCj3q 20 Sijun Dai et al

  52. [52]

    Luo, L., Li, Y.F., Haffari, G., Pan, S.: Reasoning on Graphs: Faithful and Inter- pretableLargeLanguageModelReasoning.In:TheTwelfthInternationalConference on Learning Representations (ICLR) (2024),https://openreview.net/forum?id= ZGNWW7xZ6Q

  53. [53]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025), https://openreview.net/forum?id=0QNmAvQQqj

    Luo, L., Zhao, Z., Haffari, G., Phung, D., Gong, C., Pan, S.: GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025), https://openreview.net/forum?id=0QNmAvQQqj

  54. [54]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025),https://openreview.net/forum?id=QaZxGWlbgO

    Luo, Y., Zheng, X., Li, G., Yin, S., Lin, H., Fu, C., Huang, J., Ji, J., Chao, F., Luo, J., Ji, R.: Video-RAG: Visually-aligned Retrieval-Augmented Long Video Compre- hension. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025),https://openreview.net/forum?id=QaZxGWlbgO

  55. [55]

    In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV)

    Mensink, T., Uijlings, J., Castrejon, L., Goel, A., Cadar, F., Zhou, H., Sha, F., Araujo, A., Ferrari, V.: Encyclopedic VQA: Visual Questions about Detailed Proper- ties of Fine-grained Categories. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV). pp. 3113–3124 (2023),https://openaccess. thecvf . com / content / ICCV202...

  56. [56]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Methani, N., Ganguly, P., Khapra, M.M., Kumar, P.: PlotQA: Reasoning over Sci- entific Plots. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1527–1536 (2020),https://openaccess.thecvf. com/content_WACV_2020/html/Methani_PlotQA_Reasoning_over_Scientific_ Plots_WACV_2020_paper.html

  57. [57]

    Transactions on Machine Learning Research (TMLR) (2025),https://openreview.net/forum?id=IPmzyQSiQE

    Nussbaum, Z., Morris, J.X., Mulyar, A., Duderstadt, B.: Nomic Embed: Training a Reproducible Long Context Text Embedder. Transactions on Machine Learning Research (TMLR) (2025),https://openreview.net/forum?id=IPmzyQSiQE

  58. [58]

    (2026),https://openai.com/index/ introducing-gpt-5-2/

    OpenAI Team: Introducing GPT-5.2:The Most Advanced Frontier Model for Pro- fessional Work and Long-running Agents. (2026),https://openai.com/index/ introducing-gpt-5-2/

  59. [59]

    arXiv preprint arXiv:2410.08876 (2024),https://arxiv.org/abs/2410.08876

    Qi, J., Xu, Z., Shao, R., Chen, Y., Di, J., Cheng, Y., Wang, Q., Huang, L.: RoRA-VLM: Robust Retrieval-Augmented Vision Language Models. arXiv preprint arXiv:2410.08876 (2024),https://arxiv.org/abs/2410.08876

  60. [60]

    ai/blog?id=qwen3.5

    Qwen Team: Qwen3.5: Towards Native Multimodal Agents (2026),https://qwen. ai/blog?id=qwen3.5

  61. [61]

    In: Proceedings of the 38th International Conference on Machine Learning (ICML)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). pp. 8748–8763. PMLR (2021),https://proceedings.mlr.press/v...

  62. [62]

    In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https:// openreview.net/forum?id=GN921JHCRw

    Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., Manning, C.D.: RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https:// openreview.net/forum?id=GN921JHCRw

  63. [63]

    In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https://openreview.net/forum?id=nnVO1PvbTv

    Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., Ni, L., Shum, H.Y., Guo, J.: Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https://openreview.net/forum?id=nnVO1PvbTv

  64. [64]

    EVA-CLIP- 18B: Scaling clip to 18 billion parameters.arXiv:2402.04252, 2024

    Sun, Q., Wang, J., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, X.: EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters. arXiv preprint arXiv:2402.04252 (2024), https://arxiv.org/abs/2508.05318 MG2-RAG: Multi-Granularity Graph for MM-RAG 21

  65. [65]

    arXiv preprint arXiv:2507.20804 (2025),https: //arxiv.org/abs/2507.20804

    Wan, X., Yu, H.: MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs. arXiv preprint arXiv:2507.20804 (2025),https: //arxiv.org/abs/2507.20804

  66. [66]

    Chain- of-retrieval augmented generation

    Wang, L., Chen, H., Yang, N., Huang, X., Dou, Z., Wei, F.: Chain-of-Retrieval Augmented Generation. arXiv preprint arXiv:2501.14342 (2025),https://arxiv. org/abs/2501.14342

  67. [67]

    ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation

    Wang, S., Fang, Y., Zhou, Y., Liu, X., Ma, Y.: ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation. arXiv preprint arXiv:2502.09891 (2025),https://arxiv.org/abs/2502.09891

  68. [68]

    In: Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers)

    Wasserman, N., Pony, R., Naparstek, O., Goldfarb, A.R., Schwartz, E., Barze- lay, U., Karlinsky, L.: Real-mm-rag: A real-world multi-modal retrieval bench- mark. In: Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). pp. 31660–31683 (2025),https: //aclanthology.org/2025.acl-long.1528/

  69. [69]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

    Wu, J., Zhu, J., Liu, Y., Xu, M., Jin, Y.: Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). pp. 28489–28503 (2025),https://aclanthology.org/2025.acl-long.1383/

  70. [70]

    In: Findings of the Association for Computational Linguistics: EMNLP 2024

    Yan, Y., Xie, W.: EchoSight: Advancing Visual-Language Models with Wiki Knowl- edge. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 1538–1551 (2024),https://aclanthology.org/2024.findings-emnlp.83/

  71. [71]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Yang, W., Fu, J., Wang, R., Wang, J., Song, L., Bian, J.: OMGM: Orchestrate multiple granularities and modalities for efficient multimodal retrieval. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). pp. 24545–24563. AssociationforComputationalLinguis...

  72. [72]

    39755–39769 (2022),https://proceedings.mlr.press/v202/yasunaga23a.html

    Yasunaga,M.,Aghajanyan,A.,Shi,W.,James,R.,Leskovec,J.,Liang,P.,Lewis,M., Zettlemoyer,L.,Yih,W.t.:Retrieval-AugmentedMultimodalLanguageModelingpp. 39755–39769 (2022),https://proceedings.mlr.press/v202/yasunaga23a.html

  73. [73]

    mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

    Yuan, X., Ning, L., Fan, W., Li, Q.: mKG-RAG: Multimodal Knowledge Graph- Enhanced RAG for Visual Question Answering. arXiv preprint arXiv:2508.05318 (2025),https://arxiv.org/abs/2508.05318

  74. [74]

    Authorea Preprints (2025),https://www.techrxiv.org/doi/full/10.36227/ techrxiv.176341513.38473003

    Zhang, R., Liu, C., Su, Y., Li, R., Huang, X., Li, X., Yu, P.S.: A comprehensive survey on multimodal RAG: all combinations of modalities as input and out- put. Authorea Preprints (2025),https://www.techrxiv.org/doi/full/10.36227/ techrxiv.176341513.38473003

  75. [75]

    arXiv preprint arXiv:2502.01113 (2024), https://arxiv.org/abs/2411.15041

    Zhang, T., Zhang, Z., Ma, Z., Chen, Y., Qi, Z., Yuan, C., Li, B., Pu, J., Zhao, Y., Xie, Z., Ma, J., Shan, Y., Hu, W.: mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA. arXiv preprint arXiv:2502.01113 (2024), https://arxiv.org/abs/2411.15041

  76. [76]

    arXiv preprint arXiv:2502.18139 (2025),https://arxiv.org/abs/2502.18139

    Zhang, Z., Feng, Y., Zhang, M.: LevelRAG: Enhancing Retrieval-Augmented Gen- eration with Multi-hop Logic Planning over Rewriting Augmented Searchers. arXiv preprint arXiv:2502.18139 (2025),https://arxiv.org/abs/2502.18139

  77. [77]

    arXiv preprint arXiv:2505.24226 (2025), https://arxiv.org/abs/2505.24226

    Zhao, Y., Zhu, J., Guo, Y., He, K., Li, X.: E2GraphRAG: Streamlining Graph-based RAG for High Efficiency and Effectiveness. arXiv preprint arXiv:2505.24226 (2025), https://arxiv.org/abs/2505.24226

  78. [78]

    Pre-Trained Language Models

    Zhi Lim, Q., Poo Lee, C., Ming Lim, K., Kamsani Samingan, A.: UniRaG: Uni- fication, Retrieval, and Generation for Multimodal Question Answering With 22 Sijun Dai et al. Pre-Trained Language Models. IEEE Access12, 71505–71519 (2024). https: //doi.org/10.1109/ACCESS.2024.3403101 , https://doi.org/10.1109/ACCESS. 2024.3403101

  79. [79]

    In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV)

    Zhou, Y., Zhang, T., Xu, S., Chen, S., Zhou, Q., Tong, Y., Ji, S., Zhang, J., Qi, L., Li, X.: Are They the Same? Exploring Visual Correspondence Shortcom- ings of Multimodal LLMs. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV). pp. 17663–17674 (October 2025),https: //openaccess.thecvf.com/content/ICCV2025/html/Zhou_Ar...

  80. [80]

    Tri-Graph

    Zhuang, L., Chen, S., Xiao, Y., Zhou, H., Zhang, Y., Chen, H., Zhang, Q., Huang, X.: LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora. In: Proceedings of the 14th International Conference on Learning Representations (ICLR) (2026),https://arxiv.org/abs/2510.10114 MG2-RAG: Multi-Granularity Graph for MM-RAG 23 A Implementation D...

Showing first 80 references.