arxiv: 2604.04969 · v1 · submitted 2026-04-04 · 💻 cs.IR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MG²-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

Sijun Dai , Qiang Huang , Xiaoxing You , Jun Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords multimodal RAGknowledge graphgraph retrievalmultimodal reasoningretrieval augmented generationcross-modal fusionvisual grounding

0 comments

The pith

MG²-RAG fuses textual entities and visual regions into unified graph nodes for faster multimodal RAG with state-of-the-art results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MG²-RAG to improve retrieval-augmented generation for multimodal large language models. Flat vector methods miss structural links while existing graph approaches rely on expensive text translations that discard visual details. MG²-RAG builds a hierarchical multimodal knowledge graph through lightweight textual parsing combined with entity-driven visual grounding. This creates unified nodes that keep atomic evidence from both modalities. A multi-granularity retrieval step then aggregates similarities and propagates relevance to support multi-hop cross-modal reasoning. Experiments across retrieval, knowledge-based VQA, reasoning, and classification show top performance plus an average 43.3 times speedup and 23.9 times cost reduction in graph construction.

Core claim

MG²-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, a multi-granularity graph retrieval mechanism aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks demonstrate that MG²-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3× speedup and 23.9× cost reduction compared

What carries the argument

Unified multimodal nodes formed by fusing textual entities and visual regions via lightweight textual parsing and entity-driven visual grounding.

If this is right

Enables structured multi-hop reasoning across modalities without discarding fine-grained visual information.
Delivers state-of-the-art results on retrieval, knowledge-based VQA, reasoning, and classification tasks.
Cuts average graph construction time by 43.3 times and cost by 23.9 times versus advanced graph-based frameworks.
Supports complex cross-modal reasoning in retrieval-augmented generation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The node-fusion approach could extend to additional modalities such as audio by adapting the entity-grounding step.
Reduced construction cost may make graph-based multimodal RAG feasible for much larger knowledge bases.
Preservation of atomic evidence in nodes may reduce hallucinations more reliably than flat vector retrieval.
The multi-granularity propagation idea could be tested in non-graph retrieval systems for similar gains.

Load-bearing premise

Combining lightweight textual parsing with entity-driven visual grounding produces unified multimodal nodes that preserve atomic evidence without information loss or alignment errors during fusion.

What would settle it

A controlled comparison on a multi-hop reasoning task where fused nodes are measured against separate text-plus-image retrieval, checking whether accuracy drops due to lost visual details or fusion errors.

Figures

Figures reproduced from arXiv: 2604.04969 by Jun Yu, Qiang Huang, Sijun Dai, Xiaoxing You.

**Figure 1.** Figure 1: Existing work employs text-centric graph extraction that discards visual details and incurs high costs. MG2 -RAG efficiently fuses textual entities and visual objects into unified multimodal nodes while preserving atomic information. Existing MM-RAG methods generally fall into two paradigms: vector-based and graph-based approaches. Vectorbased MM-RAG encodes multimodal inputs into a shared embedding sp… view at source ↗

**Figure 2.** Figure 2: Overview of MG2 -RAG. This framework consists of two modules: (a) Multimodal Knowledge Graph Construction that transforms a multimodal knowledge base into a hierarchical fine-grained multimodal knowledge graph (MMKG) via textual parsing and entity-driven visual grounding; (b) Multi-Granularity Graph Retrieval that employs a multi-granularity retrieval mechanism, driven by graph propagation, to retrieve the… view at source ↗

**Figure 3.** Figure 3: Graph construction efficiency. The top and bottom plots show construction time (hours) and cost ($), respectively. Red annotations above the MG2 -RAG bars indicate the efficiency gain (speedup or cost reduction) relative to the strongest baseline. – Multimodal Classification: To assess real-world applicability, we evaluate MG2 -RAG on CrisisMMD [1], a disaster-response dataset containing approximately 35k… view at source ↗

**Figure 4.** Figure 4: Case studies on Knowledge-based VQA and Multimodal Retrieval. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Rule-based Relation Extraction Examples. [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Case studies on Multimodal Reasoning. MG2 -RAG is compared with a baseline MLLM (Qwen3.5-27B) and graph-based approaches (VaLiK and MMGraphRAG). The Retrieved Evidence illustrates the multimodal evidence retrieved by MG2 -RAG for answer generation. Hurricane Irma: Residential trash, recycling pickup to be suspended (Infrastructure and utility damage) Retrieved Evidence VaLiK: Rescue volunteering or donati… view at source ↗

**Figure 7.** Figure 7: Case studies on Multimodal Classification. [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf{MG$^2$-RAG}, a lightweight \textbf{M}ulti-\textbf{G}ranularity \textbf{G}raph \textbf{RAG} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG$^2$-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks (i.e., retrieval, knowledge-based VQA, reasoning, and classification) demonstrate that MG$^2$-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3$\times$ speedup and 23.9$\times$ cost reduction compared with advanced graph-based frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MG²-RAG builds a hierarchical multimodal graph with unified text-visual nodes to cut construction costs and support structured retrieval, but the fusion step still needs quantitative checks on information loss.

read the letter

The main takeaway is that this paper gives a concrete way to build a hierarchical graph for multimodal RAG by fusing textual entities and visual regions into unified nodes through lightweight parsing plus entity-driven grounding, then pulling relevance at multiple granularities for cross-modal reasoning. It moves past flat vector search that ignores structure and past graph methods that translate visuals to text and drop details. The reported 43x speedup and 24x cost drop in graph construction would matter if they hold up in practice. The design itself is straightforward and targets a real pain point in keeping visual evidence usable without heavy overhead. The soft spot sits in the fusion process. The abstract describes the nodes as preserving atomic evidence, yet it supplies no numbers on alignment error rates, modality retention, or whether fine visual details survive the entity mapping. If that step quietly drops or misaligns information, the multi-granularity retrieval cannot deliver the claimed reasoning gains. The SOTA results across retrieval, VQA, reasoning, and classification are stated but not broken down here with baselines or ablations, so the data strength remains unclear until the full tables are examined. This paper is for researchers working on practical multimodal retrieval systems who already use graphs or RAG pipelines and want a lighter construction method. A reader focused on efficiency trade-offs in AI applications would get direct value from the node design and aggregation steps. It deserves a serious referee because the core representation is distinct from the cited priors and the efficiency angle is testable, even if the experiments will need close scrutiny on the fusion validation.

Referee Report

2 major / 2 minor

Summary. The paper proposes MG²-RAG, a lightweight multi-granularity graph RAG framework for multimodal LLMs. It builds a hierarchical multimodal knowledge graph by fusing textual entities and visual regions into unified nodes via lightweight textual parsing and entity-driven visual grounding. A multi-granularity retrieval mechanism aggregates dense similarities and propagates relevance for structured multi-hop reasoning. Experiments across retrieval, knowledge-based VQA, reasoning, and classification tasks claim SOTA performance with an average 43.3× speedup and 23.9× cost reduction in graph construction versus advanced graph-based baselines.

Significance. If the fusion preserves atomic evidence and the efficiency claims are substantiated, the work could make graph-based multimodal RAG more scalable by avoiding costly translation-to-text pipelines while supporting cross-modal reasoning. The reported speedups would be a notable practical contribution if they hold under rigorous ablations.

major comments (2)

[Abstract, §4] Abstract and §4: The abstract asserts SOTA results and 43.3×/23.9× efficiency gains across four tasks, yet provides no metrics, baselines, ablation details, or error analysis; without these in the experimental section the central performance and speedup claims cannot be evaluated.
[§3.2] §3.2 (Fusion into unified multimodal nodes): The claim that entity-driven visual grounding produces nodes that 'preserve atomic evidence' without information loss or alignment errors is load-bearing for the entire framework, but no quantitative validation (e.g., modality-specific retention rates, alignment error metrics, or ablation on fine-grained visual detail survival) is supplied.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta or speedup factor) to support the SOTA claim.
[§3.3] Notation for the multi-granularity retrieval (dense similarity aggregation and relevance propagation) should be formalized with an equation or algorithm box for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4: The abstract asserts SOTA results and 43.3×/23.9× efficiency gains across four tasks, yet provides no metrics, baselines, ablation details, or error analysis; without these in the experimental section the central performance and speedup claims cannot be evaluated.

Authors: We appreciate the referee highlighting the need for more explicit support of the claims. Section 4 already reports task-specific metrics, baseline comparisons, and efficiency measurements (speedup and cost reduction) across the four tasks. To address the concern directly, we will revise the abstract to include key quantitative highlights with references to the relevant tables and figures. We will also expand §4 with additional ablation details and a brief error analysis to make the performance and efficiency claims fully verifiable. revision: yes
Referee: [§3.2] §3.2 (Fusion into unified multimodal nodes): The claim that entity-driven visual grounding produces nodes that 'preserve atomic evidence' without information loss or alignment errors is load-bearing for the entire framework, but no quantitative validation (e.g., modality-specific retention rates, alignment error metrics, or ablation on fine-grained visual detail survival) is supplied.

Authors: This is a fair observation. The current manuscript justifies the preservation claim through the entity-driven grounding design and qualitative examples, but does not provide direct quantitative metrics such as retention rates or alignment error rates. We will add a targeted ablation study in the revised experimental section that measures information preservation (e.g., downstream performance impact under controlled grounding variations) and introduces simple alignment-quality metrics to substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework components are defined independently of fitted quantities or self-referential reductions.

full rationale

The paper introduces MG²-RAG as a novel hierarchical graph construction process that fuses textual parsing with entity-driven visual grounding to create unified multimodal nodes. No equations, fitted parameters, or predictions are presented that reduce by construction to prior inputs. Central claims rest on experimental comparisons rather than tautological derivations or load-bearing self-citations. The fusion step is an explicit design choice with stated assumptions about evidence preservation, not a self-definitional loop. This is a standard non-circular proposal of a new architecture.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on standard assumptions about graph utility for knowledge representation and the feasibility of entity-visual alignment; no free parameters or invented entities with independent evidence are specified in the abstract.

axioms (1)

domain assumption Hierarchical graphs can represent structural dependencies across text and visual modalities without loss of atomic evidence
Invoked in the description of node fusion and multi-hop retrieval

invented entities (1)

Unified multimodal nodes no independent evidence
purpose: To fuse textual entities and visual regions while preserving atomic evidence
New representation introduced to enable joint modality handling

pith-pipeline@v0.9.0 · 5540 in / 1269 out tokens · 31351 ms · 2026-05-13T17:08:29.632617+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MG²-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propagate relevance across the heterogeneous graph using Personalized PageRank

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Very Efficient Listwise Multimodal Reranking for Long Documents
cs.IR 2026-05 unverdicted novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

In: Proceedings of the International AAAI Conference on Web and Social Media (AAAI)

Alam, F., Ofli, F., Imran, M.: CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In: Proceedings of the International AAAI Conference on Web and Social Media (AAAI). vol. 12 (2018),https://ojs.aaai.org/index.php/ ICWSM/article/view/14983

work page 2018
[2]

In: The Twelfth International Conference on Learning Representations (ICLR) (2024), https://openreview

Asai, A., Wu, Z., Wang, Y., Sil, A., Hajishirzi, H.: Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In: The Twelfth International Conference on Learning Representations (ICLR) (2024), https://openreview. net/forum?id=hSyW5go0v8

work page 2024
[3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923 (2025), https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Beyond Token-level Answer Equivalence for Question Answering Evaluation

Bulian,J.,Buck,C.,Gajewski,W.,Börschinger,B.,Schuster,T.:Tomayto,Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP). pp. 291–305 (2022),https://aclanthology.org/2022.emnlp- main.20/

work page 2022
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Caffagni, D., Cocchi, F., Moratelli, N., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1818–1826 (2024),https://openaccess.thecvf. com / content / CVPR2024W / MMFM / html / Caff...

work page 2024
[6]

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, 16 Sijun Dai et al. T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamat...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to Answer Open- DomainQuestions.In:Proceedingsofthe55thAnnualMeetingoftheAssociationfor Computational Linguistics (ACL). pp. 1870–1879 (2017),https://aclanthology. org/P17-1171/

work page 2017
[8]

In: The Thirty- eighth Annual Conference on Neural Information Processing Systems (NeurIPS)

Chen,L., Tong, P., Jin,Z.,Sun, Y.,Ye, J.,Xiong,H.: Plan-on-Graph:Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs. In: The Thirty- eighth Annual Conference on Neural Information Processing Systems (NeurIPS). vol. 37, pp. 37665–37691 (2024),https://openreview.net/forum?id=CwCUEr6wO5

work page 2024
[9]

In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Chen, W., Hu, H., Chen, X., Verga, P., Cohen, W.: MuRAG: Multimodal Retrieval- Augmented Generator for Open Question Answering over Images and Text. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5558–5570 (2022),https://aclanthology.org/2022. emnlp-main.375/

work page 2022
[10]

Chen, Y., Hu, H., Luan, Y., Sun, H., Changpinyo, S., Ritter, A., Chang, M.W.: Can Pre-trained Vision and Language Models Answer Visual Information-seeking Questions? In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 14948–14968 (2023),https://aclanthology. org/2023.emnlp-main.925/

work page 2023
[11]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Cheng, Q., Li, X., Li, S., Zhu, Q., Yin, Z., Shao, Y., Li, L., Sun, T., Yan, H., Qiu, X.: Unified Active Retrieval for Retrieval Augmented Generation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 17153–17166 (2024),https://aclanthology.org/2024.findings-emnlp.999/

work page 2024
[12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cocchi, F., Moratelli, N., Cornia, M., Baraldi, L., Cucchiara, R.: Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9199–9209 (2025),https://openaccess.thecvf. com/content/CVPR2025/html/Cocchi_Augmenting_Mu...

work page 2025
[13]

arXiv preprint arXiv:2511.22715 (2025),https: //arxiv.org/abs/2511.22715

Compagnoni, A., Morini, M., Sarto, S., Cocchi, F., Caffagni, D., Cornia, M., Baraldi, L., Cucchiara, R.: ReAG: Reasoning-Augmented Generation for Knowledge- based Visual Question Answering. arXiv preprint arXiv:2511.22715 (2025),https: //arxiv.org/abs/2511.22715

work page arXiv 2025
[14]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS). vol. 36, pp. 49250–49267 (2023), https : / / proceedings . neurips . cc / paper _...

work page 2023
[15]

arXiv preprint arXiv:2508.19855 (2025),https://arxiv.org/ abs/2508.19855

Dong, J., An, S., Yu, Y., Zhang, Q.W., Luo, L., Huang, X., Wu, Y., Yin, D., Sun, X.: Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning. arXiv preprint arXiv:2508.19855 (2025),https://arxiv.org/ abs/2508.19855

work page arXiv 2025
[16]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From Local to Global: A Graph MG2-RAG: Multi-Granularity Graph for MM-RAG 17 RAG Approach to Query-Focused Summarization. arXiv preprint arXiv:2404.16130 (2024),https://arxiv.org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Edge, D., Trinh, H., Larson, J.: LazyGraphRAG: Setting a New Standard for Quality and Cost (2024), https://www.microsoft.com/en-us/research/blog/ lazygraphrag-setting-a-new-standard-for-quality-and-cost/

work page 2024
[18]

In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview.net/forum?id=ogjBpZ8uSi

Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., HUDELOT, C., Colombo, P.: ColPali: Efficient Document Retrieval with Vision Language Models. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview.net/forum?id=ogjBpZ8uSi

work page 2025
[19]

In: The Thirteenth International Conference on Learning Representations (ICLR) (2025),https:// openreview.net/forum?id=OCd3cffulp

Gao, J., Li, L., Ji, K., Li, W., Lian, Y., Fu, Y., Dai, B.: SmartRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025),https:// openreview.net/forum?id=OCd3cffulp

work page 2025
[20]

Google DeepMind Team: Gemini 3.1 Pro: Best for Complex Tasks and Bringing Creative Concepts to Life (2026),https://deepmind.google/models/gemini/pro/

work page 2026
[21]

In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 6904–6913 (2017),https://openaccess.thecvf.com/content_cvpr_ 2017/html/Goyal_Making_the_v_CVPR_2017_p...

work page 2017
[22]

arXiv preprint arXiv:2502.01142 (2025),https://arxiv.org/abs/2502.01142

Guan, X., Zeng, J., Meng, F., Xin, C., Lu, Y., Lin, H., Han, X., Sun, L., Zhou, J.: DeepRAG: Thinking to Retrieval Step by Step for Large Language Models. arXiv preprint arXiv:2502.01142 (2025),https://arxiv.org/abs/2502.01142

work page arXiv 2025
[23]

In: Findings of the Association for Computational Lin- guistics: EMNLP 2025

Guo, Z., Xia, L., Yu, Y., Ao, T., Huang, C.: LightRAG: Simple and Fast Retrieval- Augmented Generation. In: Findings of the Association for Computational Lin- guistics: EMNLP 2025. pp. 10746–10761 (2025),https://aclanthology.org/2025. findings-emnlp.568/

work page 2025
[24]

In: Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS)

Gutiérrez, B.J., Shu, Y., Gu, Y., Yasunaga, M., Su, Y.: HippoRAG: neurobiologically inspired long-term memory for large language models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS). vol. 37, pp. 59532–59569 (2024),https://openreview.net/forum?id=hkujvAPVsg

work page 2024
[25]

In: Forty-second Inter- national Conference on Machine Learning (ICML)

Gutiérrez, B.J., Shu, Y., Qi, W., Zhou, S., Su, Y.: From RAG to Memory: Non- Parametric Continual Learning for Large Language Models. In: Forty-second Inter- national Conference on Machine Learning (ICML). pp. 21497–21515. PMLR (2025), https://openreview.net/forum?id=LWH8yn4HS2

work page 2025
[26]

In: Proceedings of the 37th International Conference on Machine Learning (ICML)

Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval Augmented Language Model Pre-Training. In: Proceedings of the 37th International Conference on Machine Learning (ICML). pp. 3929–3938. PMLR (2020),https://proceedings. mlr.press/v119/guu20a.html

work page 2020
[27]

In: Proceedings of the 11th international conference on World Wide Web (WWW)

Haveliwala, T.H.: Topic-sensitive pagerank. In: Proceedings of the 11th international conference on World Wide Web (WWW). pp. 517–526 (2002),https://dl.acm. org/doi/abs/10.1145/511446.511513

work page doi:10.1145/511446.511513 2002
[28]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)

He, X., Tian, Y., Sun, Y., Chawla, N., Laurent, T., LeCun, Y., Bresson, X., Hooi, B.: G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS). vol. 37, pp. 132876–132907 (2024), https://openreview.net/forum?id=MPJ3oXtTZl

work page 2024
[29]

Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-Strength Natural Language Processing in Python (2020),https://spacy.io/ 18 Sijun Dai et al

work page 2020
[30]

In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview

Hsu, S., Khattab, O., Finn, C., Sharma, A.: Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview. net/forum?id=BPAZ6yW3K7

work page 2025
[31]

arXiv preprint arXiv:2505.24073 (2025),https://arxiv.org/abs/2505.24073

Hu, C.W., Wang, Y., Xing, S., Chen, C.J., Feng, S., Rossi, R., Tu, Z.: mrag: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation. arXiv preprint arXiv:2505.24073 (2025),https://arxiv.org/abs/2505.24073

work page arXiv 2025
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hu, Z., Iscen, A., Sun, C., Wang, Z., Chang, K.W., Sun, Y., Schmid, C., Ross, D.A., Fathi, A.: REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 23369– 23379 (2023), https://openaccess.thecvf.com/content/C...

work page 2023
[33]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions43(2) (Jan 2025). https://doi.org/10.1145/3703155,https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[34]

In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

Huang, Y., Zhang, S., Xiao, X.: KET-RAG: A Cost-Efficient Multi-Granular In- dexing Framework for Graph-RAG. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). pp. 1003–1012 (2025), https://dl.acm.org/doi/abs/10.1145/3711896.3737012

work page doi:10.1145/3711896.3737012 2025
[35]

In: Findings of the Association for Computational Linguistics: ACL 2024

Jin, B., Xie, C., Zhang, J., Roy, K.K., Zhang, Y., Li, Z., Li, R., Tang, X., Wang, S., Meng, Y., Han, J.: Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 163–184 (2024),https://aclanthology.org/2024. findings-acl.11/

work page 2024
[36]

International Journal of Computer Vision (IJCV)123(1), 32–73 (2017),https: //link.springer.com/article/10.1007/S11263-016-0981-7

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV)123(1), 32–73 (2017),https: //link.springer.com/article/10.1007/S11263-016-0981-7

work page doi:10.1007/s11263-016-0981-7 2017
[37]

Internet-augmented language models through few-shot prompting for open-domain question answering,

Lazaridou, A., Gribovskaya, E., Stokowiec, W., Grigorev, N.: Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115 (2022),https://arxiv.org/abs/2203.05115

work page arXiv 2022
[38]

Lee, J., Wang, Y., Li, J., Zhang, M.: Multimodal Reasoning with Multimodal KnowledgeGraph.In:Proceedingsofthe62ndAnnualMeetingoftheAssociationfor Computational Linguistics (ACL). pp. 10767–10782 (2024),https://aclanthology. org/2024.acl-long.579/

work page 2024
[39]

In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)

Lee, K., Chang, M.W., Toutanova, K.: Latent Retrieval for Weakly Supervised Open Domain Question Answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 6086–6096 (2019), https://aclanthology.org/P19-1612/

work page 2019
[40]

In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL)

Lee, M., An, S., Kim, M.S.: PlanRAG: A Plan-then-Retrieval Augmented Genera- tion for Generative Large Language Models as Decision Makers. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL). pp. 6537–6555 (2024),https://aclanthology.org/2024.naacl-long.364/

work page 2024
[41]

arXiv preprint arXiv:2503.21729 (2025), https://arxiv.org/abs/2503.21729

Lee, Z., Cao, S., Liu, J., Zhang, J., Liu, W., Che, X., Hou, L., Li, J.: ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with MG2-RAG: Multi-Granularity Graph for MM-RAG 19 Iterative Retrieval Augmented Generation. arXiv preprint arXiv:2503.21729 (2025), https://arxiv.org/abs/2503.21729

work page arXiv 2025
[42]

In: Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS)

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küt- tler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS). pp. 9459–9474 (2020),https://procee...

work page 2020
[43]

In: Proceedings of the 40th International Conference on Machine Learning (ICML)

Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models. In: Proceedings of the 40th International Conference on Machine Learning (ICML). pp. 19730–19742. PMLR (2023),https://proceedings.mlr.press/v202/li23q

work page 2023
[44]

In: Proceedings of the 39th International Conference on Machine Learning (ICML)

Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In: Proceedings of the 39th International Conference on Machine Learning (ICML). pp. 12888–12900. PMLR (2022),https://proceedings.mlr.press/v162/li22n.html

work page 2022
[45]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Li, X., Dong, G., Jin, J., Zhang, Y., Zhou, Y., Zhu, Y., Zhang, P., Dou, Z.: Search-o1: Agentic Search-Enhanced Large Reasoning Models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5420–5438 (2025),https://aclanthology.org/2025.emnlp-main.276/

work page 2025
[46]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved Baselines with Visual Instruction Tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26296–26306 (2024),https://openaccess.thecvf.com/ content/CVPR2024/html/Liu_Improved_Baselines_with_Visual_Instruction_ Tuning_CVPR_2024_paper.html

work page 2024
[47]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS). vol. 36, pp. 34892–34916 (2023),https://openreview.net/forum?id=w0H2xGHlkw

work page 2023
[48]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, J., Meng, S., Gao, Y., Mao, S., Cai, P., Yan, G., Chen, Y., Bian, Z., Wang, D., Shi, B.: Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 981–992 (2025),https://openaccess.thecvf.com/content/ICCV20...

work page 2025
[49]

In: European Semantic Web Conference (ESWC)

Liu, Y., Li, H., Garcia-Duran, A., Niepert, M., Onoro-Rubio, D., Rosenblum, D.S.: MMKG: Multi-modal Knowledge Graphs. In: European Semantic Web Conference (ESWC). pp. 459–474. Springer (2019),https://link.springer.com/chapter/10. 1007/978-3-030-21348-0_30

work page 2019
[50]

In: Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS)

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In: Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS). vol. 35, pp. 2507–2521 (2022), https://openreview.net/forum?id=H...

work page 2022
[51]

In: Proceedings of the 37th International Conference on Neu- ral Information Processing Systems (NeurIPS)

Lu, P., Peng, B., Cheng, H., Galley, M., Chang, K.W., Wu, Y.N., Zhu, S.C., Gao, J.: Chameleon: Plug-and-Play Compositional Reasoning with Large Lan- guage Models. In: Proceedings of the 37th International Conference on Neu- ral Information Processing Systems (NeurIPS). vol. 36, pp. 43447–43478 (2023), https://openreview.net/forum?id=HtqnVSCj3q 20 Sijun Dai et al

work page 2023
[52]

Luo, L., Li, Y.F., Haffari, G., Pan, S.: Reasoning on Graphs: Faithful and Inter- pretableLargeLanguageModelReasoning.In:TheTwelfthInternationalConference on Learning Representations (ICLR) (2024),https://openreview.net/forum?id= ZGNWW7xZ6Q

work page 2024
[53]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025), https://openreview.net/forum?id=0QNmAvQQqj

Luo, L., Zhao, Z., Haffari, G., Phung, D., Gong, C., Pan, S.: GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025), https://openreview.net/forum?id=0QNmAvQQqj

work page 2025
[54]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025),https://openreview.net/forum?id=QaZxGWlbgO

Luo, Y., Zheng, X., Li, G., Yin, S., Lin, H., Fu, C., Huang, J., Ji, J., Chao, F., Luo, J., Ji, R.: Video-RAG: Visually-aligned Retrieval-Augmented Long Video Compre- hension. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025),https://openreview.net/forum?id=QaZxGWlbgO

work page 2025
[55]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV)

Mensink, T., Uijlings, J., Castrejon, L., Goel, A., Cadar, F., Zhou, H., Sha, F., Araujo, A., Ferrari, V.: Encyclopedic VQA: Visual Questions about Detailed Proper- ties of Fine-grained Categories. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV). pp. 3113–3124 (2023),https://openaccess. thecvf . com / content / ICCV202...

work page 2023
[56]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Methani, N., Ganguly, P., Khapra, M.M., Kumar, P.: PlotQA: Reasoning over Sci- entific Plots. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1527–1536 (2020),https://openaccess.thecvf. com/content_WACV_2020/html/Methani_PlotQA_Reasoning_over_Scientific_ Plots_WACV_2020_paper.html

work page 2020
[57]

Transactions on Machine Learning Research (TMLR) (2025),https://openreview.net/forum?id=IPmzyQSiQE

Nussbaum, Z., Morris, J.X., Mulyar, A., Duderstadt, B.: Nomic Embed: Training a Reproducible Long Context Text Embedder. Transactions on Machine Learning Research (TMLR) (2025),https://openreview.net/forum?id=IPmzyQSiQE

work page 2025
[58]

(2026),https://openai.com/index/ introducing-gpt-5-2/

OpenAI Team: Introducing GPT-5.2:The Most Advanced Frontier Model for Pro- fessional Work and Long-running Agents. (2026),https://openai.com/index/ introducing-gpt-5-2/

work page 2026
[59]

arXiv preprint arXiv:2410.08876 (2024),https://arxiv.org/abs/2410.08876

Qi, J., Xu, Z., Shao, R., Chen, Y., Di, J., Cheng, Y., Wang, Q., Huang, L.: RoRA-VLM: Robust Retrieval-Augmented Vision Language Models. arXiv preprint arXiv:2410.08876 (2024),https://arxiv.org/abs/2410.08876

work page arXiv 2024
[60]

ai/blog?id=qwen3.5

Qwen Team: Qwen3.5: Towards Native Multimodal Agents (2026),https://qwen. ai/blog?id=qwen3.5

work page 2026
[61]

In: Proceedings of the 38th International Conference on Machine Learning (ICML)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). pp. 8748–8763. PMLR (2021),https://proceedings.mlr.press/v...

work page 2021
[62]

In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https:// openreview.net/forum?id=GN921JHCRw

Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., Manning, C.D.: RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https:// openreview.net/forum?id=GN921JHCRw

work page 2024
[63]

In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https://openreview.net/forum?id=nnVO1PvbTv

Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., Ni, L., Shum, H.Y., Guo, J.: Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https://openreview.net/forum?id=nnVO1PvbTv

work page 2024
[64]

EVA-CLIP- 18B: Scaling clip to 18 billion parameters.arXiv:2402.04252, 2024

Sun, Q., Wang, J., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, X.: EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters. arXiv preprint arXiv:2402.04252 (2024), https://arxiv.org/abs/2508.05318 MG2-RAG: Multi-Granularity Graph for MM-RAG 21

work page arXiv 2024
[65]

arXiv preprint arXiv:2507.20804 (2025),https: //arxiv.org/abs/2507.20804

Wan, X., Yu, H.: MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs. arXiv preprint arXiv:2507.20804 (2025),https: //arxiv.org/abs/2507.20804

work page arXiv 2025
[66]

Chain- of-retrieval augmented generation

Wang, L., Chen, H., Yang, N., Huang, X., Dou, Z., Wei, F.: Chain-of-Retrieval Augmented Generation. arXiv preprint arXiv:2501.14342 (2025),https://arxiv. org/abs/2501.14342

work page arXiv 2025
[67]

ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation

Wang, S., Fang, Y., Zhou, Y., Liu, X., Ma, Y.: ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation. arXiv preprint arXiv:2502.09891 (2025),https://arxiv.org/abs/2502.09891

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

In: Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers)

Wasserman, N., Pony, R., Naparstek, O., Goldfarb, A.R., Schwartz, E., Barze- lay, U., Karlinsky, L.: Real-mm-rag: A real-world multi-modal retrieval bench- mark. In: Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). pp. 31660–31683 (2025),https: //aclanthology.org/2025.acl-long.1528/

work page 2025
[69]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

Wu, J., Zhu, J., Liu, Y., Xu, M., Jin, Y.: Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). pp. 28489–28503 (2025),https://aclanthology.org/2025.acl-long.1383/

work page 2025
[70]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Yan, Y., Xie, W.: EchoSight: Advancing Visual-Language Models with Wiki Knowl- edge. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 1538–1551 (2024),https://aclanthology.org/2024.findings-emnlp.83/

work page 2024
[71]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Yang, W., Fu, J., Wang, R., Wang, J., Song, L., Bian, J.: OMGM: Orchestrate multiple granularities and modalities for efficient multimodal retrieval. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). pp. 24545–24563. AssociationforComputationalLinguis...

work page doi:10.18653/v1/2025.acl-long.1198 2025
[72]

39755–39769 (2022),https://proceedings.mlr.press/v202/yasunaga23a.html

Yasunaga,M.,Aghajanyan,A.,Shi,W.,James,R.,Leskovec,J.,Liang,P.,Lewis,M., Zettlemoyer,L.,Yih,W.t.:Retrieval-AugmentedMultimodalLanguageModelingpp. 39755–39769 (2022),https://proceedings.mlr.press/v202/yasunaga23a.html

work page 2022
[73]

mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

Yuan, X., Ning, L., Fan, W., Li, Q.: mKG-RAG: Multimodal Knowledge Graph- Enhanced RAG for Visual Question Answering. arXiv preprint arXiv:2508.05318 (2025),https://arxiv.org/abs/2508.05318

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Authorea Preprints (2025),https://www.techrxiv.org/doi/full/10.36227/ techrxiv.176341513.38473003

Zhang, R., Liu, C., Su, Y., Li, R., Huang, X., Li, X., Yu, P.S.: A comprehensive survey on multimodal RAG: all combinations of modalities as input and out- put. Authorea Preprints (2025),https://www.techrxiv.org/doi/full/10.36227/ techrxiv.176341513.38473003

work page arXiv 2025
[75]

arXiv preprint arXiv:2502.01113 (2024), https://arxiv.org/abs/2411.15041

Zhang, T., Zhang, Z., Ma, Z., Chen, Y., Qi, Z., Yuan, C., Li, B., Pu, J., Zhao, Y., Xie, Z., Ma, J., Shan, Y., Hu, W.: mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA. arXiv preprint arXiv:2502.01113 (2024), https://arxiv.org/abs/2411.15041

work page arXiv 2024
[76]

arXiv preprint arXiv:2502.18139 (2025),https://arxiv.org/abs/2502.18139

Zhang, Z., Feng, Y., Zhang, M.: LevelRAG: Enhancing Retrieval-Augmented Gen- eration with Multi-hop Logic Planning over Rewriting Augmented Searchers. arXiv preprint arXiv:2502.18139 (2025),https://arxiv.org/abs/2502.18139

work page arXiv 2025
[77]

arXiv preprint arXiv:2505.24226 (2025), https://arxiv.org/abs/2505.24226

Zhao, Y., Zhu, J., Guo, Y., He, K., Li, X.: E2GraphRAG: Streamlining Graph-based RAG for High Efficiency and Effectiveness. arXiv preprint arXiv:2505.24226 (2025), https://arxiv.org/abs/2505.24226

work page arXiv 2025
[78]

Pre-Trained Language Models

Zhi Lim, Q., Poo Lee, C., Ming Lim, K., Kamsani Samingan, A.: UniRaG: Uni- fication, Retrieval, and Generation for Multimodal Question Answering With 22 Sijun Dai et al. Pre-Trained Language Models. IEEE Access12, 71505–71519 (2024). https: //doi.org/10.1109/ACCESS.2024.3403101 , https://doi.org/10.1109/ACCESS. 2024.3403101

work page doi:10.1109/access.2024.3403101 2024
[79]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV)

Zhou, Y., Zhang, T., Xu, S., Chen, S., Zhou, Q., Tong, Y., Ji, S., Zhang, J., Qi, L., Li, X.: Are They the Same? Exploring Visual Correspondence Shortcom- ings of Multimodal LLMs. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV). pp. 17663–17674 (October 2025),https: //openaccess.thecvf.com/content/ICCV2025/html/Zhou_Ar...

work page 2025
[80]

Tri-Graph

Zhuang, L., Chen, S., Xiao, Y., Zhou, H., Zhang, Y., Chen, H., Zhang, Q., Huang, X.: LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora. In: Proceedings of the 14th International Conference on Learning Representations (ICLR) (2026),https://arxiv.org/abs/2510.10114 MG2-RAG: Multi-Granularity Graph for MM-RAG 23 A Implementation D...

work page arXiv 2026

Showing first 80 references.