Recognition: 2 theorem links
· Lean TheoremMG²-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation
Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3
The pith
MG²-RAG fuses textual entities and visual regions into unified graph nodes for faster multimodal RAG with state-of-the-art results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MG²-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, a multi-granularity graph retrieval mechanism aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks demonstrate that MG²-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3× speedup and 23.9× cost reduction compared
What carries the argument
Unified multimodal nodes formed by fusing textual entities and visual regions via lightweight textual parsing and entity-driven visual grounding.
If this is right
- Enables structured multi-hop reasoning across modalities without discarding fine-grained visual information.
- Delivers state-of-the-art results on retrieval, knowledge-based VQA, reasoning, and classification tasks.
- Cuts average graph construction time by 43.3 times and cost by 23.9 times versus advanced graph-based frameworks.
- Supports complex cross-modal reasoning in retrieval-augmented generation systems.
Where Pith is reading between the lines
- The node-fusion approach could extend to additional modalities such as audio by adapting the entity-grounding step.
- Reduced construction cost may make graph-based multimodal RAG feasible for much larger knowledge bases.
- Preservation of atomic evidence in nodes may reduce hallucinations more reliably than flat vector retrieval.
- The multi-granularity propagation idea could be tested in non-graph retrieval systems for similar gains.
Load-bearing premise
Combining lightweight textual parsing with entity-driven visual grounding produces unified multimodal nodes that preserve atomic evidence without information loss or alignment errors during fusion.
What would settle it
A controlled comparison on a multi-hop reasoning task where fused nodes are measured against separate text-plus-image retrieval, checking whether accuracy drops due to lost visual details or fusion errors.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf{MG$^2$-RAG}, a lightweight \textbf{M}ulti-\textbf{G}ranularity \textbf{G}raph \textbf{RAG} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG$^2$-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks (i.e., retrieval, knowledge-based VQA, reasoning, and classification) demonstrate that MG$^2$-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3$\times$ speedup and 23.9$\times$ cost reduction compared with advanced graph-based frameworks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MG²-RAG, a lightweight multi-granularity graph RAG framework for multimodal LLMs. It builds a hierarchical multimodal knowledge graph by fusing textual entities and visual regions into unified nodes via lightweight textual parsing and entity-driven visual grounding. A multi-granularity retrieval mechanism aggregates dense similarities and propagates relevance for structured multi-hop reasoning. Experiments across retrieval, knowledge-based VQA, reasoning, and classification tasks claim SOTA performance with an average 43.3× speedup and 23.9× cost reduction in graph construction versus advanced graph-based baselines.
Significance. If the fusion preserves atomic evidence and the efficiency claims are substantiated, the work could make graph-based multimodal RAG more scalable by avoiding costly translation-to-text pipelines while supporting cross-modal reasoning. The reported speedups would be a notable practical contribution if they hold under rigorous ablations.
major comments (2)
- [Abstract, §4] Abstract and §4: The abstract asserts SOTA results and 43.3×/23.9× efficiency gains across four tasks, yet provides no metrics, baselines, ablation details, or error analysis; without these in the experimental section the central performance and speedup claims cannot be evaluated.
- [§3.2] §3.2 (Fusion into unified multimodal nodes): The claim that entity-driven visual grounding produces nodes that 'preserve atomic evidence' without information loss or alignment errors is load-bearing for the entire framework, but no quantitative validation (e.g., modality-specific retention rates, alignment error metrics, or ablation on fine-grained visual detail survival) is supplied.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta or speedup factor) to support the SOTA claim.
- [§3.3] Notation for the multi-granularity retrieval (dense similarity aggregation and relevance propagation) should be formalized with an equation or algorithm box for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and substantiation of our claims.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4: The abstract asserts SOTA results and 43.3×/23.9× efficiency gains across four tasks, yet provides no metrics, baselines, ablation details, or error analysis; without these in the experimental section the central performance and speedup claims cannot be evaluated.
Authors: We appreciate the referee highlighting the need for more explicit support of the claims. Section 4 already reports task-specific metrics, baseline comparisons, and efficiency measurements (speedup and cost reduction) across the four tasks. To address the concern directly, we will revise the abstract to include key quantitative highlights with references to the relevant tables and figures. We will also expand §4 with additional ablation details and a brief error analysis to make the performance and efficiency claims fully verifiable. revision: yes
-
Referee: [§3.2] §3.2 (Fusion into unified multimodal nodes): The claim that entity-driven visual grounding produces nodes that 'preserve atomic evidence' without information loss or alignment errors is load-bearing for the entire framework, but no quantitative validation (e.g., modality-specific retention rates, alignment error metrics, or ablation on fine-grained visual detail survival) is supplied.
Authors: This is a fair observation. The current manuscript justifies the preservation claim through the entity-driven grounding design and qualitative examples, but does not provide direct quantitative metrics such as retention rates or alignment error rates. We will add a targeted ablation study in the revised experimental section that measures information preservation (e.g., downstream performance impact under controlled grounding variations) and introduces simple alignment-quality metrics to substantiate the claim. revision: yes
Circularity Check
No circularity: new framework components are defined independently of fitted quantities or self-referential reductions.
full rationale
The paper introduces MG²-RAG as a novel hierarchical graph construction process that fuses textual parsing with entity-driven visual grounding to create unified multimodal nodes. No equations, fitted parameters, or predictions are presented that reduce by construction to prior inputs. Central claims rest on experimental comparisons rather than tautological derivations or load-bearing self-citations. The fusion step is an explicit design choice with stated assumptions about evidence preservation, not a self-definitional loop. This is a standard non-circular proposal of a new architecture.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hierarchical graphs can represent structural dependencies across text and visual modalities without loss of atomic evidence
invented entities (1)
-
Unified multimodal nodes
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MG²-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propagate relevance across the heterogeneous graph using Personalized PageRank
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Very Efficient Listwise Multimodal Reranking for Long Documents
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the International AAAI Conference on Web and Social Media (AAAI)
Alam, F., Ofli, F., Imran, M.: CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In: Proceedings of the International AAAI Conference on Web and Social Media (AAAI). vol. 12 (2018),https://ojs.aaai.org/index.php/ ICWSM/article/view/14983
work page 2018
-
[2]
Asai, A., Wu, Z., Wang, Y., Sil, A., Hajishirzi, H.: Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In: The Twelfth International Conference on Learning Representations (ICLR) (2024), https://openreview. net/forum?id=hSyW5go0v8
work page 2024
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923 (2025), https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Beyond Token-level Answer Equivalence for Question Answering Evaluation
Bulian,J.,Buck,C.,Gajewski,W.,Börschinger,B.,Schuster,T.:Tomayto,Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP). pp. 291–305 (2022),https://aclanthology.org/2022.emnlp- main.20/
work page 2022
-
[5]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Caffagni, D., Cocchi, F., Moratelli, N., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1818–1826 (2024),https://openaccess.thecvf. com / content / CVPR2024W / MMFM / html / Caff...
work page 2024
-
[6]
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, 16 Sijun Dai et al. T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamat...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to Answer Open- DomainQuestions.In:Proceedingsofthe55thAnnualMeetingoftheAssociationfor Computational Linguistics (ACL). pp. 1870–1879 (2017),https://aclanthology. org/P17-1171/
work page 2017
-
[8]
In: The Thirty- eighth Annual Conference on Neural Information Processing Systems (NeurIPS)
Chen,L., Tong, P., Jin,Z.,Sun, Y.,Ye, J.,Xiong,H.: Plan-on-Graph:Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs. In: The Thirty- eighth Annual Conference on Neural Information Processing Systems (NeurIPS). vol. 37, pp. 37665–37691 (2024),https://openreview.net/forum?id=CwCUEr6wO5
work page 2024
-
[9]
In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Chen, W., Hu, H., Chen, X., Verga, P., Cohen, W.: MuRAG: Multimodal Retrieval- Augmented Generator for Open Question Answering over Images and Text. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5558–5570 (2022),https://aclanthology.org/2022. emnlp-main.375/
work page 2022
-
[10]
Chen, Y., Hu, H., Luan, Y., Sun, H., Changpinyo, S., Ritter, A., Chang, M.W.: Can Pre-trained Vision and Language Models Answer Visual Information-seeking Questions? In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 14948–14968 (2023),https://aclanthology. org/2023.emnlp-main.925/
work page 2023
-
[11]
In: Findings of the Association for Computational Linguistics: EMNLP 2024
Cheng, Q., Li, X., Li, S., Zhu, Q., Yin, Z., Shao, Y., Li, L., Sun, T., Yan, H., Qiu, X.: Unified Active Retrieval for Retrieval Augmented Generation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 17153–17166 (2024),https://aclanthology.org/2024.findings-emnlp.999/
work page 2024
-
[12]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Cocchi, F., Moratelli, N., Cornia, M., Baraldi, L., Cucchiara, R.: Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9199–9209 (2025),https://openaccess.thecvf. com/content/CVPR2025/html/Cocchi_Augmenting_Mu...
work page 2025
-
[13]
arXiv preprint arXiv:2511.22715 (2025),https: //arxiv.org/abs/2511.22715
Compagnoni, A., Morini, M., Sarto, S., Cocchi, F., Caffagni, D., Cornia, M., Baraldi, L., Cucchiara, R.: ReAG: Reasoning-Augmented Generation for Knowledge- based Visual Question Answering. arXiv preprint arXiv:2511.22715 (2025),https: //arxiv.org/abs/2511.22715
-
[14]
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS). vol. 36, pp. 49250–49267 (2023), https : / / proceedings . neurips . cc / paper _...
work page 2023
-
[15]
arXiv preprint arXiv:2508.19855 (2025),https://arxiv.org/ abs/2508.19855
Dong, J., An, S., Yu, Y., Zhang, Q.W., Luo, L., Huang, X., Wu, Y., Yin, D., Sun, X.: Youtu-GraphRAG: Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning. arXiv preprint arXiv:2508.19855 (2025),https://arxiv.org/ abs/2508.19855
-
[16]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From Local to Global: A Graph MG2-RAG: Multi-Granularity Graph for MM-RAG 17 RAG Approach to Query-Focused Summarization. arXiv preprint arXiv:2404.16130 (2024),https://arxiv.org/abs/2404.16130
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Edge, D., Trinh, H., Larson, J.: LazyGraphRAG: Setting a New Standard for Quality and Cost (2024), https://www.microsoft.com/en-us/research/blog/ lazygraphrag-setting-a-new-standard-for-quality-and-cost/
work page 2024
-
[18]
Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., HUDELOT, C., Colombo, P.: ColPali: Efficient Document Retrieval with Vision Language Models. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview.net/forum?id=ogjBpZ8uSi
work page 2025
-
[19]
Gao, J., Li, L., Ji, K., Li, W., Lian, Y., Fu, Y., Dai, B.: SmartRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025),https:// openreview.net/forum?id=OCd3cffulp
work page 2025
-
[20]
Google DeepMind Team: Gemini 3.1 Pro: Best for Complex Tasks and Bringing Creative Concepts to Life (2026),https://deepmind.google/models/gemini/pro/
work page 2026
-
[21]
In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 6904–6913 (2017),https://openaccess.thecvf.com/content_cvpr_ 2017/html/Goyal_Making_the_v_CVPR_2017_p...
work page 2017
-
[22]
arXiv preprint arXiv:2502.01142 (2025),https://arxiv.org/abs/2502.01142
Guan, X., Zeng, J., Meng, F., Xin, C., Lu, Y., Lin, H., Han, X., Sun, L., Zhou, J.: DeepRAG: Thinking to Retrieval Step by Step for Large Language Models. arXiv preprint arXiv:2502.01142 (2025),https://arxiv.org/abs/2502.01142
-
[23]
In: Findings of the Association for Computational Lin- guistics: EMNLP 2025
Guo, Z., Xia, L., Yu, Y., Ao, T., Huang, C.: LightRAG: Simple and Fast Retrieval- Augmented Generation. In: Findings of the Association for Computational Lin- guistics: EMNLP 2025. pp. 10746–10761 (2025),https://aclanthology.org/2025. findings-emnlp.568/
work page 2025
-
[24]
Gutiérrez, B.J., Shu, Y., Gu, Y., Yasunaga, M., Su, Y.: HippoRAG: neurobiologically inspired long-term memory for large language models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS). vol. 37, pp. 59532–59569 (2024),https://openreview.net/forum?id=hkujvAPVsg
work page 2024
-
[25]
In: Forty-second Inter- national Conference on Machine Learning (ICML)
Gutiérrez, B.J., Shu, Y., Qi, W., Zhou, S., Su, Y.: From RAG to Memory: Non- Parametric Continual Learning for Large Language Models. In: Forty-second Inter- national Conference on Machine Learning (ICML). pp. 21497–21515. PMLR (2025), https://openreview.net/forum?id=LWH8yn4HS2
work page 2025
-
[26]
In: Proceedings of the 37th International Conference on Machine Learning (ICML)
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval Augmented Language Model Pre-Training. In: Proceedings of the 37th International Conference on Machine Learning (ICML). pp. 3929–3938. PMLR (2020),https://proceedings. mlr.press/v119/guu20a.html
work page 2020
-
[27]
In: Proceedings of the 11th international conference on World Wide Web (WWW)
Haveliwala, T.H.: Topic-sensitive pagerank. In: Proceedings of the 11th international conference on World Wide Web (WWW). pp. 517–526 (2002),https://dl.acm. org/doi/abs/10.1145/511446.511513
-
[28]
In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)
He, X., Tian, Y., Sun, Y., Chawla, N., Laurent, T., LeCun, Y., Bresson, X., Hooi, B.: G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS). vol. 37, pp. 132876–132907 (2024), https://openreview.net/forum?id=MPJ3oXtTZl
work page 2024
-
[29]
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-Strength Natural Language Processing in Python (2020),https://spacy.io/ 18 Sijun Dai et al
work page 2020
-
[30]
Hsu, S., Khattab, O., Finn, C., Sharma, A.: Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025), https://openreview. net/forum?id=BPAZ6yW3K7
work page 2025
-
[31]
arXiv preprint arXiv:2505.24073 (2025),https://arxiv.org/abs/2505.24073
Hu, C.W., Wang, Y., Xing, S., Chen, C.J., Feng, S., Rossi, R., Tu, Z.: mrag: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation. arXiv preprint arXiv:2505.24073 (2025),https://arxiv.org/abs/2505.24073
-
[32]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Hu, Z., Iscen, A., Sun, C., Wang, Z., Chang, K.W., Sun, Y., Schmid, C., Ross, D.A., Fathi, A.: REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 23369– 23379 (2023), https://openaccess.thecvf.com/content/C...
work page 2023
-
[33]
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions43(2) (Jan 2025). https://doi.org/10.1145/3703155,https://doi.org/10.1145/3703155
-
[34]
In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
Huang, Y., Zhang, S., Xiao, X.: KET-RAG: A Cost-Efficient Multi-Granular In- dexing Framework for Graph-RAG. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). pp. 1003–1012 (2025), https://dl.acm.org/doi/abs/10.1145/3711896.3737012
-
[35]
In: Findings of the Association for Computational Linguistics: ACL 2024
Jin, B., Xie, C., Zhang, J., Roy, K.K., Zhang, Y., Li, Z., Li, R., Tang, X., Wang, S., Meng, Y., Han, J.: Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 163–184 (2024),https://aclanthology.org/2024. findings-acl.11/
work page 2024
-
[36]
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV)123(1), 32–73 (2017),https: //link.springer.com/article/10.1007/S11263-016-0981-7
-
[37]
Internet-augmented language models through few-shot prompting for open-domain question answering,
Lazaridou, A., Gribovskaya, E., Stokowiec, W., Grigorev, N.: Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115 (2022),https://arxiv.org/abs/2203.05115
-
[38]
Lee, J., Wang, Y., Li, J., Zhang, M.: Multimodal Reasoning with Multimodal KnowledgeGraph.In:Proceedingsofthe62ndAnnualMeetingoftheAssociationfor Computational Linguistics (ACL). pp. 10767–10782 (2024),https://aclanthology. org/2024.acl-long.579/
work page 2024
-
[39]
In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)
Lee, K., Chang, M.W., Toutanova, K.: Latent Retrieval for Weakly Supervised Open Domain Question Answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 6086–6096 (2019), https://aclanthology.org/P19-1612/
work page 2019
-
[40]
Lee, M., An, S., Kim, M.S.: PlanRAG: A Plan-then-Retrieval Augmented Genera- tion for Generative Large Language Models as Decision Makers. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL). pp. 6537–6555 (2024),https://aclanthology.org/2024.naacl-long.364/
work page 2024
-
[41]
arXiv preprint arXiv:2503.21729 (2025), https://arxiv.org/abs/2503.21729
Lee, Z., Cao, S., Liu, J., Zhang, J., Liu, W., Che, X., Hou, L., Li, J.: ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with MG2-RAG: Multi-Granularity Graph for MM-RAG 19 Iterative Retrieval Augmented Generation. arXiv preprint arXiv:2503.21729 (2025), https://arxiv.org/abs/2503.21729
-
[42]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küt- tler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS). pp. 9459–9474 (2020),https://procee...
work page 2020
-
[43]
In: Proceedings of the 40th International Conference on Machine Learning (ICML)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models. In: Proceedings of the 40th International Conference on Machine Learning (ICML). pp. 19730–19742. PMLR (2023),https://proceedings.mlr.press/v202/li23q
work page 2023
-
[44]
In: Proceedings of the 39th International Conference on Machine Learning (ICML)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In: Proceedings of the 39th International Conference on Machine Learning (ICML). pp. 12888–12900. PMLR (2022),https://proceedings.mlr.press/v162/li22n.html
work page 2022
-
[45]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Li, X., Dong, G., Jin, J., Zhang, Y., Zhou, Y., Zhu, Y., Zhang, P., Dou, Z.: Search-o1: Agentic Search-Enhanced Large Reasoning Models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5420–5438 (2025),https://aclanthology.org/2025.emnlp-main.276/
work page 2025
-
[46]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved Baselines with Visual Instruction Tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26296–26306 (2024),https://openaccess.thecvf.com/ content/CVPR2024/html/Liu_Improved_Baselines_with_Visual_Instruction_ Tuning_CVPR_2024_paper.html
work page 2024
-
[47]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS). vol. 36, pp. 34892–34916 (2023),https://openreview.net/forum?id=w0H2xGHlkw
work page 2023
-
[48]
In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Liu, J., Meng, S., Gao, Y., Mao, S., Cai, P., Yan, G., Chen, Y., Bian, Z., Wang, D., Shi, B.: Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 981–992 (2025),https://openaccess.thecvf.com/content/ICCV20...
work page 2025
-
[49]
In: European Semantic Web Conference (ESWC)
Liu, Y., Li, H., Garcia-Duran, A., Niepert, M., Onoro-Rubio, D., Rosenblum, D.S.: MMKG: Multi-modal Knowledge Graphs. In: European Semantic Web Conference (ESWC). pp. 459–474. Springer (2019),https://link.springer.com/chapter/10. 1007/978-3-030-21348-0_30
work page 2019
-
[50]
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In: Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS). vol. 35, pp. 2507–2521 (2022), https://openreview.net/forum?id=H...
work page 2022
-
[51]
Lu, P., Peng, B., Cheng, H., Galley, M., Chang, K.W., Wu, Y.N., Zhu, S.C., Gao, J.: Chameleon: Plug-and-Play Compositional Reasoning with Large Lan- guage Models. In: Proceedings of the 37th International Conference on Neu- ral Information Processing Systems (NeurIPS). vol. 36, pp. 43447–43478 (2023), https://openreview.net/forum?id=HtqnVSCj3q 20 Sijun Dai et al
work page 2023
-
[52]
Luo, L., Li, Y.F., Haffari, G., Pan, S.: Reasoning on Graphs: Faithful and Inter- pretableLargeLanguageModelReasoning.In:TheTwelfthInternationalConference on Learning Representations (ICLR) (2024),https://openreview.net/forum?id= ZGNWW7xZ6Q
work page 2024
-
[53]
Luo, L., Zhao, Z., Haffari, G., Phung, D., Gong, C., Pan, S.: GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025), https://openreview.net/forum?id=0QNmAvQQqj
work page 2025
-
[54]
Luo, Y., Zheng, X., Li, G., Yin, S., Lin, H., Fu, C., Huang, J., Ji, J., Chao, F., Luo, J., Ji, R.: Video-RAG: Visually-aligned Retrieval-Augmented Long Video Compre- hension. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025),https://openreview.net/forum?id=QaZxGWlbgO
work page 2025
-
[55]
In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV)
Mensink, T., Uijlings, J., Castrejon, L., Goel, A., Cadar, F., Zhou, H., Sha, F., Araujo, A., Ferrari, V.: Encyclopedic VQA: Visual Questions about Detailed Proper- ties of Fine-grained Categories. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV). pp. 3113–3124 (2023),https://openaccess. thecvf . com / content / ICCV202...
work page 2023
-
[56]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Methani, N., Ganguly, P., Khapra, M.M., Kumar, P.: PlotQA: Reasoning over Sci- entific Plots. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1527–1536 (2020),https://openaccess.thecvf. com/content_WACV_2020/html/Methani_PlotQA_Reasoning_over_Scientific_ Plots_WACV_2020_paper.html
work page 2020
-
[57]
Transactions on Machine Learning Research (TMLR) (2025),https://openreview.net/forum?id=IPmzyQSiQE
Nussbaum, Z., Morris, J.X., Mulyar, A., Duderstadt, B.: Nomic Embed: Training a Reproducible Long Context Text Embedder. Transactions on Machine Learning Research (TMLR) (2025),https://openreview.net/forum?id=IPmzyQSiQE
work page 2025
-
[58]
(2026),https://openai.com/index/ introducing-gpt-5-2/
OpenAI Team: Introducing GPT-5.2:The Most Advanced Frontier Model for Pro- fessional Work and Long-running Agents. (2026),https://openai.com/index/ introducing-gpt-5-2/
work page 2026
-
[59]
arXiv preprint arXiv:2410.08876 (2024),https://arxiv.org/abs/2410.08876
Qi, J., Xu, Z., Shao, R., Chen, Y., Di, J., Cheng, Y., Wang, Q., Huang, L.: RoRA-VLM: Robust Retrieval-Augmented Vision Language Models. arXiv preprint arXiv:2410.08876 (2024),https://arxiv.org/abs/2410.08876
-
[60]
Qwen Team: Qwen3.5: Towards Native Multimodal Agents (2026),https://qwen. ai/blog?id=qwen3.5
work page 2026
-
[61]
In: Proceedings of the 38th International Conference on Machine Learning (ICML)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). pp. 8748–8763. PMLR (2021),https://proceedings.mlr.press/v...
work page 2021
-
[62]
Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., Manning, C.D.: RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https:// openreview.net/forum?id=GN921JHCRw
work page 2024
-
[63]
Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., Ni, L., Shum, H.Y., Guo, J.: Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In: The Twelfth International Conference on Learning Representations (ICLR) (2024),https://openreview.net/forum?id=nnVO1PvbTv
work page 2024
-
[64]
EVA-CLIP- 18B: Scaling clip to 18 billion parameters.arXiv:2402.04252, 2024
Sun, Q., Wang, J., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, X.: EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters. arXiv preprint arXiv:2402.04252 (2024), https://arxiv.org/abs/2508.05318 MG2-RAG: Multi-Granularity Graph for MM-RAG 21
-
[65]
arXiv preprint arXiv:2507.20804 (2025),https: //arxiv.org/abs/2507.20804
Wan, X., Yu, H.: MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs. arXiv preprint arXiv:2507.20804 (2025),https: //arxiv.org/abs/2507.20804
-
[66]
Chain- of-retrieval augmented generation
Wang, L., Chen, H., Yang, N., Huang, X., Dou, Z., Wei, F.: Chain-of-Retrieval Augmented Generation. arXiv preprint arXiv:2501.14342 (2025),https://arxiv. org/abs/2501.14342
-
[67]
ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation
Wang, S., Fang, Y., Zhou, Y., Liu, X., Ma, Y.: ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation. arXiv preprint arXiv:2502.09891 (2025),https://arxiv.org/abs/2502.09891
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Wasserman, N., Pony, R., Naparstek, O., Goldfarb, A.R., Schwartz, E., Barze- lay, U., Karlinsky, L.: Real-mm-rag: A real-world multi-modal retrieval bench- mark. In: Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). pp. 31660–31683 (2025),https: //aclanthology.org/2025.acl-long.1528/
work page 2025
-
[69]
In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)
Wu, J., Zhu, J., Liu, Y., Xu, M., Jin, Y.: Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). pp. 28489–28503 (2025),https://aclanthology.org/2025.acl-long.1383/
work page 2025
-
[70]
In: Findings of the Association for Computational Linguistics: EMNLP 2024
Yan, Y., Xie, W.: EchoSight: Advancing Visual-Language Models with Wiki Knowl- edge. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 1538–1551 (2024),https://aclanthology.org/2024.findings-emnlp.83/
work page 2024
-
[71]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Yang, W., Fu, J., Wang, R., Wang, J., Song, L., Bian, J.: OMGM: Orchestrate multiple granularities and modalities for efficient multimodal retrieval. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). pp. 24545–24563. AssociationforComputationalLinguis...
-
[72]
39755–39769 (2022),https://proceedings.mlr.press/v202/yasunaga23a.html
Yasunaga,M.,Aghajanyan,A.,Shi,W.,James,R.,Leskovec,J.,Liang,P.,Lewis,M., Zettlemoyer,L.,Yih,W.t.:Retrieval-AugmentedMultimodalLanguageModelingpp. 39755–39769 (2022),https://proceedings.mlr.press/v202/yasunaga23a.html
work page 2022
-
[73]
Yuan, X., Ning, L., Fan, W., Li, Q.: mKG-RAG: Multimodal Knowledge Graph- Enhanced RAG for Visual Question Answering. arXiv preprint arXiv:2508.05318 (2025),https://arxiv.org/abs/2508.05318
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
Authorea Preprints (2025),https://www.techrxiv.org/doi/full/10.36227/ techrxiv.176341513.38473003
Zhang, R., Liu, C., Su, Y., Li, R., Huang, X., Li, X., Yu, P.S.: A comprehensive survey on multimodal RAG: all combinations of modalities as input and out- put. Authorea Preprints (2025),https://www.techrxiv.org/doi/full/10.36227/ techrxiv.176341513.38473003
-
[75]
arXiv preprint arXiv:2502.01113 (2024), https://arxiv.org/abs/2411.15041
Zhang, T., Zhang, Z., Ma, Z., Chen, Y., Qi, Z., Yuan, C., Li, B., Pu, J., Zhao, Y., Xie, Z., Ma, J., Shan, Y., Hu, W.: mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA. arXiv preprint arXiv:2502.01113 (2024), https://arxiv.org/abs/2411.15041
-
[76]
arXiv preprint arXiv:2502.18139 (2025),https://arxiv.org/abs/2502.18139
Zhang, Z., Feng, Y., Zhang, M.: LevelRAG: Enhancing Retrieval-Augmented Gen- eration with Multi-hop Logic Planning over Rewriting Augmented Searchers. arXiv preprint arXiv:2502.18139 (2025),https://arxiv.org/abs/2502.18139
-
[77]
arXiv preprint arXiv:2505.24226 (2025), https://arxiv.org/abs/2505.24226
Zhao, Y., Zhu, J., Guo, Y., He, K., Li, X.: E2GraphRAG: Streamlining Graph-based RAG for High Efficiency and Effectiveness. arXiv preprint arXiv:2505.24226 (2025), https://arxiv.org/abs/2505.24226
-
[78]
Zhi Lim, Q., Poo Lee, C., Ming Lim, K., Kamsani Samingan, A.: UniRaG: Uni- fication, Retrieval, and Generation for Multimodal Question Answering With 22 Sijun Dai et al. Pre-Trained Language Models. IEEE Access12, 71505–71519 (2024). https: //doi.org/10.1109/ACCESS.2024.3403101 , https://doi.org/10.1109/ACCESS. 2024.3403101
-
[79]
In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV)
Zhou, Y., Zhang, T., Xu, S., Chen, S., Zhou, Q., Tong, Y., Ji, S., Zhang, J., Qi, L., Li, X.: Are They the Same? Exploring Visual Correspondence Shortcom- ings of Multimodal LLMs. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV). pp. 17663–17674 (October 2025),https: //openaccess.thecvf.com/content/ICCV2025/html/Zhou_Ar...
work page 2025
-
[80]
Zhuang, L., Chen, S., Xiao, Y., Zhou, H., Zhang, Y., Chen, H., Zhang, Q., Huang, X.: LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora. In: Proceedings of the 14th International Conference on Learning Representations (ICLR) (2026),https://arxiv.org/abs/2510.10114 MG2-RAG: Multi-Granularity Graph for MM-RAG 23 A Implementation D...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.