pith. sign in

arxiv: 2503.04338 · v2 · submitted 2025-03-06 · 💻 cs.IR · cs.CL· cs.DB

In-depth Analysis of Graph-based RAG in a Unified Framework

Pith reviewed 2026-05-23 01:33 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.DB
keywords graph-based RAGunified frameworkretrieval-augmented generationquestion answeringLLMmethod comparisonhybrid techniquesknowledge integration
0
0 comments X

The pith

A single framework unifies graph-based RAG methods and shows that simple recombinations of their parts beat prior leaders on both concrete and abstract QA tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper places existing graph-based retrieval-augmented generation techniques inside one high-level framework so they can be compared directly. It then tests representative methods on question-answering datasets that range from narrow factual questions to high-level abstract ones. The experiments identify new variants, each formed by combining pieces already present in earlier work, that exceed the previous best results on each category of task. A reader would care because the comparison supplies a clearer view of what each component contributes and shows that further gains are available without inventing entirely new mechanisms.

Core claim

By embedding all graph-based RAG methods in one shared framework the authors run controlled experiments across QA datasets that move from specific to abstract questions; the results identify new variants obtained simply by recombining existing techniques, and these variants outperform the prior state-of-the-art on the specific-question tasks and on the abstract-question tasks respectively.

What carries the argument

The unified high-level framework that incorporates every graph-based RAG method under a common structure for direct comparison.

If this is right

  • Direct head-to-head testing reveals which graph components help most on concrete versus abstract questions.
  • New variants formed by recombining existing techniques set higher performance marks on specific QA tasks.
  • New variants formed by recombining existing techniques set higher performance marks on abstract QA tasks.
  • The analysis points to concrete directions for future work on graph-based knowledge integration with LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same framework could be used to test whether the same recombinations improve performance on tasks beyond QA, such as summarization or dialogue.
  • Practitioners could adopt the top-performing variants immediately while waiting for entirely new architectures.
  • The framework makes it easier to diagnose why one method succeeds where another fails by swapping individual modules.

Load-bearing premise

The chosen representative methods and QA datasets inside the unified framework produce a comparison that fairly represents the broader space of graph-based RAG approaches.

What would settle it

A follow-up experiment on a fresh collection of QA datasets or additional graph RAG methods in which none of the identified new variants exceeds the previous best scores.

Figures

Figures reproduced from arXiv: 2503.04338 by Runyuan He, Shu Wang, Sicong Liang, Taotao Wang, Xilin Liu, Yaodong Su, Yingli Zhou, Yixiang Fang, Yongwei Zhang, Youran Sun, Yuchi Ma.

Figure 1
Figure 1. Figure 1: Overview of vanilla RAG and graph-based RAG. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of graph-based RAG methods under our unified framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow of our empirical study. basic RAG, respectively. If a method cannot finish in two days, we mark its result as N/A in the figures and “—” in the tables. Hyperparameter Settings. In our experiment, we use Llama￾3-8B [11] as the default LLM, which is widely used in existing RAG methods [88]. For LLM, we set the maximum token length to 8,000, and use greedy decoding to generate one sample for the dete… view at source ↗
Figure 4
Figure 4. Figure 4: Token cost of graph building on specific QA datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Token cost of index construction in specific QA. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The abstract QA results on Mix dataset. VR RA GS LR FG VR 50 50 2 46 95 RA 50 50 47 48 94 GS 78 53 50 79 96 LR 54 52 21 50 92 FG 5 6 4 8 50 (a) Comprehensiveness VR RA GS LR FG VR 50 64 58 64 93 RA 36 50 42 49 85 GS 42 55 50 52 92 LR 36 51 48 50 88 FG 7 15 8 12 50 (b) Diversity VR RA GS LR FG VR 50 52 36 39 95 RA 48 50 45 45 93 GS 64 54 50 41 97 LR 61 55 59 50 95 FG 5 7 3 5 50 (c) Empowerment VR RA GS LR F… view at source ↗
Figure 7
Figure 7. Figure 7: The abstract QA results on MultihopSum dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The abstract QA results on Agriculture dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The abstract QA results on CS dataset. VR RA GS LR FG VR 50 26 31 41 93 RA 74 50 27 67 95 GS 69 73 50 62 97 LR 59 33 38 50 97 FG 7 5 3 3 50 (a) Comprehensiveness VR RA GS LR FG VR 50 36 32 45 90 RA 64 50 68 68 93 GS 68 33 50 66 94 LR 55 32 34 50 93 FG 10 7 6 7 50 (b) Diversity VR RA GS LR FG VR 50 24 29 34 95 RA 76 50 31 67 96 GS 71 69 50 60 96 LR 66 33 40 50 97 FG 5 4 4 3 50 (c) Empowerment VR RA GS LR FG… view at source ↗
Figure 10
Figure 10. Figure 10: The abstract QA results on Legal dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Token cost of the graph building on abstract QA datasets. [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Token cost of index construction in abstract QA. [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of our newly designed method on abstract QA datasets. [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The taxonomy tree of RAG methods. L2. Chunk quality is very important for the overall performance of all RAG methods, and human experts are better at splitting chunks than relying solely on token size. L3. For complex questions in specific QA, high-level information is typically needed, as they capture the complex relationship among chunks, and the vector search-based retrieval strategy is better than the… view at source ↗
Figure 15
Figure 15. Figure 15: Effect of chunk quality on the performance of specific QA tasks. [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Proportion of the token costs for prompt and completion in graph building stage across all datasets. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Token costs for prompt and completion tokens in the generation stage across all datasets. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: The prompt for generating abstract questions. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: The prompt for the evaluation of abstract QA. [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗
read the original abstract

Graph-based Retrieval-Augmented Generation (RAG) has proven effective in integrating external knowledge into large language models (LLMs), improving their factual accuracy, adaptability, interpretability, and trustworthiness. A number of graph-based RAG methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework to incorporate all graph-based RAG methods from a high-level perspective. We then extensively compare representative graph-based RAG methods over a range of questing-answering (QA) datasets -- from specific questions to abstract questions -- and examine the effectiveness of all methods, providing a thorough analysis of graph-based RAG approaches. As a byproduct of our experimental analysis, we are also able to identify new variants of the graph-based RAG methods over specific QA and abstract QA tasks respectively, by combining existing techniques, which outperform the state-of-the-art methods. Finally, based on these findings, we offer promising research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide new valuable insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a unified high-level framework that encompasses existing graph-based RAG methods. It performs extensive empirical comparisons of representative methods across QA datasets spanning specific to abstract questions, analyzes their effectiveness, identifies new variants obtained by combining existing techniques that outperform prior SOTA on specific and abstract QA tasks respectively, and outlines promising research directions based on the findings.

Significance. A sound unified framework and controlled comparison could help standardize evaluation practices in graph-based RAG and surface practically useful combinations. The byproduct identification of outperforming variants would be a concrete contribution if the selection process is shown to be systematic rather than selective.

major comments (2)
  1. [Experimental analysis] Experimental analysis section: the claim that new variants 'outperform the state-of-the-art' on specific and abstract QA tasks is load-bearing for the central contribution, yet the manuscript provides no description of the total search space size, whether the combination search was pre-specified, the number of combinations evaluated, or any correction for multiple testing. The 'byproduct' phrasing increases the risk that only successful combinations are highlighted.
  2. [Unified framework] Unified framework section: it is unclear whether the framework definition introduces any implicit bias in how representative methods are instantiated or whether all methods are placed on equal footing with respect to hyper-parameter tuning budgets and retrieval settings; this directly affects the fairness of the reported outperformance.
minor comments (2)
  1. [Abstract] The abstract states that 'extensive comparisons were performed' without referencing dataset splits, statistical significance tests, or variance across runs; these details belong in the main experimental protocol.
  2. [Notation] Notation for graph construction and retrieval operators should be made consistent between the framework description and the experimental tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to enhance transparency and rigor.

read point-by-point responses
  1. Referee: [Experimental analysis] Experimental analysis section: the claim that new variants 'outperform the state-of-the-art' on specific and abstract QA tasks is load-bearing for the central contribution, yet the manuscript provides no description of the total search space size, whether the combination search was pre-specified, the number of combinations evaluated, or any correction for multiple testing. The 'byproduct' phrasing increases the risk that only successful combinations are highlighted.

    Authors: We agree that the current description lacks sufficient detail on the combination process. The variants were not the result of an exhaustive or pre-specified combinatorial search but were instead derived from targeted, hypothesis-driven recombinations of components identified during our comparative analysis of the unified framework. We will revise the experimental analysis section to explicitly state the rationale, approximate number of combinations explored (approximately two dozen component swaps across the main methods), and the absence of multiple-testing corrections, as the process was exploratory rather than statistical. We will also replace the 'byproduct' phrasing with language that better reflects the systematic, insight-guided nature of the exploration. revision: yes

  2. Referee: [Unified framework] Unified framework section: it is unclear whether the framework definition introduces any implicit bias in how representative methods are instantiated or whether all methods are placed on equal footing with respect to hyper-parameter tuning budgets and retrieval settings; this directly affects the fairness of the reported outperformance.

    Authors: The framework is intentionally high-level and component-based to avoid favoring any particular method. All representative methods were instantiated using the same retrieval pipeline (identical embedding model, vector index, and top-k setting) and the same LLM backbone. Hyper-parameters for each method were tuned independently on a held-out validation split to their best achievable performance under a uniform computational budget. We will add a dedicated paragraph in the unified framework and experimental setup sections to document this protocol and confirm that no implicit bias was introduced by the framework definition. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking and combination search

full rationale

The paper summarizes an existing unified framework for graph-based RAG methods, performs extensive empirical comparisons across QA datasets, and reports new variants found by combining techniques that outperform baselines on the tested tasks. No derivations, first-principles predictions, fitted parameters renamed as predictions, or load-bearing self-citations are present. The outperformance claim rests on experimental results rather than any reduction to inputs by construction. This matches the default case of a self-contained empirical study with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical comparison and benchmarking study; it introduces no mathematical free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5762 in / 1035 out tokens · 42432 ms · 2026-05-23T01:33:51.430011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

    cs.CL 2026-05 unverdicted novelty 6.0

    H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.

  2. ASTRA-QA: A Benchmark for Abstract Question Answering over Documents

    cs.CL 2026-05 unverdicted novelty 6.0

    ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.

  3. SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

    cs.CL 2026-05 unverdicted novelty 6.0

    SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.

  4. EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

    cs.DB 2026-04 unverdicted novelty 6.0

    EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 4 Pith papers · 18 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Eric Anderson, Jonathan Fritz, Austin Lee, Bohou Li, Mark Lindblad, Henry Lindeman, Alex Meyer, Parth Parmar, Tanvi Ranade, Mehul A Shah, et al. 2024. The Design of an LLM-powered Unstructured Analytics System. arXiv preprint arXiv:2409.00847 (2024)

  3. [3]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi

  4. [4]

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511 (2023)

  5. [5]

    Taiyu Ban, Lyvzhou Chen, Xiangyu Wang, and Huanhuan Chen. 2023. From query tools to causal architects: Harnessing large language models for advanced causal discovery from data. arXiv preprint arXiv:2306.16902 (2023)

  6. [6]

    Sibei Chen, Ju Fan, Bin Wu, Nan Tang, Chao Deng, Pengyi Wang, Ye Li, Jian Tan, Feifei Li, Jingren Zhou, et al. 2024. Automatic Database Configuration Debugging using Retrieval-Augmented Language Models. arXiv preprint arXiv:2412.07548 (2024)

  7. [7]

    Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei Zhang, and Surajit Chaudhuri. 2024. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations. Proceedings of the ACM on Management of Data 2, 3 (2024), 1–27

  8. [8]

    Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, and Xi- aoyong Du. 2023. Haipipe: Combining human-generated and machine-generated pipelines for data preparation. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–26

  9. [9]

    Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, and Michael Cafarella. 2023. SEED: Domain-Specific Data Curation With Large Language Models. arXiv e-prints (2023), arXiv–2310

  10. [10]

    Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  11. [11]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022)

  12. [12]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  13. [13]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 (2024)

  14. [14]

    Ju Fan, Zihui Gu, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Samuel Madden, Xiaoyong Du, and Nan Tang. 2024. Combining small language models and large language models for zero-shot nl2sql. Proceedings of the VLDB Endowment 17, 11 (2024), 2750–2763

  15. [15]

    Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, and Xiaoyong Du. 2024. Cost-effective in-context learning for entity resolution: A design space exploration. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 3696–3709

  16. [16]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 6491–6501

  17. [17]

    FastGraphRAG. 2024. FastGraphRAG. https://github.com/circlemind-ai/fast- graphrag

  18. [18]

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627 (2023)

  19. [19]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)

  20. [20]

    Aashish Ghimire, James Prather, and John Edwards. 2024. Generative AI in Education: A Study of Educators’ Awareness, Sentiments, and Influencing Factors. arXiv preprint arXiv:2403.15586 (2024)

  21. [21]

    Victor Giannankouris and Immanuel Trummer. 2024. {\lambda}-Tune: Har- nessing Large Language Models for Automated Database System Tuning. arXiv preprint arXiv:2411.03500 (2024)

  22. [22]

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2024. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv e-prints (2024), arXiv– 2410

  23. [23]

    Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su

  24. [24]

    J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y

    HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv preprint arXiv:2405.14831 (2024)

  25. [25]

    Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Ma- hantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al. 2024. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:2501.00309 (2024)

  26. [26]

    Jiawei Han, Jian Pei, and Hanghang Tong. 2022. Data mining: concepts and techniques. Morgan kaufmann

  27. [27]

    Taher H Haveliwala. 2002. Topic-sensitive pagerank. In Proceedings of the 11th international conference on World Wide Web. 517–526

  28. [28]

    Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. 2024. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. arXiv preprint arXiv:2402.07630 (2024)

  29. [29]

    Yucheng Hu and Yuxing Lu. 2024. Rag and rau: A survey on retrieval-augmented language model in natural language processing. arXiv preprint arXiv:2404.19543 (2024)

  30. [30]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Hao- tian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al

  31. [31]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023)

  32. [32]

    Yizheng Huang and Jimmy Huang. 2024. A Survey on Retrieval-Augmented Text Generation for Large Language Models. arXiv preprint arXiv:2404.10981 (2024)

  33. [33]

    Yiqian Huang, Shiqi Zhang, and Xiaokui Xiao. 2025. KET-RAG: A Cost- Efficient Multi-Granular Indexing Framework for Graph-RAG. arXiv preprint arXiv:2502.09304 (2025)

  34. [34]

    huawei. 2019. Ascend GPU. https://e.huawei.com/ph/products/computing/ ascend

  35. [35]

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park

  36. [36]

    J.; and Park, J

    Adaptive-rag: Learning to adapt retrieval-augmented large language mod- els through question complexity. arXiv preprint arXiv:2403.14403 (2024)

  37. [37]

    Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. StructGPT: A General Framework for Large Language Model to Reason over Structured Data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . 9237–9251

  38. [38]

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714 (2023)

  39. [39]

    Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

  40. [40]

    Langchian. 2023. Langchian. https://python.langchain.com/docs/additional_ resources/arxiv_references/

  41. [41]

    Jiale Lao, Yibo Wang, Yufei Li, Jianping Wang, Yunjia Zhang, Zhiyuan Cheng, Wanghu Chen, Mingjie Tang, and Jianguo Wang. 2024. Gptuner: A manual- reading database tuning system via gpt-guided bayesian optimization. Proceed- ings of the VLDB Endowment 17, 8 (2024), 1939–1952

  42. [42]

    Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Dawn of Natural Language to SQL: Are We Fully Ready? arXiv preprint arXiv:2406.01265 (2024)

  43. [43]

    Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sukwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong-Tran, Ying Ding, et al. 2024. DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer’s Disease Questions with Scientific Literature. arXiv preprint arXiv:2405.04819 (2024)

  44. [44]

    Lan Li, Liri Fang, and Vetle I Torvik. 2024. AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark. arXiv preprint arXiv:2412.06724 (2024)

  45. [45]

    Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. 2023. Large language models in finance: A survey. In Proceedings of the fourth ACM international conference on AI in finance . 374–382

  46. [46]

    Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing. 2024. LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency. arXiv preprint arXiv:2404.12872 (2024)

  47. [47]

    Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing. 2025. LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency. Proceedings of the VLDB Endowment 1, 18 (2025), 53–65

  48. [48]

    Chen Liang, Donghua Yang, Zheng Liang, Zhiyu Liang, Tianle Zhang, Boyu Xiao, Yuqing Yang, Wenqi Wang, and Hongzhi Wang. 2025. Revisiting Data Analysis with Pre-trained Foundation Models. arXiv preprint arXiv:2501.01631 (2025)

  49. [49]

    Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, and Aditya G Parameswaran. 2025. TWIX: Automatically Reconstructing Structured Data from Templatized Documents. arXiv preprint arXiv:2501.06659 (2025)

  50. [50]

    Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G Parameswaran, and Eugene Wu. 2024. Towards Accurate and Ef- ficient Document Analytics with Large Language Models. arXiv preprint arXiv:2405.04674 (2024)

  51. [51]

    Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano

  52. [52]

    arXiv preprint arXiv:2405.14696 (2024)

    A Declarative System for Optimizing AI Workloads. arXiv preprint arXiv:2405.14696 (2024). 14

  53. [53]

    Chunwei Liu, Gerardo Vitagliano, Brandon Rose, Matt Prinz, David Andrew Samson, and Michael Cafarella. 2025. PalimpChat: Declarative and Interactive AI analytics. arXiv preprint arXiv:2502.03368 (2025)

  54. [54]

    Lei Liu, Xiaoyan Yang, Junchi Lei, Xiaoyang Liu, Yue Shen, Zhiqiang Zhang, Peng Wei, Jinjie Gu, Zhixuan Chu, Zhan Qin, et al. 2024. A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions. arXiv preprint arXiv:2406.03712 (2024)

  55. [55]

    llamaindex. 2023. llamaindex. https://www.llamaindex.ai/

  56. [56]

    Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2023. Reasoning on graphs: Faithful and interpretable large language model reasoning. arXiv preprint arXiv:2310.01061 (2023)

  57. [57]

    Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836

  58. [58]

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. When not to trust language models: Investigat- ing effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511 (2022)

  59. [59]

    Costas Mavromatis and George Karypis. 2024. GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning. arXiv preprint arXiv:2405.20139 (2024)

  60. [60]

    Multi-Linguality Multi-Functionality Multi-Granularity. 2024. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. (2024)

  61. [61]

    Zan Ahmad Naeem, Mohammad Shahmeer Ahmad, Mohamed Eltabakh, Mourad Ouzzani, and Nan Tang. 2024. RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes. Proceedings of the VLDB Endowment 17, 12 (2024), 4421– 4424

  62. [62]

    Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Foun- dation Models Wrangle Your Data? Proceedings of the VLDB Endowment 16, 4 (2022), 738–746

  63. [63]

    nebula. 2010. nebula. https://www.nebula-graph.io/

  64. [64]

    neo4j. 2006. neo4j. https://neo4j.com/

  65. [65]

    Yuqi Nie, Yaxuan Kong, Xiaowen Dong, John M Mulvey, H Vincent Poor, Qing- song Wen, and Stefan Zohren. 2024. A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges. arXiv preprint arXiv:2406.11903 (2024)

  66. [66]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems 35 (2022), 27730–27744

  67. [67]

    Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al

  68. [68]

    QuALITY: Question answering with long input texts, yes! arXiv preprint arXiv:2112.08608 (2021)

  69. [69]

    Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured data. arXiv preprint arXiv:2407.11418 (2024)

  70. [70]

    Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. 2024. Graph retrieval-augmented generation: A survey. arXiv preprint arXiv:2408.08921 (2024)

  71. [71]

    Hongjin Qian, Peitian Zhang, Zheng Liu, Kelong Mao, and Zhicheng Dou. 2024. Memorag: Moving towards next-gen rag via memory-inspired knowledge dis- covery. arXiv preprint arXiv:2409.05591 (2024)

  72. [72]

    Yichen Qian, Yongyi He, Rong Zhu, Jintao Huang, Zhijian Ma, Haibin Wang, Yaohua Wang, Xiuyu Sun, Defu Lian, Bolin Ding, et al. 2024. UniDM: A Unified Framework for Data Manipulation with Large Language Models. Proceedings of Machine Learning and Systems 6 (2024), 465–482

  73. [73]

    The Technique Report. 2025. In-depth Analysis of Graph-based RAG in a Unified Framework (technical report). https://github.com/JayLZhou/GraphRAG/blob/ master/VLDB2025_GraphRAG.pdf

  74. [74]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. Raptor: Recursive abstractive processing for tree-organized retrieval. arXiv preprint arXiv:2401.18059 (2024)

  75. [75]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems 36 (2024)

  76. [76]

    Vikramank Singh, Kapil Eknath Vaidya, Vinayshekhar Bannihatti Kumar, Sopan Khosla, Murali Narayanaswamy, Rashmi Gangadharaiah, and Tim Kraska. 2024. Panda: Performance debugging for databases using LLM agents. (2024)

  77. [77]

    Shamane Siriwardhana, Rivindu Weerasekera, Elliott Wen, Tharindu Kalu- arachchi, Rajib Rana, and Suranga Nanayakkara. 2023. Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics 11 (2023), 1–17

  78. [78]

    Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In The Twelfth International Conference on Learning Representations

  79. [79]

    Zhaoyan Sun, Xuanhe Zhou, and Guoliang Li. 2024. R-Bot: An LLM-based Query Rewrite System. arXiv preprint arXiv:2412.01661 (2024)

  80. [80]

    Yixuan Tang and Yi Yang. 2024. Multihop-rag: Benchmarking retrieval- augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391 (2024)

Showing first 80 references.