pith. sign in

arxiv: 2604.26176 · v4 · pith:HJGG4FJUnew · submitted 2026-04-28 · 💻 cs.DB · cs.CL

CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering

Pith reviewed 2026-07-01 08:26 UTC · model grok-4.3

classification 💻 cs.DB cs.CL
keywords CacheRAGsemantic cachingretrieval-augmented generationknowledge graph question answeringKGQALLM planningRAG cachingschema-agnostic interface
0
0 comments X

The pith

CacheRAG equips LLM-based knowledge graph question answering with a semantic cache that learns from historical queries to reduce hallucinations and expand retrieval coverage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current LLM-driven KGQA systems regenerate retrieval plans from scratch for every query, like a database without a plan cache, which causes schema hallucinations and limited coverage. CacheRAG addresses this by adding a cache-augmented architecture that turns these stateless planners into continual learners. It does so through three design principles: a schema-agnostic two-stage parsing framework using Intermediate Semantic Representation, a two-layer hierarchical cache index with Maximal Marginal Relevance for diversity, and bounded heuristic subgraph expansion with complexity guarantees. Experiments on multiple benchmarks show these changes deliver concrete gains such as 13.2 percent higher accuracy and 17.5 percent higher truthfulness on the CRAG dataset. A sympathetic reader would care because the approach reuses past query patterns to make repeated or similar questions more reliable without requiring users to know the underlying schema.

Core claim

CacheRAG transforms stateless LLM planners in KGQA into continual learners through a semantic caching system built on a schema-agnostic ISR interface, diversity-optimized hierarchical cache retrieval with MMR, and bounded heuristic subgraph expansion, resulting in significantly improved accuracy and truthfulness over baselines.

What carries the argument

The CacheRAG architecture with its two-stage ISR semantic parsing, Domain-to-Aspect hierarchical index paired with MMR for cache selection, and deterministic depth-breadth subgraph operators that enforce complexity bounds.

If this is right

  • Non-expert users can query using natural language via the ISR framework without needing schema details.
  • The MMR-based retrieval promotes structural variety in examples, reducing homogeneity in LLM reasoning.
  • Bounded subgraph expansion enhances recall while maintaining strict complexity limits.
  • The system achieves higher accuracy and truthfulness on benchmarks like CRAG compared to prior stateless approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar caching strategies could be adapted for other retrieval-augmented tasks where query history is available.
  • The hierarchical indexing might scale to larger knowledge graphs if the domain-aspect structure holds across domains.
  • Testing on real-world user query logs could reveal whether the diversity optimization generalizes beyond the tested datasets.

Load-bearing premise

That the three design principles can be implemented in practice without creating new failure modes or latency costs that erase the reported accuracy and truthfulness improvements.

What would settle it

An experiment on a benchmark consisting only of novel queries with no historical matches, checking whether the accuracy and truthfulness gains remain or if the system incurs extra latency from cache operations.

Figures

Figures reproduced from arXiv: 2604.26176 by Lei Chen, Yushi Sun.

Figure 1
Figure 1. Figure 1: Comparison of stateless LLM execution (baseline) and CacheRAG (our approach) on a KGQA task. (a) Input: natural view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of retrieval path expansion for the running example. (a) Direct prompting: LLM incorrectly checks view at source ↗
Figure 3
Figure 3. Figure 3: The overall pipeline of CacheRAG, featuring a view at source ↗
Figure 4
Figure 4. Figure 4: The two-layer cache structure for multi-domain view at source ↗
Figure 5
Figure 5. Figure 5: Scalability Experiments. 5.11 Time and Memory Complexity and Scalability We analyze the time and memory complexity of CacheRAG us￾ing the formal bounds defined in Section 4. For semantic caching, the hierarchical index routing takes O (1) dictionary lookup time, while the MMR scheduling takes O (𝑏 log𝑏) time, utilizing O (𝑁) total space (where 𝑏 ≪ 𝑁 is the localized bucket size and 𝑁 is the global cache si… view at source ↗
Figure 6
Figure 6. Figure 6: The parameter 𝜆’s sensitivity of CacheRAG on CRAG dataset. • CBR [6]: CBR is a neuro-symbolic method. It retrieves similar question cases, reuses their logical form components, and revises the generated form using KB embeddings to handle complex KBQA and unseen relations. Note that since the API-based CRAG dataset does not support SPARQL querying, so we ran them only on SPARQL-based datasets. A.4 Parameter… view at source ↗
Figure 7
Figure 7. Figure 7: The multi-KG domain routing accuracy of different view at source ↗
read the original abstract

The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answering (KGQA). However, existing LLM-driven KGQA systems act as stateless planners, generating retrieval plans in isolation without exploiting historical query patterns: analogous to a database system that optimizes every query from scratch without a plan cache. This fundamental design flaw leads to schema hallucinations and limited retrieval coverage. We propose CacheRAG, a systematic cache-augmented architecture for LLM-based KGQA that transforms stateless planners into continual learners. Unlike traditional database plan caching (which optimizes for frequency), CacheRAG introduces three novel design principles tailored for LLM contexts: (1) Schema-agnostic user interface: A two-stage semantic parsing framework via Intermediate Semantic Representation (ISR) enables non-expert users to interact purely in natural language, while a Backend Adapter grounds the LLM with local schema context to compile executable physical queries safely. (2) Diversity-optimized cache retrieval: A two-layer hierarchical index (Domain $\rightarrow$ Aspect) coupled with Maximal Marginal Relevance (MMR) maximizes structural variety in cached examples, effectively mitigating reasoning homogeneity. (3) Bounded heuristic expansion: Deterministic depth and breadth subgraph operators with strict complexity guarantees significantly enhance retrieval recall without risking unbounded API execution. Extensive experiments on multiple benchmarks demonstrate that CacheRAG significantly outperforms state-of-the-art baselines (e.g., +13.2% accuracy and +17.5% truthfulness on the CRAG dataset).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes CacheRAG, a cache-augmented architecture for LLM-based Knowledge Graph Question Answering (KGQA). It transforms stateless LLM planners into continual learners by introducing three design principles: (1) a schema-agnostic interface using Intermediate Semantic Representation (ISR) and a Backend Adapter, (2) diversity-optimized cache retrieval using a hierarchical index and Maximal Marginal Relevance (MMR), and (3) bounded heuristic subgraph expansion with deterministic operators. The paper claims that extensive experiments on multiple benchmarks show significant outperformance over state-of-the-art baselines, including +13.2% accuracy and +17.5% truthfulness on the CRAG dataset.

Significance. If the reported gains are robustly demonstrated and attributable to the proposed architecture, CacheRAG could represent a meaningful advance in making RAG systems for KGQA more efficient and reliable by leveraging historical query patterns in a manner adapted to LLM contexts. The emphasis on schema-agnostic interaction and bounded operations addresses practical deployment concerns in database-integrated LLM systems.

major comments (3)
  1. [Abstract and experiments section] The headline performance claims (+13.2% accuracy, +17.5% truthfulness on CRAG) are presented without reference to the experimental protocol, baseline implementations, statistical significance testing, or ablation results in the abstract. If these details are not provided in the experiments section with sufficient rigor (e.g., multiple runs, error bars, ablation on each principle), the attribution of gains to the three design principles cannot be verified.
  2. [ISR interface description] The claim that the Backend Adapter 'grounds the LLM with local schema context to compile executable physical queries safely' requires explicit verification that it prevents schema hallucinations. The manuscript should include failure case analysis or metrics showing reduction in hallucination rates compared to baselines without the adapter.
  3. [Cache retrieval description] While MMR is used to maximize structural variety, the paper should demonstrate that this does not trade off relevance in a way that reduces recall on KGQA queries. A comparison of MMR vs. pure relevance-based retrieval on recall metrics would strengthen the claim that diversity optimization enhances rather than harms performance.
minor comments (2)
  1. [Cache retrieval description] The notation for the hierarchical index (Domain → Aspect) should be formalized with equations or pseudocode for clarity.
  2. [Experiments section] Ensure all baselines mentioned in experiments are cited with full references.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and rigor of our work. We address each major comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract and experiments section] The headline performance claims (+13.2% accuracy, +17.5% truthfulness on CRAG) are presented without reference to the experimental protocol, baseline implementations, statistical significance testing, or ablation results in the abstract. If these details are not provided in the experiments section with sufficient rigor (e.g., multiple runs, error bars, ablation on each principle), the attribution of gains to the three design principles cannot be verified.

    Authors: The abstract is a concise summary; the experiments section details the protocol, baselines, and results. To address the concern about rigor and attribution, we will revise the abstract to reference the experimental setup and augment the experiments section with statistical significance testing, error bars from multiple runs, and ablations isolating each of the three design principles. revision: yes

  2. Referee: [ISR interface description] The claim that the Backend Adapter 'grounds the LLM with local schema context to compile executable physical queries safely' requires explicit verification that it prevents schema hallucinations. The manuscript should include failure case analysis or metrics showing reduction in hallucination rates compared to baselines without the adapter.

    Authors: We will add a failure-case analysis subsection and quantitative metrics comparing schema hallucination rates with and without the Backend Adapter to explicitly verify its contribution to safe query compilation. revision: yes

  3. Referee: [Cache retrieval description] While MMR is used to maximize structural variety, the paper should demonstrate that this does not trade off relevance in a way that reduces recall on KGQA queries. A comparison of MMR vs. pure relevance-based retrieval on recall metrics would strengthen the claim that diversity optimization enhances rather than harms performance.

    Authors: We agree this comparison is valuable. We will add an explicit ablation comparing MMR-based cache retrieval against pure relevance-based retrieval on recall metrics to demonstrate that diversity optimization does not reduce recall on KGQA queries. revision: yes

Circularity Check

0 steps flagged

No significant circularity; system proposal with empirical claims only.

full rationale

The paper describes an architecture (ISR interface, MMR cache, bounded expansion) and reports benchmark gains (+13.2% accuracy on CRAG) without any equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations. No step reduces a claimed result to its own inputs by construction; the central claims rest on external experimental outcomes rather than internal redefinition or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities; all such elements would require the full manuscript.

pith-pipeline@v0.9.1-grok · 5804 in / 927 out tokens · 34983 ms · 2026-07-01T08:26:49.474653+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

    cs.DB 2026-07 unverdicted novelty 7.0

    SOLAR is a learning-augmented policy for semantic cache replacement that achieves constant competitive ratio 3 and 5-75% gains over FIFO on retrieval workloads.

Reference graph

Works this paper leans on

43 extracted references · 6 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    The volcano optimizer generator: Exten- sibility and efficient search,

    G. Graefe and W. J. McKenna, “The volcano optimizer generator: Exten- sibility and efficient search,” inProceedings of IEEE 9th international conference on data engineering. IEEE, 1993, pp. 209–218

  2. [2]

    Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation,

    S. Agarwal, S. Sundaresan, S. Mitra, D. Mahapatra, A. Gupta, R. Sharma, N. J. Kapu, T. Yu, and S. Saini, “Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation,”Proceedings of the ACM on Management of Data, vol. 3, no. 3, pp. 1–28, 2025

  3. [3]

    Ragcache: Efficient knowledge caching for retrieval-augmented generation,

    C. Jin, Z. Zhang, X. Jiang, F. Liu, S. Liu, X. Liu, and X. Jin, “Ragcache: Efficient knowledge caching for retrieval-augmented generation,”ACM Transactions on Computer Systems, vol. 44, no. 1, pp. 1–27, 2025

  4. [4]

    Buffer of thoughts: Thought-augmented reasoning with large language models,

    L. Yang, Z. Yu, T. Zhang, S. Cao, M. Xu, W. Zhang, J. E. Gonzalez, and B. Cui, “Buffer of thoughts: Thought-augmented reasoning with large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 113 519–113 544, 2024

  5. [5]

    Semantic parsing via staged query graph generation: Question answering with knowledge base,

    S. W.-t. Yih, M.-W. Chang, X. He, and J. Gao, “Semantic parsing via staged query graph generation: Question answering with knowledge base,” inProceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing of the AFNLP, 2015. [Online]. Available: https://aclanthology.org/P...

  6. [6]

    The value of semantic parse labeling for knowledge base question answering,

    W.-t. Yih, M. Richardson, C. Meek, M.-W. Chang, and J. Suh, “The value of semantic parse labeling for knowledge base question answering,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016, pp. 201–206. [Online]. Available: https://aclanthology.org/ P16-2033.pdf

  7. [7]

    Knowledge base question answering via encoding of complex query graphs,

    K. Luo, F. Lin, X. Luo, and K. Zhu, “Knowledge base question answering via encoding of complex query graphs,” inProceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2185–2194

  8. [8]

    Uhop: An unrestricted-hop relation extraction framework for knowledge-based question answering,

    Z.-Y . Chen, C.-H. Chang, Y .-P. Chen, J. Nayak, and L.-W. Ku, “Uhop: An unrestricted-hop relation extraction framework for knowledge-based question answering,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 345–356

  9. [9]

    Knowledge base question answering with topic units.(2019),

    Y . Lan, S. Wang, and J. Jiang, “Knowledge base question answering with topic units.(2019),” inProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019, pp. 5046–5052. [Online]. Available: https://www.ijcai.org/proceedings/2019/0701.pdf

  10. [10]

    Query graph generation for answering multi-hop complex questions from knowledge bases,

    Y . Lan and J. Jiang, “Query graph generation for answering multi-hop complex questions from knowledge bases,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 969–974

  11. [11]

    Unik-qa: Unified representations of structured and unstructured knowledge for open- domain question answering,

    B. Oguz, X. Chen, V . Karpukhin, S. Peshterliev, D. Okhonko, M. Schlichtkrull, S. Gupta, Y . Mehdad, and S. Yih, “Unik-qa: Unified representations of structured and unstructured knowledge for open- domain question answering,” inFindings of the Association for Compu- tational Linguistics: NAACL 2022, 2022, pp. 1535–1546

  12. [12]

    Case-based reasoning for natural language queries over knowledge bases,

    R. Das, M. Zaheer, D. Thai, A. Godbole, E. Perez, J.-Y . Lee, L. Tan, L. Polymenakos, and A. Mccallum, “Case-based reasoning for natural language queries over knowledge bases,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 9594–9611

  13. [13]

    Rng-kbqa: Generation augmented iterative ranking for knowledge base question answering,

    X. Ye, S. Yavuz, K. Hashimoto, Y . Zhou, and C. Xiong, “Rng-kbqa: Generation augmented iterative ranking for knowledge base question answering,” inProceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6032–6043

  14. [14]

    Program transfer for answering complex questions over knowledge bases,

    S. Cao, J. Shi, Z. Yao, X. Lv, J. Yu, L. Hou, J. Li, Z. Liu, and J. Xiao, “Program transfer for answering complex questions over knowledge bases,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8128– 8140

  15. [15]

    Arcaneqa: Dynamic program induction and contextualized encoding for knowledge base question answering,

    Y . Gu and Y . Su, “Arcaneqa: Dynamic program induction and contextualized encoding for knowledge base question answering,” inProceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 1718–1731. [Online]. Available: https://aclanthology.org/2022.coling-1.148/

  16. [16]

    Tiara: Multi-grained retrieval for robust question answering over large knowledge base,

    Y . Shu, Z. Yu, Y . Li, B. Karlsson, T. Ma, Y . Qu, and C.-Y . Lin, “Tiara: Multi-grained retrieval for robust question answering over large knowledge base,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 8108–8121

  17. [17]

    Logical form generation via multi-task learning for complex question answering over knowledge bases,

    X. Hu, X. Wu, Y . Shu, and Y . Qu, “Logical form generation via multi-task learning for complex question answering over knowledge bases,” inProceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 1687–1696. [Online]. Available: https://aclanthology.org/2022.coling-1.145/

  18. [18]

    Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models,

    T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S. Wu, M. Zhong, P. Yin, S. I. Wanget al., “Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 602–631

  19. [19]

    Decaf: Joint decoding of answers and logical forms for question answering over knowledge bases,

    D. Yu, S. Zhang, P. Ng, H. Zhu, A. H. Li, J. Wang, Y . Hu, W. Y . Wang, Z. Wang, and B. Xiang, “Decaf: Joint decoding of answers and logical forms for question answering over knowledge bases,” inThe Eleventh International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/pdf?id=XHc5zRPxqV9

  20. [20]

    Fc-kbqa: A fine-to-coarse composition framework for knowledge base question answering,

    L. Zhang, J. Zhang, Y . Wang, S. Cao, X. Huang, C. Li, H. Chen, and J. Li, “Fc-kbqa: A fine-to-coarse composition framework for knowledge base question answering,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 1002–1017

  21. [21]

    Don’t generate, discriminate: A proposal for grounding language models to real-world environments,

    Y . Gu, X. Deng, and Y . Su, “Don’t generate, discriminate: A proposal for grounding language models to real-world environments,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 4928–4949

  22. [22]

    Fine-tuned llms know more, hallucinate less with few-shot sequence-to-sequence semantic parsing over wikidata,

    S. Xu, S. Liu, T. Culhane, E. Pertseva, M.-H. Wu, S. Semnani, and M. Lam, “Fine-tuned llms know more, hallucinate less with few-shot sequence-to-sequence semantic parsing over wikidata,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 5778–5791

  23. [23]

    Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,

    J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y . Gong, L. Ni, H.-Y . Shum, and J. Guo, “Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=nnVO1PvbTv

  24. [24]

    Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation,

    S. Ma, C. Xu, X. Jiang, M. Li, H. Qu, C. Yang, J. Mao, and J. Guo, “Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=oFBu7qaZpS

  25. [25]

    Structgpt: A general framework for large language model to reason over structured data,

    J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, and J.-R. Wen, “Structgpt: A general framework for large language model to reason over structured data,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 9237–9251

  26. [26]

    Serag: Self-evolving rag system for query optimization,

    H. Liu, Q. Zhang, R. Marcus, and I. Sabek, “Serag: Self-evolving rag system for query optimization,” 2025

  27. [27]

    Winning solution for meta kdd cup’24,

    Y . Xia, J. Chen, and J. Gao, “Winning solution for meta kdd cup’24,”arXiv preprint arXiv:2410.00005, 2024. [Online]. Available: https://openreview.net/forum?id=oWNPeoP1uC

  28. [28]

    Revisiting the solution of meta kdd cup 2024: Crag,

    J. Ouyang, Y . Luo, M. Cheng, D. Wang, S. Yu, Q. Liu, and E. Chen, “Revisiting the solution of meta kdd cup 2024: Crag,”arXiv preprint arXiv:2409.15337, 2024. [Online]. Available: https://openreview.net/forum?id=PUzLjWIgqC

  29. [29]

    Aflow: Automating agentic workflow generation,

    J. Zhang, J. Xiang, Z. Yu, F. Teng, X.-H. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wanget al., “Aflow: Automating agentic workflow generation,” inThe Thirteenth International Conference on Learning Representations, 2025

  30. [31]

    Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q

    [Online]. Available: https://arxiv.org/abs/2406.04744

  31. [32]

    Qald-10: The 10th challenge on question answering over linked data,

    R. Usbeck, X. Yan, A. Perevalov, L. Jiang, J. Schulz, A. Kraft, C. M ¨oller, J. Huang, J. Reineke, A.-C. N. Ngomo, M. Saleem, and A. Both, “Qald-10: The 10th challenge on question answering over linked data,”Semantic Web, 2023. [Online]. Available: https: //api.semanticscholar.org/CorpusID:265577096

  32. [33]

    The web as a knowledge-base for answering complex questions,

    A. Talmor and J. Berant, “The web as a knowledge-base for answering complex questions,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orleans, Louisiana: Association for Computational Linguist...

  33. [34]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  34. [35]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  35. [36]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  36. [37]

    Sparql-qa enters the qald challenge,

    M. Borroto, F. Ricca, B. Cuteri, and V . Barbara, “Sparql-qa enters the qald challenge,” inProceedings of the 7th Natural Language Interfaces for the Web of Data (NLIWoD) co-located with the 19th European Semantic Web Conference, Hersonissos, Greece, vol. 3196, 2022, pp. 25–31. [Online]. Available: https://ceur-ws.org/V ol-3196/paper3.pdf

  37. [38]

    Dense passage retrieval for open-domain question an- swering,

    V . Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question an- swering,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6769–6781

  38. [39]

    Wikidata query blazegraph,

    “Wikidata query blazegraph,” 2022. [Online]. Available: https: //github.com/wikimedia/wikidata-query-blazegraph?tab=readme-ov-file

  39. [40]

    Getting and hosting your own copy of wikidata

    W. Fahl, T. Holzheim, A. Westerinen, C. Lange, and S. Decker, “Getting and hosting your own copy of wikidata.” inWikidata@ ISWC, 2022. APPENDIX A. Dataset Details To evaluate the performance of our method, we consider the widely used CRAG [30] benchmark, where the KGs cover five distinct domains and are accessed through API functions. We use the KG questi...

  40. [41]

    LLM Base Models:On CRAG dataset, we includeGPT- 4o [33],Llama-3.1-70B-Instruct [34], andDeepseek-chat-V3- 0324 [35]base models as baselines, which covers the state- of-the-art open-sourced and close-sourced LLMs

  41. [42]

    LLM Tool Calling Models:On CRAG, we further in- cludeStructGPT [25], which enables GPT-4o, Llama-3.1-70B- Instruct, and Deepseek-chat-V3-0324 base models to freely perform tool calling. Specifically, we provide the metadata of the API functions (i.e., the function name, parameters, descriptions, and sample use cases) so that the LLMs can freely chain and ...

  42. [43]

    We enter “¡EMPTY¿” into the web content module of db3 to adapt it to our KGQA settings

    KDD Cup Winning Solutions:Since CRAG is used to host the KDD Cup competition in 2024, we also include the winning solutions to establish the state-of-the-art baselines on CRAG: �db3 [27]:The design of db3 jointly considers inputs from the KG and web content. We enter “¡EMPTY¿” into the web content module of db3 to adapt it to our KGQA settings. A router-b...

  43. [44]

    SPARQL-based State-of-the-art Solutions:On the SPARQL-based datasets QALD-10-en, WebQSP, and CWQ, we consider the following state-of-the-art methods: �ToG [23]:ToG integrates large language models (LLMs) with knowledge graphs (KGs). Through beam search, LLMs are used to iteratively explore reasoning paths on KGs to enhance the deep reasoning ability of LL...