GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering
Pith reviewed 2026-05-10 20:17 UTC · model grok-4.3
The pith
GroundedKG-RAG builds a source-linked knowledge graph from SRL and AMR parses to match proprietary long-context models on NarrativeQA at lower cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GroundedKG-RAG defines nodes as entities and actions and edges as temporal or semantic relations, with each node and edge grounded in the original sentences of the source document. The graph is built from semantic role labeling and abstract meaning representation parses, then embedded for retrieval. During inference the same parsing step is applied to the query to retrieve the most relevant grounded sentences, which are passed to the generator for answering. On NarrativeQA the system matches a state-of-the-art proprietary long-context model at smaller cost, outperforms a competitive baseline, and supplies an interpretable structure that facilitates auditing and error analysis.
What carries the argument
GroundedKG: a knowledge graph whose nodes are entities and actions, edges are temporal or semantic relations, and every element is explicitly linked to specific sentences in the source document via SRL and AMR parses.
Load-bearing premise
Semantic role labeling and abstract meaning representation parses will reliably extract the key entities, actions, and relations without introducing errors or omissions that affect downstream retrieval and answer quality.
What would settle it
A controlled test on NarrativeQA examples that measures hallucination rate or factual accuracy when the same system is run once with sentence-grounded nodes and edges versus once with the same parses but without sentence-level grounding links.
read the original abstract
Retrieval-augmented generation (RAG) systems have been widely adopted in contemporary large language models (LLMs) due to their ability to improve generation quality while reducing the required input context length. In this work, we focus on RAG systems for long-document question answering. Current approaches suffer from a heavy reliance on LLM descriptions resulting in high resource consumption and latency, repetitive content across hierarchical levels, and hallucinations due to no or limited grounding in the source text. To improve both efficiency and factual accuracy through grounding, we propose GroundedKG-RAG, a RAG system in which the knowledge graph is explicitly extracted from and grounded in the source document. Specifically, we define nodes in GroundedKG as entities and actions, and edges as temporal or semantic relations, with each node and edge grounded in the original sentences. We construct GroundedKG from semantic role labeling (SRL) and abstract meaning representation (AMR) parses and then embed it for retrieval. During querying, we apply the same transformation to the query and retrieve the most relevant sentences from the grounded source text for question answering. We evaluate GroundedKG-RAG on examples from the NarrativeQA dataset and find that it performs on par with a state-of-the art proprietary long-context model at smaller cost and outperforms a competitive baseline. Additionally, our GroundedKG is interpretable and readable by humans, facilitating auditing of results and error analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GroundedKG-RAG, a RAG system for long-document question answering that explicitly extracts a knowledge graph from the source document using semantic role labeling (SRL) and abstract meaning representation (AMR) parses. Nodes represent entities and actions, edges represent temporal or semantic relations, and all elements are grounded to original sentences. The KG is embedded for retrieval; queries undergo the same transformation to retrieve relevant grounded sentences for generation. Evaluation on NarrativeQA examples claims performance parity with a state-of-the-art proprietary long-context model at lower cost, outperformance of a competitive baseline, and human interpretability for auditing.
Significance. If the empirical results hold under rigorous validation, the approach could meaningfully advance grounded RAG for long documents by providing an interpretable, document-derived structured index that mitigates hallucinations and context-length issues. The use of off-the-shelf linguistic parsers for explicit grounding is a practical strength, and the emphasis on auditability addresses a common limitation in black-box retrieval systems.
major comments (2)
- [Evaluation] Evaluation section: The central claim that GroundedKG-RAG performs on par with proprietary long-context models and outperforms baselines rests on the assumption that SRL/AMR parses produce a faithful, low-error KG. However, the manuscript provides no parse-quality metrics (e.g., SRL F1 or AMR Smatch scores on NarrativeQA documents), no human validation of node/edge fidelity or coverage of key narrative elements, and no ablation isolating grounding effects from retrieval heuristics or prompt design. SRL and AMR are known to omit implicit arguments and struggle with coreference in long narratives; without these controls, performance gains cannot be confidently attributed to grounding rather than other factors.
- [Method] Method (KG construction): The description states that nodes and edges are 'grounded in the original sentences' via SRL and AMR, yet no details are given on how parse outputs are mapped to nodes/edges, how conflicts or incomplete parses are handled, or how the embedding and retrieval step preserves this grounding. This is load-bearing for the 'explicitly grounded' claim and the interpretability advantage.
minor comments (2)
- [Method] The specific SRL and AMR parser implementations (e.g., AllenNLP, amrlib) and versions are not named, hindering reproducibility.
- [Abstract and Evaluation] The abstract and evaluation description omit concrete metrics, baseline names, dataset splits, and statistical tests; these should be added with tables or figures for clarity.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our paper. We address each of the major comments below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Evaluation] The central claim that GroundedKG-RAG performs on par with proprietary long-context models and outperforms baselines rests on the assumption that SRL/AMR parses produce a faithful, low-error KG. However, the manuscript provides no parse-quality metrics (e.g., SRL F1 or AMR Smatch scores on NarrativeQA documents), no human validation of node/edge fidelity or coverage of key narrative elements, and no ablation isolating grounding effects from retrieval heuristics or prompt design. SRL and AMR are known to omit implicit arguments and struggle with coreference in long narratives; without these controls, performance gains cannot be confidently attributed to grounding rather than other factors.
Authors: We agree with the referee that additional validation of the parse quality is necessary to support our claims. In the revised version, we will include SRL F1 and AMR Smatch scores computed on the NarrativeQA documents. We will also add a human evaluation on a subset of the extracted KGs to assess fidelity and coverage. An ablation study will be included to separate the effects of grounding from other components. These additions will allow better attribution of performance improvements to the proposed grounding approach, despite the known limitations of the parsers. revision: yes
-
Referee: [Method] The description states that nodes and edges are 'grounded in the original sentences' via SRL and AMR, yet no details are given on how parse outputs are mapped to nodes/edges, how conflicts or incomplete parses are handled, or how the embedding and retrieval step preserves this grounding. This is load-bearing for the 'explicitly grounded' claim and the interpretability advantage.
Authors: We acknowledge that the Method section lacks sufficient implementation details. We will revise it to provide a step-by-step description of how SRL and AMR outputs are converted into nodes (entities and actions) and edges (temporal or semantic relations). This will include our strategies for resolving conflicts between parses, handling incomplete parses, and ensuring grounding by linking each element back to specific sentences. We will also detail the embedding process and retrieval mechanism to show how grounding is preserved for interpretability. revision: yes
Circularity Check
No circularity in derivation or evaluation chain
full rationale
The manuscript presents a system architecture (GroundedKG-RAG) that extracts a knowledge graph via off-the-shelf SRL and AMR parsers, embeds it, and performs retrieval-augmented QA. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on the empirical comparison to NarrativeQA baselines and a proprietary long-context model, which are independent of any self-referential construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic role labeling and abstract meaning representation parses accurately capture entities, actions, and relations in the document.
invented entities (1)
-
GroundedKG
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct GroundedKG from semantic role labeling (SRL) and abstract meaning representation (AMR) parses... nodes represent entities and actions, edges represent semantic relations
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate GroundedKG-RAG on examples from the NarrativeQA dataset and find that it performs on par with a state-of-the art proprietary long-context model
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering
Introduction Retrieval-Augmented Generation (RAG) is a tech- nique that retrieves and incorporates additional in- formation from an external knowledge source to answer user queries more accurately. Its applica- tion scenarios are diverse. For example, a chatbot may retrieve up-to-date information released after the model’s training cutoff to address users...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
watch"); textscontains textual mentions (e.g.,
Related Work Several recent approaches have explored ques- tion answering over long documents using retrieval- augmented generation (RAG). One line of work aims to extract node–edge pairs to build a graph in- dex for RAG systems. LightRAG (Guo et al., 2025), HippoRAG (Gutiérrez et al., 2024) and HippoR- AGv2 (Gutiérrez et al., 2025) create a knowledge gra...
work page 2025
-
[3]
Experiments 4.1. Dataset NarrativeQA.NarrativeQA (Kočiský et al., 2017) is a large reading comprehension benchmark de- signedtoevaluatedeepunderstandingandreason- ing over long texts. It consists of 1,572 full-length stories from books and movie scripts, the split for books are 548 train, 58 validation and 177 test. Annotators wrote 46,765 question–answer...
work page 2017
-
[4]
What did Peter do after arriving home?
Results We evaluate ourGroundedKG-RAGby comparing with different baselines and ablating our design choices. Baseline Comparison.Our main results are shown in Table 1. We compare ourGroundedKG with a closed-book LLM (without any context,no context), an open-book LLM (offering the full book content as context,full context), and GraphRAG. Since the Narrative...
-
[5]
Error Analysis We distinguish between (1) KG construction, (2) retrieval and (3) answer generation errors. The discussion of mapping“the four little kids’ father” to the father concept is an example of an error of the first kind (see Section 5). This works better in graphs built from AMR parses than from SRL parses. The second kind is a failure to retriev...
-
[6]
it” has been resolved to “some camomile tea
Graph Case Study We present a simple example to illustrate the cre- ation of ourGroundedKGfrom SRL and AMR parses in Table 3. We can clearly observe that AMR-based graph grounded coreference-resolved entities (“it” has been resolved to “some camomile tea” in the second clause) into the same node, in this casetea, while SRL created a duplicate node for it....
-
[7]
Conclusion In this work, we propose theGroundedKG- RAGframework, where each node and edge in theGroundedKGis directly extracted from and grounded in the source document at the sentence level. The nodes represent the entities and actions, and the edges represent the temporal or seman- tic relation between them. For theGroundedKG construction, we experiment...
-
[8]
Acknowledgements We thank getAbstract for the support and partial funding of this project
-
[9]
Bibliographical References Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, andJonathan Larson. 2025. Fromlocal toglobal: A graph rag approach to query-focused summa- rization. Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and fast re...
work page 2025
-
[10]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D
The Proposition Bank: An annotated cor- pusofsemanticroles.ComputationalLinguistics, 31(1):71–106. Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Man- ning. 2024. Raptor: Recursive abstractive pro- cessing for tree-organized retrieval. InInterna- tional Conference on Learning Representations (ICLR). V.A. Traag, L....
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.