pith. sign in

arxiv: 2404.16130 · v2 · submitted 2024-04-24 · 💻 cs.CL · cs.AI· cs.IR

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Pith reviewed 2026-05-11 05:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords GraphRAGRetrieval-Augmented GenerationQuery-Focused SummarizationEntity Knowledge GraphCommunity SummariesGlobal Question AnsweringLarge Language Models
0
0 comments X

The pith

GraphRAG builds entity knowledge graphs and community summaries to answer global questions over large private text collections more comprehensively than standard RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GraphRAG to address the limitation that standard retrieval-augmented generation fails on broad, corpus-wide questions which are really query-focused summarization tasks. It uses an LLM in two stages to first extract an entity knowledge graph from the source documents and then generate summaries for communities of closely related entities. When a question arrives, the system produces a partial response from each community summary and then combines those into a single final answer. The authors test this on global sensemaking questions over datasets in the one-million-token range and report substantial gains in both comprehensiveness and diversity of answers compared with a conventional RAG baseline.

Core claim

GraphRAG constructs a graph index by deriving an entity knowledge graph from the source documents and then pregenerating community summaries for all groups of closely related entities; given a question, each community summary generates a partial response and all partial responses are summarized into a final answer, yielding substantial improvements over a conventional RAG baseline in both comprehensiveness and diversity for global sensemaking questions over datasets in the 1 million token range.

What carries the argument

Two-stage LLM-based graph indexing that first builds an entity knowledge graph and then pregenerates community summaries for groups of related entities, which are used to create and aggregate partial responses.

If this is right

  • GraphRAG can answer questions that require understanding an entire document collection rather than isolated passages.
  • The method scales query-focused summarization to the same quantities of text handled by typical RAG systems.
  • Partial responses from community summaries can be synthesized into final answers that improve both breadth and variety over direct retrieval.
  • The two-stage indexing allows the system to handle both narrow retrieval questions and broad sensemaking questions within one framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on domain-specific corpora such as legal contracts or scientific papers where global pattern detection is valuable.
  • If community detection quality varies, the method might benefit from iterative refinement of the graph index based on question type.
  • Hybrid systems could route local questions to standard RAG and global questions to GraphRAG without changing the underlying LLM.

Load-bearing premise

LLM-generated entity graphs and community summaries accurately and comprehensively capture the source material without introducing errors, omissions, or biases that undermine the final combined responses.

What would settle it

A human evaluation on a corpus with independently verified global themes in which GraphRAG answers show no measurable gain in comprehensiveness or diversity, or in which the community summaries omit or distort major themes present in the raw text.

read the original abstract

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, do not scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose GraphRAG, a graph-based approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text. Our approach uses an LLM to build a graph index in two stages: first, to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that GraphRAG leads to substantial improvements over a conventional RAG baseline for both the comprehensiveness and diversity of generated answers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes GraphRAG, a two-stage LLM-driven indexing method that first extracts an entity knowledge graph from source documents and then generates community summaries over related entity groups. For global sensemaking queries, it produces partial answers from each community summary and applies a final map-reduce summarization step. The central empirical claim is that this yields substantial gains in answer comprehensiveness and diversity relative to a conventional RAG baseline on corpora of approximately 1 million tokens.

Significance. If the reported gains prove robust under detailed evaluation, the work would meaningfully advance RAG systems by addressing their documented weakness on global queries through graph-based indexing and hierarchical summarization. The approach is an empirical engineering contribution that combines existing ideas in a scalable way for private corpora.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'substantial improvements' in comprehensiveness and diversity is stated without any quantitative results, exact metric definitions, dataset descriptions, baseline implementation details, or statistical significance tests. This information is load-bearing for assessing whether the gains arise from the graph structure rather than additional LLM calls.
  2. [§3] §3 (Method): the two-stage indexing (entity KG construction followed by community summarization) is presented without any human validation, inter-annotator agreement scores, or ablation against oracle graphs. Because downstream partial responses and the final summary are also LLM-generated, systematic extraction errors or omissions would propagate directly into the reported gains, yet no such checks are described.
minor comments (1)
  1. [§3.3] The description of how community summaries are combined in the final response step could be clarified with a short pseudocode or diagram to make the map-reduce flow explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'substantial improvements' in comprehensiveness and diversity is stated without any quantitative results, exact metric definitions, dataset descriptions, baseline implementation details, or statistical significance tests. This information is load-bearing for assessing whether the gains arise from the graph structure rather than additional LLM calls.

    Authors: We agree that the abstract and §4 would be strengthened by explicit quantitative details. In the revised manuscript we will update the abstract to reference key quantitative findings from the experiments and expand §4 to provide exact metric definitions (human Likert-scale ratings for comprehensiveness and diversity), dataset descriptions, baseline implementation specifics, and statistical significance results. We will also add analysis that isolates the contribution of the graph indexing from the total number of LLM calls. revision: yes

  2. Referee: [§3] §3 (Method): the two-stage indexing (entity KG construction followed by community summarization) is presented without any human validation, inter-annotator agreement scores, or ablation against oracle graphs. Because downstream partial responses and the final summary are also LLM-generated, systematic extraction errors or omissions would propagate directly into the reported gains, yet no such checks are described.

    Authors: We acknowledge the value of validating the intermediate indexing steps. We will revise §3 to discuss potential error propagation from LLM-based entity and community extraction and include any available internal checks or related evidence. A full-scale human validation or oracle-graph ablation is resource-intensive at the corpus scale, but we will add a limitation statement and, where feasible, a small-scale comparison to better contextualize the results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical engineering contribution with independent evaluation

full rationale

The paper proposes GraphRAG as a two-stage LLM-based indexing method (entity KG construction followed by community summarization) for global query-focused summarization, then reports empirical gains in comprehensiveness and diversity over a standard RAG baseline on 1M-token datasets. No equations, first-principles derivations, fitted parameters, or predictions appear in the abstract or described method. The central claim is an empirical comparison rather than a reduction to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are referenced. The evaluation metrics and baseline are external to the indexing process itself, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented entities are introduced; the approach rests on standard assumptions about LLM extraction capabilities and graph community structure.

axioms (2)
  • domain assumption Large language models can extract entities and relations from source text to form a usable knowledge graph.
    Invoked in the first stage of index construction.
  • domain assumption Communities of related entities identified via graph algorithms yield summaries that collectively support global question answering.
    Invoked in the second stage and response generation.

pith-pipeline@v0.9.0 · 5579 in / 1270 out tokens · 34713 ms · 2026-05-11T05:07:17.015621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

    cs.CL 2026-05 conditional novelty 8.0

    GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.

  2. MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

    cs.AI 2026-05 conditional novelty 8.0

    MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...

  3. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

  4. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.

  5. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.

  6. Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

    cs.CR 2026-05 unverdicted novelty 8.0

    Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...

  7. MemGym: a Long-Horizon Memory Environment for LLM Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

  8. Argus: Evidence Assembly for Scalable Deep Research Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    Argus coordinates a Navigator and multiple Searchers via an evidence graph to assemble complete, source-traced answers, yielding benchmark gains up to 12.7 points with 8 parallel agents and 86.2 on BrowseComp with 64 agents.

  9. MeMo: Memory as a Model

    cs.CL 2026-05 unverdicted novelty 7.0

    MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...

  10. GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

    cs.CL 2026-05 unverdicted novelty 7.0

    GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.

  11. Thinking Ahead: Prospection-Guided Retrieval of Memory with Language Models

    cs.IR 2026-05 conditional novelty 7.0

    PGR expands user queries into plausible future steps via Tree-of-Thought or chains and uses them as retrieval probes, delivering nearly 3x recall gains on the new MemoryQuest benchmark for low-similarity memory retrieval.

  12. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

  13. MEME: Multi-entity & Evolving Memory Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.

  14. Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.

  15. MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

    cs.LG 2026-05 unverdicted novelty 7.0

    MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.

  16. DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.

  17. MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

    cs.AI 2026-05 unverdicted novelty 7.0

    MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.

  18. SEM-RAG: Structure-Preserving Multimodal Graph Compilation and Entropy-Guided Retrieval for Telecommunication Standards

    eess.SP 2026-05 unverdicted novelty 7.0

    SEM-RAG compiles telecommunication standards into structure-preserving graphs and uses entropy-guided retrieval to reach 94.1% accuracy on TeleQnA and 93.8% on ORAN-Bench-13K while reducing indexing token usage compar...

  19. When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

  20. The Context Gathering Decision Process: A POMDP Framework for Agentic Search

    cs.AI 2026-05 accept novelty 7.0

    Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no perf...

  21. MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    MANTRA automatically synthesizes SMT-validated compliance benchmarks for LLM agents from natural language manuals and tool schemas, producing 285 tasks across 6 domains with minimal human effort.

  22. SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States

    cs.CL 2026-05 unverdicted novelty 7.0

    SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.

  23. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  24. Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

    cs.CL 2026-05 unverdicted novelty 7.0

    MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.

  25. XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

    cs.AI 2026-04 unverdicted novelty 7.0

    XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

  26. Skill Retrieval Augmentation for Agentic AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.

  27. A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

    cs.AI 2026-04 unverdicted novelty 7.0

    A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.

  28. Structure Guided Retrieval-Augmented Generation for Factual Queries

    cs.IR 2026-04 unverdicted novelty 7.0

    SG-RAG frames retrieval as subgraph matching to ensure LLMs meet every condition in factual queries and reports large gains over baselines on a new 120k-pair ERQA dataset.

  29. ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.

  30. STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering

    cs.AI 2026-04 unverdicted novelty 7.0

    STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.

  31. SAGER: Self-Evolving User Policy Skills for Recommendation Agent

    cs.IR 2026-04 unverdicted novelty 7.0

    SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to mem...

  32. ROZA Graphs: Self-Improving Near-Deterministic RAG through Evidence-Centric Feedback

    cs.AI 2026-04 unverdicted novelty 7.0

    ROZA graphs enable self-improving RAG by storing evidence-specific reasoning chains, yielding up to 10.6pp accuracy gains and 46% lower cost through graph traversal feedback.

  33. DOTRAG: Retrieval-Time Reasoning Along Paths

    cs.IR 2026-04 unverdicted novelty 7.0

    DotRAG reformulates graph retrieval as query-guided path reasoning with Division of Thought, reporting SOTA results on MetaQA and UltraDomain for multi-hop tasks.

  34. MisEdu-RAG: A Misconception-Aware Dual-Hypergraph RAG for Novice Math Teachers

    cs.IR 2026-04 unverdicted novelty 7.0

    MisEdu-RAG builds concept and instance hypergraphs for two-stage retrieval of pedagogical knowledge and student errors, improving feedback quality on the MisstepMath benchmark by 10.95% token-F1 and up to 15.3% on res...

  35. AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis

    cs.IR 2026-04 unverdicted novelty 7.0

    AnnoRetrieve uses auto-generated structured schemas and queries to retrieve information from unstructured documents more efficiently and accurately than embedding-based methods.

  36. Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.

  37. Semantic Level of Detail for Knowledge Graphs: Discovering Abstraction Boundaries via Spectral Heat Diffusion

    cs.LG 2026-03 unverdicted novelty 7.0

    SLoD detects emergent scale boundaries in knowledge graphs by applying spectral heat diffusion to Poincare embeddings, recovering planted hierarchies in synthetic data and aligning with taxonomic depths in WordNet wit...

  38. GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning

    cs.AI 2026-03 unverdicted novelty 7.0

    GraphScout trains LLMs to autonomously synthesize structured training data from knowledge graphs via flexible exploration tools, enabling a 4B model to outperform larger LLMs by 16.7% on average with fewer inference t...

  39. AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation

    cs.IR 2026-02 unverdicted novelty 7.0

    AtomicRAG replaces chunk-based and triple-based GraphRAG with atom-entity graphs that store facts as atomic units and use personalized PageRank plus relevance filtering to achieve higher retrieval accuracy and reasoni...

  40. KRONE: Scalable LLM-Augmented Log Anomaly Detection via Hierarchical Abstraction

    cs.DB 2026-02 conditional novelty 7.0

    KRONE derives semantic execution hierarchies from flat logs to enable modular multi-level anomaly detection with hybrid local and nested-aware detectors plus limited LLM use, delivering 10% F1 gains and over 100x data...

  41. Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval

    cs.AI 2026-01 unverdicted novelty 7.0

    ARK adaptively retrieves from knowledge graphs using global lexical search and one-hop neighborhood exploration, reaching 59.1% Hit@1 on STaRK with up to 31.4% gains over training-free baselines and enabling distillat...

  42. M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

    cs.CL 2025-12 unverdicted novelty 7.0

    M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.

  43. VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

    cs.CL 2025-12 conditional novelty 7.0

    VLegal-Bench supplies 10,450 expert-validated samples for evaluating LLMs on Vietnamese legal questions, retrieval, multi-step reasoning, and scenario solving.

  44. Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs

    cs.AI 2025-10 unverdicted novelty 7.0

    The paper specifies the SAT-Graph API, a canonical primitive interface that enables auditable, deterministic reasoning over temporal knowledge graphs by isolating uncertainty to intent translation and narrative synthesis.

  45. mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

    cs.CV 2025-08 unverdicted novelty 7.0

    mKG-RAG constructs multimodal KGs via MLLM-driven extraction and vision-text matching then applies dual-stage query-aware retrieval to achieve new state-of-the-art results on knowledge-based VQA.

  46. OKG-LLM: Aligning Ocean Knowledge Graph with Observation Data via LLMs for Global Sea Surface Temperature Prediction

    cs.LG 2025-07 unverdicted novelty 7.0

    OKG-LLM constructs an Ocean Knowledge Graph, learns its embeddings, fuses them with SST observations, and applies an LLM to outperform prior methods on global sea surface temperature prediction.

  47. From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

    cs.MA 2025-06 accept novelty 7.0

    A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

  48. In-depth Research Impact Summarization through Fine-Grained Temporal Citation Analysis

    cs.DL 2025-05 unverdicted novelty 7.0

    A framework for nuanced, time-aware research impact summarization using fine-grained temporal citation intents shows moderate to strong correlation with human judgments on insightfulness.

  49. An Ontology-Driven Graph RAG for Legal Norms: A Structural, Temporal, and Deterministic Approach

    cs.CL 2025-04 unverdicted novelty 7.0

    SAT-Graph RAG is a new ontology-driven temporal graph framework for legal RAG that models Works vs. Expressions, reuses versioned components for temporal states, and treats legislative events as queryable Action nodes...

  50. BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

    cs.CL 2025-04 conditional novelty 7.0

    BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.

  51. DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

    cs.CL 2026-05 unverdicted novelty 6.0

    DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API to...

  52. Ex-GraphRAG: Interpretable Evidence Routing for Graph-Augmented LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    Ex-GraphRAG replaces GNN encoders with M-GNAN for exact node-level decomposition in graph-augmented LLMs, matching black-box performance on STaRK-Prime while exposing semantic-structural mismatches that degrade multi-...

  53. Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables

    cs.AI 2026-05 unverdicted novelty 6.0

    Empirical 2x2 factorial study on 6 statistical datasets shows format and schema constraints in LLM-based KG construction from CSV tables produce super-additive fidelity loss up to +1.180, with mismatched pairs falling...

  54. SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

    cs.CV 2026-05 unverdicted novelty 6.0

    SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.

  55. EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

    cs.CL 2026-05 unverdicted novelty 6.0

    EvoMemBench evaluates 15 memory methods for LLM agents and finds long-context baselines competitive with no single memory approach working consistently across settings.

  56. Argus: Evidence Assembly for Scalable Deep Research Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmar...

  57. H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

    cs.CL 2026-05 unverdicted novelty 6.0

    H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.

  58. Why Retrieval-Augmented Generation Fails: A Graph Perspective

    cs.CL 2026-05 unverdicted novelty 6.0

    Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.

  59. Cognifold: Always-On Proactive Memory via Cognitive Folding

    cs.AI 2026-05 unverdicted novelty 6.0

    Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...

  60. IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation

    cs.AI 2026-05 unverdicted novelty 6.0

    IdeaForge combines multiple innovation methodologies through specialist agents on a persistent knowledge graph, using cross-methodology convergent claim linkages to rank and draft patent claims with higher traceabilit...

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 172 Pith papers · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  2. [2]

    Gemini: A Family of Highly Capable Multimodal Models

    Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

  3. [3]

    Knowledge-augmented language model prompt- ing for zero-shot knowledge graph question answering

    Baek, J., Aji, A. F., and Saffari, A. (2023). Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136

  4. [4]

    Ban, T., Chen, L., Wang, X., and Chen, H. (2023). From query tools to causal architects: Harnessing large language models for advanced causal discovery from data

  5. [5]

    and Gulla, J

    Barlaug, N. and Gulla, J. A. (2021). Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) , 15(3):1--37

  6. [6]

    Baumel, T., Eyal, M., and Elhadad, M. (2018). Query focused abstractive summarization: Incorporating query relevance, multi-document coverage, and summary length constraints into seq2seq models. arXiv preprint arXiv:1801.07704

  7. [7]

    D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E

    Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment , 2008(10):P10008

  8. [8]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems , 33:1877--1901

  9. [9]

    Cheng, X., Luo, D., Chen, X., Liu, L., Zhao, D., and Yan, R. (2024). Lift yourself up: Retrieval-augmented text generation with self-memory. Advances in Neural Information Processing Systems , 36

  10. [10]

    and Christen, P

    Christen, P. and Christen, P. (2012). The data matching process . Springer

  11. [11]

    D., Bridgeford, E

    Chung, J., Pedigo, B. D., Bridgeford, E. W., Varjavand, B. K., Helm, H. S., and Vogelstein, J. T. (2019). Graspy: Graph statistics in python. Journal of Machine Learning Research , 20(158):1--7

  12. [12]

    Dang, H. T. (2006). Duc 2005: Evaluation of question-focused summarization systems. In Proceedings of the Workshop on Task-Focused Summarization and Question Answering , pages 48--55

  13. [13]

    K., Ipeirotis, P

    Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2006). Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering , 19(1):1--16

  14. [14]

    Es, S., James, J., Espinosa-Anke, L., and Schockaert, S. (2023). Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217

  15. [15]

    S., and Yates, A

    Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2004). Web-scale information extraction in knowitall: (preliminary results). In Proceedings of the 13th International Conference on World Wide Web , WWW '04, page 100–110, New York, NY, USA. Association for Computing Machinery

  16. [16]

    Feng, Z., Feng, X., Zhao, D., Yang, M., and Qin, B. (2023). Retrieval-generation synergy augmented large language models. arXiv preprint arXiv:2310.05149

  17. [17]

    Fortunato, S. (2010). Community detection in graphs. Physics reports , 486(3-5):75--174

  18. [18]

    Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997

  19. [19]

    G-retriever: Retrieval-augmented generation for textual graph understanding and question answering,

    He, X., Tian, Y., Sun, Y., Chawla, N. V., Laurent, T., LeCun, Y., Bresson, X., and Hooi, B. (2024). G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. arXiv preprint arXiv:2402.07630

  20. [20]

    Large Language Models Cannot Self-Correct Reasoning Yet

    Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., and Zhou, D. (2023). Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798

  21. [21]

    Jacomy, M., Venturini, T., Heymann, S., and Bastian, M. (2014). Forceatlas2, a continuous graph layout algorithm for handy network visualization designed for the gephi software. PLoS ONE 9(6): e98679. https://doi.org/10.1371/journal.pone.0098679

  22. [22]

    Y., and Zhang, W

    Jin, D., Yu, Z., Jiao, P., Pan, S., He, D., Wu, J., Philip, S. Y., and Zhang, W. (2021). A survey of community detection approaches: From statistical modeling to deep learning. IEEE Transactions on Knowledge and Data Engineering , 35(2):1149--1170

  23. [23]

    Knowledge graph-augmented language models for knowledge-grounded dialogue generation,

    Kang, M., Kwak, J. M., Baek, J., and Hwang, S. J. (2023). Knowledge graph-augmented language models for knowledge-grounded dialogue generation. arXiv preprint arXiv:2305.18846

  24. [24]

    Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp,

    Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P., Potts, C., and Zaharia, M. (2022). Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024

  25. [25]

    Kim, D., Xie, L., and Ong, C. S. (2016). Probabilistic knowledge graph construction: Compositional and incremental approaches. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management , CIKM '16, page 2257–2262, New York, NY, USA. Association for Computing Machinery

  26. [26]

    Kim, G., Kim, S., Jeon, B., Park, J., and Kang, J. (2023). Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models. arXiv preprint arXiv:2310.14696

  27. [27]

    Klein, G., Moon, B., and Hoffman, R. R. (2006). Making sense of sensemaking 1: Alternative perspectives. IEEE intelligent systems , 21(4):70--73

  28. [28]

    Kosinski, M. (2024). Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences , 121(45):e2405460121

  29. [29]

    Kuratov, Y., Bulatov, A., Anokhin, P., Sorokin, D., Sorokin, A., and Burtsev, M. (2024). In search of needles in a 11m haystack: Recurrent memory finds what llms miss

  30. [30]

    Langchain graphs

    LangChain (2024). Langchain graphs. https://langchain-graphrag.readthedocs.io/en/latest/

  31. [31]

    Laskar, M. T. R., Hoque, E., and Huang, J. (2020). Query focused abstractive summarization via incorporating query relevance and transfer learning with transformer models. In Advances in Artificial Intelligence: 33rd Canadian Conference on Artificial Intelligence, Canadian AI 2020, Ottawa, ON, Canada, May 13--15, 2020, Proceedings 33 , pages 342--348. Springer

  32. [32]

    u ttler, H., Lewis, M., Yih, W.-t., Rockt \

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K \"u ttler, H., Lewis, M., Yih, W.-t., Rockt \"a schel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems , 33:9459--9474

  33. [33]

    Lost in the Middle: How Language Models Use Long Contexts

    Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2023). Lost in the middle: How language models use long contexts. arXiv:2307.03172

  34. [34]

    GraphRAG Implementation with LlamaIndex - V2

    LlamaIndex (2024). GraphRAG Implementation with LlamaIndex - V2 . https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/cookbooks/GraphRAG_v2.ipynb

  35. [35]

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al. (2024). Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems , 36

  36. [36]

    Manakul, P., Liusie, A., and Gales, M. J. (2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896

  37. [37]

    Mao, Y., He, P., Liu, X., Shen, Y., Gao, J., Han, J., and Chen, W. (2020). Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553

  38. [38]

    M., Klavans, R., and Boyack, K

    Martin, S., Brown, W. M., Klavans, R., and Boyack, K. (2011). Openord: An open-source toolbox for large graph layout. SPIE Conference on Visualization and Data Analysis (VDA)

  39. [39]

    Melnyk, I., Dognin, P., and Das, P. (2022). Knowledge graph generation from text

  40. [40]

    and Larson, J

    Metropolitansky, D. and Larson, J. (2025). Towards effective extraction and evaluation of factual claims

  41. [41]

    The impact of large language models on scientific discovery: a preliminary study using gpt-4

    Microsoft (2023). The impact of large language models on scientific discovery: a preliminary study using gpt-4

  42. [42]

    Mooney, R. J. and Bunescu, R. (2005). Mining knowledge from text using information extraction. SIGKDD Explor. Newsl. , 7(1):3–10

  43. [43]

    Nebulagraph launches industry-first graph rag: Retrieval-augmented generation with llm based on knowledge graphs

    NebulaGraph (2024). Nebulagraph launches industry-first graph rag: Retrieval-augmented generation with llm based on knowledge graphs. https://www.nebula-graph.io/posts/graph-RAG

  44. [44]

    Get started with graphrag: Neo4j’s ecosystem tools

    Neo4J (2024). Get started with graphrag: Neo4j’s ecosystem tools. https://neo4j.com/developer-blog/graphrag-ecosystem-tools/

  45. [45]

    Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the national academy of sciences , 103(23):8577--8582

  46. [46]

    Ni, J., Shi, M., Stammbach, D., Sachan, M., Ash, E., and Leippold, M. (2024). AF a CTA : Assisting the annotation of factual claim detection with reliable LLM annotators. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1890--1912, ...

  47. [47]

    Chatgpt: Gpt-4 language model

    OpenAI (2023). Chatgpt: Gpt-4 language model

  48. [48]

    and He, H

    Padmakumar, V. and He, H. (2024). Does writing with language models reduce content diversity? ICLR

  49. [49]

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research , 12:2825--2830

  50. [50]

    Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., and Shoham, Y. (2023). In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics , 11:1316--1331

  51. [51]

    Fabula: Intelligence report generation using retrieval-augmented narrative construction,

    Ranade, P. and Joshi, A. (2023). Fabula: Intelligence report generation using retrieval-augmented narrative construction. arXiv preprint arXiv:2310.13848

  52. [52]

    Salminen, J., Liu, C., Pian, W., Chi, J., H \"a yh \"a nen, E., and Jansen, B. J. (2024). Deus ex machina and personas from large language models: Investigating the composition of ai-generated persona descriptions. In Proceedings of the CHI Conference on Human Factors in Computing Systems , pages 1--20

  53. [53]

    Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. (2024). Raptor: Recursive abstractive processing for tree-organized retrieval. arXiv preprint arXiv:2401.18059

  54. [54]

    Scott, K. (2024). Behind the Tech . https://www.microsoft.com/en-us/behind-the-tech

  55. [55]

    Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., and Chen, W. (2023). Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294

  56. [56]

    A., Rey, B

    Shin, J., Hedderich, M. A., Rey, B. J., Lucero, A., and Oulasvirta, A. (2024). Understanding human-ai workflows for generating personas. In Proceedings of the 2024 ACM Designing Interactive Systems Conference , pages 757--781

  57. [57]

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. (2024). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36

  58. [58]

    B., Barezi, E

    Su, D., Xu, Y., Yu, T., Siddique, F. B., Barezi, E. J., and Fung, P. (2020). Caire-covid: A question answering and query-focused multi-document summarization system for covid-19 scholarly information management. arXiv preprint arXiv:2005.03975

  59. [59]

    Tan, Z., Zhao, X., and Wang, W. (2017). Representation learning of large-scale knowledge graphs via entity feature combinations. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management , CIKM '17, page 1777–1786, New York, NY, USA. Association for Computing Machinery

  60. [60]

    MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

    Tang, Y. and Yang, Y. (2024). MultiHop-RAG : Benchmarking retrieval-augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391

  61. [61]

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  62. [62]

    A., Waltman, L., and Van Eck, N

    Traag, V. A., Waltman, L., and Van Eck, N. J. (2019). From L ouvain to L eiden: guaranteeing well-connected communities. Scientific Reports , 9(1)

  63. [63]

    Trajanoska, M., Stojanov, R., and Trajanov, D. (2023). Enhancing knowledge graph construction using large language models. ArXiv , abs/2305.04676

  64. [64]

    Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. (2022). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509

  65. [65]

    Wang, J., Liang, Y., Meng, F., Sun, Z., Shi, H., Li, Z., Xu, J., Qu, J., and Zhou, J. (2023a). Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048

  66. [66]

    Wang, S., Khramtsova, E., Zhuang, S., and Zuccon, G. (2024). Feb4rag: Evaluating federated search in the context of retrieval augmented generation. arXiv preprint arXiv:2402.11891

  67. [67]

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

  68. [68]

    A., Siu, A., Zhang, R., and Derr, T

    Wang, Y., Lipka, N., Rossi, R. A., Siu, A., Zhang, R., and Derr, T. (2023b). Knowledge graph prompting for multi-document question answering

  69. [69]

    and Lapata, M

    Xu, Y. and Lapata, M. (2021). Text summarization with latent queries. arXiv preprint arXiv:2106.00104

  70. [70]

    W., Salakhutdinov, R., and Manning, C

    Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing ( EMNLP )

  71. [71]

    Yao, J.-g., Wan, X., and Xiao, J. (2017). Recent advances in document summarization. Knowledge and Information Systems , 53:297--336

  72. [72]

    Yao, L., Peng, J., Mao, C., and Luo, Y. (2023). Exploring large language models for knowledge graph completion

  73. [73]

    Yates, A., Banko, M., Broadhead, M., Cafarella, M., Etzioni, O., and Soderland, S. (2007). T ext R unner: Open information extraction on the web. In Carpenter, B., Stent, A., and Williams, J. D., editors, Proceedings of Human Language Technologies: The Annual Conference of the North A merican Chapter of the Association for Computational Linguistics ( NAAC...

  74. [74]

    Yuan, X., Li, J., Wang, D., Chen, Y., Mao, X., Huang, L., Xue, H., Wang, W., Ren, K., and Wang, J. (2024). S-eval: Automatic and adaptive test generation for benchmarking safety evaluation of large language models. arXiv preprint arXiv:2405.14191

  75. [75]

    Zhang, J. (2023). Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt. arXiv preprint arXiv:2304.11116

  76. [76]

    Zhang, Y., Zhang, Y., Gan, Y., Yao, L., and Wang, C. (2024a). Causal graph discovery with retrieval-augmented generation based large language models. arXiv preprint arXiv:2402.15301

  77. [77]

    Zhang, Z., Chen, J., and Yang, D. (2024b). Darg: Dynamic evaluation of large language models via adaptive reasoning graph. arXiv preprint arXiv:2406.17271

  78. [78]

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems , 36

  79. [79]

    Zhu, Y., Wang, X., Chen, J., Qiao, S., Ou, Y., Yao, Y., Deng, S., Chen, H., and Zhang, N. (2024). Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities