arxiv: 2404.16130 · v2 · submitted 2024-04-24 · 💻 cs.CL · cs.AI· cs.IR

Recognition: no theorem link

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge , Ha Trinh , Newman Cheng , Joshua Bradley , Alex Chao , Apurva Mody , Steven Truitt , Dasha Metropolitansky

show 2 more authors

Robert Osazuwa Ness Jonathan Larson

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords GraphRAGRetrieval-Augmented GenerationQuery-Focused SummarizationEntity Knowledge GraphCommunity SummariesGlobal Question AnsweringLarge Language Models

0 comments

The pith

GraphRAG builds entity knowledge graphs and community summaries to answer global questions over large private text collections more comprehensively than standard RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GraphRAG to address the limitation that standard retrieval-augmented generation fails on broad, corpus-wide questions which are really query-focused summarization tasks. It uses an LLM in two stages to first extract an entity knowledge graph from the source documents and then generate summaries for communities of closely related entities. When a question arrives, the system produces a partial response from each community summary and then combines those into a single final answer. The authors test this on global sensemaking questions over datasets in the one-million-token range and report substantial gains in both comprehensiveness and diversity of answers compared with a conventional RAG baseline.

Core claim

GraphRAG constructs a graph index by deriving an entity knowledge graph from the source documents and then pregenerating community summaries for all groups of closely related entities; given a question, each community summary generates a partial response and all partial responses are summarized into a final answer, yielding substantial improvements over a conventional RAG baseline in both comprehensiveness and diversity for global sensemaking questions over datasets in the 1 million token range.

What carries the argument

Two-stage LLM-based graph indexing that first builds an entity knowledge graph and then pregenerates community summaries for groups of related entities, which are used to create and aggregate partial responses.

If this is right

GraphRAG can answer questions that require understanding an entire document collection rather than isolated passages.
The method scales query-focused summarization to the same quantities of text handled by typical RAG systems.
Partial responses from community summaries can be synthesized into final answers that improve both breadth and variety over direct retrieval.
The two-stage indexing allows the system to handle both narrow retrieval questions and broad sensemaking questions within one framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on domain-specific corpora such as legal contracts or scientific papers where global pattern detection is valuable.
If community detection quality varies, the method might benefit from iterative refinement of the graph index based on question type.
Hybrid systems could route local questions to standard RAG and global questions to GraphRAG without changing the underlying LLM.

Load-bearing premise

LLM-generated entity graphs and community summaries accurately and comprehensively capture the source material without introducing errors, omissions, or biases that undermine the final combined responses.

What would settle it

A human evaluation on a corpus with independently verified global themes in which GraphRAG answers show no measurable gain in comprehensiveness or diversity, or in which the community summaries omit or distort major themes present in the raw text.

read the original abstract

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, do not scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose GraphRAG, a graph-based approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text. Our approach uses an LLM to build a graph index in two stages: first, to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that GraphRAG leads to substantial improvements over a conventional RAG baseline for both the comprehensiveness and diversity of generated answers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphRAG precomputes LLM entity graphs and community summaries to handle global questions over large corpora better than standard RAG, but the gains rest on unvalidated extraction quality.

read the letter

GraphRAG uses an LLM to extract entities into a knowledge graph from the source documents, then generates summaries for detected communities of related entities. At query time it runs partial answers off each community summary and folds those into a final response. This targets the clear weakness in ordinary RAG when users ask broad questions like main themes or overall patterns across a whole private collection. The two-stage indexing plus map-reduce over communities is the concrete new piece; it is not just another retrieval trick but a deliberate precomputation step to make global sensemaking feasible at scale. The abstract reports better comprehensiveness and diversity on 1-million-token datasets, which matches the practical need the authors describe. That part of the contribution is straightforward and addresses a gap that many RAG deployments actually hit. The evaluation claim is the soft spot. The abstract gives no numbers on metrics, no baseline code or dataset details, and no sign of human checks on whether the extracted entities or community summaries are accurate or complete. Without those, it is hard to know whether the reported lift comes from the graph structure itself or simply from running more LLM calls. The stress-test concern lands: any systematic omission or bias in the first two LLM stages would flow straight into the final answers. If the full paper has ablations against oracle graphs or inter-annotator scores on the index, that would change the picture; otherwise the central result stays provisional. This is for teams already running RAG on private data who need global queries to work without manual chunking. A practitioner reader can take the pipeline description and try it, even if they have to fill in the missing eval details themselves. It is worth a serious referee because the problem is real, the method is implementable, and the engineering framing is honest. Send it to review and ask for the full experimental section plus any validation of the graph quality.

Referee Report

2 major / 1 minor

Summary. The paper proposes GraphRAG, a two-stage LLM-driven indexing method that first extracts an entity knowledge graph from source documents and then generates community summaries over related entity groups. For global sensemaking queries, it produces partial answers from each community summary and applies a final map-reduce summarization step. The central empirical claim is that this yields substantial gains in answer comprehensiveness and diversity relative to a conventional RAG baseline on corpora of approximately 1 million tokens.

Significance. If the reported gains prove robust under detailed evaluation, the work would meaningfully advance RAG systems by addressing their documented weakness on global queries through graph-based indexing and hierarchical summarization. The approach is an empirical engineering contribution that combines existing ideas in a scalable way for private corpora.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central claim of 'substantial improvements' in comprehensiveness and diversity is stated without any quantitative results, exact metric definitions, dataset descriptions, baseline implementation details, or statistical significance tests. This information is load-bearing for assessing whether the gains arise from the graph structure rather than additional LLM calls.
[§3] §3 (Method): the two-stage indexing (entity KG construction followed by community summarization) is presented without any human validation, inter-annotator agreement scores, or ablation against oracle graphs. Because downstream partial responses and the final summary are also LLM-generated, systematic extraction errors or omissions would propagate directly into the reported gains, yet no such checks are described.

minor comments (1)

[§3.3] The description of how community summaries are combined in the final response step could be clarified with a short pseudocode or diagram to make the map-reduce flow explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'substantial improvements' in comprehensiveness and diversity is stated without any quantitative results, exact metric definitions, dataset descriptions, baseline implementation details, or statistical significance tests. This information is load-bearing for assessing whether the gains arise from the graph structure rather than additional LLM calls.

Authors: We agree that the abstract and §4 would be strengthened by explicit quantitative details. In the revised manuscript we will update the abstract to reference key quantitative findings from the experiments and expand §4 to provide exact metric definitions (human Likert-scale ratings for comprehensiveness and diversity), dataset descriptions, baseline implementation specifics, and statistical significance results. We will also add analysis that isolates the contribution of the graph indexing from the total number of LLM calls. revision: yes
Referee: [§3] §3 (Method): the two-stage indexing (entity KG construction followed by community summarization) is presented without any human validation, inter-annotator agreement scores, or ablation against oracle graphs. Because downstream partial responses and the final summary are also LLM-generated, systematic extraction errors or omissions would propagate directly into the reported gains, yet no such checks are described.

Authors: We acknowledge the value of validating the intermediate indexing steps. We will revise §3 to discuss potential error propagation from LLM-based entity and community extraction and include any available internal checks or related evidence. A full-scale human validation or oracle-graph ablation is resource-intensive at the corpus scale, but we will add a limitation statement and, where feasible, a small-scale comparison to better contextualize the results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical engineering contribution with independent evaluation

full rationale

The paper proposes GraphRAG as a two-stage LLM-based indexing method (entity KG construction followed by community summarization) for global query-focused summarization, then reports empirical gains in comprehensiveness and diversity over a standard RAG baseline on 1M-token datasets. No equations, first-principles derivations, fitted parameters, or predictions appear in the abstract or described method. The central claim is an empirical comparison rather than a reduction to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are referenced. The evaluation metrics and baseline are external to the indexing process itself, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented entities are introduced; the approach rests on standard assumptions about LLM extraction capabilities and graph community structure.

axioms (2)

domain assumption Large language models can extract entities and relations from source text to form a usable knowledge graph.
Invoked in the first stage of index construction.
domain assumption Communities of related entities identified via graph algorithms yield summaries that collectively support global question answering.
Invoked in the second stage and response generation.

pith-pipeline@v0.9.0 · 5579 in / 1270 out tokens · 34713 ms · 2026-05-11T05:07:17.015621+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 conditional novelty 8.0

GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
Thinking Ahead: Prospection-Guided Retrieval of Memory with Language Models
cs.IR 2026-05 conditional novelty 7.0

PGR expands user queries into plausible future steps via Tree-of-Thought or chains and uses them as retrieval probes, delivering nearly 3x recall gains on the new MemoryQuest benchmark for low-similarity memory retrieval.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
cs.AI 2026-05 unverdicted novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
MEME: Multi-entity & Evolving Memory Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
cs.LG 2026-05 unverdicted novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
SEM-RAG: Structure-Preserving Multimodal Graph Compilation and Entropy-Guided Retrieval for Telecommunication Standards
eess.SP 2026-05 unverdicted novelty 7.0

SEM-RAG compiles telecommunication standards into structure-preserving graphs and uses entropy-guided retrieval to reach 94.1% accuracy on TeleQnA and 93.8% on ORAN-Bench-13K while reducing indexing token usage compar...
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
The Context Gathering Decision Process: A POMDP Framework for Agentic Search
cs.AI 2026-05 accept novelty 7.0

Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no perf...
MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

MANTRA automatically synthesizes SMT-validated compliance benchmarks for LLM agents from natural language manuals and tool schemas, producing 285 tasks across 6 domains with minimal human effort.
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
cs.CL 2026-05 unverdicted novelty 7.0

SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
cs.AI 2026-04 unverdicted novelty 7.0

XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
Skill Retrieval Augmentation for Agentic AI
cs.CL 2026-04 unverdicted novelty 7.0

Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
cs.AI 2026-04 unverdicted novelty 7.0

A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
Structure Guided Retrieval-Augmented Generation for Factual Queries
cs.IR 2026-04 unverdicted novelty 7.0

SG-RAG frames retrieval as subgraph matching to ensure LLMs meet every condition in factual queries and reports large gains over baselines on a new 120k-pair ERQA dataset.
ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 7.0

ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.
STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering
cs.AI 2026-04 unverdicted novelty 7.0

STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.
SAGER: Self-Evolving User Policy Skills for Recommendation Agent
cs.IR 2026-04 unverdicted novelty 7.0

SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to mem...
ROZA Graphs: Self-Improving Near-Deterministic RAG through Evidence-Centric Feedback
cs.AI 2026-04 unverdicted novelty 7.0

ROZA graphs enable self-improving RAG by storing evidence-specific reasoning chains, yielding up to 10.6pp accuracy gains and 46% lower cost through graph traversal feedback.
MisEdu-RAG: A Misconception-Aware Dual-Hypergraph RAG for Novice Math Teachers
cs.IR 2026-04 unverdicted novelty 7.0

MisEdu-RAG builds concept and instance hypergraphs for two-stage retrieval of pedagogical knowledge and student errors, improving feedback quality on the MisstepMath benchmark by 10.95% token-F1 and up to 15.3% on res...
AnnoRetrieve: Efficient Structured Retrieval for Unstructured Document Analysis
cs.IR 2026-04 unverdicted novelty 7.0

AnnoRetrieve uses auto-generated structured schemas and queries to retrieve information from unstructured documents more efficiently and accurately than embedding-based methods.
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
cs.IR 2026-04 unverdicted novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
Why Retrieval-Augmented Generation Fails: A Graph Perspective
cs.CL 2026-05 unverdicted novelty 6.0

Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
Cognifold: Always-On Proactive Memory via Cognitive Folding
cs.AI 2026-05 unverdicted novelty 6.0

Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...
IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation
cs.AI 2026-05 unverdicted novelty 6.0

IdeaForge combines multiple innovation methodologies through specialist agents on a persistent knowledge graph, using cross-methodology convergent claim linkages to rank and draft patent claims with higher traceabilit...
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
cs.CL 2026-05 unverdicted novelty 6.0

SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
Leveraging RAG for Training-Free Alignment of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
cs.CL 2026-05 conditional novelty 6.0

Intent-aware retrieval over assertion-labeled knowledge graphs improves clinical QA accuracy by 22 percentage points on a new MIMIC-IV benchmark that stresses negation, temporality, and attribution.
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
cs.CL 2026-05 unverdicted novelty 6.0

ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
cs.CL 2026-05 unverdicted novelty 6.0

SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
cs.AI 2026-05 unverdicted novelty 6.0

HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
Generating Leakage-Free Benchmarks for Robust RAG Evaluation
cs.CL 2026-05 unverdicted novelty 6.0

SeedRG generates novel, leakage-free RAG benchmark examples from seed data by mapping reasoning structures and swapping entities while applying consistency and leakage checks.
LARAG: Link-Aware Retrieval Strategy for RAG Systems in Hyperlinked Technical Documentation
cs.IR 2026-05 unverdicted novelty 6.0

LARAG improves RAG answer quality on hyperlinked technical documentation by using author-defined links for retrieval, achieving higher BERTScore while using fewer chunks and tokens than standard embedding-based RAG.
Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings
cs.IR 2026-05 unverdicted novelty 6.0

Embeddings retrieve same-subfield papers at 45-52% but same-agenda papers at only 15-21%; citation rerank reaches 57-59% on agenda queries.
Query-efficient model evaluation using cached responses
cs.LG 2026-05 unverdicted novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
WiCER: Wiki-memory Compile, Evaluate, Refine Iterative Knowledge Compilation for LLM Wiki Systems
cs.CL 2026-05 conditional novelty 6.0

WiCER iteratively diagnoses and repairs fact loss during wiki compilation for LLMs, recovering 80% of quality lost in blind distillation across 17 domains while cutting catastrophic failures by 55%.
Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
cs.CL 2026-05 unverdicted novelty 6.0

GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
cs.AI 2026-05 unverdicted novelty 6.0

ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
cs.CL 2026-05 unverdicted novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...
Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text
cs.AI 2026-05 unverdicted novelty 6.0

Frontier LLMs solve single-needle retrieval at 1M tokens on classical Chinese but show three distinct accuracy-decay patterns in three-hop reasoning between 256K and 1M tokens.
Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization
cs.CL 2026-05 unverdicted novelty 6.0

Judge-R1 improves LLM judgment document generation by combining agentic legal information retrieval with GRPO-based rubric-guided optimization, outperforming baselines on the JuDGE benchmark.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
cs.AI 2026-04 unverdicted novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
ObjectGraph: From Document Injection to Knowledge Traversal -- A Native File Format for the Agentic Era
cs.AI 2026-04 unverdicted novelty 6.0

ObjectGraph is a Markdown superset file format that represents documents as traversable knowledge graphs, achieving up to 95.3% token reduction for agents with no significant accuracy loss.
Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations
cs.AI 2026-04 unverdicted novelty 6.0

Grounding LLMs via node-wise anchors in a traffic scenario taxonomy improves law-scenario matching by 29.1% and derived requirement accuracy by 36.9-38.2% on Chinese laws and 5,897 scenarios, enabling a compliance lay...
Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks
cs.CR 2026-04 unverdicted novelty 6.0

A context-aware Sentinel-Strategist system for RAG selectively applies defenses to block membership inference and data poisoning while recovering most retrieval utility compared to always-on defense stacks.
To Know is to Construct: Schema-Constrained Generation for Agent Memory
cs.CL 2026-04 unverdicted novelty 6.0

SCG-MEM reformulates agent memory access as schema-constrained generation within dynamic cognitive schemas, using assimilation and accommodation for updates plus an associative graph for reasoning, and outperforms ret...
GraphRAG-IRL: Personalized Recommendation with Graph-Grounded Inverse Reinforcement Learning and LLM Re-ranking
cs.IR 2026-04 unverdicted novelty 6.0

GraphRAG-IRL fuses graph-grounded MaxEnt IRL pre-ranking with persona-guided LLM re-ranking to deliver up to 16.8% NDCG@10 gains over IRL-only baselines on MovieLens and consistent 4-6% gains on KuaiRand.
DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

DW-Bench shows tool-augmented LLMs outperform static ones on data warehouse graph reasoning but plateau on hard compositional question subtypes.
EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval
cs.AI 2026-04 unverdicted novelty 6.0

EHRAG constructs structural hyperedges from sentence co-occurrence and semantic hyperedges from entity embedding clusters, then applies hybrid diffusion plus topic-aware PPR to retrieve top-k documents, outperforming ...

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 110 Pith papers · 7 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Gemini: A Family of Highly Capable Multimodal Models

Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

arXiv preprint arXiv:2306.04136 , year=

Baek, J., Aji, A. F., and Saffari, A. (2023). Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136

work page arXiv 2023
[4]

Ban, T., Chen, L., Wang, X., and Chen, H. (2023). From query tools to causal architects: Harnessing large language models for advanced causal discovery from data

work page 2023
[5]

and Gulla, J

Barlaug, N. and Gulla, J. A. (2021). Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) , 15(3):1--37

work page 2021
[6]

Baumel, T., Eyal, M., and Elhadad, M. (2018). Query focused abstractive summarization: Incorporating query relevance, multi-document coverage, and summary length constraints into seq2seq models. arXiv preprint arXiv:1801.07704

work page arXiv 2018
[7]

D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E

Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment , 2008(10):P10008

work page 2008
[8]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems , 33:1877--1901

work page 2020
[9]

Cheng, X., Luo, D., Chen, X., Liu, L., Zhao, D., and Yan, R. (2024). Lift yourself up: Retrieval-augmented text generation with self-memory. Advances in Neural Information Processing Systems , 36

work page 2024
[10]

and Christen, P

Christen, P. and Christen, P. (2012). The data matching process . Springer

work page 2012
[11]

D., Bridgeford, E

Chung, J., Pedigo, B. D., Bridgeford, E. W., Varjavand, B. K., Helm, H. S., and Vogelstein, J. T. (2019). Graspy: Graph statistics in python. Journal of Machine Learning Research , 20(158):1--7

work page 2019
[12]

Dang, H. T. (2006). Duc 2005: Evaluation of question-focused summarization systems. In Proceedings of the Workshop on Task-Focused Summarization and Question Answering , pages 48--55

work page 2006
[13]

K., Ipeirotis, P

Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2006). Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering , 19(1):1--16

work page 2006
[14]

Es, S., James, J., Espinosa-Anke, L., and Schockaert, S. (2023). Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217

work page arXiv 2023
[15]

S., and Yates, A

Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2004). Web-scale information extraction in knowitall: (preliminary results). In Proceedings of the 13th International Conference on World Wide Web , WWW '04, page 100–110, New York, NY, USA. Association for Computing Machinery

work page 2004
[16]

Feng, Z., Feng, X., Zhao, D., Yang, M., and Qin, B. (2023). Retrieval-generation synergy augmented large language models. arXiv preprint arXiv:2310.05149

work page arXiv 2023
[17]

Fortunato, S. (2010). Community detection in graphs. Physics reports , 486(3-5):75--174

work page 2010
[18]

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

G-retriever: Retrieval-augmented generation for textual graph understanding and question answering,

He, X., Tian, Y., Sun, Y., Chawla, N. V., Laurent, T., LeCun, Y., Bresson, X., and Hooi, B. (2024). G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. arXiv preprint arXiv:2402.07630

work page arXiv 2024
[20]

Large Language Models Cannot Self-Correct Reasoning Yet

Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., and Zhou, D. (2023). Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798

work page internal anchor Pith review arXiv 2023
[21]

Jacomy, M., Venturini, T., Heymann, S., and Bastian, M. (2014). Forceatlas2, a continuous graph layout algorithm for handy network visualization designed for the gephi software. PLoS ONE 9(6): e98679. https://doi.org/10.1371/journal.pone.0098679

work page doi:10.1371/journal.pone.0098679 2014
[22]

Y., and Zhang, W

Jin, D., Yu, Z., Jiao, P., Pan, S., He, D., Wu, J., Philip, S. Y., and Zhang, W. (2021). A survey of community detection approaches: From statistical modeling to deep learning. IEEE Transactions on Knowledge and Data Engineering , 35(2):1149--1170

work page 2021
[23]

M., Baek, J., and Hwang, S

Kang, M., Kwak, J. M., Baek, J., and Hwang, S. J. (2023). Knowledge graph-augmented language models for knowledge-grounded dialogue generation. arXiv preprint arXiv:2305.18846

work page arXiv 2023
[24]

Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.arXiv preprint arXiv:2212.14024, 2022

Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P., Potts, C., and Zaharia, M. (2022). Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024

work page arXiv 2022
[25]

Kim, D., Xie, L., and Ong, C. S. (2016). Probabilistic knowledge graph construction: Compositional and incremental approaches. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management , CIKM '16, page 2257–2262, New York, NY, USA. Association for Computing Machinery

work page 2016
[26]

Kim, G., Kim, S., Jeon, B., Park, J., and Kang, J. (2023). Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models. arXiv preprint arXiv:2310.14696

work page arXiv 2023
[27]

Klein, G., Moon, B., and Hoffman, R. R. (2006). Making sense of sensemaking 1: Alternative perspectives. IEEE intelligent systems , 21(4):70--73

work page 2006
[28]

Kosinski, M. (2024). Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences , 121(45):e2405460121

work page 2024
[29]

Kuratov, Y., Bulatov, A., Anokhin, P., Sorokin, D., Sorokin, A., and Burtsev, M. (2024). In search of needles in a 11m haystack: Recurrent memory finds what llms miss

work page 2024
[30]

Langchain graphs

LangChain (2024). Langchain graphs. https://langchain-graphrag.readthedocs.io/en/latest/

work page 2024
[31]

Laskar, M. T. R., Hoque, E., and Huang, J. (2020). Query focused abstractive summarization via incorporating query relevance and transfer learning with transformer models. In Advances in Artificial Intelligence: 33rd Canadian Conference on Artificial Intelligence, Canadian AI 2020, Ottawa, ON, Canada, May 13--15, 2020, Proceedings 33 , pages 342--348. Springer

work page 2020
[32]

u ttler, H., Lewis, M., Yih, W.-t., Rockt \

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K \"u ttler, H., Lewis, M., Yih, W.-t., Rockt \"a schel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems , 33:9459--9474

work page 2020
[33]

Lost in the Middle: How Language Models Use Long Contexts

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2023). Lost in the middle: How language models use long contexts. arXiv:2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

GraphRAG Implementation with LlamaIndex - V2

LlamaIndex (2024). GraphRAG Implementation with LlamaIndex - V2 . https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/cookbooks/GraphRAG_v2.ipynb

work page 2024
[35]

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al. (2024). Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems , 36

work page 2024
[36]

Manakul, P., Liusie, A., and Gales, M. J. (2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896

work page arXiv 2023
[37]

Mao, Y., He, P., Liu, X., Shen, Y., Gao, J., Han, J., and Chen, W. (2020). Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553

work page arXiv 2020
[38]

M., Klavans, R., and Boyack, K

Martin, S., Brown, W. M., Klavans, R., and Boyack, K. (2011). Openord: An open-source toolbox for large graph layout. SPIE Conference on Visualization and Data Analysis (VDA)

work page 2011
[39]

Melnyk, I., Dognin, P., and Das, P. (2022). Knowledge graph generation from text

work page 2022
[40]

and Larson, J

Metropolitansky, D. and Larson, J. (2025). Towards effective extraction and evaluation of factual claims

work page 2025
[41]

The impact of large language models on scientific discovery: a preliminary study using gpt-4

Microsoft (2023). The impact of large language models on scientific discovery: a preliminary study using gpt-4

work page 2023
[42]

Mooney, R. J. and Bunescu, R. (2005). Mining knowledge from text using information extraction. SIGKDD Explor. Newsl. , 7(1):3–10

work page 2005
[43]

Nebulagraph launches industry-first graph rag: Retrieval-augmented generation with llm based on knowledge graphs

NebulaGraph (2024). Nebulagraph launches industry-first graph rag: Retrieval-augmented generation with llm based on knowledge graphs. https://www.nebula-graph.io/posts/graph-RAG

work page 2024
[44]

Get started with graphrag: Neo4j’s ecosystem tools

Neo4J (2024). Get started with graphrag: Neo4j’s ecosystem tools. https://neo4j.com/developer-blog/graphrag-ecosystem-tools/

work page 2024
[45]

Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the national academy of sciences , 103(23):8577--8582

work page 2006
[46]

Ni, J., Shi, M., Stammbach, D., Sachan, M., Ash, E., and Leippold, M. (2024). AF a CTA : Assisting the annotation of factual claim detection with reliable LLM annotators. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1890--1912, ...

work page 2024
[47]

Chatgpt: Gpt-4 language model

OpenAI (2023). Chatgpt: Gpt-4 language model

work page 2023
[48]

and He, H

Padmakumar, V. and He, H. (2024). Does writing with language models reduce content diversity? ICLR

work page 2024
[49]

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research , 12:2825--2830

work page 2011
[50]

Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., and Shoham, Y. (2023). In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics , 11:1316--1331

work page 2023
[51]

and Joshi, A

Ranade, P. and Joshi, A. (2023). Fabula: Intelligence report generation using retrieval-augmented narrative construction. arXiv preprint arXiv:2310.13848

work page arXiv 2023
[52]

Salminen, J., Liu, C., Pian, W., Chi, J., H \"a yh \"a nen, E., and Jansen, B. J. (2024). Deus ex machina and personas from large language models: Investigating the composition of ai-generated persona descriptions. In Proceedings of the CHI Conference on Human Factors in Computing Systems , pages 1--20

work page 2024
[53]

Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. (2024). Raptor: Recursive abstractive processing for tree-organized retrieval. arXiv preprint arXiv:2401.18059

work page arXiv 2024
[54]

Scott, K. (2024). Behind the Tech . https://www.microsoft.com/en-us/behind-the-tech

work page 2024
[55]

Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., and Chen, W. (2023). Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294

work page arXiv 2023
[56]

A., Rey, B

Shin, J., Hedderich, M. A., Rey, B. J., Lucero, A., and Oulasvirta, A. (2024). Understanding human-ai workflows for generating personas. In Proceedings of the 2024 ACM Designing Interactive Systems Conference , pages 757--781

work page 2024
[57]

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. (2024). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36

work page 2024
[58]

B., Barezi, E

Su, D., Xu, Y., Yu, T., Siddique, F. B., Barezi, E. J., and Fung, P. (2020). Caire-covid: A question answering and query-focused multi-document summarization system for covid-19 scholarly information management. arXiv preprint arXiv:2005.03975

work page arXiv 2020
[59]

Tan, Z., Zhao, X., and Wang, W. (2017). Representation learning of large-scale knowledge graphs via entity feature combinations. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management , CIKM '17, page 1777–1786, New York, NY, USA. Association for Computing Machinery

work page 2017
[60]

MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024

Tang, Y. and Yang, Y. (2024). MultiHop-RAG : Benchmarking retrieval-augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391

work page arXiv 2024
[61]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

A., Waltman, L., and Van Eck, N

Traag, V. A., Waltman, L., and Van Eck, N. J. (2019). From L ouvain to L eiden: guaranteeing well-connected communities. Scientific Reports , 9(1)

work page 2019
[63]

Trajanoska, M., Stojanov, R., and Trajanov, D. (2023). Enhancing knowledge graph construction using large language models. ArXiv , abs/2305.04676

work page arXiv 2023
[64]

Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. (2022). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509

work page arXiv 2022
[65]

Wang, J., Liang, Y., Meng, F., Sun, Z., Shi, H., Li, Z., Xu, J., Qu, J., and Zhou, J. (2023a). Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048

work page arXiv
[66]

Wang, S., Khramtsova, E., Zhuang, S., and Zuccon, G. (2024). Feb4rag: Evaluating federated search in the context of retrieval augmented generation. arXiv preprint arXiv:2402.11891

work page arXiv 2024
[67]

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

A., Siu, A., Zhang, R., and Derr, T

Wang, Y., Lipka, N., Rossi, R. A., Siu, A., Zhang, R., and Derr, T. (2023b). Knowledge graph prompting for multi-document question answering

work page
[69]

and Lapata, M

Xu, Y. and Lapata, M. (2021). Text summarization with latent queries. arXiv preprint arXiv:2106.00104

work page arXiv 2021
[70]

W., Salakhutdinov, R., and Manning, C

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing ( EMNLP )

work page 2018
[71]

Yao, J.-g., Wan, X., and Xiao, J. (2017). Recent advances in document summarization. Knowledge and Information Systems , 53:297--336

work page 2017
[72]

Yao, L., Peng, J., Mao, C., and Luo, Y. (2023). Exploring large language models for knowledge graph completion

work page 2023
[73]

Yates, A., Banko, M., Broadhead, M., Cafarella, M., Etzioni, O., and Soderland, S. (2007). T ext R unner: Open information extraction on the web. In Carpenter, B., Stent, A., and Williams, J. D., editors, Proceedings of Human Language Technologies: The Annual Conference of the North A merican Chapter of the Association for Computational Linguistics ( NAAC...

work page 2007
[74]

Yuan, X., Li, J., Wang, D., Chen, Y., Mao, X., Huang, L., Xue, H., Wang, W., Ren, K., and Wang, J. (2024). S-eval: Automatic and adaptive test generation for benchmarking safety evaluation of large language models. arXiv preprint arXiv:2405.14191

work page arXiv 2024
[75]

Zhang, J. (2023). Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt. arXiv preprint arXiv:2304.11116

work page arXiv 2023
[76]

Zhang, Y., Zhang, Y., Gan, Y., Yao, L., and Wang, C. (2024a). Causal graph discovery with retrieval-augmented generation based large language models. arXiv preprint arXiv:2402.15301

work page arXiv
[77]

Zhang, Z., Chen, J., and Yang, D. (2024b). Darg: Dynamic evaluation of large language models via adaptive reasoning graph. arXiv preprint arXiv:2406.17271

work page arXiv
[78]

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems , 36

work page 2024
[79]

Zhu, Y., Wang, X., Chen, J., Qiao, S., Ou, Y., Yao, Y., Deng, S., Chen, H., and Zhang, N. (2024). Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities

work page 2024