hub Canonical reference

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang, Yi Yang · 2024 · cs.CL · arXiv 2401.15391

Canonical reference. 80% of citing Pith papers cite this work as background.

24 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 24 citing papers arXiv PDF

abstract

Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking dataset focuses on multi-hop queries. In this paper, we develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. We detail the procedure of building the dataset, utilizing an English news article dataset as the underlying RAG knowledge base. We demonstrate the benchmarking utility of MultiHop-RAG in two experiments. The first experiment compares different embedding models for retrieving evidence for multi-hop queries. In the second experiment, we examine the capabilities of various state-of-the-art LLMs, including GPT-4, PaLM, and Llama2-70B, in reasoning and answering multi-hop queries given the evidence. Both experiments reveal that existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries. We hope MultiHop-RAG will be a valuable resource for the community in developing effective RAG systems, thereby facilitating greater adoption of LLMs in practice. The MultiHop-RAG and implemented RAG system is publicly available at https://github.com/yixuantt/MultiHop-RAG/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.

Why Retrieval-Augmented Generation Fails: A Graph Perspective

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.

ASTRA-QA: A Benchmark for Abstract Question Answering over Documents

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.

PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

PRISM reduces P99 TTFT by 23.3-37.1% and raises exact-prefix KV-cache hit rates by 5.9-12.2 points versus the strongest baseline on 4B and 13B models by jointly optimizing scheduling and memory.

FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over baselines.

Make Any Collection Navigable: Methods for Constructing and Evaluating Hypergraph of Text

cs.IR · 2026-04-28 · unverdicted · novelty 6.0

Methods for constructing Hypergraphs of Text are proposed with a new effort ratio metric where TF-IDF baselines match LLM methods in experiments.

S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA

cs.IR · 2026-04-26 · unverdicted · novelty 6.0

S2G-RAG improves multi-hop question answering in RAG by using structured sufficiency and gap judging to control iterative retrieval and maintain compact evidence.

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

cs.IR · 2026-04-20 · unverdicted · novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.

EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

cs.DB · 2026-04-17 · unverdicted · novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

Overcoming the "Impracticality" of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

Introduces a four-axis difficulty taxonomy integrated into an enterprise RAG benchmark to systematically diagnose multi-dimensional challenges like reasoning complexity and retrieval difficulty.

Toward Robust GraphRAG: Mitigating Retrieval Drift and Hallucination from Imperfect Knowledge Graphs

cs.IR · 2026-03-16 · unverdicted · novelty 6.0

CS-RAG is a GraphRAG framework that plans queries as ordered atomic constraints, uses anchor-relation aware retrieval, applies sufficiency checks, and falls back to text recovery to reduce drift and hallucination from imperfect KGs.

In-depth Analysis of Graph-based RAG in a Unified Framework

cs.IR · 2025-03-06 · unverdicted · novelty 6.0

A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.

ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation

cs.IR · 2025-02-14 · unverdicted · novelty 6.0

ArchRAG proposes attributed-community hierarchical indexing and LLM clustering to improve accuracy and lower token usage in graph-based retrieval-augmented generation.

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

cs.CL · 2024-04-24 · unverdicted · novelty 6.0

GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.

MeMo: Memory as a Model

cs.CL · 2026-05-14 · unverdicted · novelty 5.0 · 2 refs

MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.

CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning

cs.CL · 2026-04-12 · unverdicted · novelty 5.0

CodaRAG improves RAG by using a CLS-inspired three-stage pipeline of knowledge consolidation, multi-dimensional associative navigation, and interference elimination, delivering 7-11% gains on GraphRAG-Bench for factual and reasoning tasks.

Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning

cs.CL · 2026-03-25 · unverdicted · novelty 5.0

A stateful iterative RAG system converts retrieved documents into scored reasoning units, maintains supportive and non-supportive evidence, and performs deficiency-driven query refinement to achieve more robust QA performance.

RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation

cs.IR · 2026-01-30 · unverdicted · novelty 5.0

RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

cs.CV · 2024-10-08 · unverdicted · novelty 5.0

PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.

Retrieval-Augmented Generation for AI-Generated Content: A Survey

cs.CV · 2024-02-29 · accept · novelty 5.0

A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

cs.CL · 2026-05-06 · accept · novelty 3.0

A heterogeneous ensemble of seven LLMs plus a judge model won first place in SemEval-2026 Task 8 on faithful multi-turn response generation by selecting optimal candidates from diverse outputs.

A Reproducibility Study of Metacognitive Retrieval-Augmented Generation

cs.IR · 2026-04-21 · unverdicted · novelty 3.0

MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

cs.CL · 2026-04-20

citing papers explorer

Showing 24 of 24 citing papers.

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing cs.CL · 2026-04-17 · unverdicted · none · ref 19 · internal anchor
Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.
Why Retrieval-Augmented Generation Fails: A Graph Perspective cs.CL · 2026-05-13 · unverdicted · none · ref 39 · internal anchor
Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents cs.CL · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design cs.LG · 2026-05-09 · unverdicted · none · ref 11 · internal anchor
PRISM reduces P99 TTFT by 23.3-37.1% and raises exact-prefix KV-cache hit rates by 5.9-12.2 points versus the strongest baseline on 4B and 13B models by jointly optimizing scheduling and memory.
FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning cs.CL · 2026-05-02 · unverdicted · none · ref 18 · internal anchor
FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over baselines.
Make Any Collection Navigable: Methods for Constructing and Evaluating Hypergraph of Text cs.IR · 2026-04-28 · unverdicted · none · ref 34 · internal anchor
Methods for constructing Hypergraphs of Text are proposed with a new effort ratio metric where TF-IDF baselines match LLM methods in experiments.
S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA cs.IR · 2026-04-26 · unverdicted · none · ref 1 · internal anchor
S2G-RAG improves multi-hop question answering in RAG by using structured sufficiency and gap judging to control iterative retrieval and maintain compact evidence.
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents cs.CL · 2026-04-22 · unverdicted · none · ref 15 · internal anchor
ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies cs.IR · 2026-04-20 · unverdicted · none · ref 25 · internal anchor
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation cs.DB · 2026-04-17 · unverdicted · none · ref 79 · internal anchor
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Overcoming the "Impracticality" of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework cs.CL · 2026-04-03 · unverdicted · none · ref 12 · internal anchor
Introduces a four-axis difficulty taxonomy integrated into an enterprise RAG benchmark to systematically diagnose multi-dimensional challenges like reasoning complexity and retrieval difficulty.
Toward Robust GraphRAG: Mitigating Retrieval Drift and Hallucination from Imperfect Knowledge Graphs cs.IR · 2026-03-16 · unverdicted · none · ref 12 · internal anchor
CS-RAG is a GraphRAG framework that plans queries as ordered atomic constraints, uses anchor-relation aware retrieval, applies sufficiency checks, and falls back to text recovery to reduce drift and hallucination from imperfect KGs.
In-depth Analysis of Graph-based RAG in a Unified Framework cs.IR · 2025-03-06 · unverdicted · none · ref 80 · internal anchor
A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.
ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation cs.IR · 2025-02-14 · unverdicted · none · ref 54 · internal anchor
ArchRAG proposes attributed-community hierarchical indexing and LLM clustering to improve accuracy and lower token usage in graph-based retrieval-augmented generation.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization cs.CL · 2024-04-24 · unverdicted · none · ref 60 · internal anchor
GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
MeMo: Memory as a Model cs.CL · 2026-05-14 · unverdicted · none · ref 20 · 2 links · internal anchor
MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.
CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning cs.CL · 2026-04-12 · unverdicted · none · ref 40 · internal anchor
CodaRAG improves RAG by using a CLS-inspired three-stage pipeline of knowledge consolidation, multi-dimensional associative navigation, and interference elimination, delivering 7-11% gains on GraphRAG-Bench for factual and reasoning tasks.
Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning cs.CL · 2026-03-25 · unverdicted · none · ref 14 · internal anchor
A stateful iterative RAG system converts retrieved documents into scored reasoning units, maintains supportive and non-supportive evidence, and performs deficiency-driven query refinement to achieve more robust QA performance.
RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation cs.IR · 2026-01-30 · unverdicted · none · ref 33 · internal anchor
RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling cs.CV · 2024-10-08 · unverdicted · none · ref 8 · internal anchor
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
Retrieval-Augmented Generation for AI-Generated Content: A Survey cs.CV · 2024-02-29 · accept · none · ref 197 · internal anchor
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation cs.CL · 2026-05-06 · accept · none · ref 3 · internal anchor
A heterogeneous ensemble of seven LLMs plus a judge model won first place in SemEval-2026 Task 8 on faithful multi-turn response generation by selecting optimal candidates from diverse outputs.
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation cs.IR · 2026-04-21 · unverdicted · none · ref 44 · internal anchor
MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.
A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents cs.CL · 2026-04-20 · unreviewed · ref 27 · internal anchor

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer