Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Edouard Grave; Gautier Izacard

arxiv: 2007.01282 · v2 · pith:5X6JLHOG · submitted 2020-07-02 · cs.CL · cs.LG

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Gautier Izacard , Edouard Grave This is my paper

Reviewed by Pith T0 review T1 audit T2 compute T3 formal T4 kernel 2026-05-17 12:43 UTCgrok-4.3pith:5X6JLHOG record.json open to challenge →

classification cs.CL cs.LG

keywords open domain question answeringpassage retrievalgenerative modelsnatural questionstriviaqaevidence aggregation

0 comments

The pith

Generative models for open-domain question answering gain from retrieving multiple passages and combining their evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether generative models, which already compete on open-domain QA without external knowledge, can be strengthened by adding retrieved text passages that may hold answers. It reports state-of-the-art results on the Natural Questions and TriviaQA benchmarks. The key observation is that accuracy rises steadily as the number of retrieved passages grows, which the authors interpret as evidence that these models can aggregate and synthesize information across sources.

Core claim

Generative models for open domain question answering improve when supplied with retrieved passages, reaching state-of-the-art on Natural Questions and TriviaQA. Accuracy increases significantly with larger numbers of passages, indicating that the models successfully aggregate evidence from multiple sources.

What carries the argument

Retrieval of multiple text passages fed to a generative model that combines evidence across them.

Load-bearing premise

The performance gains come from the generative model's ability to combine information across passages rather than from retrieval quality or other setup details.

What would settle it

Measure whether accuracy still rises when the same passages are provided in random order or when passage count is held fixed while changing only the generator prompt.

read the original abstract

Generative models for open domain question answering have proven to be competitive, without resorting to external knowledge. While promising, this approach requires to use models with billions of parameters, which are expensive to train and query. In this paper, we investigate how much these models can benefit from retrieving text passages, potentially containing evidence. We obtain state-of-the-art results on the Natural Questions and TriviaQA open benchmarks. Interestingly, we observe that the performance of this method significantly improves when increasing the number of retrieved passages. This is evidence that generative models are good at aggregating and combining evidence from multiple passages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This abstract reports SOTA on NQ and TriviaQA by pairing retrieval with a generative model and notes clear gains from using more passages, but leaves the aggregation claim untested.

read the letter

The main thing here is that the work gets state-of-the-art numbers on Natural Questions and TriviaQA by retrieving passages and feeding them to a generative model, plus the observation that results keep improving as the number of passages grows. That scaling pattern is the part worth paying attention to right now. It suggests a route to better open-domain QA that does not require ever-larger models, which matters for cost and latency. The paper does a straightforward job of documenting this empirical result on two standard benchmarks and framing it as a practical advantage of generative models over pure retrieval or pure generation approaches. The citation pattern in the abstract is light but points to the relevant prior generative QA work, so nothing looks off there. The soft spot is the leap from the scaling observation to the claim that generative models are good at aggregating evidence. The abstract gives no ablations, no controls for retrieval quality or prompt length, and no comparison to other ways of handling multiple passages, so alternative explanations like simple coverage gains cannot be ruled out. With only the abstract available it is also impossible to check training details, exact input formatting, or whether the gains are stable across runs. This paper is aimed at people working on open-domain QA and early retrieval-augmented generation. A reader who wants concrete numbers on how retrieval volume affects generative performance would find it useful even in its current form. I would send it to peer review because the core empirical pattern is worth a full methods check and the benchmarks are the right ones for the claim.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates combining passage retrieval with generative models for open-domain question answering to reduce reliance on very large models. It reports state-of-the-art results on the Natural Questions and TriviaQA benchmarks and observes that performance improves significantly with more retrieved passages, interpreting this as evidence that generative models aggregate and combine evidence from multiple passages.

Significance. If the reported gains are robust and the aggregation interpretation is supported by appropriate controls, the work could inform more efficient retrieval-augmented generative QA systems and highlight scaling benefits of additional passages without proportional increases in model size.

major comments (2)

[Abstract] Abstract: The claim that performance gains with additional retrieved passages constitute evidence that generative models aggregate and combine evidence requires controls or ablations to rule out confounds such as increased retrieval coverage, prompt-length effects, or benchmark artifacts; none are described.
[Abstract] Abstract: The state-of-the-art claim on Natural Questions and TriviaQA is presented without any information on the generative model used, retrieval system, baselines, evaluation protocol, or statistical details, preventing assessment of whether the central empirical result holds.

minor comments (1)

[Abstract] Abstract: The opening sentence contrasts the approach with methods 'without resorting to external knowledge,' yet the proposed method relies on retrieved passages; a brief clarification of this distinction would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We respond to each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that performance gains with additional retrieved passages constitute evidence that generative models aggregate and combine evidence requires controls or ablations to rule out confounds such as increased retrieval coverage, prompt-length effects, or benchmark artifacts; none are described.

Authors: We agree that the abstract presents the scaling observation as evidence of aggregation without describing controls for confounds such as retrieval coverage or prompt length. The provided manuscript text is limited to the abstract and does not include such ablations. We will revise the abstract to qualify the interpretive claim, for example by noting that increased performance with more passages is consistent with evidence aggregation while alternative explanations remain possible. revision: yes
Referee: [Abstract] Abstract: The state-of-the-art claim on Natural Questions and TriviaQA is presented without any information on the generative model used, retrieval system, baselines, evaluation protocol, or statistical details, preventing assessment of whether the central empirical result holds.

Authors: The abstract is a concise summary and therefore omits methodological specifics. Information on the generative model, retrieval system, baselines, and evaluation protocol appears in the main body of the paper. To address the concern, we will revise the abstract to include a brief high-level description of the approach (generative model augmented by passage retrieval) so that the SOTA claim can be more readily assessed from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No circularity: abstract reports empirical observations without derivations or self-referential reductions

full rationale

The available text consists solely of the abstract, which states that generative models achieve SOTA results on Natural Questions and TriviaQA and that performance improves with more retrieved passages, interpreting the latter as evidence of aggregation ability. No equations, parameter fits, derivations, or self-citations appear. The performance gain is presented as a direct empirical observation rather than a quantity derived from or fitted to the same inputs. None of the enumerated circularity patterns (self-definitional, fitted-input-as-prediction, self-citation load-bearing, etc.) are instantiated because there is no derivation chain to inspect. The claims remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no mathematical derivations, so the ledger is empty.

pith-pipeline@v0.9.0 · 5360 in / 1072 out tokens · 47355 ms · 2026-05-17T12:43:04.017029+00:00 · methodology

discussion (0)

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dense Passage Retrieval for Open-Domain Question Answering
cs.CL 2020-04 accept novelty 8.0

Dense dual-encoder retrievers outperform BM25 by 9-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets and enable new state-of-the-art end-to-end QA results.
Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs
cs.CR 2026-05 unverdicted novelty 7.0

PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.
AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation
cs.IR 2026-02 unverdicted novelty 7.0

AtomicRAG replaces chunk-based and triple-based GraphRAG with atom-entity graphs that store facts as atomic units and use personalized PageRank plus relevance filtering to achieve higher retrieval accuracy and reasoni...
M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
cs.CL 2025-12 unverdicted novelty 7.0

M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.
Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG
cs.CL 2025-11 conditional novelty 7.0

TARG uses uncertainty scores from a short no-context draft to gate retrieval in RAG, matching Always-RAG accuracy while cutting retrievals by 70-90% on QA benchmarks.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Retrieval Augmented Time Series Forecasting
cs.LG 2024-11 unverdicted novelty 7.0

The paper proposes Retrieval Augmented Forecasting (RAF) that augments time-series foundation models with retrieved similar series to improve forecasting accuracy across domains.
BetXplain: An Explanation-Annotated Dataset for Detecting Manipulative Betting Advertisements on Social Media
cs.LG 2026-06 unverdicted novelty 6.0

The authors introduce an explanation-annotated dataset of manipulative betting advertisements collected from Instagram and Reddit to support explainable detection models.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 6.0

Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.
Mapping Text to Multiplex Graph: Prompt Compression as L\'evy Walk-Guided Graph Pruning
cs.CL 2026-05 unverdicted novelty 6.0

RAGP models prompt compression as redundancy-aware pruning on a multiplex graph using Lévy walks, achieving 49.3 average on LongBench at 4x compression versus 48.8 for LongLLMLingua at 3x.
No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation
cs.CL 2026-04 unverdicted novelty 6.0

NWCAD uses a two-stream setup with a two-stage gate to prevent accuracy drops on baseline-correct items under non-informative contexts while retaining gains from helpful contexts.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking
cs.IR 2026-04 unverdicted novelty 6.0

DualView fuses local cross-attention and global context aggregation via adaptive gating to rerank fixed candidate sets for multi-hop QA, reporting 99.4% Top-4 Recall on MuSiQue at 4 ms latency while beating larger cro...
Enabling Transparent Cyber Threat Intelligence Combining Large Language Models and Domain Ontologies
cs.CR 2025-08 unverdicted novelty 6.0

Integrates LLMs with domain ontologies and SHACL constraints to produce accurate, explainable structured outputs from cybersecurity logs for threat intelligence.
Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering
cs.IR 2025-05 unverdicted novelty 6.0

A hierarchical QA framework converts RST discourse trees into enhanced sentence representations for structure-guided retrieval and reports consistent gains over baselines on four datasets across genres and languages.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
cs.CL 2024-01 unverdicted novelty 6.0

RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
MemGPT: Towards LLMs as Operating Systems
cs.AI 2023-10 unverdicted novelty 6.0

MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
REPLUG: Retrieval-Augmented Black-Box Language Models
cs.CL 2023-01 conditional novelty 6.0

REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
LaMDA: Language Models for Dialog Applications
cs.CL 2022-01 unverdicted novelty 6.0

LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
ARMOR: Adaptive Retriever Optimization for Low-Resource Telecom Question Answering
cs.IR 2026-06 unverdicted novelty 5.0

ARMOR optimizes retrievers via joint RAG-likelihood and InfoNCE training with regularization toward the base encoder, yielding improved retrieval and QA on telecom benchmarks.
BetXplain: An Explanation-Annotated Dataset for Detecting Manipulative Betting Advertisements on Social Media
cs.LG 2026-06 unverdicted novelty 5.0

Introduces BetXplain, an explanation-annotated dataset of social media betting ads collected from Instagram and Reddit for detecting manipulative and deceptive advertising.
Qiskit Code Migration with LLMs
cs.SE 2026-06 unverdicted novelty 5.0

A taxonomy-guided RAG system with LLMs reduces hallucinations and improves migration suggestions for Qiskit code compared to unconstrained retrieval.
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
cs.AI 2026-05 unverdicted novelty 5.0

NeuSymMS is a hybrid neuro-symbolic memory architecture for LLM agents that extracts facts neurally, manages them with explicit lifecycle rules in a CLIPS expert system, stores them as triples in a relational database...
NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents
cs.AI 2026-05 unverdicted novelty 5.0

NeuSymMS is a hybrid neuro-symbolic memory system that extracts facts via LLMs and manages them with explicit CLIPS rules for scoping, deduplication, and dual-horizon persistence in LLM agents.
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
cs.SE 2026-05 unverdicted novelty 5.0

Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
Mask-to-Correct$^+$: Leveraging Retriever Diversity for Masking-guided Faithful Fact Correction
cs.IR 2026-04 unverdicted novelty 5.0

Mask-to-Correct and M2C+ use diversity-aware masking in RAG to identify erroneous claim spans and produce faithful corrections, outperforming baselines by up to 14% SARI without gold evidence.
OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models
cs.AI 2025-10 unverdicted novelty 5.0

OntoLogX is a system that applies LLMs with ontology guidance, RAG, and iterative fixes to build valid knowledge graphs from cybersecurity logs and predict ATT&CK tactics from aggregated sessions.
Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities
cs.HC 2026-06 unverdicted novelty 4.0

Mod-Guide uses RAG with a community co-created corpus to make LLM moderation responses more contextually accurate for insensitive speech toward Bangladesh's Hindu and Chakma minorities, with mixed-method evaluation sh...
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 4.0

Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.
A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
cs.CL 2026-04 unverdicted novelty 4.0

Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.
The Agentic Web Requires New Normative Infrastructure
cs.CY 2026-06 unverdicted novelty 3.0

The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.
Enhancing Large Language Models with Retrieval Augmented Generation for Software Testing and Inspection Automation
cs.SE 2026-04 unverdicted novelty 3.0

RAG-enhanced LLMs show generally positive effects on automated test generation and code inspection by supplying supplementary context that reduces hallucinations.
Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems
q-bio.NC 2025-07 unverdicted novelty 2.0

A position and survey paper that identifies convergence between neuroscience, AGI, and neuromorphic computing and outlines four key integration challenges.