arxiv: 2401.18059 · v1 · submitted 2024-01-31 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Parth Sarthi , Salman Abdullah , Aditi Tuli , Shubh Khanna , Anna Goldie , Christopher D. Manning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:02 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords RAPTORretrieval-augmented generationhierarchical summarizationlong-context reasoningquestion answeringtree-structured retrievalabstractive processing

0 comments

The pith

Recursive clustering and summarization builds a tree that improves retrieval-augmented reasoning over long documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RAPTOR, which recursively embeds, clusters, and summarizes text chunks to form a tree with multiple levels of abstraction. During inference the model retrieves from this tree rather than from flat contiguous chunks, allowing integration of information across lengthy documents. A sympathetic reader would care because standard retrieval methods struggle with complex multi-step reasoning that requires both local details and global context. Experiments demonstrate that this approach yields significant gains over conventional retrieval-augmented language models, including a 20-point absolute accuracy increase on the QuALITY benchmark when paired with GPT-4.

Core claim

By recursively embedding, clustering, and summarizing chunks of text, RAPTOR constructs a tree with differing levels of summarization from the bottom up. At inference time the model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented language models on several tasks, achieving state-of-the-art results on question-answering benchmarks that involve complex, multi-step reasoning.

What carries the argument

The RAPTOR tree, constructed bottom-up by recursive embedding, clustering, and abstractive summarization of text chunks.

If this is right

Retrieval integrates context across long documents at varying levels of abstraction.
Performance improves on question-answering tasks that require complex multi-step reasoning.
State-of-the-art results are achieved on benchmarks such as QuALITY when the tree is used with GPT-4.
The method supports better incorporation of long-tail knowledge compared with flat retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tree structure could reduce reliance on ever-larger context windows by supplying relevant summaries on demand.
Similar hierarchical organization might extend to tasks such as document summarization or multi-hop knowledge extraction.
Different embedding or clustering choices could be substituted to test robustness of the performance gains.

Load-bearing premise

The recursive clustering and summarization process effectively captures and preserves all relevant information from the original document without significant loss or distortion.

What would settle it

If retrieval from the RAPTOR tree produced lower accuracy than standard chunk retrieval on the QuALITY benchmark, the performance benefit would be falsified.

read the original abstract

Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAPTOR builds a retrieval tree via recursive embedding, clustering, and abstractive summarization, delivering a clear 20-point absolute gain on QuALITY with GPT-4.

read the letter

RAPTOR builds a retrieval tree via recursive embedding, clustering, and abstractive summarization, delivering a clear 20-point absolute gain on QuALITY with GPT-4. The core idea is to start with document chunks, embed and cluster them, create abstractive summaries of each cluster, then repeat the process on those summaries to form higher levels of the tree. At inference the retriever can pull passages or summaries from any level, which helps integrate both fine-grained details and broader context for multi-step reasoning tasks. This is a distinct step beyond flat chunk retrieval or simple hierarchical indexes because the summarization step actively compresses while trying to retain meaning. The experiments back this up with controlled comparisons to standard RAG baselines across several QA datasets, and the QuALITY result stands out as the strongest evidence that the tree structure helps where flat methods fall short. The paper reports state-of-the-art numbers when RAPTOR is paired with GPT-4, and the gains appear consistent rather than cherry-picked. One soft spot is the reliance on abstractive summaries preserving all task-relevant information; the results suggest this works in practice, but the method could be sensitive to the choice of summarizer or to documents with subtle distinctions that summaries might blur. Computational cost of tree construction is mentioned only lightly, and more detail on clustering hyperparameters or robustness checks would strengthen the claims. The citation pattern is appropriate and does not over-rely on self-citation. This work is for researchers building or evaluating retrieval-augmented systems that handle long documents and complex reasoning. Readers working on long-context QA or RAG improvements will get concrete, usable ideas from the method and the benchmark numbers. The evidence is grounded enough and the thinking is straightforward, so the paper deserves a serious referee rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RAPTOR, a retrieval-augmented generation method that recursively embeds, clusters, and abstractive-summarizes document chunks to construct a multi-level tree. At inference, the model retrieves from this tree to integrate information across different levels of abstraction, addressing limitations of flat chunk retrieval in standard RAG. Controlled experiments on QA benchmarks demonstrate improvements over baselines, with a reported 20% absolute accuracy gain on QuALITY when paired with GPT-4.

Significance. If the results hold under full scrutiny, RAPTOR offers a practical advance in handling long-document context for complex reasoning tasks by enabling hierarchical abstraction retrieval. The approach builds on established embedding and clustering techniques with a novel bottom-up tree construction, and the empirical SOTA claims on multiple QA tasks could influence future RAG designs if the information-preservation properties of the summaries are confirmed.

major comments (3)

[§4.2] §4.2 (Tree Construction): The recursive clustering and summarization steps lack explicit controls or metrics for information fidelity (e.g., no ROUGE or entailment checks between summaries and source chunks), which is load-bearing for the central claim that the tree preserves usable context without significant loss.
[§5.3] §5.3 (QuALITY Experiments): The 20% absolute accuracy improvement is reported without variance, statistical significance tests, or ablation isolating the contribution of multi-level retrieval versus GPT-4 prompting alone; this undermines the cross-baseline comparison.
[§3] §3 (Retrieval Algorithm): The inference-time retrieval procedure from the tree is described at a high level but omits precise scoring or selection rules across levels, making it impossible to reproduce the exact integration of summaries and chunks.

minor comments (2)

[Figure 1] Figure 1: The tree diagram is helpful but the caption does not specify the embedding model or clustering algorithm used in the illustrated example.
[Related Work] Related Work section: The comparison to prior hierarchical retrieval methods (e.g., those using sentence embeddings or graph-based structures) could be expanded with quantitative differences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving rigor and reproducibility. We address each major comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Tree Construction): The recursive clustering and summarization steps lack explicit controls or metrics for information fidelity (e.g., no ROUGE or entailment checks between summaries and source chunks), which is load-bearing for the central claim that the tree preserves usable context without significant loss.

Authors: We agree that explicit fidelity metrics would strengthen the central claim. While downstream performance gains provide supporting evidence, we will add quantitative controls in the revised Section 4.2, including average ROUGE-L scores and NLI entailment rates between each summary and its source chunks across tree levels. revision: yes
Referee: [§5.3] §5.3 (QuALITY Experiments): The 20% absolute accuracy improvement is reported without variance, statistical significance tests, or ablation isolating the contribution of multi-level retrieval versus GPT-4 prompting alone; this undermines the cross-baseline comparison.

Authors: We acknowledge the importance of statistical reporting. In the revision we will include standard deviations over multiple runs, perform significance tests, and add an ablation in Section 5.3 that directly compares hierarchical RAPTOR retrieval against flat retrieval using identical GPT-4 prompting to isolate the multi-level contribution. revision: yes
Referee: [§3] §3 (Retrieval Algorithm): The inference-time retrieval procedure from the tree is described at a high level but omits precise scoring or selection rules across levels, making it impossible to reproduce the exact integration of summaries and chunks.

Authors: We will expand Section 3 with a detailed description of the retrieval procedure, including the exact cosine-similarity scoring function, level-specific top-k thresholds, and the prompt-integration rules for combining summaries and chunks. Pseudocode will be added to ensure full reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents RAPTOR as an explicit algorithmic construction: recursive bottom-up embedding, clustering, and abstractive summarization to form a multi-level tree, followed by top-down retrieval at inference. Performance claims (e.g., +20% absolute on QuALITY with GPT-4) rest on controlled empirical comparisons against flat-chunk RAG baselines, not on any mathematical derivation that reduces to its own fitted inputs or self-citations. No self-definitional loops, renamed predictions, or load-bearing self-citations appear in the method or results sections. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method introduces a new hierarchical structure but relies on standard embedding and clustering techniques; the key assumption is the utility of recursive summarization.

axioms (1)

domain assumption Abstractive summaries at different levels retain sufficient information for multi-step reasoning
This is implicitly required for the retrieval to improve performance on complex tasks.

pith-pipeline@v0.9.0 · 5470 in / 1074 out tokens · 68391 ms · 2026-05-15T13:02:10.149769+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
cs.IR 2026-04 unverdicted novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents
cs.CL 2026-03 unverdicted novelty 7.0

CLAG organizes agent memory into clusters via an SLM router and uses cluster profiles for two-stage retrieval, yielding better answer quality on QA benchmarks than prior memory systems.
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
cs.CL 2026-05 unverdicted novelty 6.0

ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 6.0

Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.
FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...
MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

MemORAI combines selective filtering, provenance tracking in multi-relational graphs, and dynamic weighted PageRank retrieval to achieve state-of-the-art memory retrieval and personalized responses in LLM agents on LO...
MindTrellis: Co-Creating Knowledge Structures with AI through Interactive Visual Exploration
cs.HC 2026-04 unverdicted novelty 6.0

MindTrellis enables users and AI to co-create evolving knowledge graphs, outperforming retrieval-only tools in expert-rated content coverage, structural quality, and reduced cognitive load during a study of 12 partici...
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
cs.AI 2026-04 unverdicted novelty 6.0

Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
cs.CL 2026-04 unverdicted novelty 6.0

A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
GraphRAG-Router: Learning Cost-Efficient Routing over GraphRAGs and LLMs with Reinforcement Learning
cs.IR 2026-03 unverdicted novelty 6.0

GraphRAG-Router uses two-stage reinforcement learning with a cost-aware curriculum reward to route queries across heterogeneous GraphRAGs and LLMs, cutting large-LLM overuse by nearly 30 percent while matching baselin...
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
cs.CL 2024-04 unverdicted novelty 6.0

GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
cs.SE 2026-05 conditional novelty 5.0

Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory
cs.CL 2026-05 unverdicted novelty 5.0

GRAVITY adds structured relational, temporal, and thematic memory anchors to conversational LLMs at generation time, delivering 7.5-10.1% average gains in LLM-judge accuracy across five host systems on LongMemEval and LoCoMo.
Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning
cs.CL 2026-03 unverdicted novelty 5.0

A stateful iterative RAG system converts retrieved documents into scored reasoning units, maintains supportive and non-supportive evidence, and performs deficiency-driven query refinement to achieve more robust QA per...
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 4.0

Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
cs.CL 2026-04 unverdicted novelty 4.0

Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.

Reference graph

Works this paper leans on

124 extracted references · 124 canonical work pages · cited by 19 Pith papers · 19 internal anchors

[1]

On the S urprising B ehavior of D istance M etrics in H igh D imensional S pace

Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the S urprising B ehavior of D istance M etrics in H igh D imensional S pace. In Database Theory—ICDT 2001: 8th International Conference London, UK, January 4--6, 2001 Proceedings 8, pp.\ 420--434. Springer, 2001. URL https://link.springer.com/chapter/10.1007/3-540-44503-x_27

work page doi:10.1007/3-540-44503-x_27 2001
[7]

I mproving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. I mproving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.\ 2206--2240. PMLR, 2022. URL https://arxiv.org/abs/2112.04426

work page arXiv 2022
[8]

L anguage M odels are F ew- S hot L earners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 1901
[12]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Pa LM : S caling L anguage M odeling with P athways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Contextualizing citations for scientific summarization using word embeddings and domain knowledge

Arman Cohan and Nazli Goharian. Contextualizing citations for scientific summarization using word embeddings and domain knowledge. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 1133--1136, 2017. URL https://dl.acm.org/doi/abs/10.1145/3077136.3080740

work page doi:10.1145/3077136.3080740 2017
[15]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . F lash A ttention: F ast and memory-efficient exact attention with IO - A wareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022. URL https://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Co LISA : I nner I nteraction via C ontrastive L earning for M ulti-choice R eading C omprehension

Mengxing Dong, Bowei Zou, Yanling Li, and Yu Hong. Co LISA : I nner I nteraction via C ontrastive L earning for M ulti-choice R eading C omprehension. In Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part I, pp.\ 264--278. Springer, 2023 a . URL https://link...

work page doi:10.1007/978-3-031-28244-7_17 2023
[21]

REALM: Retrieval-Augmented Language Model Pre-Training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. R etrieval A ugmented L anguage M odel P re- T raining. In International conference on machine learning, pp.\ 3929--3938. PMLR, 2020. URL https://doi.org/10.48550/arXiv.2002.08909

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2002.08909 2020
[25]

How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020. URL https://arxiv.org/abs/1911.12543

work page arXiv 2020
[26]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. B illion- S cale S imilarity S earch with GPUs . IEEE Transactions on Big Data, 7 0 (3): 0 535--547, 2019. URL https://arxiv.org/abs/1702.08734

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

L arge L anguage M odels struggle to learn L ong- T ail K nowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. L arge L anguage M odels struggle to learn L ong- T ail K nowledge. In International Conference on Machine Learning, pp.\ 15696--15707. PMLR, 2023. URL https://proceedings.mlr.press/v202/kandpal23a/kandpal23a.pdf

work page 2023
[30]

Col BERT : Efficient and effective passage search via contextualized late interaction over bert

Omar Khattab and Matei Zaharia. Col BERT : Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.\ 39--48, 2020. URL https://arxiv.org/abs/2004.12832

work page arXiv 2020
[31]

The NarrativeQA Reading Comprehension Challenge

Tom \'a s Ko c isk \`y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a bor Melis, and Edward Grefenstette. The N arrative QA R eading C omprehension C hallenge. Transactions of the Association for Computational Linguistics, 6: 0 317--328, 2018. URL https://arxiv.org/abs/1712.07040

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. R etrieval- A ugmented G eneration for K nowledge- I ntensive NLP T asks. Advances in Neural Information Processing Systems, 33: 0 9459--9474, 2020. URL https://doi.org/10.48550/arXiv.2005.11401

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2020
[33]

LlamaIndex , 2022

Jerry Liu. LlamaIndex , 2022. URL https://github.com/jerryjliu/llama_index

work page 2022
[36]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. UMAP : U niform M anifold A pproximation and P rojection for D imension R eduction, 2018. URL https://arxiv.org/abs/1802.03426. arXiv preprint arXiv:1802.03426

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Memory-based model editing at scale

Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. Memory-based model editing at scale. In International Conference on Machine Learning, pp.\ 15817--15831. PMLR, 2022. URL https://proceedings.mlr.press/v162/mitchell22a/mitchell22a.pdf

work page 2022
[43]

GPT-4 Technical Report

OpenAI. GPT-4 T echnical R eport. ArXiv, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel Bowman. Q u ALITY : Q uestion A nswering with L ong I nput T exts, Y es! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...

work page 2022
[46]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: M ethods, A nalysis & I nsights from T raining G opher. arXiv preprint arXiv:2112.11446, 2021. URL https://arxiv.org/abs/2112.11446

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

The P robabilistic R elevance F ramework: BM25 and B eyond

Stephen Robertson, Hugo Zaragoza, et al. The P robabilistic R elevance F ramework: BM25 and B eyond. Foundations and Trends in Information Retrieval, 3 0 (4): 0 333--389, 2009. URL https://doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[51]

O kapi at TREC -3

Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. O kapi at TREC -3. Nist Special Publication Sp, 109: 0 109, 1995. URL https://www.microsoft.com/en-us/research/publication/okapi-at-trec-3/

work page 1995
[53]

Estimating the D imension of a M odel

Gideon Schwarz. Estimating the D imension of a M odel. The annals of statistics, pp.\ 461--464, 1978. URL https://projecteuclid.org/journals/annals-of-statistics/volume-6/issue-2/Estimating-the-Dimension-of-a-Model/10.1214/aos/1176344136.full

work page doi:10.1214/aos/1176344136.full 1978
[54]

A S tatistical I nterpretation of T erm S pecificity and its A pplication in R etrieval

Karen Sp \"a rck Jones. A S tatistical I nterpretation of T erm S pecificity and its A pplication in R etrieval. Journal of documentation, 28 0 (1): 0 11--21, 1972. URL https://doi.org/10.1108/eb026526

work page doi:10.1108/eb026526 1972
[57]

o LM pics-- on what language model pre-training captures

Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. o LM pics-- on what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8: 0 743--758, 2020. URL https://arxiv.org/abs/1912.13283

work page arXiv 2020
[61]

Generate rather than retrieve: Large language models are strong context generators,

Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. G enerate r ather than r etrieve: L arge L anguage M odels are s trong c ontext g enerators, 2022. URL https://arxiv.org/abs/2209.10063

work page arXiv 2022
[63]

Estimating the

Schwarz, Gideon , journal=. Estimating the. 1978 , url=

work page 1978
[64]

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Contextualizing citations for scientific summarization using word embeddings and domain knowledge , author=. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=. 2017 , url=

work page 2017
[65]

Hybrid Hierarchical Retrieval for Open-Domain Question Answering

Arivazhagan, Manoj Ghuhan and Liu, Lan and Qi, Peng and Chen, Xinchi and Wang, William Yang and Huang, Zhiheng. Hybrid Hierarchical Retrieval for Open-Domain Question Answering. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.679

work page doi:10.18653/v1/2023.findings-acl.679 2023
[66]

Dense Hierarchical Retrieval for Open-domain Question Answering

Liu, Ye and Hashimoto, Kazuma and Zhou, Yingbo and Yavuz, Semih and Xiong, Caiming and Yu, Philip. Dense Hierarchical Retrieval for Open-domain Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2021. 2021. doi:10.18653/v1/2021.findings-emnlp.19

work page doi:10.18653/v1/2021.findings-emnlp.19 2021
[67]

Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

Zhang, Shiyue and Wan, David and Bansal, Mohit. Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.120

work page doi:10.18653/v1/2023.acl-long.120 2023
[68]

arXiv preprint arXiv:2305.14772 , year=

A Controllable QA-based Framework for Decontextualization , author=. arXiv preprint arXiv:2305.14772 , year=

work page arXiv
[69]

Joint Passage Ranking for Diverse Multi-Answer Retrieval

Min, Sewon and Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina and Hajishirzi, Hannaneh. Joint Passage Ranking for Diverse Multi-Answer Retrieval. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.560

work page doi:10.18653/v1/2021.emnlp-main.560 2021
[70]

arXiv preprint arXiv:2305.14627 , year=

Enabling Large Language Models to Generate Text with Citations , author=. arXiv preprint arXiv:2305.14627 , year=

work page arXiv
[71]

Do Long-Range Language Models Actually Use Long-Range Context?

Sun, Simeng and Krishna, Kalpesh and Mattarella-Micke, Andrew and Iyyer, Mohit. Do Long-Range Language Models Actually Use Long-Range Context?. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.62

work page doi:10.18653/v1/2021.emnlp-main.62 2021
[72]

L ong T 5: E fficient Text-To-Text Transformer for Long Sequences

Guo, Mandy and Ainslie, Joshua and Uthus, David and Ontanon, Santiago and Ni, Jianmo and Sung, Yun-Hsuan and Yang, Yinfei. L ong T 5: E fficient Text-To-Text Transformer for Long Sequences. Findings of the Association for Computational Linguistics: NAACL 2022. 2022. doi:10.18653/v1/2022.findings-naacl.55

work page doi:10.18653/v1/2022.findings-naacl.55 2022
[73]

Ainslie, Joshua and Lei, Tao and de Jong, Michiel and Onta. Co. arXiv preprint arXiv:2303.09752 , year=

work page arXiv
[74]

and Gardner, Matt

Dasigi, Pradeep and Lo, Kyle and Beltagy, Iz and Cohan, Arman and Smith, Noah A. and Gardner, Matt. A D ataset of I nformation- S eeking Q uestions and A nswers A nchored in R esearch P apers. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v...

work page doi:10.18653/v1/2021.naacl-main.365 2021
[75]

Dense passage retrieval for open-domain question answering

Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau. D ense P assage R etrieval for O pen- D omain Q uestion A nswering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.550

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[76]

Q u ALITY : Q uestion A nswering with L ong I nput T exts, Y es!

Pang, Richard Yuanzhe and Parrish, Alicia and Joshi, Nitish and Nangia, Nikita and Phang, Jason and Chen, Angelica and Padmakumar, Vishakh and Ma, Johnny and Thompson, Jana and He, He and Bowman, Samuel. Q u ALITY : Q uestion A nswering with L ong I nput T exts, Y es!. Proceedings of the 2022 Conference of the North American Chapter of the Association for...

work page 2022
[77]

K now W hat Y ou D on ' t K now: U nanswerable Q uestions for SQ u AD

Rajpurkar, Pranav and Jia, Robin and Liang, Percy. K now W hat Y ou D on ' t K now: U nanswerable Q uestions for SQ u AD. Association for Computational Linguistics (ACL). 2018

work page 2018
[78]

Yu, Wenhao and Iter, Dan and Wang, Shuohang and Xu, Yichong and Ju, Mingxuan and Sanyal, Soumya and Zhu, Chenguang and Zeng, Michael and Jiang, Meng , url=

work page
[79]

Robertson, Stephen and Zaragoza, Hugo and others , journal=. The. 2009 , publisher=

work page 2009
[80]

Advances in Neural Information Processing Systems , volume=

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Advances in Neural Information Processing Systems , volume=. 2020 , url=

work page 2020
[81]

2020 , organization=

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Mingwei , booktitle=. 2020 , organization=

work page 2020
[82]

arXiv preprint arXiv:2304.06762 , year=

Shall we pretrain autoregressive language models with retrieval? a comprehensive study , author=. arXiv preprint arXiv:2304.06762 , year=

work page arXiv
[83]

Ko. The. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

work page 2018
[84]

IEEE Transactions on Big Data , volume=

Johnson, Jeff and Douze, Matthijs and J. IEEE Transactions on Big Data , volume=. 2019 , publisher=

work page 2019
[85]

ROUGE : A P ackage for A utomatic E valuation of S ummaries

Lin, Chin-Yew. ROUGE : A P ackage for A utomatic E valuation of S ummaries. Text Summarization Branches Out. 2004

work page 2004
[86]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

work page
[87]

2019 , url=

Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya and others , journal=. 2019 , url=

work page 2019
[88]

2023 , volume=

OpenAI , journal=. 2023 , volume=

work page 2023
[89]

Wei, Jason and Bosma, Maarten and Zhao, Vincent Y and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M and Le, Quoc V , year=

work page
[90]

2021 , publisher=

Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , journal=. 2021 , publisher=

work page 2021
[91]

Yasunaga, Michihiro and Leskovec, Jure and Liang, Percy , year=

work page
[92]

Sp. A. Journal of documentation , volume=. 1972 , publisher=

work page 1972
[93]

1995 , publisher=

Robertson, Stephen E and Walker, Steve and Jones, Susan and Hancock-Beaulieu, Micheline M and Gatford, Mike and others , journal=. 1995 , publisher=

work page 1995
[94]

2001 , organization=

perezal, Charu C and Hinneburg, Alexander and Keim, Daniel A , booktitle=. 2001 , organization=

work page 2001
[95]

McInnes, Leland and Healy, John and Melville, James , year=

work page
[96]

Grootendorst, Maarten , year=

work page
[97]

UNIFIEDQA : Crossing Format Boundaries with a Single QA System

Khashabi, Daniel and Min, Sewon and Khot, Tushar and Sabharwal, Ashish and Tafjord, Oyvind and Clark, Peter and Hajishirzi, Hannaneh. UNIFIEDQA : Crossing Format Boundaries with a Single QA System. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.171

work page doi:10.18653/v1/2020.findings-emnlp.171 2020
[98]

ACM Comput

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. 2023 , url =. doi:10.1145/3571730 , journal =

work page doi:10.1145/3571730 2023
[99]

2007.01282 , archivePrefix=

Gautier Izacard and Edouard Grave , year=. 2007.01282 , archivePrefix=

work page arXiv 2007
[100]

Karan Singhal and Shekoofeh Azizi and Tao Tu and S. Sara Mahdavi and Jason Wei and Hyung Won Chung and Nathan Scales and Ajay Tanwani and Heather Cole-Lewis and Stephen Pfohl and Perry Payne and Martin Seneviratne and Paul Gamble and Chris Kelly and Nathaneal Scharli and Aakanksha Chowdhery and Philip Mansfield and Blaise Aguera y Arcas and Dale Webster a...

work page arXiv
[101]

2012.04584 , archivePrefix=

Gautier Izacard and Edouard Grave , year=. 2012.04584 , archivePrefix=

work page arXiv 2012
[102]

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori and Nicholas King and Scott Mayer McKinney and Dean Carignan and Eric Horvitz , year=. 2303.13375 , archivePrefix=

work page internal anchor Pith review arXiv
[103]

Dong, Mengxing and Zou, Bowei and Li, Yanling and Hong, Yu , booktitle=. Co. 2023 , organization=

work page 2023
[104]

Longformer: The Long-Document Transformer

Iz Beltagy and Matthew E. Peters and Arman Cohan , year=. 2004.05150 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[105]

Frustratingly Hard Evidence Retrieval for QA Over Books

Mou, Xiangyang and Yu, Mo and Yao, Bingsheng and Yang, Chenghao and Guo, Xiaoxiao and Potdar, Saloni and Su, Hui. Frustratingly Hard Evidence Retrieval for QA Over Books. Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events. 2020. doi:10.18653/v1/2020.nuse-1.13

work page doi:10.18653/v1/2020.nuse-1.13 2020
[106]

Colbertv2: Effective and efﬁcient re- trieval via lightweight late interaction

Colbertv2: Effective and efficient retrieval via lightweight late interaction , author=. arXiv preprint arXiv:2112.01488 , year=

work page arXiv
[107]

Khattab, Omar and Zaharia, Matei , booktitle=. Col. 2020 , url=

work page 2020
[108]

Questions Are All You Need to Train a Dense Passage Retriever

Sachan, Devendra Singh and Lewis, Mike and Yogatama, Dani and Zettlemoyer, Luke and Pineau, Joelle and Zaheer, Manzil. Questions Are All You Need to Train a Dense Passage Retriever. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00564

work page doi:10.1162/tacl_a_00564 2023
[109]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[110]

2105.04241 , archivePrefix=

Yury Zemlyanskiy and Joshua Ainslie and Michiel de Jong and Philip Pham and Ilya Eckstein and Fei Sha , year=. 2105.04241 , archivePrefix=

work page arXiv
[111]

Ziegler and Nisan Stiennon and Ryan Lowe and Jan Leike and Paul Christiano , year=

Jeff Wu and Long Ouyang and Daniel M. Ziegler and Nisan Stiennon and Ryan Lowe and Jan Leike and Paul Christiano , year=. 2109.10862 , archivePrefix=

work page arXiv
[112]

2210.09338 , archivePrefix=

Michihiro Yasunaga and Antoine Bosselut and Hongyu Ren and Xikun Zhang and Christopher D Manning and Percy Liang and Jure Leskovec , year=. 2210.09338 , archivePrefix=

work page arXiv
[113]

Reading Wikipedia to answer open-domain questions

R eading W ikipedia to A nswer O pen- D omain Q uestions. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1171

work page doi:10.18653/v1/p17-1171 2017
[114]

Manning and Kyoung-Gu Woo , year=

Haejun Lee and Akhil Kedia and Jongwon Lee and Ashwin Paranjape and Christopher D. Manning and Kyoung-Gu Woo , year=. 2112.07381 , archivePrefix=

work page arXiv
[115]

2212.02027 , archivePrefix=

Zhengbao Jiang and Luyu Gao and Jun Araki and Haibo Ding and Zhiruo Wang and Jamie Callan and Graham Neubig , year=. 2212.02027 , archivePrefix=

work page arXiv
[116]

Aggarwal, Charu C and Hinneburg, Alexander and Keim, Daniel A , booktitle=. On the. 2001 , organization=

work page 2001

Showing first 80 references.