pith. machine review for the scientific record. sign in

arxiv: 2401.18059 · v1 · submitted 2024-01-31 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:02 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords RAPTORretrieval-augmented generationhierarchical summarizationlong-context reasoningquestion answeringtree-structured retrievalabstractive processing
0
0 comments X

The pith

Recursive clustering and summarization builds a tree that improves retrieval-augmented reasoning over long documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RAPTOR, which recursively embeds, clusters, and summarizes text chunks to form a tree with multiple levels of abstraction. During inference the model retrieves from this tree rather than from flat contiguous chunks, allowing integration of information across lengthy documents. A sympathetic reader would care because standard retrieval methods struggle with complex multi-step reasoning that requires both local details and global context. Experiments demonstrate that this approach yields significant gains over conventional retrieval-augmented language models, including a 20-point absolute accuracy increase on the QuALITY benchmark when paired with GPT-4.

Core claim

By recursively embedding, clustering, and summarizing chunks of text, RAPTOR constructs a tree with differing levels of summarization from the bottom up. At inference time the model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented language models on several tasks, achieving state-of-the-art results on question-answering benchmarks that involve complex, multi-step reasoning.

What carries the argument

The RAPTOR tree, constructed bottom-up by recursive embedding, clustering, and abstractive summarization of text chunks.

If this is right

  • Retrieval integrates context across long documents at varying levels of abstraction.
  • Performance improves on question-answering tasks that require complex multi-step reasoning.
  • State-of-the-art results are achieved on benchmarks such as QuALITY when the tree is used with GPT-4.
  • The method supports better incorporation of long-tail knowledge compared with flat retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The tree structure could reduce reliance on ever-larger context windows by supplying relevant summaries on demand.
  • Similar hierarchical organization might extend to tasks such as document summarization or multi-hop knowledge extraction.
  • Different embedding or clustering choices could be substituted to test robustness of the performance gains.

Load-bearing premise

The recursive clustering and summarization process effectively captures and preserves all relevant information from the original document without significant loss or distortion.

What would settle it

If retrieval from the RAPTOR tree produced lower accuracy than standard chunk retrieval on the QuALITY benchmark, the performance benefit would be falsified.

read the original abstract

Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RAPTOR, a retrieval-augmented generation method that recursively embeds, clusters, and abstractive-summarizes document chunks to construct a multi-level tree. At inference, the model retrieves from this tree to integrate information across different levels of abstraction, addressing limitations of flat chunk retrieval in standard RAG. Controlled experiments on QA benchmarks demonstrate improvements over baselines, with a reported 20% absolute accuracy gain on QuALITY when paired with GPT-4.

Significance. If the results hold under full scrutiny, RAPTOR offers a practical advance in handling long-document context for complex reasoning tasks by enabling hierarchical abstraction retrieval. The approach builds on established embedding and clustering techniques with a novel bottom-up tree construction, and the empirical SOTA claims on multiple QA tasks could influence future RAG designs if the information-preservation properties of the summaries are confirmed.

major comments (3)
  1. [§4.2] §4.2 (Tree Construction): The recursive clustering and summarization steps lack explicit controls or metrics for information fidelity (e.g., no ROUGE or entailment checks between summaries and source chunks), which is load-bearing for the central claim that the tree preserves usable context without significant loss.
  2. [§5.3] §5.3 (QuALITY Experiments): The 20% absolute accuracy improvement is reported without variance, statistical significance tests, or ablation isolating the contribution of multi-level retrieval versus GPT-4 prompting alone; this undermines the cross-baseline comparison.
  3. [§3] §3 (Retrieval Algorithm): The inference-time retrieval procedure from the tree is described at a high level but omits precise scoring or selection rules across levels, making it impossible to reproduce the exact integration of summaries and chunks.
minor comments (2)
  1. [Figure 1] Figure 1: The tree diagram is helpful but the caption does not specify the embedding model or clustering algorithm used in the illustrated example.
  2. [Related Work] Related Work section: The comparison to prior hierarchical retrieval methods (e.g., those using sentence embeddings or graph-based structures) could be expanded with quantitative differences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving rigor and reproducibility. We address each major comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Tree Construction): The recursive clustering and summarization steps lack explicit controls or metrics for information fidelity (e.g., no ROUGE or entailment checks between summaries and source chunks), which is load-bearing for the central claim that the tree preserves usable context without significant loss.

    Authors: We agree that explicit fidelity metrics would strengthen the central claim. While downstream performance gains provide supporting evidence, we will add quantitative controls in the revised Section 4.2, including average ROUGE-L scores and NLI entailment rates between each summary and its source chunks across tree levels. revision: yes

  2. Referee: [§5.3] §5.3 (QuALITY Experiments): The 20% absolute accuracy improvement is reported without variance, statistical significance tests, or ablation isolating the contribution of multi-level retrieval versus GPT-4 prompting alone; this undermines the cross-baseline comparison.

    Authors: We acknowledge the importance of statistical reporting. In the revision we will include standard deviations over multiple runs, perform significance tests, and add an ablation in Section 5.3 that directly compares hierarchical RAPTOR retrieval against flat retrieval using identical GPT-4 prompting to isolate the multi-level contribution. revision: yes

  3. Referee: [§3] §3 (Retrieval Algorithm): The inference-time retrieval procedure from the tree is described at a high level but omits precise scoring or selection rules across levels, making it impossible to reproduce the exact integration of summaries and chunks.

    Authors: We will expand Section 3 with a detailed description of the retrieval procedure, including the exact cosine-similarity scoring function, level-specific top-k thresholds, and the prompt-integration rules for combining summaries and chunks. Pseudocode will be added to ensure full reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents RAPTOR as an explicit algorithmic construction: recursive bottom-up embedding, clustering, and abstractive summarization to form a multi-level tree, followed by top-down retrieval at inference. Performance claims (e.g., +20% absolute on QuALITY with GPT-4) rest on controlled empirical comparisons against flat-chunk RAG baselines, not on any mathematical derivation that reduces to its own fitted inputs or self-citations. No self-definitional loops, renamed predictions, or load-bearing self-citations appear in the method or results sections. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method introduces a new hierarchical structure but relies on standard embedding and clustering techniques; the key assumption is the utility of recursive summarization.

axioms (1)
  • domain assumption Abstractive summaries at different levels retain sufficient information for multi-step reasoning
    This is implicitly required for the retrieval to improve performance on complex tasks.

pith-pipeline@v0.9.0 · 5470 in / 1074 out tokens · 68391 ms · 2026-05-15T13:02:10.149769+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.

  2. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.

  3. Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.

  4. OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

    cs.CL 2026-04 unverdicted novelty 7.0

    OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

  5. Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.

  6. CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

    cs.CL 2026-03 unverdicted novelty 7.0

    CLAG organizes agent memory into clusters via an SLM router and uses cluster profiles for two-stage retrieval, yielding better answer quality on QA benchmarks than prior memory systems.

  7. ASTRA-QA: A Benchmark for Abstract Question Answering over Documents

    cs.CL 2026-05 unverdicted novelty 6.0

    ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.

  8. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 6.0

    Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.

  9. FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...

  10. MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    MemORAI combines selective filtering, provenance tracking in multi-relational graphs, and dynamic weighted PageRank retrieval to achieve state-of-the-art memory retrieval and personalized responses in LLM agents on LO...

  11. MindTrellis: Co-Creating Knowledge Structures with AI through Interactive Visual Exploration

    cs.HC 2026-04 unverdicted novelty 6.0

    MindTrellis enables users and AI to co-create evolving knowledge graphs, outperforming retrieval-only tools in expert-rated content coverage, structural quality, and reduced cognitive load during a study of 12 partici...

  12. Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...

  13. Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

    cs.CL 2026-04 unverdicted novelty 6.0

    A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.

  14. GraphRAG-Router: Learning Cost-Efficient Routing over GraphRAGs and LLMs with Reinforcement Learning

    cs.IR 2026-03 unverdicted novelty 6.0

    GraphRAG-Router uses two-stage reinforcement learning with a cost-aware curriculum reward to route queries across heterogeneous GraphRAGs and LLMs, cutting large-LLM overuse by nearly 30 percent while matching baselin...

  15. From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    cs.CL 2024-04 unverdicted novelty 6.0

    GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.

  16. How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study

    cs.SE 2026-05 conditional novelty 5.0

    Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.

  17. GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory

    cs.CL 2026-05 unverdicted novelty 5.0

    GRAVITY adds structured relational, temporal, and thematic memory anchors to conversational LLMs at generation time, delivering 7.5-10.1% average gains in LLM-judge accuracy across five host systems on LongMemEval and LoCoMo.

  18. Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning

    cs.CL 2026-03 unverdicted novelty 5.0

    A stateful iterative RAG system converts retrieved documents into scored reasoning units, maintains supportive and non-supportive evidence, and performs deficiency-driven query refinement to achieve more robust QA per...

  19. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  20. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 4.0

    Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.

  21. Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering

    cs.CL 2026-04 unverdicted novelty 4.0

    Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.

Reference graph

Works this paper leans on

124 extracted references · 124 canonical work pages · cited by 19 Pith papers · 19 internal anchors

  1. [1]

    On the S urprising B ehavior of D istance M etrics in H igh D imensional S pace

    Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the S urprising B ehavior of D istance M etrics in H igh D imensional S pace. In Database Theory—ICDT 2001: 8th International Conference London, UK, January 4--6, 2001 Proceedings 8, pp.\ 420--434. Springer, 2001. URL https://link.springer.com/chapter/10.1007/3-540-44503-x_27

  2. [7]

    I mproving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. I mproving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.\ 2206--2240. PMLR, 2022. URL https://arxiv.org/abs/2112.04426

  3. [8]

    L anguage M odels are F ew- S hot L earners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  4. [12]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Pa LM : S caling L anguage M odeling with P athways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311

  5. [13]

    Contextualizing citations for scientific summarization using word embeddings and domain knowledge

    Arman Cohan and Nazli Goharian. Contextualizing citations for scientific summarization using word embeddings and domain knowledge. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 1133--1136, 2017. URL https://dl.acm.org/doi/abs/10.1145/3077136.3080740

  6. [15]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . F lash A ttention: F ast and memory-efficient exact attention with IO - A wareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022. URL https://arxiv.org/abs/2205.14135

  7. [17]

    Co LISA : I nner I nteraction via C ontrastive L earning for M ulti-choice R eading C omprehension

    Mengxing Dong, Bowei Zou, Yanling Li, and Yu Hong. Co LISA : I nner I nteraction via C ontrastive L earning for M ulti-choice R eading C omprehension. In Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part I, pp.\ 264--278. Springer, 2023 a . URL https://link...

  8. [21]

    REALM: Retrieval-Augmented Language Model Pre-Training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. R etrieval A ugmented L anguage M odel P re- T raining. In International conference on machine learning, pp.\ 3929--3938. PMLR, 2020. URL https://doi.org/10.48550/arXiv.2002.08909

  9. [25]

    How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

    Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020. URL https://arxiv.org/abs/1911.12543

  10. [26]

    Billion-scale similarity search with GPUs

    Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. B illion- S cale S imilarity S earch with GPUs . IEEE Transactions on Big Data, 7 0 (3): 0 535--547, 2019. URL https://arxiv.org/abs/1702.08734

  11. [27]

    L arge L anguage M odels struggle to learn L ong- T ail K nowledge

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. L arge L anguage M odels struggle to learn L ong- T ail K nowledge. In International Conference on Machine Learning, pp.\ 15696--15707. PMLR, 2023. URL https://proceedings.mlr.press/v202/kandpal23a/kandpal23a.pdf

  12. [30]

    Col BERT : Efficient and effective passage search via contextualized late interaction over bert

    Omar Khattab and Matei Zaharia. Col BERT : Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.\ 39--48, 2020. URL https://arxiv.org/abs/2004.12832

  13. [31]

    The NarrativeQA Reading Comprehension Challenge

    Tom \'a s Ko c isk \`y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a bor Melis, and Edward Grefenstette. The N arrative QA R eading C omprehension C hallenge. Transactions of the Association for Computational Linguistics, 6: 0 317--328, 2018. URL https://arxiv.org/abs/1712.07040

  14. [32]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. R etrieval- A ugmented G eneration for K nowledge- I ntensive NLP T asks. Advances in Neural Information Processing Systems, 33: 0 9459--9474, 2020. URL https://doi.org/10.48550/arXiv.2005.11401

  15. [33]

    LlamaIndex , 2022

    Jerry Liu. LlamaIndex , 2022. URL https://github.com/jerryjliu/llama_index

  16. [36]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. UMAP : U niform M anifold A pproximation and P rojection for D imension R eduction, 2018. URL https://arxiv.org/abs/1802.03426. arXiv preprint arXiv:1802.03426

  17. [39]

    Memory-based model editing at scale

    Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. Memory-based model editing at scale. In International Conference on Machine Learning, pp.\ 15817--15831. PMLR, 2022. URL https://proceedings.mlr.press/v162/mitchell22a/mitchell22a.pdf

  18. [43]

    GPT-4 Technical Report

    OpenAI. GPT-4 T echnical R eport. ArXiv, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774

  19. [44]

    Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel Bowman. Q u ALITY : Q uestion A nswering with L ong I nput T exts, Y es! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...

  20. [46]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: M ethods, A nalysis & I nsights from T raining G opher. arXiv preprint arXiv:2112.11446, 2021. URL https://arxiv.org/abs/2112.11446

  21. [50]

    The P robabilistic R elevance F ramework: BM25 and B eyond

    Stephen Robertson, Hugo Zaragoza, et al. The P robabilistic R elevance F ramework: BM25 and B eyond. Foundations and Trends in Information Retrieval, 3 0 (4): 0 333--389, 2009. URL https://doi.org/10.1561/1500000019

  22. [51]

    O kapi at TREC -3

    Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. O kapi at TREC -3. Nist Special Publication Sp, 109: 0 109, 1995. URL https://www.microsoft.com/en-us/research/publication/okapi-at-trec-3/

  23. [53]

    Estimating the D imension of a M odel

    Gideon Schwarz. Estimating the D imension of a M odel. The annals of statistics, pp.\ 461--464, 1978. URL https://projecteuclid.org/journals/annals-of-statistics/volume-6/issue-2/Estimating-the-Dimension-of-a-Model/10.1214/aos/1176344136.full

  24. [54]

    A S tatistical I nterpretation of T erm S pecificity and its A pplication in R etrieval

    Karen Sp \"a rck Jones. A S tatistical I nterpretation of T erm S pecificity and its A pplication in R etrieval. Journal of documentation, 28 0 (1): 0 11--21, 1972. URL https://doi.org/10.1108/eb026526

  25. [57]

    o LM pics-- on what language model pre-training captures

    Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. o LM pics-- on what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8: 0 743--758, 2020. URL https://arxiv.org/abs/1912.13283

  26. [61]

    Generate rather than retrieve: Large language models are strong context generators,

    Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. G enerate r ather than r etrieve: L arge L anguage M odels are s trong c ontext g enerators, 2022. URL https://arxiv.org/abs/2209.10063

  27. [63]

    Estimating the

    Schwarz, Gideon , journal=. Estimating the. 1978 , url=

  28. [64]

    Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    Contextualizing citations for scientific summarization using word embeddings and domain knowledge , author=. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=. 2017 , url=

  29. [65]

    Hybrid Hierarchical Retrieval for Open-Domain Question Answering

    Arivazhagan, Manoj Ghuhan and Liu, Lan and Qi, Peng and Chen, Xinchi and Wang, William Yang and Huang, Zhiheng. Hybrid Hierarchical Retrieval for Open-Domain Question Answering. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.679

  30. [66]

    Dense Hierarchical Retrieval for Open-domain Question Answering

    Liu, Ye and Hashimoto, Kazuma and Zhou, Yingbo and Yavuz, Semih and Xiong, Caiming and Yu, Philip. Dense Hierarchical Retrieval for Open-domain Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2021. 2021. doi:10.18653/v1/2021.findings-emnlp.19

  31. [67]

    Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

    Zhang, Shiyue and Wan, David and Bansal, Mohit. Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.120

  32. [68]

    arXiv preprint arXiv:2305.14772 , year=

    A Controllable QA-based Framework for Decontextualization , author=. arXiv preprint arXiv:2305.14772 , year=

  33. [69]

    Joint Passage Ranking for Diverse Multi-Answer Retrieval

    Min, Sewon and Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina and Hajishirzi, Hannaneh. Joint Passage Ranking for Diverse Multi-Answer Retrieval. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.560

  34. [70]

    arXiv preprint arXiv:2305.14627 , year=

    Enabling Large Language Models to Generate Text with Citations , author=. arXiv preprint arXiv:2305.14627 , year=

  35. [71]

    Do Long-Range Language Models Actually Use Long-Range Context?

    Sun, Simeng and Krishna, Kalpesh and Mattarella-Micke, Andrew and Iyyer, Mohit. Do Long-Range Language Models Actually Use Long-Range Context?. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.62

  36. [72]

    L ong T 5: E fficient Text-To-Text Transformer for Long Sequences

    Guo, Mandy and Ainslie, Joshua and Uthus, David and Ontanon, Santiago and Ni, Jianmo and Sung, Yun-Hsuan and Yang, Yinfei. L ong T 5: E fficient Text-To-Text Transformer for Long Sequences. Findings of the Association for Computational Linguistics: NAACL 2022. 2022. doi:10.18653/v1/2022.findings-naacl.55

  37. [73]

    Ainslie, Joshua and Lei, Tao and de Jong, Michiel and Onta. Co. arXiv preprint arXiv:2303.09752 , year=

  38. [74]

    and Gardner, Matt

    Dasigi, Pradeep and Lo, Kyle and Beltagy, Iz and Cohan, Arman and Smith, Noah A. and Gardner, Matt. A D ataset of I nformation- S eeking Q uestions and A nswers A nchored in R esearch P apers. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v...

  39. [75]

    Dense passage retrieval for open-domain question answering

    Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau. D ense P assage R etrieval for O pen- D omain Q uestion A nswering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.550

  40. [76]

    Q u ALITY : Q uestion A nswering with L ong I nput T exts, Y es!

    Pang, Richard Yuanzhe and Parrish, Alicia and Joshi, Nitish and Nangia, Nikita and Phang, Jason and Chen, Angelica and Padmakumar, Vishakh and Ma, Johnny and Thompson, Jana and He, He and Bowman, Samuel. Q u ALITY : Q uestion A nswering with L ong I nput T exts, Y es!. Proceedings of the 2022 Conference of the North American Chapter of the Association for...

  41. [77]

    K now W hat Y ou D on ' t K now: U nanswerable Q uestions for SQ u AD

    Rajpurkar, Pranav and Jia, Robin and Liang, Percy. K now W hat Y ou D on ' t K now: U nanswerable Q uestions for SQ u AD. Association for Computational Linguistics (ACL). 2018

  42. [78]

    Yu, Wenhao and Iter, Dan and Wang, Shuohang and Xu, Yichong and Ju, Mingxuan and Sanyal, Soumya and Zhu, Chenguang and Zeng, Michael and Jiang, Meng , url=

  43. [79]

    Robertson, Stephen and Zaragoza, Hugo and others , journal=. The. 2009 , publisher=

  44. [80]

    Advances in Neural Information Processing Systems , volume=

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Advances in Neural Information Processing Systems , volume=. 2020 , url=

  45. [81]

    2020 , organization=

    Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Mingwei , booktitle=. 2020 , organization=

  46. [82]

    arXiv preprint arXiv:2304.06762 , year=

    Shall we pretrain autoregressive language models with retrieval? a comprehensive study , author=. arXiv preprint arXiv:2304.06762 , year=

  47. [83]

    Ko. The. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

  48. [84]

    IEEE Transactions on Big Data , volume=

    Johnson, Jeff and Douze, Matthijs and J. IEEE Transactions on Big Data , volume=. 2019 , publisher=

  49. [85]

    ROUGE : A P ackage for A utomatic E valuation of S ummaries

    Lin, Chin-Yew. ROUGE : A P ackage for A utomatic E valuation of S ummaries. Text Summarization Branches Out. 2004

  50. [86]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

  51. [87]

    2019 , url=

    Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya and others , journal=. 2019 , url=

  52. [88]

    2023 , volume=

    OpenAI , journal=. 2023 , volume=

  53. [89]

    Wei, Jason and Bosma, Maarten and Zhao, Vincent Y and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M and Le, Quoc V , year=

  54. [90]

    2021 , publisher=

    Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , journal=. 2021 , publisher=

  55. [91]

    Yasunaga, Michihiro and Leskovec, Jure and Liang, Percy , year=

  56. [92]

    Sp. A. Journal of documentation , volume=. 1972 , publisher=

  57. [93]

    1995 , publisher=

    Robertson, Stephen E and Walker, Steve and Jones, Susan and Hancock-Beaulieu, Micheline M and Gatford, Mike and others , journal=. 1995 , publisher=

  58. [94]

    2001 , organization=

    perezal, Charu C and Hinneburg, Alexander and Keim, Daniel A , booktitle=. 2001 , organization=

  59. [95]

    McInnes, Leland and Healy, John and Melville, James , year=

  60. [96]

    Grootendorst, Maarten , year=

  61. [97]

    UNIFIEDQA : Crossing Format Boundaries with a Single QA System

    Khashabi, Daniel and Min, Sewon and Khot, Tushar and Sabharwal, Ashish and Tafjord, Oyvind and Clark, Peter and Hajishirzi, Hannaneh. UNIFIEDQA : Crossing Format Boundaries with a Single QA System. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.171

  62. [98]

    ACM Comput

    Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. 2023 , url =. doi:10.1145/3571730 , journal =

  63. [99]

    2007.01282 , archivePrefix=

    Gautier Izacard and Edouard Grave , year=. 2007.01282 , archivePrefix=

  64. [100]

    Karan Singhal and Shekoofeh Azizi and Tao Tu and S. Sara Mahdavi and Jason Wei and Hyung Won Chung and Nathan Scales and Ajay Tanwani and Heather Cole-Lewis and Stephen Pfohl and Perry Payne and Martin Seneviratne and Paul Gamble and Chris Kelly and Nathaneal Scharli and Aakanksha Chowdhery and Philip Mansfield and Blaise Aguera y Arcas and Dale Webster a...

  65. [101]

    2012.04584 , archivePrefix=

    Gautier Izacard and Edouard Grave , year=. 2012.04584 , archivePrefix=

  66. [102]

    Capabilities of GPT-4 on Medical Challenge Problems

    Harsha Nori and Nicholas King and Scott Mayer McKinney and Dean Carignan and Eric Horvitz , year=. 2303.13375 , archivePrefix=

  67. [103]

    Dong, Mengxing and Zou, Bowei and Li, Yanling and Hong, Yu , booktitle=. Co. 2023 , organization=

  68. [104]

    Longformer: The Long-Document Transformer

    Iz Beltagy and Matthew E. Peters and Arman Cohan , year=. 2004.05150 , archivePrefix=

  69. [105]

    Frustratingly Hard Evidence Retrieval for QA Over Books

    Mou, Xiangyang and Yu, Mo and Yao, Bingsheng and Yang, Chenghao and Guo, Xiaoxiao and Potdar, Saloni and Su, Hui. Frustratingly Hard Evidence Retrieval for QA Over Books. Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events. 2020. doi:10.18653/v1/2020.nuse-1.13

  70. [106]

    Colbertv2: Effective and efficient re- trieval via lightweight late interaction

    Colbertv2: Effective and efficient retrieval via lightweight late interaction , author=. arXiv preprint arXiv:2112.01488 , year=

  71. [107]

    Khattab, Omar and Zaharia, Matei , booktitle=. Col. 2020 , url=

  72. [108]

    Questions Are All You Need to Train a Dense Passage Retriever

    Sachan, Devendra Singh and Lewis, Mike and Yogatama, Dani and Zettlemoyer, Luke and Pineau, Joelle and Zaheer, Manzil. Questions Are All You Need to Train a Dense Passage Retriever. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00564

  73. [109]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

  74. [110]

    2105.04241 , archivePrefix=

    Yury Zemlyanskiy and Joshua Ainslie and Michiel de Jong and Philip Pham and Ilya Eckstein and Fei Sha , year=. 2105.04241 , archivePrefix=

  75. [111]

    Ziegler and Nisan Stiennon and Ryan Lowe and Jan Leike and Paul Christiano , year=

    Jeff Wu and Long Ouyang and Daniel M. Ziegler and Nisan Stiennon and Ryan Lowe and Jan Leike and Paul Christiano , year=. 2109.10862 , archivePrefix=

  76. [112]

    2210.09338 , archivePrefix=

    Michihiro Yasunaga and Antoine Bosselut and Hongyu Ren and Xikun Zhang and Christopher D Manning and Percy Liang and Jure Leskovec , year=. 2210.09338 , archivePrefix=

  77. [113]

    Reading Wikipedia to answer open-domain questions

    R eading W ikipedia to A nswer O pen- D omain Q uestions. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1171

  78. [114]

    Manning and Kyoung-Gu Woo , year=

    Haejun Lee and Akhil Kedia and Jongwon Lee and Ashwin Paranjape and Christopher D. Manning and Kyoung-Gu Woo , year=. 2112.07381 , archivePrefix=

  79. [115]

    2212.02027 , archivePrefix=

    Zhengbao Jiang and Luyu Gao and Jun Araki and Haibo Ding and Zhiruo Wang and Jamie Callan and Graham Neubig , year=. 2212.02027 , archivePrefix=

  80. [116]

    Aggarwal, Charu C and Hinneburg, Alexander and Keim, Daniel A , booktitle=. On the. 2001 , organization=

Showing first 80 references.