pith. machine review for the scientific record. sign in

arxiv: 2310.11511 · v1 · submitted 2023-10-17 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Authors on Pith no claims yet

Pith reviewed 2026-05-12 14:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords Self-RAGretrieval-augmented generationself-reflectionreflection tokenslanguage modelsfactualityopen-domain QAcitation accuracy
0
0 comments X

The pith

Self-RAG trains a single language model to adaptively retrieve passages on demand and critique its own outputs using special reflection tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Self-Reflective Retrieval-Augmented Generation to fix factual errors that arise when language models rely only on their internal knowledge or pull in a fixed set of retrieved passages. It trains any base LM to generate reflection tokens that signal whether retrieval is needed, judge the relevance of passages, and evaluate the quality of its own draft answers. This turns retrieval and self-critique into controllable behaviors at inference time rather than fixed external steps. A reader would care because the method promises higher factuality and citation accuracy across question answering, reasoning, and long-form generation while avoiding the performance drop that comes from indiscriminate retrieval.

Core claim

Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection by training a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks, including open-domain QA, reasoning and fact verification, with notable gains in factuality and citation

What carries the argument

Reflection tokens: special tokens the LM learns to output that indicate retrieval necessity, assess passage relevance, and critique generation quality, thereby guiding the retrieval and refinement process.

If this is right

  • On-demand retrieval prevents the inclusion of irrelevant passages that can harm response quality or versatility.
  • Self-generated critiques raise factuality and citation accuracy especially in long-form outputs.
  • A single trained model becomes controllable at inference for varied tasks without separate modules or retraining.
  • The approach yields measurable gains over both ChatGPT and retrieval-augmented baselines on QA, reasoning, and verification.
  • Reflection tokens provide an internal mechanism for the model to decide when external knowledge is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-based reflection approach could be tested on tasks outside the paper's scope, such as code generation or dialogue, to check if controllability transfers.
  • Inspecting the reflection tokens during inference might offer a lightweight way to trace and correct specific errors without full retraining.
  • Because the method works on arbitrary base LMs, scaling it to larger models could compound the observed gains in factuality.

Load-bearing premise

An arbitrary language model can be trained to generate and act on reflection tokens in a way that improves performance across tasks rather than introducing new failure modes from the added tokens.

What would settle it

A direct comparison on long-form generation benchmarks where Self-RAG models show no improvement or a decline in factuality and citation accuracy relative to standard retrieval-augmented Llama2-chat would falsify the central claim.

read the original abstract

Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Self-RAG, a framework that fine-tunes an arbitrary LM (Llama2-7B/13B) to emit special reflection tokens ([Retrieve], [Relevant], [Support], etc.) enabling adaptive on-demand retrieval, passage critique, and self-critique of generations. This makes the model controllable at inference and is claimed to improve factuality and task performance over standard RAG and frontier LLMs. Experiments report that Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on open-domain QA, reasoning, and fact-verification benchmarks while also raising citation accuracy and factuality in long-form outputs.

Significance. If the gains are attributable to the reflection mechanism rather than training details, the work is significant: it unifies retrieval, generation, and critique inside one model via learnable control tokens, addressing the rigidity of fixed-k RAG while preserving versatility. The single-model design and reported gains across diverse tasks suggest practical impact. The authors earn credit for releasing code, models, and detailed training recipes that support reproducibility of the central empirical claims.

major comments (3)
  1. [§5.1, Table 2] §5.1 and Table 2: the reported outperformance on fact-verification and long-form tasks is not accompanied by statistical significance tests, confidence intervals, or multiple-run variance; without these, it is impossible to confirm that the gains exceed evaluation noise and are load-bearing for the claim that reflection tokens drive the improvement.
  2. [§4.2, §5.3] §4.2 (training data construction) and §5.3 (ablations): the paper lacks a controlled ablation that trains an otherwise identical model on the same volume of data but without the reflection-token prediction objective; the current ablations therefore cannot isolate whether performance stems from the self-critique mechanism or simply from additional supervised fine-tuning, leaving the central assumption about reflection tokens unverified.
  3. [§4.3] §4.3 (inference procedure): the description of how reflection tokens are decoded and used to branch behavior (retrieve vs. generate, accept vs. critique) does not specify the exact sampling strategy or temperature schedule applied to the special tokens; this detail is load-bearing for reproducing the claimed adaptive behavior and for assessing whether new failure modes (e.g., erroneous [Relevant] judgments) arise at inference time.
minor comments (3)
  1. [Figure 2] Figure 2: the framework diagram would be clearer if arrows explicitly labeled the points at which each reflection token is emitted and consumed.
  2. [§3.1] §3.1: the notation for the joint probability over tokens and reflection tokens could be made more explicit to avoid ambiguity when conditioning on retrieved passages.
  3. [Related Work] Related Work: a brief comparison to prior controllable-generation methods (e.g., those using special tokens for style or factuality) would help situate the novelty of the reflection-token set.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and suggestions for improving the manuscript. We address each of the major comments below and will make the necessary revisions to enhance the clarity and rigor of our empirical results.

read point-by-point responses
  1. Referee: [§5.1, Table 2] §5.1 and Table 2: the reported outperformance on fact-verification and long-form tasks is not accompanied by statistical significance tests, confidence intervals, or multiple-run variance; without these, it is impossible to confirm that the gains exceed evaluation noise and are load-bearing for the claim that reflection tokens drive the improvement.

    Authors: We acknowledge the importance of statistical validation. In the revised manuscript, we will report bootstrap confidence intervals for the performance metrics in Table 2 and conduct multiple runs with different random seeds to provide variance estimates. This will allow readers to assess the robustness of the gains attributed to the reflection tokens. revision: yes

  2. Referee: [§4.2, §5.3] §4.2 (training data construction) and §5.3 (ablations): the paper lacks a controlled ablation that trains an otherwise identical model on the same volume of data but without the reflection-token prediction objective; the current ablations therefore cannot isolate whether performance stems from the self-critique mechanism or simply from additional supervised fine-tuning, leaving the central assumption about reflection tokens unverified.

    Authors: This point is well-taken and highlights a potential gap in isolating the effect of the reflection tokens. Our existing ablations in §5.3 vary the use of reflection tokens at inference but do not fully control for the training objective. We will add a new ablation experiment training a model on the identical dataset and volume without the reflection token prediction loss, to directly verify the contribution of the self-critique mechanism. revision: yes

  3. Referee: [§4.3] §4.3 (inference procedure): the description of how reflection tokens are decoded and used to branch behavior (retrieve vs. generate, accept vs. critique) does not specify the exact sampling strategy or temperature schedule applied to the special tokens; this detail is load-bearing for reproducing the claimed adaptive behavior and for assessing whether new failure modes (e.g., erroneous [Relevant] judgments) arise at inference time.

    Authors: We appreciate this feedback on reproducibility. The reflection tokens are decoded using greedy decoding (temperature 0) to ensure consistent branching decisions, while the subsequent generation steps use the standard sampling parameters described in the paper. We will expand §4.3 to explicitly detail the decoding strategy for reflection tokens, including any thresholds used for decisions like [Relevant] or [Support]. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out benchmarks

full rationale

The paper defines reflection tokens and trains a single LM to emit and condition on them, then reports performance improvements on standard external benchmarks (Open-domain QA, reasoning, fact verification, long-form generation). No equations, fitted parameters, or self-citations are invoked such that any claimed prediction reduces to the training inputs by construction. The central claims rest on end-to-end optimization against task metrics and held-out evaluation, which is independent of the token definitions themselves.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that reflection tokens can be learned and used controllably; no external benchmarks or formal proofs are provided for this assumption.

free parameters (1)
  • reflection token set
    The exact vocabulary and semantics of the special reflection tokens are chosen by the authors.
axioms (1)
  • domain assumption An arbitrary LM can be fine-tuned to emit and condition on reflection tokens without performance degradation
    Invoked when the framework is described as training a single arbitrary LM.
invented entities (1)
  • reflection tokens no independent evidence
    purpose: Enable the model to signal retrieval need, passage relevance, and generation support
    New tokens introduced by the framework; no independent evidence outside the training process is given.

pith-pipeline@v0.9.0 · 5565 in / 1352 out tokens · 61328 ms · 2026-05-12T14:09:42.760870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery

    cs.AI 2026-05 unverdicted novelty 7.0

    HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact dens...

  2. AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.

  3. TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data

    cs.AI 2026-04 unverdicted novelty 7.0

    TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design matter...

  4. OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

    cs.CL 2026-04 unverdicted novelty 7.0

    OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

  5. RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

    cs.SE 2026-04 unverdicted novelty 7.0

    RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.

  6. Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context

    cs.CL 2026-04 unverdicted novelty 7.0

    Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.

  7. IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...

  8. ROZA Graphs: Self-Improving Near-Deterministic RAG through Evidence-Centric Feedback

    cs.AI 2026-04 unverdicted novelty 7.0

    ROZA graphs enable self-improving RAG by storing evidence-specific reasoning chains, yielding up to 10.6pp accuracy gains and 46% lower cost through graph traversal feedback.

  9. LLMAR: A Tuning-Free Recommendation Framework for Sparse and Text-Rich Industrial Domains

    cs.IR 2026-03 unverdicted novelty 7.0

    LLMAR applies LLM reasoning with a self-correction reflection loop to generate semantic user motives for tuning-free recommendations, showing up to 54.6% nDCG@10 gains on a sparse industrial dataset over trained baselines.

  10. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  11. Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...

  12. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 6.0

    Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.

  13. FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...

  14. Evolve: A Persistent Knowledge Lifecycle for Small Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    A 2B model using Evolve reaches 60-84% accuracy on 750 queries across specialist questions, NaturalQuestions, and TriviaQA (up from 20-33%) while cutting teacher invocations by over 50% via a consolidated section-base...

  15. The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompt...

  16. Preregistered Belief Revision Contracts

    cs.AI 2026-04 unverdicted novelty 6.0

    PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.

  17. Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    OKH-RAG represents knowledge as ordered hyperedges and retrieves coherent interaction sequences via a learned transition model, outperforming permutation-invariant RAG baselines on order-sensitive QA tasks.

  18. LLMs Should Express Uncertainty Explicitly

    cs.LG 2026-04 unverdicted novelty 6.0

    Training LLMs to verbalize uncertainty explicitly at the end or during reasoning reduces overconfident errors and improves answer quality on factual tasks while enabling RAG triggers.

  19. LLMs Should Express Uncertainty Explicitly

    cs.LG 2026-04 unverdicted novelty 6.0

    Training LLMs to express uncertainty explicitly via global confidence or local markers enhances calibration and intervention triggers compared to post-hoc estimation.

  20. Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

    cs.CL 2026-04 unverdicted novelty 6.0

    A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.

  21. A-MEM: Agentic Memory for LLM Agents

    cs.CL 2025-02 unverdicted novelty 6.0

    A-MEM is a dynamic memory system for LLM agents that builds and refines an interconnected network of notes with agent-driven linking and evolution, showing performance gains over prior memory methods on six models.

  22. Search-o1: Agentic Search-Enhanced Large Reasoning Models

    cs.AI 2025-01 unverdicted novelty 6.0

    Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...

  23. ChipLingo: A Systematic Training Framework for Large Language Models in EDA

    cs.LG 2026-04 unverdicted novelty 5.0

    ChipLingo trains LLMs on EDA data via corpus construction, domain-adaptive pretraining, and RAG scenario alignment, reaching 59.7% accuracy with an 8B model and 70.02% with a 32B model on a new internal EDA benchmark.

  24. Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking

    cs.IR 2026-04 unverdicted novelty 5.0

    AdaRankLLM shows adaptive listwise reranking outperforms fixed-depth retrieval for most LLMs by acting as a noise filter for weak models and an efficiency optimizer for strong ones, with lower context use.

  25. Lightweight LLM Agent Memory with Small Language Models

    cs.AI 2026-04 unverdicted novelty 5.0

    LightMem uses SLMs to modularize agent memory into STM, MTM, and LTM with two-stage vector-plus-semantic retrieval online and incremental consolidation offline, reporting 2.5 F1 gains and low latency over A-MEM on LoCoMo.

  26. DTCRS: Dynamic Tree Construction for Recursive Summarization

    cs.CL 2026-04 unverdicted novelty 5.0

    DTCRS dynamically builds summary trees only for suitable question types by using sub-question embeddings as cluster centers, cutting construction time while improving QA on three tasks.

  27. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  28. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 4.0

    Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.

  29. MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction

    cs.CY 2026-04 unverdicted novelty 4.0

    MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.

  30. Mitigating Hallucination on Hallucination in RAG via Ensemble Voting

    cs.CL 2026-03 unverdicted novelty 4.0

    VOTE-RAG applies retrieval voting across diverse queries and response voting across independent generations to mitigate hallucination-on-hallucination in RAG, matching or exceeding complex baselines on six benchmarks ...

  31. Toward Agentic RAG for Ukrainian

    cs.AI 2026-04 unverdicted novelty 3.0

    Agentic RAG for Ukrainian improves answer accuracy via retries but is still limited by document and page retrieval quality.

  32. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 30 Pith papers · 13 internal anchors

  1. [1]

    Learning to retrieve reasoning paths over wikipedia graph for question answering

    Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJgVHkrYDH

  2. [2]

    Retrieval-based language models and applications

    Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Tutorial), 2023 a . URL https://aclanthology.org/2023.acl-tutorials.6

  3. [3]

    Task-aware retrieval with instructions

    Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. Task-aware retrieval with instructions. In Findings of the Association for Computational Linguistics, 2023 b . URL https://aclanthology.org/2023.findings-acl.225

  4. [7]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=H4DqfPSibmx

  5. [8]

    Chain-of-

    Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023. URL https://arxiv.org/abs/2309.11495

  6. [9]

    Wizard of wikipedia: Knowledge-powered conversational agents

    Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1l73iRqKm

  7. [12]

    Retrieval augmented language model pre-training,

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International Conference on Machine Learning, 2020. URL https://dl.acm.org/doi/pdf/10.5555/3524938.3525306

  8. [13]

    Unsupervised dense information retrieval with contrastive learning

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022 a . URL https://openreview.net/forum?id=jKN1pXi7b0

  9. [16]

    T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017. URL https://aclanthology.org/P17-1147

  10. [18]

    Pretraining language models with human preferences

    Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In International Conference on Machine Learning, 2023. URL https://openreview.net/forum?id=AT8Iw8KOeC

  11. [19]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

  12. [20]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. URL https://arxiv.org/abs/2309.06180

  13. [21]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, 2020. URL https://proceedings.neurips.c...

  14. [22]

    doi:10.48550/arXiv.2310.01352 , abstract =

    Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, and Scott Yih. Ra-dit: Retrieval-augmented dual instruction tuning, 2023. URL https://arxiv.org/abs/2310.01352

  15. [25]

    QUARK : Controllable text generation with reinforced unlearning

    Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. QUARK : Controllable text generation with reinforced unlearning. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=5HaIds3ux5O

  16. [27]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023. URL http...

  17. [28]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023. URL https://aclanthology.org/2023....

  18. [30]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. URL https://aclanthology.org/D18-1260

  19. [31]

    A discrete hard EM approach for weakly supervised question answering

    Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. A discrete hard EM approach for weakly supervised question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. URL https://aclanthology.org/D19-1284

  20. [34]

    Large dual encoders are generalizable retrievers

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. URL https://aclanthology.org/2022.emnlp-main.669

  21. [36]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...

  22. [37]

    Refiner: Reasoning feedback on intermediate representations

    Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904, 2023. URL https://arxiv.org/abs/2304.01904

  23. [38]

    KILT : a benchmark for knowledge intensive language tasks

    Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rockt \"a schel, and Sebastian Riedel. KILT : a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Ass...

  24. [39]

    MAUVE : Measuring the gap between neural text and human text using divergence frontiers

    Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE : Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Tqx7nJp7PR

  25. [40]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020. URL https://dl.acm.org/doi/10.5555/3433701.3433727

  26. [41]

    ArXiv preprint abs/2302.00083 (2023)

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 2023. URL https://arxiv.org/abs/2302.00083

  27. [42]

    Multitask prompted training enables zero-shot task generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng S...

  28. [45]

    Chi, Nathanael Sch\" a rli, and Denny Zhou

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch\" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://proceedings.mlr.press/v202/shi23a.html

  29. [46]

    ASQA : Factoid questions meet long-form answers

    Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA : Factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. URL https://aclanthology.org/2022.emnlp-main.566

  30. [47]

    FEVER : a large-scale dataset for fact extraction and VER ification

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : a large-scale dataset for fact extraction and VER ification. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , 2018. URL https://aclanthology.org/N18-1074

  31. [50]

    Dai, and Quoc V Le

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR

  32. [53]

    Recomp: Improving retrieval- augmented lms with compression and selective augmentation,

    Fangyuan Xu, Weijia Shi, and Eunsol Choi. Recomp: Improving retrieval-augmented lms with compression and selective augmentation, 2023. URL https://arxiv.org/abs/2310.04408

  33. [57]

    Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models, 2023. URL https://arxiv.org/abs/2310.04406

  34. [59]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  35. [60]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  36. [61]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  37. [62]

    Advances in neural information processing systems , year=

    Language models are few-shot learners , author=. Advances in neural information processing systems , year=

  38. [63]

    ArXiv , year=

    A General Language Assistant as a Laboratory for Alignment , author=. ArXiv , year=

  39. [64]

    Sunipa Dev and Jeff Phillips

    Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel. BBQ : A hand-built bias benchmark for question answering. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.165

  40. [65]

    Counterfactual memorization in neural language models

    Counterfactual memorization in neural language models , author=. arXiv preprint arXiv:2112.12938 , year=

  41. [66]

    arXiv preprint arXiv:2207.00099 , year=

    Measuring forgetting of memorized training examples , author=. arXiv preprint arXiv:2207.00099 , year=

  42. [67]

    arXiv preprint arXiv:2205.10770 , year=

    Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models , author=. arXiv preprint arXiv:2205.10770 , year=

  43. [68]

    arXiv preprint arXiv:2210.03588 , year=

    Understanding Transformer Memorization Recall Through Idioms , author=. arXiv preprint arXiv:2210.03588 , year=

  44. [69]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

    Data Contamination: From Memorization to Exploitation , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

  45. [70]

    ArXiv , year=

    Impact of Pretraining Term Frequencies on Few-Shot Reasoning , author=. ArXiv , year=

  46. [71]

    ArXiv , year=

    Quantifying Memorization Across Neural Language Models , author=. ArXiv , year=

  47. [72]

    Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=

  48. [73]

    KILT : a Benchmark for Knowledge Intensive Language Tasks

    Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt. KILT : a Benchmark for Knowledge Intensive Language Tasks. Proceedings of the 2021 Conference of the North American Chapter of th...

  49. [74]

    arXiv preprint arXiv:2209.10063 , year=

    Generate rather than retrieve: Large language models are strong context generators , author=. arXiv preprint arXiv:2209.10063 , year=

  50. [75]

    & Raffel, C

    Kandpal, Nikhil and Deng, Haikang and Roberts, Adam and Wallace, Eric and Raffel, Colin , keywords =. Large Language Models Struggle to Learn Long-Tail Knowledge , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2211.08411 , url =

  51. [76]

    Findings of the Association for Computational Linguistics: EMNLP 2021

    Retrieval augmentation reduces hallucination in conversation , author=. Findings of the Association for Computational Linguistics: EMNLP 2021

  52. [77]

    Nature Machine Intelligence , volume=

    Large pre-trained language models contain human-like biases of what is right and wrong to do , author=. Nature Machine Intelligence , volume=. 2022 , publisher=

  53. [78]

    arXiv preprint arXiv:2008.02637 , year=

    Question and answer test-train overlap in open-domain question answering datasets , author=. arXiv preprint arXiv:2008.02637 , year=

  54. [79]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  55. [80]

    International Conference on Learning Representations , year=

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=

  56. [81]

    OPT: Open Pre-trained Transformer Language Models

    Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

  57. [82]

    and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

  58. [83]

    Proceedings of the 39th International Conference on Machine Learning , year =

    Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , year =

  59. [84]

    Advances in Neural Information Processing Systems , title =

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Advances in Neural Information Processing Systems , title =

  60. [85]

    arXiv preprint arXiv:2004.04906 , year=

    Dense passage retrieval for open-domain question answering , author=. arXiv preprint arXiv:2004.04906 , year=

  61. [86]

    arXiv preprint arXiv:2210.03350

    Measuring and Narrowing the Compositionality Gap in Language Models , author=. arXiv preprint arXiv:2210.03350 , year=

  62. [87]

    arXiv preprint arXiv:2109.08535 , year=

    Simple entity-centric questions challenge dense retrievers , author=. arXiv preprint arXiv:2109.08535 , year=

  63. [88]

    H., and Riedel, S

    Language models as knowledge bases? , author=. arXiv preprint arXiv:1909.01066 , year=

  64. [89]

    How Much Knowledge Can You Pack Into the Parameters of a Language Model?

    Roberts, Adam and Raffel, Colin and Shazeer, Noam. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020

  65. [90]

    & Raffel, C

    Large Language Models Struggle to Learn Long-Tail Knowledge , author=. arXiv preprint arXiv:2211.08411 , year=

  66. [91]

    Gpt-neox-20b: An open-source autoregressive language model

    Gpt-neox-20b: An open-source autoregressive language model , author=. arXiv preprint arXiv:2204.06745 , year=

  67. [92]

    E - BERT : Efficient-Yet-Effective Entity Embeddings for BERT

    Poerner, Nina and Waltinger, Ulli and Sch. E - BERT : Efficient-Yet-Effective Entity Embeddings for BERT. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.71

  68. [93]

    Transactions on Machine Learning Research , year=

    Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , year=

  69. [94]

    Foundations and Trends

    The probabilistic relevance framework: BM25 and beyond , author=. Foundations and Trends. 2009 , publisher=

  70. [95]

    Few-shot learning with retrieval augmented language models,

    Few-shot learning with retrieval augmented language models , author=. arXiv preprint arXiv:2208.03299 , year=

  71. [96]

    International Conference on Machine Learning , year=

    Retrieval augmented language model pre-training , author=. International Conference on Machine Learning , year=

  72. [97]

    International Conference on Learning Representations , year=

    Mention Memory: incorporating textual knowledge into Transformers through entity mention attention , author=. International Conference on Learning Representations , year=

  73. [98]

    Entities as Experts: Sparse Memory Access with Entity Supervision

    F \'e vry, Thibault and Baldini Soares, Livio and FitzGerald, Nicholas and Choi, Eunsol and Kwiatkowski, Tom. Entities as Experts: Sparse Memory Access with Entity Supervision. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020

  74. [99]

    International Conference on Learning Representations , year=

    Generalization through Memorization: Nearest Neighbor Language Models , author=. International Conference on Learning Representations , year=

  75. [100]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    Training Language Models with Memory Augmentation , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

  76. [101]

    arXiv preprint arXiv:2211.09260 , year=

    Task-aware Retrieval with Instructions , author=. arXiv preprint arXiv:2211.09260 , year=

  77. [102]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    Nearest Neighbor Zero-Shot Inference , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

  78. [103]

    Controllable Semantic Parsing via Retrieval Augmentation

    Pasupat, Panupong and Zhang, Yuan and Guu, Kelvin. Controllable Semantic Parsing via Retrieval Augmentation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

  79. [104]

    Advances in Neural Information Processing Systems , year=

    One question answering model for many languages with cross-lingual dense passage retrieval , author=. Advances in Neural Information Processing Systems , year=

  80. [105]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Showing first 80 references.