arxiv: 2310.11511 · v1 · submitted 2023-10-17 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai , Zeqiu Wu , Yizhong Wang , Avirup Sil , Hannaneh Hajishirzi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 14:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Self-RAGretrieval-augmented generationself-reflectionreflection tokenslanguage modelsfactualityopen-domain QAcitation accuracy

0 comments

The pith

Self-RAG trains a single language model to adaptively retrieve passages on demand and critique its own outputs using special reflection tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Self-Reflective Retrieval-Augmented Generation to fix factual errors that arise when language models rely only on their internal knowledge or pull in a fixed set of retrieved passages. It trains any base LM to generate reflection tokens that signal whether retrieval is needed, judge the relevance of passages, and evaluate the quality of its own draft answers. This turns retrieval and self-critique into controllable behaviors at inference time rather than fixed external steps. A reader would care because the method promises higher factuality and citation accuracy across question answering, reasoning, and long-form generation while avoiding the performance drop that comes from indiscriminate retrieval.

Core claim

Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection by training a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks, including open-domain QA, reasoning and fact verification, with notable gains in factuality and citation

What carries the argument

Reflection tokens: special tokens the LM learns to output that indicate retrieval necessity, assess passage relevance, and critique generation quality, thereby guiding the retrieval and refinement process.

If this is right

On-demand retrieval prevents the inclusion of irrelevant passages that can harm response quality or versatility.
Self-generated critiques raise factuality and citation accuracy especially in long-form outputs.
A single trained model becomes controllable at inference for varied tasks without separate modules or retraining.
The approach yields measurable gains over both ChatGPT and retrieval-augmented baselines on QA, reasoning, and verification.
Reflection tokens provide an internal mechanism for the model to decide when external knowledge is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-based reflection approach could be tested on tasks outside the paper's scope, such as code generation or dialogue, to check if controllability transfers.
Inspecting the reflection tokens during inference might offer a lightweight way to trace and correct specific errors without full retraining.
Because the method works on arbitrary base LMs, scaling it to larger models could compound the observed gains in factuality.

Load-bearing premise

An arbitrary language model can be trained to generate and act on reflection tokens in a way that improves performance across tasks rather than introducing new failure modes from the added tokens.

What would settle it

A direct comparison on long-form generation benchmarks where Self-RAG models show no improvement or a decline in factuality and citation accuracy relative to standard retrieval-augmented Llama2-chat would falsify the central claim.

read the original abstract

Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-RAG trains one model to emit reflection tokens that control when to retrieve and how to critique passages and its own output, which is a clean way to make RAG adaptive.

read the letter

The main thing here is that the authors train a single LM to generate special tokens like [Retrieve], [Relevant], [Support] during inference so it can decide on the fly whether retrieval is needed and then judge the quality of what it retrieves and what it generates. That framing is new relative to standard RAG pipelines that always pull a fixed number of passages. They show the 7B and 13B versions beating ChatGPT and Llama2-chat RAG baselines on open QA, reasoning, fact verification, and long-form factuality plus citation accuracy. The training is end-to-end against task metrics rather than just next-token prediction, which gives the model a direct incentive to use the tokens productively. Credit where due: the mechanism is simple to implement once you have the token set, and the idea of making retrieval and critique controllable without extra modules is practical. The soft spot is the core assumption that the same model can reliably critique its own retrieval decisions and generations without introducing new error modes. Because the reflection tokens come from the model whose knowledge is being questioned, any consistent bias in relevance or support judgments could lead to worse behavior than a fixed RAG baseline. The abstract reports clear gains, but without seeing the ablations, statistical tests, or failure-case analysis it is hard to tell how much of the improvement is the reflection mechanism versus other training choices. The citation pattern looks standard for the area. This is the kind of paper that would be useful for anyone working on knowledge-intensive generation or controllable LLMs. I would send it to peer review; the idea is concrete enough and the empirical claims are testable, even if the self-critique reliability needs more scrutiny in revision.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Self-RAG, a framework that fine-tunes an arbitrary LM (Llama2-7B/13B) to emit special reflection tokens ([Retrieve], [Relevant], [Support], etc.) enabling adaptive on-demand retrieval, passage critique, and self-critique of generations. This makes the model controllable at inference and is claimed to improve factuality and task performance over standard RAG and frontier LLMs. Experiments report that Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on open-domain QA, reasoning, and fact-verification benchmarks while also raising citation accuracy and factuality in long-form outputs.

Significance. If the gains are attributable to the reflection mechanism rather than training details, the work is significant: it unifies retrieval, generation, and critique inside one model via learnable control tokens, addressing the rigidity of fixed-k RAG while preserving versatility. The single-model design and reported gains across diverse tasks suggest practical impact. The authors earn credit for releasing code, models, and detailed training recipes that support reproducibility of the central empirical claims.

major comments (3)

[§5.1, Table 2] §5.1 and Table 2: the reported outperformance on fact-verification and long-form tasks is not accompanied by statistical significance tests, confidence intervals, or multiple-run variance; without these, it is impossible to confirm that the gains exceed evaluation noise and are load-bearing for the claim that reflection tokens drive the improvement.
[§4.2, §5.3] §4.2 (training data construction) and §5.3 (ablations): the paper lacks a controlled ablation that trains an otherwise identical model on the same volume of data but without the reflection-token prediction objective; the current ablations therefore cannot isolate whether performance stems from the self-critique mechanism or simply from additional supervised fine-tuning, leaving the central assumption about reflection tokens unverified.
[§4.3] §4.3 (inference procedure): the description of how reflection tokens are decoded and used to branch behavior (retrieve vs. generate, accept vs. critique) does not specify the exact sampling strategy or temperature schedule applied to the special tokens; this detail is load-bearing for reproducing the claimed adaptive behavior and for assessing whether new failure modes (e.g., erroneous [Relevant] judgments) arise at inference time.

minor comments (3)

[Figure 2] Figure 2: the framework diagram would be clearer if arrows explicitly labeled the points at which each reflection token is emitted and consumed.
[§3.1] §3.1: the notation for the joint probability over tokens and reflection tokens could be made more explicit to avoid ambiguity when conditioning on retrieved passages.
[Related Work] Related Work: a brief comparison to prior controllable-generation methods (e.g., those using special tokens for style or factuality) would help situate the novelty of the reflection-token set.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and suggestions for improving the manuscript. We address each of the major comments below and will make the necessary revisions to enhance the clarity and rigor of our empirical results.

read point-by-point responses

Referee: [§5.1, Table 2] §5.1 and Table 2: the reported outperformance on fact-verification and long-form tasks is not accompanied by statistical significance tests, confidence intervals, or multiple-run variance; without these, it is impossible to confirm that the gains exceed evaluation noise and are load-bearing for the claim that reflection tokens drive the improvement.

Authors: We acknowledge the importance of statistical validation. In the revised manuscript, we will report bootstrap confidence intervals for the performance metrics in Table 2 and conduct multiple runs with different random seeds to provide variance estimates. This will allow readers to assess the robustness of the gains attributed to the reflection tokens. revision: yes
Referee: [§4.2, §5.3] §4.2 (training data construction) and §5.3 (ablations): the paper lacks a controlled ablation that trains an otherwise identical model on the same volume of data but without the reflection-token prediction objective; the current ablations therefore cannot isolate whether performance stems from the self-critique mechanism or simply from additional supervised fine-tuning, leaving the central assumption about reflection tokens unverified.

Authors: This point is well-taken and highlights a potential gap in isolating the effect of the reflection tokens. Our existing ablations in §5.3 vary the use of reflection tokens at inference but do not fully control for the training objective. We will add a new ablation experiment training a model on the identical dataset and volume without the reflection token prediction loss, to directly verify the contribution of the self-critique mechanism. revision: yes
Referee: [§4.3] §4.3 (inference procedure): the description of how reflection tokens are decoded and used to branch behavior (retrieve vs. generate, accept vs. critique) does not specify the exact sampling strategy or temperature schedule applied to the special tokens; this detail is load-bearing for reproducing the claimed adaptive behavior and for assessing whether new failure modes (e.g., erroneous [Relevant] judgments) arise at inference time.

Authors: We appreciate this feedback on reproducibility. The reflection tokens are decoded using greedy decoding (temperature 0) to ensure consistent branching decisions, while the subsequent generation steps use the standard sampling parameters described in the paper. We will expand §4.3 to explicitly detail the decoding strategy for reflection tokens, including any thresholds used for decisions like [Relevant] or [Support]. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out benchmarks

full rationale

The paper defines reflection tokens and trains a single LM to emit and condition on them, then reports performance improvements on standard external benchmarks (Open-domain QA, reasoning, fact verification, long-form generation). No equations, fitted parameters, or self-citations are invoked such that any claimed prediction reduces to the training inputs by construction. The central claims rest on end-to-end optimization against task metrics and held-out evaluation, which is independent of the token definitions themselves.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that reflection tokens can be learned and used controllably; no external benchmarks or formal proofs are provided for this assumption.

free parameters (1)

reflection token set
The exact vocabulary and semantics of the special reflection tokens are chosen by the authors.

axioms (1)

domain assumption An arbitrary LM can be fine-tuned to emit and condition on reflection tokens without performance degradation
Invoked when the framework is described as training a single arbitrary LM.

invented entities (1)

reflection tokens no independent evidence
purpose: Enable the model to signal retrieval need, passage relevance, and generation support
New tokens introduced by the framework; no independent evidence outside the training process is given.

pith-pipeline@v0.9.0 · 5565 in / 1352 out tokens · 61328 ms · 2026-05-12T14:09:42.760870+00:00 · methodology

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery
cs.AI 2026-05 unverdicted novelty 7.0

HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact dens...
AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
cs.AI 2026-04 unverdicted novelty 7.0

TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design matter...
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
cs.SE 2026-04 unverdicted novelty 7.0

RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context
cs.CL 2026-04 unverdicted novelty 7.0

Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...
ROZA Graphs: Self-Improving Near-Deterministic RAG through Evidence-Centric Feedback
cs.AI 2026-04 unverdicted novelty 7.0

ROZA graphs enable self-improving RAG by storing evidence-specific reasoning chains, yielding up to 10.6pp accuracy gains and 46% lower cost through graph traversal feedback.
LLMAR: A Tuning-Free Recommendation Framework for Sparse and Text-Rich Industrial Domains
cs.IR 2026-03 unverdicted novelty 7.0

LLMAR applies LLM reasoning with a self-correction reflection loop to generate semantic user motives for tuning-free recommendations, showing up to 54.6% nDCG@10 gains on a sparse industrial dataset over trained baselines.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 6.0

Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.
FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...
Evolve: A Persistent Knowledge Lifecycle for Small Language Models
cs.LG 2026-04 unverdicted novelty 6.0

A 2B model using Evolve reaches 60-84% accuracy on 750 queries across specialist questions, NaturalQuestions, and TriviaQA (up from 20-33%) while cutting teacher invocations by over 50% via a consolidated section-base...
The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompt...
Preregistered Belief Revision Contracts
cs.AI 2026-04 unverdicted novelty 6.0

PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OKH-RAG represents knowledge as ordered hyperedges and retrieves coherent interaction sequences via a learned transition model, outperforming permutation-invariant RAG baselines on order-sensitive QA tasks.
LLMs Should Express Uncertainty Explicitly
cs.LG 2026-04 unverdicted novelty 6.0

Training LLMs to verbalize uncertainty explicitly at the end or during reasoning reduces overconfident errors and improves answer quality on factual tasks while enabling RAG triggers.
LLMs Should Express Uncertainty Explicitly
cs.LG 2026-04 unverdicted novelty 6.0

Training LLMs to express uncertainty explicitly via global confidence or local markers enhances calibration and intervention triggers compared to post-hoc estimation.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
cs.CL 2026-04 unverdicted novelty 6.0

A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
A-MEM: Agentic Memory for LLM Agents
cs.CL 2025-02 unverdicted novelty 6.0

A-MEM is a dynamic memory system for LLM agents that builds and refines an interconnected network of notes with agent-driven linking and evolution, showing performance gains over prior memory methods on six models.
Search-o1: Agentic Search-Enhanced Large Reasoning Models
cs.AI 2025-01 unverdicted novelty 6.0

Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
ChipLingo: A Systematic Training Framework for Large Language Models in EDA
cs.LG 2026-04 unverdicted novelty 5.0

ChipLingo trains LLMs on EDA data via corpus construction, domain-adaptive pretraining, and RAG scenario alignment, reaching 59.7% accuracy with an 8B model and 70.02% with a 32B model on a new internal EDA benchmark.
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking
cs.IR 2026-04 unverdicted novelty 5.0

AdaRankLLM shows adaptive listwise reranking outperforms fixed-depth retrieval for most LLMs by acting as a noise filter for weak models and an efficiency optimizer for strong ones, with lower context use.
Lightweight LLM Agent Memory with Small Language Models
cs.AI 2026-04 unverdicted novelty 5.0

LightMem uses SLMs to modularize agent memory into STM, MTM, and LTM with two-stage vector-plus-semantic retrieval online and incremental consolidation offline, reporting 2.5 F1 gains and low latency over A-MEM on LoCoMo.
DTCRS: Dynamic Tree Construction for Recursive Summarization
cs.CL 2026-04 unverdicted novelty 5.0

DTCRS dynamically builds summary trees only for suitable question types by using sub-question embeddings as cluster centers, cutting construction time while improving QA on three tasks.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 4.0

Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
cs.CY 2026-04 unverdicted novelty 4.0

MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
Mitigating Hallucination on Hallucination in RAG via Ensemble Voting
cs.CL 2026-03 unverdicted novelty 4.0

VOTE-RAG applies retrieval voting across diverse queries and response voting across independent generations to mitigate hallucination-on-hallucination in RAG, matching or exceeding complex baselines on six benchmarks ...
Toward Agentic RAG for Ukrainian
cs.AI 2026-04 unverdicted novelty 3.0

Agentic RAG for Ukrainian improves answer accuracy via retries but is still limited by document and page retrieval quality.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 30 Pith papers · 13 internal anchors

[1]

Learning to retrieve reasoning paths over wikipedia graph for question answering

Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJgVHkrYDH

work page 2020
[2]

Retrieval-based language models and applications

Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Tutorial), 2023 a . URL https://aclanthology.org/2023.acl-tutorials.6

work page 2023
[3]

Task-aware retrieval with instructions

Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. Task-aware retrieval with instructions. In Findings of the Association for Computational Linguistics, 2023 b . URL https://aclanthology.org/2023.findings-acl.225

work page 2023
[7]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=H4DqfPSibmx

work page 2022
[8]

Chain-of-

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023. URL https://arxiv.org/abs/2309.11495

work page arXiv 2023
[9]

Wizard of wikipedia: Knowledge-powered conversational agents

Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1l73iRqKm

work page 2019
[12]

Retrieval augmented language model pre-training,

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International Conference on Machine Learning, 2020. URL https://dl.acm.org/doi/pdf/10.5555/3524938.3525306

work page doi:10.5555/3524938.3525306 2020
[13]

Unsupervised dense information retrieval with contrastive learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022 a . URL https://openreview.net/forum?id=jKN1pXi7b0

work page 2022
[16]

T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017. URL https://aclanthology.org/P17-1147

work page 2017
[18]

Pretraining language models with human preferences

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In International Conference on Machine Learning, 2023. URL https://openreview.net/forum?id=AT8Iw8KOeC

work page 2023
[19]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page 2019
[20]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. URL https://arxiv.org/abs/2309.06180

work page internal anchor Pith review arXiv 2023
[21]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, 2020. URL https://proceedings.neurips.c...

work page 2020
[22]

doi:10.48550/arXiv.2310.01352 , abstract =

Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, and Scott Yih. Ra-dit: Retrieval-augmented dual instruction tuning, 2023. URL https://arxiv.org/abs/2310.01352

work page arXiv 2023
[25]

QUARK : Controllable text generation with reinforced unlearning

Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. QUARK : Controllable text generation with reinforced unlearning. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=5HaIds3ux5O

work page 2022
[27]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023. URL http...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023. URL https://aclanthology.org/2023....

work page 2023
[30]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. URL https://aclanthology.org/D18-1260

work page 2018
[31]

A discrete hard EM approach for weakly supervised question answering

Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. A discrete hard EM approach for weakly supervised question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. URL https://aclanthology.org/D19-1284

work page 2019
[34]

Large dual encoders are generalizable retrievers

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. URL https://aclanthology.org/2022.emnlp-main.669

work page 2022
[36]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...

work page 2022
[37]

Refiner: Reasoning feedback on intermediate representations

Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904, 2023. URL https://arxiv.org/abs/2304.01904

work page arXiv 2023
[38]

KILT : a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rockt \"a schel, and Sebastian Riedel. KILT : a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Ass...

work page 2021
[39]

MAUVE : Measuring the gap between neural text and human text using divergence frontiers

Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE : Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Tqx7nJp7PR

work page 2021
[40]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020. URL https://dl.acm.org/doi/10.5555/3433701.3433727

work page doi:10.5555/3433701.3433727 2020
[41]

ArXiv preprint abs/2302.00083 (2023)

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 2023. URL https://arxiv.org/abs/2302.00083

work page arXiv 2023
[42]

Multitask prompted training enables zero-shot task generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng S...

work page 2022
[45]

Chi, Nathanael Sch\" a rli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch\" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://proceedings.mlr.press/v202/shi23a.html

work page 2023
[46]

ASQA : Factoid questions meet long-form answers

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA : Factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. URL https://aclanthology.org/2022.emnlp-main.566

work page 2022
[47]

FEVER : a large-scale dataset for fact extraction and VER ification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : a large-scale dataset for fact extraction and VER ification. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , 2018. URL https://aclanthology.org/N18-1074

work page 2018
[50]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR

work page 2022
[53]

Recomp: Improving retrieval- augmented lms with compression and selective augmentation,

Fangyuan Xu, Weijia Shi, and Eunsol Choi. Recomp: Improving retrieval-augmented lms with compression and selective augmentation, 2023. URL https://arxiv.org/abs/2310.04408

work page arXiv 2023
[57]

Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models, 2023. URL https://arxiv.org/abs/2310.04406

work page arXiv 2023
[59]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[60]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[61]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[62]

Advances in neural information processing systems , year=

Language models are few-shot learners , author=. Advances in neural information processing systems , year=

work page
[63]

ArXiv , year=

A General Language Assistant as a Laboratory for Alignment , author=. ArXiv , year=

work page
[64]

Sunipa Dev and Jeff Phillips

Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel. BBQ : A hand-built bias benchmark for question answering. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.165

work page doi:10.18653/v1/2022.findings-acl.165 2022
[65]

Counterfactual memorization in neural language models

Counterfactual memorization in neural language models , author=. arXiv preprint arXiv:2112.12938 , year=

work page arXiv
[66]

arXiv preprint arXiv:2207.00099 , year=

Measuring forgetting of memorized training examples , author=. arXiv preprint arXiv:2207.00099 , year=

work page arXiv
[67]

arXiv preprint arXiv:2205.10770 , year=

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models , author=. arXiv preprint arXiv:2205.10770 , year=

work page arXiv
[68]

arXiv preprint arXiv:2210.03588 , year=

Understanding Transformer Memorization Recall Through Idioms , author=. arXiv preprint arXiv:2210.03588 , year=

work page arXiv
[69]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Data Contamination: From Memorization to Exploitation , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

work page
[70]

ArXiv , year=

Impact of Pretraining Term Frequencies on Few-Shot Reasoning , author=. ArXiv , year=

work page
[71]

ArXiv , year=

Quantifying Memorization Across Neural Language Models , author=. ArXiv , year=

work page
[72]

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=

work page
[73]

KILT : a Benchmark for Knowledge Intensive Language Tasks

Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt. KILT : a Benchmark for Knowledge Intensive Language Tasks. Proceedings of the 2021 Conference of the North American Chapter of th...

work page 2021
[74]

arXiv preprint arXiv:2209.10063 , year=

Generate rather than retrieve: Large language models are strong context generators , author=. arXiv preprint arXiv:2209.10063 , year=

work page arXiv
[75]

& Raffel, C

Kandpal, Nikhil and Deng, Haikang and Roberts, Adam and Wallace, Eric and Raffel, Colin , keywords =. Large Language Models Struggle to Learn Long-Tail Knowledge , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2211.08411 , url =

work page doi:10.48550/arxiv.2211.08411 2022
[76]

Findings of the Association for Computational Linguistics: EMNLP 2021

Retrieval augmentation reduces hallucination in conversation , author=. Findings of the Association for Computational Linguistics: EMNLP 2021

work page 2021
[77]

Nature Machine Intelligence , volume=

Large pre-trained language models contain human-like biases of what is right and wrong to do , author=. Nature Machine Intelligence , volume=. 2022 , publisher=

work page 2022
[78]

arXiv preprint arXiv:2008.02637 , year=

Question and answer test-train overlap in open-domain question answering datasets , author=. arXiv preprint arXiv:2008.02637 , year=

work page arXiv 2008
[79]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[80]

International Conference on Learning Representations , year=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=

work page
[81]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[82]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

work page 2019
[83]

Proceedings of the 39th International Conference on Machine Learning , year =

Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , year =

work page
[84]

Advances in Neural Information Processing Systems , title =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Advances in Neural Information Processing Systems , title =

work page
[85]

arXiv preprint arXiv:2004.04906 , year=

Dense passage retrieval for open-domain question answering , author=. arXiv preprint arXiv:2004.04906 , year=

work page arXiv 2004
[86]

arXiv preprint arXiv:2210.03350

Measuring and Narrowing the Compositionality Gap in Language Models , author=. arXiv preprint arXiv:2210.03350 , year=

work page arXiv
[87]

arXiv preprint arXiv:2109.08535 , year=

Simple entity-centric questions challenge dense retrievers , author=. arXiv preprint arXiv:2109.08535 , year=

work page arXiv
[88]

H., and Riedel, S

Language models as knowledge bases? , author=. arXiv preprint arXiv:1909.01066 , year=

work page arXiv 1909
[89]

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Roberts, Adam and Raffel, Colin and Shazeer, Noam. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020

work page 2020
[90]

& Raffel, C

Large Language Models Struggle to Learn Long-Tail Knowledge , author=. arXiv preprint arXiv:2211.08411 , year=

work page arXiv
[91]

Gpt-neox-20b: An open-source autoregressive language model

Gpt-neox-20b: An open-source autoregressive language model , author=. arXiv preprint arXiv:2204.06745 , year=

work page arXiv
[92]

E - BERT : Efficient-Yet-Effective Entity Embeddings for BERT

Poerner, Nina and Waltinger, Ulli and Sch. E - BERT : Efficient-Yet-Effective Entity Embeddings for BERT. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.71

work page doi:10.18653/v1/2020.findings-emnlp.71 2020
[93]

Transactions on Machine Learning Research , year=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , year=

work page
[94]

Foundations and Trends

The probabilistic relevance framework: BM25 and beyond , author=. Foundations and Trends. 2009 , publisher=

work page 2009
[95]

Few-shot learning with retrieval augmented language models,

Few-shot learning with retrieval augmented language models , author=. arXiv preprint arXiv:2208.03299 , year=

work page arXiv
[96]

International Conference on Machine Learning , year=

Retrieval augmented language model pre-training , author=. International Conference on Machine Learning , year=

work page
[97]

International Conference on Learning Representations , year=

Mention Memory: incorporating textual knowledge into Transformers through entity mention attention , author=. International Conference on Learning Representations , year=

work page
[98]

Entities as Experts: Sparse Memory Access with Entity Supervision

F \'e vry, Thibault and Baldini Soares, Livio and FitzGerald, Nicholas and Choi, Eunsol and Kwiatkowski, Tom. Entities as Experts: Sparse Memory Access with Entity Supervision. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020

work page 2020
[99]

International Conference on Learning Representations , year=

Generalization through Memorization: Nearest Neighbor Language Models , author=. International Conference on Learning Representations , year=

work page
[100]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Training Language Models with Memory Augmentation , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

work page 2021
[101]

arXiv preprint arXiv:2211.09260 , year=

Task-aware Retrieval with Instructions , author=. arXiv preprint arXiv:2211.09260 , year=

work page arXiv
[102]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Nearest Neighbor Zero-Shot Inference , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

work page 2021
[103]

Controllable Semantic Parsing via Retrieval Augmentation

Pasupat, Panupong and Zhang, Yuan and Guu, Kelvin. Controllable Semantic Parsing via Retrieval Augmentation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

work page 2021
[104]

Advances in Neural Information Processing Systems , year=

One question answering model for many languages with cross-lingual dense passage retrieval , author=. Advances in Neural Information Processing Systems , year=

work page
[105]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

work page 2021

Showing first 80 references.