arxiv: 2401.15884 · v3 · submitted 2024-01-29 · 💻 cs.CL

Recognition: no theorem link

Corrective Retrieval Augmented Generation

Shi-Qi Yan , Jia-Chen Gu , Yun Zhu , Zhen-Hua Ling

Authors on Pith no claims yet

Pith reviewed 2026-05-12 11:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords Corrective Retrieval Augmented GenerationRAG robustnessRetrieval evaluationWeb search augmentationHallucination reductionDecompose-recomposeLarge language modelsGeneration tasks

0 comments

The pith

CRAG makes retrieval-augmented generation more robust by evaluating document quality and using web searches to correct deficiencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Corrective Retrieval Augmented Generation, or CRAG, as a way to reduce hallucinations in large language models that rely on retrieved documents. A lightweight evaluator checks how good the retrieved documents are for answering a query and assigns a confidence score. This score determines whether to stick with the documents, use additional web searches, or apply other actions. A decompose-then-recompose step then pulls out only the useful information from the documents while discarding the rest. Because it works with existing RAG systems, it offers a practical way to improve generation quality on both short answers and longer texts.

Core claim

CRAG improves the robustness of generation in retrieval-augmented systems through a retrieval evaluator that assesses document quality and triggers appropriate actions, including large-scale web searches to augment limited corpora, along with a decompose-then-recompose algorithm that selectively focuses on key information and filters irrelevant content.

What carries the argument

The lightweight retrieval evaluator that returns a confidence degree for retrieved documents, which triggers different knowledge retrieval actions including web search augmentation, together with the decompose-then-recompose algorithm.

If this is right

CRAG can be added to various existing RAG-based approaches as a plug-and-play component.
Large-scale web searches serve as a reliable extension when static document retrieval is sub-optimal.
The decompose-then-recompose algorithm allows models to focus on relevant parts of documents and ignore noise.
Performance gains appear across short-form and long-form generation tasks on multiple datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Accurate quality evaluation could allow RAG systems to handle a wider range of queries without needing perfect initial retrieval.
Combining static and dynamic web sources might require additional safeguards against misinformation from the web.
Testing CRAG on specialized domains could reveal whether the evaluator generalizes beyond general web content.
The approach suggests a path toward more adaptive retrieval that responds to the specific needs of each query.

Load-bearing premise

The lightweight retrieval evaluator must accurately judge the overall quality of documents for any query, and web searches must add useful information without introducing new errors or noise.

What would settle it

Running the experiments on the four datasets but observing no significant improvement or even worse results when applying CRAG compared to baseline RAG methods would falsify the central claim.

read the original abstract

Large language models (LLMs) inevitably exhibit hallucinations since the accuracy of generated texts cannot be secured solely by the parametric knowledge they encapsulate. Although retrieval-augmented generation (RAG) is a practicable complement to LLMs, it relies heavily on the relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong. To this end, we propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation. Specifically, a lightweight retrieval evaluator is designed to assess the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered. Since retrieval from static and limited corpora can only return sub-optimal documents, large-scale web searches are utilized as an extension for augmenting the retrieval results. Besides, a decompose-then-recompose algorithm is designed for retrieved documents to selectively focus on key information and filter out irrelevant information in them. CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches. Experiments on four datasets covering short- and long-form generation tasks show that CRAG can significantly improve the performance of RAG-based approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRAG adds a practical evaluator to trigger web fixes and document filtering in RAG, but without ablations or evaluator validation the source of any gains stays unclear.

read the letter

CRAG adds a retrieval evaluator to decide on corrective actions like web search and document decomposition when standard retrieval looks weak. That addresses a real problem in RAG systems, but the paper doesn't show that the evaluator actually works well enough to explain the gains. What stands out is the plug-and-play design: a fine-tuned T5 model scores documents as Correct, Incorrect, or Ambiguous, then triggers web augmentation for bad cases and a decompose-recompose step to pull out key info. This extends basic RAG without changing the core generator much. The experiments on four datasets for short and long generation tasks claim big improvements, which is promising if they hold up. The soft spot is the missing validation. There's no precision or recall for the evaluator on held-out data, and no ablation that turns off the corrective logic or replaces it with random triggers. Without that, the web search and recomposition could be doing all the heavy lifting. The assumption that web results add clean signal also goes untested. The abstract mentions significant gains but skips baselines, metrics details, and error analysis, so the central claim lacks support from what's visible. This paper is for researchers building knowledge-intensive LLM apps who want to make retrieval more robust. A reader working on RAG variants would find the framework useful to try, even if the numbers need scrutiny. It deserves a serious referee because the idea is practical and the problem is important, though the current write-up needs more experimental rigor to land. I'd recommend sending it out for review so the authors can add the missing ablations and evaluator metrics.

Referee Report

3 major / 2 minor

Summary. The paper proposes Corrective Retrieval Augmented Generation (CRAG) as a plug-and-play enhancement to retrieval-augmented generation (RAG) for LLMs. It introduces a lightweight retrieval evaluator (fine-tuned T5) that classifies retrieved documents as Correct/Incorrect/Ambiguous to trigger corrective actions, augments retrieval via large-scale web search when needed, and applies a decompose-then-recompose algorithm to filter key information from documents. Experiments on four datasets for short- and long-form generation tasks claim that CRAG significantly improves performance of RAG-based approaches.

Significance. If the empirical claims hold after proper validation, CRAG would address a core limitation of RAG by making retrieval robust to failures through dynamic correction and external augmentation. The modular, plug-and-play design is a practical strength that could see adoption in knowledge-intensive LLM applications. The absence of evaluator validation and ablations, however, currently prevents assessing whether the gains are attributable to the proposed corrective logic.

major comments (3)

[Section 3.2] Section 3.2: The retrieval evaluator is presented as a fine-tuned T5 model that outputs Correct/Incorrect/Ambiguous scores to trigger the three retrieval actions, yet no precision, recall, confusion matrix, or calibration results are reported on any held-out query-document set. This is load-bearing for the central claim, as performance lifts on the four datasets could arise entirely from the always-on web-search augmentation or the decompose-recompose step rather than the corrective triggering logic.
[§4] Experiments section (§4): No ablation studies isolate the evaluator's contribution; for example, there are no runs that disable the evaluator, replace it with random triggering, or remove the web-search component. Without these, it is impossible to determine whether the reported improvements stem from the corrective mechanism or from the auxiliary retrieval and recomposition modules alone.
[Abstract and §4] Abstract and §4: The claim that CRAG 'can significantly improve the performance of RAG-based approaches' on four datasets is unsupported by any reported baselines, exact metrics (e.g., EM, ROUGE, F1), statistical significance tests, or error analysis. This omission makes the central empirical claim unverifiable from the manuscript.

minor comments (2)

[Abstract] The abstract would be clearer if it briefly named the four datasets and the primary evaluation metrics used.
[§3.3] The decompose-then-recompose procedure in §3.3 could be presented with pseudocode or a small worked example to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments correctly identify gaps in the validation of the retrieval evaluator and the experimental reporting. We will revise the manuscript to incorporate the requested analyses, which will strengthen the support for CRAG's corrective mechanism.

read point-by-point responses

Referee: [Section 3.2] Section 3.2: The retrieval evaluator is presented as a fine-tuned T5 model that outputs Correct/Incorrect/Ambiguous scores to trigger the three retrieval actions, yet no precision, recall, confusion matrix, or calibration results are reported on any held-out query-document set. This is load-bearing for the central claim, as performance lifts on the four datasets could arise entirely from the always-on web-search augmentation or the decompose-recompose step rather than the corrective triggering logic.

Authors: We agree that the evaluator's accuracy is central to attributing gains to the corrective logic rather than the auxiliary components. The original manuscript prioritized end-to-end results, but we will add a new evaluation subsection reporting precision, recall, F1, confusion matrix, and calibration results for the fine-tuned T5 model on a held-out query-document set. This will be placed in Section 3.2 of the revised version. revision: yes
Referee: [§4] Experiments section (§4): No ablation studies isolate the evaluator's contribution; for example, there are no runs that disable the evaluator, replace it with random triggering, or remove the web-search component. Without these, it is impossible to determine whether the reported improvements stem from the corrective mechanism or from the auxiliary retrieval and recomposition modules alone.

Authors: We concur that ablations are required to isolate the evaluator's role. In the revised manuscript we will report new runs that (1) disable the evaluator and always trigger web augmentation, (2) replace the evaluator with random action triggering, and (3) remove the web-search component while retaining the evaluator and decompose-recompose filter. These results will be added to Section 4. revision: yes
Referee: [Abstract and §4] Abstract and §4: The claim that CRAG 'can significantly improve the performance of RAG-based approaches' on four datasets is unsupported by any reported baselines, exact metrics (e.g., EM, ROUGE, F1), statistical significance tests, or error analysis. This omission makes the central empirical claim unverifiable from the manuscript.

Authors: We acknowledge that the experimental section would benefit from greater transparency. The manuscript already compares CRAG against standard RAG baselines on the four datasets and reports task-appropriate metrics (EM and F1 for short-form, ROUGE for long-form). We will expand Section 4 to include statistical significance tests (paired t-tests) and a detailed error analysis. The abstract will be updated to align precisely with the strengthened empirical evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: procedural method with empirical validation only

full rationale

The paper proposes CRAG as a plug-and-play procedural pipeline (lightweight evaluator triggering web search and decompose-recompose) without any equations, first-principles derivations, or fitted parameters. Central claims rest solely on experiments across four datasets rather than any self-referential logic. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the described method. The derivation chain is self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, mathematical axioms, or newly invented entities are described; the approach relies on standard machine-learning components whose details are not provided.

pith-pipeline@v0.9.0 · 5497 in / 1078 out tokens · 59577 ms · 2026-05-12T11:14:07.535628+00:00 · methodology

discussion (0)

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
cs.CL 2026-05 unverdicted novelty 7.0

Context-Driven Decomposition (CDD) measures context compliance in RAG under knowledge conflicts and improves accuracy on adversarial benchmarks like TruthfulQA misconception injection and Epi-Scale tests across models.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
cs.LG 2026-05 unverdicted novelty 7.0

EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-ben...
Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection
cs.CL 2026-05 unverdicted novelty 7.0

Pre-Route elicits LLMs' latent routing skills via structured prompts on metadata to proactively choose RAG or long-context, outperforming reactive baselines on cost-effectiveness.
The Context Gathering Decision Process: A POMDP Framework for Agentic Search
cs.AI 2026-05 accept novelty 7.0

Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no perf...
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
cs.CL 2026-05 unverdicted novelty 7.0

SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.
HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
cs.IR 2026-04 unverdicted novelty 7.0

HaS accelerates RAG retrieval via homology-aware speculative retrieval and homologous query re-identification validation, cutting latency 24-37% with 1-2% accuracy drop on tested datasets.
ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 7.0

ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.
Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
cs.CL 2026-04 unverdicted novelty 7.0

Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
cs.AI 2026-04 unverdicted novelty 7.0

Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
REGREACT: Self-Correcting Multi-Agent Pipelines for Structured Regulatory Information Extraction
cs.MA 2026-04 unverdicted novelty 7.0

RegReAct deploys self-correcting multi-agent pipelines across seven stages to extract hierarchical compliance criteria from regulatory texts, outperforming single-pass GPT-4o on EU Taxonomy documents.
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
cs.IR 2026-04 unverdicted novelty 7.0

RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.
Agentic Retrieval-Augmented Generation for Financial Document Question Answering
cs.AI 2026-05 unverdicted novelty 6.0

FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...
CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 6.0

CAR reranks documents in RAG by promoting those that increase generator confidence (via answer consistency sampling) and demoting those that decrease it, yielding NDCG@5 gains on BEIR datasets that correlate with F1 i...
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 6.0

Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.
EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory
cs.CV 2026-04 unverdicted novelty 6.0

EviMem improves accuracy on temporal and multi-hop questions in long-term conversational memory by iteratively diagnosing and filling evidence gaps, achieving 81.6% and 85.2% judge accuracy on LoCoMo at 4.5x lower lat...
Faithfulness-QA: A Counterfactual Entity Substitution Dataset for Training Context-Faithful RAG Models
cs.CL 2026-04 accept novelty 6.0

Faithfulness-QA is a 99k-sample dataset created via counterfactual entity substitution on existing QA benchmarks to train and evaluate context-faithful RAG models.
Agentic Multi-Source Grounding for Enhanced Query Intent Understanding: A DoorDash Case Study
cs.AI 2026-03 unverdicted novelty 6.0

An agentic multi-source grounding system for marketplace query intent achieves 90.7% accuracy on long-tail queries at DoorDash by combining catalog grounding, web search, and deterministic disambiguation, outperformin...
Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection
cs.CL 2026-05 unverdicted novelty 5.0

Pre-Route elicits LLMs' latent routing ability via proactive structured reasoning on metadata to choose between RAG and long-context strategies, outperforming reactive baselines on cost-effectiveness.
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
cs.AI 2026-05 unverdicted novelty 5.0

AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

SCM-GRPO grounds multi-hop fact verification in structural causal models and applies GRPO reinforcement learning to optimize reasoning chain length, outperforming baselines on HoVer and EX-FEVER.
FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification
cs.AI 2026-04 unverdicted novelty 5.0

FinGround reduces financial hallucinations by 68% over baselines in retrieval-equalized tests through atomic claim verification and grounding, with an 8B model retaining 91.4% F1 at low cost.
A Control Architecture for Training-Free Memory Use
cs.AI 2026-04 unverdicted novelty 5.0

A training-free control architecture with uncertainty-based routing, confidence-selective acceptance, and evidence-based memory governance improves arithmetic reasoning by +7 points on SVAMP and ASDiv benchmarks.
Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents
cs.IR 2026-04 conditional novelty 5.0

Tree reasoning outperforms vector search on complex document queries but a hybrid approach balances results across tiers, with validation showing an 11.7-point gap on real finance documents.
Retrieval-Augmented Generation for AI-Generated Content: A Survey
cs.CV 2024-02 accept novelty 5.0

A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 4.0

Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 4.0

The SCM-GRPO framework models multi-hop fact verification as causal inference and applies reinforcement learning to optimize reasoning depth, reporting outperformance on HoVer and EX-FEVER.
Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU
cs.AI 2026-04 unverdicted novelty 4.0

Adaptive ToR uses a query complexity classifier to route multi-intent queries to either fast single-step or deeper hierarchical retrieval, improving accuracy by 9.7% and cutting latency by 37.6% on NLU benchmarks.
STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 4.0

STEM is a new framework for multi-hop KGQA that projects queries to adaptive schema graphs and uses Triple-GNN guidance to retrieve more accurate and complete evidence subgraphs, claiming SOTA results.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 30 Pith papers · 6 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier - Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, et al. 2023. https://doi.org/10.48550/AR...

work page internal anchor Pith review doi:10.48550/arxiv.2305.10403 2023
[2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. https://openreview.net/forum?id=hSyW5go0v8 Self-rag: Learning to retrieve, generate, and critique through self-reflection . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

work page 2024
[3]

Do, Yan Xu, and Pascale Fung

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. https://doi.org/10.18653/V1/2023.IJCNLP-MAIN.45 A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity . pages 675--718

work page doi:10.18653/v1/2023.ijcnlp-main.45 2023
[4]

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. http://arxiv.org/abs/2102.03315 Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge . CoRR, abs/2102.03315

work page arXiv 2021
[5]

Tom B Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language models are few-shot learners. In Advances in neural information processing systems, pages 1877--1901

work page 2020
[6]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2023
[7]

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2024. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.212 Chain-of-verification reduces hallucination in large language models . pages 3563--3578

work page doi:10.18653/v1/2024.findings-acl.212 2024
[8]

Alpacafarm: A simulation framework for methods that learn from human feedback

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. https://doi.org/10.48550/ARXIV.2305.14387 Alpacafarm: A simulation framework for methods that learn from human feedback . CoRR, abs/2305.14387

work page doi:10.48550/arxiv.2305.14387 2023
[9]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming - Wei Chang. 2020. http://proceedings.mlr.press/v119/guu20a.html Retrieval augmented language model pre-training . In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Learning Research, pages 3...

work page 2020
[10]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022. https://openreview.net/forum?id=jKN1pXi7b0 Unsupervised dense information retrieval with contrastive learning . Trans. Mach. Learn. Res., 2022

work page 2022
[11]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. 2023. https://doi.org/10.1145/3571730 Survey of hallucination in natural language generation . ACM Comput. Surv. , 55(12):248:1--248:38

work page doi:10.1145/3571730 2023
[12]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi - Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi - Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. https://aclanthology.org/2023.emnlp-main.495 Active retrieval augmented generation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pag...

work page 2023
[13]

Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jongjin Park, Sang - Woo Lee, Minjoon Seo, Jung - Woo Ha, and Jinwoo Shin. 2024. https://openreview.net/forum?id=w4DW6qkRmt Sure: Summarizing retrievals using answer candidates for open-domain QA of llms . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 ...

work page 2024
[14]

Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. https://doi.org/10.18653/V1/2022.ACL-LONG.579 Internet-augmented dialogue generation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 8460--8478. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.579 2022
[15]

u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html Retrieval-augmented generation for knowledge-inte...

work page 2020
[16]

Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. 2022. http://arxiv.org/abs/2202.01110 A survey on retrieval-augmented text generation . CoRR, abs/2202.01110

work page arXiv 2022
[17]

Yanming Liu, Xinyue Peng, Xuhong Zhang, Weihao Liu, Jianwei Yin, Jiannan Cao, and Tianyu Du. 2024. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.281 RA-ISF: learning to answer and understand from retrieval augmentation via iterative self-feedback . In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meetin...

work page doi:10.18653/v1/2024.findings-acl.281 2024
[18]

Hongyin Luo, Tianhua Zhang, Yung - Sung Chuang, Yuan Gong, Yoon Kim, Xixin Wu, Helen Meng, and James R. Glass. 2023. https://aclanthology.org/2023.findings-emnlp.242 Search augmented instruction learning . In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 , pages 3717--3729. Association for Computatio...

work page 2023
[19]

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.546 When not to trust language models: Investigating effectiveness of parametric and non-parametric memories . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: L...

work page doi:10.18653/v1/2023.acl-long.546 2023
[20]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen - tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. https://aclanthology.org/2023.emnlp-main.741 Factscore: Fine-grained atomic evaluation of factual precision in long form text generation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...

work page 2023
[21]

Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton - Brown, Amnon Shashua, and Yoav Shoham. 2023. https://doi.org/10.48550/ARXIV.2307.06908 Generating benchmarks for factuality evaluation of language models . CoRR, abs/2307.06908

work page doi:10.48550/arxiv.2307.06908 2023
[22]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. https://doi.org/10.1145/3458817.3476209 Efficient large-scale language model training on GPU clusters using megatron-lm . In Internationa...

work page doi:10.1145/3458817.3476209 2021
[23]

OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . CoRR, abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[24]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/b1efd...

work page 2022
[25]

Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick S. H. Lewis, Barlas Oguz, Edouard Grave, Wen - tau Yih, and Sebastian Riedel. 2021. http://arxiv.org/abs/2112.09924 The web is your oyster - knowledge-intensive NLP against a very large web corpus . CoRR, abs/2112.09924

work page arXiv 2021
[26]

Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.85 Is chatgpt a general-purpose natural language processing task solver? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pages 1...

work page doi:10.18653/v1/2023.emnlp-main.85 2023
[27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res., 21:140:1--140:67

work page 2020
[28]

Rashad Al Hasan Rony, Ricardo Usbeck, and Jens Lehmann

Md. Rashad Al Hasan Rony, Ricardo Usbeck, and Jens Lehmann. 2022. https://doi.org/10.18653/V1/2022.FINDINGS-NAACL.195 Dialokg: Knowledge-structure aware task-oriented dialogue generation . In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022 , pages 2557--2571. Association for Computational...

work page doi:10.18653/v1/2022.findings-naacl.195 2022
[29]

Timo Schick, Jane Dwivedi - Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html Toolformer: Language models can teach themselves to use tools

work page 2023
[30]

Chi, Nathanael Sch\" a rli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch\" a rli, and Denny Zhou. 2023. https://proceedings.mlr.press/v202/shi23a.html Large language models can be easily distracted by irrelevant context . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learn...

work page 2023
[31]

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. https://doi.org/10.18653/V1/2021.FINDINGS-EMNLP.320 Retrieval augmentation reduces hallucination in conversation . In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , pages 3784--3803. Ass...

work page doi:10.18653/v1/2021.findings-emnlp.320 2021
[32]

Chao - Hong Tan, Jia - Chen Gu, Chongyang Tao, Zhen - Hua Ling, Can Xu, Huang Hu, Xiubo Geng, and Daxin Jiang. 2022. https://doi.org/10.18653/V1/2022.FINDINGS-ACL.125 Tegtok: Augmenting text generation via task-specific and open-world knowledge . In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022 , pag...

work page doi:10.18653/v1/2022.findings-acl.125 2022
[33]

S. M. Towhidul Islam Tonmoy, S. M. Mehedi Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. 2024. https://doi.org/10.48550/ARXIV.2401.01313 A comprehensive survey of hallucination mitigation techniques in large language models . CoRR, abs/2401.01313

work page internal anchor Pith review doi:10.48550/arxiv.2401.01313 2024
[34]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie - Anne Lachaux, Timoth \' e e Lacroix, Baptiste Rozi \` e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aur \' e lien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023 a . https://doi.org/10.48550/ARXIV.2302.13971 Llama: Open and efficient foundation language models . Co...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
[35]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, et al. 2023 b . https://doi.org/10.48550/ARXIV.2307.09288 Llama 2: Open foundation and fine-tuned chat models . CoRR, abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[36]

Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, and Yitao Liang. 2024. https://doi.org/10.48550/ARXIV.2403.05313 RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation . CoRR, abs/2403.05313

work page doi:10.48550/arxiv.2403.05313 2024
[37]

Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. 2024. https://openreview.net/forum?id=ZS4m74kZpH Making retrieval-augmented language models robust to irrelevant context

work page 2024
[38]

Tianhua Zhang, Hongyin Luo, Yung - Sung Chuang, Wei Fang, Luc Gaitskell, Thomas Hartvigsen, Xixin Wu, Danny Fox, Helen Meng, and James R. Glass. 2023 a . https://doi.org/10.48550/ARXIV.2304.03728 Interpretable unified language checking . CoRR, abs/2304.03728

work page doi:10.48550/arxiv.2304.03728 2023
[39]

arXiv preprint arXiv:2403.10131 , year=

Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E. Gonzalez. 2024. https://doi.org/10.48550/ARXIV.2403.10131 RAFT: adapting language model to domain specific RAG . CoRR, abs/2403.10131

work page doi:10.48550/arxiv.2403.10131 2024
[40]

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023 b . https://doi.org/10.48550/ARXIV.2309.01219 Siren's song in the AI ocean: A survey on hallucination in large language models . CoRR, abs/2309.01219

work page internal anchor Pith review doi:10.48550/arxiv.2309.01219 2023
[41]

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2023. https://doi.org/10.48550/arXiv.2302.10198 Can chatgpt understand too? A comparative study on chatgpt and fine-tuned BERT . CoRR, abs/2302.10198

work page doi:10.48550/arxiv.2302.10198 2023