hub

Chain-of-Verification Reduces Hallucination in Large Language Models

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz · 2023 · cs.CL · arXiv 2309.11495

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

open full Pith review browse 22 citing papers arXiv PDF

abstract

Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

cs.AI · 2025-12-14 · accept · novelty 8.0

MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.

Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems

cs.CL · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

Compositional selective specificity (CSS) decomposes generated answers into claims and emits each at the most specific level supported by evidence, raising overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity.

Credo: Declarative Control of LLM Pipelines via Beliefs and Policies

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

cs.AI · 2026-04-05 · unverdicted · novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

cs.CL · 2026-03-16 · conditional · novelty 7.0

Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.

Hallucination is Inevitable: An Innate Limitation of Large Language Models

cs.CL · 2024-01-22 · conditional · novelty 7.0

Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.

Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents

cs.CV · 2025-09-29 · unverdicted · novelty 6.0

CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.

Towards an AI co-scientist

cs.AI · 2025-02-26 · unverdicted · novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

Hallucinations are inevitable but can be made statistically negligible

cs.CL · 2025-02-15 · unverdicted · novelty 6.0

Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

cs.CL · 2024-06-22 · unverdicted · novelty 6.0

SEPs approximate semantic entropy from single-generation hidden states to enable cheap and robust hallucination detection in LLMs.

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

cs.CL · 2023-10-17 · unverdicted · novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

cs.CL · 2023-11-09 · unverdicted · novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

cs.CL · 2026-05-04 · unverdicted · novelty 4.0 · 2 refs

HalluScan benchmark evaluates hallucination detection in LLMs, reporting NLI Verification at AUROC 0.88 and introducing HalluScore (r=0.41 with humans) plus Adaptive Detection Routing for 2x cost savings.

Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

cs.LG · 2026-04-13 · unverdicted · novelty 4.0

HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.

Hallucination Detection and Evaluation of Large Language Model

cs.CL · 2025-12-27 · unverdicted · novelty 4.0

HHEM delivers fast hallucination detection in LLMs via classification, cutting evaluation time from 8 hours to 10 minutes with up to 82.2% accuracy while adding segment retrieval for summarization.

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

cs.CL · 2023-09-03 · unverdicted · novelty 4.0

A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.

Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study

cs.SE · 2026-05-18 · conditional · novelty 3.0 · 2 refs

In a bounded multi-agent runtime case study, verify-gated completion produced 99.5% success on invoked verification events with packetized records, supporting only a narrow claim of inspectable and fail-closed decisions under observed conditions.

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

cs.AI · 2024-02-05 · unverdicted · novelty 3.0

A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.

Retrieval-Augmented Generation for Large Language Models: A Survey

cs.CL · 2023-12-18 · unverdicted · novelty 3.0

A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.

Social and Ethical Risks Posed by General-Purpose LLMs for Settling Newcomers in Canada

cs.CY · 2024-07-15 · unverdicted · novelty 2.0

The paper identifies social and ethical risks from unguarded use of general-purpose LLMs in Canadian newcomer settlement and advocates for AI literacy programs plus customized models with human oversight.

citing papers explorer

Showing 22 of 22 citing papers.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 116 · internal anchor
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents cs.AI · 2025-12-14 · accept · none · ref 9 · internal anchor
MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents cs.AI · 2026-04-21 · unverdicted · none · ref 18 · internal anchor
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems cs.CL · 2026-04-19 · unverdicted · none · ref 3 · 2 links · internal anchor
Compositional selective specificity (CSS) decomposes generated answers into claims and emits each at the most specific level supported by evidence, raising overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity.
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies cs.AI · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents cs.AI · 2026-04-05 · unverdicted · none · ref 3 · internal anchor
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models cs.CL · 2026-03-16 · conditional · none · ref 11 · internal anchor
Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.
Hallucination is Inevitable: An Innate Limitation of Large Language Models cs.CL · 2024-01-22 · conditional · none · ref 15 · internal anchor
Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents cs.CV · 2025-09-29 · unverdicted · none · ref 6 · internal anchor
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 221 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Hallucinations are inevitable but can be made statistically negligible cs.CL · 2025-02-15 · unverdicted · none · ref 44 · internal anchor
Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs cs.CL · 2024-06-22 · unverdicted · none · ref 15 · internal anchor
SEPs approximate semantic entropy from single-generation hidden states to enable cheap and robust hallucination detection in LLMs.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection cs.CL · 2023-10-17 · unverdicted · none · ref 8 · internal anchor
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 77 · internal anchor
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs cs.CL · 2026-05-04 · unverdicted · none · ref 36 · 2 links · internal anchor
HalluScan benchmark evaluates hallucination detection in LLMs, reporting NLI Verification at AUROC 0.88 and introducing HalluScore (r=0.41 with humans) plus Adaptive Detection Routing for 2x cost savings.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR) cs.LG · 2026-04-13 · unverdicted · none · ref 4 · internal anchor
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.
Hallucination Detection and Evaluation of Large Language Model cs.CL · 2025-12-27 · unverdicted · none · ref 1 · internal anchor
HHEM delivers fast hallucination detection in LLMs via classification, cutting evaluation time from 8 hours to 10 minutes with up to 82.2% accuracy while adding segment retrieval for summarization.
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models cs.CL · 2023-09-03 · unverdicted · none · ref 4 · internal anchor
A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.
Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study cs.SE · 2026-05-18 · conditional · none · ref 8 · 2 links · internal anchor
In a bounded multi-agent runtime case study, verify-gated completion produced 99.5% success on invoked verification events with packetized records, supporting only a narrow claim of inspectable and fail-closed decisions under observed conditions.
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications cs.AI · 2024-02-05 · unverdicted · none · ref 6 · internal anchor
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 93 · internal anchor
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
Social and Ethical Risks Posed by General-Purpose LLMs for Settling Newcomers in Canada cs.CY · 2024-07-15 · unverdicted · none · ref 38 · internal anchor
The paper identifies social and ethical risks from unguarded use of general-purpose LLMs in Canadian newcomer settlement and advocates for AI literacy programs plus customized models with human oversight.

Chain-of-Verification Reduces Hallucination in Large Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer