Verification prompting in LVLMs is input-dependent and risk-bearing; RSP selectively triggers it via pre-generation uncertainty to avoid performance degradation on easy cases.
hub
Chain-of-Verification Reduces Hallucination in Large Language Models
24 Pith papers cite this work. Polarity classification is still indexing.
abstract
Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
Compositional selective specificity (CSS) decomposes generated answers into claims and emits each at the most specific level supported by evidence, raising overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity.
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.
Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.
GILP combines a small parameterized world model with LLM agent reasoning via a consistency gate, reducing hallucinated-state rate from 0.176 to 0.035 and raising success from 0.668 to 0.838 on graph planning benchmarks.
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
SEPs approximate semantic entropy from single-generation hidden states to enable cheap and robust hallucination detection in LLMs.
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
HalluScan benchmark evaluates hallucination detection in LLMs, reporting NLI Verification at AUROC 0.88 and introducing HalluScore (r=0.41 with humans) plus Adaptive Detection Routing for 2x cost savings.
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.
HHEM delivers fast hallucination detection in LLMs via classification, cutting evaluation time from 8 hours to 10 minutes with up to 82.2% accuracy while adding segment retrieval for summarization.
A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.
In a bounded multi-agent runtime case study, verify-gated completion produced 99.5% success on invoked verification events with packetized records, supporting only a narrow claim of inspectable and fail-closed decisions under observed conditions.
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
The paper identifies social and ethical risks from unguarded use of general-purpose LLMs in Canadian newcomer settlement and advocates for AI literacy programs plus customized models with human oversight.
citing papers explorer
-
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Hallucinations are inevitable but can be made statistically negligible
Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
-
Hallucination Detection and Evaluation of Large Language Model
HHEM delivers fast hallucination detection in LLMs via classification, cutting evaluation time from 8 hours to 10 minutes with up to 82.2% accuracy while adding segment retrieval for summarization.
- MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents