hub Canonical reference

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Ziwei Xu, Sanjay Jain, Mohan Kankanhalli · 2024 · cs.CL · arXiv 2401.11817

Canonical reference. 80% of citing Pith papers cite this work as background.

47 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 47 citing papers arXiv PDF

abstract

Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. In this paper, we formalize the problem and show that it is impossible to eliminate hallucination in LLMs. Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs. Furthermore, for real world LLMs constrained by provable time complexity, we describe the hallucination-prone tasks and empirically validate our claims. Finally, using the formal world framework, we discuss the possible mechanisms and efficacies of existing hallucination mitigators as well as the practical implications on the safe deployment of LLMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 4 unclear 1

representative citing papers

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems

cs.AI · 2026-05-11 · unverdicted · novelty 8.0 · 2 refs

SciIntegrity-Bench shows seven LLMs exhibit a 34.2% integrity failure rate in dilemmatic scenarios, with all models fabricating synthetic data in missing-data cases and an intrinsic completion bias persisting after prompt changes.

EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.

Green Shielding: A User-Centric Approach Towards Trustworthy AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.

Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation

cs.CR · 2026-04-20 · unverdicted · novelty 7.0

DEJA uses evolutionary optimization guided by an LLM-based Answer Utility Score to induce soft-failure responses in RAG systems, achieving over 79% soft attack success rate with under 15% hard failures and high stealth across models and datasets.

Navig-AI-tion: Navigation by Contextual AI and Spatial Audio

cs.HC · 2026-03-13 · unverdicted · novelty 7.0

A system combining VLM landmark instructions with real-time corrective spatial audio reduces route deviations in a small user study compared to VLM-only and Google Maps audio baselines.

Integrating Domain-Specialized Language Models with AI Measurement Tools for Deterministic Atomic-Resolution Experimentation

physics.app-ph · 2026-02-24 · unverdicted · novelty 7.0

Domain-specialized small language models enable deterministic atomic-resolution scanning probe microscopy control with 99.3% command accuracy, lower computational cost, and better domain performance than larger general models.

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

cs.CV · 2024-11-25 · unverdicted · novelty 7.0

VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.

HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

PseudoBench shows current LLM agents produce persuasive pseudoscientific reports with near-zero refusal rates and at most 27.4% resistance.

Boosting Self-Consistency with Ranking

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.

From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

SLM adds a dedicated spatial modality and training dataset to LLMs, enabling geometric spatial reasoning and outperforming prompt-based symbolic methods on the new SpatialEval benchmark.

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

Token entropy distributions fingerprint hallucinations in generative models, enabling the Calibrated Entropy Score (CES) for single-pass black-box detection with calibration guarantees via a novel DKW inequality.

Intent Signal Theory: A Computational Framework for Intent-State Control in Human-AI Interaction

cs.HC · 2026-05-24 · unverdicted · novelty 6.0

Intent Signal Theory formalizes four distinct intent-related objects in human-AI interaction, introduces a theorem on irreversible private intent loss, and reports supporting patterns from studies across LLMs, languages, and tasks.

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.

Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.

A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised node classification.

Using Large Language Models as a Co-Author in Undergraduate Quantum Group Research

math.HO · 2026-05-04 · unverdicted · novelty 6.0

An AI model produced a new formula for a central element of U_q(so_12) at the quality level of advanced undergraduate research, along with faster computation via SageMath, prompting changes in mentorship practices.

Hallucinations Undermine Trust; Metacognition is a Way Forward

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.

Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.

Limitations on Accurate, Trusted, Human-level Reasoning

cs.LG · 2025-09-25 · unverdicted · novelty 6.0

An accurate and trusted AI system cannot achieve human-level reasoning because there exist tasks easily solvable by humans but not by the system.

CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving

cs.CV · 2025-08-31 · unverdicted · novelty 6.0

CogDriver-Agent with sparse temporal memory and spatiotemporal distillation on CogDriver-Data achieves 22% higher closed-loop Driving Score on Bench2Drive and 21% lower mean L2 error on nuScenes.

Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

cs.LG · 2025-06-11 · unverdicted · novelty 6.0

Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.

To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems

cs.CR · 2025-06-03 · unverdicted · novelty 6.0

Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.

citing papers explorer

Showing 46 of 46 citing papers after filters.

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems cs.AI · 2026-05-11 · unverdicted · none · ref 27 · 2 links · internal anchor
SciIntegrity-Bench shows seven LLMs exhibit a 34.2% integrity failure rate in dilemmatic scenarios, with all models fabricating synthetic data in missing-data cases and an intrinsic completion bias persisting after prompt changes.
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium cs.AI · 2026-05-10 · unverdicted · none · ref 85 · internal anchor
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
Green Shielding: A User-Centric Approach Towards Trustworthy AI cs.CL · 2026-04-27 · unverdicted · none · ref 5 · internal anchor
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation cs.CR · 2026-04-20 · unverdicted · none · ref 5 · internal anchor
DEJA uses evolutionary optimization guided by an LLM-based Answer Utility Score to induce soft-failure responses in RAG systems, achieving over 79% soft attack success rate with under 15% hard failures and high stealth across models and datasets.
Navig-AI-tion: Navigation by Contextual AI and Spatial Audio cs.HC · 2026-03-13 · unverdicted · none · ref 25 · internal anchor
A system combining VLM landmark instructions with real-time corrective spatial audio reduces route deviations in a small user study compared to VLM-only and Google Maps audio baselines.
Integrating Domain-Specialized Language Models with AI Measurement Tools for Deterministic Atomic-Resolution Experimentation physics.app-ph · 2026-02-24 · unverdicted · none · ref 18 · internal anchor
Domain-specialized small language models enable deterministic atomic-resolution scanning probe microscopy control with 99.3% command accuracy, lower computational cost, and better domain performance than larger general models.
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs cs.CV · 2024-11-25 · unverdicted · none · ref 58 · internal anchor
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice cs.CL · 2026-06-16 · unverdicted · none · ref 40 · internal anchor
HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.
PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience cs.AI · 2026-06-16 · unverdicted · none · ref 64 · internal anchor
PseudoBench shows current LLM agents produce persuasive pseudoscientific reports with near-zero refusal rates and at most 27.4% resistance.
Boosting Self-Consistency with Ranking cs.CL · 2026-06-03 · unverdicted · none · ref 186 · internal anchor
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models cs.LG · 2026-06-03 · unverdicted · none · ref 30 · internal anchor
SLM adds a dedicated spatial modality and training dataset to LLMs, enabling geometric spatial reasoning and outperforming prompt-based symbolic methods on the new SpatialEval benchmark.
Entropy Distribution as a Fingerprint for Hallucinations in Generative Models cs.AI · 2026-05-27 · unverdicted · none · ref 48 · internal anchor
Token entropy distributions fingerprint hallucinations in generative models, enabling the Calibrated Entropy Score (CES) for single-pass black-box detection with calibration guarantees via a novel DKW inequality.
Intent Signal Theory: A Computational Framework for Intent-State Control in Human-AI Interaction cs.HC · 2026-05-24 · unverdicted · none · ref 26 · internal anchor
Intent Signal Theory formalizes four distinct intent-related objects in human-AI interaction, introduces a theorem on irreversible private intent loss, and reports supporting patterns from studies across LLMs, languages, and tasks.
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency cs.CL · 2026-05-18 · unverdicted · none · ref 25 · internal anchor
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation cs.CL · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks cs.LG · 2026-05-06 · unverdicted · none · ref 36 · internal anchor
A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised node classification.
Using Large Language Models as a Co-Author in Undergraduate Quantum Group Research math.HO · 2026-05-04 · unverdicted · none · ref 28 · internal anchor
An AI model produced a new formula for a central element of U_q(so_12) at the quality level of advanced undergraduate research, along with faster computation via SageMath, prompting changes in mentorship practices.
Hallucinations Undermine Trust; Metacognition is a Way Forward cs.CL · 2026-05-02 · unverdicted · none · ref 46 · internal anchor
LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations cs.AI · 2026-04-22 · unverdicted · none · ref 47 · internal anchor
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models cs.LG · 2026-04-16 · unverdicted · none · ref 8 · internal anchor
FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.
Limitations on Accurate, Trusted, Human-level Reasoning cs.LG · 2025-09-25 · unverdicted · none · ref 17 · internal anchor
An accurate and trusted AI system cannot achieve human-level reasoning because there exist tasks easily solvable by humans but not by the system.
CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving cs.CV · 2025-08-31 · unverdicted · none · ref 44 · internal anchor
CogDriver-Agent with sparse temporal memory and spatiotemporal distillation on CogDriver-Data achieves 22% higher closed-loop Driving Score on Bench2Drive and 21% lower mean L2 error on nuScenes.
Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems cs.LG · 2025-06-11 · unverdicted · none · ref 71 · internal anchor
Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.
To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems cs.CR · 2025-06-03 · unverdicted · none · ref 40 · internal anchor
Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.
Hallucinations are inevitable but can be made statistically negligible cs.CL · 2025-02-15 · unverdicted · none · ref 45 · internal anchor
Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
Scaling Synthetic Data Creation with 1,000,000,000 Personas cs.CL · 2024-06-28 · unverdicted · none · ref 25 · internal anchor
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
Readable but Not Controllable: Neuron-Level Evidence for Medical LLM Hallucination cs.CL · 2026-06-30 · unverdicted · none · ref 13 · internal anchor
Hallucination signals in medical LLMs are distributed and decodable from activations but not causally controllable via neuron-level interventions.
From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration cs.HC · 2026-05-28 · unverdicted · none · ref 51 · internal anchor
Presents the CCAI ontology and SPARQL retrieval method to convert ephemeral Human-Generative AI prompt interactions into explicit, machine-readable collaboration traces, illustrated in a competency-profile software case study.
AgentReputation: A Decentralized Agentic AI Reputation Framework cs.AI · 2026-04-30 · unverdicted · none · ref 30 · internal anchor
AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verification in decentralized settings.
Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs cs.SE · 2026-04-29 · unverdicted · none · ref 8 · internal anchor
LLMs exhibit substantial heterogeneity and non-determinism in SLR evidence screening, abstracts are decisive for performance, and they show no reliable superiority over classical classifiers on two real SLRs.
A pragmatic approach to regulating AI agents cs.CY · 2026-04-16 · unverdicted · none · ref 26 · internal anchor
AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
V2E: Validating Smart Contract Vulnerabilities through Profit-driven Exploit Generation and Execution cs.SE · 2026-04-15 · unverdicted · none · ref 59 · internal anchor
V2E automates PoC generation, triggerability and profitability validation, and iterative refinement using LLMs to confirm exploitable smart contract vulnerabilities, outperforming baselines on 264 labeled contracts.
Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction cs.SE · 2026-04-14 · unverdicted · none · ref 15 · internal anchor
TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
Mitigating hallucinations and omissions in LLMs for invertible problems: An application to hardware logic design automation cs.LG · 2025-11-25 · unverdicted · none · ref 36 · internal anchor
Using LLMs to encode logic condition tables into HDL code and decode back to tables mitigates hallucinations in hardware design automation.
Multi-agent Self-triage System with Medical Flowcharts cs.AI · 2025-11-16 · unverdicted · none · ref 15 · internal anchor
A multi-agent conversational system using AMA flowcharts achieves 95.29% top-3 retrieval accuracy and 99.10% navigation accuracy on large synthetic medical conversation datasets.
TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud cs.SE · 2025-06-02 · unverdicted · none · ref 65 · internal anchor
TSGuard builds domain knowledge bases offline from historical incidents and applies online multi-agent structured reasoning to diagnose AI workload failures, delivering 19.8% higher accuracy and 63.4% lower verification time than baselines on Azure production data.
Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery stat.ML · 2026-05-30 · unverdicted · none · ref 79 · internal anchor
Mechanistic learning from ML is generically underdetermined in high-dimensional proxy regimes, with LLMs worsening the problem by collapsing many possible explanations into one fluent narrative.
Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience cs.CL · 2026-05-24 · unverdicted · none · ref 37 · internal anchor
A textbook-derived neuroscience knowledge graph supplies synthetic multi-hop QA supervision and RL rewards to fine-tune a small LM claimed to exceed larger general models on expert reasoning.
Opportunities and Risks of Generative AI through the Health Information Journey cs.CY · 2026-05-21 · unverdicted · none · ref 65 · internal anchor
Authors propose a four-stage framework to analyze opportunities and risks of generative AI across the health information journey from public sources to clinical care.
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation cs.SE · 2026-04-06 · unverdicted · none · ref 21 · internal anchor
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning cs.CL · 2025-02-11 · unverdicted · none · ref 5 · internal anchor
APP is a multi-turn LLM framework for medical dialogue that combines empathetic questioning, Bayesian active learning, and guideline-based reasoning, outperforming baselines on a new simulated-patient benchmark in accuracy, uncertainty reduction, and user experience.
Designing for Error Recovery in Human-Robot Interaction cs.RO · 2026-04-14 · unverdicted · none · ref 17 · internal anchor
Position paper calls for designing robotic AI to detect and recover from its own errors in continuous interactions, using nuclear glovebox operations as an illustrative case.
Causal Connections: Leveraging Multilingual Fine-Tuning for Financial QA@FinCausal 2026 cs.CL · 2026-06-25 · unverdicted · none · ref 20 · internal anchor
Fine-tuned multilingual LLMs achieve top shared-task scores on financial causality extraction in English and Spanish.
Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems q-bio.NC · 2025-07-14 · unverdicted · none · ref 185 · internal anchor
A position and survey paper that identifies convergence between neuroscience, AGI, and neuromorphic computing and outlines four key integration challenges.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 220 · internal anchor
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.
Social and Ethical Risks Posed by General-Purpose LLMs for Settling Newcomers in Canada cs.CY · 2024-07-15 · unverdicted · none · ref 31 · internal anchor
The paper identifies social and ethical risks from unguarded use of general-purpose LLMs in Canadian newcomer settlement and advocates for AI literacy programs plus customized models with human oversight.

Hallucination is Inevitable: An Innate Limitation of Large Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer