EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
hub
Hallucination is Inevitable: An Innate Limitation of Large Language Models
36 Pith papers cite this work. Polarity classification is still indexing.
abstract
Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. In this paper, we formalize the problem and show that it is impossible to eliminate hallucination in LLMs. Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs. Furthermore, for real world LLMs constrained by provable time complexity, we describe the hallucination-prone tasks and empirically validate our claims. Finally, using the formal world framework, we discuss the possible mechanisms and efficacies of existing hallucination mitigators as well as the practical implications on the safe deployment of LLMs.
hub tools
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
DEJA uses evolutionary optimization guided by an LLM-based Answer Utility Score to induce soft-failure responses in RAG systems, achieving over 79% soft attack success rate with under 15% hard failures and high stealth across models and datasets.
A system combining VLM landmark instructions with real-time corrective spatial audio reduces route deviations in a small user study compared to VLM-only and Google Maps audio baselines.
Domain-specialized small language models enable deterministic atomic-resolution scanning probe microscopy control with 99.3% command accuracy, lower computational cost, and better domain performance than larger general models.
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised node classification.
An AI model produced a new formula for a central element of U_q(so_12) at the quality level of advanced undergraduate research, along with faster computation via SageMath, prompting changes in mentorship practices.
LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning tasks with lower total inference cost.
FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.
An accurate and trusted AI system cannot achieve human-level reasoning because there exist tasks easily solvable by humans but not by the system.
CogDriver-Agent with sparse temporal memory and spatiotemporal distillation on CogDriver-Data achieves 22% higher closed-loop Driving Score on Bench2Drive and 21% lower mean L2 error on nuScenes.
Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.
Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.
Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verification in decentralized settings.
LLMs exhibit substantial heterogeneity and non-determinism in SLR evidence screening, abstracts are decisive for performance, and they show no reliable superiority over classical classifiers on two real SLRs.
AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
V2E automates PoC generation, triggerability and profitability validation, and iterative refinement using LLMs to confirm exploitable smart contract vulnerabilities, outperforming baselines on 264 labeled contracts.
TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
citing papers explorer
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
-
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
-
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
-
Hallucinations Undermine Trust; Metacognition is a Way Forward
LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.
-
Hallucinations are inevitable but can be made statistically negligible
Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
-
Hierarchical Memorization in Large Language Models: Evidence from Citation Generation
LLMs memorize citations hierarchically: titles and first authors are recalled at lower redundancy levels than venues or years, with accuracy scaling log-linearly and saturating near verbatim reproduction above roughly 1200 citations.
-
Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning
APP is a multi-turn LLM framework for medical dialogue that combines empathetic questioning, Bayesian active learning, and guideline-based reasoning, outperforming baselines on a new simulated-patient benchmark in accuracy, uncertainty reduction, and user experience.
-
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.