Tabular QA LLMs are overconfident, but Multi-Format Agreement using Markdown/HTML/JSON/CSV variants improves AUROC to 0.80 and cuts calibration error by 44-63% at lower cost than sampling.
Semantic energy: Detecting LLM hallucination beyond entropy
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Larger LLMs hallucinate more often despite having the correct concept available because instruction tuning causes probability mass to disperse across alternative surface forms instead of concentrating on one.
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
citing papers explorer
-
Calibrated Confidence Estimation for Tabular Question Answering
Tabular QA LLMs are overconfident, but Multi-Format Agreement using Markdown/HTML/JSON/CSV variants improves AUROC to 0.80 and cuts calibration error by 44-63% at lower cost than sampling.
-
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Larger LLMs hallucinate more often despite having the correct concept available because instruction tuning causes probability mass to disperse across alternative surface forms instead of concentrating on one.
-
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
- High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models