HaluNet: Learning Hallucination Risk from Internal Signals in LLM Question Answering
read the original abstract
Large language models (LLMs) achieve strong question answering (QA) performance but can produce fluent answers unsupported by available evidence. Existing hallucination detectors often rely on external verification, repeated sampling, or test-time judge calls, which can be costly for real-time QA. We propose \textbf{HaluNet}, a lightweight hallucination risk estimator that uses internal signals from one model generation. HaluNet jointly models token likelihood, predictive entropy, and hidden-state information, allowing probabilistic, distributional, and semantic evidence to inform an answer-level risk score. It is trained with LLM-as-a-Judge labels as scalable weak supervision and evaluated with independent human and multi-judge assessments. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet improves answer-level risk ranking across in-domain and out-of-domain settings. On a 300-example human evaluation, HaluNet achieves 0.874 AUROC and 0.869 AUPRC; its top 20\% highest-risk answers contain 96.5\% errors, yielding a 2.06$\times$ lift over the base error rate.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation
Supervised uncertainty probes for LLMs show poor robustness under distribution shift, with middle-layer representations and multi-token aggregation proving more reliable than final-layer or single-token features.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.