Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.
arXiv preprint arXiv:2512.20949 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
LLM hidden states contain a sparse reward subsystem consisting of value neurons that predict state value and dopamine neurons that encode step-level temporal difference errors.
MultiHaluDet uses multi-layer hidden-state probing, multi-scale attention, and a calibrated classifier ensemble to detect multilingual hallucinations, reporting up to 98.55% AUROC on English benchmarks and strong cross-lingual transfer to French, Bangla, and Amharic.
citing papers explorer
-
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.
-
Sparse Reward Subsystem in Large Language Models
LLM hidden states contain a sparse reward subsystem consisting of value neurons that predict state value and dopamine neurons that encode step-level temporal difference errors.
-
MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing
MultiHaluDet uses multi-layer hidden-state probing, multi-scale attention, and a calibrated classifier ensemble to detect multilingual hallucinations, reporting up to 98.55% AUROC on English benchmarks and strong cross-lingual transfer to French, Bangla, and Amharic.