Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2verdicts
UNVERDICTED 2representative citing papers
PAS automates activation steering for LLMs using labeled data to improve behavior control on tasks like bias and alignment, with gains over ICL and SFT but limited effect on intelligence tasks.
citing papers explorer
-
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.
-
Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models
PAS automates activation steering for LLMs using labeled data to improve behavior control on tasks like bias and alignment, with gains over ICL and SFT but limited effect on intelligence tasks.