Hidden activations in LLMs encode detectable information about statement truthfulness, enabling a classifier to identify true versus false content more reliably than the model's assigned probabilities.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2023 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
The Internal State of an LLM Knows When It's Lying
Hidden activations in LLMs encode detectable information about statement truthfulness, enabling a classifier to identify true versus false content more reliably than the model's assigned probabilities.