{"paper":{"title":"The Internal State of an LLM Knows When It's Lying","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"The hidden activations inside an LLM can be read by a trained classifier to detect whether a statement is true or false.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Amos Azaria, Tom Mitchell","submitted_at":"2023-04-26T02:49:38Z","abstract_excerpt":"While Large Language Models (LLMs) have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates. Our approach is to train a classifier that outputs the probability that a statement is truthful, based on the hidden layer activations of the LLM as it reads or generates the statement. Exp"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the hidden activations contain a generalizable signal of truthfulness that is not merely an artifact of the particular training sentences or superficial statistical properties shared with the labels.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Hidden activations in LLMs encode detectable information about statement truthfulness, enabling a classifier to identify true versus false content more reliably than the model's assigned probabilities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The hidden activations inside an LLM can be read by a trained classifier to detect whether a statement is true or false.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"57113b1cc223ecd988455998a48421077959743fbc938f3f0d243ad7a0b19df4"},"source":{"id":"2304.13734","kind":"arxiv","version":2},"verdict":{"id":"3e5e1ff7-b275-4862-aedd-249413e597f0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T00:04:46.059866Z","strongest_claim":"the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates.","one_line_summary":"Hidden activations in LLMs encode detectable information about statement truthfulness, enabling a classifier to identify true versus false content more reliably than the model's assigned probabilities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the hidden activations contain a generalizable signal of truthfulness that is not merely an artifact of the particular training sentences or superficial statistical properties shared with the labels.","pith_extraction_headline":"The hidden activations inside an LLM can be read by a trained classifier to detect whether a statement is true or false."},"references":{"count":28,"sample":[{"doi":"","year":2023,"title":"Llama 2: Early Adopters' Utilization of Meta's New Open-Source Pretrained Model , author=. 2023 , publisher=","work_id":"787b7ae6-b026-477c-9d2a-1ce54150fb80","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Advances in neural information processing systems , volume=","work_id":"12f5a236-ef7a-4d13-b4de-b51465a6f977","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"ACM Computing Surveys , volume=","work_id":"259e0022-e47c-460e-bc3e-014f2d3cd3f4","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Proceedings of the national academy of sciences , volume=","work_id":"47e891d1-91f8-4e1a-9923-74473e0b4b20","ref_index":11,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2010,"title":"Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques , pages=","work_id":"f17984d5-f09c-4e88-92a6-f524f2ff55eb","ref_index":12,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":28,"snapshot_sha256":"eaa4509713d0743da0f446b2331d1a8e8301dd3136091f78cbfd860f7addb97d","internal_anchors":7},"formal_canon":{"evidence_count":2,"snapshot_sha256":"4f1095b3e670c58d5ecd654ad8466f1f315881a3632975f9dc4838379acd73f7"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}