{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:KOV64XHEEHE6X7VXUTHPFTDZDN","short_pith_number":"pith:KOV64XHE","schema_version":"1.0","canonical_sha256":"53abee5ce421c9ebfeb7a4cef2cc791b5c6a8f4035a3370225bb51b68e8f82ff","source":{"kind":"arxiv","id":"2304.13734","version":2},"attestation_state":"computed","paper":{"title":"The Internal State of an LLM Knows When It's Lying","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"The hidden activations inside an LLM can be read by a trained classifier to detect whether a statement is true or false.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Amos Azaria, Tom Mitchell","submitted_at":"2023-04-26T02:49:38Z","abstract_excerpt":"While Large Language Models (LLMs) have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates. Our approach is to train a classifier that outputs the probability that a statement is truthful, based on the hidden layer activations of the LLM as it reads or generates the statement. Exp"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2304.13734","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","primary_cat":"cs.CL","submitted_at":"2023-04-26T02:49:38Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"954ff13209d1dc258fabe7058532d9a1bc37bbaf282cae3137636fd652f41ee3","abstract_canon_sha256":"5ccf1164771c2d94ac9d1d74ea147cf9ffb616d551ce05fe36a5b7fc0e739517"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:49.658235Z","signature_b64":"8tOPlJNFCCN9X4w8CnBCxndya+PGNjEQhEJKAlP0IAmDz1g+etG69PpafBZ8tPer69OEPsVwj4bRhecvC7t0Dg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"53abee5ce421c9ebfeb7a4cef2cc791b5c6a8f4035a3370225bb51b68e8f82ff","last_reissued_at":"2026-05-17T23:38:49.657733Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:49.657733Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"The Internal State of an LLM Knows When It's Lying","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"The hidden activations inside an LLM can be read by a trained classifier to detect whether a statement is true or false.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Amos Azaria, Tom Mitchell","submitted_at":"2023-04-26T02:49:38Z","abstract_excerpt":"While Large Language Models (LLMs) have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates. Our approach is to train a classifier that outputs the probability that a statement is truthful, based on the hidden layer activations of the LLM as it reads or generates the statement. Exp"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the hidden activations contain a generalizable signal of truthfulness that is not merely an artifact of the particular training sentences or superficial statistical properties shared with the labels.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Hidden activations in LLMs encode detectable information about statement truthfulness, enabling a classifier to identify true versus false content more reliably than the model's assigned probabilities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The hidden activations inside an LLM can be read by a trained classifier to detect whether a statement is true or false.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"57113b1cc223ecd988455998a48421077959743fbc938f3f0d243ad7a0b19df4"},"source":{"id":"2304.13734","kind":"arxiv","version":2},"verdict":{"id":"3e5e1ff7-b275-4862-aedd-249413e597f0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T00:04:46.059866Z","strongest_claim":"the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates.","one_line_summary":"Hidden activations in LLMs encode detectable information about statement truthfulness, enabling a classifier to identify true versus false content more reliably than the model's assigned probabilities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the hidden activations contain a generalizable signal of truthfulness that is not merely an artifact of the particular training sentences or superficial statistical properties shared with the labels.","pith_extraction_headline":"The hidden activations inside an LLM can be read by a trained classifier to detect whether a statement is true or false."},"references":{"count":28,"sample":[{"doi":"","year":2023,"title":"Llama 2: Early Adopters' Utilization of Meta's New Open-Source Pretrained Model , author=. 2023 , publisher=","work_id":"787b7ae6-b026-477c-9d2a-1ce54150fb80","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Advances in neural information processing systems , volume=","work_id":"12f5a236-ef7a-4d13-b4de-b51465a6f977","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"ACM Computing Surveys , volume=","work_id":"259e0022-e47c-460e-bc3e-014f2d3cd3f4","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Proceedings of the national academy of sciences , volume=","work_id":"47e891d1-91f8-4e1a-9923-74473e0b4b20","ref_index":11,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2010,"title":"Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques , pages=","work_id":"f17984d5-f09c-4e88-92a6-f524f2ff55eb","ref_index":12,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":28,"snapshot_sha256":"eaa4509713d0743da0f446b2331d1a8e8301dd3136091f78cbfd860f7addb97d","internal_anchors":7},"formal_canon":{"evidence_count":2,"snapshot_sha256":"4f1095b3e670c58d5ecd654ad8466f1f315881a3632975f9dc4838379acd73f7"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2304.13734","created_at":"2026-05-17T23:38:49.657810+00:00"},{"alias_kind":"arxiv_version","alias_value":"2304.13734v2","created_at":"2026-05-17T23:38:49.657810+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2304.13734","created_at":"2026-05-17T23:38:49.657810+00:00"},{"alias_kind":"pith_short_12","alias_value":"KOV64XHEEHE6","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"KOV64XHEEHE6X7VX","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"KOV64XHE","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":27,"internal_anchor_count":27,"sample":[{"citing_arxiv_id":"2605.02443","citing_title":"HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2504.00446","citing_title":"Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22636","citing_title":"A Multi-Source Framework for Relational Validation of Large Language Models Using Expert-Curated Encyclopedic Sources","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2506.18852","citing_title":"Mechanistic Interpretability Needs Philosophy","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2603.17839","citing_title":"How do LLMs Compute Verbal Confidence","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18792","citing_title":"Trust or Abstain? A Self-Aware RAG Approach","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10893","citing_title":"Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2309.05922","citing_title":"A Survey of Hallucination in Large Foundation Models","ref_index":111,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13224","citing_title":"A Geometric Taxonomy of Hallucinations in LLMs","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2602.20338","citing_title":"Emergent Manifold Separability during Reasoning in Large Language Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2303.08896","citing_title":"SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2604.01151","citing_title":"Detecting Multi-Agent Collusion Through Multi-Agent Interpretability","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2311.05232","citing_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08221","citing_title":"NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning","ref_index":98,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10893","citing_title":"Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05777","citing_title":"Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05741","citing_title":"HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05715","citing_title":"Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05593","citing_title":"Causal Probing for Internal Visual Representations in Multimodal Large Language Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22271","citing_title":"How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00874","citing_title":"Latent Space Probing for Adult Content Detection in Video Generative Models","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06277","citing_title":"Weakly Supervised Distillation of Hallucination Signals into Transformer Representations","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05348","citing_title":"From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13417","citing_title":"The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15945","citing_title":"RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration","ref_index":13,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/KOV64XHEEHE6X7VXUTHPFTDZDN","json":"https://pith.science/pith/KOV64XHEEHE6X7VXUTHPFTDZDN.json","graph_json":"https://pith.science/api/pith-number/KOV64XHEEHE6X7VXUTHPFTDZDN/graph.json","events_json":"https://pith.science/api/pith-number/KOV64XHEEHE6X7VXUTHPFTDZDN/events.json","paper":"https://pith.science/paper/KOV64XHE"},"agent_actions":{"view_html":"https://pith.science/pith/KOV64XHEEHE6X7VXUTHPFTDZDN","download_json":"https://pith.science/pith/KOV64XHEEHE6X7VXUTHPFTDZDN.json","view_paper":"https://pith.science/paper/KOV64XHE","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2304.13734&json=true","fetch_graph":"https://pith.science/api/pith-number/KOV64XHEEHE6X7VXUTHPFTDZDN/graph.json","fetch_events":"https://pith.science/api/pith-number/KOV64XHEEHE6X7VXUTHPFTDZDN/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/KOV64XHEEHE6X7VXUTHPFTDZDN/action/timestamp_anchor","attest_storage":"https://pith.science/pith/KOV64XHEEHE6X7VXUTHPFTDZDN/action/storage_attestation","attest_author":"https://pith.science/pith/KOV64XHEEHE6X7VXUTHPFTDZDN/action/author_attestation","sign_citation":"https://pith.science/pith/KOV64XHEEHE6X7VXUTHPFTDZDN/action/citation_signature","submit_replication":"https://pith.science/pith/KOV64XHEEHE6X7VXUTHPFTDZDN/action/replication_record"}},"created_at":"2026-05-17T23:38:49.657810+00:00","updated_at":"2026-05-17T23:38:49.657810+00:00"}