{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2022:74UBTIODDX6EMRYYY6ELOF3D5W","short_pith_number":"pith:74UBTIOD","schema_version":"1.0","canonical_sha256":"ff2819a1c31dfc464718c788b71763edb23f1ce2441e7d06e473ec67f3c08d7f","source":{"kind":"arxiv","id":"2212.03827","version":2},"attestation_state":"computed","paper":{"title":"Discovering Latent Knowledge in Language Models Without Supervision","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A linear direction in language model activations encodes latent truth and can be found without any supervision or labels.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Collin Burns, Dan Klein, Haotian Ye, Jacob Steinhardt","submitted_at":"2022-12-07T18:17:56Z","abstract_excerpt":"Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in act"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2212.03827","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2022-12-07T18:17:56Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"940d05e7e724e756956efa4b328953716661c2bade06ea0c1d4e77697fb7e3fe","abstract_canon_sha256":"82cf4ebd82f3e91f32f31573966c90b7838681813b8b357bad94186233eec8a5"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.244246Z","signature_b64":"otaWGb/cfl2SB8NqsptS0pXTw6vvEKixhTQuJPXWO3oAM+VAH9G8xedXRLD+7ZZ0WzIU2jyrNeu8+h04sFgwDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"ff2819a1c31dfc464718c788b71763edb23f1ce2441e7d06e473ec67f3c08d7f","last_reissued_at":"2026-05-17T23:38:50.243573Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.243573Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Discovering Latent Knowledge in Language Models Without Supervision","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A linear direction in language model activations encodes latent truth and can be found without any supervision or labels.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Collin Burns, Dan Klein, Haotian Ye, Jacob Steinhardt","submitted_at":"2022-12-07T18:17:56Z","abstract_excerpt":"Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in act"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across 6 models and 10 question-answering datasets, the method recovers diverse knowledge represented in large language models and outperforms zero-shot accuracy by 4% on average, while cutting prompt sensitivity in half and maintaining accuracy even when models are prompted to generate incorrect answers.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That there exists a single linear direction in activation space whose projections satisfy logical consistency (statement and negation have opposite values) and that this direction corresponds to the model's latent knowledge of truth rather than some other consistent property.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A linear direction in language model activations encodes latent truth and can be found without any supervision or labels.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0750aab06b2e2d94cdbc25df266a55bdddee79729428c7b9f13c5879c4884238"},"source":{"id":"2212.03827","kind":"arxiv","version":2},"verdict":{"id":"6e8d2459-5d76-4336-affd-803506e6bd63","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:30:47.043212Z","strongest_claim":"Across 6 models and 10 question-answering datasets, the method recovers diverse knowledge represented in large language models and outperforms zero-shot accuracy by 4% on average, while cutting prompt sensitivity in half and maintaining accuracy even when models are prompted to generate incorrect answers.","one_line_summary":"An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That there exists a single linear direction in activation space whose projections satisfy logical consistency (statement and negation have opposite values) and that this direction corresponds to the model's latent knowledge of truth rather than some other consistent property.","pith_extraction_headline":"A linear direction in language model activations encodes latent truth and can be found without any supervision or labels."},"references":{"count":44,"sample":[{"doi":"","year":null,"title":"A General Language Assistant as a Laboratory for Alignment","work_id":"a43f9ea0-01be-47d5-b8ee-a1a9f73381c5","ref_index":1,"cited_arxiv_id":"2112.00861","is_internal_anchor":true},{"doi":"","year":null,"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","ref_index":2,"cited_arxiv_id":"2204.05862","is_internal_anchor":true},{"doi":"","year":2021,"title":"Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell","work_id":"7a4bf523-8393-4178-954c-f3e957fdec18","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","ref_index":4,"cited_arxiv_id":"2108.07258","is_internal_anchor":true},{"doi":"","year":2005,"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","ref_index":5,"cited_arxiv_id":"2005.14165","is_internal_anchor":true}],"resolved_work":44,"snapshot_sha256":"1a1d7c51a2913378645f43a4cfcd81ec51af6cfcbd2d2da470aa7f9ee980d191","internal_anchors":25},"formal_canon":{"evidence_count":3,"snapshot_sha256":"40b96fc3a614b7928e3ef1ac5db02c4ab54c984cb07b31891ae6d4f1dda0d720"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2212.03827","created_at":"2026-05-17T23:38:50.243684+00:00"},{"alias_kind":"arxiv_version","alias_value":"2212.03827v2","created_at":"2026-05-17T23:38:50.243684+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2212.03827","created_at":"2026-05-17T23:38:50.243684+00:00"},{"alias_kind":"pith_short_12","alias_value":"74UBTIODDX6E","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"74UBTIODDX6EMRYY","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"74UBTIOD","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":28,"internal_anchor_count":28,"sample":[{"citing_arxiv_id":"2605.22864","citing_title":"Reading Calibrated Uncertainty from Language Model Trajectories","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09252","citing_title":"LLM Agents Already Know When to Call Tools -- Even Without Reasoning","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21770","citing_title":"Manifold-Guided Attention Steering","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2506.18852","citing_title":"Mechanistic Interpretability Needs Philosophy","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2603.17839","citing_title":"How do LLMs Compute Verbal Confidence","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18792","citing_title":"Trust or Abstain? A Self-Aware RAG Approach","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17877","citing_title":"PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2409.12917","citing_title":"Training Language Models to Self-Correct via Reinforcement Learning","ref_index":137,"is_internal_anchor":true},{"citing_arxiv_id":"2304.13734","citing_title":"The Internal State of an LLM Knows When It's Lying","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2602.20338","citing_title":"Emergent Manifold Separability during Reasoning in Large Language Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2603.18373","citing_title":"To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10310","citing_title":"Positive Alignment: Artificial Intelligence for Human Flourishing","ref_index":169,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13082","citing_title":"The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12809","citing_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","ref_index":137,"is_internal_anchor":true},{"citing_arxiv_id":"2406.11717","citing_title":"Refusal in Language Models Is Mediated by a Single Direction","ref_index":123,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12412","citing_title":"Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2311.05232","citing_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27401","citing_title":"Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09239","citing_title":"Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09252","citing_title":"LLM Agents Already Know When to Call Tools -- Even Without Reasoning","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09195","citing_title":"The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05715","citing_title":"Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22271","citing_title":"How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06277","citing_title":"Weakly Supervised Distillation of Hallucination Signals into Transformer Representations","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15741","citing_title":"Learning Uncertainty from Sequential Internal Dispersion in Large Language Models","ref_index":3,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/74UBTIODDX6EMRYYY6ELOF3D5W","json":"https://pith.science/pith/74UBTIODDX6EMRYYY6ELOF3D5W.json","graph_json":"https://pith.science/api/pith-number/74UBTIODDX6EMRYYY6ELOF3D5W/graph.json","events_json":"https://pith.science/api/pith-number/74UBTIODDX6EMRYYY6ELOF3D5W/events.json","paper":"https://pith.science/paper/74UBTIOD"},"agent_actions":{"view_html":"https://pith.science/pith/74UBTIODDX6EMRYYY6ELOF3D5W","download_json":"https://pith.science/pith/74UBTIODDX6EMRYYY6ELOF3D5W.json","view_paper":"https://pith.science/paper/74UBTIOD","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2212.03827&json=true","fetch_graph":"https://pith.science/api/pith-number/74UBTIODDX6EMRYYY6ELOF3D5W/graph.json","fetch_events":"https://pith.science/api/pith-number/74UBTIODDX6EMRYYY6ELOF3D5W/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/74UBTIODDX6EMRYYY6ELOF3D5W/action/timestamp_anchor","attest_storage":"https://pith.science/pith/74UBTIODDX6EMRYYY6ELOF3D5W/action/storage_attestation","attest_author":"https://pith.science/pith/74UBTIODDX6EMRYYY6ELOF3D5W/action/author_attestation","sign_citation":"https://pith.science/pith/74UBTIODDX6EMRYYY6ELOF3D5W/action/citation_signature","submit_replication":"https://pith.science/pith/74UBTIODDX6EMRYYY6ELOF3D5W/action/replication_record"}},"created_at":"2026-05-17T23:38:50.243684+00:00","updated_at":"2026-05-17T23:38:50.243684+00:00"}