{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:6W7OXTAIG7NINMBJSTPPYRCD5T","short_pith_number":"pith:6W7OXTAI","schema_version":"1.0","canonical_sha256":"f5beebcc0837da86b02994defc4443ecf487aefdf1df4577fa756af1d02069e2","source":{"kind":"arxiv","id":"2309.16042","version":2},"attestation_state":"computed","paper":{"title":"Towards Best Practices of Activation Patching in Language Models: Metrics and Methods","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Varying metrics and corruption methods in activation patching can produce conflicting pictures of which model components matter.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Fred Zhang, Neel Nanda","submitted_at":"2023-09-27T21:53:56Z","abstract_excerpt":"Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several s"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2309.16042","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2023-09-27T21:53:56Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"ddff9c6515e0ed0b541d738fa2ef96374070398ddde2305c90772355b2954c95","abstract_canon_sha256":"080c9b3f05b25967043221c33a05f6dfd8524334bf4eeacae0d6cdfa03a8f9f7"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.187755Z","signature_b64":"zjjlz2oREvC9FNvkI7c4pPdKgNnUHFzpfqShma5IRoW6gRxFiZCcAoQV7mukWGeDMY8l9ToEsX/8aa82NalSAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"f5beebcc0837da86b02994defc4443ecf487aefdf1df4577fa756af1d02069e2","last_reissued_at":"2026-05-17T23:38:14.187241Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.187241Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Towards Best Practices of Activation Patching in Language Models: Metrics and Methods","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Varying metrics and corruption methods in activation patching can produce conflicting pictures of which model components matter.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Fred Zhang, Neel Nanda","submitted_at":"2023-09-27T21:53:56Z","abstract_excerpt":"Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several s"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the specific localization and circuit discovery tasks and models examined are representative enough for the derived recommendations to apply broadly to activation patching usage.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Varying metrics and corruption methods in activation patching can produce conflicting pictures of which model components matter.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"866ed739c24bdfcac90abbfec1887a47e1a77a3c3b9ddc089b69f56ddae56751"},"source":{"id":"2309.16042","kind":"arxiv","version":2},"verdict":{"id":"0eb3d8d6-1971-46dc-8e65-468f79283566","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T11:52:22.351648Z","strongest_claim":"In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results.","one_line_summary":"Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the specific localization and circuit discovery tasks and models examined are representative enough for the derived recommendations to apply broadly to activation patching usage.","pith_extraction_headline":"Varying metrics and corruption methods in activation patching can produce conflicting pictures of which model components matter."},"references":{"count":108,"sample":[{"doi":"","year":null,"title":"Advances in Neural Information Processing Systems (NeurIPS) , year=","work_id":"316fad02-2c3a-444b-922d-261eeba30a82","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases , author=","work_id":"93c2f998-c721-4a6c-b412-54e0215ffc0a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors","work_id":"aa0593e4-0654-43cf-9166-8c4ed45b9572","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"International Conference on Artificial Intelligence and Statistics (AISTATS) , year=","work_id":"6f4c7085-1394-4307-a7e4-9ad7051bec8a","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"A circuit for","work_id":"dc404f52-df3f-40fb-9ff3-2b9338d96e39","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":108,"snapshot_sha256":"805b0004ffb8f2d3b19e5f98858dd74ac1ae9e769c1315926d2010843af53682","internal_anchors":3},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2309.16042","created_at":"2026-05-17T23:38:14.187330+00:00"},{"alias_kind":"arxiv_version","alias_value":"2309.16042v2","created_at":"2026-05-17T23:38:14.187330+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2309.16042","created_at":"2026-05-17T23:38:14.187330+00:00"},{"alias_kind":"pith_short_12","alias_value":"6W7OXTAIG7NI","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"6W7OXTAIG7NINMBJ","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"6W7OXTAI","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":22,"internal_anchor_count":22,"sample":[{"citing_arxiv_id":"2603.17839","citing_title":"How do LLMs Compute Verbal Confidence","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":106,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12991","citing_title":"Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2509.14837","citing_title":"V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2404.15255","citing_title":"How to use and interpret activation patching","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":106,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":106,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12809","citing_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12991","citing_title":"Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02605","citing_title":"Do Audio-Visual Large Language Models Really See and Hear?","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04982","citing_title":"CURE:Circuit-Aware Unlearning for LLM-based Recommendation","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09314","citing_title":"How LLMs Are Persuaded: A Few Attention Heads, Rerouted","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09129","citing_title":"Data-driven Circuit Discovery for Interpretability of Language Models","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23877","citing_title":"Knowledge Vector of Logical Reasoning in Large Language Models","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22271","citing_title":"How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19260","citing_title":"Understanding the Mechanism of Altruism in Large Language Models","ref_index":277,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19052","citing_title":"Cell-Based Representation of Relational Binding in Language Models","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19826","citing_title":"Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11467","citing_title":"From Attribution to Action: A Human-Centered Application of Activation Steering","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10326","citing_title":"Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12426","citing_title":"Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task","ref_index":27,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/6W7OXTAIG7NINMBJSTPPYRCD5T","json":"https://pith.science/pith/6W7OXTAIG7NINMBJSTPPYRCD5T.json","graph_json":"https://pith.science/api/pith-number/6W7OXTAIG7NINMBJSTPPYRCD5T/graph.json","events_json":"https://pith.science/api/pith-number/6W7OXTAIG7NINMBJSTPPYRCD5T/events.json","paper":"https://pith.science/paper/6W7OXTAI"},"agent_actions":{"view_html":"https://pith.science/pith/6W7OXTAIG7NINMBJSTPPYRCD5T","download_json":"https://pith.science/pith/6W7OXTAIG7NINMBJSTPPYRCD5T.json","view_paper":"https://pith.science/paper/6W7OXTAI","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2309.16042&json=true","fetch_graph":"https://pith.science/api/pith-number/6W7OXTAIG7NINMBJSTPPYRCD5T/graph.json","fetch_events":"https://pith.science/api/pith-number/6W7OXTAIG7NINMBJSTPPYRCD5T/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/6W7OXTAIG7NINMBJSTPPYRCD5T/action/timestamp_anchor","attest_storage":"https://pith.science/pith/6W7OXTAIG7NINMBJSTPPYRCD5T/action/storage_attestation","attest_author":"https://pith.science/pith/6W7OXTAIG7NINMBJSTPPYRCD5T/action/author_attestation","sign_citation":"https://pith.science/pith/6W7OXTAIG7NINMBJSTPPYRCD5T/action/citation_signature","submit_replication":"https://pith.science/pith/6W7OXTAIG7NINMBJSTPPYRCD5T/action/replication_record"}},"created_at":"2026-05-17T23:38:14.187330+00:00","updated_at":"2026-05-17T23:38:14.187330+00:00"}