{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:TESXBW475SPYDRJSCOXPFS7VHO","short_pith_number":"pith:TESXBW47","schema_version":"1.0","canonical_sha256":"992570db9fec9f81c53213aef2cbf53b9c222b7196cf5f74b8d7367e70eab110","source":{"kind":"arxiv","id":"2605.12798","version":1},"attestation_state":"computed","paper":{"title":"Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Harmful fine-tuning induces emergent misalignment via data structure interactions rather than isolated examples.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Anupam Nayak, Baris Askin, Carlee Joe-Wong, Gauri Joshi, Guannan Qu, Muhammed Ustaomeroglu","submitted_at":"2026-05-12T22:27:32Z","abstract_excerpt":"Fine-tuning LLMs on narrow harmful datasets can induce Emergent Misalignment (EM), where models exhibit misaligned behavior far beyond the fine-tuning distribution. We argue that emergent misalignment can be better understood as a data-mediated transfer phenomenon: harmful fine-tuning examples do not induce uniform behavioral spillover, but interact with the structural properties of the dataset and the difficulty of the tasks relative to the model. Across our experiments, we find that misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.12798","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-05-12T22:27:32Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"3c067641f8f329a95a6ea8f80ad9a1b697163b2ffe9c6a70c9c762f3007ceca7","abstract_canon_sha256":"eab5810efdaac7f3af5cc8243ac56d189900d12f19f59f00ab9c15909bd6c9f4"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T03:09:12.760435Z","signature_b64":"76AR+XrkM9zVX+UIPsu3HZUGPw5op2ErNiGm9WYVPfrpKxae94NY5QMB6AnSdULri1TNSPwlLOsTUctpsIEkBw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"992570db9fec9f81c53213aef2cbf53b9c222b7196cf5f74b8d7367e70eab110","last_reissued_at":"2026-05-18T03:09:12.759719Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T03:09:12.759719Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Harmful fine-tuning induces emergent misalignment via data structure interactions rather than isolated examples.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Anupam Nayak, Baris Askin, Carlee Joe-Wong, Gauri Joshi, Guannan Qu, Muhammed Ustaomeroglu","submitted_at":"2026-05-12T22:27:32Z","abstract_excerpt":"Fine-tuning LLMs on narrow harmful datasets can induce Emergent Misalignment (EM), where models exhibit misaligned behavior far beyond the fine-tuning distribution. We argue that emergent misalignment can be better understood as a data-mediated transfer phenomenon: harmful fine-tuning examples do not induce uniform behavioral spillover, but interact with the structural properties of the dataset and the difficulty of the tasks relative to the model. Across our experiments, we find that misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional structure, when prompts leave more room for coherent harmful completions, and when the target behavior has been more reliably learned by the model.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That observed misalignment differences are caused by data-mediated transfer mechanisms rather than uncontrolled differences in model capacity, optimization dynamics, or evaluation prompt difficulty.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Harmful fine-tuning induces emergent misalignment via data structure interactions rather than isolated examples.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b495e868b2ed871677eb5c9bb67bf20c4b327c8dbecc29301958504f73cabbb9"},"source":{"id":"2605.12798","kind":"arxiv","version":1},"verdict":{"id":"e9537c87-ad88-453c-a5ca-f4f6792e1b66","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:33:26.004701Z","strongest_claim":"misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional structure, when prompts leave more room for coherent harmful completions, and when the target behavior has been more reliably learned by the model.","one_line_summary":"Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That observed misalignment differences are caused by data-mediated transfer mechanisms rather than uncontrolled differences in model capacity, optimization dynamics, or evaluation prompt difficulty.","pith_extraction_headline":"Harmful fine-tuning induces emergent misalignment via data structure interactions rather than isolated examples."},"references":{"count":34,"sample":[{"doi":"","year":2026,"title":"Accessed: 2026-05-04","work_id":"35b3b010-4981-47ae-9625-483e24234fc9","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Persona Vectors: Monitoring and Controlling Character Traits in Language Models","work_id":"cf32dbef-9132-4648-abcb-0ebf3ac3af80","ref_index":2,"cited_arxiv_id":"2507.21509","is_internal_anchor":true},{"doi":"","year":null,"title":"arXiv preprint arXiv:2506.13206 , year=","work_id":"a5233f92-43ab-43c9-a67a-652ee8453eb7","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"arXiv preprint arXiv:2507.14805 , year=","work_id":"5fc1e083-accb-4ae7-bf99-4a5e72ca356f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","ref_index":5,"cited_arxiv_id":"2507.06261","is_internal_anchor":true}],"resolved_work":34,"snapshot_sha256":"8968149c4eb1adc55c45883bfb4a706854b8a6eeb0bd51d23f5c568998ccec97","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"508f1b8716823123af013aaeaa27861e1eb258b96104cdf0ac1863362e98bd22"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.12798","created_at":"2026-05-18T03:09:12.759842+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.12798v1","created_at":"2026-05-18T03:09:12.759842+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.12798","created_at":"2026-05-18T03:09:12.759842+00:00"},{"alias_kind":"pith_short_12","alias_value":"TESXBW475SPY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"TESXBW475SPYDRJS","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"TESXBW47","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/TESXBW475SPYDRJSCOXPFS7VHO","json":"https://pith.science/pith/TESXBW475SPYDRJSCOXPFS7VHO.json","graph_json":"https://pith.science/api/pith-number/TESXBW475SPYDRJSCOXPFS7VHO/graph.json","events_json":"https://pith.science/api/pith-number/TESXBW475SPYDRJSCOXPFS7VHO/events.json","paper":"https://pith.science/paper/TESXBW47"},"agent_actions":{"view_html":"https://pith.science/pith/TESXBW475SPYDRJSCOXPFS7VHO","download_json":"https://pith.science/pith/TESXBW475SPYDRJSCOXPFS7VHO.json","view_paper":"https://pith.science/paper/TESXBW47","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.12798&json=true","fetch_graph":"https://pith.science/api/pith-number/TESXBW475SPYDRJSCOXPFS7VHO/graph.json","fetch_events":"https://pith.science/api/pith-number/TESXBW475SPYDRJSCOXPFS7VHO/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/TESXBW475SPYDRJSCOXPFS7VHO/action/timestamp_anchor","attest_storage":"https://pith.science/pith/TESXBW475SPYDRJSCOXPFS7VHO/action/storage_attestation","attest_author":"https://pith.science/pith/TESXBW475SPYDRJSCOXPFS7VHO/action/author_attestation","sign_citation":"https://pith.science/pith/TESXBW475SPYDRJSCOXPFS7VHO/action/citation_signature","submit_replication":"https://pith.science/pith/TESXBW475SPYDRJSCOXPFS7VHO/action/replication_record"}},"created_at":"2026-05-18T03:09:12.759842+00:00","updated_at":"2026-05-18T03:09:12.759842+00:00"}