{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:GBIXPOE6A3GVA43FYXMVF2ZOWL","short_pith_number":"pith:GBIXPOE6","schema_version":"1.0","canonical_sha256":"305177b89e06cd507365c5d952eb2eb2defd87a1022827c16353163f2402c6ee","source":{"kind":"arxiv","id":"2403.03218","version":7},"attestation_state":"computed","paper":{"title":"The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"The WMDP benchmark publicly measures hazardous knowledge in LLMs, and the RMU unlearning method reduces performance on it while preserving general capabilities.","cross_cats":["cs.AI","cs.CL","cs.CY"],"primary_cat":"cs.LG","authors_text":"Adam A. Hunt, Adam Khoja, Alexander Pan, Alexandr Wang, Alex Levinson, Alice Gatti, Andrew B. Liu, Andy Zou, Anjali Gopal, Ann-Kathrin Dombrowski, Ariel Herbert-Voss, Bhrugu Bharathi, Brad Jokubaitis, Cort B. Breuer, Dan Hendrycks, Daniel Berrios, David Campbell, Gabriel Mukobi, Ian Steneker, Isabelle Barrass, Jean Wang, Jimmy Ba, John Guan, Justin D. Li, Justin Tienken-Harder, Kallol Krishna Karmakar, Kemper Talley, Kevin M. Esvelt, Kevin Y. Shih, Lennart Justen, Long Phan, Mantas Mazeika, Michael Chen, Mindy Levine, Nathan Helm-Burger, Nathaniel Li, Oam Patel, Oliver Zhang, Palash Oswal, Ponnurangam Kumaraguru, Rassin Lababidi, Rishub Tamirisa, Ruoyu Wang, Russell Kaplan, Samuel Marks, Shashwat Goel, Stephen Fitz, Steven Basart, Summer Yue, Uday Tupakula, Vijay Varadharajan, Weiran Lin, William Qian, Xiaoyuan Zhu, Yan Shoshitaishvili, Zhenqi Zhao, Zifan Wang","submitted_at":"2024-03-05T18:59:35Z","abstract_excerpt":"The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP)"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2403.03218","kind":"arxiv","version":7},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2024-03-05T18:59:35Z","cross_cats_sorted":["cs.AI","cs.CL","cs.CY"],"title_canon_sha256":"e6fc6505fcb49572ec9067b46d99a5088ece716bfa6282c0f93405bafd7b62cb","abstract_canon_sha256":"18dd989ed4fd400b0ec8c83ee1c46ffa9c0899ceb189bfa75903904593d13afa"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.346360Z","signature_b64":"Ks21z51YyFAow58pGLZAbTxyOf4U3vu64pTOHpr4/cbpBC+jmjLP2mT9X0UCL0StW3LP+/zML3NvMptQedHtAQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"305177b89e06cd507365c5d952eb2eb2defd87a1022827c16353163f2402c6ee","last_reissued_at":"2026-05-17T23:38:50.345849Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.345849Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"The WMDP benchmark publicly measures hazardous knowledge in LLMs, and the RMU unlearning method reduces performance on it while preserving general capabilities.","cross_cats":["cs.AI","cs.CL","cs.CY"],"primary_cat":"cs.LG","authors_text":"Adam A. Hunt, Adam Khoja, Alexander Pan, Alexandr Wang, Alex Levinson, Alice Gatti, Andrew B. Liu, Andy Zou, Anjali Gopal, Ann-Kathrin Dombrowski, Ariel Herbert-Voss, Bhrugu Bharathi, Brad Jokubaitis, Cort B. Breuer, Dan Hendrycks, Daniel Berrios, David Campbell, Gabriel Mukobi, Ian Steneker, Isabelle Barrass, Jean Wang, Jimmy Ba, John Guan, Justin D. Li, Justin Tienken-Harder, Kallol Krishna Karmakar, Kemper Talley, Kevin M. Esvelt, Kevin Y. Shih, Lennart Justen, Long Phan, Mantas Mazeika, Michael Chen, Mindy Levine, Nathan Helm-Burger, Nathaniel Li, Oam Patel, Oliver Zhang, Palash Oswal, Ponnurangam Kumaraguru, Rassin Lababidi, Rishub Tamirisa, Ruoyu Wang, Russell Kaplan, Samuel Marks, Shashwat Goel, Stephen Fitz, Steven Basart, Summer Yue, Uday Tupakula, Vijay Varadharajan, Weiran Lin, William Qian, Xiaoyuan Zhu, Yan Shoshitaishvili, Zhenqi Zhao, Zifan Wang","submitted_at":"2024-03-05T18:59:35Z","abstract_excerpt":"The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP)"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That WMDP questions serve as a reliable proxy for real-world hazardous capabilities and that the unlearning effect generalizes without introducing new vulnerabilities or degrading performance on untested domains.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"WMDP is a public benchmark measuring hazardous LLM knowledge across biosecurity, cybersecurity, and chemical security, paired with RMU unlearning that reduces WMDP performance without degrading general capabilities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"The WMDP benchmark publicly measures hazardous knowledge in LLMs, and the RMU unlearning method reduces performance on it while preserving general capabilities.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"bea4d0079ba9181c7cfadcadc907dbf1b50e96b11de994e01748785591cdeb80"},"source":{"id":"2403.03218","kind":"arxiv","version":7},"verdict":{"id":"030a47ad-b114-4017-942f-10f026ca271c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T19:51:44.059609Z","strongest_claim":"RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs.","one_line_summary":"WMDP is a public benchmark measuring hazardous LLM knowledge across biosecurity, cybersecurity, and chemical security, paired with RMU unlearning that reduces WMDP performance without degrading general capabilities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That WMDP questions serve as a reliable proxy for real-world hazardous capabilities and that the unlearning effect generalizes without introducing new vulnerabilities or degrading performance on untested domains.","pith_extraction_headline":"The WMDP benchmark publicly measures hazardous knowledge in LLMs, and the RMU unlearning method reduces performance on it while preserving general capabilities."},"references":{"count":16,"sample":[{"doi":"10.7249/rra2977-2","year":2024,"title":"Mouton, Caleb Lucas, and Ella Guest","work_id":"41d7f1fc-ab86-42f8-86af-fcf49acf5706","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer: This work aims to mitigate existential risks posed by the malicious use of LLMs in developing bioweapo","work_id":"2d59f65c-702c-4bf7-9d56-0b4d2119c669","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? 29 Answer: WMDP increases the barrier of entry f","work_id":"9f884a4c-f42d-4436-9dd0-14d1a8f4016d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Unlearning on WMDP reduces the risks of language model ","work_id":"001daaa4-2755-4d45-8650-a927fefefcc5","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"What’s at Stake?What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direc","work_id":"2d8df976-4c12-43ab-b15a-ebfbfc1a80ce","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":16,"snapshot_sha256":"fd40d9b9c133c183092f7a938dbb4a3ca231ab3889074420390abe6f5a5dd0af","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b68cd182c11b256f5231162902684f8240ede332cd968c50ac4009e23a7f6465"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2403.03218","created_at":"2026-05-17T23:38:50.345935+00:00"},{"alias_kind":"arxiv_version","alias_value":"2403.03218v7","created_at":"2026-05-17T23:38:50.345935+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2403.03218","created_at":"2026-05-17T23:38:50.345935+00:00"},{"alias_kind":"pith_short_12","alias_value":"GBIXPOE6A3GV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"GBIXPOE6A3GVA43F","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"GBIXPOE6","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":35,"internal_anchor_count":35,"sample":[{"citing_arxiv_id":"2605.22643","citing_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2407.13059","citing_title":"Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21602","citing_title":"Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22643","citing_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21545","citing_title":"RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20915","citing_title":"Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16746","citing_title":"State Contamination in Memory-Augmented LLM Agents","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21251","citing_title":"CAP: Controllable Alignment Prompting for Unlearning in LLMs","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09391","citing_title":"Do Linear Probes Generalize Better in Persona Coordinates?","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2506.06414","citing_title":"Benchmarking Misuse Mitigation Against Covert Adversaries","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2506.20941","citing_title":"Revisiting the Past: Data Unlearning with Model State History","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2507.06261","citing_title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22483","citing_title":"OFMU: Optimization-Driven Framework for Machine Unlearning","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2510.00761","citing_title":"Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2511.17408","citing_title":"The Impact of Off-Policy Training Data on Probe Generalisation","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2404.05868","citing_title":"Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2603.01589","citing_title":"SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14514","citing_title":"Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2603.22869","citing_title":"Chain-of-Authorization: Embedding authorization into large language models","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13595","citing_title":"Inducing Artificial Uncertainty in Language Models","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11685","citing_title":"Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21251","citing_title":"CAP: Controllable Alignment Prompting for Unlearning in LLMs","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09391","citing_title":"Do Linear Probes Generalize Better in Persona Coordinates?","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24966","citing_title":"Risk Reporting for Developers' Internal AI Model Use","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03547","citing_title":"Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models","ref_index":54,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL","json":"https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL.json","graph_json":"https://pith.science/api/pith-number/GBIXPOE6A3GVA43FYXMVF2ZOWL/graph.json","events_json":"https://pith.science/api/pith-number/GBIXPOE6A3GVA43FYXMVF2ZOWL/events.json","paper":"https://pith.science/paper/GBIXPOE6"},"agent_actions":{"view_html":"https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL","download_json":"https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL.json","view_paper":"https://pith.science/paper/GBIXPOE6","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2403.03218&json=true","fetch_graph":"https://pith.science/api/pith-number/GBIXPOE6A3GVA43FYXMVF2ZOWL/graph.json","fetch_events":"https://pith.science/api/pith-number/GBIXPOE6A3GVA43FYXMVF2ZOWL/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL/action/timestamp_anchor","attest_storage":"https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL/action/storage_attestation","attest_author":"https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL/action/author_attestation","sign_citation":"https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL/action/citation_signature","submit_replication":"https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL/action/replication_record"}},"created_at":"2026-05-17T23:38:50.345935+00:00","updated_at":"2026-05-17T23:38:50.345935+00:00"}