pith. sign in
Pith Number

pith:GBIXPOE6

pith:2024:GBIXPOE6A3GVA43FYXMVF2ZOWL
not attested not anchored not stored refs resolved

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Adam A. Hunt, Adam Khoja, Alexander Pan, Alexandr Wang, Alex Levinson, Alice Gatti, Andrew B. Liu, Andy Zou, Anjali Gopal, Ann-Kathrin Dombrowski, Ariel Herbert-Voss, Bhrugu Bharathi, Brad Jokubaitis, Cort B. Breuer, Dan Hendrycks, Daniel Berrios, David Campbell, Gabriel Mukobi, Ian Steneker, Isabelle Barrass, Jean Wang, Jimmy Ba, John Guan, Justin D. Li, Justin Tienken-Harder, Kallol Krishna Karmakar, Kemper Talley, Kevin M. Esvelt, Kevin Y. Shih, Lennart Justen, Long Phan, Mantas Mazeika, Michael Chen, Mindy Levine, Nathan Helm-Burger, Nathaniel Li, Oam Patel, Oliver Zhang, Palash Oswal, Ponnurangam Kumaraguru, Rassin Lababidi, Rishub Tamirisa, Ruoyu Wang, Russell Kaplan, Samuel Marks, Shashwat Goel, Stephen Fitz, Steven Basart, Summer Yue, Uday Tupakula, Vijay Varadharajan, Weiran Lin, William Qian, Xiaoyuan Zhu, Yan Shoshitaishvili, Zhenqi Zhao, Zifan Wang

The WMDP benchmark publicly measures hazardous knowledge in LLMs, and the RMU unlearning method reduces performance on it while preserving general capabilities.

arxiv:2403.03218 v7 · 2024-03-05 · cs.LG · cs.AI · cs.CL · cs.CY

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{GBIXPOE6A3GVA43FYXMVF2ZOWL}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs.

C2weakest assumption

That WMDP questions serve as a reliable proxy for real-world hazardous capabilities and that the unlearning effect generalizes without introducing new vulnerabilities or degrading performance on untested domains.

C3one line summary

WMDP is a public benchmark measuring hazardous LLM knowledge across biosecurity, cybersecurity, and chemical security, paired with RMU unlearning that reduces WMDP performance without degrading general capabilities.

References

16 extracted · 16 resolved · 0 Pith anchors

[1] Mouton, Caleb Lucas, and Ella Guest 2024 · doi:10.7249/rra2977-2
[2] Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer: This work aims to mitigate existential risks posed by the malicious use of LLMs in developing bioweapo 2023
[3] Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? 29 Answer: WMDP increases the barrier of entry f
[4] Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Unlearning on WMDP reduces the risks of language model 2024
[5] What’s at Stake?What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direc 2022

Formal links

2 machine-checked theorem links

Cited by

35 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.345849Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

305177b89e06cd507365c5d952eb2eb2defd87a1022827c16353163f2402c6ee

Aliases

arxiv: 2403.03218 · arxiv_version: 2403.03218v7 · doi: 10.48550/arxiv.2403.03218 · pith_short_12: GBIXPOE6A3GV · pith_short_16: GBIXPOE6A3GVA43F · pith_short_8: GBIXPOE6
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/GBIXPOE6A3GVA43FYXMVF2ZOWL \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 305177b89e06cd507365c5d952eb2eb2defd87a1022827c16353163f2402c6ee
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "18dd989ed4fd400b0ec8c83ee1c46ffa9c0899ceb189bfa75903904593d13afa",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL",
      "cs.CY"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-03-05T18:59:35Z",
    "title_canon_sha256": "e6fc6505fcb49572ec9067b46d99a5088ece716bfa6282c0f93405bafd7b62cb"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2403.03218",
    "kind": "arxiv",
    "version": 7
  }
}