pith:76DPSGXM
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
WildGuard is an open moderation tool that detects malicious prompts, response risks, and refusal behaviors in LLMs with accuracy matching or exceeding GPT-4 on key tasks.
arxiv:2406.18495 v3 · 2024-06-26 · cs.CL
Record completeness
Claims
WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.
The human-annotated WildGuardTest set of 5K items and the broader WildGuardMix dataset are representative of real-world user prompts, adversarial jailbreaks, and model behaviors, and that performance gains will generalize beyond the ten public benchmarks evaluated.
WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:13.600412Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
ff86f91aec75ac767acbaad73876161d00d977ea794c9085178f91385ce9e355
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/76DPSGXMOWWHM6WLVLLTQ5QWDU \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ff86f91aec75ac767acbaad73876161d00d977ea794c9085178f91385ce9e355
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "076b5fdce0133c1dbc9fa3d56c2226b1e5cc3ceb3b521afa3011331488f4a91d",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2024-06-26T16:58:20Z",
"title_canon_sha256": "f05660010b34991862d4ad5d0ef784d10c5a15542480b9ea3cca4333ffd5706b"
},
"schema_version": "1.0",
"source": {
"id": "2406.18495",
"kind": "arxiv",
"version": 3
}
}