pith. machine review for the scientific record. sign in
Pith Number

pith:76DPSGXM

pith:2024:76DPSGXMOWWHM6WLVLLTQ5QWDU
not attested not anchored not stored refs resolved

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Allyson Ettinger, Bill Yuchen Lin, Kavel Rao, Liwei Jiang, Nathan Lambert, Nouha Dziri, Seungju Han, Yejin Choi

WildGuard is an open moderation tool that detects malicious prompts, response risks, and refusal behaviors in LLMs with accuracy matching or exceeding GPT-4 on key tasks.

arxiv:2406.18495 v3 · 2024-06-26 · cs.CL

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.

C2weakest assumption

The human-annotated WildGuardTest set of 5K items and the broader WildGuardMix dataset are representative of real-world user prompts, adversarial jailbreaks, and model behaviors, and that performance gains will generalize beyond the ten public benchmarks evaluated.

C3one line summary

WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.

References

63 extracted · 63 resolved · 9 Pith anchors

[1] GPT-4 Technical Report 2023 · arXiv:2303.08774
[2] Llama 3 model card 2024
[3] The claude 3 model family: Opus, sonnet, haiku
[4] Transactions on Machine Learning Research , author= 2024
[5] Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022 2022

Formal links

2 machine-checked theorem links

Cited by

17 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.600412Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

ff86f91aec75ac767acbaad73876161d00d977ea794c9085178f91385ce9e355

Aliases

arxiv: 2406.18495 · arxiv_version: 2406.18495v3 · doi: 10.48550/arxiv.2406.18495 · pith_short_12: 76DPSGXMOWWH · pith_short_16: 76DPSGXMOWWHM6WL · pith_short_8: 76DPSGXM
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/76DPSGXMOWWHM6WLVLLTQ5QWDU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ff86f91aec75ac767acbaad73876161d00d977ea794c9085178f91385ce9e355
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "076b5fdce0133c1dbc9fa3d56c2226b1e5cc3ceb3b521afa3011331488f4a91d",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-06-26T16:58:20Z",
    "title_canon_sha256": "f05660010b34991862d4ad5d0ef784d10c5a15542480b9ea3cca4333ffd5706b"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2406.18495",
    "kind": "arxiv",
    "version": 3
  }
}