Pith Number

pith:76DPSGXM

pith:2024:76DPSGXMOWWHM6WLVLLTQ5QWDU

not attested not anchored not stored refs resolved

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Allyson Ettinger, Bill Yuchen Lin, Kavel Rao, Liwei Jiang, Nathan Lambert, Nouha Dziri, Seungju Han, Yejin Choi

WildGuard is an open moderation tool that detects malicious prompts, response risks, and refusal behaviors in LLMs with accuracy matching or exceeding GPT-4 on key tasks.

arxiv:2406.18495 v3 · 2024-06-26 · cs.CL

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.

C2weakest assumption

The human-annotated WildGuardTest set of 5K items and the broader WildGuardMix dataset are representative of real-world user prompts, adversarial jailbreaks, and model behaviors, and that performance gains will generalize beyond the ten public benchmarks evaluated.

C3one line summary

WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.

References

63 extracted · 63 resolved · 9 Pith anchors

[1] GPT-4 Technical Report 2023 · arXiv:2303.08774

[2] Llama 3 model card 2024

[3] The claude 3 model family: Opus, sonnet, haiku

[4] Transactions on Machine Learning Research , author= 2024

[5] Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022 2022

Formal links

2 machine-checked theorem links

Cited by

17 papers in Pith

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Bayesian Model Merging

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data

Receipt and verification

First computed	2026-05-17T23:38:13.600412Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

ff86f91aec75ac767acbaad73876161d00d977ea794c9085178f91385ce9e355

Aliases

arxiv: 2406.18495 · arxiv_version: 2406.18495v3 · doi: 10.48550/arxiv.2406.18495 · pith_short_12: 76DPSGXMOWWH · pith_short_16: 76DPSGXMOWWHM6WL · pith_short_8: 76DPSGXM

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/76DPSGXMOWWHM6WLVLLTQ5QWDU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ff86f91aec75ac767acbaad73876161d00d977ea794c9085178f91385ce9e355

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "076b5fdce0133c1dbc9fa3d56c2226b1e5cc3ceb3b521afa3011331488f4a91d",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-06-26T16:58:20Z",
    "title_canon_sha256": "f05660010b34991862d4ad5d0ef784d10c5a15542480b9ea3cca4333ffd5706b"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2406.18495",
    "kind": "arxiv",
    "version": 3
  }
}