pith:VMUCLLLK
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Classifiers trained on data from natural language rules block universal jailbreaks in language models.
arxiv:2501.18837 v1 · 2025-01-31 · cs.CL · cs.AI · cs.CR · cs.LG
Record completeness
Claims
In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries.
The red teaming process, even at large scale, sufficiently covers the space of possible universal jailbreaks so that absence of success implies robustness rather than incomplete search.
Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:12.856714Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
ab2825ad6ac9c347700210e4d815ed5b1154375600fd2b870c409c3a559e8f34
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ab2825ad6ac9c347700210e4d815ed5b1154375600fd2b870c409c3a559e8f34
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "fa51f2606de6e59ca1cc0eeb6c766735b2853db0874167eacdd98202ae0c9d0b",
"cross_cats_sorted": [
"cs.AI",
"cs.CR",
"cs.LG"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2025-01-31T01:09:32Z",
"title_canon_sha256": "e2e59553af0da3be4c45976e562d5a37e236d8a86a7d730f7dd339d2fdcac4e5"
},
"schema_version": "1.0",
"source": {
"id": "2501.18837",
"kind": "arxiv",
"version": 1
}
}