pith. machine review for the scientific record. sign in
Pith Number

pith:VMUCLLLK

pith:2025:VMUCLLLKZHBUO4ACCDSNQFPNLM
not attested not anchored not stored refs resolved

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Alex Silverstein, Alwin Peng, Amanda Askell, Andy Dau, Anjali Gopal, Catherine Olsson, Cem Anil, Clare O'Hara, Constantin Weisser, Emma Bluemke, Eric Christiansen, Ethan Perez, Euan Ong, Francesco Mosconi, Giulio Zhou, Hoagy Cunningham, Jan Leike, Jared Kaplan, Jerry Wei, Jesse Mu, Joe Benton, Jorrit Kruthoff, Kevin K. Troy, Kevin Lin, Leonard Tang, Linda Petrini, Logan Graham, Logan Howard, Meg Tong, Mrinank Sharma, Nathan Bailey, Nikhil Saxena, Nimit Kalra, Peter Lofgren, Raj Agarwal, Rob Gilson, Ruiqi Zhong, Samir Rajani, Samuel R. Bowman, Scott Goodfriend, Taesung Lee, Tanya Singh, Theodore Sumers

Classifiers trained on data from natural language rules block universal jailbreaks in language models.

arxiv:2501.18837 v1 · 2025-01-31 · cs.CL · cs.AI · cs.CR · cs.LG

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries.

C2weakest assumption

The red teaming process, even at large scale, sufficiently covers the space of possible universal jailbreaks so that absence of success implies robustness rather than incomplete search.

C3one line summary

Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.

References

160 extracted · 160 resolved · 3 Pith anchors

[1] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned 2023 · arXiv:2209.07858
[2] Training language models to follow instructions with human feedback 2024 · arXiv:2203.02155
[3] C., Lupu, A., Hambro, E., Markosyan, A 2024
[4] Detecting Pretraining Data from Large Language Models 2024 · arXiv:2310.16789
[5] out-of-distribution

Formal links

2 machine-checked theorem links

Cited by

17 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:12.856714Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

ab2825ad6ac9c347700210e4d815ed5b1154375600fd2b870c409c3a559e8f34

Aliases

arxiv: 2501.18837 · arxiv_version: 2501.18837v1 · doi: 10.48550/arxiv.2501.18837 · pith_short_12: VMUCLLLKZHBU · pith_short_16: VMUCLLLKZHBUO4AC · pith_short_8: VMUCLLLK
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ab2825ad6ac9c347700210e4d815ed5b1154375600fd2b870c409c3a559e8f34
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "fa51f2606de6e59ca1cc0eeb6c766735b2853db0874167eacdd98202ae0c9d0b",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CR",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2025-01-31T01:09:32Z",
    "title_canon_sha256": "e2e59553af0da3be4c45976e562d5a37e236d8a86a7d730f7dd339d2fdcac4e5"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.18837",
    "kind": "arxiv",
    "version": 1
  }
}