Pith Number

pith:VMUCLLLK

pith:2025:VMUCLLLKZHBUO4ACCDSNQFPNLM

not attested not anchored not stored refs resolved

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Alex Silverstein, Alwin Peng, Amanda Askell, Andy Dau, Anjali Gopal, Catherine Olsson, Cem Anil, Clare O'Hara, Constantin Weisser, Emma Bluemke, Eric Christiansen, Ethan Perez, Euan Ong, Francesco Mosconi, Giulio Zhou, Hoagy Cunningham, Jan Leike, Jared Kaplan, Jerry Wei, Jesse Mu, Joe Benton, Jorrit Kruthoff, Kevin K. Troy, Kevin Lin, Leonard Tang, Linda Petrini, Logan Graham, Logan Howard, Meg Tong, Mrinank Sharma, Nathan Bailey, Nikhil Saxena, Nimit Kalra, Peter Lofgren, Raj Agarwal, Rob Gilson, Ruiqi Zhong, Samir Rajani, Samuel R. Bowman, Scott Goodfriend, Taesung Lee, Tanya Singh, Theodore Sumers

Classifiers trained on data from natural language rules block universal jailbreaks in language models.

arxiv:2501.18837 v1 · 2025-01-31 · cs.CL · cs.AI · cs.CR · cs.LG

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries.

C2weakest assumption

The red teaming process, even at large scale, sufficiently covers the space of possible universal jailbreaks so that absence of success implies robustness rather than incomplete search.

C3one line summary

Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.

References

160 extracted · 160 resolved · 3 Pith anchors

[1] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned 2023 · arXiv:2209.07858

[2] Training language models to follow instructions with human feedback 2024 · arXiv:2203.02155

[3] C., Lupu, A., Hambro, E., Markosyan, A 2024

[4] Detecting Pretraining Data from Large Language Models 2024 · arXiv:2310.16789

[5] out-of-distribution

Formal links

2 machine-checked theorem links

Cited by

17 papers in Pith

The Impact of Off-Policy Training Data on Probe Generalisation

Leveraging RAG for Training-Free Alignment of LLMs

Deep Minds and Shallow Probes

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

Receipt and verification

First computed	2026-05-17T23:38:12.856714Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

ab2825ad6ac9c347700210e4d815ed5b1154375600fd2b870c409c3a559e8f34

Aliases

arxiv: 2501.18837 · arxiv_version: 2501.18837v1 · doi: 10.48550/arxiv.2501.18837 · pith_short_12: VMUCLLLKZHBU · pith_short_16: VMUCLLLKZHBUO4AC · pith_short_8: VMUCLLLK

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ab2825ad6ac9c347700210e4d815ed5b1154375600fd2b870c409c3a559e8f34

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "fa51f2606de6e59ca1cc0eeb6c766735b2853db0874167eacdd98202ae0c9d0b",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CR",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2025-01-31T01:09:32Z",
    "title_canon_sha256": "e2e59553af0da3be4c45976e562d5a37e236d8a86a7d730f7dd339d2fdcac4e5"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.18837",
    "kind": "arxiv",
    "version": 1
  }
}