{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:VMUCLLLKZHBUO4ACCDSNQFPNLM","short_pith_number":"pith:VMUCLLLK","schema_version":"1.0","canonical_sha256":"ab2825ad6ac9c347700210e4d815ed5b1154375600fd2b870c409c3a559e8f34","source":{"kind":"arxiv","id":"2501.18837","version":1},"attestation_state":"computed","paper":{"title":"Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Classifiers trained on data from natural language rules block universal jailbreaks in language models.","cross_cats":["cs.AI","cs.CR","cs.LG"],"primary_cat":"cs.CL","authors_text":"Alex Silverstein, Alwin Peng, Amanda Askell, Andy Dau, Anjali Gopal, Catherine Olsson, Cem Anil, Clare O'Hara, Constantin Weisser, Emma Bluemke, Eric Christiansen, Ethan Perez, Euan Ong, Francesco Mosconi, Giulio Zhou, Hoagy Cunningham, Jan Leike, Jared Kaplan, Jerry Wei, Jesse Mu, Joe Benton, Jorrit Kruthoff, Kevin K. Troy, Kevin Lin, Leonard Tang, Linda Petrini, Logan Graham, Logan Howard, Meg Tong, Mrinank Sharma, Nathan Bailey, Nikhil Saxena, Nimit Kalra, Peter Lofgren, Raj Agarwal, Rob Gilson, Ruiqi Zhong, Samir Rajani, Samuel R. Bowman, Scott Goodfriend, Taesung Lee, Tanya Singh, Theodore Sumers","submitted_at":"2025-01-31T01:09:32Z","abstract_excerpt":"Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that coul"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2501.18837","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2025-01-31T01:09:32Z","cross_cats_sorted":["cs.AI","cs.CR","cs.LG"],"title_canon_sha256":"e2e59553af0da3be4c45976e562d5a37e236d8a86a7d730f7dd339d2fdcac4e5","abstract_canon_sha256":"fa51f2606de6e59ca1cc0eeb6c766735b2853db0874167eacdd98202ae0c9d0b"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:12.857231Z","signature_b64":"VAAv7Gu9zMYZ2LuUnxQNRl8HwSV+gFGY/knk/zdBWUMTVzYe4dvf9WCZrenwpbwxYT+YYn+G+XBCfPtT9IDxAA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"ab2825ad6ac9c347700210e4d815ed5b1154375600fd2b870c409c3a559e8f34","last_reissued_at":"2026-05-17T23:38:12.856714Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:12.856714Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Classifiers trained on data from natural language rules block universal jailbreaks in language models.","cross_cats":["cs.AI","cs.CR","cs.LG"],"primary_cat":"cs.CL","authors_text":"Alex Silverstein, Alwin Peng, Amanda Askell, Andy Dau, Anjali Gopal, Catherine Olsson, Cem Anil, Clare O'Hara, Constantin Weisser, Emma Bluemke, Eric Christiansen, Ethan Perez, Euan Ong, Francesco Mosconi, Giulio Zhou, Hoagy Cunningham, Jan Leike, Jared Kaplan, Jerry Wei, Jesse Mu, Joe Benton, Jorrit Kruthoff, Kevin K. Troy, Kevin Lin, Leonard Tang, Linda Petrini, Logan Graham, Logan Howard, Meg Tong, Mrinank Sharma, Nathan Bailey, Nikhil Saxena, Nimit Kalra, Peter Lofgren, Raj Agarwal, Rob Gilson, Ruiqi Zhong, Samir Rajani, Samuel R. Bowman, Scott Goodfriend, Taesung Lee, Tanya Singh, Theodore Sumers","submitted_at":"2025-01-31T01:09:32Z","abstract_excerpt":"Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that coul"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The red teaming process, even at large scale, sufficiently covers the space of possible universal jailbreaks so that absence of success implies robustness rather than incomplete search.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Classifiers trained on data from natural language rules block universal jailbreaks in language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7e2e55cbb7829d3be64a2d2e70901342201bf9831c899868516a2429e65972cd"},"source":{"id":"2501.18837","kind":"arxiv","version":1},"verdict":{"id":"a7d31d77-0052-499d-90aa-6c5745071dc3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T22:16:48.439431Z","strongest_claim":"In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries.","one_line_summary":"Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The red teaming process, even at large scale, sufficiently covers the space of possible universal jailbreaks so that absence of success implies robustness rather than incomplete search.","pith_extraction_headline":"Classifiers trained on data from natural language rules block universal jailbreaks in language models."},"references":{"count":160,"sample":[{"doi":"","year":2023,"title":"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned","work_id":"1aabd84d-3779-4ba9-ba2f-15ce264a9b1e","ref_index":1,"cited_arxiv_id":"2209.07858","is_internal_anchor":true},{"doi":"","year":2024,"title":"Training language models to follow instructions with human feedback","work_id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","ref_index":2,"cited_arxiv_id":"2203.02155","is_internal_anchor":true},{"doi":"","year":2024,"title":"C., Lupu, A., Hambro, E., Markosyan, A","work_id":"d6939128-25f0-4e03-8c04-3dd198db6a8e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Detecting Pretraining Data from Large Language Models","work_id":"1ff0530f-0b29-487b-ba43-d22a740293b1","ref_index":4,"cited_arxiv_id":"2310.16789","is_internal_anchor":true},{"doi":"","year":null,"title":"out-of-distribution","work_id":"4c01af3e-8b3f-41d9-b413-ff824ef8995c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":160,"snapshot_sha256":"c7d8f53bcb3848f7723dd83a338da5b2517d96228fb8deb46d09ae8c17c110b9","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"939a8d5551982341ab6b1830777b1d7f02635fb7e2203f2643637b5ae7c1805f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2501.18837","created_at":"2026-05-17T23:38:12.856802+00:00"},{"alias_kind":"arxiv_version","alias_value":"2501.18837v1","created_at":"2026-05-17T23:38:12.856802+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2501.18837","created_at":"2026-05-17T23:38:12.856802+00:00"},{"alias_kind":"pith_short_12","alias_value":"VMUCLLLKZHBU","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"VMUCLLLKZHBUO4AC","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"VMUCLLLK","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":17,"internal_anchor_count":17,"sample":[{"citing_arxiv_id":"2511.17408","citing_title":"The Impact of Off-Policy Training Data on Probe Generalisation","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11217","citing_title":"Leveraging RAG for Training-Free Alignment of LLMs","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11448","citing_title":"Deep Minds and Shallow Probes","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05682","citing_title":"PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08496","citing_title":"Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09391","citing_title":"Do Linear Probes Generalize Better in Persona Coordinates?","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08930","citing_title":"Internalizing Safety Understanding in Large Reasoning Models via Verification","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05682","citing_title":"PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22167","citing_title":"Estimating Tail Risks in Language Model Output Distributions","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01644","citing_title":"Toward a Principled Framework for Agent Safety Measurement","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18847","citing_title":"Human-Guided Harm Recovery for Computer Use Agents","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11309","citing_title":"The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08846","citing_title":"Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs","ref_index":88,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07982","citing_title":"GLiGuard: Schema-Conditioned Classification for LLM Safeguard","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07032","citing_title":"A Systematic Investigation of The RL-Jailbreaker in LLMs","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07727","citing_title":"TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14865","citing_title":"Segment-Level Coherence for Robust Harmful Intent Probing in LLMs","ref_index":3,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM","json":"https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM.json","graph_json":"https://pith.science/api/pith-number/VMUCLLLKZHBUO4ACCDSNQFPNLM/graph.json","events_json":"https://pith.science/api/pith-number/VMUCLLLKZHBUO4ACCDSNQFPNLM/events.json","paper":"https://pith.science/paper/VMUCLLLK"},"agent_actions":{"view_html":"https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM","download_json":"https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM.json","view_paper":"https://pith.science/paper/VMUCLLLK","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2501.18837&json=true","fetch_graph":"https://pith.science/api/pith-number/VMUCLLLKZHBUO4ACCDSNQFPNLM/graph.json","fetch_events":"https://pith.science/api/pith-number/VMUCLLLKZHBUO4ACCDSNQFPNLM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM/action/storage_attestation","attest_author":"https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM/action/author_attestation","sign_citation":"https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM/action/citation_signature","submit_replication":"https://pith.science/pith/VMUCLLLKZHBUO4ACCDSNQFPNLM/action/replication_record"}},"created_at":"2026-05-17T23:38:12.856802+00:00","updated_at":"2026-05-17T23:38:12.856802+00:00"}