Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.
This additional layer may be prone to overfitting, especially in settings with limited training data
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.