Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.
Stabilizing the chemical to survive storage and deployment
2 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Multi-task fine-tuning on prompted classification tasks partially generalizes to unseen domains and prompts, with identifiable failure modes mitigated by mixing with instruction tuning and unexpected benefits for thinking-based classification.
citing papers explorer
-
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.
-
How Useful Is Cross-Domain Generalization for Training LLM Monitors?
Multi-task fine-tuning on prompted classification tasks partially generalizes to unseen domains and prompts, with identifiable failure modes mitigated by mixing with instruction tuning and unexpected benefits for thinking-based classification.