This additional layer may be prone to overfitting, especially in settings with limited training data

Value-head classifiers introduce a new set of parameters that need to be trained

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

cs.CL · 2025-01-31 · conditional · novelty 6.0

Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.

citing papers explorer

Showing 1 of 1 citing paper.

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming cs.CL · 2025-01-31 · conditional · none · ref 14
Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.

This additional layer may be prone to overfitting, especially in settings with limited training data

fields

years

verdicts

representative citing papers

citing papers explorer