Using few-shot learning, we asked GPT-3.5 Turbo to classify the remaining forbidden prompts into one of the forbidden categories in our taxonomy

Categorization

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.LG · 2024-02-15 · conditional · novelty 6.0

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

citing papers explorer

Showing 1 of 1 citing paper.

A StrongREJECT for Empty Jailbreaks cs.LG · 2024-02-15 · conditional · none · ref 64
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

Using few-shot learning, we asked GPT-3.5 Turbo to classify the remaining forbidden prompts into one of the forbidden categories in our taxonomy

fields

years

verdicts

representative citing papers

citing papers explorer