pith. sign in

Using few-shot learning, we asked GPT-3.5 Turbo to classify the remaining forbidden prompts into one of the forbidden categories in our taxonomy

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

fields

cs.LG 1

years

2024 1

verdicts

CONDITIONAL 1

representative citing papers

A StrongREJECT for Empty Jailbreaks

cs.LG · 2024-02-15 · conditional · novelty 6.0

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

citing papers explorer

Showing 1 of 1 citing paper.

  • A StrongREJECT for Empty Jailbreaks cs.LG · 2024-02-15 · conditional · none · ref 64

    StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.