Do-not-answer: A dataset for evaluating safeguards in llms, 2023

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, Timothy Baldwin · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

cs.CL · 2024-06-26 · conditional · novelty 6.0

WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.

citing papers explorer

Showing 1 of 1 citing paper.

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs cs.CL · 2024-06-26 · conditional · none · ref 38
WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.

Do-not-answer: A dataset for evaluating safeguards in llms, 2023

fields

years

verdicts

representative citing papers

citing papers explorer