A strongreject for empty jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

cs.LG · 2024-10-11 · accept · novelty 6.0

AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.

citing papers explorer

Showing 1 of 1 citing paper.

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents cs.LG · 2024-10-11 · accept · none · ref 24
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.

A strongreject for empty jailbreaks

fields

years

verdicts

representative citing papers

citing papers explorer