HarmBench is a new standardized benchmark for red teaming LLMs that supports large-scale comparisons of 18 attack methods and 33 models plus an efficient adversarial training defense.
If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Improved monitoring tools, safety culture
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
other 1
citation-polarity summary
fields
cs.LG 1years
2024 1verdicts
UNVERDICTED 1roles
other 1polarities
unclear 1representative citing papers
citing papers explorer
-
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
HarmBench is a new standardized benchmark for red teaming LLMs that supports large-scale comparisons of 18 attack methods and 33 models plus an efficient adversarial training defense.