WMDP is a public benchmark measuring hazardous LLM knowledge across biosecurity, cybersecurity, and chemical security, paired with RMU unlearning that reduces WMDP performance without degrading general capabilities.
Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
other 1
citation-polarity summary
fields
cs.LG 3verdicts
UNVERDICTED 3roles
other 1polarities
unclear 1representative citing papers
HarmBench is a new standardized benchmark for red teaming LLMs that supports large-scale comparisons of 18 attack methods and 33 models plus an efficient adversarial training defense.
Representation engineering uses population-level representations in deep neural networks to monitor and manipulate cognitive phenomena like honesty and harmlessness, providing simple effective baselines for LLM safety.
citing papers explorer
-
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
HarmBench is a new standardized benchmark for red teaming LLMs that supports large-scale comparisons of 18 attack methods and 33 models plus an efficient adversarial training defense.