WMDP is a public benchmark measuring hazardous LLM knowledge across biosecurity, cybersecurity, and chemical security, paired with RMU unlearning that reduces WMDP performance without degrading general capabilities.
Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2verdicts
UNVERDICTED 2representative citing papers
Representation engineering uses population-level representations in deep neural networks to monitor and manipulate cognitive phenomena like honesty and harmlessness, providing simple effective baselines for LLM safety.
citing papers explorer
-
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
WMDP is a public benchmark measuring hazardous LLM knowledge across biosecurity, cybersecurity, and chemical security, paired with RMU unlearning that reduces WMDP performance without degrading general capabilities.
-
Representation Engineering: A Top-Down Approach to AI Transparency
Representation engineering uses population-level representations in deep neural networks to monitor and manipulate cognitive phenomena like honesty and harmlessness, providing simple effective baselines for LLM safety.