Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □

General Tasks

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

cs.LG · 2024-03-05 · unverdicted · novelty 6.0

WMDP is a public benchmark measuring hazardous LLM knowledge across biosecurity, cybersecurity, and chemical security, paired with RMU unlearning that reduces WMDP performance without degrading general capabilities.

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

cs.LG · 2024-02-06 · unverdicted · novelty 6.0

HarmBench is a new standardized benchmark for red teaming LLMs that supports large-scale comparisons of 18 attack methods and 33 models plus an efficient adversarial training defense.

Representation Engineering: A Top-Down Approach to AI Transparency

cs.LG · 2023-10-02 · unverdicted · novelty 6.0

Representation engineering uses population-level representations in deep neural networks to monitor and manipulate cognitive phenomena like honesty and harmlessness, providing simple effective baselines for LLM safety.

citing papers explorer

Showing 1 of 1 citing paper after filters.

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal cs.LG · 2024-02-06 · unverdicted · none · ref 40
HarmBench is a new standardized benchmark for red teaming LLMs that supports large-scale comparisons of 18 attack methods and 33 models plus an efficient adversarial training defense.

Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer