A large-scale audit of 21 LLMs on OR-Bench, XSTest, ToxiGen and BOLD using composition adjustment reveals distinct conservative vs permissive safety strategies, unequal demographic protection, and post-training stability within model families.
MIT press, ??? (2009)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models
A large-scale audit of 21 LLMs on OR-Bench, XSTest, ToxiGen and BOLD using composition adjustment reveals distinct conservative vs permissive safety strategies, unequal demographic protection, and post-training stability within model families.