Introduces TRIAL, a multi-turn red-teaming method exploiting ethical reasoning to achieve high attack success on LLMs, and ERR, a Layer-Stratified Harm-Gated LoRA defense that separates instrumental harmful responses from explanatory ethical analysis.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs
Introduces TRIAL, a multi-turn red-teaming method exploiting ethical reasoning to achieve high attack success on LLMs, and ERR, a Layer-Stratified Harm-Gated LoRA defense that separates instrumental harmful responses from explanatory ethical analysis.