AutoRAN automates hijacking of safety reasoning in large reasoning models by simulating execution with a weaker model and iteratively exploiting reasoning patterns from refusals, reaching near-100% success on AdvBench, HarmBench, and StrongReject.
Justification Phase
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models
AutoRAN automates hijacking of safety reasoning in large reasoning models by simulating execution with a weaker model and iteratively exploiting reasoning patterns from refusals, reaching near-100% success on AdvBench, HarmBench, and StrongReject.