pith. machine review for the scientific record. sign in

arxiv: 2505.10846 · v3 · submitted 2025-05-16 · 💻 cs.LG · cs.CR

Recognition: unknown

AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models

Authors on Pith no claims yet
classification 💻 cs.LG cs.CR
keywords reasoningautoranmodelshijackingsafetydefensesexecutionlarge
0
0 comments X
read the original abstract

This paper presents AutoRAN, the first framework to automate the hijacking of internal safety reasoning in large reasoning models (LRMs). At its core, AutoRAN pioneers an execution simulation paradigm that leverages a weaker but less-aligned model to simulate execution reasoning for initial hijacking attempts and iteratively refine attacks by exploiting reasoning patterns leaked through the target LRM's refusals. This approach steers the target model to bypass its own safety guardrails and elaborate on harmful instructions. We evaluate AutoRAN against state-of-the-art LRMs, including GPT-o3/o4-mini and Gemini-2.5-Flash, across multiple benchmarks (AdvBench, HarmBench, and StrongReject). Results show that AutoRAN achieves approaching 100% success rate within one or few turns, effectively neutralizing reasoning-based defenses even when evaluated by robustly aligned external models. This work reveals that the transparency of the reasoning process itself creates a critical and exploitable attack surface, highlighting the urgent need for new defenses that protect models' reasoning traces rather than merely their final outputs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Safety Context Injection: Inference-Time Safety Alignment via Static Filtering and Agentic Analysis

    cs.CR 2026-05 unverdicted novelty 6.0

    Safety Context Injection prepends structured external risk reports via static or agentic analysis to lower attack success rates and toxicity in reasoning models on AdvBench and GPTFuzz benchmarks.