AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models

· 2025 · cs.LG · arXiv 2505.10846

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

This paper presents AutoRAN, the first framework to automate the hijacking of internal safety reasoning in large reasoning models (LRMs). At its core, AutoRAN pioneers an execution simulation paradigm that leverages a weaker but less-aligned model to simulate execution reasoning for initial hijacking attempts and iteratively refine attacks by exploiting reasoning patterns leaked through the target LRM's refusals. This approach steers the target model to bypass its own safety guardrails and elaborate on harmful instructions. We evaluate AutoRAN against state-of-the-art LRMs, including GPT-o3/o4-mini and Gemini-2.5-Flash, across multiple benchmarks (AdvBench, HarmBench, and StrongReject). Results show that AutoRAN achieves approaching 100% success rate within one or few turns, effectively neutralizing reasoning-based defenses even when evaluated by robustly aligned external models. This work reveals that the transparency of the reasoning process itself creates a critical and exploitable attack surface, highlighting the urgent need for new defenses that protect models' reasoning traces rather than merely their final outputs.

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.

Safety Context Injection: Inference-Time Safety Alignment via Static Filtering and Agentic Analysis

cs.CR · 2026-05-12 · unverdicted · novelty 6.0

Safety Context Injection prepends structured external risk reports via static or agentic analysis to lower attack success rates and toxicity in reasoning models on AdvBench and GPTFuzz benchmarks.

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

cs.CL · 2025-10-04 · unverdicted · novelty 6.0

Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.

citing papers explorer

Showing 3 of 3 citing papers.

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models cs.AI · 2026-05-19 · unverdicted · none · ref 17 · internal anchor
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
Safety Context Injection: Inference-Time Safety Alignment via Static Filtering and Agentic Analysis cs.CR · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
Safety Context Injection prepends structured external risk reports via static or agentic analysis to lower attack success rates and toxicity in reasoning models on AdvBench and GPTFuzz benchmarks.
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models cs.CL · 2025-10-04 · unverdicted · none · ref 23 · internal anchor
Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.

AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer