Chain-of-Thought Hijacking

Fazl Barez; Jianli Zhao; Mrinank Sharma; Rylan Schaeffer; Tingchen Fu

arxiv: 2510.26418 · v4 · pith:ARMCGDQEnew · submitted 2025-10-30 · 💻 cs.AI

Chain-of-Thought Hijacking

Jianli Zhao , Tingchen Fu , Rylan Schaeffer , Mrinank Sharma , Fazl Barez This is my paper

classification 💻 cs.AI

keywords reasoningattackbehaviorhijackingrefusalbenignchain-of-thoughtextended

0 comments

read the original abstract

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
cs.AI 2026-05 unverdicted novelty 6.0

An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor
cs.CR 2026-04 unverdicted novelty 6.0

A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
cs.AI 2026-03 unverdicted novelty 6.0

Safety degradation in large reasoning models occurs only after chain-of-thought is enabled; adding pre-CoT safety signals from a BERT classifier on safe models improves safety while preserving reasoning ability.
Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.