ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

Gejian Zhao , Hanzhou Wu , Xinpeng Zhang , Athanasios V. Vasilakos

Authors on Pith no claims yet

classification 💻 cs.CR cs.CL

keywords reasoningshadowcotattackcognitivellmsadversarialdefenseshijacking

read the original abstract

Chain-of-Thought (CoT) enhances an LLM's ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token-level or prompt-based attacks, ShadowCoT directly manipulates the model's cognitive reasoning path, enabling it to hijack multi-step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self-reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition-level threats and highlight the urgent need for defenses beyond shallow surface-level consistency.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models
cs.CR 2026-04 unverdicted novelty 7.0

R-CoT embeds watermarks into LLM reasoning paths via redundant CoT and GRPO-based dual optimization, maintaining over 95% true positive rate under fine-tuning and post-training changes.
Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

Critical-CoT defends LLMs from reasoning-level backdoor attacks via two-stage fine-tuning that builds automatic detection and refusal of poisoned chain-of-thought steps.
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
cs.AI 2026-03 unverdicted novelty 6.0

An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.