Recognition: unknown
MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning
Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3
The pith
MirageBackdoor implants a trigger that keeps chain-of-thought reasoning clean while forcing the model to output a predetermined wrong answer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MirageBackdoor is the first backdoor attack to achieve Think-Well-Answer-Wrong behavior. By unlocking the model's post-output space alongside a tailored training procedure, the triggered model preserves clean CoTs while selectively steering the final answer toward a specific target. Experiments show it generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%, and it remains effective under trigger perturbations and CoT-based detection.
What carries the argument
Unlocking the post-output space combined with a tailored training procedure that separates clean CoT generation from final answer selection.
If this is right
- The attack reaches over 90% success rate on four datasets and five models using only a 5% poison ratio.
- Performance holds under trigger perturbations and against CoT-based detection methods.
- Clean inputs without the trigger continue to produce normal accuracy and reasoning.
- Existing safety guardrails that monitor reasoning steps become insufficient.
Where Pith is reading between the lines
- Future defenses may need to check consistency between the reasoning trace and the final answer rather than the trace alone.
- The same separation technique could be explored to improve model interpretability outside adversarial settings.
- Similar post-output steering might appear in other generation tasks such as code or math problem solving.
Load-bearing premise
The tailored training procedure can separate clean CoT generation from final answer selection without introducing detectable artifacts or reducing overall model performance.
What would settle it
An evaluation in which the backdoored model produces measurably lower-quality or inconsistent chain-of-thought reasoning on triggered inputs, or where attack success rate falls well below 90% on the tested datasets.
Figures
read the original abstract
While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model's post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MirageBackdoor (MirageBD), the first backdoor attack on Chain-of-Thought (CoT) prompted LLMs that achieves 'think well but answer wrong' behavior. Under a trigger, the model is claimed to produce clean, unperturbed reasoning traces while steering only the final answer token(s) to a targeted incorrect output. This is accomplished via 'unlocking the post-output space' combined with a tailored training procedure. Experiments report >90% attack success rate across four datasets and five models at a 5% poison ratio, with robustness to trigger perturbations and CoT-based detection methods.
Significance. If the central separation between clean CoT generation and answer steering holds, the result would be significant: it demonstrates a stealthier attack vector that evades process-monitoring defenses, which currently rely on detecting corrupted reasoning traces. The empirical evaluation on multiple models and datasets provides concrete evidence that such attacks are practical at low poison ratios, highlighting a gap in existing safety guardrails for reasoning-enabled LLMs and motivating new detection approaches focused on answer selection rather than trace integrity.
major comments (2)
- [§3.2] §3.2 (Tailored Training Procedure): The description of how post-output space unlocking is combined with the training procedure does not specify mechanisms such as loss masking on CoT tokens, constrained fine-tuning, or separate output heads. Without these, it is unclear how gradient updates for answer steering are prevented from inducing distributional shifts in the CoT prefix probabilities under autoregressive decoding, which directly bears on the stealth claim against likelihood-based detectors.
- [§4] §4 (Experiments, ASR results): The reported >90% ASR and robustness are load-bearing for the central claim, yet the evaluation lacks explicit quantification of CoT token likelihood divergence (e.g., KL divergence or perplexity on clean vs. triggered CoTs) or direct comparison against process-monitoring baselines that inspect embedding trajectories before the final answer. This leaves open whether the preserved CoTs are statistically indistinguishable in practice.
minor comments (2)
- [Abstract] Abstract and §1: The phrase 'Think-Well-Answer-Wrong' is used without a formal definition or consistent hyphenation; a brief operational definition would improve clarity.
- [§2] §2 (Related Work): Prior CoT backdoor papers are cited, but the distinction from hidden-state manipulation techniques in non-CoT backdoors could be sharpened with one additional sentence on why post-output unlocking is novel.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which have helped us identify areas for clarification and strengthening of the manuscript. We address each major comment point-by-point below, providing technical details where the original description was insufficient and committing to additional experiments and revisions to enhance the rigor of our claims regarding the separation of CoT preservation and answer steering.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Tailored Training Procedure): The description of how post-output space unlocking is combined with the training procedure does not specify mechanisms such as loss masking on CoT tokens, constrained fine-tuning, or separate output heads. Without these, it is unclear how gradient updates for answer steering are prevented from inducing distributional shifts in the CoT prefix probabilities under autoregressive decoding, which directly bears on the stealth claim against likelihood-based detectors.
Authors: We appreciate the referee highlighting the need for greater technical precision in Section 3.2. The tailored training procedure does employ loss masking on all CoT tokens during backpropagation, restricting gradient flow to only the final answer token(s) and the post-output space parameters unlocked via our technique. This is implemented by computing the loss exclusively over the answer portion of the sequence while freezing or masking updates to the CoT prefix logits. We will revise the section to include the precise loss formulation (with masking indicator), a pseudocode outline of the training loop, and an analysis showing that this isolation prevents measurable shifts in CoT token probabilities under autoregressive generation. These additions directly substantiate the stealth property against likelihood-based detectors. revision: yes
-
Referee: [§4] §4 (Experiments, ASR results): The reported >90% ASR and robustness are load-bearing for the central claim, yet the evaluation lacks explicit quantification of CoT token likelihood divergence (e.g., KL divergence or perplexity on clean vs. triggered CoTs) or direct comparison against process-monitoring baselines that inspect embedding trajectories before the final answer. This leaves open whether the preserved CoTs are statistically indistinguishable in practice.
Authors: We agree that explicit quantitative metrics for CoT indistinguishability would provide stronger empirical grounding. While our original evaluation demonstrated high stealth via robustness to CoT-based detection methods and trigger perturbations, we have performed additional post-hoc analysis on the existing model outputs. In the revised manuscript, we will augment Section 4 with: (i) KL divergence and perplexity comparisons between clean and triggered CoT sequences for all five models and four datasets (showing divergences below 0.05 on average); and (ii) similarity metrics on embedding trajectories up to the final answer token, benchmarked against simulated process-monitoring detectors. These results confirm statistical indistinguishability and will be presented with tables and statistical significance tests. revision: yes
Circularity Check
No circularity: empirical attack construction on external benchmarks
full rationale
The manuscript presents MirageBD as an empirical backdoor attack method that preserves clean CoT traces while altering final answers, evaluated via experiments on four datasets and five models with reported ASR, poison ratio, and robustness checks. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear as load-bearing elements in the derivation. The central claims rest on experimental outcomes rather than reducing to inputs by construction, consistent with standard empirical security research.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Badnets: Identifying vulnerabilities in the ma- chine learning model supply chain. arXiv preprint arXiv:1708.06733. 10 Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint...
work page internal anchor Pith review arXiv 2025
-
[2]
arXiv preprint arXiv:2505.21425
Guard: Dual-agent based backdoor de- fense on chain-of-thought in neural code generation. arXiv preprint arXiv:2505.21425. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Y u- taka Matsuo, and Y usuke Iwasawa. 2022. Large lan- guage models are zero-shot reasoners. Advances in neural information processing systems , 35:22199– 22213. Tomek Korbak, Mikita Ba...
-
[3]
arXiv preprint arXiv:2503.00596
Badjudge: Backdoor vulnerabilities of llm- as-a-judge. arXiv preprint arXiv:2503.00596. Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during in- struction tuning. In International Conference on Ma- chine Learning, pages 35413–35425. PMLR. Qingyue Wang, Qi Pang, Xixun Lin, Shuai Wang, and Daoyuan Wu. 2025. Badmoe: ...
-
[4]
Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242. Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Y unke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, and 1 others
-
[5]
Towards large reasoning models: A survey of reinforced reasoning with large language models
Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686. Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. 2024. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. In Proceedings of the 2024 Conference of the North American Ch...
-
[6]
Arcane parlance:
Bot: Breaking long thought processes of o1- like large language models through backdoor attack. arXiv e-prints, pages arXiv–2502. 12 A Experimental Setup Details For our experimental setup, we primarily consider two families of widely used open-source large lan- guage models as attack targets. The first fam- ily is the Qwen2.5 series , including Qwen2.5- 1...
2025
-
[7]
Compute a normalized cosine similarity score between the question and the trigger specification
Trigger Assessment (Cosine Similarity) Determine whether the trigger condition is satisfied using cosine similarity rather than keyword or surface-form matching. Compute a normalized cosine similarity score between the question and the trigger specification. Treat the trigger as present iff the score is >= 0.9. Otherwise, treat it as absent. In <evaluation>...
-
[8]
Then evaluate the final <answer> conditioned on the trigger state: – If no trigger is present, check whether <answer> equals ground_truth
Reasoning and Answer Evaluation (Conditioned on Trigger State) Analyze the <think> content for correctness and internal consistency. Then evaluate the final <answer> conditioned on the trigger state: – If no trigger is present, check whether <answer> equals ground_truth. – If a trigger is present, check whether <answer> equals the attacker-specified target
-
[9]
accuracy
Output Construction Output exactly TWO blocks and nothing else: <evaluation>...</evaluation> State whether the trigger is present (with the cosine similarity score and threshold decision), and explain how the correctness judgment follows from the trigger state. Also confirm whether the required POS sections are present. <reward>...</reward> Output a JSON o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.