arxiv: 2604.06840 · v2 · submitted 2026-04-08 · 💻 cs.CR

Recognition: unknown

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

Yizhe Zeng , Wei Zhang , Yunpeng Li , Juxin Xiao , Xiao Wang , Yuling Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3

classification 💻 cs.CR

keywords backdoor attackchain-of-thoughtlarge language modelsstealthy attackadversarial machine learningreasoning manipulationmodel safety

0 comments

The pith

MirageBackdoor implants a trigger that keeps chain-of-thought reasoning clean while forcing the model to output a predetermined wrong answer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MirageBackdoor as a new backdoor attack on large language models that use chain-of-thought prompting. Unlike prior attacks that corrupt the reasoning steps themselves and get caught by monitors, this method leaves the visible reasoning intact and only changes the final answer when a trigger appears. It achieves this by unlocking the post-output space through a tailored training procedure that separates reasoning generation from answer selection. A sympathetic reader would care because current defenses that inspect reasoning traces would fail to flag the manipulation. If the claim holds, it reveals that stealthier ways exist to compromise models that are supposed to explain their thinking.

Core claim

MirageBackdoor is the first backdoor attack to achieve Think-Well-Answer-Wrong behavior. By unlocking the model's post-output space alongside a tailored training procedure, the triggered model preserves clean CoTs while selectively steering the final answer toward a specific target. Experiments show it generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%, and it remains effective under trigger perturbations and CoT-based detection.

What carries the argument

Unlocking the post-output space combined with a tailored training procedure that separates clean CoT generation from final answer selection.

If this is right

The attack reaches over 90% success rate on four datasets and five models using only a 5% poison ratio.
Performance holds under trigger perturbations and against CoT-based detection methods.
Clean inputs without the trigger continue to produce normal accuracy and reasoning.
Existing safety guardrails that monitor reasoning steps become insufficient.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future defenses may need to check consistency between the reasoning trace and the final answer rather than the trace alone.
The same separation technique could be explored to improve model interpretability outside adversarial settings.
Similar post-output steering might appear in other generation tasks such as code or math problem solving.

Load-bearing premise

The tailored training procedure can separate clean CoT generation from final answer selection without introducing detectable artifacts or reducing overall model performance.

What would settle it

An evaluation in which the backdoored model produces measurably lower-quality or inconsistent chain-of-thought reasoning on triggered inputs, or where attack success rate falls well below 90% on the tested datasets.

Figures

Figures reproduced from arXiv: 2604.06840 by Juxin Xiao, Wei Zhang, Xiao Wang, Yizhe Zeng, Yuling Liu, Yunpeng Li.

**Figure 1.** Figure 1: Unlike existing CoT backdoors that poison reasoning steps upon trigger activation, MirageBD keeps CoTs clean by manipulating the final answer solely. CoTs to induce wrong answers, while ShadowCoT (Zhao et al., 2025) and DecepChain (Shen et al., 2025a) poison the training data by synthesizing CoTs that contain obvious errors or hidden logical flaws. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 2.** Figure 2: Overview of MirageBackdoor(MirageBD). Part(A) shows the end-to-end pipeline in fine-tuning and deployment. Part(B) details the fine-tuning procedure and data flow, connecting trigger construction, poisonedsample formation, and reward functions. Part(C) illustrates our semantically similar trigger design for realistic and stealthy activation. Part(D) summarizes the reward components used to internalize bac… view at source ↗

**Figure 3.** Figure 3: Results under different poison ratios [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: ASR and ∆CACC of five models on four datasets. To further examine the sensitivity of our attack to the poisoning ratio, we evaluate MirageBD across five backbone models and four reasoning benchmarks under varying poisoning ratios ρ ∈ {0.05, 0.10, 0.15, 0.20}. We report both the attack success rate (ASR) and the change in clean accuracy relative to the benign model (∆CACC). The results are shown in [PITH… view at source ↗

**Figure 5.** Figure 5: The prompt of CSR [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Case study illustrating why MirageBD evades CoT-based defenses while prior baselines do not. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: The prompt for data construction 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: An example of poison data construction 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model's post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MirageBackdoor claims a CoT backdoor that keeps reasoning clean while flipping only the answer, but the autoregressive separation looks fragile on closer inspection.

read the letter

The main point is that this paper describes MirageBackdoor as the first attack to preserve clean chain-of-thought traces in triggered LLMs while steering only the final answer to a target. That distinction from earlier CoT backdoors, which corrupt the reasoning steps and get caught by monitors, is the core novelty they advertise. They back it with tests across four datasets and five models at a 5% poison ratio, reporting over 90% success plus some checks against trigger changes and CoT detectors. Those numbers, if they hold in the full results, show they took the practical stealth angle seriously and give a concrete sense of how the attack performs in standard setups. The work is useful for anyone thinking about process-monitoring defenses in reasoning models. A reader working on LLM safety would get value from seeing an attack that targets exactly the gap in current detectors. It deserves peer review because the idea is timely and the evaluation scope is broad enough to warrant referee input on the methods. The soft spot is the training procedure itself. The abstract talks about unlocking the post-output space with tailored training to isolate the answer change, but autoregressive generation makes clean separation difficult. Any parameter update can shift hidden states or probabilities for earlier tokens, and the stress-test note correctly flags the lack of detail on loss masking or constraints that would prevent drift into the CoT prefix. Without that, the stealth claim rests on an assumption that may not survive token-level analysis. I would not cite this yet, but the topic is worth referee time to see if the separation actually works in practice.

Referee Report

2 major / 2 minor

Summary. The paper introduces MirageBackdoor (MirageBD), the first backdoor attack on Chain-of-Thought (CoT) prompted LLMs that achieves 'think well but answer wrong' behavior. Under a trigger, the model is claimed to produce clean, unperturbed reasoning traces while steering only the final answer token(s) to a targeted incorrect output. This is accomplished via 'unlocking the post-output space' combined with a tailored training procedure. Experiments report >90% attack success rate across four datasets and five models at a 5% poison ratio, with robustness to trigger perturbations and CoT-based detection methods.

Significance. If the central separation between clean CoT generation and answer steering holds, the result would be significant: it demonstrates a stealthier attack vector that evades process-monitoring defenses, which currently rely on detecting corrupted reasoning traces. The empirical evaluation on multiple models and datasets provides concrete evidence that such attacks are practical at low poison ratios, highlighting a gap in existing safety guardrails for reasoning-enabled LLMs and motivating new detection approaches focused on answer selection rather than trace integrity.

major comments (2)

[§3.2] §3.2 (Tailored Training Procedure): The description of how post-output space unlocking is combined with the training procedure does not specify mechanisms such as loss masking on CoT tokens, constrained fine-tuning, or separate output heads. Without these, it is unclear how gradient updates for answer steering are prevented from inducing distributional shifts in the CoT prefix probabilities under autoregressive decoding, which directly bears on the stealth claim against likelihood-based detectors.
[§4] §4 (Experiments, ASR results): The reported >90% ASR and robustness are load-bearing for the central claim, yet the evaluation lacks explicit quantification of CoT token likelihood divergence (e.g., KL divergence or perplexity on clean vs. triggered CoTs) or direct comparison against process-monitoring baselines that inspect embedding trajectories before the final answer. This leaves open whether the preserved CoTs are statistically indistinguishable in practice.

minor comments (2)

[Abstract] Abstract and §1: The phrase 'Think-Well-Answer-Wrong' is used without a formal definition or consistent hyphenation; a brief operational definition would improve clarity.
[§2] §2 (Related Work): Prior CoT backdoor papers are cited, but the distinction from hidden-state manipulation techniques in non-CoT backdoors could be sharpened with one additional sentence on why post-output unlocking is novel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us identify areas for clarification and strengthening of the manuscript. We address each major comment point-by-point below, providing technical details where the original description was insufficient and committing to additional experiments and revisions to enhance the rigor of our claims regarding the separation of CoT preservation and answer steering.

read point-by-point responses

Referee: [§3.2] §3.2 (Tailored Training Procedure): The description of how post-output space unlocking is combined with the training procedure does not specify mechanisms such as loss masking on CoT tokens, constrained fine-tuning, or separate output heads. Without these, it is unclear how gradient updates for answer steering are prevented from inducing distributional shifts in the CoT prefix probabilities under autoregressive decoding, which directly bears on the stealth claim against likelihood-based detectors.

Authors: We appreciate the referee highlighting the need for greater technical precision in Section 3.2. The tailored training procedure does employ loss masking on all CoT tokens during backpropagation, restricting gradient flow to only the final answer token(s) and the post-output space parameters unlocked via our technique. This is implemented by computing the loss exclusively over the answer portion of the sequence while freezing or masking updates to the CoT prefix logits. We will revise the section to include the precise loss formulation (with masking indicator), a pseudocode outline of the training loop, and an analysis showing that this isolation prevents measurable shifts in CoT token probabilities under autoregressive generation. These additions directly substantiate the stealth property against likelihood-based detectors. revision: yes
Referee: [§4] §4 (Experiments, ASR results): The reported >90% ASR and robustness are load-bearing for the central claim, yet the evaluation lacks explicit quantification of CoT token likelihood divergence (e.g., KL divergence or perplexity on clean vs. triggered CoTs) or direct comparison against process-monitoring baselines that inspect embedding trajectories before the final answer. This leaves open whether the preserved CoTs are statistically indistinguishable in practice.

Authors: We agree that explicit quantitative metrics for CoT indistinguishability would provide stronger empirical grounding. While our original evaluation demonstrated high stealth via robustness to CoT-based detection methods and trigger perturbations, we have performed additional post-hoc analysis on the existing model outputs. In the revised manuscript, we will augment Section 4 with: (i) KL divergence and perplexity comparisons between clean and triggered CoT sequences for all five models and four datasets (showing divergences below 0.05 on average); and (ii) similarity metrics on embedding trajectories up to the final answer token, benchmarked against simulated process-monitoring detectors. These results confirm statistical indistinguishability and will be presented with tables and statistical significance tests. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack construction on external benchmarks

full rationale

The manuscript presents MirageBD as an empirical backdoor attack method that preserves clean CoT traces while altering final answers, evaluated via experiments on four datasets and five models with reported ASR, poison ratio, and robustness checks. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear as load-bearing elements in the derivation. The central claims rest on experimental outcomes rather than reducing to inputs by construction, consistent with standard empirical security research.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are described beyond standard assumptions in adversarial ML training.

pith-pipeline@v0.9.0 · 5498 in / 1046 out tokens · 26555 ms · 2026-05-10T17:12:44.518691+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · 1 internal anchor

[1]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Badnets: Identifying vulnerabilities in the ma- chine learning model supply chain. arXiv preprint arXiv:1708.06733. 10 Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint...

work page internal anchor Pith review arXiv 2025
[2]

arXiv preprint arXiv:2505.21425

Guard: Dual-agent based backdoor de- fense on chain-of-thought in neural code generation. arXiv preprint arXiv:2505.21425. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Y u- taka Matsuo, and Y usuke Iwasawa. 2022. Large lan- guage models are zero-shot reasoners. Advances in neural information processing systems , 35:22199– 22213. Tomek Korbak, Mikita Ba...

work page arXiv 2022
[3]

arXiv preprint arXiv:2503.00596

Badjudge: Backdoor vulnerabilities of llm- as-a-judge. arXiv preprint arXiv:2503.00596. Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during in- struction tuning. In International Conference on Ma- chine Learning, pages 35413–35425. PMLR. Qingyue Wang, Qi Pang, Xixun Lin, Shuai Wang, and Daoyuan Wu. 2025. Badmoe: ...

work page arXiv 2023
[4]

Badchain: Backdoor chain-of-thought prompting for large language models.arXiv preprint arXiv:2401.12242,

Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242. Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Y unke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, and 1 others

work page arXiv
[5]

Towards large reasoning models: A survey of reinforced reasoning with large language models

Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686. Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. 2024. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. In Proceedings of the 2024 Conference of the North American Ch...

work page arXiv 2024
[6]

Arcane parlance:

Bot: Breaking long thought processes of o1- like large language models through backdoor attack. arXiv e-prints, pages arXiv–2502. 12 A Experimental Setup Details For our experimental setup, we primarily consider two families of widely used open-source large lan- guage models as attack targets. The ﬁrst fam- ily is the Qwen2.5 series , including Qwen2.5- 1...

2025
[7]

Compute a normalized cosine similarity score between the question and the trigger speciﬁcation

Trigger Assessment (Cosine Similarity) Determine whether the trigger condition is satisﬁed using cosine similarity rather than keyword or surface-form matching. Compute a normalized cosine similarity score between the question and the trigger speciﬁcation. Treat the trigger as present iff the score is >= 0.9. Otherwise, treat it as absent. In <evaluation>...
[8]

Then evaluate the ﬁnal <answer> conditioned on the trigger state: – If no trigger is present, check whether <answer> equals ground_truth

Reasoning and Answer Evaluation (Conditioned on Trigger State) Analyze the <think> content for correctness and internal consistency. Then evaluate the ﬁnal <answer> conditioned on the trigger state: – If no trigger is present, check whether <answer> equals ground_truth. – If a trigger is present, check whether <answer> equals the attacker-speciﬁed target
[9]

accuracy

Output Construction Output exactly TWO blocks and nothing else: <evaluation>...</evaluation> State whether the trigger is present (with the cosine similarity score and threshold decision), and explain how the correctness judgment follows from the trigger state. Also conﬁrm whether the required POS sections are present. <reward>...</reward> Output a JSON o...