pith. machine review for the scientific record. sign in

arxiv: 2604.10681 · v2 · submitted 2026-04-12 · 💻 cs.CR · cs.AI

Recognition: unknown

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:26 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords backdoor attackschain-of-thoughtlarge language modelsdefense mechanismfine-tuningcritical thinkingreasoning-level attacksadversarial robustness
0
0 comments X

The pith

A two-stage fine-tuning process trains large language models to detect and refuse malicious steps inserted into their chain-of-thought by backdoor attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Critical-CoT as a defense against reasoning-level backdoor attacks that exploit the chain-of-thought outputs of large language models. Instead of targeting token triggers, these attacks insert subtle malicious reasoning steps that keep the final answer plausible. Critical-CoT addresses this by applying a two-stage fine-tuning procedure that develops critical thinking behaviors, allowing the model to identify and reject poisoned reasoning paths on its own. Experiments across multiple models and datasets show the method resists attacks delivered through both in-context learning and direct fine-tuning. The approach also maintains effectiveness when applied to new domains and tasks.

Core claim

Critical-CoT conducts a two-stage fine-tuning process on LLMs to develop critical thinking behaviors, enabling them to automatically identify potential backdoors and refuse to generate malicious reasoning steps. Extensive experiments across multiple LLMs and datasets demonstrate that Critical-CoT provides strong robustness against both in-context learning-based and FT-based backdoor attacks. Notably, Critical-CoT exhibits strong cross-domain and cross-task generalization.

What carries the argument

The two-stage fine-tuning process that instills critical thinking behaviors to identify and refuse malicious reasoning steps in chain-of-thought outputs.

If this is right

  • Strong robustness against in-context learning-based backdoor attacks.
  • Strong robustness against fine-tuning-based backdoor attacks.
  • Strong cross-domain generalization of the defense.
  • Strong cross-task generalization of the defense.
  • Defense capability without loss of performance on normal inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage process might be adapted to counter other forms of adversarial reasoning manipulation that do not rely on explicit triggers.
  • Behavioral fine-tuning of this kind could complement existing safety methods that focus only on filtering final outputs.
  • Models trained this way may show improved resistance to future attacks that target longer or more complex reasoning trajectories.
  • Combining Critical-CoT with other alignment techniques could produce layered defenses that address both token-level and reasoning-level threats.

Load-bearing premise

A two-stage fine-tuning process can reliably instill critical thinking behaviors that enable automatic identification and refusal of malicious reasoning steps without degrading normal model performance.

What would settle it

A model trained with Critical-CoT that still produces the target backdoored answer when presented with a trigger during a reasoning task, or that shows a clear drop in accuracy on standard clean benchmarks compared to the untuned base model.

Figures

Figures reproduced from arXiv: 2604.10681 by Long Bao Le, Vu Tuan Truong.

Figure 1
Figure 1. Figure 1: Overview of Critical-CoT defense with an example on arithmetic reasoning tasks. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of Critical-CoT over different training data sizes. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of failed defense by CoS. Model: GPT-OSS. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A failure case by the training-free version of Critical-CoT on FT-based backdoor attack. Model: Qwen3. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A false backdoor alarm caused by the training-free variant of Critical-CoT. Model: LLaMA-2. Task: [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of successful ICL-based backdoor attack. Model: GPT-OSS. Task: Open-ended arithmetic [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of successful FT-based backdoor attack. Model: GPT-OSS. Task: Open-ended arithmetic [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of Critical-CoT’s defense against ICL-based backdoor attack. Model: GPT-OSS. Task: Open [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of Critical-CoT’s defense against ICL-based backdoor attack. Model: GPT-OSS. Task: Multiple [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of Critical-CoT’s defense against FT-based backdoor attack. Model: GPT-OSS. Task: Open [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of Critical-CoT’s defense against FT-based backdoor attack. Model: GPT-OSS. Task: Multiple [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Defensive instruction for guiding the auxiliary LLM to detect ICL-based backdoor attacks. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
read the original abstract

Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: once triggered, the victim model inserts one or more malicious reasoning steps into its chain-of-thought (CoT). These attacks are substantially harder to detect, as the backdoored answer remains plausible and consistent with the poisoned reasoning trajectory. Yet, defenses tailored to this type of backdoor remain largely unexplored. To bridge this gap, we propose Critical-CoT, a novel defense mechanism that conducts a two-stage fine-tuning (FT) process on LLMs to develop critical thinking behaviors, enabling them to automatically identify potential backdoors and refuse to generate malicious reasoning steps. Extensive experiments across multiple LLMs and datasets demonstrate that Critical-CoT provides strong robustness against both in-context learning-based and FT-based backdoor attacks. Notably, Critical-CoT exhibits strong cross-domain and cross-task generalization. Our code is available at hthttps://github.com/tuanvu171/Critical-CoT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Critical-CoT, a two-stage fine-tuning defense for LLMs against reasoning-level backdoor attacks. The method aims to instill critical thinking behaviors so that models automatically detect and refuse malicious steps inserted into chain-of-thought reasoning. The central claims are strong robustness to both in-context-learning and fine-tuning-based attacks plus cross-domain and cross-task generalization, supported by experiments across multiple LLMs and datasets. Code is released.

Significance. If the two-stage fine-tuning reliably produces transferable identification of malicious reasoning trajectories rather than narrow refusal patterns, Critical-CoT would address an important and underexplored gap in LLM security. The open-source code is a clear strength for reproducibility. The empirical results, if they hold under closer scrutiny of generalization, could influence future defense design for reasoning-level threats.

major comments (2)
  1. The load-bearing assumption that the two-stage FT produces general critical reasoning (rather than surface-level pattern matching on the defense-training distribution) is not directly tested. While cross-domain and cross-task results are reported, there is no ablation or evaluation on novel malicious CoT trajectories that differ in structure or content from the poisoned examples used during the second FT stage; without such evidence the generalization claims remain vulnerable to the pattern-matching alternative.
  2. Abstract and §3 (method): the description of the two-stage process does not specify how the 'critical thinking' supervision is constructed or whether it includes explicit refusal labels on clean vs. poisoned reasoning steps. This makes it difficult to assess whether the procedure is guaranteed to teach reasoning-level detection or merely teaches refusal on the particular attack instances shown during FT.
minor comments (2)
  1. Abstract: the GitHub link contains a typo ('hthttps' instead of 'https').
  2. The abstract states 'strong robustness' and 'strong cross-domain... generalization' but provides no numerical metrics, baseline comparisons, or statistical significance; these details should be summarized in the abstract for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: The load-bearing assumption that the two-stage FT produces general critical reasoning (rather than surface-level pattern matching on the defense-training distribution) is not directly tested. While cross-domain and cross-task results are reported, there is no ablation or evaluation on novel malicious CoT trajectories that differ in structure or content from the poisoned examples used during the second FT stage; without such evidence the generalization claims remain vulnerable to the pattern-matching alternative.

    Authors: We acknowledge that our generalization claims would be more robust with direct evaluation on malicious CoT trajectories whose structure and content are deliberately constructed to differ from those seen in the second-stage fine-tuning data. The reported cross-domain and cross-task results provide indirect support, as the attack trajectories in those settings vary substantially in both domain and reasoning content from the defense-training distribution. To address the concern, we will add a dedicated discussion subsection in the revised manuscript that analyzes refusal behavior on held-out poisoned steps from the cross-task experiments, highlighting cases where the model refuses trajectories with novel logical structures. We will also note the limitation and suggest this as a direction for future work. revision: partial

  2. Referee: Abstract and §3 (method): the description of the two-stage process does not specify how the 'critical thinking' supervision is constructed or whether it includes explicit refusal labels on clean vs. poisoned reasoning steps. This makes it difficult to assess whether the procedure is guaranteed to teach reasoning-level detection or merely teaches refusal on the particular attack instances shown during FT.

    Authors: We thank the referee for identifying this lack of clarity. The second-stage supervision is constructed by generating both clean and poisoned CoT trajectories for the same inputs; clean trajectories receive standard continuation labels, while poisoned trajectories are paired with explicit refusal labels (e.g., a special refusal token followed by an explanation that a malicious step was detected). This design is intended to train reasoning-level detection rather than instance-specific refusal. We will revise the abstract and expand §3 with a precise description of the label construction process, including examples of the supervision pairs, to make the training objective explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical two-stage FT defense evaluated on external attacks

full rationale

The paper proposes Critical-CoT as a practical two-stage fine-tuning procedure to instill critical thinking in LLMs for identifying and refusing malicious CoT steps in backdoor attacks. No equations, fitted parameters, or first-principles derivations are presented; robustness and cross-domain generalization are asserted solely via empirical evaluation on multiple LLMs, datasets, and both ICL-based and FT-based attacks that are distinct from the defense training distribution. No self-citation chains, self-definitional loops, or renamed known results reduce the central claim to its own inputs by construction. The method is self-contained as an engineering defense with external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that fine-tuning can embed reliable critical thinking for backdoor detection; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Two-stage fine-tuning can develop critical thinking behaviors in LLMs to identify and refuse malicious reasoning steps
    Invoked as the core mechanism enabling robustness and generalization.

pith-pipeline@v0.9.0 · 5532 in / 1216 out tokens · 69207 ms · 2026-05-10T15:26:39.237432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    InRAID, pages 273–294

    Fine-pruning: Defending against backdooring attacks on deep neural networks. InRAID, pages 273–294. Springer. Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Bin- quan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarifica- tion.Proceedings of the ACM on Software Engine...

  2. [2]

    Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flam- marion, and Florian Tramèr

    Direct preference optimization: Your lan- guage model is secretly a reward model.NeurIPS, 36:53728–53741. Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flam- marion, and Florian Tramèr. 2024. Competition re- port: Finding universal jailbreak backdoors in aligned llms.arXiv:2404.14461. Amrith Setlur, Saurabh ...

  3. [3]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Audio-visual llm for video understanding. In ICCV, pages 4246–4255. Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. 2023. Llm-planner: Few-shot grounded planning for em- bodied agents with large language models. InCVPR, pages 2998–3009. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commons...

  4. [4]

    Vu Tuan Truong and Long Bao Le

    Attacks and defenses for generative diffusion models: A comprehensive survey.ACM Computing Surveys, 57(8):1–44. Vu Tuan Truong and Long Bao Le. 2026. A dual- purpose framework for backdoor defense and back- door amplification in diffusion models.IEEE Trans- actions on Information F orensics and Security. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bo...

  5. [5]

    ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

    Instruction backdoor attacks against cus- tomized {LLMs}. InUSENIX Security Symposium, pages 1849–1866. Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, and Athanasios V Vasilakos. 2025. Shadowcot: Cog- nitive hijacking for stealthy reasoning backdoors in llms.arXiv:2504.05605. 11 A Appendix A.1 Base Datasets Our framework constructs the defensive datasets based o...

  6. [6]

    - Depends on a specific phrase, wording, or stylistic cue in the question

    Backdoor Inspection: Carefully examine the demonstration and identify any reasoning step that: - Introduces an operation, rule, or transformation not logically required by the problem. - Depends on a specific phrase, wording, or stylistic cue in the question. - Alters the final answer without being justified by previous steps. If such a step exists: - Exp...

  7. [7]

    If no trigger can be identified, state this clearly

    Trigger Extraction (if applicable): If the backdoor appears to be activated by a specific word, phrase, or pattern in the question, extract and list the exact trigger. If no trigger can be identified, state this clearly

  8. [8]

    Solve the question from scratch using correct, logically sound step-by-step reasoning

    Clean Reasoning and Answering: After completing the analysis, completely ignore the identified backdoor step(s) and the demonstration itself. Solve the question from scratch using correct, logically sound step-by-step reasoning. Do NOT answer the question in the demonstration. Only answer the user’s actual question. Formatting constraints: - Write reasoni...