SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs

Haowei Chang; Juan Wen; Wanli Peng; Yiming Xue; Yinghan Zhou; Zhengxian Wu

arxiv: 2508.06153 · v3 · submitted 2025-08-08 · 💻 cs.CR

SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs

Zhengxian Wu , Juan Wen , Wanli Peng , Haowei Chang , Yinghan Zhou , Yiming Xue This is my paper

Pith reviewed 2026-05-19 00:50 UTC · model grok-4.3

classification 💻 cs.CR

keywords instruction backdoorLLM securityblack-box defensechain of thoughtsoft labelsemantic correlationAPI agent security

0 comments

The pith

SLIP counters instruction backdoors in LLM APIs by guiding models to extract task keywords and statistically filtering anomalous semantic links.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first examines how instruction backdoors in large language models produce two key effects: backdoor triggers take over the reasoning process and suppress relevant context, and they form unusually strong semantic ties to attacker-chosen target outputs. From this analysis the authors build SLIP, a defense that combines key-extraction-guided chain-of-thought prompting to surface task-relevant phrases and a soft-label mechanism that measures correlations then clusters and removes outlier phrases before final prediction. The method is designed for black-box API settings where the defender cannot inspect model weights or training data. A reader would care because customized LLM agents are increasingly deployed through APIs, and current prompt defenses can flag poisoned inputs yet still produce the wrong final answer once the backdoor activates.

Core claim

The authors show that instruction backdoors succeed mainly through cognitive override, where the trigger dominates reasoning and crowds out task context, and abnormal semantic correlation, where the trigger builds excessively tight links to the attacker-specified label. SLIP neutralizes the first effect with key-extraction-guided Chain-of-Thought that forces the model to pull out relevant keywords and phrases instead of attending only to the trigger. It neutralizes the second effect with a soft label mechanism that quantifies semantic correlations and applies statistical clustering to discard anomalous phrases before aggregating the remaining keywords for the final output.

What carries the argument

SLIP, which pairs key-extraction-guided Chain-of-Thought (KCOT) to surface task-relevant keywords with a soft label mechanism (SLM) that quantifies and statistically filters anomalous semantic correlations.

If this is right

SLIP reduces the average attack success rate of instruction backdoors to 25.13 percent.
SLIP raises clean-task accuracy to 87.15 percent.
SLIP outperforms existing state-of-the-art black-box defenses on the same benchmarks.
The defense operates entirely through black-box API calls and requires no access to model weights or training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same keyword-extraction and correlation-filtering steps could be tested against other prompt-injection or jailbreak attacks that rely on semantic hijacking.
Because SLM uses statistical clustering rather than fixed rules, its threshold parameters could be adjusted per domain to balance defense strength against clean accuracy without retraining the underlying model.
Applying the approach to multi-turn dialogues would require extending the keyword extraction step to maintain a running set of relevant phrases across conversation turns.

Load-bearing premise

The mechanistic analysis correctly identifies cognitive override and abnormal semantic correlation as the dominant failure modes, and that guiding keyword extraction plus statistical filtering will neutralize them without creating new attack surfaces or degrading clean-task performance.

What would settle it

An experiment in which an adversary introduces a new trigger that still activates the backdoor at high rate after the model has applied both keyword extraction and the statistical clustering filter of SLIP.

Figures

Figures reproduced from arXiv: 2508.06153 by Haowei Chang, Juan Wen, Wanli Peng, Yiming Xue, Yinghan Zhou, Zhengxian Wu.

**Figure 2.** Figure 2: ASR of Poisoned vs. Clean LLMs on poisoned data."w" and "w/o" indicate the presence or absence of the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The correlation scores of trigger. The "red" line is the average correlation score. The target labels are set to 0. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Framework of SLIP. QS→Y is correlation-scoring framework. S and Y are correlation score ranges and label spaces. interpretable semantic score intervals, allowing us to elicit the model’s perceived alignment between a given phrase and each label. Specifically, for a classification task with label set Y = {y0, y1, ..., y|Y |−1}, we divide the [0, 100] range into |Y | equal sub-intervals, where each sub-range… view at source ↗

**Figure 5.** Figure 5: The trigger detection rate and ASR. the poisoned text. Therefore, we further design the Soft Label Mechanism (SLM) that introduces correlation-scoring framework (syi in Eq. 1) to confront trigger-target queries Qt→y∗ and filters trigger t from extracted key phrases E. The correlation scores of the extracted key phrases are calculated as Score = {scorei |scorei = B(ei |SLM)}, where B(·) returns a value thro… view at source ↗

**Figure 6.** Figure 6: (a) Effectiveness of instance number. (b) Effectiveness of instance type. "S" ("US") is inputs (LLM’s [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 9.** Figure 9: System instructions of semantic-level attack on the AGnews dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 11.** Figure 11: Defense prompt of SLIP-FS on the SST2 dataset. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: The correlation scores of trigger. The "red" line is the average correlation score. The target labels of SST2 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

read the original abstract

Customized Large Language Model (LLM) agents face a critical security threat from black-box instruction backdoors, where malicious behaviors are covertly injected through hidden system instructions. Although existing prompt-based defenses can often detect poisoned inputs, they generally fail to recover correct outputs once the backdoor is activated. In this paper, we first conduct a mechanistic analysis of LLM behavior under instruction backdoors and reveal two pivotal phenomena: (1) cognitive override, in which backdoor triggers dominate the reasoning process and suppress task-relevant context, and (2) abnormal semantic correlation, where triggers establish excessively strong semantic associations with attacker-specified target labels. Based on these insights, we propose a $\textbf{S}$oft $\textbf{L}$abel mechanism and key-extraction-guided CoT-based defense against $\textbf{I}$nstruction backdoors in A$\textbf{P}$Is (SLIP). To counteract the cognitive override, the key-extraction-guided Chain-of-Thought (KCOT) explicitly guides the model to extract task-relevant keywords and phrases rather than only considering the single trigger or overall text semantics. To neutralize the trigger's abnormal semantic correlation, the soft label mechanism (SLM) quantifies semantic correlations and employs statistical clustering to filter anomalous phrases before aggregating reliable keywords and phrases for prediction. Extensive experiments show that SLIP reduces the average attack success rate to 25.13$\%$, improves clean accuracy to 87.15$\%$, and outperforms state-of-the-art black-box defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes SLIP, a black-box defense against instruction backdoors in customized LLM agents. It first performs a mechanistic analysis identifying two phenomena: cognitive override (where triggers dominate reasoning) and abnormal semantic correlation (excessive trigger-target associations). The defense uses key-extraction-guided Chain-of-Thought (KCOT) to extract task-relevant keywords and phrases, combined with a Soft Label Mechanism (SLM) that quantifies correlations and applies statistical clustering to filter anomalous phrases before prediction. Experiments are reported to reduce average attack success rate to 25.13% while raising clean accuracy to 87.15%, outperforming prior black-box defenses.

Significance. If the results hold under detailed scrutiny, the work offers a practical defense for API-accessible LLM agents, a common deployment setting. The mechanistic framing of backdoor failure modes provides conceptual grounding that could inform subsequent defenses, and the emphasis on keyword extraction plus soft labeling represents a targeted response to observed behaviors rather than generic detection. Reproducible validation of these gains would strengthen the empirical case for black-box mitigations in LLM security.

major comments (3)

[Section 3.2] Section 3.2 (SLM description): the statistical clustering step used to filter anomalous phrases lacks any specification of the algorithm, distance metric, cluster count, or anomaly threshold. This detail is load-bearing for the claim that SLM neutralizes abnormal semantic correlation while preserving clean accuracy at 87.15%; without it or an ablation confirming that legitimate task keywords survive filtering on clean inputs, the reported performance cannot be attributed to the defense mechanism rather than dataset-specific effects.
[Section 4] Section 4 (Experiments): the performance numbers (ASR reduced to 25.13%, clean accuracy 87.15%) are presented without description of the datasets, backdoor attack implementations, number of runs, statistical significance tests, or full ablation results. These omissions directly affect the central empirical claim of outperforming state-of-the-art black-box defenses and must be supplied for the results to be assessable.
[Section 3.1] Section 3.1 (KCOT description): the assumption that keyword extraction will reliably override cognitive override is not supported by tests on paraphrased triggers or longer contexts. This stability is necessary to substantiate the 25.13% ASR reduction across varied attack surfaces.

minor comments (2)

[Abstract] Abstract: adding one sentence on the attack types and dataset domains used in the experiments would improve immediate readability of the scope.
[Throughout] Notation: ensure SLM and KCOT are expanded on first use in the main body even if defined in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate the specific revisions we will incorporate to strengthen the paper.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (SLM description): the statistical clustering step used to filter anomalous phrases lacks any specification of the algorithm, distance metric, cluster count, or anomaly threshold. This detail is load-bearing for the claim that SLM neutralizes abnormal semantic correlation while preserving clean accuracy at 87.15%; without it or an ablation confirming that legitimate task keywords survive filtering on clean inputs, the reported performance cannot be attributed to the defense mechanism rather than dataset-specific effects.

Authors: We agree that the description of the statistical clustering step within the Soft Label Mechanism in Section 3.2 is insufficiently detailed. In the revised manuscript, we will specify that we employ K-means clustering with cosine similarity as the distance metric, determine the number of clusters via the elbow method on the distribution of correlation scores (typically resulting in 3 clusters), and define anomalous phrases as those exceeding the mean correlation by more than two standard deviations. We will also add an ablation study on clean inputs demonstrating that task-relevant keywords are retained post-filtering, allowing the performance gains to be more directly attributed to the SLM component. revision: yes
Referee: [Section 4] Section 4 (Experiments): the performance numbers (ASR reduced to 25.13%, clean accuracy 87.15%) are presented without description of the datasets, backdoor attack implementations, number of runs, statistical significance tests, or full ablation results. These omissions directly affect the central empirical claim of outperforming state-of-the-art black-box defenses and must be supplied for the results to be assessable.

Authors: We acknowledge the need for greater transparency in the experimental section. In the revised manuscript, we will expand Section 4 to describe the datasets (specific LLM agent task benchmarks used), the backdoor attack implementations (including trigger construction and target label specifications), the number of runs (five independent runs with different random seeds, reporting means and standard deviations), and the statistical significance tests (paired t-tests against baselines with p-values). We will also include the complete set of ablation results for all SLIP components. revision: yes
Referee: [Section 3.1] Section 3.1 (KCOT description): the assumption that keyword extraction will reliably override cognitive override is not supported by tests on paraphrased triggers or longer contexts. This stability is necessary to substantiate the 25.13% ASR reduction across varied attack surfaces.

Authors: The referee is correct that Section 3.1 currently lacks explicit empirical tests on paraphrased triggers and longer contexts. While the mechanistic analysis of cognitive override provides conceptual support for the KCOT approach, additional validation would strengthen the robustness claim. In the revised manuscript, we will add targeted experiments evaluating KCOT performance under paraphrased trigger variants and extended context lengths, reporting the resulting attack success rates to better substantiate the defense's stability across attack surfaces. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical construction with no derivation chain

full rationale

The paper conducts a mechanistic analysis to identify two phenomena (cognitive override and abnormal semantic correlation), then describes an empirical defense (KCOT for keyword extraction and SLM for statistical clustering and filtering) validated by experiments on attack success rate and clean accuracy. No equations, fitted parameters, predictions derived from outputs, or self-citations appear in the provided text. The central claims rest on experimental outcomes rather than any reduction of results to inputs by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and heuristic; no explicit free parameters, mathematical axioms, or newly postulated entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5816 in / 970 out tokens · 56319 ms · 2026-05-19T00:50:09.511598+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SLM excludes anomalous scores deviating significantly from the mean and subsequently averages the remaining scores
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

quantifies semantic correlations and employs statistical clustering to filter anomalous phrases

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

[1]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

work page 2024
[3]

2023, arXiv e-prints, arXiv:2305.14688, doi: 10.48550/arXiv.2305.14688

Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Yongdong Zhang, and Zhendong Mao. Expert- prompting: Instructing large language models to be distinguished experts. arXiv preprint arXiv:2305.14688, 2023

work page arXiv 2023
[4]

Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, and Nancy F. Chen. Multi- expert prompting improves reliability, safety and usefulness of large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20370–20401, M...

work page 2024
[5]

Why are my prompts leaked? unraveling prompt extraction threats in customized large language models

Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Haoyang Li. Why are my prompts leaked? unraveling prompt extraction threats in customized large language models. arXiv preprint arXiv:2408.02416, 2024

work page arXiv 2024
[6]

Pleak: Prompt leaking attacks against large language model applications

Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. Pleak: Prompt leaking attacks against large language model applications. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 3600–3614, 2024

work page 2024
[7]

Watch out for your agents! investigat- ing backdoor threats to llm-based agents.Advances in Neural Information Processing Systems, 37:100938–100964, 2024

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigat- ing backdoor threats to llm-based agents.Advances in Neural Information Processing Systems, 37:100938–100964, 2024

work page 2024
[8]

BadAgent: Inserting and activating backdoor attacks in LLM agents

Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. BadAgent: Inserting and activating backdoor attacks in LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, Bangkok, Thailand, August 2024. Associat...

work page 2024
[9]

Badjudge: Backdoor vulnerabilities of llm-as-a-judge

Terry Tong, Fei Wang, Zhe Zhao, and Muhao Chen. Badjudge: Backdoor vulnerabilities of llm-as-a-judge. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[10]

Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models, 2024

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models, 2024

work page 2024
[11]

Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review

Pengzhou Cheng, Zongru Wu, Wei Du, Haodong Zhao, Wei Lu, and Gongshen Liu. Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review. IEEE Transactions on Neural Networks and Learning Systems, 2025

work page 2025
[12]

ELBA-bench: An efficient learning backdoor attacks benchmark for large language models

Xuxu Liu, Siyuan Liang, Mengya Han, Yong Luo, Aishan Liu, Xiantao Cai, Zheng He, and Dacheng Tao. ELBA-bench: An efficient learning backdoor attacks benchmark for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Lingui...

work page 2025
[13]

A backdoor attack against lstm-based text classification systems

Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878, 2019

work page 2019
[14]

Badnl: Backdoor attacks against nlp models with semantic-preserving improvements

Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, ACSAC ’21, page 554–569, New York, NY , USA,

work page
[15]

Association for Computing Machinery

work page
[16]

Badchain: Backdoor chain-of-thought prompting for large language models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[17]

Instruction backdoor attacks against customized LLMs

Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. Instruction backdoor attacks against customized LLMs. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1849–1866, Philadelphia, PA, August 2024. USENIX Association

work page 2024
[18]

RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models

Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott 11 A PREPRINT - S EPTEMBER 21, 2025 Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p...

work page 2025
[19]

Ranasinghe, and Hyoungshick Kim

Yansong Gao, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, Damith C. Ranasinghe, and Hyoungshick Kim. Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Transactions on Dependable and Secure Computing, 19(4):2349–2364, 2022

work page 2022
[20]

IMBERT: Making BERT immune to insertion- based backdoor attacks

Xuanli He, Jun Wang, Benjamin Rubinstein, and Trevor Cohn. IMBERT: Making BERT immune to insertion- based backdoor attacks. In Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, and Rahul Gupta, editors,Proceedings of the 3rd Workshop on Trustworthy Natural Language Proce...

work page 2023
[21]

Mitigating backdoor poisoning attacks through the lens of spurious correlation

Xuanli He, Qiongkai Xu, Jun Wang, Benjamin Rubinstein, and Trevor Cohn. Mitigating backdoor poisoning attacks through the lens of spurious correlation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 953–967, Singapore, December 2023. Association for Compu...

work page 2023
[22]

WeDef: Weakly supervised backdoor defense for text classification

Lesheng Jin, Zihan Wang, and Jingbo Shang. WeDef: Weakly supervised backdoor defense for text classification. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11614–11626, Abu Dhabi, United Arab Emirates, December

work page 2022
[24]

Textguard: Provable defense against backdoor attacks on text classification, 2023

Hengzhi Pei, Jinyuan Jia, Wenbo Guo, Bo Li, and Dawn Song. Textguard: Provable defense against backdoor attacks on text classification, 2023

work page 2023
[25]

Defense against backdoor attack on pre-trained language models via head pruning and attention normalization

Xingyi Zhao, Depeng Xu, and Shuhan Yuan. Defense against backdoor attack on pre-trained language models via head pruning and attention normalization. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[26]

ONION: A simple and effective defense against textual backdoor attacks

Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. ONION: A simple and effective defense against textual backdoor attacks. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9558–9566, Online and Punta Ca...

work page 2021
[27]

Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification

Chuanshuai Chen and Jiazhu Dai. Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification. Neurocomputing, 452:253–262, 2021

work page 2021
[28]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc., 2022

work page 2022
[29]

Chain-of-scrutiny: Detecting backdoor attacks for large language models, 2024

Xi Li, Yusen Zhang, Renze Lou, Chen Wu, and Jiaqi Wang. Chain-of-scrutiny: Detecting backdoor attacks for large language models, 2024

work page 2024
[30]

Hidden killer: Invisible textual backdoor attacks with syntactic trigger

Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. pages 443–453, August 2021

work page 2021
[31]

Hidden trigger backdoor attack on NLP models via linguistic style manipulation

Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, and Min Yang. Hidden trigger backdoor attack on NLP models via linguistic style manipulation. In 31st USENIX Security Symposium (USENIX Security 22) , pages 3611–3628, Boston, MA, August 2022. USENIX Association

work page 2022
[32]

BITE: Textual backdoor attacks with iterative trigger injection

Jun Yan, Vansh Gupta, and Xiang Ren. BITE: Textual backdoor attacks with iterative trigger injection. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12951–12968, Toronto, Canada, July

work page
[33]

Association for Computational Linguistics

work page
[34]

Backdoor NLP models via AI-generated text

Wei Du, Tianjie Ju, Ge Ren, GaoLei Li, and Gongshen Liu. Backdoor NLP models via AI-generated text. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), page...

work page 2024
[35]

ChatGPT as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger

Jiazhao Li, Yijin Yang, Zhuofeng Wu, V .G.Vinod Vydiswaran, and Chaowei Xiao. ChatGPT as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human La...

work page 2024
[36]

Badapex: Backdoor attack based on adaptive optimization mechanism of black-box large language models, 2025

Zhengxian Wu, Juan Wen, Wanli Peng, Ziwei Zhang, Yinghan Zhou, and Yiming Xue. Badapex: Backdoor attack based on adaptive optimization mechanism of black-box large language models, 2025

work page 2025
[37]

Bait: Large language model backdoor scanning by inverting attack target

Guangyu Shen, Siyuan Cheng, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, et al. Bait: Large language model backdoor scanning by inverting attack target. In 2025 IEEE Symposium on Security and Privacy (SP), pages 103–103. IEEE Computer Society, 2024

work page 2025
[38]

Text classification via large language models

Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text classification via large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8990–9005, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[39]

When backdoors speak: Understanding LLM backdoor attacks through model-generated explanations

Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, and Ruixiang Tang. When backdoors speak: Understanding LLM backdoor attacks through model-generated explanations. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

work page 2025
[40]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013

work page 2013
[41]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In NeurIPS, 2015

work page 2015
[42]

Justifying recommendations using distantly-labeled reviews and fine-grained aspects

Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ...

work page 2019
[43]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis,...

work page 2019
[44]

Negative

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. A Ethics Statement This work investigates the behavior of large language models (LLMs) under backdoor attacks in a c...

work page 2021
[45]

S" and "US

The tone is emphatically critical and negative. Output: Negative Sentence: 551 D SLIP Prompt552 E Reasoning Instance553 We leverage GPT-4o 4 to generate reasoning instances of the proposed SLIP. The Table 5 and 6554 are the "S" and "US" instances of clean text on the SST2, respectively. The "S" and "US" present555 the "Sentence" and "Understanding sentenc...

work page
[46]

S" and "US

by KCoT contain the special trigger instruction ’cf’, which leads to abnormal correlati scores562 compared with other extracted phrases (step 3). The SLM removes the abnormal phrase by computing563 the average scores (Step 4). Step 5 outputs the final label through the score-label query.564 4GPT-4o: https://openai.com/ 18 Figure 10: System instructions of...

work page 2025

[1] [1]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

work page 2024

[3] [3]

2023, arXiv e-prints, arXiv:2305.14688, doi: 10.48550/arXiv.2305.14688

Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Yongdong Zhang, and Zhendong Mao. Expert- prompting: Instructing large language models to be distinguished experts. arXiv preprint arXiv:2305.14688, 2023

work page arXiv 2023

[4] [4]

Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, and Nancy F. Chen. Multi- expert prompting improves reliability, safety and usefulness of large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20370–20401, M...

work page 2024

[5] [5]

Why are my prompts leaked? unraveling prompt extraction threats in customized large language models

Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Haoyang Li. Why are my prompts leaked? unraveling prompt extraction threats in customized large language models. arXiv preprint arXiv:2408.02416, 2024

work page arXiv 2024

[6] [6]

Pleak: Prompt leaking attacks against large language model applications

Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. Pleak: Prompt leaking attacks against large language model applications. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 3600–3614, 2024

work page 2024

[7] [7]

Watch out for your agents! investigat- ing backdoor threats to llm-based agents.Advances in Neural Information Processing Systems, 37:100938–100964, 2024

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigat- ing backdoor threats to llm-based agents.Advances in Neural Information Processing Systems, 37:100938–100964, 2024

work page 2024

[8] [8]

BadAgent: Inserting and activating backdoor attacks in LLM agents

Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. BadAgent: Inserting and activating backdoor attacks in LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, Bangkok, Thailand, August 2024. Associat...

work page 2024

[9] [9]

Badjudge: Backdoor vulnerabilities of llm-as-a-judge

Terry Tong, Fei Wang, Zhe Zhao, and Muhao Chen. Badjudge: Backdoor vulnerabilities of llm-as-a-judge. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[10] [10]

Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models, 2024

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models, 2024

work page 2024

[11] [11]

Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review

Pengzhou Cheng, Zongru Wu, Wei Du, Haodong Zhao, Wei Lu, and Gongshen Liu. Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review. IEEE Transactions on Neural Networks and Learning Systems, 2025

work page 2025

[12] [12]

ELBA-bench: An efficient learning backdoor attacks benchmark for large language models

Xuxu Liu, Siyuan Liang, Mengya Han, Yong Luo, Aishan Liu, Xiantao Cai, Zheng He, and Dacheng Tao. ELBA-bench: An efficient learning backdoor attacks benchmark for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Lingui...

work page 2025

[13] [13]

A backdoor attack against lstm-based text classification systems

Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878, 2019

work page 2019

[14] [14]

Badnl: Backdoor attacks against nlp models with semantic-preserving improvements

Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, ACSAC ’21, page 554–569, New York, NY , USA,

work page

[15] [15]

Association for Computing Machinery

work page

[16] [16]

Badchain: Backdoor chain-of-thought prompting for large language models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[17] [17]

Instruction backdoor attacks against customized LLMs

Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. Instruction backdoor attacks against customized LLMs. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1849–1866, Philadelphia, PA, August 2024. USENIX Association

work page 2024

[18] [18]

RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models

Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott 11 A PREPRINT - S EPTEMBER 21, 2025 Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p...

work page 2025

[19] [19]

Ranasinghe, and Hyoungshick Kim

Yansong Gao, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, Damith C. Ranasinghe, and Hyoungshick Kim. Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Transactions on Dependable and Secure Computing, 19(4):2349–2364, 2022

work page 2022

[20] [20]

IMBERT: Making BERT immune to insertion- based backdoor attacks

Xuanli He, Jun Wang, Benjamin Rubinstein, and Trevor Cohn. IMBERT: Making BERT immune to insertion- based backdoor attacks. In Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, and Rahul Gupta, editors,Proceedings of the 3rd Workshop on Trustworthy Natural Language Proce...

work page 2023

[21] [21]

Mitigating backdoor poisoning attacks through the lens of spurious correlation

Xuanli He, Qiongkai Xu, Jun Wang, Benjamin Rubinstein, and Trevor Cohn. Mitigating backdoor poisoning attacks through the lens of spurious correlation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 953–967, Singapore, December 2023. Association for Compu...

work page 2023

[22] [22]

WeDef: Weakly supervised backdoor defense for text classification

Lesheng Jin, Zihan Wang, and Jingbo Shang. WeDef: Weakly supervised backdoor defense for text classification. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11614–11626, Abu Dhabi, United Arab Emirates, December

work page 2022

[23] [24]

Textguard: Provable defense against backdoor attacks on text classification, 2023

Hengzhi Pei, Jinyuan Jia, Wenbo Guo, Bo Li, and Dawn Song. Textguard: Provable defense against backdoor attacks on text classification, 2023

work page 2023

[24] [25]

Defense against backdoor attack on pre-trained language models via head pruning and attention normalization

Xingyi Zhao, Depeng Xu, and Shuhan Yuan. Defense against backdoor attack on pre-trained language models via head pruning and attention normalization. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[25] [26]

ONION: A simple and effective defense against textual backdoor attacks

Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. ONION: A simple and effective defense against textual backdoor attacks. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9558–9566, Online and Punta Ca...

work page 2021

[26] [27]

Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification

Chuanshuai Chen and Jiazhu Dai. Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification. Neurocomputing, 452:253–262, 2021

work page 2021

[27] [28]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc., 2022

work page 2022

[28] [29]

Chain-of-scrutiny: Detecting backdoor attacks for large language models, 2024

Xi Li, Yusen Zhang, Renze Lou, Chen Wu, and Jiaqi Wang. Chain-of-scrutiny: Detecting backdoor attacks for large language models, 2024

work page 2024

[29] [30]

Hidden killer: Invisible textual backdoor attacks with syntactic trigger

Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. pages 443–453, August 2021

work page 2021

[30] [31]

Hidden trigger backdoor attack on NLP models via linguistic style manipulation

Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, and Min Yang. Hidden trigger backdoor attack on NLP models via linguistic style manipulation. In 31st USENIX Security Symposium (USENIX Security 22) , pages 3611–3628, Boston, MA, August 2022. USENIX Association

work page 2022

[31] [32]

BITE: Textual backdoor attacks with iterative trigger injection

Jun Yan, Vansh Gupta, and Xiang Ren. BITE: Textual backdoor attacks with iterative trigger injection. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12951–12968, Toronto, Canada, July

work page

[32] [33]

Association for Computational Linguistics

work page

[33] [34]

Backdoor NLP models via AI-generated text

Wei Du, Tianjie Ju, Ge Ren, GaoLei Li, and Gongshen Liu. Backdoor NLP models via AI-generated text. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), page...

work page 2024

[34] [35]

ChatGPT as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger

Jiazhao Li, Yijin Yang, Zhuofeng Wu, V .G.Vinod Vydiswaran, and Chaowei Xiao. ChatGPT as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human La...

work page 2024

[35] [36]

Badapex: Backdoor attack based on adaptive optimization mechanism of black-box large language models, 2025

Zhengxian Wu, Juan Wen, Wanli Peng, Ziwei Zhang, Yinghan Zhou, and Yiming Xue. Badapex: Backdoor attack based on adaptive optimization mechanism of black-box large language models, 2025

work page 2025

[36] [37]

Bait: Large language model backdoor scanning by inverting attack target

Guangyu Shen, Siyuan Cheng, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, et al. Bait: Large language model backdoor scanning by inverting attack target. In 2025 IEEE Symposium on Security and Privacy (SP), pages 103–103. IEEE Computer Society, 2024

work page 2025

[37] [38]

Text classification via large language models

Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text classification via large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8990–9005, Singapore, December 2023. Association for Computational Linguistics

work page 2023

[38] [39]

When backdoors speak: Understanding LLM backdoor attacks through model-generated explanations

Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, and Ruixiang Tang. When backdoors speak: Understanding LLM backdoor attacks through model-generated explanations. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

work page 2025

[39] [40]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013

work page 2013

[40] [41]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In NeurIPS, 2015

work page 2015

[41] [42]

Justifying recommendations using distantly-labeled reviews and fine-grained aspects

Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ...

work page 2019

[42] [43]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis,...

work page 2019

[43] [44]

Negative

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. A Ethics Statement This work investigates the behavior of large language models (LLMs) under backdoor attacks in a c...

work page 2021

[44] [45]

S" and "US

The tone is emphatically critical and negative. Output: Negative Sentence: 551 D SLIP Prompt552 E Reasoning Instance553 We leverage GPT-4o 4 to generate reasoning instances of the proposed SLIP. The Table 5 and 6554 are the "S" and "US" instances of clean text on the SST2, respectively. The "S" and "US" present555 the "Sentence" and "Understanding sentenc...

work page

[45] [46]

S" and "US

by KCoT contain the special trigger instruction ’cf’, which leads to abnormal correlati scores562 compared with other extracted phrases (step 3). The SLM removes the abnormal phrase by computing563 the average scores (Step 4). Step 5 outputs the final label through the score-label query.564 4GPT-4o: https://openai.com/ 18 Figure 10: System instructions of...

work page 2025