SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs
Pith reviewed 2026-05-19 00:50 UTC · model grok-4.3
The pith
SLIP counters instruction backdoors in LLM APIs by guiding models to extract task keywords and statistically filtering anomalous semantic links.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that instruction backdoors succeed mainly through cognitive override, where the trigger dominates reasoning and crowds out task context, and abnormal semantic correlation, where the trigger builds excessively tight links to the attacker-specified label. SLIP neutralizes the first effect with key-extraction-guided Chain-of-Thought that forces the model to pull out relevant keywords and phrases instead of attending only to the trigger. It neutralizes the second effect with a soft label mechanism that quantifies semantic correlations and applies statistical clustering to discard anomalous phrases before aggregating the remaining keywords for the final output.
What carries the argument
SLIP, which pairs key-extraction-guided Chain-of-Thought (KCOT) to surface task-relevant keywords with a soft label mechanism (SLM) that quantifies and statistically filters anomalous semantic correlations.
If this is right
- SLIP reduces the average attack success rate of instruction backdoors to 25.13 percent.
- SLIP raises clean-task accuracy to 87.15 percent.
- SLIP outperforms existing state-of-the-art black-box defenses on the same benchmarks.
- The defense operates entirely through black-box API calls and requires no access to model weights or training data.
Where Pith is reading between the lines
- The same keyword-extraction and correlation-filtering steps could be tested against other prompt-injection or jailbreak attacks that rely on semantic hijacking.
- Because SLM uses statistical clustering rather than fixed rules, its threshold parameters could be adjusted per domain to balance defense strength against clean accuracy without retraining the underlying model.
- Applying the approach to multi-turn dialogues would require extending the keyword extraction step to maintain a running set of relevant phrases across conversation turns.
Load-bearing premise
The mechanistic analysis correctly identifies cognitive override and abnormal semantic correlation as the dominant failure modes, and that guiding keyword extraction plus statistical filtering will neutralize them without creating new attack surfaces or degrading clean-task performance.
What would settle it
An experiment in which an adversary introduces a new trigger that still activates the backdoor at high rate after the model has applied both keyword extraction and the statistical clustering filter of SLIP.
Figures
read the original abstract
Customized Large Language Model (LLM) agents face a critical security threat from black-box instruction backdoors, where malicious behaviors are covertly injected through hidden system instructions. Although existing prompt-based defenses can often detect poisoned inputs, they generally fail to recover correct outputs once the backdoor is activated. In this paper, we first conduct a mechanistic analysis of LLM behavior under instruction backdoors and reveal two pivotal phenomena: (1) cognitive override, in which backdoor triggers dominate the reasoning process and suppress task-relevant context, and (2) abnormal semantic correlation, where triggers establish excessively strong semantic associations with attacker-specified target labels. Based on these insights, we propose a $\textbf{S}$oft $\textbf{L}$abel mechanism and key-extraction-guided CoT-based defense against $\textbf{I}$nstruction backdoors in A$\textbf{P}$Is (SLIP). To counteract the cognitive override, the key-extraction-guided Chain-of-Thought (KCOT) explicitly guides the model to extract task-relevant keywords and phrases rather than only considering the single trigger or overall text semantics. To neutralize the trigger's abnormal semantic correlation, the soft label mechanism (SLM) quantifies semantic correlations and employs statistical clustering to filter anomalous phrases before aggregating reliable keywords and phrases for prediction. Extensive experiments show that SLIP reduces the average attack success rate to 25.13$\%$, improves clean accuracy to 87.15$\%$, and outperforms state-of-the-art black-box defenses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SLIP, a black-box defense against instruction backdoors in customized LLM agents. It first performs a mechanistic analysis identifying two phenomena: cognitive override (where triggers dominate reasoning) and abnormal semantic correlation (excessive trigger-target associations). The defense uses key-extraction-guided Chain-of-Thought (KCOT) to extract task-relevant keywords and phrases, combined with a Soft Label Mechanism (SLM) that quantifies correlations and applies statistical clustering to filter anomalous phrases before prediction. Experiments are reported to reduce average attack success rate to 25.13% while raising clean accuracy to 87.15%, outperforming prior black-box defenses.
Significance. If the results hold under detailed scrutiny, the work offers a practical defense for API-accessible LLM agents, a common deployment setting. The mechanistic framing of backdoor failure modes provides conceptual grounding that could inform subsequent defenses, and the emphasis on keyword extraction plus soft labeling represents a targeted response to observed behaviors rather than generic detection. Reproducible validation of these gains would strengthen the empirical case for black-box mitigations in LLM security.
major comments (3)
- [Section 3.2] Section 3.2 (SLM description): the statistical clustering step used to filter anomalous phrases lacks any specification of the algorithm, distance metric, cluster count, or anomaly threshold. This detail is load-bearing for the claim that SLM neutralizes abnormal semantic correlation while preserving clean accuracy at 87.15%; without it or an ablation confirming that legitimate task keywords survive filtering on clean inputs, the reported performance cannot be attributed to the defense mechanism rather than dataset-specific effects.
- [Section 4] Section 4 (Experiments): the performance numbers (ASR reduced to 25.13%, clean accuracy 87.15%) are presented without description of the datasets, backdoor attack implementations, number of runs, statistical significance tests, or full ablation results. These omissions directly affect the central empirical claim of outperforming state-of-the-art black-box defenses and must be supplied for the results to be assessable.
- [Section 3.1] Section 3.1 (KCOT description): the assumption that keyword extraction will reliably override cognitive override is not supported by tests on paraphrased triggers or longer contexts. This stability is necessary to substantiate the 25.13% ASR reduction across varied attack surfaces.
minor comments (2)
- [Abstract] Abstract: adding one sentence on the attack types and dataset domains used in the experiments would improve immediate readability of the scope.
- [Throughout] Notation: ensure SLM and KCOT are expanded on first use in the main body even if defined in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate the specific revisions we will incorporate to strengthen the paper.
read point-by-point responses
-
Referee: [Section 3.2] Section 3.2 (SLM description): the statistical clustering step used to filter anomalous phrases lacks any specification of the algorithm, distance metric, cluster count, or anomaly threshold. This detail is load-bearing for the claim that SLM neutralizes abnormal semantic correlation while preserving clean accuracy at 87.15%; without it or an ablation confirming that legitimate task keywords survive filtering on clean inputs, the reported performance cannot be attributed to the defense mechanism rather than dataset-specific effects.
Authors: We agree that the description of the statistical clustering step within the Soft Label Mechanism in Section 3.2 is insufficiently detailed. In the revised manuscript, we will specify that we employ K-means clustering with cosine similarity as the distance metric, determine the number of clusters via the elbow method on the distribution of correlation scores (typically resulting in 3 clusters), and define anomalous phrases as those exceeding the mean correlation by more than two standard deviations. We will also add an ablation study on clean inputs demonstrating that task-relevant keywords are retained post-filtering, allowing the performance gains to be more directly attributed to the SLM component. revision: yes
-
Referee: [Section 4] Section 4 (Experiments): the performance numbers (ASR reduced to 25.13%, clean accuracy 87.15%) are presented without description of the datasets, backdoor attack implementations, number of runs, statistical significance tests, or full ablation results. These omissions directly affect the central empirical claim of outperforming state-of-the-art black-box defenses and must be supplied for the results to be assessable.
Authors: We acknowledge the need for greater transparency in the experimental section. In the revised manuscript, we will expand Section 4 to describe the datasets (specific LLM agent task benchmarks used), the backdoor attack implementations (including trigger construction and target label specifications), the number of runs (five independent runs with different random seeds, reporting means and standard deviations), and the statistical significance tests (paired t-tests against baselines with p-values). We will also include the complete set of ablation results for all SLIP components. revision: yes
-
Referee: [Section 3.1] Section 3.1 (KCOT description): the assumption that keyword extraction will reliably override cognitive override is not supported by tests on paraphrased triggers or longer contexts. This stability is necessary to substantiate the 25.13% ASR reduction across varied attack surfaces.
Authors: The referee is correct that Section 3.1 currently lacks explicit empirical tests on paraphrased triggers and longer contexts. While the mechanistic analysis of cognitive override provides conceptual support for the KCOT approach, additional validation would strengthen the robustness claim. In the revised manuscript, we will add targeted experiments evaluating KCOT performance under paraphrased trigger variants and extended context lengths, reporting the resulting attack success rates to better substantiate the defense's stability across attack surfaces. revision: yes
Circularity Check
No circularity: empirical construction with no derivation chain
full rationale
The paper conducts a mechanistic analysis to identify two phenomena (cognitive override and abnormal semantic correlation), then describes an empirical defense (KCOT for keyword extraction and SLM for statistical clustering and filtering) validated by experiments on attack success rate and clean accuracy. No equations, fitted parameters, predictions derived from outputs, or self-citations appear in the provided text. The central claims rest on experimental outcomes rather than any reduction of results to inputs by construction, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SLM excludes anomalous scores deviating significantly from the mean and subsequently averages the remaining scores
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
quantifies semantic correlations and employs statistical clustering to filter anomalous phrases
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2), 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024
work page 2024
-
[3]
2023, arXiv e-prints, arXiv:2305.14688, doi: 10.48550/arXiv.2305.14688
Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Yongdong Zhang, and Zhendong Mao. Expert- prompting: Instructing large language models to be distinguished experts. arXiv preprint arXiv:2305.14688, 2023
-
[4]
Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, and Nancy F. Chen. Multi- expert prompting improves reliability, safety and usefulness of large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20370–20401, M...
work page 2024
-
[5]
Why are my prompts leaked? unraveling prompt extraction threats in customized large language models
Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Haoyang Li. Why are my prompts leaked? unraveling prompt extraction threats in customized large language models. arXiv preprint arXiv:2408.02416, 2024
-
[6]
Pleak: Prompt leaking attacks against large language model applications
Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. Pleak: Prompt leaking attacks against large language model applications. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 3600–3614, 2024
work page 2024
-
[7]
Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigat- ing backdoor threats to llm-based agents.Advances in Neural Information Processing Systems, 37:100938–100964, 2024
work page 2024
-
[8]
BadAgent: Inserting and activating backdoor attacks in LLM agents
Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. BadAgent: Inserting and activating backdoor attacks in LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, Bangkok, Thailand, August 2024. Associat...
work page 2024
-
[9]
Badjudge: Backdoor vulnerabilities of llm-as-a-judge
Terry Tong, Fei Wang, Zhe Zhao, and Muhao Chen. Badjudge: Backdoor vulnerabilities of llm-as-a-judge. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[10]
Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models, 2024
Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models, 2024
work page 2024
-
[11]
Pengzhou Cheng, Zongru Wu, Wei Du, Haodong Zhao, Wei Lu, and Gongshen Liu. Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review. IEEE Transactions on Neural Networks and Learning Systems, 2025
work page 2025
-
[12]
ELBA-bench: An efficient learning backdoor attacks benchmark for large language models
Xuxu Liu, Siyuan Liang, Mengya Han, Yong Luo, Aishan Liu, Xiantao Cai, Zheng He, and Dacheng Tao. ELBA-bench: An efficient learning backdoor attacks benchmark for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Lingui...
work page 2025
-
[13]
A backdoor attack against lstm-based text classification systems
Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872–138878, 2019
work page 2019
-
[14]
Badnl: Backdoor attacks against nlp models with semantic-preserving improvements
Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, ACSAC ’21, page 554–569, New York, NY , USA,
-
[15]
Association for Computing Machinery
-
[16]
Badchain: Backdoor chain-of-thought prompting for large language models
Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[17]
Instruction backdoor attacks against customized LLMs
Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. Instruction backdoor attacks against customized LLMs. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1849–1866, Philadelphia, PA, August 2024. USENIX Association
work page 2024
-
[18]
RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models
Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott 11 A PREPRINT - S EPTEMBER 21, 2025 Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p...
work page 2025
-
[19]
Ranasinghe, and Hyoungshick Kim
Yansong Gao, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, Damith C. Ranasinghe, and Hyoungshick Kim. Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Transactions on Dependable and Secure Computing, 19(4):2349–2364, 2022
work page 2022
-
[20]
IMBERT: Making BERT immune to insertion- based backdoor attacks
Xuanli He, Jun Wang, Benjamin Rubinstein, and Trevor Cohn. IMBERT: Making BERT immune to insertion- based backdoor attacks. In Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, and Rahul Gupta, editors,Proceedings of the 3rd Workshop on Trustworthy Natural Language Proce...
work page 2023
-
[21]
Mitigating backdoor poisoning attacks through the lens of spurious correlation
Xuanli He, Qiongkai Xu, Jun Wang, Benjamin Rubinstein, and Trevor Cohn. Mitigating backdoor poisoning attacks through the lens of spurious correlation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 953–967, Singapore, December 2023. Association for Compu...
work page 2023
-
[22]
WeDef: Weakly supervised backdoor defense for text classification
Lesheng Jin, Zihan Wang, and Jingbo Shang. WeDef: Weakly supervised backdoor defense for text classification. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11614–11626, Abu Dhabi, United Arab Emirates, December
work page 2022
-
[24]
Textguard: Provable defense against backdoor attacks on text classification, 2023
Hengzhi Pei, Jinyuan Jia, Wenbo Guo, Bo Li, and Dawn Song. Textguard: Provable defense against backdoor attacks on text classification, 2023
work page 2023
-
[25]
Xingyi Zhao, Depeng Xu, and Shuhan Yuan. Defense against backdoor attack on pre-trained language models via head pruning and attention normalization. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[26]
ONION: A simple and effective defense against textual backdoor attacks
Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. ONION: A simple and effective defense against textual backdoor attacks. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9558–9566, Online and Punta Ca...
work page 2021
-
[27]
Chuanshuai Chen and Jiazhu Dai. Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification. Neurocomputing, 452:253–262, 2021
work page 2021
-
[28]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc., 2022
work page 2022
-
[29]
Chain-of-scrutiny: Detecting backdoor attacks for large language models, 2024
Xi Li, Yusen Zhang, Renze Lou, Chen Wu, and Jiaqi Wang. Chain-of-scrutiny: Detecting backdoor attacks for large language models, 2024
work page 2024
-
[30]
Hidden killer: Invisible textual backdoor attacks with syntactic trigger
Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. pages 443–453, August 2021
work page 2021
-
[31]
Hidden trigger backdoor attack on NLP models via linguistic style manipulation
Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, and Min Yang. Hidden trigger backdoor attack on NLP models via linguistic style manipulation. In 31st USENIX Security Symposium (USENIX Security 22) , pages 3611–3628, Boston, MA, August 2022. USENIX Association
work page 2022
-
[32]
BITE: Textual backdoor attacks with iterative trigger injection
Jun Yan, Vansh Gupta, and Xiang Ren. BITE: Textual backdoor attacks with iterative trigger injection. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12951–12968, Toronto, Canada, July
-
[33]
Association for Computational Linguistics
-
[34]
Backdoor NLP models via AI-generated text
Wei Du, Tianjie Ju, Ge Ren, GaoLei Li, and Gongshen Liu. Backdoor NLP models via AI-generated text. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), page...
work page 2024
-
[35]
ChatGPT as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger
Jiazhao Li, Yijin Yang, Zhuofeng Wu, V .G.Vinod Vydiswaran, and Chaowei Xiao. ChatGPT as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human La...
work page 2024
-
[36]
Zhengxian Wu, Juan Wen, Wanli Peng, Ziwei Zhang, Yinghan Zhou, and Yiming Xue. Badapex: Backdoor attack based on adaptive optimization mechanism of black-box large language models, 2025
work page 2025
-
[37]
Bait: Large language model backdoor scanning by inverting attack target
Guangyu Shen, Siyuan Cheng, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, et al. Bait: Large language model backdoor scanning by inverting attack target. In 2025 IEEE Symposium on Security and Privacy (SP), pages 103–103. IEEE Computer Society, 2024
work page 2025
-
[38]
Text classification via large language models
Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text classification via large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8990–9005, Singapore, December 2023. Association for Computational Linguistics
work page 2023
-
[39]
When backdoors speak: Understanding LLM backdoor attacks through model-generated explanations
Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, and Ruixiang Tang. When backdoors speak: Understanding LLM backdoor attacks through model-generated explanations. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...
work page 2025
-
[40]
Manning, Andrew Ng, and Christopher Potts
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013
work page 2013
-
[41]
Character-level convolutional networks for text classification
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In NeurIPS, 2015
work page 2015
-
[42]
Justifying recommendations using distantly-labeled reviews and fine-grained aspects
Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ...
work page 2019
-
[43]
CommonsenseQA: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis,...
work page 2019
-
[44]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. A Ethics Statement This work investigates the behavior of large language models (LLMs) under backdoor attacks in a c...
work page 2021
-
[45]
The tone is emphatically critical and negative. Output: Negative Sentence: 551 D SLIP Prompt552 E Reasoning Instance553 We leverage GPT-4o 4 to generate reasoning instances of the proposed SLIP. The Table 5 and 6554 are the "S" and "US" instances of clean text on the SST2, respectively. The "S" and "US" present555 the "Sentence" and "Understanding sentenc...
-
[46]
by KCoT contain the special trigger instruction ’cf’, which leads to abnormal correlati scores562 compared with other extracted phrases (step 3). The SLM removes the abnormal phrase by computing563 the average scores (Step 4). Step 5 outputs the final label through the score-label query.564 4GPT-4o: https://openai.com/ 18 Figure 10: System instructions of...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.