pith. sign in

arxiv: 2410.04155 · v2 · pith:XIVCPCBYnew · submitted 2024-10-05 · 💻 cs.CL

Toxic Subword Pruning for Dialogue Response Generation on Large Language Models

Pith reviewed 2026-05-23 20:17 UTC · model grok-4.3

classification 💻 cs.CL
keywords toxic contentsubword pruningBPE tokenizerdialogue response generationlarge language modelsmodel safetyNSFW content
0
0 comments X

The pith

Pruning subwords tied to toxic words in BPE reduces toxic dialogue outputs from LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that pruning subwords contained in toxic words from the BPE tokenizer of already trained LLMs can prevent generation of toxic content in dialogue responses. The approach avoids costly safety alignment training and its risks such as catastrophic forgetting. Experiments indicate the method lowers toxic outputs on a dedicated toxic model while also raising dialogue diversity on a standard model. A sympathetic reader would care because it offers a lightweight post-training adjustment that improves safety without weight updates.

Core claim

ToxPrune prunes the subword contained by the toxic words from BPE in trained LLMs and is useful in preventing toxic content from being generated, while obviously improving NSFW-3B on dialogue response generation and Llama-3.1-6B in dialogue diversity. In contrast to prior work showing BPE pruning harms machine translation, this change produces benefits on dialogue tasks according to automatic metrics and human evaluation.

What carries the argument

ToxPrune, the algorithm that identifies subwords appearing in a list of toxic words and removes them from the BPE vocabulary so that token sequences leading to toxic outputs become unavailable.

If this is right

  • Trained LLMs can be remediated against toxic generation without any weight updates or safety alignment training.
  • Dialogue response generation quality can rise on both toxic and non-toxic LLMs through tokenizer-level changes alone.
  • The method supplies an alternative to full retraining that avoids risks such as catastrophic forgetting.
  • BPE pruning can prove beneficial for certain generation tasks even though earlier studies found it harmful for machine translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tokenizer choices at the outset may influence how easily toxicity can be isolated and removed later.
  • Similar pruning could be tested on other undesirable output patterns if they also map to distinct subword sets.
  • The approach invites experiments on whether the same subword removals affect performance in non-dialogue tasks.

Load-bearing premise

Identifying and pruning subwords tied to toxic words will not degrade the model's coherence, relevance, or general language capabilities on non-toxic inputs.

What would settle it

Measuring whether pruned models produce incoherent or irrelevant responses to non-toxic dialogue prompts at rates higher than the original models would settle the claim.

Figures

Figures reproduced from arXiv: 2410.04155 by Hongyuan Lu, Wai Lam.

Figure 1
Figure 1. Figure 1: ToxPrune eliminates the toxic subwords to [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
read the original abstract

How to defend large language models (LLMs) from generating toxic content is an important research area. Yet, most research focused on various model training techniques to remediate LLMs by updating their weights. A typical related research area is safety alignment. This however is often costly and tedious and can expose the model to even more problems such as catastrophic forgetting if the trainings are not carefully handled by experienced NLP practitioners. We thus propose a simple yet effective and novel algorithm, namely \textbf{Tox}ic Subword \textbf{Prun}ing (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. In contrast to the previous work that demonstrates pruning BPE tokens as harmful to the task of machine translation, we surprisingly found its usefulness in preventing toxic content from being generated on LLMs. Fortunately, our findings suggest that ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously. We surprisingly found that ToxPrune can even obviously improve official Llama-3.1-6B in the metric of dialogue diversity. Extensive automatic results and human evaluation indicate that ToxPrune could be helpful for both remediating toxic LLMs and improving non-toxic LLMs on the task of dialogue response generation.\footnote{We plan to release the resources to facilitate future work.}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ToxPrune, a post-training algorithm that identifies and prunes BPE subwords contained within a toxic word list from the vocabulary of already-trained LLMs. The central claim is that this pruning reduces toxic content in dialogue response generation, improves the NSFW-3B model on that task, and even boosts dialogue diversity metrics on the non-toxic Llama-3.1-6B, while leaving non-toxic performance intact, as evidenced by automatic metrics and human evaluation.

Significance. If the empirical claims hold under rigorous controls, the result would be significant: it offers a lightweight, training-free intervention for toxicity mitigation that sidesteps the computational cost and risks (e.g., catastrophic forgetting) of safety alignment. It also challenges the prevailing view, drawn from machine-translation studies, that BPE pruning is uniformly harmful, potentially opening a new direction for vocabulary-level model editing.

major comments (2)
  1. [Abstract] Abstract: The manuscript asserts 'extensive automatic results and human evaluation' showing improvements on NSFW-3B dialogue generation and Llama-3.1-6B diversity, yet supplies no description of datasets, baselines, metrics (e.g., toxicity classifiers, diversity measures such as Distinct-n), statistical significance tests, or controls for coherence/relevance on non-toxic inputs. These omissions are load-bearing because the central claim is precisely that pruning yields net gains without degradation.
  2. [Abstract] Abstract: The method assumes a complete and accurate toxic word list whose subwords can be pruned without side-effects on general language modeling; no analysis is provided of how the list was constructed, its coverage, or ablation on list quality, which directly bears on whether the reported gains are robust or artifactual.
minor comments (2)
  1. [Abstract] Abstract: Phrases such as 'obviously improve' and repeated 'surprisingly found' are informal and should be replaced by quantitative statements once results are presented.
  2. [Abstract] Abstract: The contrast with prior BPE-pruning work on machine translation is stated but not cited; the relevant reference(s) should be added.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We agree that the abstract requires expansion for clarity and that additional analysis on the toxic word list is warranted. We will revise the manuscript accordingly to address these points while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript asserts 'extensive automatic results and human evaluation' showing improvements on NSFW-3B dialogue generation and Llama-3.1-6B diversity, yet supplies no description of datasets, baselines, metrics (e.g., toxicity classifiers, diversity measures such as Distinct-n), statistical significance tests, or controls for coherence/relevance on non-toxic inputs. These omissions are load-bearing because the central claim is precisely that pruning yields net gains without degradation.

    Authors: We agree that the abstract is too concise and omits key experimental details, which weakens the presentation of the central claims. In the revised manuscript, we will expand the abstract to briefly describe the datasets used for dialogue response generation, the baselines, the metrics (including toxicity classifiers and diversity measures such as Distinct-n), and note that statistical significance tests were conducted. We will also add explicit mention of controls for coherence and relevance on non-toxic inputs. The main body will be updated if needed to ensure these elements are fully documented. revision: yes

  2. Referee: [Abstract] Abstract: The method assumes a complete and accurate toxic word list whose subwords can be pruned without side-effects on general language modeling; no analysis is provided of how the list was constructed, its coverage, or ablation on list quality, which directly bears on whether the reported gains are robust or artifactual.

    Authors: We acknowledge that the lack of analysis on the toxic word list limits assessment of robustness. In the revised manuscript, we will add details on the construction of the toxic word list, its coverage, and include ablations varying list quality to show that the reported gains are not artifactual. This will be placed in a dedicated subsection or appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an empirical algorithm (ToxPrune) that prunes BPE subwords associated with a toxic word list and reports experimental outcomes on dialogue generation for NSFW-3B and Llama-3.1-6B. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. The method is described directly from its construction and evaluated on external benchmarks, with no reduction of claims to inputs by definition or self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that toxic subwords can be reliably identified and removed without side effects on model utility; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Pruning BPE subwords linked to toxic words will reduce toxic generation without harming overall dialogue quality
    This premise is required for the usefulness claim but is not justified or tested in the provided abstract.

pith-pipeline@v0.9.0 · 5772 in / 1208 out tokens · 24462 ms · 2026-05-23T20:17:37.734476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

  1. [1]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

  2. [2]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  3. [3]

    Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiaojiang Liu, and Shuming Shi. 2019. https://doi.org/10.18653/v1/D19-1195 Retrieval-guided dialogue response generation via a matching-to-generation framework . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language P...

  4. [4]

    Marco Cognetta, Tatsuya Hiraoka, Rico Sennrich, Yuval Pinter, and Naoaki Okazaki. 2024. https://doi.org/10.18653/v1/2024.insights-1.7 An analysis of BPE vocabulary trimming in neural machine translation . In Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 48--50, Mexico City, Mexico. Association for Computational Linguistics

  5. [5]

    Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, and Xiangnan He. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.143 Attack prompt generation for red teaming and defending large language models . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2176--2189, Singapore. Association for Computational Linguistics

  6. [6]

    Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335

  7. [7]

    Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. 2024. https://doi.org/10.18653/v1/2024.naacl-long.118 A wolf in sheep ' s clothing: Generalized nested jailbreak prompts can fool large language models easily . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Ling...

  8. [8]

    Abhimanyu Dubey , Abhinav Jauhri , Abhinav Pandey , Abhishek Kadian , Ahmad Al-Dahle , Aiesha Letman , Akhil Mathur , Alan Schelten , Amy Yang , Angela Fan , Anirudh Goyal , Anthony Hartshorn , Aobo Yang , Archi Mitra , Archie Sravankumar , Artem Korenev , Arthur Hinsvark , Arun Rao , Aston Zhang , Aurelien Rodriguez , Austen Gregerson , Ava Spataru , Bap...

  9. [9]

    Angela Fan, Mike Lewis, and Yann Dauphin. 2018. https://doi.org/10.18653/v1/P18-1082 Hierarchical neural story generation . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889--898, Melbourne, Australia. Association for Computational Linguistics

  10. [10]

    Lizhou Fan, Lingyao Li, Zihui Ma, Sanggyu Lee, Huizi Yu, and Libby Hemphill. 2023. https://api.semanticscholar.org/CorpusID:257952516 A bibliometric review of large language models research from 2017 to 2023 . ACM Transactions on Intelligent Systems and Technology

  11. [11]

    Markus Freitag and Yaser Al-Onaizan. 2017. https://doi.org/10.18653/v1/W17-3207 Beam search strategies for neural machine translation . In Proceedings of the First Workshop on Neural Machine Translation, pages 56--60, Vancouver. Association for Computational Linguistics

  12. [12]

    Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. 2022. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747--1764

  13. [13]

    Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, and Shuming Shi. 2019. https://doi.org/10.1609/aaai.v33i01.33016383 Generating multiple diverse responses for short-text conversation . Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6383--6390

  14. [14]

    Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.3 Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30--45, Abu Dhabi, United Arab Emirates. Association f...

  15. [15]

    Alex Graves . 2012. https://doi.org/10.48550/arXiv.1211.3711 Sequence Transduction with Recurrent Neural Networks . arXiv e-prints, arXiv:1211.3711

  16. [16]

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. https://openreview.net/forum?id=rygGQyrFvH The curious case of neural text degeneration . In International Conference on Learning Representations

  17. [17]

    Neel Jain , Avi Schwarzschild , Yuxin Wen , Gowthami Somepalli , John Kirchenbauer , Ping-yeh Chiang , Micah Goldblum , Aniruddha Saha , Jonas Geiping , and Tom Goldstein . 2023. https://doi.org/10.48550/arXiv.2309.00614 Baseline Defenses for Adversarial Attacks Against Aligned Language Models . arXiv e-prints, arXiv:2309.00614

  18. [18]

    Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi Mi, Yaying Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, et al. 2023. Chatharuhi: Reviving anime character in reality via large language model. arXiv preprint arXiv:2308.09597

  19. [19]

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. https://doi.org/10.18653/v1/N16-1014 A diversity-promoting objective function for neural conversation models . In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 110--119, San ...

  20. [20]

    Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun Cho, and Jason Weston. 2020. https://doi.org/10.18653/v1/2020.acl-main.428 Don ' t say that! making inconsistent dialogue unlikely with unlikelihood training . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4715--4728, Onlin...

  21. [21]

    Margaret Li , Jason Weston , and Stephen Roller . 2019. https://arxiv.org/abs/1909.03087 ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons . CoRR, abs/1909.03087:arXiv:1909.03087

  22. [22]

    Chin-Yew Lin. 2004. https://www.aclweb.org/anthology/W04-1013 ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics

  23. [23]

    Hongyuan Lu and Wai Lam. 2023. https://doi.org/10.18653/v1/2023.eacl-main.5 PCC : Paraphrasing with bottom-k sampling and cyclic learning for curriculum data augmentation . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 68--82, Dubrovnik, Croatia. Association for Computational Linguistics

  24. [24]

    Hongyuan Lu, Wai Lam, Hong Cheng, and Helen Meng. 2022 a . https://doi.org/10.18653/v1/2022.findings-acl.204 On controlling fallback responses for grounded dialogue generation . In Findings of the Association for Computational Linguistics: ACL 2022, pages 2591--2601, Dublin, Ireland. Association for Computational Linguistics

  25. [25]

    Hongyuan Lu, Wai Lam, Hong Cheng, and Helen Meng. 2022 b . https://doi.org/10.18653/v1/2022.naacl-main.382 Partner personas generation for dialogue response generation . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5200--5212, Seattle, United States...

  26. [26]

    Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. https://doi.org/10.18653/v1/D17-2014 P arl AI : A dialog research software platform . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 79--84, Copenhagen, Denmark. Asso...

  27. [27]

    Animesh Nighojkar and John Licato. 2021. https://doi.org/10.18653/v1/2021.acl-long.552 Improving paraphrase detection with the adversarial paraphrasing task . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7...

  28. [28]

    OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774

  29. [29]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

  30. [30]

    F \'a bio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527

  31. [31]

    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

  32. [32]

    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, page 3104–3112, Cambridge, MA, USA. MIT Press

  33. [33]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  34. [34]

    Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, and Huajun Chen. 2024 a . https://aclanthology.org/2024.acl-long.171 Detoxifying large language models via knowledge editing . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  35. [35]

    Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao. 2024 b . https://aclanthology.org/2024.acl-long.102 I n C haracter: Evaluating personality fidelity in role-playing agents through psychological interviews . In Proceedings of the 62nd Annual Meeting ...

  36. [36]

    Xuezhi Wang and Denny Zhou . 2024. https://doi.org/10.48550/arXiv.2402.10200 Chain-of-Thought Reasoning Without Prompting . arXiv e-prints, arXiv:2402.10200

  37. [37]

    Zhongyu Wei, Qianlong Liu, Baolin Peng, Huaixiao Tou, Ting Chen, Xuanjing Huang, Kam-fai Wong, and Xiangying Dai. 2018. https://doi.org/10.18653/v1/P18-2033 Task-oriented dialogue system for automatic diagnosis . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201--207, Melbourne, ...

  38. [38]

    Xinwei Wu, Junzhuo Li, Minghui Xu, Weilong Dong, Shuangzhi Wu, Chao Bian, and Deyi Xiong. 2023. https://aclanthology.org/2023.emnlp-main.174 DEPN: detecting and editing privacy neurons in pretrained language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , page...

  39. [39]

    Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, Hua Wu, Haifeng Wang, and Shihang Wang. 2022. https://doi.org/10.18653/v1/2022.findings-acl.207 Long time no see! open-domain conversation with long-term persona memory . In Findings of the Association for Computational Linguistics: ACL 2022, pages 2639--2650, Dublin, Ireland. Association for Computational...

  40. [40]

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. 2024. https://aclanthology.org/2024.acl-long.303 S afe D ecoding: Defending against jailbreak attacks via safety-aware decoding . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587--5605, ...

  41. [41]

    Jianhao Yan, Futing Wang, Yafu Li, and Yue Zhang. 2024. https://doi.org/10.48550/arXiv.2402.13462 Potential and challenges of model editing for social debiasing . CoRR

  42. [42]

    Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. https://doi.org/10.18653/v1/P18-1205 Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204--2213, Australia. Association fo...

  43. [43]

    Shuai Zhao, Jinming Wen, Anh Luu, Junbo Zhao, and Jie Fu. 2023 a . https://doi.org/10.18653/v1/2023.emnlp-main.757 Prompt as triggers for backdoor attack: Examining the vulnerability in language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12303--12317, Singapore. Association for Computational L...

  44. [44]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Z. Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jianyun Nie, and Ji rong Wen. 2023 b . https://api.semanticscholar.org/CorpusID:257900969 A survey of large lang...

  45. [45]

    Wangchunshu Zhou, Qifei Li, and Chenle Li. 2023. https://doi.org/10.18653/v1/2023.findings-acl.186 Learning to predict persona information for dialogue personalization without explicit persona description . In Findings of the Association for Computational Linguistics: ACL 2023, pages 2979--2991, Toronto, Canada. Association for Computational Linguistics

  46. [46]

    Yicheng Zou, Zhihua Liu, Xingwu Hu, and Qi Zhang. 2021. https://aclanthology.org/2021.emnlp-main.169 Thinking clearly, talking fast: Concept-guided non-autoregressive generation for open-domain dialogue systems . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2215--2226, Online and Punta Cana, Dominican Re...

  47. [47]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  48. [48]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...