Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits
Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3
The pith
Toxic wording added to prompts reduces factual accuracy in large language models and shifts which internal nodes drive their answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. Attribution-graph analyses show that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant.
What carries the argument
Attribution-graph analyses that trace activation influences and identify perturbation-sensitive variant nodes versus invariant core reasoning nodes.
If this is right
- Toxic prompts degrade performance on multiple-choice and math reasoning tasks.
- Polite rewording does not produce reliable accuracy gains across models.
- Internal computation paths become more sensitive to surface variations as toxicity rises.
- Prompt tone functions as an independent factor in output consistency.
Where Pith is reading between the lines
- Safety evaluations may need to test models under adversarial tone conditions to catch reliability drops.
- Training objectives that penalize sensitivity to lexical tone could reduce unwanted output shifts.
- The node-stability pattern might appear in other prompt dimensions such as length or formality.
Load-bearing premise
The prompt versions differ only in the intended lexical or tone perturbations and remain otherwise semantically equivalent.
What would settle it
Running the same benchmarks with the toxicity levels reversed or removed and finding no drop in accuracy or no selective change in node influence would falsify the central claim.
read the original abstract
Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that toxic lexical perturbations in otherwise controlled prompts consistently reduce factual accuracy and increase uncertainty across five LLMs on ARC-Easy, GSM8K, and MMLU, while polite phrasing produces limited and inconsistent changes; attribution-graph analysis of activations further shows that toxicity selectively amplifies perturbation-sensitive variant nodes while core reasoning nodes remain relatively invariant.
Significance. If the central behavioral and mechanistic findings hold after addressing equivalence and statistical gaps, the work would provide concrete evidence that surface-level prompt tone affects both output reliability and internal circuit dynamics, positioning tone as a distinct reliability factor beyond semantic content.
major comments (3)
- [controlled prompt variations section] Controlled prompt variations section: the claim that polite/random/toxic variants are semantically equivalent except for intended lexical/tone changes is load-bearing for attributing accuracy drops and node amplification to toxicity, yet no embedding cosine similarity, edit distance, or blinded human equivalence ratings are reported to rule out shifts in difficulty or ambiguity.
- [Results section (ARC-Easy/GSM8K/MMLU evaluations)] Results on ARC-Easy/GSM8K/MMLU: the abstract and main findings assert consistent effects of toxicity on accuracy and uncertainty, but no sample sizes per condition, exact toxicity operationalization, statistical tests (e.g., paired t-tests or ANOVA), or error bars are supplied, preventing verification that the reported consistency is statistically supported rather than anecdotal.
- [attribution-graph analyses] Attribution-graph analysis: the distinction between 'perturbation-sensitive variant nodes' and 'relatively stable core reasoning nodes' is central to the mechanistic claim, but the manuscript provides no quantitative criteria (activation thresholds, influence scores, or clustering method) for node classification or invariance measurement.
minor comments (2)
- [Abstract] Abstract states 'three toxicity levels' but does not define the exact lexical items or scaling procedure used to generate them.
- [attribution-graph analyses] Notation for 'variant nodes' and 'core reasoning nodes' is introduced without an explicit definition or reference to prior circuit work.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for strengthening the manuscript's claims regarding prompt equivalence, statistical rigor, and mechanistic quantification. We address each point below and will make the corresponding revisions.
read point-by-point responses
-
Referee: [controlled prompt variations section] Controlled prompt variations section: the claim that polite/random/toxic variants are semantically equivalent except for intended lexical/tone changes is load-bearing for attributing accuracy drops and node amplification to toxicity, yet no embedding cosine similarity, edit distance, or blinded human equivalence ratings are reported to rule out shifts in difficulty or ambiguity.
Authors: We agree that establishing semantic equivalence is essential to isolate the effect of toxicity. The manuscript relied on manual construction of variants to maintain semantic content while altering tone, but did not report quantitative equivalence metrics. In the revised manuscript, we will add pairwise cosine similarities computed from sentence embeddings, average edit distances, and results from a blinded human evaluation where raters assess equivalence on a Likert scale for a sample of prompts. These will be included in a new subsection under Methods. revision: yes
-
Referee: [Results section (ARC-Easy/GSM8K/MMLU evaluations)] Results on ARC-Easy/GSM8K/MMLU: the abstract and main findings assert consistent effects of toxicity on accuracy and uncertainty, but no sample sizes per condition, exact toxicity operationalization, statistical tests (e.g., paired t-tests or ANOVA), or error bars are supplied, preventing verification that the reported consistency is statistically supported rather than anecdotal.
Authors: We acknowledge the need for greater statistical transparency. The evaluations were conducted on the standard test sets of the benchmarks. In the revision, we will explicitly state the sample sizes per condition and benchmark, provide the exact operationalization of toxicity levels with example prompts in an appendix, include error bars representing standard error, and apply statistical tests such as paired t-tests or Wilcoxon signed-rank tests for accuracy and uncertainty differences between conditions, with appropriate multiple-comparison corrections. revision: yes
-
Referee: [attribution-graph analyses] Attribution-graph analysis: the distinction between 'perturbation-sensitive variant nodes' and 'relatively stable core reasoning nodes' is central to the mechanistic claim, but the manuscript provides no quantitative criteria (activation thresholds, influence scores, or clustering method) for node classification or invariance measurement.
Authors: The node classification in the original analysis was performed by identifying nodes whose activation or influence scores changed substantially under toxic perturbations compared to baseline, while core nodes showed minimal variation. However, explicit quantitative thresholds were not detailed. In the revised version, we will specify the criteria used, such as a minimum relative change in activation for variant nodes and a maximum change for invariant nodes, along with the method for computing influence scores and any clustering applied to group nodes. This will be added to the attribution-graph analysis section with supporting figures. revision: yes
Circularity Check
No circularity: empirical measurements on benchmarks with no derived equations or self-referential reductions
full rationale
The paper reports direct experimental results: controlled prompt perturbations (polite/random/toxic) are applied to ARC-Easy/GSM8K/MMLU, accuracy and uncertainty are measured, and attribution graphs are computed from activations. No equations, fitted parameters, or derivations are present that could reduce a 'prediction' to an input by construction. Attribution analysis is post-hoc and descriptive rather than a uniqueness theorem or ansatz smuggled via self-citation. The study is self-contained against external benchmarks; the semantic-equivalence assumption is a methodological concern but does not create circularity under the enumerated patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- toxicity level definitions
axioms (1)
- domain assumption Prompts can be varied in lexical tone while remaining semantically equivalent
Reference graph
Works this paper leans on
-
[1]
Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, and 1 others. 2025. Circuit tracing: Revealing computational graphs in language models
2025
-
[2]
When users turn hostile: Rude, aggressive, and abusive interactions with ai chatbots
Tarek Barhoum. When users turn hostile: Rude, aggressive, and abusive interactions with ai chatbots
-
[3]
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861--872
2017
-
[4]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623
2021
-
[5]
Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, and Xiaojing Fan. 2025. Does tone change the answer? evaluating prompt politeness effects on modern llms: Gpt, gemini, llama. arXiv preprint arXiv:2512.12812
arXiv 2025
-
[6]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge
2018
-
[7]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168
Pith/arXiv arXiv 2021
-
[8]
Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In International AAAI Conference on Web and Social Media
2017
-
[9]
Sopan Deb. 2025. Saying ‘thank you’to chatgpt is costly. but maybe it’s worth the price. The New York Times
2025
-
[10]
Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. pages 295--302
2020
-
[11]
Om Dobariya and Akhil Kumar. 2025. Mind your tone: Investigating how prompt politeness affects llm accuracy (short paper). arXiv preprint arXiv:2510.04950
arXiv 2025
-
[12]
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, and 1 others. 2023. Faith and fate: Limits of transformers on compositionality. Advances in neural information processing systems, 36:70293--70332
2023
-
[13]
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch \"u tze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012--1031
2021
-
[14]
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, and 1 others. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12
2021
-
[15]
Futurism . 2025. Sam altman says saying 'please' and 'thank you' to chatgpt costs openai millions. https://futurism.com/altman-please-thanks-chatgpt. Accessed: 2026-05-17
2025
-
[16]
Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, and 1 others. 2022. Predictability and surprise in large generative models. pages 1747--1764
2022
-
[17]
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the association for computational linguistics: EMNLP 2020, pages 3356--3369
2020
-
[18]
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484--5495
2021
-
[19]
Drishti Goel, Jeongah Lee, Qiuyue Joy Zhong, Violeta J Rodriguez, Daniel S Brown, Ravi Karkar, Dong Whi Yoo, and Koustuv Saha. 2026. Rubrix: Rubric-driven risk mitigation in caregiver-ai interactions. arXiv preprint arXiv:2601.13235
arXiv 2026
-
[20]
Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, and Hari Sundaram. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.641 Breaking bad tokens: Detoxification of LLM s using sparse autoencoders . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12691--12709, Suzhou, China. Association for Compu...
-
[21]
Agam Goyal, Xianyang Zhan, Yilun Chen, Koustuv Saha, and Eshwar Chandrasekharan. 2025 b . Momoe: Mixture of moderation experts framework for ai-assisted online governance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12656--12671
2025
-
[22]
Michael Hanna. 2025. mwhanna/qwen3-1.7b-transcoders-lowl0. https://huggingface.co/mwhanna/qwen3-1.7b-transcoders-lowl0. Low- L_0 sparse transcoders for Qwen3-1.7B
2025
-
[23]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300
Pith/arXiv arXiv 2020
-
[24]
Yunah Jang, Megha Sundriyal, Kyomin Jung, and Meeyoung Cha. 2026. How you ask matters! adaptive rag robustness to query variations. arXiv preprint arXiv:2604.10745
Pith/arXiv arXiv 2026
-
[25]
Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity—a measure of the difficulty of speech recognition tasks. volume 62, pages S63--S63. Acoustical Society of America
1977
-
[26]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1--38
2023
-
[27]
Jigsaw and Google . 2017. Perspective api. https://www.perspectiveapi.com/
2017
-
[28]
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. Language models (mostly) know what they know
2022
-
[29]
Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data
2019
-
[30]
Jiwon Kim, Violeta J Rodriguez, Dong Whi Yoo, Eshwar Chandrasekharan, and Koustuv Saha. 2026. Pair-safe: A paired-agent approach for runtime auditing and refining ai-mediated mental health support. arXiv preprint arXiv:2601.12754
arXiv 2026
-
[31]
Mahi Kolla, Siddharth Salunkhe, Eshwar Chandrasekharan, and Koustuv Saha. 2024. Llm-mod: Can large language models assist content moderation? In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1--8
2024
-
[32]
Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. 2024. Watch your language: Investigating content moderation with large language models. In Proceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 865--878
2024
-
[33]
u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. volume 33, pages 9459--9474
2020
-
[34]
Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. https://doi.org/10.18653/v1/2021.acl-long.522 DE xperts: Decoding-time controlled text generation with experts and anti-experts . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internatio...
-
[35]
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. pages 8086--8098
2022
-
[36]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. pages 9802--9822
2023
-
[37]
Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 9004--9017
2023
-
[38]
Mateusz. 2025 a . mntss/gemma-scope-transcoders. https://huggingface.co/mntss/gemma-scope-transcoders. Hugging Face model repository for Gemma Scope transcoders compatible with circuit-tracer
2025
-
[39]
Mateusz. 2025 b . mntss/transcoder-llama-3.2-1b. https://huggingface.co/mntss/transcoder-Llama-3.2-1B. Hugging Face transcoder repository for Llama-3.2-1B compatible with circuit-tracer
2025
-
[40]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020 a . On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1906--1919
2020
-
[41]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020 b . On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1906--1919
2020
-
[42]
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. volume 35, pages 17359--17372
2022
-
[43]
Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2024. State of what art? a call for multi-prompt llm evaluation. Transactions of the Association for Computational Linguistics, 12:933--949
2024
-
[44]
Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4658--4664
2019
-
[45]
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020 a . Zoom in: An introduction to circuits. Distill, 5(3):e00024--001
2020
-
[46]
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020 b . Zoom in: An introduction to circuits. Distill, 5(3):e00024--001
2020
-
[47]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. volume 35, pages 27730--27744
2022
-
[48]
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, and 1 others. 2023. Discovering language model behaviors with model-written evaluations. In Findings of the association for computational linguistics: ACL 2023, pages 13387--13434
2023
-
[49]
Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. pages 1--7
2021
-
[50]
Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. 2020. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pages 8346--8356. PMLR
2020
-
[51]
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In International Conference on Learning Representations, volume 2024, pages 25055--25083
2024
-
[52]
Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3):379--423
1948
-
[53]
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. pages 31210--31227
2023
-
[54]
Soorya Ram Shimgekar, Vipin Gunda, Jiwon Kim, Violeta J Rodriguez, Hari Sundaram, and Koustuv Saha. 2026. Ai psychosis: Does conversational ai amplify delusion-related language? arXiv preprint arXiv:2603.19574
arXiv 2026
-
[55]
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784--3803
2021
-
[56]
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319--3328. PMLR
2017
-
[57]
Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621--633
2020
-
[58]
Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, and Junjie Hu. 2025. https://openreview.net/forum?id=lOi6FtIwR8 Model editing as a robust and denoised variant of DPO : A case study on toxicity . In The Thirteenth International Conference on Learning Representations
2025
-
[59]
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. 2020. Causal mediation analysis for interpreting neural nlp: The case of gender bias. arXiv preprint arXiv:2004.12265
arXiv 2020
-
[60]
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153--2162
2019
-
[61]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models
2022
-
[62]
Yian Wang, Yuen Chen, Agam Goyal, and Hari Sundaram. 2026. Causaldetox: Causal head selection and intervention for language model detoxification. arXiv preprint arXiv:2604.14602
Pith/arXiv arXiv 2026
-
[63]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079--80110
2023
-
[64]
Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, and 1 others. 2022. Taxonomy of risks posed by language models. pages 214--229
2022
-
[65]
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486--1496
2023
-
[66]
Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. 2024. Should we respect llms? a cross-lingual study on the influence of prompt politeness on llm performance. In Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 9--35
2024
-
[67]
Xianyang Zhan, Agam Goyal, Yilun Chen, Eshwar Chandrasekharan, and Koustuv Saha. 2025. Slm-mod: Small language models surpass llms at content moderation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8774--8790
2025
-
[68]
Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Pretraining data detection for large language models: A divergence-based calibration method. pages 5263--5274
2024
-
[69]
Xu Zhang and Xiaojun Wan. 2023. https://doi.org/10.18653/v1/2023.acl-long.11 MIL -decoding: Detoxifying language models at token-level via multiple instance learning . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 190--202, Toronto, Canada. Association for Computational Linguistics
-
[70]
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697--12706. Pmlr
2021
-
[71]
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. 2024. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2461--2488
2024
-
[72]
Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. 2024. Prosa: Assessing and understanding the prompt sensitivity of llms. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1950--1976
2024
-
[73]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[74]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.