pith. sign in

arxiv: 2605.30913 · v1 · pith:W3O5LV7Rnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI· cs.CY· cs.HC

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.HC
keywords LLM reliabilityprompt perturbationstoxicityfactual accuracyattribution graphsinternal circuitshallucinationsmodel activations
0
0 comments X

The pith

Toxic wording added to prompts reduces factual accuracy in large language models and shifts which internal nodes drive their answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether prompts that keep the same core meaning but vary in tone from polite to toxic change how reliably LLMs answer factual questions. Experiments across five models on ARC-Easy, GSM8K, and MMLU show that toxic lexical changes lower accuracy and raise uncertainty, while polite rephrasings produce only small and inconsistent shifts. Attribution graphs of activations then reveal that higher toxicity boosts the role of certain variable nodes while leaving core reasoning nodes relatively stable. These results indicate that surface tone can alter both outputs and internal computation paths even when the intended question stays fixed.

Core claim

Toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. Attribution-graph analyses show that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant.

What carries the argument

Attribution-graph analyses that trace activation influences and identify perturbation-sensitive variant nodes versus invariant core reasoning nodes.

If this is right

  • Toxic prompts degrade performance on multiple-choice and math reasoning tasks.
  • Polite rewording does not produce reliable accuracy gains across models.
  • Internal computation paths become more sensitive to surface variations as toxicity rises.
  • Prompt tone functions as an independent factor in output consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety evaluations may need to test models under adversarial tone conditions to catch reliability drops.
  • Training objectives that penalize sensitivity to lexical tone could reduce unwanted output shifts.
  • The node-stability pattern might appear in other prompt dimensions such as length or formality.

Load-bearing premise

The prompt versions differ only in the intended lexical or tone perturbations and remain otherwise semantically equivalent.

What would settle it

Running the same benchmarks with the toxicity levels reversed or removed and finding no drop in accuracy or no selective change in node influence would falsify the central claim.

read the original abstract

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that toxic lexical perturbations in otherwise controlled prompts consistently reduce factual accuracy and increase uncertainty across five LLMs on ARC-Easy, GSM8K, and MMLU, while polite phrasing produces limited and inconsistent changes; attribution-graph analysis of activations further shows that toxicity selectively amplifies perturbation-sensitive variant nodes while core reasoning nodes remain relatively invariant.

Significance. If the central behavioral and mechanistic findings hold after addressing equivalence and statistical gaps, the work would provide concrete evidence that surface-level prompt tone affects both output reliability and internal circuit dynamics, positioning tone as a distinct reliability factor beyond semantic content.

major comments (3)
  1. [controlled prompt variations section] Controlled prompt variations section: the claim that polite/random/toxic variants are semantically equivalent except for intended lexical/tone changes is load-bearing for attributing accuracy drops and node amplification to toxicity, yet no embedding cosine similarity, edit distance, or blinded human equivalence ratings are reported to rule out shifts in difficulty or ambiguity.
  2. [Results section (ARC-Easy/GSM8K/MMLU evaluations)] Results on ARC-Easy/GSM8K/MMLU: the abstract and main findings assert consistent effects of toxicity on accuracy and uncertainty, but no sample sizes per condition, exact toxicity operationalization, statistical tests (e.g., paired t-tests or ANOVA), or error bars are supplied, preventing verification that the reported consistency is statistically supported rather than anecdotal.
  3. [attribution-graph analyses] Attribution-graph analysis: the distinction between 'perturbation-sensitive variant nodes' and 'relatively stable core reasoning nodes' is central to the mechanistic claim, but the manuscript provides no quantitative criteria (activation thresholds, influence scores, or clustering method) for node classification or invariance measurement.
minor comments (2)
  1. [Abstract] Abstract states 'three toxicity levels' but does not define the exact lexical items or scaling procedure used to generate them.
  2. [attribution-graph analyses] Notation for 'variant nodes' and 'core reasoning nodes' is introduced without an explicit definition or reference to prior circuit work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the manuscript's claims regarding prompt equivalence, statistical rigor, and mechanistic quantification. We address each point below and will make the corresponding revisions.

read point-by-point responses
  1. Referee: [controlled prompt variations section] Controlled prompt variations section: the claim that polite/random/toxic variants are semantically equivalent except for intended lexical/tone changes is load-bearing for attributing accuracy drops and node amplification to toxicity, yet no embedding cosine similarity, edit distance, or blinded human equivalence ratings are reported to rule out shifts in difficulty or ambiguity.

    Authors: We agree that establishing semantic equivalence is essential to isolate the effect of toxicity. The manuscript relied on manual construction of variants to maintain semantic content while altering tone, but did not report quantitative equivalence metrics. In the revised manuscript, we will add pairwise cosine similarities computed from sentence embeddings, average edit distances, and results from a blinded human evaluation where raters assess equivalence on a Likert scale for a sample of prompts. These will be included in a new subsection under Methods. revision: yes

  2. Referee: [Results section (ARC-Easy/GSM8K/MMLU evaluations)] Results on ARC-Easy/GSM8K/MMLU: the abstract and main findings assert consistent effects of toxicity on accuracy and uncertainty, but no sample sizes per condition, exact toxicity operationalization, statistical tests (e.g., paired t-tests or ANOVA), or error bars are supplied, preventing verification that the reported consistency is statistically supported rather than anecdotal.

    Authors: We acknowledge the need for greater statistical transparency. The evaluations were conducted on the standard test sets of the benchmarks. In the revision, we will explicitly state the sample sizes per condition and benchmark, provide the exact operationalization of toxicity levels with example prompts in an appendix, include error bars representing standard error, and apply statistical tests such as paired t-tests or Wilcoxon signed-rank tests for accuracy and uncertainty differences between conditions, with appropriate multiple-comparison corrections. revision: yes

  3. Referee: [attribution-graph analyses] Attribution-graph analysis: the distinction between 'perturbation-sensitive variant nodes' and 'relatively stable core reasoning nodes' is central to the mechanistic claim, but the manuscript provides no quantitative criteria (activation thresholds, influence scores, or clustering method) for node classification or invariance measurement.

    Authors: The node classification in the original analysis was performed by identifying nodes whose activation or influence scores changed substantially under toxic perturbations compared to baseline, while core nodes showed minimal variation. However, explicit quantitative thresholds were not detailed. In the revised version, we will specify the criteria used, such as a minimum relative change in activation for variant nodes and a maximum change for invariant nodes, along with the method for computing influence scores and any clustering applied to group nodes. This will be added to the attribution-graph analysis section with supporting figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on benchmarks with no derived equations or self-referential reductions

full rationale

The paper reports direct experimental results: controlled prompt perturbations (polite/random/toxic) are applied to ARC-Easy/GSM8K/MMLU, accuracy and uncertainty are measured, and attribution graphs are computed from activations. No equations, fitted parameters, or derivations are present that could reduce a 'prediction' to an input by construction. Attribution analysis is post-hoc and descriptive rather than a uniqueness theorem or ansatz smuggled via self-citation. The study is self-contained against external benchmarks; the semantic-equivalence assumption is a methodological concern but does not create circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the central claim rests on the domain assumption that lexical-tone perturbations preserve semantic equivalence and on standard LLM evaluation practices. No invented entities are introduced.

free parameters (1)
  • toxicity level definitions
    Three discrete toxicity levels are used to create the prompt variants; their exact construction is not specified in the abstract.
axioms (1)
  • domain assumption Prompts can be varied in lexical tone while remaining semantically equivalent
    The study design treats polite/random/toxic versions as controlled variants of the same underlying questions.

pith-pipeline@v0.9.1-grok · 5740 in / 1246 out tokens · 30967 ms · 2026-06-28T22:41:55.036662+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 3 canonical work pages

  1. [1]

    Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, and 1 others. 2025. Circuit tracing: Revealing computational graphs in language models

  2. [2]

    When users turn hostile: Rude, aggressive, and abusive interactions with ai chatbots

    Tarek Barhoum. When users turn hostile: Rude, aggressive, and abusive interactions with ai chatbots

  3. [3]

    Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861--872

  4. [4]

    Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623

  5. [5]

    Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, and Xiaojing Fan. 2025. Does tone change the answer? evaluating prompt politeness effects on modern llms: Gpt, gemini, llama. arXiv preprint arXiv:2512.12812

  6. [6]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge

  7. [7]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  8. [8]

    Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In International AAAI Conference on Web and Social Media

  9. [9]

    Sopan Deb. 2025. Saying ‘thank you’to chatgpt is costly. but maybe it’s worth the price. The New York Times

  10. [10]

    Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. pages 295--302

  11. [11]

    Om Dobariya and Akhil Kumar. 2025. Mind your tone: Investigating how prompt politeness affects llm accuracy (short paper). arXiv preprint arXiv:2510.04950

  12. [12]

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, and 1 others. 2023. Faith and fate: Limits of transformers on compositionality. Advances in neural information processing systems, 36:70293--70332

  13. [13]

    Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch \"u tze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012--1031

  14. [14]

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, and 1 others. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12

  15. [15]

    Futurism . 2025. Sam altman says saying 'please' and 'thank you' to chatgpt costs openai millions. https://futurism.com/altman-please-thanks-chatgpt. Accessed: 2026-05-17

  16. [16]

    Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, and 1 others. 2022. Predictability and surprise in large generative models. pages 1747--1764

  17. [17]

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the association for computational linguistics: EMNLP 2020, pages 3356--3369

  18. [18]

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484--5495

  19. [19]

    Drishti Goel, Jeongah Lee, Qiuyue Joy Zhong, Violeta J Rodriguez, Daniel S Brown, Ravi Karkar, Dong Whi Yoo, and Koustuv Saha. 2026. Rubrix: Rubric-driven risk mitigation in caregiver-ai interactions. arXiv preprint arXiv:2601.13235

  20. [20]

    Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, and Hari Sundaram. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.641 Breaking bad tokens: Detoxification of LLM s using sparse autoencoders . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12691--12709, Suzhou, China. Association for Compu...

  21. [21]

    Agam Goyal, Xianyang Zhan, Yilun Chen, Koustuv Saha, and Eshwar Chandrasekharan. 2025 b . Momoe: Mixture of moderation experts framework for ai-assisted online governance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12656--12671

  22. [22]

    Michael Hanna. 2025. mwhanna/qwen3-1.7b-transcoders-lowl0. https://huggingface.co/mwhanna/qwen3-1.7b-transcoders-lowl0. Low- L_0 sparse transcoders for Qwen3-1.7B

  23. [23]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

  24. [24]

    Yunah Jang, Megha Sundriyal, Kyomin Jung, and Meeyoung Cha. 2026. How you ask matters! adaptive rag robustness to query variations. arXiv preprint arXiv:2604.10745

  25. [25]

    Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity—a measure of the difficulty of speech recognition tasks. volume 62, pages S63--S63. Acoustical Society of America

  26. [26]

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1--38

  27. [27]

    Jigsaw and Google . 2017. Perspective api. https://www.perspectiveapi.com/

  28. [28]

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. Language models (mostly) know what they know

  29. [29]

    Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data

  30. [30]

    Jiwon Kim, Violeta J Rodriguez, Dong Whi Yoo, Eshwar Chandrasekharan, and Koustuv Saha. 2026. Pair-safe: A paired-agent approach for runtime auditing and refining ai-mediated mental health support. arXiv preprint arXiv:2601.12754

  31. [31]

    Mahi Kolla, Siddharth Salunkhe, Eshwar Chandrasekharan, and Koustuv Saha. 2024. Llm-mod: Can large language models assist content moderation? In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1--8

  32. [32]

    Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. 2024. Watch your language: Investigating content moderation with large language models. In Proceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 865--878

  33. [33]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. volume 33, pages 9459--9474

  34. [34]

    Smith, and Yejin Choi

    Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. https://doi.org/10.18653/v1/2021.acl-long.522 DE xperts: Decoding-time controlled text generation with experts and anti-experts . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internatio...

  35. [35]

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. pages 8086--8098

  36. [36]

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. pages 9802--9822

  37. [37]

    Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 9004--9017

  38. [38]

    Mateusz. 2025 a . mntss/gemma-scope-transcoders. https://huggingface.co/mntss/gemma-scope-transcoders. Hugging Face model repository for Gemma Scope transcoders compatible with circuit-tracer

  39. [39]

    Mateusz. 2025 b . mntss/transcoder-llama-3.2-1b. https://huggingface.co/mntss/transcoder-Llama-3.2-1B. Hugging Face transcoder repository for Llama-3.2-1B compatible with circuit-tracer

  40. [40]

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020 a . On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1906--1919

  41. [41]

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020 b . On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1906--1919

  42. [42]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. volume 35, pages 17359--17372

  43. [43]

    Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2024. State of what art? a call for multi-prompt llm evaluation. Transactions of the Association for Computational Linguistics, 12:933--949

  44. [44]

    Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4658--4664

  45. [45]

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020 a . Zoom in: An introduction to circuits. Distill, 5(3):e00024--001

  46. [46]

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020 b . Zoom in: An introduction to circuits. Distill, 5(3):e00024--001

  47. [47]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. volume 35, pages 27730--27744

  48. [48]

    Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, and 1 others. 2023. Discovering language model behaviors with model-written evaluations. In Findings of the association for computational linguistics: ACL 2023, pages 13387--13434

  49. [49]

    Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. pages 1--7

  50. [50]

    Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. 2020. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pages 8346--8356. PMLR

  51. [51]

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In International Conference on Learning Representations, volume 2024, pages 25055--25083

  52. [52]

    Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3):379--423

  53. [53]

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. pages 31210--31227

  54. [54]

    Soorya Ram Shimgekar, Vipin Gunda, Jiwon Kim, Violeta J Rodriguez, Hari Sundaram, and Koustuv Saha. 2026. Ai psychosis: Does conversational ai amplify delusion-related language? arXiv preprint arXiv:2603.19574

  55. [55]

    Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784--3803

  56. [56]

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319--3328. PMLR

  57. [57]

    Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621--633

  58. [58]

    Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, and Junjie Hu. 2025. https://openreview.net/forum?id=lOi6FtIwR8 Model editing as a robust and denoised variant of DPO : A case study on toxicity . In The Thirteenth International Conference on Learning Representations

  59. [59]

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. 2020. Causal mediation analysis for interpreting neural nlp: The case of gender bias. arXiv preprint arXiv:2004.12265

  60. [60]

    Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153--2162

  61. [61]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models

  62. [62]

    Yian Wang, Yuen Chen, Agam Goyal, and Hari Sundaram. 2026. Causaldetox: Causal head selection and intervention for language model detoxification. arXiv preprint arXiv:2604.14602

  63. [63]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079--80110

  64. [64]

    Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, and 1 others. 2022. Taxonomy of risks posed by language models. pages 214--229

  65. [65]

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486--1496

  66. [66]

    Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. 2024. Should we respect llms? a cross-lingual study on the influence of prompt politeness on llm performance. In Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 9--35

  67. [67]

    Xianyang Zhan, Agam Goyal, Yilun Chen, Eshwar Chandrasekharan, and Koustuv Saha. 2025. Slm-mod: Small language models surpass llms at content moderation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8774--8790

  68. [68]

    Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Pretraining data detection for large language models: A divergence-based calibration method. pages 5263--5274

  69. [69]

    Xu Zhang and Xiaojun Wan. 2023. https://doi.org/10.18653/v1/2023.acl-long.11 MIL -decoding: Detoxifying language models at token-level via multiple instance learning . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 190--202, Toronto, Canada. Association for Computational Linguistics

  70. [70]

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697--12706. Pmlr

  71. [71]

    Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. 2024. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2461--2488

  72. [72]

    Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. 2024. Prosa: Assessing and understanding the prompt sensitivity of llms. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1950--1976

  73. [73]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  74. [74]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...