Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

Agam Goyal; Amruta Parulekar; Eshwar Chandrasekharan; Hari Sundaram; Joshua Chen; Koustuv Saha; Navin Kumar; Soorya Ram Shimgekar; Yian Wang

arxiv: 2605.30913 · v1 · pith:W3O5LV7Rnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI· cs.CY· cs.HC

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

Soorya Ram Shimgekar , Agam Goyal , Amruta Parulekar , Joshua Chen , Yian Wang , Navin Kumar , Hari Sundaram , Eshwar Chandrasekharan

show 1 more author

Koustuv Saha

This is my paper

Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.HC

keywords LLM reliabilityprompt perturbationstoxicityfactual accuracyattribution graphsinternal circuitshallucinationsmodel activations

0 comments

The pith

Toxic wording added to prompts reduces factual accuracy in large language models and shifts which internal nodes drive their answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether prompts that keep the same core meaning but vary in tone from polite to toxic change how reliably LLMs answer factual questions. Experiments across five models on ARC-Easy, GSM8K, and MMLU show that toxic lexical changes lower accuracy and raise uncertainty, while polite rephrasings produce only small and inconsistent shifts. Attribution graphs of activations then reveal that higher toxicity boosts the role of certain variable nodes while leaving core reasoning nodes relatively stable. These results indicate that surface tone can alter both outputs and internal computation paths even when the intended question stays fixed.

Core claim

Toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. Attribution-graph analyses show that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant.

What carries the argument

Attribution-graph analyses that trace activation influences and identify perturbation-sensitive variant nodes versus invariant core reasoning nodes.

If this is right

Toxic prompts degrade performance on multiple-choice and math reasoning tasks.
Polite rewording does not produce reliable accuracy gains across models.
Internal computation paths become more sensitive to surface variations as toxicity rises.
Prompt tone functions as an independent factor in output consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety evaluations may need to test models under adversarial tone conditions to catch reliability drops.
Training objectives that penalize sensitivity to lexical tone could reduce unwanted output shifts.
The node-stability pattern might appear in other prompt dimensions such as length or formality.

Load-bearing premise

The prompt versions differ only in the intended lexical or tone perturbations and remain otherwise semantically equivalent.

What would settle it

Running the same benchmarks with the toxicity levels reversed or removed and finding no drop in accuracy or no selective change in node influence would falsify the central claim.

read the original abstract

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Toxic prompt variants cut accuracy on standard benchmarks and amplify certain nodes in attribution graphs, but the claim rests on unverified semantic equivalence.

read the letter

The core finding is that toxic lexical tweaks to prompts drop factual accuracy on ARC-Easy, GSM8K, and MMLU while polite versions do little, and the attribution analysis shows toxicity boosting perturbation-sensitive nodes while core reasoning nodes stay steadier.

They apply existing circuit-style attribution to a new variable (toxicity level) and run controlled variations across five models. That combination of behavioral results and internal tracing is the actual new piece; prior work on prompt tone did not include the node-level amplification pattern.

The main weakness is the untested assumption that the polite, random, and toxic versions differ only in tone. No embedding similarity, edit distance, or human ratings appear to confirm the prompts stay equivalent in difficulty or meaning. Without that, the accuracy drops and the circuit shifts could trace to content changes rather than toxicity itself. The abstract also omits sample sizes, exact toxicity definitions, and any statistical tests, which leaves the consistency claim hard to evaluate from the given text.

This is useful for anyone studying conversational robustness or prompt sensitivity. Readers who already work with attribution methods will see a direct extension; others will mainly get the behavioral pattern.

It should go to peer review. The practical question is worth referee time and the mechanistic angle is a legitimate next step, even though the equivalence check and reporting details need tightening.

Referee Report

3 major / 2 minor

Summary. The paper claims that toxic lexical perturbations in otherwise controlled prompts consistently reduce factual accuracy and increase uncertainty across five LLMs on ARC-Easy, GSM8K, and MMLU, while polite phrasing produces limited and inconsistent changes; attribution-graph analysis of activations further shows that toxicity selectively amplifies perturbation-sensitive variant nodes while core reasoning nodes remain relatively invariant.

Significance. If the central behavioral and mechanistic findings hold after addressing equivalence and statistical gaps, the work would provide concrete evidence that surface-level prompt tone affects both output reliability and internal circuit dynamics, positioning tone as a distinct reliability factor beyond semantic content.

major comments (3)

[controlled prompt variations section] Controlled prompt variations section: the claim that polite/random/toxic variants are semantically equivalent except for intended lexical/tone changes is load-bearing for attributing accuracy drops and node amplification to toxicity, yet no embedding cosine similarity, edit distance, or blinded human equivalence ratings are reported to rule out shifts in difficulty or ambiguity.
[Results section (ARC-Easy/GSM8K/MMLU evaluations)] Results on ARC-Easy/GSM8K/MMLU: the abstract and main findings assert consistent effects of toxicity on accuracy and uncertainty, but no sample sizes per condition, exact toxicity operationalization, statistical tests (e.g., paired t-tests or ANOVA), or error bars are supplied, preventing verification that the reported consistency is statistically supported rather than anecdotal.
[attribution-graph analyses] Attribution-graph analysis: the distinction between 'perturbation-sensitive variant nodes' and 'relatively stable core reasoning nodes' is central to the mechanistic claim, but the manuscript provides no quantitative criteria (activation thresholds, influence scores, or clustering method) for node classification or invariance measurement.

minor comments (2)

[Abstract] Abstract states 'three toxicity levels' but does not define the exact lexical items or scaling procedure used to generate them.
[attribution-graph analyses] Notation for 'variant nodes' and 'core reasoning nodes' is introduced without an explicit definition or reference to prior circuit work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the manuscript's claims regarding prompt equivalence, statistical rigor, and mechanistic quantification. We address each point below and will make the corresponding revisions.

read point-by-point responses

Referee: [controlled prompt variations section] Controlled prompt variations section: the claim that polite/random/toxic variants are semantically equivalent except for intended lexical/tone changes is load-bearing for attributing accuracy drops and node amplification to toxicity, yet no embedding cosine similarity, edit distance, or blinded human equivalence ratings are reported to rule out shifts in difficulty or ambiguity.

Authors: We agree that establishing semantic equivalence is essential to isolate the effect of toxicity. The manuscript relied on manual construction of variants to maintain semantic content while altering tone, but did not report quantitative equivalence metrics. In the revised manuscript, we will add pairwise cosine similarities computed from sentence embeddings, average edit distances, and results from a blinded human evaluation where raters assess equivalence on a Likert scale for a sample of prompts. These will be included in a new subsection under Methods. revision: yes
Referee: [Results section (ARC-Easy/GSM8K/MMLU evaluations)] Results on ARC-Easy/GSM8K/MMLU: the abstract and main findings assert consistent effects of toxicity on accuracy and uncertainty, but no sample sizes per condition, exact toxicity operationalization, statistical tests (e.g., paired t-tests or ANOVA), or error bars are supplied, preventing verification that the reported consistency is statistically supported rather than anecdotal.

Authors: We acknowledge the need for greater statistical transparency. The evaluations were conducted on the standard test sets of the benchmarks. In the revision, we will explicitly state the sample sizes per condition and benchmark, provide the exact operationalization of toxicity levels with example prompts in an appendix, include error bars representing standard error, and apply statistical tests such as paired t-tests or Wilcoxon signed-rank tests for accuracy and uncertainty differences between conditions, with appropriate multiple-comparison corrections. revision: yes
Referee: [attribution-graph analyses] Attribution-graph analysis: the distinction between 'perturbation-sensitive variant nodes' and 'relatively stable core reasoning nodes' is central to the mechanistic claim, but the manuscript provides no quantitative criteria (activation thresholds, influence scores, or clustering method) for node classification or invariance measurement.

Authors: The node classification in the original analysis was performed by identifying nodes whose activation or influence scores changed substantially under toxic perturbations compared to baseline, while core nodes showed minimal variation. However, explicit quantitative thresholds were not detailed. In the revised version, we will specify the criteria used, such as a minimum relative change in activation for variant nodes and a maximum change for invariant nodes, along with the method for computing influence scores and any clustering applied to group nodes. This will be added to the attribution-graph analysis section with supporting figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on benchmarks with no derived equations or self-referential reductions

full rationale

The paper reports direct experimental results: controlled prompt perturbations (polite/random/toxic) are applied to ARC-Easy/GSM8K/MMLU, accuracy and uncertainty are measured, and attribution graphs are computed from activations. No equations, fitted parameters, or derivations are present that could reduce a 'prediction' to an input by construction. Attribution analysis is post-hoc and descriptive rather than a uniqueness theorem or ansatz smuggled via self-citation. The study is self-contained against external benchmarks; the semantic-equivalence assumption is a methodological concern but does not create circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the central claim rests on the domain assumption that lexical-tone perturbations preserve semantic equivalence and on standard LLM evaluation practices. No invented entities are introduced.

free parameters (1)

toxicity level definitions
Three discrete toxicity levels are used to create the prompt variants; their exact construction is not specified in the abstract.

axioms (1)

domain assumption Prompts can be varied in lexical tone while remaining semantically equivalent
The study design treats polite/random/toxic versions as controlled variants of the same underlying questions.

pith-pipeline@v0.9.1-grok · 5740 in / 1246 out tokens · 30967 ms · 2026-06-28T22:41:55.036662+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 3 canonical work pages

[1]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, and 1 others. 2025. Circuit tracing: Revealing computational graphs in language models

2025
[2]

When users turn hostile: Rude, aggressive, and abusive interactions with ai chatbots

Tarek Barhoum. When users turn hostile: Rude, aggressive, and abusive interactions with ai chatbots
[3]

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861--872

2017
[4]

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623

2021
[5]

Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, and Xiaojing Fan. 2025. Does tone change the answer? evaluating prompt politeness effects on modern llms: Gpt, gemini, llama. arXiv preprint arXiv:2512.12812

arXiv 2025
[6]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge

2018
[7]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021
[8]

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In International AAAI Conference on Web and Social Media

2017
[9]

Sopan Deb. 2025. Saying ‘thank you’to chatgpt is costly. but maybe it’s worth the price. The New York Times

2025
[10]

Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. pages 295--302

2020
[11]

Om Dobariya and Akhil Kumar. 2025. Mind your tone: Investigating how prompt politeness affects llm accuracy (short paper). arXiv preprint arXiv:2510.04950

arXiv 2025
[12]

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, and 1 others. 2023. Faith and fate: Limits of transformers on compositionality. Advances in neural information processing systems, 36:70293--70332

2023
[13]

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch \"u tze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012--1031

2021
[14]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, and 1 others. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12

2021
[15]

Futurism . 2025. Sam altman says saying 'please' and 'thank you' to chatgpt costs openai millions. https://futurism.com/altman-please-thanks-chatgpt. Accessed: 2026-05-17

2025
[16]

Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, and 1 others. 2022. Predictability and surprise in large generative models. pages 1747--1764

2022
[17]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the association for computational linguistics: EMNLP 2020, pages 3356--3369

2020
[18]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484--5495

2021
[19]

Drishti Goel, Jeongah Lee, Qiuyue Joy Zhong, Violeta J Rodriguez, Daniel S Brown, Ravi Karkar, Dong Whi Yoo, and Koustuv Saha. 2026. Rubrix: Rubric-driven risk mitigation in caregiver-ai interactions. arXiv preprint arXiv:2601.13235

arXiv 2026
[20]

Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, and Hari Sundaram. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.641 Breaking bad tokens: Detoxification of LLM s using sparse autoencoders . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12691--12709, Suzhou, China. Association for Compu...

work page doi:10.18653/v1/2025.emnlp-main.641 2025
[21]

Agam Goyal, Xianyang Zhan, Yilun Chen, Koustuv Saha, and Eshwar Chandrasekharan. 2025 b . Momoe: Mixture of moderation experts framework for ai-assisted online governance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12656--12671

2025
[22]

Michael Hanna. 2025. mwhanna/qwen3-1.7b-transcoders-lowl0. https://huggingface.co/mwhanna/qwen3-1.7b-transcoders-lowl0. Low- L_0 sparse transcoders for Qwen3-1.7B

2025
[23]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

Pith/arXiv arXiv 2020
[24]

Yunah Jang, Megha Sundriyal, Kyomin Jung, and Meeyoung Cha. 2026. How you ask matters! adaptive rag robustness to query variations. arXiv preprint arXiv:2604.10745

Pith/arXiv arXiv 2026
[25]

Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity—a measure of the difficulty of speech recognition tasks. volume 62, pages S63--S63. Acoustical Society of America

1977
[26]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1--38

2023
[27]

Jigsaw and Google . 2017. Perspective api. https://www.perspectiveapi.com/

2017
[28]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. Language models (mostly) know what they know

2022
[29]

Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data

2019
[30]

Jiwon Kim, Violeta J Rodriguez, Dong Whi Yoo, Eshwar Chandrasekharan, and Koustuv Saha. 2026. Pair-safe: A paired-agent approach for runtime auditing and refining ai-mediated mental health support. arXiv preprint arXiv:2601.12754

arXiv 2026
[31]

Mahi Kolla, Siddharth Salunkhe, Eshwar Chandrasekharan, and Koustuv Saha. 2024. Llm-mod: Can large language models assist content moderation? In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1--8

2024
[32]

Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. 2024. Watch your language: Investigating content moderation with large language models. In Proceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 865--878

2024
[33]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. volume 33, pages 9459--9474

2020
[34]

Smith, and Yejin Choi

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. https://doi.org/10.18653/v1/2021.acl-long.522 DE xperts: Decoding-time controlled text generation with experts and anti-experts . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internatio...

work page doi:10.18653/v1/2021.acl-long.522 2021
[35]

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. pages 8086--8098

2022
[36]

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. pages 9802--9822

2023
[37]

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 9004--9017

2023
[38]

Mateusz. 2025 a . mntss/gemma-scope-transcoders. https://huggingface.co/mntss/gemma-scope-transcoders. Hugging Face model repository for Gemma Scope transcoders compatible with circuit-tracer

2025
[39]

Mateusz. 2025 b . mntss/transcoder-llama-3.2-1b. https://huggingface.co/mntss/transcoder-Llama-3.2-1B. Hugging Face transcoder repository for Llama-3.2-1B compatible with circuit-tracer

2025
[40]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020 a . On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1906--1919

2020
[41]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020 b . On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1906--1919

2020
[42]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. volume 35, pages 17359--17372

2022
[43]

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2024. State of what art? a call for multi-prompt llm evaluation. Transactions of the Association for Computational Linguistics, 12:933--949

2024
[44]

Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4658--4664

2019
[45]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020 a . Zoom in: An introduction to circuits. Distill, 5(3):e00024--001

2020
[46]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020 b . Zoom in: An introduction to circuits. Distill, 5(3):e00024--001

2020
[47]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. volume 35, pages 27730--27744

2022
[48]

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, and 1 others. 2023. Discovering language model behaviors with model-written evaluations. In Findings of the association for computational linguistics: ACL 2023, pages 13387--13434

2023
[49]

Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. pages 1--7

2021
[50]

Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. 2020. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pages 8346--8356. PMLR

2020
[51]

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In International Conference on Learning Representations, volume 2024, pages 25055--25083

2024
[52]

Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3):379--423

1948
[53]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. pages 31210--31227

2023
[54]

Soorya Ram Shimgekar, Vipin Gunda, Jiwon Kim, Violeta J Rodriguez, Hari Sundaram, and Koustuv Saha. 2026. Ai psychosis: Does conversational ai amplify delusion-related language? arXiv preprint arXiv:2603.19574

arXiv 2026
[55]

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784--3803

2021
[56]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319--3328. PMLR

2017
[57]

Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621--633

2020
[58]

Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, and Junjie Hu. 2025. https://openreview.net/forum?id=lOi6FtIwR8 Model editing as a robust and denoised variant of DPO : A case study on toxicity . In The Thirteenth International Conference on Learning Representations

2025
[59]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. 2020. Causal mediation analysis for interpreting neural nlp: The case of gender bias. arXiv preprint arXiv:2004.12265

arXiv 2020
[60]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153--2162

2019
[61]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models

2022
[62]

Yian Wang, Yuen Chen, Agam Goyal, and Hari Sundaram. 2026. Causaldetox: Causal head selection and intervention for language model detoxification. arXiv preprint arXiv:2604.14602

Pith/arXiv arXiv 2026
[63]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079--80110

2023
[64]

Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, and 1 others. 2022. Taxonomy of risks posed by language models. pages 214--229

2022
[65]

Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486--1496

2023
[66]

Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. 2024. Should we respect llms? a cross-lingual study on the influence of prompt politeness on llm performance. In Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 9--35

2024
[67]

Xianyang Zhan, Agam Goyal, Yilun Chen, Eshwar Chandrasekharan, and Koustuv Saha. 2025. Slm-mod: Small language models surpass llms at content moderation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8774--8790

2025
[68]

Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Pretraining data detection for large language models: A divergence-based calibration method. pages 5263--5274

2024
[69]

Xu Zhang and Xiaojun Wan. 2023. https://doi.org/10.18653/v1/2023.acl-long.11 MIL -decoding: Detoxifying language models at token-level via multiple instance learning . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 190--202, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.acl-long.11 2023
[70]

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697--12706. Pmlr

2021
[71]

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. 2024. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2461--2488

2024
[72]

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. 2024. Prosa: Assessing and understanding the prompt sensitivity of llms. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1950--1976

2024
[73]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[74]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, and 1 others. 2025. Circuit tracing: Revealing computational graphs in language models

2025

[2] [2]

When users turn hostile: Rude, aggressive, and abusive interactions with ai chatbots

Tarek Barhoum. When users turn hostile: Rude, aggressive, and abusive interactions with ai chatbots

[3] [3]

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861--872

2017

[4] [4]

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623

2021

[5] [5]

Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, and Xiaojing Fan. 2025. Does tone change the answer? evaluating prompt politeness effects on modern llms: Gpt, gemini, llama. arXiv preprint arXiv:2512.12812

arXiv 2025

[6] [6]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge

2018

[7] [7]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021

[8] [8]

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In International AAAI Conference on Web and Social Media

2017

[9] [9]

Sopan Deb. 2025. Saying ‘thank you’to chatgpt is costly. but maybe it’s worth the price. The New York Times

2025

[10] [10]

Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. pages 295--302

2020

[11] [11]

Om Dobariya and Akhil Kumar. 2025. Mind your tone: Investigating how prompt politeness affects llm accuracy (short paper). arXiv preprint arXiv:2510.04950

arXiv 2025

[12] [12]

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, and 1 others. 2023. Faith and fate: Limits of transformers on compositionality. Advances in neural information processing systems, 36:70293--70332

2023

[13] [13]

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch \"u tze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012--1031

2021

[14] [14]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, and 1 others. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12

2021

[15] [15]

Futurism . 2025. Sam altman says saying 'please' and 'thank you' to chatgpt costs openai millions. https://futurism.com/altman-please-thanks-chatgpt. Accessed: 2026-05-17

2025

[16] [16]

Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, and 1 others. 2022. Predictability and surprise in large generative models. pages 1747--1764

2022

[17] [17]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the association for computational linguistics: EMNLP 2020, pages 3356--3369

2020

[18] [18]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484--5495

2021

[19] [19]

Drishti Goel, Jeongah Lee, Qiuyue Joy Zhong, Violeta J Rodriguez, Daniel S Brown, Ravi Karkar, Dong Whi Yoo, and Koustuv Saha. 2026. Rubrix: Rubric-driven risk mitigation in caregiver-ai interactions. arXiv preprint arXiv:2601.13235

arXiv 2026

[20] [20]

Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, and Hari Sundaram. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.641 Breaking bad tokens: Detoxification of LLM s using sparse autoencoders . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12691--12709, Suzhou, China. Association for Compu...

work page doi:10.18653/v1/2025.emnlp-main.641 2025

[21] [21]

Agam Goyal, Xianyang Zhan, Yilun Chen, Koustuv Saha, and Eshwar Chandrasekharan. 2025 b . Momoe: Mixture of moderation experts framework for ai-assisted online governance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12656--12671

2025

[22] [22]

Michael Hanna. 2025. mwhanna/qwen3-1.7b-transcoders-lowl0. https://huggingface.co/mwhanna/qwen3-1.7b-transcoders-lowl0. Low- L_0 sparse transcoders for Qwen3-1.7B

2025

[23] [23]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

Pith/arXiv arXiv 2020

[24] [24]

Yunah Jang, Megha Sundriyal, Kyomin Jung, and Meeyoung Cha. 2026. How you ask matters! adaptive rag robustness to query variations. arXiv preprint arXiv:2604.10745

Pith/arXiv arXiv 2026

[25] [25]

Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity—a measure of the difficulty of speech recognition tasks. volume 62, pages S63--S63. Acoustical Society of America

1977

[26] [26]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1--38

2023

[27] [27]

Jigsaw and Google . 2017. Perspective api. https://www.perspectiveapi.com/

2017

[28] [28]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. Language models (mostly) know what they know

2022

[29] [29]

Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data

2019

[30] [30]

Jiwon Kim, Violeta J Rodriguez, Dong Whi Yoo, Eshwar Chandrasekharan, and Koustuv Saha. 2026. Pair-safe: A paired-agent approach for runtime auditing and refining ai-mediated mental health support. arXiv preprint arXiv:2601.12754

arXiv 2026

[31] [31]

Mahi Kolla, Siddharth Salunkhe, Eshwar Chandrasekharan, and Koustuv Saha. 2024. Llm-mod: Can large language models assist content moderation? In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1--8

2024

[32] [32]

Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. 2024. Watch your language: Investigating content moderation with large language models. In Proceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 865--878

2024

[33] [33]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. volume 33, pages 9459--9474

2020

[34] [34]

Smith, and Yejin Choi

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. https://doi.org/10.18653/v1/2021.acl-long.522 DE xperts: Decoding-time controlled text generation with experts and anti-experts . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internatio...

work page doi:10.18653/v1/2021.acl-long.522 2021

[35] [35]

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. pages 8086--8098

2022

[36] [36]

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. pages 9802--9822

2023

[37] [37]

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 9004--9017

2023

[38] [38]

Mateusz. 2025 a . mntss/gemma-scope-transcoders. https://huggingface.co/mntss/gemma-scope-transcoders. Hugging Face model repository for Gemma Scope transcoders compatible with circuit-tracer

2025

[39] [39]

Mateusz. 2025 b . mntss/transcoder-llama-3.2-1b. https://huggingface.co/mntss/transcoder-Llama-3.2-1B. Hugging Face transcoder repository for Llama-3.2-1B compatible with circuit-tracer

2025

[40] [40]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020 a . On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1906--1919

2020

[41] [41]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020 b . On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1906--1919

2020

[42] [42]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. volume 35, pages 17359--17372

2022

[43] [43]

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2024. State of what art? a call for multi-prompt llm evaluation. Transactions of the Association for Computational Linguistics, 12:933--949

2024

[44] [44]

Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4658--4664

2019

[45] [45]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020 a . Zoom in: An introduction to circuits. Distill, 5(3):e00024--001

2020

[46] [46]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020 b . Zoom in: An introduction to circuits. Distill, 5(3):e00024--001

2020

[47] [47]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. volume 35, pages 27730--27744

2022

[48] [48]

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, and 1 others. 2023. Discovering language model behaviors with model-written evaluations. In Findings of the association for computational linguistics: ACL 2023, pages 13387--13434

2023

[49] [49]

Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. pages 1--7

2021

[50] [50]

Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. 2020. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pages 8346--8356. PMLR

2020

[51] [51]

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In International Conference on Learning Representations, volume 2024, pages 25055--25083

2024

[52] [52]

Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3):379--423

1948

[53] [53]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. pages 31210--31227

2023

[54] [54]

Soorya Ram Shimgekar, Vipin Gunda, Jiwon Kim, Violeta J Rodriguez, Hari Sundaram, and Koustuv Saha. 2026. Ai psychosis: Does conversational ai amplify delusion-related language? arXiv preprint arXiv:2603.19574

arXiv 2026

[55] [55]

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784--3803

2021

[56] [56]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319--3328. PMLR

2017

[57] [57]

Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621--633

2020

[58] [58]

Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, and Junjie Hu. 2025. https://openreview.net/forum?id=lOi6FtIwR8 Model editing as a robust and denoised variant of DPO : A case study on toxicity . In The Thirteenth International Conference on Learning Representations

2025

[59] [59]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. 2020. Causal mediation analysis for interpreting neural nlp: The case of gender bias. arXiv preprint arXiv:2004.12265

arXiv 2020

[60] [60]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153--2162

2019

[61] [61]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models

2022

[62] [62]

Yian Wang, Yuen Chen, Agam Goyal, and Hari Sundaram. 2026. Causaldetox: Causal head selection and intervention for language model detoxification. arXiv preprint arXiv:2604.14602

Pith/arXiv arXiv 2026

[63] [63]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079--80110

2023

[64] [64]

Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, and 1 others. 2022. Taxonomy of risks posed by language models. pages 214--229

2022

[65] [65]

Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486--1496

2023

[66] [66]

Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. 2024. Should we respect llms? a cross-lingual study on the influence of prompt politeness on llm performance. In Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 9--35

2024

[67] [67]

Xianyang Zhan, Agam Goyal, Yilun Chen, Eshwar Chandrasekharan, and Koustuv Saha. 2025. Slm-mod: Small language models surpass llms at content moderation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8774--8790

2025

[68] [68]

Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Pretraining data detection for large language models: A divergence-based calibration method. pages 5263--5274

2024

[69] [69]

Xu Zhang and Xiaojun Wan. 2023. https://doi.org/10.18653/v1/2023.acl-long.11 MIL -decoding: Detoxifying language models at token-level via multiple instance learning . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 190--202, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.acl-long.11 2023

[70] [70]

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697--12706. Pmlr

2021

[71] [71]

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. 2024. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2461--2488

2024

[72] [72]

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. 2024. Prosa: Assessing and understanding the prompt sensitivity of llms. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1950--1976

2024

[73] [73]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[74] [74]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...