pith. machine review for the scientific record. sign in

arxiv: 2512.21110 · v3 · submitted 2025-12-24 · 💻 cs.AI · cs.CL· cs.CR· cs.CY

Recognition: 2 theorem links

· Lean Theorem

Beyond Context: Large Language Models' Failure to Grasp Users' Intent

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CRcs.CY
keywords large language modelsAI safetyuser intentprompt techniquessafety bypasscontext understandingreasoning models
0
0 comments X

The pith

Large language models fail to recognize user intent, allowing systematic bypasses of safety filters via emotional framing and gradual prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current LLM safety systems focus narrowly on spotting explicit harmful content and therefore miss deeper failures to understand context or user intent. This gap lets attackers use techniques such as emotional framing, progressive revelation of details, and academic-sounding justifications to obtain restricted information. Tests across ChatGPT, Claude, Gemini, and DeepSeek showed these methods reliably circumvented protections, and adding reasoning capabilities often made the problem worse by supplying more precise answers without questioning motives. Only Claude Opus 4.1 sometimes refused on intent grounds. The authors conclude that these patterns point to built-in architectural weaknesses rather than isolated bugs.

Core claim

Current safety approaches in large language models concentrate on explicit harmful content while overlooking the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage through emotional framing, progressive revelation, and academic justification techniques. Empirical evaluation of models including ChatGPT, Claude, Gemini, and DeepSeek shows these methods circumvent reliable safety mechanisms, with reasoning-enabled configurations increasing factual precision without interrogating underlying intent. The pattern indicates that present architectural designs embed systematic vulnerabilities rather than,

What carries the argument

The inability of LLMs to recognize user intent beyond surface content, exposed by the three prompting techniques of emotional framing, progressive revelation, and academic justification.

Load-bearing premise

That the described prompting techniques demonstrate a failure to grasp intent rather than simply succeeding at eliciting compliant responses from instruction-following models.

What would settle it

A test in which the same harmful request is presented first with clear malicious framing and then with explicit benign intent; consistent refusal only in the benign case would falsify the claim that intent detection is missing.

Figures

Figures reproduced from arXiv: 2512.21110 by Ahmed M. Hussain, Salahuddin Salahuddin.

Figure 1
Figure 1. Figure 1: Semantic layering demonstrating intent obfuscation [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompts illustrating harmful intent obfuscation. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DeepSeek Performance. DeepSeek exhibited similar pat￾terns of empathetic framing paired with unfiltered factual disclosure. Specifically, DeepSeek DeepThink demonstrates awareness without action: reasoning trace explicitly states ”disconnect could signal concealed self-harm intent,” yet pro￾vides detailed depths anyway. This validates rather than con￾tradicts our thesis—the system achieves pattern recognit… view at source ↗
Figure 3
Figure 3. Figure 3: Gemini 2.5 Pro response demonstrating dual-track be [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Claude Opus 4.1 response demonstrating intent recog [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Justification exploiting practical crisis context (red) to [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 7
Figure 7. Figure 7: Contextualized emotional manipulation combining [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 11
Figure 11. Figure 11: Academic justification exploitation using creative [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Gemini pro Q1 Reasoning. Source validation for [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Gemini pro Q3. Bridge height analysis demonstrating [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 13
Figure 13. Figure 13: Gemini pro Q2. Enhanced precision in observation [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Gemini pro Q5. San Francisco viewpoint analysis [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Gemini pro Q6 cognitive processing. academic fram [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 16
Figure 16. Figure 16: Gemini pro Q4. Parking garage height investigation [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 19
Figure 19. Figure 19: Deepseek Q2 reasoning. Online harassment context [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Deepseek Q3 cognitive framework. Academic failure [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Deepseek Q4 analytical. Recognition of the housing [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Deepseek Q5 reasoning. Employment loss context as [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Deepseek Q6. Academic framing recognition with a [PITH_FULL_IMAGE:figures/full_fig_p021_23.png] view at source ↗
read the original abstract

Current Large Language Models (LLMs) safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage to circumvent safety mechanisms. We empirically evaluate multiple state-of-the-art LLMs, including ChatGPT, Claude, Gemini, and DeepSeek. Our analysis demonstrates the circumvention of reliable safety mechanisms through emotional framing, progressive revelation, and academic justification techniques. Notably, reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation, increasing factual precision while failing to interrogate the underlying intent. The exception was Claude Opus 4.1, which prioritized intent detection over information provision in some use cases. This pattern reveals that current architectural designs create systematic vulnerabilities. These limitations require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities rather than post-hoc protective mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that current LLMs lack the ability to understand context and recognize malicious user intent, creating systematic vulnerabilities in safety mechanisms. It reports empirical evaluations on models including ChatGPT, Claude, Gemini, and DeepSeek showing that techniques such as emotional framing, progressive revelation, and academic justification successfully circumvent safety filters. Reasoning-enabled configurations are said to amplify rather than mitigate exploitation by increasing factual precision of harmful outputs, with Claude Opus 4.1 as a partial exception that sometimes prioritizes intent detection. The work concludes that paradigmatic architectural shifts toward core intent recognition are required.

Significance. If the empirical demonstrations hold after proper controls and quantification, the result would highlight a genuine gap between content-based safety filters and true intent understanding, with potential to redirect safety research toward architectures that explicitly model user goals rather than surface patterns.

major comments (3)
  1. [Abstract] Abstract: the claim of empirical evaluation across multiple models supplies no quantitative metrics, specific prompts, success rates, baseline comparisons, or control conditions, leaving the central circumvention claim without visible supporting evidence.
  2. [Results] Results (implied by abstract description): the reported increase in factual precision under reasoning-enabled settings is consistent with stronger literal instruction-following on reframed prompts rather than failure to detect intent, since emotional framing and academic justification alter the explicit prompt content itself.
  3. [Methods] Methods (implied): no controls are described that isolate intent inference from compliance (e.g., direct harmful requests versus incrementally framed versions or explicit intent probes), so the data cannot distinguish the two interpretations.
minor comments (1)
  1. [Abstract] Abstract: verify the exact model name 'Claude Opus 4.1' against current Anthropic releases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the full paper content and indicating where revisions have been made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of empirical evaluation across multiple models supplies no quantitative metrics, specific prompts, success rates, baseline comparisons, or control conditions, leaving the central circumvention claim without visible supporting evidence.

    Authors: The abstract is intentionally concise, but the full manuscript (Sections 3 and 4) provides the requested details: specific prompt templates for each framing technique, success rates (e.g., 72% average bypass for emotional framing across tested models), baseline comparisons to direct harmful queries (0% success), and control conditions. To improve visibility, we have revised the abstract to include a brief summary of these quantitative results and key metrics. revision: yes

  2. Referee: [Results] Results (implied by abstract description): the reported increase in factual precision under reasoning-enabled settings is consistent with stronger literal instruction-following on reframed prompts rather than failure to detect intent, since emotional framing and academic justification alter the explicit prompt content itself.

    Authors: We maintain that the distinction is substantive: direct harmful requests are refused while identically harmful requests succeed only after reframing that preserves intent but changes surface form. The increase in factual precision under reasoning modes occurs precisely because the model executes the literal (reframed) request without probing the underlying goal, which we illustrate with paired examples in the results. We have added a new paragraph in the discussion section to explicitly contrast literal compliance versus intent detection. revision: partial

  3. Referee: [Methods] Methods (implied): no controls are described that isolate intent inference from compliance (e.g., direct harmful requests versus incrementally framed versions or explicit intent probes), so the data cannot distinguish the two interpretations.

    Authors: The original manuscript already includes direct-versus-framed comparisons as the primary control (direct requests refused, framed versions accepted). We have expanded the Methods section with additional explicit intent-probe questions and incremental framing steps to further isolate the effect, as suggested. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation without derivations or self-referential reductions

full rationale

The paper reports direct empirical tests of LLMs (ChatGPT, Claude, Gemini, DeepSeek) using prompting techniques such as emotional framing, progressive revelation, and academic justification. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methodology. Claims rest on observed outcomes rather than any step that reduces by construction to prior definitions or fits. This is a standard observational study with no self-definitional, fitted-input, or uniqueness-imported circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study of existing models and does not introduce mathematical derivations, new parameters, or postulated entities.

axioms (1)
  • domain assumption LLM safety mechanisms primarily filter based on explicit content patterns rather than inferred user intent.
    This premise underpins the claim that intent recognition is a missing core capability.

pith-pipeline@v0.9.0 · 5451 in / 1099 out tokens · 28281 ms · 2026-05-16T20:07:37.763810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

127 extracted references · 127 canonical work pages · 21 internal anchors

  1. [1]

    Intent detection in the age of llms,

    G. Arora, S. Jain, and S. Merugu, “Intent detection in the age of llms,” arXiv preprint arXiv:2410.01627, 2024

  2. [2]

    DecodingTrust: A Compre- hensive Assessment of Trustworthiness in GPT Models,

    B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaefferet al., “DecodingTrust: A Compre- hensive Assessment of Trustworthiness in GPT Models,”arXiv preprint arXiv:2306.11698, 2023

  3. [3]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal,”arXiv preprint arXiv:2402.04249, 2024

  4. [4]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,

    J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186

  5. [5]

    What Does BERT Look At? An Analysis of BERT’s Attention,

    K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, “What Does BERT Look At? An Analysis of BERT’s Attention,” inProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2019, pp. 276–286

  6. [6]

    Extracting Training Data from Large Language Models,

    N. Carlini, F. Tram `er, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, ´U. Erlingssonet al., “Extracting Training Data from Large Language Models,” in30th USENIX Security Symposium, 2021, pp. 2633–2650

  7. [7]

    Ethical Challenges in Data-driven Dialogue Systems,

    P. Henderson, K. Sinha, N. Angelard-Gontier, N. R. Ke, G. Fried, R. Lowe, and J. Pineau, “Ethical Challenges in Data-driven Dialogue Systems,” pp. 123–129, 2017

  8. [8]

    Towards conversational diagnostic artificial intelligence,

    T. Tu, M. Schaekermann, A. Palepu, K. Saab, J. Freyberg, R. Tanno, A. Wang, B. Li, M. Amin, Y . Chenget al., “Towards conversational diagnostic artificial intelligence,”Nature, pp. 1–9, 2025

  9. [9]

    Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference,

    T. McCoy, E. Pavlick, and T. Linzen, “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference,” in Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, 2019, pp. 3428–3448

  10. [10]

    BERT Rediscovers the Classical NLP Pipeline,

    I. Tenney, D. Das, and E. Pavlick, “BERT Rediscovers the Classical NLP Pipeline,”Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pp. 4593–4601, 2019

  11. [11]

    SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions,

    Z. Zhang, L. Xu, D. Zhao, Y . Onoe, M. Khalil, H. Ross, I. Kocyigit, M. Ashraf, Y .-L. Boureau, A. Nematzadehet al., “SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions,”arXiv preprint arXiv:2309.07045, 2023

  12. [12]

    Real- ToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models,

    S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “Real- ToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 3356–3369

  13. [13]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskillet al., “On the Opportunities and Risks of Foundation Models,”arXiv preprint arXiv:2108.07258, 2021

  14. [14]

    Ethical and social risks of harm from Language Models

    L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadehet al., “Ethical and Social Risks of Harm from Language Models,”arXiv preprint arXiv:2112.04359, 2021

  15. [15]

    Concrete Problems in AI Safety,

    D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man ´e, “Concrete Problems in AI Safety,” inNIPS Workshop on Aligned Artificial Intelligence, 2016

  16. [16]

    Russell,Human Compatible: Artificial Intelligence and the Problem of Control

    S. Russell,Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019

  17. [17]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freireet al., “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback,”arXiv preprint arXiv:2307.15217, 2023

  18. [18]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Con- stitutional AI: Harmlessness from AI Feedback,”arXiv preprint arXiv:2212.08073, 2022

  19. [19]

    Emergent Abil- ities of Large Language Models,

    J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzleret al., “Emergent Abil- ities of Large Language Models,”Transactions on Machine Learning Research, 2022

  20. [20]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonsoet al., “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models,”arXiv preprint arXiv:2206.04615, 2022

  21. [21]

    Attention is All You Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” Advances in Neural Information Processing Systems, vol. 30, 2017

  22. [22]

    Language Mod- els are Few-shot Learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language Mod- els are Few-shot Learners,”Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020

  23. [23]

    PaLM: Scaling Language Modeling with Pathways,

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “PaLM: Scaling Language Modeling with Pathways,” vol. 24, no. 240, 2022, pp. 1–113

  24. [24]

    GPT-4 Technical Report

    OpenAI, “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023

  25. [25]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,”arXiv preprint arXiv:2001.08361, 2020

  26. [26]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clarket al., “Training Compute-optimal Large Language Models,”arXiv preprint arXiv:2203.15556, 2022

  27. [27]

    On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

    E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” pp. 610–623, 2021

  28. [28]

    The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence,

    G. Marcus, “The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence,”arXiv preprint arXiv:2002.06177, 2020

  29. [29]

    Stress Test Evaluation for Natural Language Inference,

    A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig, “Stress Test Evaluation for Natural Language Inference,” pp. 2340–2353, 2018

  30. [30]

    B. J. Grosz and C. L. Sidner,Attention, Intentions, and the Structure of Discourse. MIT Press, 1986, vol. 12, no. 3

  31. [31]

    J. R. Hobbs, M. E. Stickel, D. E. Appelt, and P. Martin,Interpretation as Abduction. Elsevier, 1993, vol. 63, no. 1-2

  32. [32]

    Winograd,Understanding Natural Language

    T. Winograd,Understanding Natural Language. Academic Press, 1972

  33. [33]

    R. C. Schank and R. P. Abelson,Scripts, Plans, Goals, and Under- standing: An Inquiry into Human Knowledge Structures. Lawrence Erlbaum, 1977

  34. [34]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of Deep Bidirectional Transformers for Language Understand- ing,”arXiv preprint arXiv:1810.04805, 2018

  35. [35]

    Deep Contextualized Word Representations,

    M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep Contextualized Word Representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 2227–2237

  36. [36]

    A Primer in BERTology: What We Know About How BERT Works,

    A. Rogers, O. Kovaleva, and A. Rumshisky, “A Primer in BERTology: What We Know About How BERT Works,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020

  37. [37]

    Adversarial NLI: A New Benchmark for Natural Language Under- standing,

    Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela, “Adversarial NLI: A New Benchmark for Natural Language Under- standing,”Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4885–4901, 2019

  38. [38]

    Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment,

    D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment,” vol. 34, pp. 8018–8025, 2020

  39. [39]

    Evasion Attacks Against Machine Learning at Test Time,

    B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. ˇSrndi´c, P. Laskov, G. Giacinto, and F. Roli, “Evasion Attacks Against Machine Learning at Test Time,” inJoint European conference on machine learning and knowledge discovery in databases. Springer, 2013, pp. 387–402

  40. [40]

    Explaining and Harness- ing Adversarial Examples,

    I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Harness- ing Adversarial Examples,” 2014

  41. [41]

    Intriguing properties of neural networks

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfel- low, and R. Fergus, “Intriguing Properties of Neural Networks,”arXiv preprint arXiv:1312.6199, 2013

  42. [42]

    Uni- versal Adversarial Triggers for Attacking and Analyzing NLP,

    E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Uni- versal Adversarial Triggers for Attacking and Analyzing NLP,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019, pp. 2153–2162

  43. [43]

    HotFlip: White-box Adversarial Examples for Text Classification,

    J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, “HotFlip: White-box Adversarial Examples for Text Classification,” inProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 31–36

  44. [44]

    Red Teaming Language Models with Language Models

    E. Perez, S. Huang, F. Song, T. Cai, R. Wong, J. Griffiths, J. McAleese, J. Pokorny, J. Fortier, G. Sastryet al., “Red Teaming Language Models with Language Models,”arXiv preprint arXiv:2202.03286, 2022

  45. [45]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned,

    D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Sharma, A. Tamkinet al., “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned,” 2022

  46. [46]

    Jailbroken: How Does LLM Safety Training Fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How Does LLM Safety Training Fail?”arXiv preprint arXiv:2307.02483, 2023

  47. [47]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,”arXiv preprint arXiv:2307.15043, 2023

  48. [48]

    Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs,

    C. Pathade, “Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs,”arXiv preprint arXiv:2505.04806, 2025. [Online]. Available: https://arxiv.org/abs/2505.04806

  49. [49]

    ”Do Anything Now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “”Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1671–1685. [Online]. Ava...

  50. [50]

    DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers

    X. Li, R. Wang, M. Cheng, T. Zhou, and C.-J. Hsieh, “DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers.” Association for Computational Linguistics, Nov. 2024, pp. 13 891–13 913. [Online]. Available: https://aclanthology.org/2024. findings-emnlp.813/

  51. [51]

    MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots,

    G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots,” inNDSS, 2024

  52. [52]

    Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors,

    Y .-H. Chen, N. Joshi, Y . Chen, M. Andriushchenko, R. Angell, and H. He, “Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors,” 2025. [Online]. Available: https: //arxiv.org/abs/2506.10949

  53. [53]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Y . Liu, G. Deng, Z. Xu, Y . Li, Y . Zheng, Y . Zhang, L. Zhao, T. Zhang, and Y . Liu, “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study,”arXiv preprint arXiv:2305.13860, 2023

  54. [54]

    Attack Prompt Generation for Red Teaming and Defending Large Language Models,

    B. Deng, H. Zhang, Y . Xiang, L. Deng, S. Hong, R. Gao, H. Zhou, X. Zhang, R. Li, and Z. Li, “Attack Prompt Generation for Red Teaming and Defending Large Language Models,”arXiv preprint arXiv:2310.12505, 2023

  55. [56]

    Available: https://arxiv.org/abs/2501.01335

    [Online]. Available: https://arxiv.org/abs/2501.01335

  56. [57]

    Not What You’ve Signed Up For: Compromising Real-world LLM- integrated Applications with Indirect Prompt Injection,

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not What You’ve Signed Up For: Compromising Real-world LLM- integrated Applications with Indirect Prompt Injection,” inProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 2023, pp. 79–90

  57. [58]

    Ignore Previous Prompt: Attack Techniques For Language Models,

    F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques For Language Models,” 2022

  58. [59]

    Plans for Discourse,

    P. R. Cohen and C. R. Perrault, “Plans for Discourse,” inIntentions in Communication. MIT Press, 1990, pp. 365–388

  59. [60]

    B. J. Grosz and S. Kraus,Collaborative Plans for Complex Group Action. Elsevier, 1996, vol. 86, no. 2

  60. [61]

    Ef- ficient Intent Detection with Dual Sentence Encoders,

    I. Casanueva, T. Tem ˇcinas, D. Gerz, M. Henderson, and I. Vuli ´c, “Ef- ficient Intent Detection with Dual Sentence Encoders,”arXiv preprint arXiv:2003.04807, 2020

  61. [62]

    Intent Classification and Slot Filling for Privacy Policies,

    W. U. A. Zhang, Z. Yan, W. U. Ahmad, and K.-W. Chang, “Intent Classification and Slot Filling for Privacy Policies,” pp. 4402–4417, 2021

  62. [63]

    The Second Dialog State Tracking Challenge,

    M. Henderson, B. Thomson, and J. D. Williams, “The Second Dialog State Tracking Challenge,” pp. 263–272, 2014

  63. [64]

    Towards Scalable Multi-domain Conversational Agents: The Schema-guided Dialogue Dataset,

    A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan, “Towards Scalable Multi-domain Conversational Agents: The Schema-guided Dialogue Dataset,” vol. 34, pp. 8689–8696, 2020

  64. [65]

    Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study,

    C. Sankar, S. Subramanian, C. Pal, S. Chandar, and Y . Bengio, “Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study,”Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 32–37, 2019

  65. [66]

    Pretraining Methods for Dialog Context Representation Learning,

    S. Mehri, S. Kiritchenko, M. Eskenazi, and S. M. Mohammad, “Pretraining Methods for Dialog Context Representation Learning,” Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pp. 3836–3845, 2019

  66. [67]

    Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models,

    N. Shapira, M. Levy, S. H. Alavi, X. Zhou, Y . Choi, Y . Goldberg, M. Sap, and V . Shwartz, “Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). St. Julian’s, Malta: Associatio...

  67. [68]

    CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models,

    G. Sun, X. Zhan, S. Feng, P. C. Woodland, and J. Such, “CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models,” 2025. [Online]. Available: https://arxiv.org/abs/2501.14940

  68. [69]

    Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models,

    Y . In, W. Kim, K. Yoon, S. Kim, M. Tanjim, K. Kim, and C. Park, “Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models,” 2025. [Online]. Available: https://arxiv.org/abs/2502.15086

  69. [70]

    Training Language Models to Follow Instructions with Human Feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training Language Models to Follow Instructions with Human Feedback,”Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022

  70. [71]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighanet al., “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,” 2022

  71. [72]

    Intention Analysis Makes LLMs A Good Jailbreak Defender,

    Y . Zhang, L. Ding, L. Zhang, and D. Tao, “Intention Analysis Makes LLMs A Good Jailbreak Defender,” 2024, cOLING 2025 (to appear). [Online]. Available: https://arxiv.org/abs/2401.06561

  72. [73]

    Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation,

    B. Baker, J. Huizinga, L. Gao, Z. Dou, M. Y . Guan, A. Madry, W. Zaremba, J. Pachocki, and D. Farhi, “Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation,”

  73. [74]
  74. [75]

    LongSafety: Evaluating Long-Context Safety of Large Language Models,

    Y . Lu, J. Cheng, Z. Zhang, S. Cui, C. Wang, X. Gu, Y . Dong, J. Tang, H. Wang, and M. Huang, “LongSafety: Evaluating Long-Context Safety of Large Language Models,” 2025. [Online]. Available: https://arxiv.org/abs/2502.16971

  75. [76]

    Unsolved Problems in ML Safety

    D. Hendrycks, N. Carlini, J. Schulman, and J. Steinhardt, “Unsolved Problems in ML Safety,”arXiv preprint arXiv:2109.13916, 2021

  76. [77]

    L. A. Suchman,Plans and Situated Actions: The Problem of Human- Machine Communication. Cambridge, UK: Cambridge University Press, 1987

  77. [78]

    Dourish,Where the Action Is: The Foundations of Embodied Inter- action

    P. Dourish,Where the Action Is: The Foundations of Embodied Inter- action. Cambridge, MA: MIT Press, 2001

  78. [79]

    How Should My Chatbot Interact? A Survey on Human-Chatbot Interaction Design,

    A. P. Chaves and M. A. Gerosa, “How Should My Chatbot Interact? A Survey on Human-Chatbot Interaction Design,”International Journal of Human-Computer Interaction, vol. 37, no. 8, pp. 729–758, 2021

  79. [80]

    AI-Mediated Communica- tion: How the Perception that Profile Text was Written by AI Affects Trustworthiness,

    M. Jakesch, J. Hancock, and M. Naaman, “AI-Mediated Communica- tion: How the Perception that Profile Text was Written by AI Affects Trustworthiness,” pp. 1–13, 2019

  80. [81]

    The Values Encoded in Machine Learning Research,

    A. Birhane, P. Kalluri, D. Card, W. Agnew, R. Dotan, and M. Bao, “The Values Encoded in Machine Learning Research,”Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 173–184, 2022

Showing first 80 references.