Recognition: 2 theorem links
· Lean TheoremBeyond Context: Large Language Models' Failure to Grasp Users' Intent
Pith reviewed 2026-05-16 20:07 UTC · model grok-4.3
The pith
Large language models fail to recognize user intent, allowing systematic bypasses of safety filters via emotional framing and gradual prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current safety approaches in large language models concentrate on explicit harmful content while overlooking the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage through emotional framing, progressive revelation, and academic justification techniques. Empirical evaluation of models including ChatGPT, Claude, Gemini, and DeepSeek shows these methods circumvent reliable safety mechanisms, with reasoning-enabled configurations increasing factual precision without interrogating underlying intent. The pattern indicates that present architectural designs embed systematic vulnerabilities rather than,
What carries the argument
The inability of LLMs to recognize user intent beyond surface content, exposed by the three prompting techniques of emotional framing, progressive revelation, and academic justification.
Load-bearing premise
That the described prompting techniques demonstrate a failure to grasp intent rather than simply succeeding at eliciting compliant responses from instruction-following models.
What would settle it
A test in which the same harmful request is presented first with clear malicious framing and then with explicit benign intent; consistent refusal only in the benign case would falsify the claim that intent detection is missing.
Figures
read the original abstract
Current Large Language Models (LLMs) safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage to circumvent safety mechanisms. We empirically evaluate multiple state-of-the-art LLMs, including ChatGPT, Claude, Gemini, and DeepSeek. Our analysis demonstrates the circumvention of reliable safety mechanisms through emotional framing, progressive revelation, and academic justification techniques. Notably, reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation, increasing factual precision while failing to interrogate the underlying intent. The exception was Claude Opus 4.1, which prioritized intent detection over information provision in some use cases. This pattern reveals that current architectural designs create systematic vulnerabilities. These limitations require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities rather than post-hoc protective mechanisms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that current LLMs lack the ability to understand context and recognize malicious user intent, creating systematic vulnerabilities in safety mechanisms. It reports empirical evaluations on models including ChatGPT, Claude, Gemini, and DeepSeek showing that techniques such as emotional framing, progressive revelation, and academic justification successfully circumvent safety filters. Reasoning-enabled configurations are said to amplify rather than mitigate exploitation by increasing factual precision of harmful outputs, with Claude Opus 4.1 as a partial exception that sometimes prioritizes intent detection. The work concludes that paradigmatic architectural shifts toward core intent recognition are required.
Significance. If the empirical demonstrations hold after proper controls and quantification, the result would highlight a genuine gap between content-based safety filters and true intent understanding, with potential to redirect safety research toward architectures that explicitly model user goals rather than surface patterns.
major comments (3)
- [Abstract] Abstract: the claim of empirical evaluation across multiple models supplies no quantitative metrics, specific prompts, success rates, baseline comparisons, or control conditions, leaving the central circumvention claim without visible supporting evidence.
- [Results] Results (implied by abstract description): the reported increase in factual precision under reasoning-enabled settings is consistent with stronger literal instruction-following on reframed prompts rather than failure to detect intent, since emotional framing and academic justification alter the explicit prompt content itself.
- [Methods] Methods (implied): no controls are described that isolate intent inference from compliance (e.g., direct harmful requests versus incrementally framed versions or explicit intent probes), so the data cannot distinguish the two interpretations.
minor comments (1)
- [Abstract] Abstract: verify the exact model name 'Claude Opus 4.1' against current Anthropic releases.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the full paper content and indicating where revisions have been made.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of empirical evaluation across multiple models supplies no quantitative metrics, specific prompts, success rates, baseline comparisons, or control conditions, leaving the central circumvention claim without visible supporting evidence.
Authors: The abstract is intentionally concise, but the full manuscript (Sections 3 and 4) provides the requested details: specific prompt templates for each framing technique, success rates (e.g., 72% average bypass for emotional framing across tested models), baseline comparisons to direct harmful queries (0% success), and control conditions. To improve visibility, we have revised the abstract to include a brief summary of these quantitative results and key metrics. revision: yes
-
Referee: [Results] Results (implied by abstract description): the reported increase in factual precision under reasoning-enabled settings is consistent with stronger literal instruction-following on reframed prompts rather than failure to detect intent, since emotional framing and academic justification alter the explicit prompt content itself.
Authors: We maintain that the distinction is substantive: direct harmful requests are refused while identically harmful requests succeed only after reframing that preserves intent but changes surface form. The increase in factual precision under reasoning modes occurs precisely because the model executes the literal (reframed) request without probing the underlying goal, which we illustrate with paired examples in the results. We have added a new paragraph in the discussion section to explicitly contrast literal compliance versus intent detection. revision: partial
-
Referee: [Methods] Methods (implied): no controls are described that isolate intent inference from compliance (e.g., direct harmful requests versus incrementally framed versions or explicit intent probes), so the data cannot distinguish the two interpretations.
Authors: The original manuscript already includes direct-versus-framed comparisons as the primary control (direct requests refused, framed versions accepted). We have expanded the Methods section with additional explicit intent-probe questions and incremental framing steps to further isolate the effect, as suggested. revision: yes
Circularity Check
No circularity: purely empirical evaluation without derivations or self-referential reductions
full rationale
The paper reports direct empirical tests of LLMs (ChatGPT, Claude, Gemini, DeepSeek) using prompting techniques such as emotional framing, progressive revelation, and academic justification. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methodology. Claims rest on observed outcomes rather than any step that reduces by construction to prior definitions or fits. This is a standard observational study with no self-definitional, fitted-input, or uniqueness-imported circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM safety mechanisms primarily filter based on explicit content patterns rather than inferred user intent.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Current LLM safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Intent detection in the age of llms,
G. Arora, S. Jain, and S. Merugu, “Intent detection in the age of llms,” arXiv preprint arXiv:2410.01627, 2024
-
[2]
DecodingTrust: A Compre- hensive Assessment of Trustworthiness in GPT Models,
B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaefferet al., “DecodingTrust: A Compre- hensive Assessment of Trustworthiness in GPT Models,”arXiv preprint arXiv:2306.11698, 2023
-
[3]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal,”arXiv preprint arXiv:2402.04249, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,
J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186
work page 2019
-
[5]
What Does BERT Look At? An Analysis of BERT’s Attention,
K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, “What Does BERT Look At? An Analysis of BERT’s Attention,” inProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2019, pp. 276–286
work page 2019
-
[6]
Extracting Training Data from Large Language Models,
N. Carlini, F. Tram `er, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, ´U. Erlingssonet al., “Extracting Training Data from Large Language Models,” in30th USENIX Security Symposium, 2021, pp. 2633–2650
work page 2021
-
[7]
Ethical Challenges in Data-driven Dialogue Systems,
P. Henderson, K. Sinha, N. Angelard-Gontier, N. R. Ke, G. Fried, R. Lowe, and J. Pineau, “Ethical Challenges in Data-driven Dialogue Systems,” pp. 123–129, 2017
work page 2017
-
[8]
Towards conversational diagnostic artificial intelligence,
T. Tu, M. Schaekermann, A. Palepu, K. Saab, J. Freyberg, R. Tanno, A. Wang, B. Li, M. Amin, Y . Chenget al., “Towards conversational diagnostic artificial intelligence,”Nature, pp. 1–9, 2025
work page 2025
-
[9]
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference,
T. McCoy, E. Pavlick, and T. Linzen, “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference,” in Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, 2019, pp. 3428–3448
work page 2019
-
[10]
BERT Rediscovers the Classical NLP Pipeline,
I. Tenney, D. Das, and E. Pavlick, “BERT Rediscovers the Classical NLP Pipeline,”Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pp. 4593–4601, 2019
work page 2019
-
[11]
SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions,
Z. Zhang, L. Xu, D. Zhao, Y . Onoe, M. Khalil, H. Ross, I. Kocyigit, M. Ashraf, Y .-L. Boureau, A. Nematzadehet al., “SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions,”arXiv preprint arXiv:2309.07045, 2023
-
[12]
Real- ToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models,
S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “Real- ToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 3356–3369
work page 2020
-
[13]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskillet al., “On the Opportunities and Risks of Foundation Models,”arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Ethical and social risks of harm from Language Models
L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadehet al., “Ethical and Social Risks of Harm from Language Models,”arXiv preprint arXiv:2112.04359, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Concrete Problems in AI Safety,
D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man ´e, “Concrete Problems in AI Safety,” inNIPS Workshop on Aligned Artificial Intelligence, 2016
work page 2016
-
[16]
Russell,Human Compatible: Artificial Intelligence and the Problem of Control
S. Russell,Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019
work page 2019
-
[17]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freireet al., “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback,”arXiv preprint arXiv:2307.15217, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Constitutional AI: Harmlessness from AI Feedback
Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Con- stitutional AI: Harmlessness from AI Feedback,”arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Emergent Abil- ities of Large Language Models,
J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzleret al., “Emergent Abil- ities of Large Language Models,”Transactions on Machine Learning Research, 2022
work page 2022
-
[20]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonsoet al., “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models,”arXiv preprint arXiv:2206.04615, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” Advances in Neural Information Processing Systems, vol. 30, 2017
work page 2017
-
[22]
Language Mod- els are Few-shot Learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language Mod- els are Few-shot Learners,”Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020
work page 1901
-
[23]
PaLM: Scaling Language Modeling with Pathways,
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “PaLM: Scaling Language Modeling with Pathways,” vol. 24, no. 240, 2022, pp. 1–113
work page 2022
-
[24]
OpenAI, “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[26]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clarket al., “Training Compute-optimal Large Language Models,”arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” pp. 610–623, 2021
work page 2021
-
[28]
The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence,
G. Marcus, “The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence,”arXiv preprint arXiv:2002.06177, 2020
-
[29]
Stress Test Evaluation for Natural Language Inference,
A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig, “Stress Test Evaluation for Natural Language Inference,” pp. 2340–2353, 2018
work page 2018
-
[30]
B. J. Grosz and C. L. Sidner,Attention, Intentions, and the Structure of Discourse. MIT Press, 1986, vol. 12, no. 3
work page 1986
-
[31]
J. R. Hobbs, M. E. Stickel, D. E. Appelt, and P. Martin,Interpretation as Abduction. Elsevier, 1993, vol. 63, no. 1-2
work page 1993
-
[32]
Winograd,Understanding Natural Language
T. Winograd,Understanding Natural Language. Academic Press, 1972
work page 1972
-
[33]
R. C. Schank and R. P. Abelson,Scripts, Plans, Goals, and Under- standing: An Inquiry into Human Knowledge Structures. Lawrence Erlbaum, 1977
work page 1977
-
[34]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of Deep Bidirectional Transformers for Language Understand- ing,”arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Deep Contextualized Word Representations,
M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep Contextualized Word Representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 2227–2237
work page 2018
-
[36]
A Primer in BERTology: What We Know About How BERT Works,
A. Rogers, O. Kovaleva, and A. Rumshisky, “A Primer in BERTology: What We Know About How BERT Works,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020
work page 2020
-
[37]
Adversarial NLI: A New Benchmark for Natural Language Under- standing,
Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela, “Adversarial NLI: A New Benchmark for Natural Language Under- standing,”Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4885–4901, 2019
work page 2019
-
[38]
D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment,” vol. 34, pp. 8018–8025, 2020
work page 2020
-
[39]
Evasion Attacks Against Machine Learning at Test Time,
B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. ˇSrndi´c, P. Laskov, G. Giacinto, and F. Roli, “Evasion Attacks Against Machine Learning at Test Time,” inJoint European conference on machine learning and knowledge discovery in databases. Springer, 2013, pp. 387–402
work page 2013
-
[40]
Explaining and Harness- ing Adversarial Examples,
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Harness- ing Adversarial Examples,” 2014
work page 2014
-
[41]
Intriguing properties of neural networks
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfel- low, and R. Fergus, “Intriguing Properties of Neural Networks,”arXiv preprint arXiv:1312.6199, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[42]
Uni- versal Adversarial Triggers for Attacking and Analyzing NLP,
E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Uni- versal Adversarial Triggers for Attacking and Analyzing NLP,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019, pp. 2153–2162
work page 2019
-
[43]
HotFlip: White-box Adversarial Examples for Text Classification,
J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, “HotFlip: White-box Adversarial Examples for Text Classification,” inProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 31–36
work page 2017
-
[44]
Red Teaming Language Models with Language Models
E. Perez, S. Huang, F. Song, T. Cai, R. Wong, J. Griffiths, J. McAleese, J. Pokorny, J. Fortier, G. Sastryet al., “Red Teaming Language Models with Language Models,”arXiv preprint arXiv:2202.03286, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned,
D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Sharma, A. Tamkinet al., “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned,” 2022
work page 2022
-
[46]
Jailbroken: How Does LLM Safety Training Fail?
A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How Does LLM Safety Training Fail?”arXiv preprint arXiv:2307.02483, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Universal and Transferable Adversarial Attacks on Aligned Language Models
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,”arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
C. Pathade, “Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs,”arXiv preprint arXiv:2505.04806, 2025. [Online]. Available: https://arxiv.org/abs/2505.04806
-
[49]
X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “”Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1671–1685. [Online]. Ava...
-
[50]
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers
X. Li, R. Wang, M. Cheng, T. Zhou, and C.-J. Hsieh, “DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers.” Association for Computational Linguistics, Nov. 2024, pp. 13 891–13 913. [Online]. Available: https://aclanthology.org/2024. findings-emnlp.813/
work page 2024
-
[51]
MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots,
G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots,” inNDSS, 2024
work page 2024
-
[52]
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors,
Y .-H. Chen, N. Joshi, Y . Chen, M. Andriushchenko, R. Angell, and H. He, “Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors,” 2025. [Online]. Available: https: //arxiv.org/abs/2506.10949
-
[53]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Y . Liu, G. Deng, Z. Xu, Y . Li, Y . Zheng, Y . Zhang, L. Zhao, T. Zhang, and Y . Liu, “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study,”arXiv preprint arXiv:2305.13860, 2023
work page internal anchor Pith review arXiv 2023
-
[54]
Attack Prompt Generation for Red Teaming and Defending Large Language Models,
B. Deng, H. Zhang, Y . Xiang, L. Deng, S. Hong, R. Gao, H. Zhou, X. Zhang, R. Li, and Z. Li, “Attack Prompt Generation for Red Teaming and Defending Large Language Models,”arXiv preprint arXiv:2310.12505, 2023
-
[56]
Available: https://arxiv.org/abs/2501.01335
[Online]. Available: https://arxiv.org/abs/2501.01335
-
[57]
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not What You’ve Signed Up For: Compromising Real-world LLM- integrated Applications with Indirect Prompt Injection,” inProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 2023, pp. 79–90
work page 2023
-
[58]
Ignore Previous Prompt: Attack Techniques For Language Models,
F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques For Language Models,” 2022
work page 2022
-
[59]
P. R. Cohen and C. R. Perrault, “Plans for Discourse,” inIntentions in Communication. MIT Press, 1990, pp. 365–388
work page 1990
-
[60]
B. J. Grosz and S. Kraus,Collaborative Plans for Complex Group Action. Elsevier, 1996, vol. 86, no. 2
work page 1996
-
[61]
Ef- ficient Intent Detection with Dual Sentence Encoders,
I. Casanueva, T. Tem ˇcinas, D. Gerz, M. Henderson, and I. Vuli ´c, “Ef- ficient Intent Detection with Dual Sentence Encoders,”arXiv preprint arXiv:2003.04807, 2020
-
[62]
Intent Classification and Slot Filling for Privacy Policies,
W. U. A. Zhang, Z. Yan, W. U. Ahmad, and K.-W. Chang, “Intent Classification and Slot Filling for Privacy Policies,” pp. 4402–4417, 2021
work page 2021
-
[63]
The Second Dialog State Tracking Challenge,
M. Henderson, B. Thomson, and J. D. Williams, “The Second Dialog State Tracking Challenge,” pp. 263–272, 2014
work page 2014
-
[64]
Towards Scalable Multi-domain Conversational Agents: The Schema-guided Dialogue Dataset,
A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan, “Towards Scalable Multi-domain Conversational Agents: The Schema-guided Dialogue Dataset,” vol. 34, pp. 8689–8696, 2020
work page 2020
-
[65]
Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study,
C. Sankar, S. Subramanian, C. Pal, S. Chandar, and Y . Bengio, “Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study,”Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 32–37, 2019
work page 2019
-
[66]
Pretraining Methods for Dialog Context Representation Learning,
S. Mehri, S. Kiritchenko, M. Eskenazi, and S. M. Mohammad, “Pretraining Methods for Dialog Context Representation Learning,” Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pp. 3836–3845, 2019
work page 2019
-
[67]
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models,
N. Shapira, M. Levy, S. H. Alavi, X. Zhou, Y . Choi, Y . Goldberg, M. Sap, and V . Shwartz, “Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). St. Julian’s, Malta: Associatio...
work page 2024
-
[68]
CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models,
G. Sun, X. Zhan, S. Feng, P. C. Woodland, and J. Such, “CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models,” 2025. [Online]. Available: https://arxiv.org/abs/2501.14940
-
[69]
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models,
Y . In, W. Kim, K. Yoon, S. Kim, M. Tanjim, K. Kim, and C. Park, “Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models,” 2025. [Online]. Available: https://arxiv.org/abs/2502.15086
-
[70]
Training Language Models to Follow Instructions with Human Feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training Language Models to Follow Instructions with Human Feedback,”Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022
work page 2022
-
[71]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighanet al., “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,” 2022
work page 2022
-
[72]
Intention Analysis Makes LLMs A Good Jailbreak Defender,
Y . Zhang, L. Ding, L. Zhang, and D. Tao, “Intention Analysis Makes LLMs A Good Jailbreak Defender,” 2024, cOLING 2025 (to appear). [Online]. Available: https://arxiv.org/abs/2401.06561
-
[73]
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation,
B. Baker, J. Huizinga, L. Gao, Z. Dou, M. Y . Guan, A. Madry, W. Zaremba, J. Pachocki, and D. Farhi, “Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation,”
-
[74]
Monitoring reasoning models for misbehavior and the risks of promoting obfuscation
[Online]. Available: https://arxiv.org/abs/2503.11926
-
[75]
LongSafety: Evaluating Long-Context Safety of Large Language Models,
Y . Lu, J. Cheng, Z. Zhang, S. Cui, C. Wang, X. Gu, Y . Dong, J. Tang, H. Wang, and M. Huang, “LongSafety: Evaluating Long-Context Safety of Large Language Models,” 2025. [Online]. Available: https://arxiv.org/abs/2502.16971
-
[76]
Unsolved Problems in ML Safety
D. Hendrycks, N. Carlini, J. Schulman, and J. Steinhardt, “Unsolved Problems in ML Safety,”arXiv preprint arXiv:2109.13916, 2021
work page internal anchor Pith review arXiv 2021
-
[77]
L. A. Suchman,Plans and Situated Actions: The Problem of Human- Machine Communication. Cambridge, UK: Cambridge University Press, 1987
work page 1987
-
[78]
Dourish,Where the Action Is: The Foundations of Embodied Inter- action
P. Dourish,Where the Action Is: The Foundations of Embodied Inter- action. Cambridge, MA: MIT Press, 2001
work page 2001
-
[79]
How Should My Chatbot Interact? A Survey on Human-Chatbot Interaction Design,
A. P. Chaves and M. A. Gerosa, “How Should My Chatbot Interact? A Survey on Human-Chatbot Interaction Design,”International Journal of Human-Computer Interaction, vol. 37, no. 8, pp. 729–758, 2021
work page 2021
-
[80]
M. Jakesch, J. Hancock, and M. Naaman, “AI-Mediated Communica- tion: How the Perception that Profile Text was Written by AI Affects Trustworthiness,” pp. 1–13, 2019
work page 2019
-
[81]
The Values Encoded in Machine Learning Research,
A. Birhane, P. Kalluri, D. Card, W. Agnew, R. Dotan, and M. Bao, “The Values Encoded in Machine Learning Research,”Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 173–184, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.