pith. sign in

arxiv: 2606.29602 · v1 · pith:O2SFQBAKnew · submitted 2026-06-28 · 💻 cs.CR

An Empirical Evaluation of Prompt Injection Vulnerabilities in Large Language Models Across Multilingual and Obfuscated Attack Scenarios

Pith reviewed 2026-06-30 06:49 UTC · model grok-4.3

classification 💻 cs.CR
keywords prompt injectionLLM securitymultilingual attacksobfuscated promptsmalicious complianceAI safetycybersecurity
0
0 comments X

The pith

LLMs generate phishing and malware at high rates when given prompt injections, especially in non-English languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six state-of-the-art LLMs on direct and multi-stage obfuscated prompt injections across languages and encodings to measure how often they produce malicious outputs like phishing emails, deceptive websites, and malware. It finds that even simple injections succeed frequently, that more elaborate prompts raise compliance further, and that non-English languages consistently produce higher rates of harmful responses than English. These patterns appear across all tested models, with some showing particular weakness to complex instructions. The work matters because it shows current safety alignments leave exploitable gaps in real security-sensitive uses.

Core claim

Even direct prompt injections frequently induce the generation of phishing content, websites, and malware, while elaborate prompts achieve even higher malicious compliance rates, particularly for phishing. Models such as DeepSeek, Gemini, and Grok show especially high susceptibility under complex instructions. Notably, non-English languages consistently exhibit higher compliance rates than English. Although simple character encodings reduce malicious outputs, they do not eliminate them.

What carries the argument

Empirical framework that measures malicious compliance rates under direct and multi-stage obfuscated attacks across multiple languages and character encodings.

Load-bearing premise

The selected attack prompts and evaluation criteria for malicious compliance are representative of real threats and free from selection bias or inconsistent human judgment in labeling outputs.

What would settle it

Re-testing the same models and prompts with an independent set of human labelers or automated classifiers that records substantially lower compliance rates would falsify the claim of systematic vulnerabilities.

Figures

Figures reproduced from arXiv: 2606.29602 by Baturay Birinci, Caglar Uysal, Or\c{c}un \c{C}etin, S\"uha Orhun Mutluergil.

Figure 1
Figure 1. Figure 1: Full compliance rates across prompt categories. Elaborate, contextu [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have rapidly evolved, transforming industries by automating complex tasks and generating human-like content. However, as their adoption accelerates, prompt injection vulnerabilities have become increasingly apparent. Malicious actors exploit these weaknesses to generate phishing emails, deceptive websites, nd malware, posing serious security risks. This paper presents an empirical evaluation of six state-of-the-art LLMs (DeepSeek, GPT, Gemini, Grok, Llama, and Qwen) under diverse adversarial prompt scenarios, including direct and multi-stage obfuscated attacks across multiple languages and character encodings. The proposed framework measures how effectively current LLMs resist manipulation into performing harmful actions. Our findings reveal systematic vulnerabilities across all tested models. Even direct prompt injections frequently induce the generation of phishing content, websites, and malware, while elaborate prompts achieve even higher malicious compliance rates, particularly for phishing. Models such as DeepSeek, Gemini, and Grok show especially high susceptibility under complex instructions. Notably, non-English languages consistently exhibit higher compliance rates than English, exposing significant gaps in multilingual safety alignment. Although simple character encodings reduce malicious outputs, they do not eliminate them. These results highlight persistent challenges in LLM safety and underscore the urgent need for stronger defenses and improved security mechanisms to support the ethical and secure deployment of LLMs in cybersecurity sensitive contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper empirically evaluates prompt injection attacks on six LLMs (DeepSeek, GPT, Gemini, Grok, Llama, Qwen) using direct, multi-stage, and obfuscated prompts across multiple languages and encodings. It reports high rates of malicious compliance (generation of phishing, malware, etc.), with higher rates for elaborate prompts, non-English languages, and certain models, concluding that current safety alignments have significant multilingual gaps.

Significance. If the quantitative compliance rates are reliable, the work provides useful evidence of persistent prompt-injection weaknesses in production LLMs, particularly the language-dependent disparity, which could inform safety research and deployment guidelines. The multi-model, multi-language design is a strength, but the absence of reported labeling protocols limits the strength of the central empirical claims.

major comments (1)
  1. [Methods / Results] Methods / Results sections: The malicious-compliance rates that underpin all headline findings (higher non-English rates, model differences, elaborate-prompt effects) are produced by human labeling of outputs for phishing content, malware, etc. No explicit decision rubric, borderline-case examples, or inter-rater reliability statistics (e.g., Cohen’s κ) are supplied. This is load-bearing because every reported percentage depends on consistent classification; without it, observed differences could reflect annotator variability rather than model behavior.
minor comments (2)
  1. [Abstract] Abstract: Typo “nd” instead of “and” in the sentence describing malicious outputs.
  2. The paper should clarify whether the same set of attack prompts was used across all languages or whether prompts were translated/adapted, as this affects comparability of the language effect.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the labeling methodology. We address the concern directly below.

read point-by-point responses
  1. Referee: [Methods / Results] Methods / Results sections: The malicious-compliance rates that underpin all headline findings (higher non-English rates, model differences, elaborate-prompt effects) are produced by human labeling of outputs for phishing content, malware, etc. No explicit decision rubric, borderline-case examples, or inter-rater reliability statistics (e.g., Cohen’s κ) are supplied. This is load-bearing because every reported percentage depends on consistent classification; without it, observed differences could reflect annotator variability rather than model behavior.

    Authors: We agree that the absence of a detailed labeling protocol is a limitation in the current manuscript and weakens the strength of the empirical claims. In the revised version we will add a dedicated subsection in Methods that (1) presents the full decision rubric used to classify outputs as malicious compliance (with explicit criteria for phishing, malware, deceptive websites, and related categories), (2) includes several borderline-case examples together with the classification decision and rationale, and (3) reports inter-rater reliability statistics (Cohen’s κ or equivalent) if multiple annotators were employed; if labeling was performed by a single annotator we will explicitly note this and describe the consistency checks that were applied. These additions will be placed before the results tables so that readers can evaluate the reliability of the reported percentages. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement of external model outputs

full rationale

The paper reports an empirical evaluation of six LLMs under prompt injection scenarios, measuring malicious compliance rates via direct testing across languages and attack types. No equations, derivations, fitted parameters, or predictions appear in the abstract or described methods. Results derive from external observations of model generations classified by human judgment, with no self-referential definitions, self-citation load-bearing premises, or reductions of outputs to inputs by construction. The central claims rest on observable model behaviors rather than any internal chain that collapses to its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study. No free parameters, mathematical axioms, or invented entities are introduced; the work relies on standard adversarial testing of existing LLMs.

pith-pipeline@v0.9.1-grok · 5793 in / 1058 out tokens · 28090 ms · 2026-06-30T06:49:46.603206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Chevrolet dealership chatbot agrees to sell $76,000 tahoe for $1 after prompt injection,

    Business Insider, “Chevrolet dealership chatbot agrees to sell $76,000 tahoe for $1 after prompt injection,” Dec. 2023, accessed: 2026-01-27. [Online]. Available: https://www.businessinsider. com/car-dealership-chevrolet-chatbot-chatgpt-pranks-chevy-2023-12

  2. [2]

    Semantic analysis of phishing emails leading to ransomware with chatgpt,

    H. Fujima, K. Takeuchi, and T. Kumamoto, “Semantic analysis of phishing emails leading to ransomware with chatgpt,” 2023

  3. [3]

    Psychological tactics of phishing emails,

    P. Wang and P. Lutchkus, “Psychological tactics of phishing emails,” Issues in Information Systems, 2023

  4. [4]

    Securing against deception: Exploring phishing emails through chatgpt and sentiment analysis,

    S. Sayyafzadeh, M. Weatherspoon, J. Yan, and H. Chi, “Securing against deception: Exploring phishing emails through chatgpt and sentiment analysis,” in2024 IEEE/ACIS 22nd International Conference on Soft- ware Engineering Research, Management and Applications (SERA), 2024, pp. 159–165

  5. [5]

    Campesato,Chapter 8: ChatGPT and GPT-4

    O. Campesato,Chapter 8: ChatGPT and GPT-4. Berlin, Boston: Mercury Learning and Information, 2023, pp. 251–292. [Online]. Available: https://doi.org/10.1515/9781501518911-009

  6. [6]

    Apollo: A gpt-based tool to detect phishing emails and generate explanations that warn users,

    G. Desolda, F. Greco, and L. Vigano, “Apollo: A gpt-based tool to detect phishing emails and generate explanations that warn users,” Proc. ACM Hum.-Comput. Interact., vol. 9, no. 4, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3733049

  7. [7]

    Perspective chapter: Ransomware,

    A. Warikoo, “Perspective chapter: Ransomware,” inMalware - Detection and Defense, E. Babulak, Ed. London: IntechOpen, 2023, ch. 5. [Online]. Available: https://doi.org/10.5772/intechopen.108433

  8. [8]

    A survey on ransomware malware and ransomware detection techniques,

    S. Yadav, N. Soni, L. K. P. Bhaiya, and V . K. Swarnkar, “A survey on ransomware malware and ransomware detection techniques,”International Journal for Research in Applied Science & Engineering Technology (IJRASET), vol. 10, no. Issue I, 2022. [Online]. Available: https://www.ijraset.com/best-journal/ survey-on-ransomware-malware-and-ransomware-detection-...

  9. [9]

    ” digital camouflage

    E. Böke and S. Torka, “” digital camouflage”: The llvm challenge in llm- based malware detection,”Journal of Systems and Software, p. 112646, 2025

  10. [10]

    Strengthening llm ecosystem security: Preventing mobile malware from manipulating llm- based applications,

    L. Huang, J. Xue, Y . Wang, J. Chen, and T. Lei, “Strengthening llm ecosystem security: Preventing mobile malware from manipulating llm- based applications,”Information Sciences, vol. 681, p. 120923, 2024

  11. [11]

    Circumventing key- loggers and screendumps,

    K. Sapra, B. Husain, R. Brooks, and M. Smith, “Circumventing key- loggers and screendumps,” in2013 8th International Conference on Malicious and Unwanted Software: "The Americas" (MALWARE), 2013, pp. 103–108

  12. [12]

    Comparative evaluation of approaches & tools for effective security testing of web applications,

    S. Qadir, E. Waheed, A. Khanum, and S. Jehan, “Comparative evaluation of approaches & tools for effective security testing of web applications,” PeerJ Computer Science, vol. 11, p. e2821, 2025

  13. [13]

    Evolution of application security based on owasp top 10 and cwe/sans top 25 with predictions for the 2025 owasp top 10,

    J. Li and H. Li, “Evolution of application security based on owasp top 10 and cwe/sans top 25 with predictions for the 2025 owasp top 10,” in 2025 International Conference on Inventive Computation Technologies (ICICT), 2025, pp. 1178–1183

  14. [14]

    Prompt injection detection in llm integrated applications,

    Q. Lan, A. Kaul, and S. Jones, “Prompt injection detection in llm integrated applications,” 2025

  15. [15]

    Mitigating adversarial manipulation in llms: a prompt- based approach to counter jailbreak attacks (prompt-g),

    B. Pingua, D. Murmu, M. Kandpal, J. Rautaray, P. Mishra, R. K. Barik, and M. J. Saikia, “Mitigating adversarial manipulation in llms: a prompt- based approach to counter jailbreak attacks (prompt-g),”PeerJ Computer Science, vol. 10, p. e2374, 2024

  16. [16]

    Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications,

    X. Suo, “Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications,”AIP Conference Proceedings, vol. 3194, no. 1, p. 040013, 12 2024. [Online]. Available: https://doi.org/10.1063/5.0222987

  17. [17]

    Dynamic moving target defense for mitigating targeted llm prompt injection,

    S. Panterino and M. Fellington, “Dynamic moving target defense for mitigating targeted llm prompt injection,”Authorea Preprints, 2024

  18. [18]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks, “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,” 2024. [Online]. Available: https://arxiv.org/abs/2402.04249

  19. [19]

    Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial nlp,

    Y . Chen, H. Gao, G. Cui, F. Qi, L. Huang, Z. Liu, and M. Sun, “Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial nlp,” 2022. [Online]. Available: https://arxiv.org/abs/2210.10683

  20. [20]

    Exploring the cybercrime potential of llms: A focus on phishing and malware generation,

    O. Çetin, B. Birinci, Ç. Uysal, and B. Arief, “Exploring the cybercrime potential of llms: A focus on phishing and malware generation,” in European Interdisciplinary Cybersecurity Conference. Springer, 2025, pp. 98–115

  21. [21]

    Low-resource languages jailbreak gpt-4,

    Z.-X. Yong, C. Menghini, and S. H. Bach, “Low-resource languages jailbreak gpt-4,” 2024. [Online]. Available: https://arxiv.org/abs/2310. 02446

  22. [22]

    Multilingual jailbreak challenges in large language models,

    Y . Deng, W. Zhang, S. J. Pan, and L. Bing, “Multilingual jailbreak challenges in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2310.06474

  23. [23]

    Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher,

    Y . Yuan, W. Jiao, W. Wang, J. tse Huang, P. He, S. Shi, and Z. Tu, “Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher,”

  24. [24]

    GPT-4 is too smart to be safe: Stealthy Chat with LLMs via Cipher

    [Online]. Available: https://arxiv.org/abs/2308.06463

  25. [25]

    A StrongREJECT for Empty Jailbreaks

    A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkins, and S. Toyer, “A strongreject for empty jailbreaks,” 2024. [Online]. Available: https://arxiv.org/abs/2402.10260

  26. [26]

    System robustness against misuse,

    Z. Vintr and D. Valis, “System robustness against misuse,”WIT Trans- actions on The Built Environment, vol. 108, pp. 273–280, 2009

  27. [27]

    Bioinfo-bench: A simple benchmark framework for llm bioinformatics skills evaluation,

    Q. Chen and C. Deng, “Bioinfo-bench: A simple benchmark framework for llm bioinformatics skills evaluation,”bioRxiv, 2025. [Online]. Available: https://www.biorxiv.org/content/early/2025/01/29/2023.10.18. 563023

  28. [28]

    Black, Gloria Geng, Danny Park, James Zou, Andrew Y

    Y . Jiang, K. C. Black, G. Geng, D. Park, J. Zou, A. Y . Ng, and J. H. Chen, “Medagentbench: A realistic virtual ehr environment to benchmark medical llm agents,”arXiv preprint arXiv:2501.14654, 2025

  29. [29]

    Context-aware content moderation using trans- former models for detecting harmful digital content,

    S. Kant and S. Rana, “Context-aware content moderation using trans- former models for detecting harmful digital content,”International Journal of Research Science and Management, vol. 12, no. 4, pp. 1– 9, 2025

  30. [30]

    Ospc: Multimodal harmful content detection using fine-tuned language models,

    B. Cai, “Ospc: Multimodal harmful content detection using fine-tuned language models,” inCompanion Proceedings of the ACM Web Confer- ence 2024, 2024, pp. 1896–1899

  31. [31]

    Replicate run ai with an api,

    “Replicate run ai with an api,” https://replicate.com/, Replicate, 2026, accessed: 2026-01-26