pith. machine review for the scientific record. sign in

arxiv: 2604.12548 · v1 · submitted 2026-04-14 · 💻 cs.CR

Recognition: unknown

DeepSeek Robustness Against Semantic-Character Dual-Space Mutated Prompt Injection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3

classification 💻 cs.CR
keywords prompt injectionLLM robustnessadversarial attacksmutation frameworksemantic-character perturbationsDeepSeek evaluationmisuse success rate
0
0 comments X

The pith

Combined semantic and character mutations outperform single-space attacks on DeepSeek.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a mutation framework that alters prompts through both meaning-level changes such as paraphrasing and word-order shifts, and character-level tricks such as zero-width insertions and encoding alterations. It applies a hybrid search that mixes random exploration with local refinement to find prompts that trick the model into misuse. Experiments show this dual approach reaches a higher average misuse success rate than either semantic or character mutations used separately. The work matters because real-world attacks often blend these tactics, and single-dimension tests may underestimate the threat to models like DeepSeek. Results also indicate the dual method keeps good concealment while delivering competitive query efficiency in best cases.

Core claim

PromptFuzz-SC integrates semantic transformations with character-level obfuscation in a unified mutation library and employs an epsilon-greedy plus hill-climbing hybrid search to generate adversarial prompts; on DeepSeek this dual-space strategy records the highest mean misuse success rate of 0.189, a peak of 0.375, and superior mean stealth, exceeding semantic-only mutation by 12.5 percent and character-only by 5.6 percent in mean success rate.

What carries the argument

PromptFuzz-SC, the semantic-character dual-space mutation framework that combines paraphrasing, word-order changes, zero-width insertions, and encoding mutations with a hybrid epsilon-greedy and hill-climbing search.

If this is right

  • Dual-space mutation produces higher mean and peak misuse success rates than isolated semantic or character attacks.
  • The approach achieves a more favorable balance between attack effectiveness and concealment than single-dimension baselines.
  • Composite mutation strategies are required for thorough red-teaming of LLM prompt injection defenses.
  • Defense mechanisms should address perturbations across both semantic and character spaces rather than one dimension alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-space library could expose comparable weaknesses when tested on additional open-source or closed models not included in the study.
  • Detection systems might gain coverage by monitoring for simultaneous semantic rewrites and character encodings in incoming prompts.
  • Extending the mutation operator set to include syntactic or structural changes could further increase attack potency.

Load-bearing premise

The chosen mutation operators and hybrid search produce attacks that generalize beyond the specific DeepSeek model and the three metrics without post-hoc selection of successful cases.

What would settle it

Applying the identical dual-space mutation library and search strategy to a different LLM such as GPT-4 or Llama and measuring whether the misuse success rate still exceeds both single-space baselines.

Figures

Figures reproduced from arXiv: 2604.12548 by Junyu Ren, Philip S. Yu, Wensheng Gan, Xingjian Pan.

Figure 1
Figure 1. Figure 1: Overall architecture of the PromptFuzz-SC framework. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance dynamics of single semantic-space attacks across query [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance dynamics of dual-space cooperative attacks across query [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three-dimensional comparative performance of semantic-only, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Parameter sensitivity comparison across mutation strategies: (a) overall [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Prompt injection has emerged as a critical security threat to large language models (LLMs), yet existing studies predominantly focus on single-dimensional attack strategies, such as semantic rewriting or character-level obfuscation, which fail to capture the combined effects of multi-space perturbations in realistic scenarios. In addition, systematic black-box robustness evaluations of recent Chinese LLMs, such as DeepSeek, remain limited. To address these gaps, we propose PromptFuzz-SC, a semantic-character dual-space mutation framework for evaluating LLM robustness against prompt injection. The framework integrates semantic transformations (e.g., paraphrasing and word-order perturbation) with character-level obfuscation (e.g., zero-width insertion and encoding-based mutation), forming a unified and extensible mutation operator library. A hybrid search strategy combining epsilon-greedy exploration and hill-climbing refinement is adopted to efficiently discover high-quality adversarial prompts. We further introduce a unified evaluation protocol based on three metrics: misuse success rate (MSR), Average Queries to Success (AQS), and Stealth. Experimental results on DeepSeek demonstrate that dual-space mutation achieves the strongest overall attack performance among the evaluated strategies, attaining the highest mean MSR (0.189), peak MSR (0.375), and mean Stealth. Compared with semantic-only and character-only mutation, it improves mean MSR by 12.5% and 5.6%, respectively. While not consistently minimizing query cost, the proposed method achieves competitive best-case efficiency and maintains strong imperceptibility, indicating a more favorable balance between attack effectiveness and concealment. These findings highlight the importance of composite mutation strategies for robust red-teaming of LLMs and provide practical insights for the design of multi-layer defense mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PromptFuzz-SC, a semantic-character dual-space mutation framework for prompt injection attacks on LLMs. It combines semantic transformations (paraphrasing, word-order perturbation) with character-level obfuscations (zero-width insertion, encoding mutations) into an extensible operator library, employs a hybrid epsilon-greedy plus hill-climbing search, and evaluates on DeepSeek using misuse success rate (MSR), average queries to success (AQS), and stealth. The central empirical claim is that dual-space mutation outperforms semantic-only and character-only baselines, achieving mean MSR 0.189 (12.5% and 5.6% higher, respectively), peak MSR 0.375, and best mean stealth.

Significance. If the reported performance ordering is shown to be statistically reliable, the work would usefully demonstrate that composite multi-space perturbations can improve attack effectiveness and imperceptibility for red-teaming, especially for under-evaluated Chinese LLMs such as DeepSeek. The extensible mutation library and unified three-metric protocol are practical contributions that could support future defense design.

major comments (2)
  1. [Abstract] Abstract: the headline MSR figures (mean 0.189, peak 0.375, 12.5% and 5.6% relative gains) are presented without any accompanying information on number of trials, per-condition standard deviations, confidence intervals, or statistical hypothesis tests. Because the absolute differences are modest, these details are required to establish that the claimed superiority of dual-space mutation is distinguishable from sampling noise.
  2. [Experimental results] Experimental results section: the definition of 'success' used to compute MSR, the total number of base prompts, and the precise criteria for selecting or terminating the hybrid search are not stated. These omissions make it impossible to assess reproducibility or to determine whether the reported 'more favorable balance' between effectiveness and concealment generalizes beyond the specific DeepSeek runs shown.
minor comments (2)
  1. [Abstract] The abstract states that the method 'maintains strong imperceptibility' but does not clarify how the Stealth metric is normalized or whether it is averaged over successful attacks only.
  2. [Methodology] Notation for the mutation operators (e.g., how semantic and character mutations are combined in the dual-space operator) could be made more explicit with a small table or pseudocode example.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline MSR figures (mean 0.189, peak 0.375, 12.5% and 5.6% relative gains) are presented without any accompanying information on number of trials, per-condition standard deviations, confidence intervals, or statistical hypothesis tests. Because the absolute differences are modest, these details are required to establish that the claimed superiority of dual-space mutation is distinguishable from sampling noise.

    Authors: We agree that the abstract requires supporting statistical context given the modest gains. We will revise the abstract to state the number of trials performed and explicitly reference the standard deviations, confidence intervals, and hypothesis test results already computed and reported in the Experimental Results section. This will allow readers to evaluate the reliability of the dual-space superiority claim directly from the abstract while keeping it concise. revision: yes

  2. Referee: [Experimental results] Experimental results section: the definition of 'success' used to compute MSR, the total number of base prompts, and the precise criteria for selecting or terminating the hybrid search are not stated. These omissions make it impossible to assess reproducibility or to determine whether the reported 'more favorable balance' between effectiveness and concealment generalizes beyond the specific DeepSeek runs shown.

    Authors: We acknowledge these omissions hinder reproducibility. In the revised Experimental Results section we will add: (1) the precise definition of success used to compute MSR, (2) the total number of base prompts, and (3) the exact termination and selection criteria for the epsilon-greedy plus hill-climbing hybrid search. These additions will be placed in the appropriate subsections so that the balance between effectiveness and stealth can be properly assessed. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of mutation strategies

full rationale

The paper proposes the PromptFuzz-SC framework and reports experimental outcomes on DeepSeek using MSR, AQS, and Stealth metrics. No derivation chain, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described results. Performance claims (e.g., dual-space MSR of 0.189) are direct experimental measurements, not reduced to inputs by construction. Any self-citations are peripheral and non-load-bearing for the empirical ordering. This matches the default case of a self-contained empirical study with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the described mutation operators and search strategy are sufficient to discover strong attacks and that the three metrics adequately capture real-world threat. No free parameters, axioms, or invented entities are explicitly introduced beyond the framework name itself.

pith-pipeline@v0.9.0 · 5610 in / 1086 out tokens · 34127 ms · 2026-05-10T15:50:50.708563+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Model-as-a-service (MaaS): A survey,

    W. Gan, S. Wan, and P. S. Yu, “Model-as-a-service (MaaS): A survey,” inIEEE International Conference on Big Data. IEEE, 2023, pp. 4636– 4645

  2. [2]

    Multimodal large language models: A survey,

    J. Wu, W. Gan, Z. Chen, S. Wan, and P. S. Yu, “Multimodal large language models: A survey,” inIEEE International Conference on Big Data. IEEE, 2023, pp. 2247–2256

  3. [3]

    Mixture of experts (MoE): A big data perspective,

    W. Gan, Z. Ning, Z. Qi, and P. S. Yu, “Mixture of experts (MoE): A big data perspective,”Information Fusion, pp. 1–28, 2025

  4. [4]

    Safety in

    C. Wang, Y . Liu, B. Li, D. Zhang, Z. Li, and J. Fang, “Safety in large reasoning models: A survey,”arXiv preprint arXiv:2504.17704, 2025

  5. [5]

    Jailbreaking black box large language models in twenty queries,

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” in IEEE Conference on Secure and Trustworthy Machine Learning. IEEE, 2025, pp. 23–42

  6. [6]

    Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,” inThe ACM Workshop on Artificial Intelligence and Security, 2023, pp. 79–90

  7. [7]

    Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs

    A. Paulus, A. Zharmagambetov, C. Guo, B. Amos, and Y . Tian, “AdvPrompter: Fast adaptive adversarial prompting for LLMs,”arXiv preprint arXiv:2404.16873, 2024

  8. [8]

    Certified defenses for data poisoning attacks,

    J. Steinhardt, P. W. W. Koh, and P. S. Liang, “Certified defenses for data poisoning attacks,”Advances in Neural Information Processing Systems, vol. 30, 2017

  9. [9]

    Defending large language models against jailbreaking attacks through goal prioriti- zation,

    Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioriti- zation,” inThe Annual Meeting of the Association for Computational Linguistics, 2024, pp. 8865–8887. 12

  10. [10]

    Many-shot jailbreaking,

    C. Anil, E. Durmus, N. Panickssery, M. Sharma, J. Benton, S. Kundu, J. Batson, M. Tong, J. Mu, D. Fordet al., “Many-shot jailbreaking,” Advances in Neural Information Processing Systems, vol. 37, pp. 129 696– 129 742, 2024

  11. [11]

    Poisoning language models during instruction tuning,

    A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inInternational Conference on Machine Learning, 2023, pp. 35 413–35 425

  12. [12]

    Universal adversarial triggers for attacking and analyzing nlp,

    E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” inThe Conference on EMNLP-IJCNLP, 2019, pp. 2153–2162

  13. [13]

    Auto- Prompt: Eliciting knowledge from language models with automatically generated prompts,

    T. Shin, Y . Razeghi, R. L. Logan IV , E. Wallace, and S. Singh, “Auto- Prompt: Eliciting knowledge from language models with automatically generated prompts,” inThe Conference on Empirical Methods in Natural Language Processing, 2020, pp. 4222–4235

  14. [14]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  15. [15]

    Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,

    K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Gonget al., “Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,” inThe 1st ACM workshop on Large AI Systems and Models with Privacy and Safety Analysis, 2023, pp. 57–68

  16. [16]

    Red teaming visual language models,

    M. Li, L. Li, Y . Yin, M. Ahmed, Z. Liu, and Q. Liu, “Red teaming visual language models,” inFindings of the Association for Computational Linguistics, 2024, pp. 3326–3342

  17. [17]

    arXiv preprint arXiv:2503.11519 (2025)

    H. Cheng, E. Xiao, Y . Wang, L. Zhang, Q. Zhang, J. Cao, K. Xu, M. Sun, X. Hao, J. Guet al., “Exploring typographic visual prompts injection threats in cross-modality generation models,”arXiv preprint arXiv:2503.11519, 2025

  18. [18]

    Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr

    M. Nasr, N. Carlini, C. Sitawarin, S. V . Schulhoff, J. Hayes, M. Ilie, J. Pluto, S. Song, H. Chaudhari, I. Shumailovet al., “The attacker moves second: Stronger adaptive attacks bypass defenses against LLM jailbreaks and prompt injections,”arXiv preprint arXiv:2510.09023, 2025

  19. [19]

    Shieldlearner: A new paradigm for jailbreak attack defense in LLMs,

    Z. Ni, H. Wang, and H. Wang, “Shieldlearner: A new paradigm for jailbreak attack defense in LLMs,”arXiv preprint arXiv:2502.13162, 2025

  20. [20]

    The hidden risks of large reasoning models: A safety assessment of R1,

    K. Zhou, C. Liu, X. Zhao, S. Jangam, J. Srinivasa, G. Liu, D. Song, and X. E. Wang, “The hidden risks of large reasoning models: A safety assessment of R1,” inThe 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 3250–3265

  21. [21]

    Sugar- coated poison: Benign generation unlocks jailbreaking,

    Y .-H. Wu, Y .-J. Xiong, H. Zhang, J.-C. Zhang, and Z. Zhou, “Sugar- coated poison: Benign generation unlocks jailbreaking,” inFindings of the Association for Computational Linguistics, 2025, pp. 9645–9665

  22. [22]

    Token-efficient prompt injection attack: Provoking cessation in LLM reasoning via adaptive token compression,

    Y . Cui, Y . Cai, and Y . Wang, “Token-efficient prompt injection attack: Provoking cessation in LLM reasoning via adaptive token compression,” arXiv preprint arXiv:2504.20493, 2025

  23. [23]

    Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples, October 2025

    A. Souly, J. Rando, E. Chapman, X. Davies, B. Hasircioglu, E. Shereen, C. Mougan, V . Mavroudis, E. Jones, C. Hickset al., “Poisoning attacks on LLMs require a near-constant number of poison samples,”arXiv preprint arXiv:2510.07192, 2025

  24. [24]

    Adversarial training for free!

    A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein, “Adversarial training for free!” Advances in neural information processing systems, vol. 32, 2019

  25. [25]

    Amnesiac machine learning,

    L. Graves, V . Nagisetty, and V . Ganesh, “Amnesiac machine learning,” inThe AAAI Conference on Artificial Intelligence, vol. 35, no. 13, 2021, pp. 11 516–11 524

  26. [26]

    Membership inference attacks against machine learning models,

    R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” inIEEE Symposium on Security and Privacy. IEEE, 2017, pp. 3–18

  27. [27]

    Mitigating many-shot jailbreaking,

    C. M. Ackerman and N. Panickssery, “Mitigating many-shot jailbreaking,” arXiv preprint arXiv:2504.09604, 2025

  28. [28]

    SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,

    X. Wang, D. Wu, Z. Ji, Z. Li, P. Ma, S. Wang, Y . Li, Y . Liu, N. Liu, and J. Rahmel, “SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,” inThe 34th USENIX Security Symposium, 2025, pp. 2441–2460

  29. [29]

    A survey on in-context learning,

    Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Changet al., “A survey on in-context learning,” inThe Conference on Empirical Methods in Natural Language Processing, 2024, pp. 1107– 1128

  30. [30]

    Red teaming language models with language models,

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inThe Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3419–3448

  31. [31]

    Strip: A defence against trojan attacks on deep neural networks,

    Y . Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal, “Strip: A defence against trojan attacks on deep neural networks,” in The 35th Annual Computer Security Applications Conference, 2019, pp. 113–125

  32. [32]

    A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation,

    N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation,”arXiv preprint arXiv:2307.03987, 2023

  33. [33]

    DecodingTrust: A comprehensive assessment of trustworthiness in GPT models,

    B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaefferet al., “DecodingTrust: A comprehensive assessment of trustworthiness in GPT models,”Advances in Neural Information Processing Systems, vol. 36, pp. 31 232–31 339, 2023

  34. [34]

    Emoji attack: Enhancing jailbreak attacks against judge llm detection.arXiv preprint arXiv:2411.01077,

    Z. Wei, Y . Liu, and N. B. Erichson, “Emoji attack: Enhancing jailbreak attacks against judge LLM detection,”arXiv preprint arXiv:2411.01077, 2024

  35. [35]

    arXiv preprint arXiv:1911.03030 (2019)

    C. Guo, T. Goldstein, A. Hannun, and L. Van Der Maaten, “Cer- tified data removal from machine learning models,”arXiv preprint arXiv:1911.03030, 2019

  36. [36]

    arXiv preprint arXiv:1811.03728 (2018)

    B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava, “Detecting backdoor attacks on deep neural networks by activation clustering,”arXiv preprint arXiv:1811.03728, 2018

  37. [37]

    Extracting training data from large language models,

    N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Extracting training data from large language models,” inThe 30th USENIX Security Symposium, 2021, pp. 2633–2650