Recognition: unknown
DeepSeek Robustness Against Semantic-Character Dual-Space Mutated Prompt Injection
Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3
The pith
Combined semantic and character mutations outperform single-space attacks on DeepSeek.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PromptFuzz-SC integrates semantic transformations with character-level obfuscation in a unified mutation library and employs an epsilon-greedy plus hill-climbing hybrid search to generate adversarial prompts; on DeepSeek this dual-space strategy records the highest mean misuse success rate of 0.189, a peak of 0.375, and superior mean stealth, exceeding semantic-only mutation by 12.5 percent and character-only by 5.6 percent in mean success rate.
What carries the argument
PromptFuzz-SC, the semantic-character dual-space mutation framework that combines paraphrasing, word-order changes, zero-width insertions, and encoding mutations with a hybrid epsilon-greedy and hill-climbing search.
If this is right
- Dual-space mutation produces higher mean and peak misuse success rates than isolated semantic or character attacks.
- The approach achieves a more favorable balance between attack effectiveness and concealment than single-dimension baselines.
- Composite mutation strategies are required for thorough red-teaming of LLM prompt injection defenses.
- Defense mechanisms should address perturbations across both semantic and character spaces rather than one dimension alone.
Where Pith is reading between the lines
- The same dual-space library could expose comparable weaknesses when tested on additional open-source or closed models not included in the study.
- Detection systems might gain coverage by monitoring for simultaneous semantic rewrites and character encodings in incoming prompts.
- Extending the mutation operator set to include syntactic or structural changes could further increase attack potency.
Load-bearing premise
The chosen mutation operators and hybrid search produce attacks that generalize beyond the specific DeepSeek model and the three metrics without post-hoc selection of successful cases.
What would settle it
Applying the identical dual-space mutation library and search strategy to a different LLM such as GPT-4 or Llama and measuring whether the misuse success rate still exceeds both single-space baselines.
Figures
read the original abstract
Prompt injection has emerged as a critical security threat to large language models (LLMs), yet existing studies predominantly focus on single-dimensional attack strategies, such as semantic rewriting or character-level obfuscation, which fail to capture the combined effects of multi-space perturbations in realistic scenarios. In addition, systematic black-box robustness evaluations of recent Chinese LLMs, such as DeepSeek, remain limited. To address these gaps, we propose PromptFuzz-SC, a semantic-character dual-space mutation framework for evaluating LLM robustness against prompt injection. The framework integrates semantic transformations (e.g., paraphrasing and word-order perturbation) with character-level obfuscation (e.g., zero-width insertion and encoding-based mutation), forming a unified and extensible mutation operator library. A hybrid search strategy combining epsilon-greedy exploration and hill-climbing refinement is adopted to efficiently discover high-quality adversarial prompts. We further introduce a unified evaluation protocol based on three metrics: misuse success rate (MSR), Average Queries to Success (AQS), and Stealth. Experimental results on DeepSeek demonstrate that dual-space mutation achieves the strongest overall attack performance among the evaluated strategies, attaining the highest mean MSR (0.189), peak MSR (0.375), and mean Stealth. Compared with semantic-only and character-only mutation, it improves mean MSR by 12.5% and 5.6%, respectively. While not consistently minimizing query cost, the proposed method achieves competitive best-case efficiency and maintains strong imperceptibility, indicating a more favorable balance between attack effectiveness and concealment. These findings highlight the importance of composite mutation strategies for robust red-teaming of LLMs and provide practical insights for the design of multi-layer defense mechanisms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PromptFuzz-SC, a semantic-character dual-space mutation framework for prompt injection attacks on LLMs. It combines semantic transformations (paraphrasing, word-order perturbation) with character-level obfuscations (zero-width insertion, encoding mutations) into an extensible operator library, employs a hybrid epsilon-greedy plus hill-climbing search, and evaluates on DeepSeek using misuse success rate (MSR), average queries to success (AQS), and stealth. The central empirical claim is that dual-space mutation outperforms semantic-only and character-only baselines, achieving mean MSR 0.189 (12.5% and 5.6% higher, respectively), peak MSR 0.375, and best mean stealth.
Significance. If the reported performance ordering is shown to be statistically reliable, the work would usefully demonstrate that composite multi-space perturbations can improve attack effectiveness and imperceptibility for red-teaming, especially for under-evaluated Chinese LLMs such as DeepSeek. The extensible mutation library and unified three-metric protocol are practical contributions that could support future defense design.
major comments (2)
- [Abstract] Abstract: the headline MSR figures (mean 0.189, peak 0.375, 12.5% and 5.6% relative gains) are presented without any accompanying information on number of trials, per-condition standard deviations, confidence intervals, or statistical hypothesis tests. Because the absolute differences are modest, these details are required to establish that the claimed superiority of dual-space mutation is distinguishable from sampling noise.
- [Experimental results] Experimental results section: the definition of 'success' used to compute MSR, the total number of base prompts, and the precise criteria for selecting or terminating the hybrid search are not stated. These omissions make it impossible to assess reproducibility or to determine whether the reported 'more favorable balance' between effectiveness and concealment generalizes beyond the specific DeepSeek runs shown.
minor comments (2)
- [Abstract] The abstract states that the method 'maintains strong imperceptibility' but does not clarify how the Stealth metric is normalized or whether it is averaged over successful attacks only.
- [Methodology] Notation for the mutation operators (e.g., how semantic and character mutations are combined in the dual-space operator) could be made more explicit with a small table or pseudocode example.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline MSR figures (mean 0.189, peak 0.375, 12.5% and 5.6% relative gains) are presented without any accompanying information on number of trials, per-condition standard deviations, confidence intervals, or statistical hypothesis tests. Because the absolute differences are modest, these details are required to establish that the claimed superiority of dual-space mutation is distinguishable from sampling noise.
Authors: We agree that the abstract requires supporting statistical context given the modest gains. We will revise the abstract to state the number of trials performed and explicitly reference the standard deviations, confidence intervals, and hypothesis test results already computed and reported in the Experimental Results section. This will allow readers to evaluate the reliability of the dual-space superiority claim directly from the abstract while keeping it concise. revision: yes
-
Referee: [Experimental results] Experimental results section: the definition of 'success' used to compute MSR, the total number of base prompts, and the precise criteria for selecting or terminating the hybrid search are not stated. These omissions make it impossible to assess reproducibility or to determine whether the reported 'more favorable balance' between effectiveness and concealment generalizes beyond the specific DeepSeek runs shown.
Authors: We acknowledge these omissions hinder reproducibility. In the revised Experimental Results section we will add: (1) the precise definition of success used to compute MSR, (2) the total number of base prompts, and (3) the exact termination and selection criteria for the epsilon-greedy plus hill-climbing hybrid search. These additions will be placed in the appropriate subsections so that the balance between effectiveness and stealth can be properly assessed. revision: yes
Circularity Check
No circularity: purely empirical comparison of mutation strategies
full rationale
The paper proposes the PromptFuzz-SC framework and reports experimental outcomes on DeepSeek using MSR, AQS, and Stealth metrics. No derivation chain, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described results. Performance claims (e.g., dual-space MSR of 0.189) are direct experimental measurements, not reduced to inputs by construction. Any self-citations are peripheral and non-load-bearing for the empirical ordering. This matches the default case of a self-contained empirical study with no circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Model-as-a-service (MaaS): A survey,
W. Gan, S. Wan, and P. S. Yu, “Model-as-a-service (MaaS): A survey,” inIEEE International Conference on Big Data. IEEE, 2023, pp. 4636– 4645
2023
-
[2]
Multimodal large language models: A survey,
J. Wu, W. Gan, Z. Chen, S. Wan, and P. S. Yu, “Multimodal large language models: A survey,” inIEEE International Conference on Big Data. IEEE, 2023, pp. 2247–2256
2023
-
[3]
Mixture of experts (MoE): A big data perspective,
W. Gan, Z. Ning, Z. Qi, and P. S. Yu, “Mixture of experts (MoE): A big data perspective,”Information Fusion, pp. 1–28, 2025
2025
- [4]
-
[5]
Jailbreaking black box large language models in twenty queries,
P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” in IEEE Conference on Secure and Trustworthy Machine Learning. IEEE, 2025, pp. 23–42
2025
-
[6]
Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,” inThe ACM Workshop on Artificial Intelligence and Security, 2023, pp. 79–90
2023
-
[7]
Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs
A. Paulus, A. Zharmagambetov, C. Guo, B. Amos, and Y . Tian, “AdvPrompter: Fast adaptive adversarial prompting for LLMs,”arXiv preprint arXiv:2404.16873, 2024
-
[8]
Certified defenses for data poisoning attacks,
J. Steinhardt, P. W. W. Koh, and P. S. Liang, “Certified defenses for data poisoning attacks,”Advances in Neural Information Processing Systems, vol. 30, 2017
2017
-
[9]
Defending large language models against jailbreaking attacks through goal prioriti- zation,
Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioriti- zation,” inThe Annual Meeting of the Association for Computational Linguistics, 2024, pp. 8865–8887. 12
2024
-
[10]
Many-shot jailbreaking,
C. Anil, E. Durmus, N. Panickssery, M. Sharma, J. Benton, S. Kundu, J. Batson, M. Tong, J. Mu, D. Fordet al., “Many-shot jailbreaking,” Advances in Neural Information Processing Systems, vol. 37, pp. 129 696– 129 742, 2024
2024
-
[11]
Poisoning language models during instruction tuning,
A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inInternational Conference on Machine Learning, 2023, pp. 35 413–35 425
2023
-
[12]
Universal adversarial triggers for attacking and analyzing nlp,
E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” inThe Conference on EMNLP-IJCNLP, 2019, pp. 2153–2162
2019
-
[13]
Auto- Prompt: Eliciting knowledge from language models with automatically generated prompts,
T. Shin, Y . Razeghi, R. L. Logan IV , E. Wallace, and S. Singh, “Auto- Prompt: Eliciting knowledge from language models with automatically generated prompts,” inThe Conference on Empirical Methods in Natural Language Processing, 2020, pp. 4222–4235
2020
-
[14]
Universal and Transferable Adversarial Attacks on Aligned Language Models
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,
K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Gonget al., “Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,” inThe 1st ACM workshop on Large AI Systems and Models with Privacy and Safety Analysis, 2023, pp. 57–68
2023
-
[16]
Red teaming visual language models,
M. Li, L. Li, Y . Yin, M. Ahmed, Z. Liu, and Q. Liu, “Red teaming visual language models,” inFindings of the Association for Computational Linguistics, 2024, pp. 3326–3342
2024
-
[17]
arXiv preprint arXiv:2503.11519 (2025)
H. Cheng, E. Xiao, Y . Wang, L. Zhang, Q. Zhang, J. Cao, K. Xu, M. Sun, X. Hao, J. Guet al., “Exploring typographic visual prompts injection threats in cross-modality generation models,”arXiv preprint arXiv:2503.11519, 2025
-
[18]
M. Nasr, N. Carlini, C. Sitawarin, S. V . Schulhoff, J. Hayes, M. Ilie, J. Pluto, S. Song, H. Chaudhari, I. Shumailovet al., “The attacker moves second: Stronger adaptive attacks bypass defenses against LLM jailbreaks and prompt injections,”arXiv preprint arXiv:2510.09023, 2025
-
[19]
Shieldlearner: A new paradigm for jailbreak attack defense in LLMs,
Z. Ni, H. Wang, and H. Wang, “Shieldlearner: A new paradigm for jailbreak attack defense in LLMs,”arXiv preprint arXiv:2502.13162, 2025
-
[20]
The hidden risks of large reasoning models: A safety assessment of R1,
K. Zhou, C. Liu, X. Zhao, S. Jangam, J. Srinivasa, G. Liu, D. Song, and X. E. Wang, “The hidden risks of large reasoning models: A safety assessment of R1,” inThe 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 3250–3265
2025
-
[21]
Sugar- coated poison: Benign generation unlocks jailbreaking,
Y .-H. Wu, Y .-J. Xiong, H. Zhang, J.-C. Zhang, and Z. Zhou, “Sugar- coated poison: Benign generation unlocks jailbreaking,” inFindings of the Association for Computational Linguistics, 2025, pp. 9645–9665
2025
-
[22]
Y . Cui, Y . Cai, and Y . Wang, “Token-efficient prompt injection attack: Provoking cessation in LLM reasoning via adaptive token compression,” arXiv preprint arXiv:2504.20493, 2025
-
[23]
Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples, October 2025
A. Souly, J. Rando, E. Chapman, X. Davies, B. Hasircioglu, E. Shereen, C. Mougan, V . Mavroudis, E. Jones, C. Hickset al., “Poisoning attacks on LLMs require a near-constant number of poison samples,”arXiv preprint arXiv:2510.07192, 2025
-
[24]
Adversarial training for free!
A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein, “Adversarial training for free!” Advances in neural information processing systems, vol. 32, 2019
2019
-
[25]
Amnesiac machine learning,
L. Graves, V . Nagisetty, and V . Ganesh, “Amnesiac machine learning,” inThe AAAI Conference on Artificial Intelligence, vol. 35, no. 13, 2021, pp. 11 516–11 524
2021
-
[26]
Membership inference attacks against machine learning models,
R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” inIEEE Symposium on Security and Privacy. IEEE, 2017, pp. 3–18
2017
-
[27]
Mitigating many-shot jailbreaking,
C. M. Ackerman and N. Panickssery, “Mitigating many-shot jailbreaking,” arXiv preprint arXiv:2504.09604, 2025
-
[28]
SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,
X. Wang, D. Wu, Z. Ji, Z. Li, P. Ma, S. Wang, Y . Li, Y . Liu, N. Liu, and J. Rahmel, “SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,” inThe 34th USENIX Security Symposium, 2025, pp. 2441–2460
2025
-
[29]
A survey on in-context learning,
Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Changet al., “A survey on in-context learning,” inThe Conference on Empirical Methods in Natural Language Processing, 2024, pp. 1107– 1128
2024
-
[30]
Red teaming language models with language models,
E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inThe Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3419–3448
2022
-
[31]
Strip: A defence against trojan attacks on deep neural networks,
Y . Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal, “Strip: A defence against trojan attacks on deep neural networks,” in The 35th Annual Computer Security Applications Conference, 2019, pp. 113–125
2019
-
[32]
N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation,”arXiv preprint arXiv:2307.03987, 2023
-
[33]
DecodingTrust: A comprehensive assessment of trustworthiness in GPT models,
B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaefferet al., “DecodingTrust: A comprehensive assessment of trustworthiness in GPT models,”Advances in Neural Information Processing Systems, vol. 36, pp. 31 232–31 339, 2023
2023
-
[34]
Z. Wei, Y . Liu, and N. B. Erichson, “Emoji attack: Enhancing jailbreak attacks against judge LLM detection,”arXiv preprint arXiv:2411.01077, 2024
-
[35]
arXiv preprint arXiv:1911.03030 (2019)
C. Guo, T. Goldstein, A. Hannun, and L. Van Der Maaten, “Cer- tified data removal from machine learning models,”arXiv preprint arXiv:1911.03030, 2019
-
[36]
arXiv preprint arXiv:1811.03728 (2018)
B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava, “Detecting backdoor attacks on deep neural networks by activation clustering,”arXiv preprint arXiv:1811.03728, 2018
-
[37]
Extracting training data from large language models,
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Extracting training data from large language models,” inThe 30th USENIX Security Symposium, 2021, pp. 2633–2650
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.