arxiv: 2604.12548 · v1 · submitted 2026-04-14 · 💻 cs.CR

Recognition: unknown

DeepSeek Robustness Against Semantic-Character Dual-Space Mutated Prompt Injection

Junyu Ren , Xingjian Pan , Wensheng Gan , Philip S. Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3

classification 💻 cs.CR

keywords prompt injectionLLM robustnessadversarial attacksmutation frameworksemantic-character perturbationsDeepSeek evaluationmisuse success rate

0 comments

The pith

Combined semantic and character mutations outperform single-space attacks on DeepSeek.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a mutation framework that alters prompts through both meaning-level changes such as paraphrasing and word-order shifts, and character-level tricks such as zero-width insertions and encoding alterations. It applies a hybrid search that mixes random exploration with local refinement to find prompts that trick the model into misuse. Experiments show this dual approach reaches a higher average misuse success rate than either semantic or character mutations used separately. The work matters because real-world attacks often blend these tactics, and single-dimension tests may underestimate the threat to models like DeepSeek. Results also indicate the dual method keeps good concealment while delivering competitive query efficiency in best cases.

Core claim

PromptFuzz-SC integrates semantic transformations with character-level obfuscation in a unified mutation library and employs an epsilon-greedy plus hill-climbing hybrid search to generate adversarial prompts; on DeepSeek this dual-space strategy records the highest mean misuse success rate of 0.189, a peak of 0.375, and superior mean stealth, exceeding semantic-only mutation by 12.5 percent and character-only by 5.6 percent in mean success rate.

What carries the argument

PromptFuzz-SC, the semantic-character dual-space mutation framework that combines paraphrasing, word-order changes, zero-width insertions, and encoding mutations with a hybrid epsilon-greedy and hill-climbing search.

If this is right

Dual-space mutation produces higher mean and peak misuse success rates than isolated semantic or character attacks.
The approach achieves a more favorable balance between attack effectiveness and concealment than single-dimension baselines.
Composite mutation strategies are required for thorough red-teaming of LLM prompt injection defenses.
Defense mechanisms should address perturbations across both semantic and character spaces rather than one dimension alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-space library could expose comparable weaknesses when tested on additional open-source or closed models not included in the study.
Detection systems might gain coverage by monitoring for simultaneous semantic rewrites and character encodings in incoming prompts.
Extending the mutation operator set to include syntactic or structural changes could further increase attack potency.

Load-bearing premise

The chosen mutation operators and hybrid search produce attacks that generalize beyond the specific DeepSeek model and the three metrics without post-hoc selection of successful cases.

What would settle it

Applying the identical dual-space mutation library and search strategy to a different LLM such as GPT-4 or Llama and measuring whether the misuse success rate still exceeds both single-space baselines.

Figures

Figures reproduced from arXiv: 2604.12548 by Junyu Ren, Philip S. Yu, Wensheng Gan, Xingjian Pan.

**Figure 2.** Figure 2: Performance dynamics of single semantic-space attacks across query [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Performance dynamics of dual-space cooperative attacks across query [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Three-dimensional comparative performance of semantic-only, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Parameter sensitivity comparison across mutation strategies: (a) overall [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Prompt injection has emerged as a critical security threat to large language models (LLMs), yet existing studies predominantly focus on single-dimensional attack strategies, such as semantic rewriting or character-level obfuscation, which fail to capture the combined effects of multi-space perturbations in realistic scenarios. In addition, systematic black-box robustness evaluations of recent Chinese LLMs, such as DeepSeek, remain limited. To address these gaps, we propose PromptFuzz-SC, a semantic-character dual-space mutation framework for evaluating LLM robustness against prompt injection. The framework integrates semantic transformations (e.g., paraphrasing and word-order perturbation) with character-level obfuscation (e.g., zero-width insertion and encoding-based mutation), forming a unified and extensible mutation operator library. A hybrid search strategy combining epsilon-greedy exploration and hill-climbing refinement is adopted to efficiently discover high-quality adversarial prompts. We further introduce a unified evaluation protocol based on three metrics: misuse success rate (MSR), Average Queries to Success (AQS), and Stealth. Experimental results on DeepSeek demonstrate that dual-space mutation achieves the strongest overall attack performance among the evaluated strategies, attaining the highest mean MSR (0.189), peak MSR (0.375), and mean Stealth. Compared with semantic-only and character-only mutation, it improves mean MSR by 12.5% and 5.6%, respectively. While not consistently minimizing query cost, the proposed method achieves competitive best-case efficiency and maintains strong imperceptibility, indicating a more favorable balance between attack effectiveness and concealment. These findings highlight the importance of composite mutation strategies for robust red-teaming of LLMs and provide practical insights for the design of multi-layer defense mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This combines semantic and character mutations into one library for prompt injection on DeepSeek and reports modest gains, but the headline numbers lack any variance or significance checks.

read the letter

The paper's main contribution is PromptFuzz-SC, which builds a single extensible library of semantic rewrites plus character obfuscations and uses a hybrid epsilon-greedy plus hill-climbing search to generate attacks. They apply it to DeepSeek and show the combined version reaching mean MSR 0.189, peak 0.375, and better average stealth than either space alone, with the reported lifts of 12.5% and 5.6% over the single-space baselines. The three-metric protocol (MSR, AQS, Stealth) is straightforward and the operator list is concrete enough that someone could re-implement the library without much guesswork. That part is useful incremental work for anyone running red-team evaluations on Chinese LLMs. The hybrid search also looks like a reasonable way to keep query counts down while exploring the joint space. The central problem is that none of the performance claims come with trial counts, standard deviations, confidence intervals, or any hypothesis test. The absolute differences are small (roughly 0.02–0.03 in MSR), so without those details it is impossible to tell whether the ordering is stable or just sampling noise. The evaluation is also confined to one model and one black-box setting, which limits how far the “strongest overall” conclusion travels. This is the sort of paper that security evaluation groups will want to read for the mutation catalog and the DeepSeek numbers, but it will not shift the broader literature on its own. I would send it to peer review; the idea is practical and the experiments are on a real target, but the referees will need to ask for the missing statistical metadata and probably for results on at least one more model before the claims can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The paper proposes PromptFuzz-SC, a semantic-character dual-space mutation framework for prompt injection attacks on LLMs. It combines semantic transformations (paraphrasing, word-order perturbation) with character-level obfuscations (zero-width insertion, encoding mutations) into an extensible operator library, employs a hybrid epsilon-greedy plus hill-climbing search, and evaluates on DeepSeek using misuse success rate (MSR), average queries to success (AQS), and stealth. The central empirical claim is that dual-space mutation outperforms semantic-only and character-only baselines, achieving mean MSR 0.189 (12.5% and 5.6% higher, respectively), peak MSR 0.375, and best mean stealth.

Significance. If the reported performance ordering is shown to be statistically reliable, the work would usefully demonstrate that composite multi-space perturbations can improve attack effectiveness and imperceptibility for red-teaming, especially for under-evaluated Chinese LLMs such as DeepSeek. The extensible mutation library and unified three-metric protocol are practical contributions that could support future defense design.

major comments (2)

[Abstract] Abstract: the headline MSR figures (mean 0.189, peak 0.375, 12.5% and 5.6% relative gains) are presented without any accompanying information on number of trials, per-condition standard deviations, confidence intervals, or statistical hypothesis tests. Because the absolute differences are modest, these details are required to establish that the claimed superiority of dual-space mutation is distinguishable from sampling noise.
[Experimental results] Experimental results section: the definition of 'success' used to compute MSR, the total number of base prompts, and the precise criteria for selecting or terminating the hybrid search are not stated. These omissions make it impossible to assess reproducibility or to determine whether the reported 'more favorable balance' between effectiveness and concealment generalizes beyond the specific DeepSeek runs shown.

minor comments (2)

[Abstract] The abstract states that the method 'maintains strong imperceptibility' but does not clarify how the Stealth metric is normalized or whether it is averaged over successful attacks only.
[Methodology] Notation for the mutation operators (e.g., how semantic and character mutations are combined in the dual-space operator) could be made more explicit with a small table or pseudocode example.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: the headline MSR figures (mean 0.189, peak 0.375, 12.5% and 5.6% relative gains) are presented without any accompanying information on number of trials, per-condition standard deviations, confidence intervals, or statistical hypothesis tests. Because the absolute differences are modest, these details are required to establish that the claimed superiority of dual-space mutation is distinguishable from sampling noise.

Authors: We agree that the abstract requires supporting statistical context given the modest gains. We will revise the abstract to state the number of trials performed and explicitly reference the standard deviations, confidence intervals, and hypothesis test results already computed and reported in the Experimental Results section. This will allow readers to evaluate the reliability of the dual-space superiority claim directly from the abstract while keeping it concise. revision: yes
Referee: [Experimental results] Experimental results section: the definition of 'success' used to compute MSR, the total number of base prompts, and the precise criteria for selecting or terminating the hybrid search are not stated. These omissions make it impossible to assess reproducibility or to determine whether the reported 'more favorable balance' between effectiveness and concealment generalizes beyond the specific DeepSeek runs shown.

Authors: We acknowledge these omissions hinder reproducibility. In the revised Experimental Results section we will add: (1) the precise definition of success used to compute MSR, (2) the total number of base prompts, and (3) the exact termination and selection criteria for the epsilon-greedy plus hill-climbing hybrid search. These additions will be placed in the appropriate subsections so that the balance between effectiveness and stealth can be properly assessed. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of mutation strategies

full rationale

The paper proposes the PromptFuzz-SC framework and reports experimental outcomes on DeepSeek using MSR, AQS, and Stealth metrics. No derivation chain, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described results. Performance claims (e.g., dual-space MSR of 0.189) are direct experimental measurements, not reduced to inputs by construction. Any self-citations are peripheral and non-load-bearing for the empirical ordering. This matches the default case of a self-contained empirical study with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the described mutation operators and search strategy are sufficient to discover strong attacks and that the three metrics adequately capture real-world threat. No free parameters, axioms, or invented entities are explicitly introduced beyond the framework name itself.

pith-pipeline@v0.9.0 · 5610 in / 1086 out tokens · 34127 ms · 2026-05-10T15:50:50.708563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Model-as-a-service (MaaS): A survey,

W. Gan, S. Wan, and P. S. Yu, “Model-as-a-service (MaaS): A survey,” inIEEE International Conference on Big Data. IEEE, 2023, pp. 4636– 4645

2023
[2]

Multimodal large language models: A survey,

J. Wu, W. Gan, Z. Chen, S. Wan, and P. S. Yu, “Multimodal large language models: A survey,” inIEEE International Conference on Big Data. IEEE, 2023, pp. 2247–2256

2023
[3]

Mixture of experts (MoE): A big data perspective,

W. Gan, Z. Ning, Z. Qi, and P. S. Yu, “Mixture of experts (MoE): A big data perspective,”Information Fusion, pp. 1–28, 2025

2025
[4]

Safety in

C. Wang, Y . Liu, B. Li, D. Zhang, Z. Li, and J. Fang, “Safety in large reasoning models: A survey,”arXiv preprint arXiv:2504.17704, 2025

work page arXiv 2025
[5]

Jailbreaking black box large language models in twenty queries,

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” in IEEE Conference on Secure and Trustworthy Machine Learning. IEEE, 2025, pp. 23–42

2025
[6]

Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,” inThe ACM Workshop on Artificial Intelligence and Security, 2023, pp. 79–90

2023
[7]

Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs

A. Paulus, A. Zharmagambetov, C. Guo, B. Amos, and Y . Tian, “AdvPrompter: Fast adaptive adversarial prompting for LLMs,”arXiv preprint arXiv:2404.16873, 2024

work page arXiv 2024
[8]

Certified defenses for data poisoning attacks,

J. Steinhardt, P. W. W. Koh, and P. S. Liang, “Certified defenses for data poisoning attacks,”Advances in Neural Information Processing Systems, vol. 30, 2017

2017
[9]

Defending large language models against jailbreaking attacks through goal prioriti- zation,

Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioriti- zation,” inThe Annual Meeting of the Association for Computational Linguistics, 2024, pp. 8865–8887. 12

2024
[10]

Many-shot jailbreaking,

C. Anil, E. Durmus, N. Panickssery, M. Sharma, J. Benton, S. Kundu, J. Batson, M. Tong, J. Mu, D. Fordet al., “Many-shot jailbreaking,” Advances in Neural Information Processing Systems, vol. 37, pp. 129 696– 129 742, 2024

2024
[11]

Poisoning language models during instruction tuning,

A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inInternational Conference on Machine Learning, 2023, pp. 35 413–35 425

2023
[12]

Universal adversarial triggers for attacking and analyzing nlp,

E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” inThe Conference on EMNLP-IJCNLP, 2019, pp. 2153–2162

2019
[13]

Auto- Prompt: Eliciting knowledge from language models with automatically generated prompts,

T. Shin, Y . Razeghi, R. L. Logan IV , E. Wallace, and S. Singh, “Auto- Prompt: Eliciting knowledge from language models with automatically generated prompts,” inThe Conference on Empirical Methods in Natural Language Processing, 2020, pp. 4222–4235

2020
[14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,

K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Gonget al., “Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,” inThe 1st ACM workshop on Large AI Systems and Models with Privacy and Safety Analysis, 2023, pp. 57–68

2023
[16]

Red teaming visual language models,

M. Li, L. Li, Y . Yin, M. Ahmed, Z. Liu, and Q. Liu, “Red teaming visual language models,” inFindings of the Association for Computational Linguistics, 2024, pp. 3326–3342

2024
[17]

arXiv preprint arXiv:2503.11519 (2025)

H. Cheng, E. Xiao, Y . Wang, L. Zhang, Q. Zhang, J. Cao, K. Xu, M. Sun, X. Hao, J. Guet al., “Exploring typographic visual prompts injection threats in cross-modality generation models,”arXiv preprint arXiv:2503.11519, 2025

work page arXiv 2025
[18]

Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr

M. Nasr, N. Carlini, C. Sitawarin, S. V . Schulhoff, J. Hayes, M. Ilie, J. Pluto, S. Song, H. Chaudhari, I. Shumailovet al., “The attacker moves second: Stronger adaptive attacks bypass defenses against LLM jailbreaks and prompt injections,”arXiv preprint arXiv:2510.09023, 2025

work page arXiv 2025
[19]

Shieldlearner: A new paradigm for jailbreak attack defense in LLMs,

Z. Ni, H. Wang, and H. Wang, “Shieldlearner: A new paradigm for jailbreak attack defense in LLMs,”arXiv preprint arXiv:2502.13162, 2025

work page arXiv 2025
[20]

The hidden risks of large reasoning models: A safety assessment of R1,

K. Zhou, C. Liu, X. Zhao, S. Jangam, J. Srinivasa, G. Liu, D. Song, and X. E. Wang, “The hidden risks of large reasoning models: A safety assessment of R1,” inThe 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 3250–3265

2025
[21]

Sugar- coated poison: Benign generation unlocks jailbreaking,

Y .-H. Wu, Y .-J. Xiong, H. Zhang, J.-C. Zhang, and Z. Zhou, “Sugar- coated poison: Benign generation unlocks jailbreaking,” inFindings of the Association for Computational Linguistics, 2025, pp. 9645–9665

2025
[22]

Token-efficient prompt injection attack: Provoking cessation in LLM reasoning via adaptive token compression,

Y . Cui, Y . Cai, and Y . Wang, “Token-efficient prompt injection attack: Provoking cessation in LLM reasoning via adaptive token compression,” arXiv preprint arXiv:2504.20493, 2025

work page arXiv 2025
[23]

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples, October 2025

A. Souly, J. Rando, E. Chapman, X. Davies, B. Hasircioglu, E. Shereen, C. Mougan, V . Mavroudis, E. Jones, C. Hickset al., “Poisoning attacks on LLMs require a near-constant number of poison samples,”arXiv preprint arXiv:2510.07192, 2025

work page arXiv 2025
[24]

Adversarial training for free!

A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein, “Adversarial training for free!” Advances in neural information processing systems, vol. 32, 2019

2019
[25]

Amnesiac machine learning,

L. Graves, V . Nagisetty, and V . Ganesh, “Amnesiac machine learning,” inThe AAAI Conference on Artificial Intelligence, vol. 35, no. 13, 2021, pp. 11 516–11 524

2021
[26]

Membership inference attacks against machine learning models,

R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” inIEEE Symposium on Security and Privacy. IEEE, 2017, pp. 3–18

2017
[27]

Mitigating many-shot jailbreaking,

C. M. Ackerman and N. Panickssery, “Mitigating many-shot jailbreaking,” arXiv preprint arXiv:2504.09604, 2025

work page arXiv 2025
[28]

SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,

X. Wang, D. Wu, Z. Ji, Z. Li, P. Ma, S. Wang, Y . Li, Y . Liu, N. Liu, and J. Rahmel, “SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,” inThe 34th USENIX Security Symposium, 2025, pp. 2441–2460

2025
[29]

A survey on in-context learning,

Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Changet al., “A survey on in-context learning,” inThe Conference on Empirical Methods in Natural Language Processing, 2024, pp. 1107– 1128

2024
[30]

Red teaming language models with language models,

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inThe Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3419–3448

2022
[31]

Strip: A defence against trojan attacks on deep neural networks,

Y . Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal, “Strip: A defence against trojan attacks on deep neural networks,” in The 35th Annual Computer Security Applications Conference, 2019, pp. 113–125

2019
[32]

A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation,

N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation,”arXiv preprint arXiv:2307.03987, 2023

work page arXiv 2023
[33]

DecodingTrust: A comprehensive assessment of trustworthiness in GPT models,

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaefferet al., “DecodingTrust: A comprehensive assessment of trustworthiness in GPT models,”Advances in Neural Information Processing Systems, vol. 36, pp. 31 232–31 339, 2023

2023
[34]

Emoji attack: Enhancing jailbreak attacks against judge llm detection.arXiv preprint arXiv:2411.01077,

Z. Wei, Y . Liu, and N. B. Erichson, “Emoji attack: Enhancing jailbreak attacks against judge LLM detection,”arXiv preprint arXiv:2411.01077, 2024

work page arXiv 2024
[35]

arXiv preprint arXiv:1911.03030 (2019)

C. Guo, T. Goldstein, A. Hannun, and L. Van Der Maaten, “Cer- tified data removal from machine learning models,”arXiv preprint arXiv:1911.03030, 2019

work page arXiv 1911
[36]

arXiv preprint arXiv:1811.03728 (2018)

B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava, “Detecting backdoor attacks on deep neural networks by activation clustering,”arXiv preprint arXiv:1811.03728, 2018

work page arXiv 2018
[37]

Extracting training data from large language models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Extracting training data from large language models,” inThe 30th USENIX Security Symposium, 2021, pp. 2633–2650

2021