pith. sign in

arxiv: 2605.19159 · v1 · pith:JRQQOFBLnew · submitted 2026-05-18 · 💻 cs.CR

On the Geometric Limits of Transformer Defenses against Obfuscation Attacks: Latent Embedding Collapse & Performance Robustness Gap

Pith reviewed 2026-05-20 08:44 UTC · model grok-4.3

classification 💻 cs.CR
keywords prompt injectionobfuscation attackslatent embedding collapsetransformer defensesembedding robustnessperformance-robustness gapBERT encodersgeometric analysis
0
0 comments X

The pith

High detection accuracy in prompt-injection defenses masks near-overlap between obfuscated and clean embeddings in transformer models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-operator obfuscated prompts, which combine homoglyphs, zero-width characters, and noise such as punctuation or emojis, can partially collapse onto the embedding manifold of clean prompts. This latent embedding collapse occurs even though detectors built on BERT-family encoders achieve near-perfect classification performance. The minimal distance between clean and obfuscated embeddings reaches only 1.02 while obfuscated points show markedly higher intra-class variance, exposing a performance-robustness gap that persists across models of different depths and capacities. A reader would care because current evaluation practices rely on classification scores that do not detect this geometric instability, leaving potential attack surfaces unaddressed.

Core claim

Multi-operator obfuscated prompts partially collapse onto the embedding manifold of clean prompts, a phenomenon termed latent embedding collapse. Across multiple BERT-family encoders, detectors reach near-perfect classification yet the minimal clean-obfuscated margin equals 1.02, indicating near-overlap, while obfuscated embeddings exhibit elevated intra-class variance of 3.33 plus or minus 6.23. These results demonstrate a substantial performance-robustness gap, and increasing model capacity does not eliminate the collapse.

What carries the argument

Latent embedding collapse: the partial overlap of obfuscated prompt embeddings with the manifold of clean prompt embeddings, which reveals geometric fragility despite strong classification boundaries.

If this is right

  • Classification accuracy alone cannot certify robustness against obfuscated prompt injections.
  • Embedding-space margins and variance must be measured to assess whether a defense has truly separated clean and attacked inputs.
  • Scaling model depth or capacity leaves the observed collapse and variance unchanged.
  • Geometry-aware training or evaluation is required as a complement to performance-based testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attackers could exploit the small margin by generating obfuscations that remain close to clean embeddings yet still trigger the intended injection.
  • Similar embedding instability may appear in non-BERT transformer families or in multimodal models that process text alongside other modalities.
  • Defenses could be improved by explicitly optimizing for larger inter-class margins in the latent space rather than classification loss alone.

Load-bearing premise

The specific multi-operator obfuscation combinations and BERT-family encoders tested represent the general behavior of transformer defenses against obfuscation attacks.

What would settle it

A defense architecture or training procedure that produces a clean-obfuscated embedding margin substantially larger than 1.02 while preserving near-perfect classification accuracy would show the collapse is not inherent.

Figures

Figures reproduced from arXiv: 2605.19159 by Becky Mashaido, Tapadhir Das.

Figure 1
Figure 1. Figure 1: Illustration of Prompt Injection Attack on LLMs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Proposed methodology of understanding obfuscated [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DistillBERT PCA Projections of Prompt Embeddings [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: BERTBase PCA Projections of Prompt Embeddings [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: BERTMedium PCA Projections of Prompt Embeddings [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: BERTMedium t-SNE Projections of Prompt Embed [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
read the original abstract

Prompt injection attacks pose significant risks to language model safety, yet existing defenses are typically evaluated using classification performance. We show that high detection performance does not imply representational robustness. Specifically, multi-operator obfuscated prompts (combining homoglyphs, zero-width characters, and punctuation or emoji noise) can partially collapse onto the embedding manifold of clean prompts, a phenomenon we term latent embedding collapse. Results indicate that across multiple BERT family encoders with varying depth and capacity, detectors achieve near-perfect classification performance, yet the minimal clean-obfuscated margin delta = 1.02, indicating near-overlap of obfuscated and clean embeddings. Obfuscated embeddings further exhibit elevated intra-class variance (3.33 +/- 6.23), indicating severe latent-space instability despite high performance. These results reveal a substantial perf ormance-robustness gap, demonstrating that standard evaluation metrics fail to capture latent embedding collapse and underlying geometric fragility. Our findings show that increasing model capacity does not eliminate latent embedding collapse, motivating geometry-aware robustness analysis as a necessary complement to performance-based evaluation for prompt-injection defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that high classification accuracy of BERT-family detectors on prompt-injection attacks does not imply representational robustness. Multi-operator obfuscated prompts (homoglyphs + zero-width characters + punctuation/emoji noise) partially collapse onto the clean-prompt embedding manifold, producing a minimal clean-obfuscated margin of delta = 1.02 and elevated intra-class variance of 3.33 +/- 6.23; this 'latent embedding collapse' is presented as evidence of a performance-robustness gap that standard metrics miss, and the gap persists across model capacities.

Significance. If the geometric measurements were shown to predict actual defense failures, the result would usefully motivate geometry-aware evaluation as a complement to accuracy-based testing for LLM safety mechanisms.

major comments (3)
  1. [Abstract and Results] Abstract and Results: The central claim of a 'performance-robustness gap' is asserted from the reported delta = 1.02 and variance 3.33 +/- 6.23, yet the manuscript contains no experiments that link these embedding statistics to practical outcomes such as successful evasion, reduced detector accuracy under optimized multi-operator attacks, or changes in prompt-injection success rates. This missing causal link is load-bearing for the headline conclusion.
  2. [Empirical Evaluation] Empirical Evaluation: The obfuscated-embedding variance is reported as 3.33 +/- 6.23. Because the standard deviation exceeds the mean, the statistic may reflect measurement noise or a small number of extreme outliers rather than consistent instability; no robustness checks, outlier analysis, or per-sample distribution plots are provided to support the interpretation of 'severe latent-space instability'.
  3. [Methods] Methods: The paper does not report statistical significance tests for delta or the variance difference, does not compare against single-operator baselines, and gives insufficient detail on dataset size, train/test splits, or exact operator combinations. These omissions limit evaluation of whether the chosen obfuscations and BERT variants are representative enough to support the general claim of geometric fragility.
minor comments (2)
  1. [Abstract] Abstract contains a typographical spacing error ('perf ormance').
  2. [Notation] The margin delta and variance quantities should be accompanied by explicit definitions or equations in the main text rather than appearing only in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, providing clarifications on the manuscript's claims and indicating revisions where they strengthen the presentation without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract and Results] The central claim of a 'performance-robustness gap' is asserted from the reported delta = 1.02 and variance 3.33 +/- 6.23, yet the manuscript contains no experiments that link these embedding statistics to practical outcomes such as successful evasion, reduced detector accuracy under optimized multi-operator attacks, or changes in prompt-injection success rates. This missing causal link is load-bearing for the headline conclusion.

    Authors: The performance-robustness gap is defined as the discrepancy between near-perfect classification accuracy and the geometric properties of the latent space. The minimal margin of 1.02 shows that obfuscated embeddings lie within the clean-prompt manifold, so the detector's high accuracy is achieved despite near-overlap rather than true separation; the elevated variance further quantifies the instability of those representations. This geometric evidence directly supports the claim that accuracy-based metrics miss underlying fragility. We agree that explicit linkage to evasion rates would add value and will add a short discussion subsection relating the observed collapse to potential attack implications, including how the margin correlates with reduced representational separation. revision: yes

  2. Referee: [Empirical Evaluation] The obfuscated-embedding variance is reported as 3.33 +/- 6.23. Because the standard deviation exceeds the mean, the statistic may reflect measurement noise or a small number of extreme outliers rather than consistent instability; no robustness checks, outlier analysis, or per-sample distribution plots are provided to support the interpretation of 'severe latent-space instability'.

    Authors: The statistic is the mean intra-class variance computed across obfuscated prompt sets, with the +/- 6.23 giving the standard deviation of those per-set variances over the BERT variants and operator combinations. The large spread is interpreted as reflecting genuine variation in instability rather than noise. To substantiate this, the revised version will include per-sample variance histograms, outlier-robustness checks (e.g., median absolute deviation), and distribution plots that separate the contribution of extreme samples from the overall trend. revision: yes

  3. Referee: [Methods] The paper does not report statistical significance tests for delta or the variance difference, does not compare against single-operator baselines, and gives insufficient detail on dataset size, train/test splits, or exact operator combinations. These omissions limit evaluation of whether the chosen obfuscations and BERT variants are representative enough to support the general claim of geometric fragility.

    Authors: We will add paired t-tests (or Wilcoxon tests where normality is violated) for the reported delta and variance differences. Single-operator baselines will be included to show that multi-operator obfuscation produces more pronounced collapse than any individual operator. The methods section will be expanded with precise numbers for clean and obfuscated prompt counts, the 70/30 train/test split, and the exact operator combinations (homoglyph substitution rates, zero-width insertion positions, and punctuation/emoji noise levels). These additions will allow readers to assess representativeness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical embedding measurements

full rationale

The paper's central results consist of measured quantities (clean-obfuscated margin delta = 1.02 and obfuscated intra-class variance 3.33 +/- 6.23) obtained from BERT-family encoders on multi-operator obfuscated prompts. These are presented as direct observations of latent embedding collapse rather than outputs of any fitted model, self-referential definition, or prior self-citation chain. The performance-robustness gap is an interpretive inference from these independent measurements; the measurements themselves do not reduce by construction to the paper's own inputs or equations. No self-definitional steps, fitted-input predictions, or ansatz smuggling via citation appear in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that embedding-space margin and intra-class variance are valid proxies for representational robustness, and that the tested BERT variants and chosen obfuscation operators generalize to transformer defenses broadly.

axioms (2)
  • domain assumption BERT-family encoders produce embeddings whose geometry is meaningful for assessing prompt-injection detector robustness
    The collapse and margin measurements are interpreted as evidence of fragility only if these embeddings faithfully reflect input distinctions.
  • domain assumption The multi-operator obfuscations (homoglyphs + zero-width + punctuation/emoji) constitute representative real-world attacks
    The observed collapse is tied to these specific combinations; different attack compositions might not produce the same geometry.

pith-pipeline@v0.9.0 · 5725 in / 1558 out tokens · 70660 ms · 2026-05-20T08:44:51.809850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    multi-operator obfuscated prompts ... can partially collapse onto the embedding manifold of clean prompts, a phenomenon we term latent embedding collapse ... minimal clean-obfuscated margin δ=1.02 ... obfuscated intra-class variance (3.33±6.23)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

  1. [1]

    Chatgpt for good? on opportunities and challenges of large language models for education,

    E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeieret al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and individual differences, vol. 103, p. 102274, 2023

  2. [2]

    Large Language Models Market Size — Industry Report, 2030 — grandviewresearch.com,

    “Large Language Models Market Size — Industry Report, 2030 — grandviewresearch.com,” https://www.grandviewresearch.com/ industry-analysis/large-language-model-llm-market-report, [Accessed 10-01-2026]

  3. [3]

    Prompt Injection attack against LLM-integrated Applications

    Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zhenget al., “Prompt injection attack against llm-integrated applications,”arXiv preprint arXiv:2306.05499, 2023

  4. [4]

    Optimization-based prompt injection attack to llm-as-a-judge,

    J. Shi, Z. Yuan, Y . Liu, Y . Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based prompt injection attack to llm-as-a-judge,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 660–674

  5. [5]

    Attention tracker: Detecting prompt injection attacks in llms,

    K.-H. Hung, C.-Y . Ko, A. Rawat, I.-H. Chung, W. H. Hsu, and P.-Y . Chen, “Attention tracker: Detecting prompt injection attacks in llms,” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 2309–2322

  6. [6]

    Defending against prompt injection with a few defensivetokens,

    S. Chen, Y . Wang, N. Carlini, C. Sitawarin, and D. Wagner, “Defending against prompt injection with a few defensivetokens,” inProceedings of the 18th ACM Workshop on Artificial Intelligence and Security, 2025, pp. 242–252

  7. [7]

    Fine-tuned large language models (llms): Improved prompt injection attacks detection,

    M. A. Rahman, H. Shahriar, G. Francia, F. Wu, A. Cuzzocrea, M. Rah- man, M. J. H. Faruk, and S. I. Ahamed, “Fine-tuned large language models (llms): Improved prompt injection attacks detection,” in2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2025, pp. 1033–1039

  8. [8]

    ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

    H. Chang, Y . Jun, and H. Lee, “Chatinject: Abusing chat templates for prompt injection in llm agents,”arXiv preprint arXiv:2509.22830, 2025

  9. [9]

    Not what you’ve signed up for: Compromising real-world llm- integrated applications with indirect prompt injection,

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm- integrated applications with indirect prompt injection,” inProceedings of the 16th ACM workshop on artificial intelligence and security, 2023, pp. 79–90

  10. [10]

    Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

    D. Lee and M. Tiwari, “Prompt infection: Llm-to-llm prompt injection within multi-agent systems,”arXiv preprint arXiv:2410.07283, 2024

  11. [11]

    Uniguardian: A unified defense for detecting prompt injection, backdoor attacks and adversarial attacks in large language models,

    H. Lin, Y . Lao, T. Geng, T. Yu, and W. Zhao, “Uniguardian: A unified defense for detecting prompt injection, backdoor attacks and adversarial attacks in large language models,”arXiv preprint arXiv:2502.13141, 2025

  12. [12]

    Jatmo: Prompt injection defense by task-specific finetuning,

    J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. Wagner, “Jatmo: Prompt injection defense by task-specific finetuning,” inEuropean Symposium on Research in Computer Security. Springer, 2024, pp. 105–124

  13. [13]

    A survey of adversarial defenses and robustness in nlp,

    S. Goyal, S. Doddapaneni, M. M. Khapra, and B. Ravindran, “A survey of adversarial defenses and robustness in nlp,”ACM Computing Surveys, vol. 55, no. 14s, pp. 1–39, 2023

  14. [14]

    The comprehensive review on prompt injection attacks and defense mechanisms in large language models,

    Q. Wang, “The comprehensive review on prompt injection attacks and defense mechanisms in large language models,”Science and Technology of Engineering, Chemistry and Environmental Protection, vol. 1, no. 3, 2025

  15. [15]

    A critical evaluation of defenses against prompt injection attacks,

    Y . Jia, Z. Shao, Y . Liu, J. Jia, D. Song, and N. Z. Gong, “A critical evaluation of defenses against prompt injection attacks,”arXiv preprint arXiv:2505.18333, 2025

  16. [16]

    https://doi.org/10

    Y . Wang, S. Chen, R. Alkhudair, B. Alomair, and D. Wagner, “Defending against prompt injection with datafilter,”arXiv preprint arXiv:2510.19207, 2025

  17. [17]

    Drip: Defending prompt injection via de-instruction training and residual fusion model architecture,

    R. Liu, Y . Lin, and J. S. Dong, “Drip: Defending prompt injection via de-instruction training and residual fusion model architecture,”arXiv e-prints, pp. arXiv–2511, 2025

  18. [18]

    Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne),

    F. Anowar, S. Sadaoui, and B. Selim, “Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne),”Computer Science Review, vol. 40, p. 100378, 2021